CoreBoost training and test data


Training Data

The CpG related program is trained on 1,445 CpG related promoters from EPD. The non-CpG related program is trained on 1,570 non-CpG related promoters from EPD and DBTSS (version 3.0). Each linked file here contains sequences of 10 kb in length, each centered at the annotated TSS.


Test Data

ChIP-chip data

ChIP-chip provides the opportunity to study the genome-wide map of active promoters in specific cells. Using this technology, the binding sites of pre-initiation complex were experimentally located throughout the genome in human fibroblast cells. To evaluate the performance of CoreBoost, we applied different programs to the test sequences of 2.4 kb each, extracted from the genome-wide mapping data of promoters. For CpG related predictions, we train CoreBoost based on 1,445 CpG related promoters from EPD and test on 1,765 CpG related promoters. Each test sequence is 2.4 kb centered at the probes (50 bp) where there was only one DBTSS annotation and no more than one ChIP-chip prediction in the same sequence.

5-fold cross validation

In the 5-fold cross validation study for non-CpG related promoters, the set of 299 non-CpG related promoters from EPD and 4/5 of 1,271 promoters from DBTSS is used to train a model and the remaining set of 1/5 sequences of 2.4 kb, each centered at the DBTSS-annotated TSS is used as test data. The linked files below contain sequences of 10 kb in length, each centered at the annotated TSS. The files ending with .featureTable are the calculated features for the sequences in the corresponding training and test sets. The files ending with .featureTable.label contain the labels for each feature vector. Feature files with name "i.1.train..." are for the first binary classifier discriminating promoters against upstream, while feature files with name "i.2.train..." are for the second binary classifier discriminating promoters against downstream. The test feature tables contain a feature vector for each shifted window of 300 bp with position 251 centered from position 3800 to 6200 for each test sequence.

1.train 1.test, 2.train 2.test, 3.train 3.test, 4.train 4.test, 5.train 5.test

  • 1.1.train.featureTable 1.2.train.featureTable 1.test.featureTable 1.1.train.featureTable.label 1.2.train.featureTable.label
  • 2.1.train.featureTable 2.2.train.featureTable 2.test.featureTable 2.1.train.featureTable.label 2.2.train.featureTable.label
  • 3.1.train.featureTable 3.2.train.featureTable 3.test.featureTable 3.1.train.featureTable.label 3.2.train.featureTable.label
  • 4.1.train.featureTable 4.2.train.featureTable 4.test.featureTable 4.1.train.featureTable.label 4.2.train.featureTable.label
  • 5.1.train.featureTable 5.2.train.featureTable 5.test.featureTable 5.1.train.featureTable.label 5.2.train.featureTable.label
  • New test data

    We selected 5000 promoters which have the most 5'-EST evidence from recently released DBTSS (version 5.0). After removing the promoters which are within 2.4kb of the promoters already used in the manuscript, we had 1,599 promoters (including 213 non-CpG related promoters based on threshold 0.3) as an independent test data.