CoreBoost_HM
training data
The training set for CoreBoost_HM was
constructed based on the known gene transcription start sites in EPD and DBTSS
database, with known expression information in human CD4+ T-cell for the
related genes in GNF gene expression atlas. The CpG related predictor was
trained on 4263 CpG related promoters (bed file, Feature Table), and the non-CpG related predictor
was trained on 1683 non-CpG related promoters (bed
file, Feature Table). For each promoter in
the training set, we chose the annotated TSS as the positive sample and
randomly selected one sample from the upstream [-1200, -300] and one from the
downstream [300, 1200] regions as negative samples to train the boosting
classifier (CpG_train, nCpG_train).
To further test the performance of
the program, we collected all 1642 non-overlapping
promoters regions (from -5 kb to 5 kb relative to the gene 5p end) with at
least one TSS annotation from the rest of the genes with known expression
information in CD4+ T-cell as our test set. After combining the TSSs less than
100 bp apart, this test set contains 2619
independent TSSs according to EPD and DBTSS annotation.