CoreBoost_HM training data

The training set for CoreBoost_HM was constructed based on the known gene transcription start sites in EPD and DBTSS database, with known expression information in human CD4+ T-cell for the related genes in GNF gene expression atlas. The CpG related predictor was trained on 4263 CpG related promoters (bed file, Feature Table), and the non-CpG related predictor was trained on 1683 non-CpG related promoters (bed file, Feature Table). For each promoter in the training set, we chose the annotated TSS as the positive sample and randomly selected one sample from the upstream [-1200, -300] and one from the downstream [300, 1200] regions as negative samples to train the boosting classifier (CpG_train, nCpG_train).

To further test the performance of the program, we collected all 1642 non-overlapping promoters regions (from -5 kb to 5 kb relative to the gene 5p end) with at least one TSS annotation from the rest of the genes with known expression information in CD4+ T-cell as our test set. After combining the TSSs less than 100 bp apart, this test set contains 2619 independent TSSs according to EPD and DBTSS annotation.