CpGpromoter is a program for a large-scale human promoter mapping using CpG islands. As defined by Gardiner-Garden & Frommer, CpG islands are greater than 200 bp in length, have more than 50% of G+C content, and have a CpG frequency of at least 0.6 of that expected on the basis of the G+C content of the region. CpG islands are an important signature of 5' region of many mammalian genes. The program is based on results of discriminant analysis between the promoter-associated CpG islands and non-associated ones (Ioshikhes & Zhang). It enables an efficient mapping of human promoters with 2Kb resolution, if there is a CpG island inside an interval (-500...+1,500) around a transcription start site (TSS).
        The current version of the program is implemented as a consequence of cpgplot, comp and qda programs with appropriate parsing and reformatting of the intermediate data. The cpgplot program for mapping of CpG islands described at (Larsen et al.) is available as a part of EMBOSS programs of sequence analysis on the ftp server of the Sanger Centre, UK (ftp://ftp.sanger.ac.uk/pub/EMBOSS/ in file EMBOSS-0.0.4.tar.gz). The comp program replaced the composition program of GCG program package which was used in the previous version of CpG_promoter. The qda program of quadratic discriminant analysis (McLachlan) is those of the S-Plus programs of applied statistics (Venables & Ripley). EMBOSS and S-Plus software packages should be installed at a user's computer, to enable work of CpGpromoter.  GCG is not required for this version. Other software required is Perl and gcc; you may use cc compiler instead of gcc changing accordingly lines 40, 43, 46, 49 of a file CpGpromoter.pl (read below).
        Other parts of the program are available for ftp downloading from this site to UNIX workstations as a CpG_promoter.tar.Z file. After retrieving the compressed tar file, do the following (in a shell window):
        1. Uncompress the .Z file by using: uncompress CpG_promoter.tar.Z. This should convert CpG_promoter.tar.Z file to CpG_promoter.tar.
        2. Untar the file by using: tar xvf CpG_promoter.tar. This should create a directory called CpG_promoter containing all its files. You can then use rm CpG_promoter.tar. (You need steps 1,2 only for the program installation; once it is installed, proceed to the step 3.)
        3. Change your current directory to CpG_promoter. The CpG_promoter is then ready to use. Put a sequence you wish to analyze into the CpG_promoter directory. The sequence should be in a fasta format and called myquery.seq.
        4. Run the CpG_promoter Perl script using perl CpG_promoter.pl. The components of the script are then consequtevely running sending corresponding messages to a standard output. By the finish of the script running, information about the mapped CpG islands should be available in the files query.cpgplot, cpgplot.ps and toqda-1.dat.
        5. By the last step of the script, the S-Plus software package should start. You must attach a .Data directory according to its path at your local computer, modifying accordingly line 1 in a file splus.input. Parameter heart0 in the line 4 of this file is result of qda classification of our Training Set (Ioshikhes & Zhang) according to the interval (-500..+1500). To use a combined data set (Training+Test Set 1+Test Subset 2-1) from this paper as a training set, use heart1 instead of heart0.
        The output data (at the standard output and file output.dat) should look like e.g.:

       CpG island pos.         Class

       (38989..39225)          "0"
       (54844..55044)          "1"
 

        Class "0" means the respective island is predicted as promoter related (centered in the interval (-500...+1500) around a possible TSS), class "1" - as promoter non-related.
        The work on the CpG_promoter is in progress, to make it more self-sufficient, with fewer of outside software involved. In order to allow bench-scientists to have the tool as soon as possible, we are giving it out before this work is completed and before our paper is published (see reference below). This program is free for academic use. For commercial use, please contact Ms. Carol Dempster at 516-367-6885 (dempster@cshl.org).

If you have problems or comments, please contact :

================================================================================

Michael Q. Zhang, Ph.D.                                                   Associate Professor

Phone:  (516) 367-8393                                                   Watson School of Biological Sciences
Fax:      (516) 367-8461                                                   Cold Spring Harbor Laboratory
E-mail: mzhang@cshl.org                                                1 Bungtown Road,
WinFax:(516) 367-6862                                                  P.O.Box 100,
WebURL: http://www.cshl.org/mzhanglab                      Cold Spring Harbor, NY 11724 USA

================================================================================

Ilya Ioshikhes, Ph.D.                                                          Instructor

Department of Molecular Genetics,                                 Phone: (718) 430-3732
Albert Einstein College of Medicine                                FAX:   (718) 430-8778
of Yeshiva University,
717 Ullmann Building 1300 Morris Park Avenue            E-mail: ilya@harryeagle.aecom.yu.edu
Bronx, New York 10461 USA

================================================================================

References :

    Ioshikhes, I. & Zhang, M.Q. Large-scale human promoter mapping using CpG islands. Nature Genetics 26, 61-63 (2000).
    Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261-276 (1987).
    Larsen, F., Gundersen, R., Lopez, R. & Prydz, H. CpG islands as gene markers in the human genome. Genomics 13, 1095-1107 (1992).
    McLachlan, G.J. Discriminant Analysis and Statistical Pattern Recognition (Wiley, New York) (1992).
    Venables, W.N. & Ripley, B.D. Modern Applied Statistics with S-Plus. (Springer-Verlag, New York) (1994).