CREAD

Regulatory sequences

The following files contain sequences known to be rich in transcription factor binding sites.

Wasserman-Fickett Muscle
LSPD

Background sequences

The following files contain 1000 promoters each. Promoters were sampled uniformly at random from the set of promoters in CSHLmpd, for Human and Mouse.

Random.Hs.1000.faa.gz
Random.Mm.1000.faa.gz

Patterns: Words, Motifs, and Modules

Currently CREAD supports regulatory sequence analysis using Words, Motifs or Modules as the pattern types. The format of the files for these types of patterns have a common scheme. Here is a very simple example of a Motif pattern:

AC  MA0098
XX
TY  Motif
XX
P0       A       C       G       T
00       4      16       4      16
01      17       0       0      23
02       0       1       0      39
03       0      39       1       0
04       0      39       0       1
05       5       3      17      15
XX
AT  SOME_ATTRIBUTE=the_attribute_value
XX
//

The above Motif is from the JASPAR database, and describes binding sites for c-ETS in Chicken. This format follows the format used by TRANSFAC for their matrices almost perfectly. In fact, the matrix.dat file that comes with TRANSFAC is parsed by most of the tools in CREAD, and can therefore be analyzed very easily. All pattern files in CREAD consist of records, and each begins with an AC line, and ends with a line containing only //. There are two major difference between the CREAD format and the TRANSFAC format. First is the TY line, which indicates the type of pattern. This allows some programs in CREAD to process multiple pattern types simultaneously. In many cases, if a program was designed to work on a specific type of pattern, that pattern type will be assumed when the TY line is absent.

The other difference is the line beginning with AT. This is an attribute of the pattern, and is simply a pair consisting of a label and a value. In the case above, the label is SOME_ATTRIBUTE, and the value of that attribute for Motif MA0098 is the_attribute_value. A more natural example might be something like this:

XX
AT  SCORE=1729
XX

Many of the programs in CREAD assign values to pattern attributes, and other programs use those attributes. For example, the sortmotifs program will sort and rank patterns according to the value of some attribute the user specifies.

As with the TRANSFAC matrix format, the pattern files of CREAD have a way of storing information about sites that match the pattern. While the pattern occurrences are not necessarily binding sites, we still use BS to begin the lines. Here is an example:

XX
BS  TTTCCG; NM_000732; 1003; 6; ; p;
XX

The parts of the site are delimited with semicolons, and they are the sequence of the site, the name of the sequence in which the site is found, the offset of the site from the beginning of the sequences, the length of the site, any gaps in the site, and the strand on which the site is found. In the above example there is nothing for the gaps, and currently no CREAD patterns may contain gaps. The field is still used. Some programs will append an additional field to the format. For example the storm program appends match scores to the end of the site lines.

The line that begins PO indicates that the lines immediately following will be the columns of a position-weight matrix, which can either give frequencies (i.e. must sum to 1) or counts. The numbers at the beginning of those lines are not important, and just serve to indicate that the line is part of the matrix. The first line that does not begin with two digits signals the end of the matrix. For the line beginning PO, the remainder of the line is always ignored, and it is always assumed, that on each line describing a matrix column, the values give counts or frequencies for the bases A, C, G and T in that order.

Here is an example of the Word format:

AC  WORD01
XX
TY  Word
XX
WO  CACCTG
XX
AT  SITE_NAME=E-Box
XX
//

This format is similar, except the positions of the matrix for the Motif are replaced by a word. In the case above, the word is the E-Box consensus sequence, which is stated in the attributes section. Currently the only characters allowed are the nucleotides of DNA.

A Module represents a set of Motifs, and is intended as a model for cis-regulatory modules, which are sets of interacting transcription factor binding sites.

An example of a Module pattern in CREAD is:

AC  MA0098-MA0006
XX
TY  Module
XX
P0       A       C       G       T
00       4      16       4      16
01      17       0       0      23
02       0       1       0      39
03       0      39       1       0
04       0      39       0       1
05       5       3      17      15
XX
P0 	 A	 C	 G	 T
00	 3	 8	 2	11
01	 0	 0	23	 1 
02	 0	23	 0	 1
03	 0	 0	23	 1
04	 0	 0	 0	 24
05	 0	 0	24	 0	
XX
AT  SOME_ATTRIBUTE=the_attribute_value
XX
//

The above module consists of two matrices from JASPAR, MA0098 and MA0006.

The module sites that reported by modstorm are in the folowing format (where the symbol "\" indicates that the line should not break in the true format):

XX
BS  (1003:TTTCCG:1009:p:12.2)-(1203:TGCGTG:1209:n:10.1); \
NM_000732; 1003; 206; 
XX

As with the motif sites, the parts of the module sites are delimeted with semicolons. The information in parentheses relate to the individual motifs in the module. The parts of the individual motif information are delimeted by colons and are the start of the site, site, end of site, strand, and score. The other module information parts are the sequence, the start of the module and the total length of the module.

Module Organization

One the important features of \fBmodstorm\fR is the ability to search for modules with specified organization. To specify the organization of a module, a line such as the following must be in the module file:

OR  [M1:n:12.407]-(250)-[M2:p:10.214]

This indicates that the module must start with M1 on the negative strand with a score threshold of 12.407, followed by a spacer region not greater than 250 bases, and then M2 on the positive strand with a score threshold of 10.214. If the strands are not specified, such as in:

OR  [M1: :12.407]-(250)-[M2: :10.214]

both strands are searched for either factor, with the constraints on score thresholds and spacing between the factors the same as before. For more information about the CREAD module format, see "modstorm" man page.