RMAP: A program for mapping Solexa reads

The RMAP package includes two programs, rmap and rmapq, both of which can map sequencing reads to their genomic location. The difference between these two programs is that rmapq considers the sequencing quality for each base in mapping. For both programs, the length of read must be at most 64bp, and should not be shorter than 20bp. The user must specify a maximum number of mismatches permitted between a read and the genomic location to which it maps. For rmapq, the quality score cutoff to define a high quality (HQ) base needs to be specified, and the mismatches are only counted in those HQ bases with quality score higher than this cutoff. For example, when mapping reads of length 36bp by rmap, it might be desirable to allow up to 3 mismatches in the mapping to account for sequencing errors or single nucleotide polymorphism (SNP). For 50bp reads, it might be desirable to allow, e.g., 5 mismatches.The RMAP program was developed by Dr. Andrew D Smith and Dr. Zhenyu Xuan.

Downloading RMAP

Source code: rmap_v0.41.tgz

Specifying the reads

Reads must be specified as either a FASTA format sequence file, or the Solexa sequencing probability score files. For the latter one, four numbers per base are listed to present the negative log-transform of the probabilities of four nucleotides (A, C, G, T) to be sequenced at this base position. The following shows examples for these two formats of the input reads files:

FASTA sequence file:

>1_168_0365_0364
GTTAAAAGTATGTGTGTCCTATGTCCTCAAGA
>1_168_0021_0625
TTTTATACACTTCAAAAAAAAAAAACCCTAGA
......
Sequencing probability score files:
-40  -40   40  -40     -40  -40  -40   40     -40  -40  -40   40       
 40  -40  -40  -40      40  -40  -40  -40      40  -40  -40  -40       
 40  -40  -40  -40     -40  -40   40  -40     -40  -40  -40   40        
 40  -40  -40  -40     -40  -40  -40   40     -40  -40   40  -40    
-40  -40  -40   40     -40  -40   40  -40     -40  -40  -40   40      
-40  -40   40  -40     -40  -40  -40   40     -40   40  -40  -40      
-40   40  -40  -40     -40  -40  -40   40      40  -40  -40  -40       
-40  -40  -40   40     -40  -40   40  -40     -40  -40  -40   40    
-40   40  -40  -40     -40   40  -40  -40     -40  -40  -40   40      
-40   40  -40  -40      35  -35  -40  -40      40  -40  -40  -40      
-40  -40   40  -40      40  -40  -40  -40
......

Each read should have at least a minimum length that is specified by the user as a command line argument. Reads with length exceeding the specified limit will be truncated at the 3'-end. Any characters other than {A,C,G,T,a,c,g,t} in the reads will be automatically transformed into an 'N'. When counting mismatches between a read and some genome location, any 'N' will mismatch whatever character it aligns with in program rmap, so will the high quality 'N' using rmapq.

High quality bases

When using rmapq, the users need to specify a float to define high quality (HQ) bases which have quality scores higher than the cutoff. For a read with the total HQ bases less than a specified minimum number, or the maximum length of the continuous HQ bases inside less than the specified seed length, this read will not be used in further mapping. For a give data set, we will identify the maximum quality score, and recommend using its half as the cutoff of HQ bases in the first try.

Specifying the genome

The genome can be specified in two ways: as a single chromosome file, or as a directory containing multiple files, one for each chromosome. The chromosome files must be in FASTA format, but must only have a single FASTA sequence (i.e. only the first line of the file can start with the '>' character). This is the format of the files downloadable from the UCSC Genome Browser. When specifying a directory containing multiple chromosomes, a filename suffix is also required to indicate those files in the directory that are to be searched. Any character other than {A,C,G,T,a,c,g,t} in the chromosome will be considered a mismatch.

How matches are scored

Scoring is simply a count of mismatches, and fewer mismatches are better. For rmap, the mismatches are counted in all bases used in mapping, while only high quality bases will be used in counting mismatches for rmapq. If for a given read, there is no location in the genome found to match that read with fewer than the specified number of mismatches, then nothing is reported for that read. If there is a "good" match, then the best match is reported, along with the number of mismatches. These matches are reported in BED format:

chromosome   start    end   name   score   strand
For example:
chr1   153728548   153728583   s_2_0100_1   1   +
If more than one genomic sequence matches the read with the fewest number of mismatches, then the read is considered ambiguous, and is not reported. There is an option to have the names of all ambiguous reads reported in a separate file.

Details

RMAP uses the "exclusion principle". For example, a read of length 36 can match a genomic sequence with 3 mismatches only if there is at least one sub-read of length at least 9bp that matches the corresponding part of the genomic sequence exactly (these are called "seeds"). For this reason, specifying fewer mismatches can make the program run faster, because the seeds are matched first, and fewer exact matching seeds will be found when the seeds are longer. However, there is an option to specify a seed length. In the example, specifying a seed length of 10bp, instead of 9bp, will result in some speedup, with very little effect on the accuracy of mapping. This is both for statistical reasons, and because errors in sequencing tend to cluster together (especially at the ends of the reads). The program also has a "verbose" option, which reports progress. This option will cause minor slow-down of the program, and is not recommended unless the progress information is important. The program also has the option to specify an input buffer size for buffering the chromosome data. The buffer size has little influence on the speed of the program, but does affect memory use. Also having a larger buffer _might_ improves efficiency in a cluster environment, depending on the network file system (i.e. when too many small reads cause problems).

One specific consideration of RMAP is using sequencing quality scores in mapping. For example, although Solexa sequencer can produce reads as long as 50bp now, the quality of each base is not identical. There are much more sequencing errors in the 3'-end of the read, which is synthesized in the later cycles, than in its 5'-end, therefore the quality scores of these bases at 3'-end are low. Using longer reads ideally could achieve higher mapping specificity, the specified maximum number of mismatches for a "good" mapping reduces this potential when there are too much errors in the 3'-end of the read. The program rmapq only count mismatches in specified HQ bases, which can use as many as possible of the reliable sequence information to achieve a good mapping. Our preliminary studies have shown that rmapq can achieve better mapping performance than rmap on both high quality and noisy sequencing data, with acceptable slowdown of the running speed.

System Requirements

rmap has only been tested on systems running linux on an x86_64 architecture. Compiling the program requires a recent version of g++ (from GCC>=4.0). The only other library required is the GNU popt library, for parsing command-line arguments. Compiling the program is done by simply typing "make" at the command prompt from the directory containing the source code and Makefile.

Examples

This example will map the reads in s_1_0001_seq.fa to the chromosome sequence specified in the file chr12frag.fa:

% rmap -w 36 -m 3 -h 11 -c Example/chromosome/chr12frag.fa Example/s_1_0001_seq.fa
The following example will map the reads in s_1_0001_seq.fa to the chromosome specified in the file chr12frag.fa by only counting mismatches in bases with quality score higher than 10, with the seed length being 11bp, and the minimum number of HQ bases for each read being 20, and the quality score file in Solexa PRB format s_1_0001_prb.txt is specified by -Q option:
% rmapq -c Example/chromosome/chr12frag.fa -w 36 -m 2 -h 11 -M 20 -q 10  \ 
        -Q Example/s_1_0001_prb.txt Example/s_1_0001_seq.fa

Questions or comments? Contact Andrew Smith. Last updated: