SpliceTrap is a statistic tool for quantifying exon inclusion ratios in paired-end RNA-seq data, with broad applications for the study of alternative splicing. SpliceTrap approaches to exon inclusion level estimation as a Bayesian inference problem. For every exon it quantifies the extent to which it is included, skipped or subjected to size variations due to alternative 3’/5’ splice sites or Intron Retention. In addition, SpliceTrap can quantify alternative splicing within a single cellular condition, with no need of a background set of reads.
Wu J, Akerman M, Sun S, McCombie WR, Krainer AR, Zhang MQ (2011) SpliceTrap: a method to quantify alternative splicing under single cellular conditions: Bioinformatics 27, 3010-3016, doi:10.1093/bioinformatics/btr508 Full Text
To quantify the inclusion ratio of every exon, SpliceTrap utilizes a comprehensive exon trio database called TXdb. In this database, every entry is an exon trio used as a reference to map paired-end reads. Subsequently, SpliceTrap takes advantage of the abundances and positions of the reads, to test the hypothesis whether the middle exon is alternatively spliced or not.
The exon inclusion ratio is defined as the expression level of the inclusion isoform divided by the total expression level of both isoforms (inclusion and skipped ) derived from an exon trio.
TXdb is an exon isoform database generated from dbCASE and RefSeq annotations. The database is in BED format. Since version 0.85, TXdb entry IDs have been changed to be represented by a pattern “XX-YY-ZZ-xxxx-yyyy.z”, where XX is the event type we will test in SpliceTrap, YY is the event annotation derived from RefSeq/dbCASE, ZZ is the chromosome id (1-21,X,Y, etc.), xxxx and yyyy are the start and end position of the middle exon/block, and z is a number to present differnt events involving the same exon. The 2-letter codes for event types are as follow:
- CS: ConStitutive exon
- CA: CAssette exon
- AA: Alternative Acceptor (3’ splice site)
- AD: Alternative Donor (5’ splice site)
- IR: Intron Retention
NOTE: For AA, AD and IR, the middle blocks are the variant part in the isoform, please see our publication for details.
In the current version, TXdbs for Human(hg18 & hg19), Mouse(mm9) and Rat (rn4) are provided. Databases for other genome versions can be supplied upon request.
Starting from v0.86, TXdb generator was provided to the user, the script can generate user-defined database from BED/GTF format transcript annotations, please see details in the Usage section.
The evidences (transcript ids) are also provided in the TXdb folder, the file name is TXdb.evi.
We provide an option for coverage cutoffs in three different levels of stringency. The cutoffs are applied to each member of the exon trio independently. Only exon trios of which the first and last exons are sufficiently covered are accepted. We use a dynamic coverage strategy; wherein higher coverage cutoffs are applied on shorter exons and lower coverage cutoffs are applied on longer exons. Using this method, we can achieve a better balance between information loss and accuracy. This cutoff can be specified by -c for both SpliceTrap and PostAnalysis.
Another filter we recommend to use is to require certain number of reads crossing junctions of either Long/Short isoforms of an event, this can be done with the option -j. For AA, AD and IR, the junctions were defined between blocks (not exons).
Unzip the tarball into a folder you like.
Enter the folder, then:
NOTE, please make sure your computer is connected to the internet, a script included in the package (downloaddb.pl) will download databases you specify from our server. wget and tar would be required in this process.
If everything goes correctly, you will see an executable file SpliceTrap in the folder, Now you can run:
or you can copy it to somewhere else, but do NOT move this folder. Also, you can see TXdbgen, PostAnalysis and SpliceChange in this folder.
Before reinstalling the program, you need to clean/uninstall it first by:
NOTE, this will clear everything in ./db, if you have user-defined database there, please backup it first.
Simply run the script to see a brief help:
The details of the parameters are as follow:
Option Description -m Mapping software, bowtie/rmap, 2 mismatches are allowed for both programs. Please make sure the software you want to use has been properly installed and can be found in your PATH. (default: bowtie) -t Split the input file into trunks with this number of lines for mapping (default: 5000000), please adjust this number according to your cluster’s setup and your input file size. The maximum number of the trunks is 676. -d Database you want to use, ALL/hg18/hg19/mm9/rn4/userdefined, this parameter corresponds to the folders in ./db, for ‘hg18/hg19/mm9/rn4’, you can download them during installation, For ‘userdefined’, you need to run TXdbgen to generate a database first. (default: hg18) -s Read Size of one end.(default: 36) -1 Read File 1 in fasta/fastq format. -2 Read File 2 in fasta/fastq format. The reads in these two read files should be corresponded, the reads from the same pair-mate must be in the same line in the files. If your experiment is singled-end, then simply ignore this option. However, paired-end is recommended for accuracy purpose. -c Cutoff level, H/M/L, means high/middle/low, high is the most stringent level. Please check the Cutoff section for short explanations. (default: M) -j Number of junction reads per junction required for either exon-isoform. (default: 5) -o Output file prefix. (default: Result) –outdir Output directory. (default: ./) -p Bowtie threads number, the same as -p in bowtie manual. This parameter is for the users who don’t have SUN Grid Engine Qsub, if you are using Qsub and it actually works, please ignore this parameter. (default: 1) –force-no-qsub Force SpliceTrap NOT to use Qsub. –noIRM Skip IRM adjustments, use this if you don’t have enough events to construct reliable IRM distributions. Please remember to use this option too for the downstream analysis.
TXdbgen allows the user to generate a TXdb database from BED/GTF format transcript annotation file (e.g., output from assembly software or refGene annotations).
Simply run it to get a brief usage:
The details of the parameters are as follow:
-g The location of the fasta files downloaded from UCSC genome browser, the names of the Fasta files should be something like chr1.fa etc. -a The transcript annotation file, in BED or GTF format. -n The name of the database, this name is to be used for SpliceTrap option -d.(default: userdefined) --gtf If the input file is in GTF format, please specify this parameter.
Note: for both BED and GTF format annotation files, we recommend to use the files downloaded from UCSC table browser. For example, the first column of the annotation file should be in the format like “chr1”, not “1”. In GTF file, “transcript_id” should be included in every line and this id should not be ambiguous.
PostAnalysis is a wrap-up script for post-analysis of the output from SpliceTrap, it accepts the following parameters:
-i Input file, it should be the .ratio file in the output folder of SpliceTrap. -o The output file, will be in the format of the .txt file described below. -c Cutoff level, H/M/L, means high/middle/low, high is the most stringent level. Please check the Cutoff section for short explanations. (default: M) -j Number of junction reads per junction required for either exon-isoform. (default: 5) --noIRM Use unadjusted inclusion ratios.
SpliceChange is a script for comparison between two outputs from SpliceTrap, the parameters are as below:
-1 Input file 1, output from SpliceTrap/PostAnalysis, .raw file in the output folder. -2 Input file 2, output from SpliceTrap/PostAnalysis. .raw file in the output folder. -o Output file prefix. -c Minimal change required [default:0.3]. -m Minimal inclusion ratio required for at least one of the two conditions [default:0.1]. For example, 0/0.2 will pass, while 0/0.05 will be filtered out. --noIRM Use unadjusted inclusion ratios.
NOTE: The change here is calculated by (r2-r1)/(r1+r2). Please find the output format below.
Test Fasta files have been provided in the example folder. This dataset is part of a Paired-end RNA-seq data and most of the reads should be mapped to human chromosome 12. To test if SpliceTrap is properly installed, you can run the test shell script in this folder.
*.txt is formatted as below
1 Splicing modality. Based on the statistics estimations we report the splicing modality using the following two-letter codes:
- CS: Constitutive exon (inclusion ratio > 0.9)
- CA: Cassette (alternative) exon (inclusion ratio < 0.9)
- AA: Alternative 3’ Splice Site (inclusion ratio indicated abundance of the extended isoform)
- AD: Alternative 5’ Splice Site (inclusion ratio indicated abundance of the extended isoform)
- IR: Intron retention (inclusion ratio indicated abundance of the retention isoform)
- na: Not available (expression level below cutoff)
2 Inclusion ratio estimation (between 0 and 1)
3 Chromosome id
4 exon 1 coordinates
5 exon 2 coordinates
6 exon 3 coordinates
7 strand (+/-)
8 exon 1 coverage
9 exon 2 coverage
10 exon 3 coverage
11 Coverage on the full isoform
This file contains necessary information for down-stream analysis.
*.log is a log file for debugging purpose.
Here we present an example pipeline for better understanding of how to use SpliceTrap, for easier explanation, we suppose we have a Paired-end RNAseq data (50ntX2) and the raw data are in fastq format, the filenames are end1.fq and end2.fq. In these two files, the lines should be corresponded (from same read-pairs).
Please refer to the Usage section above for details of each script and modify the parameters accordingly.
Construct a database if necessary, otherwise, skip to step 2:
/the/path/to/the/folder/TXdbgen -g /data/database/mm9/genome/fasta/ -a anno.bed -n mydatabase
Run SpliceTrap to get ratios with mm9 TXdb:
/the/path/to/the/folder/SpliceTrap -d mm9 -1 end1.fq -2 end2.fq -s 50
or with “mydatabase” constructed in step 1:
/the/path/to/the/folder/SpliceTrap -d mydatabase -1 end1.fq -2 end2.fq -s 50
If you want to filter the results with different parameters, you can run PostAnalysis:
/the/path/to/the/folder/PostAnalysis -i result.ratio -o newresult.txt
If you want to compare inclusion ratios between two samples, you can use SpliceChange:
/the/path/to/the/folder/SpliceChange -1 result1.raw -2 result2.raw -o changes.txt
How do I know if SpliceTrap is correctly installed and works well?
There is an “example” folder and you can run the shell script there and see if it works.
Do I need to rerun SpliceTrap if I want to use some other parameters?
PostAnalysis could do some of the filtering, try it before you decide to rerun.
What if I don’t have Qsub?
If you don’t have SUN grid Qsub, you can still run SpliceTrap, and please also have a look at the -p option of SpliceTrap.
What do I need to be caution if I want to use a user-specified TXdb?
First, you need to generate a database with TXdbgen. Second, if you don’t have many events in the database, we suggest you do not use IRM corrections. (i.e., Use the 2nd column in *.raw instead of the 3rd column ). See option –noIRM.
What if I only have single end reads?
SpliceTrap supports single end reads too, to do this, just ignore -2 option of SpliceTrap.
I got the event ids, but how do I know where they are?
You can find annotation files in the corresponded TXdb folder in ./db. The TXdb.bed file contains the positions of the event, you can copy the lines and generate a custom track in UCSC genome browser, then you will see the information graphically. You can also find the evidence transcripts for each event in the file TXdb.evi.
Please contact Jie Wu at wuj at cshl dot edu