
Many examples can be found in the man pages. This page will be updated with additional examples.
This example shows how to use the mars program to do regression on the log-ratios of binding p-values obtained from a ChIP-chip experiment. A ChIP-chip experiment will provide p-values indicating whether significant binding has been detected for each sequence represented on the chip. For regression, we want to fit motif scores to the log of the ratio of the binding p-values over some significance cutoff, such as 0.05, 0.01 or 0.001.
First we will use the featuretab program to create a "feature table". A feature table is a table whose rows correspond to sequences, and columns correspond to features. Essentially, the entry in cell (i, j) of the table is the value of feature j when evaluated on sequence i. This is not exactly true, because the first column of the table contains the names of each sequence (taken from the FASTA format file in which they are first given to featuretab), and the first row of the table contains the names of the features. The final two columns are also different. The last column contains (so far unused) weights corresponding to each sequence, and the second last column contains the vector of values that a model fitting program (like mars) tries to fit.
The featuretab program takes a file containing patterns (which must be in CREAD pattern format), and either:
To build the feature-table, using motifs from the file motif.mat,
and sequences in the file input.faa, run the featuretab program like
this:
The "max_score" at the end of the command indicates that the max-score
feature is to be used. The values of this feature, for a given sequence,
is the maximum score of any site in that sequence, when matched with a
motif. The feature-table will be written to the file table.tab.
Next we want to use mars to produce a model
that identifies the motif whose max-score values can best fit the
binding log-ratios. Run mars like this:
This says that mars should build a model up
to 3 terms (the first is the constant term, and then it adds two
linear-spline basis functions), produce the best model (the -m flag)
and produce the RIV (reduction in variance) obtained by fitting with
this model. The model is printed first in the output above, and the
RIV is printed on the next line. The final model only contains 2 terms
because one was pruned during the backward deletion part of the
algorithm, and the GCV (generalized cross validation error) was
smallest without the third term. The feature used in the only
non-constant term is the max-score of the motif names MA0046. If you
want to know what that means, and also find out the which factor was
ChIP'd in the data I used, go look up that motif in the JASPAR
database.