enr [options] --p <primary sequences> [--m <motifs>]+
The name of a file containing the primary (positive) sequences in FASTA format. The file must contain at least 2 valid sequences or ENR will reject it. Note that the command-line version of ENR does not attempt to detect the alphabet from the primary sequences, so you should specify it with the --dna, --rna, --protein or --alph options.
The name of a file containing motifs in MEME format that ENR will test for enrichment in the primary sequence. This argument may be present more than once, allowing you to simultaneously analyze motifs in several motif files.
ENR writes its output to standard output. The output is in tab-separated values format (TSV). The first line of the output contains the (tab-separated) names of the fields. The names and meanings of each of the fields are described in the table below.
field | name | contents |
---|---|---|
1 | ID | The ID of the motif. |
2 | ALT_ID | The alternate ID of the motif (or blank). |
3 | POS_MATCHES | The number of primary sequences matching the motif with scores greater than or equal to the optimal score threshold. |
4 | NEG_MATCHES | The number of negative sequences matching the motif with scores greater than or equal to the optimal score threshold. |
5 | SCORE_THR | The match score threshold giving the optimal p-value. This is the score threshold used by ENR to determine the values of POS_MATCHES and NEG_MATCHES. |
6 | RATIO | The relative enrichment ratio of the motif in the primary vs. control sequences, defined as (POS_MATCHES/NPOS) / (NEG_MATCHES/NNEG), where NPOS is the numbr of primary sequences in the input, and NNEG is the number of negative sequences in the input. |
7 | PVALUE | The statistical signficance (p-value) of the motif's enrichment, according to the chosen objective function. |
8 | LOG10_PVALUE | The base-10 logarithm of the p-value. |
Option | Parameter | Description | Default Behavior | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Objective Function | ||||||||||||
--objfun | de| cd | This option is used to select the objective function that
ENR optimizes in searching for motifs.
|
ENR uses the Differential Enrichment (de) function. | |||||||||
Control Sequences and Hold-out Set | ||||||||||||
--n | control sequences | The name of a file containing control (negative) sequences in FASTA format. The control sequences must be in the same sequence alphabet as the primary sequences. If the average length of the control sequences is longer than that of the primary sequences, ENR trims the control sequences so that both sets have the same average length. | If you do not provide control sequences, ENR creates them by shuffling a copy of each primary sequence, preserving the frequencies of words of length k (see next option). Shuffling also preserves the positions of non-core (e.g., ambiguous) characters in each sequence to avoid artifacts. | |||||||||
--kmer | k | Preserve the frequencies of words (k-mers) of this size when shuffling primary sequences to create control sequences. k must be in the range [1,..,6]. ENR also estimates a background model of order k-1 from the primary (positive) sequences for use in log-likelihood scoring of motif sites. | ENR preserves the frequencies of words of length 3 (DNA and RNA), and 1 (Protein and Custom alphabets), and constructs background models of order 2 (DNA and RNA), and order 0 (Protein and Custom alphabets). | |||||||||
--hofract | hofract | The fraction of the primary sequences that ENR will randomly select and hold out to simulate exactly how STREME works. ENR will therefor report the same statistical significance for motifs found by STREME as reported by STREME. Note: Set this option to zero if you want to measure the statistical significance of your motifs in the complete set of input sequences. | ENR sets hofract to 0.1 (10%) of the primary sequences. | |||||||||
-seed | seed | Random seed for shuffling and sampling the hold-out set sequences (see above). | ENR uses a random seed of 0. | |||||||||
Alphabet | ||||||||||||
Motif Scoring and Selection | ||||||||||||
Misc | ||||||||||||
--verbosity | 1|2|3|4|5 | A number that regulates the verbosity level of the output information messages. If set to 1 (quiet) then ENR will only output error messages, whereas the other extreme 5 (dump) outputs lots of information intended for debugging. | The verbosity level is set to 2 (normal). |
ENR evaluates each motif in the motif file(s) for enrichment in the primary sequences.
ENR builds a single suffix tree that includes both the primary and control sequences (but not the hold-out set sequences).
ENR converts each motif from a frequency matrix to a log-odds score matrix. By default, STREME creates a background model from the control sequences, but you can provide a different background model if you wish.
ENR computes the unbiased statistical significance of the of the motif by using the motif and the optimal discriminative score threshold (based on the primary and control sequences) to classify the hold-out set sequences, and then applying the statistical test (Fisher's exact test, Binomial test, or the cumulative Bates distribution) to the classification. Classification is based on the best match to the motif in each sequence (on either strand when the alphabet is complementable).