GenomeHistory Documentation
Installation:
See the INSTALL
file included with GenomeHistory for installation instructions.
Citing GenomeHistory:
Conant, G. C. and Wagner, A. (2002) GenomeHistory: A software tool and
its application to fully sequenced genomes, Nucleic Acids Research,
30(15): 3378-3386 [Reprint]
Running GenomeHistory:
GenomeHistory can be invoked either with command-line options or
through the use of a command file. An
example command file is included. The command-line options
can
be obtained by typing % GenomeHistory.pl with no
options.
Possible options are:
Long Option
|
Short Option
|
Example
|
Default
|
Protein sequences |
pi |
=<protein sequences file> |
orf_trans.fasta |
ORF sequences |
ni |
=<nucleotide sequences file> |
orf_coding.fasta |
Output file |
o |
=<output file> |
genome_ks.txt |
Error file |
e |
=<error file> |
genome_ks.err |
Gene file |
gf |
=<Gene name file> |
NONE |
Gene Name Field |
gn |
=<Gene name field #> |
2 |
BLAST matrix location |
bl |
= /usr/local/wublast/BLOSUM62 |
NONE |
Only Top BLAST match |
tb |
=YES/NO |
NO |
BLAST threshold |
bt |
=0.001 |
0.01 |
Minimum ORF Translation length |
mo |
=55 |
40 |
Minimum Number Aligned Residues |
ma |
=45 |
40 |
Percent identity threshold |
pt |
=0.70 |
0.50 |
Warnings |
w |
=ON/OFF |
OFF |
Genetic code |
gc |
=<U/VM/YM/IM/MM/CN/EM> |
U (Universal) |
Ordered Gene List |
og |
=YES/NO |
NO |
Base Frequencies |
bf |
=<CODON/OBSERVED/EQUAL> |
CODON |
Checkpoint |
cp |
=ON/OFF |
OFF |
Any combinations of options may be used. When used on the
command
line, only the short option format is allowed. The command file
interprets
the same set of options, but allows either format. Note that in
both
cases the equal (=) sign is required. Options are not
case-sensitive
(but filenames may be).
Example command-line invocation:
% GenomeHistory.pl pa=my_genome.fasta na=my_genome_nuc.fasta
bl=/usr/local/wublast/BLOSUM62
Example command file invocation:
% GenomeHistory.pl genome_analysis.txt
Generating the example dataset:
GenomeHistory includes a small example dataset (a portion of the yeast
genome) that can be used to test your installation of
GenomeHistory.
From the GenomeHistory directory, type (% is your prompt)
% GenomeHistory test_analysis.txt
This will create two new files, test_ks.txt and test_ks.err.
These files should be identical to the example files provided, example_ks.txt
and
example_ks.err.
You can compare them using the UNIX "diff"command:
% diff test_ks.txt example_ks.txt
% diff test_ks.err example_ks.err
In theory, "diff" should not print
anything. However, if you are using a different version of clustalw (not 1.82), there may be (mostly minor) differences.
Notes and Tips:
Locating nucleotide sequences corresponding to protein sequences:
One of the chief difficulties with the sort of genome analysis
performed
by GenomeHistory is matching nucleotide sequences to protein
sequences.
By default, GenomeHistory does this using keywords in the FASTA header
files (see the Gene Name Field option). Unfortuntately, some genomes do
not provide identifiers that allow one to one matching between protein
sequences and their corresponding nucleotide sequences.
This
problem can be overcome with a second matching method if the genome's
ORFs
and protein sequences are listed in a one to one order (meaning protein
sequence number 1 is the translation of nucleotide sequence number 1
etc.).
This second method is invoked with the Ordered gene list =
yes
option and simply uses the order that genes are listed in the two files
to identify nucleotide sequences. Microbial genomes seem
particularly
suited for use with this option, as alternative splicing and gene
identification
are less of a problem for these genomes.
Using the Genename field to locate nucleotide sequences:
If you are experiencing numerous errors where GenomeHistory is not able
to find nucleotide sequences for genes, you may want to experiment with
the "Gene Name Field =" option. Sequences in the nucleotide files
may have slightly different names than those in the protein sequence
file,
but the unquie identifiers should match. For instance, the
yeast genome stores unique names in field #2, while Arabidopsis uses
field 3. Note that this option will only be effective if there is
a unquie identifier common to both protein and nucleotide sequences
(for
instance the YXX### format yeast gene identifiers). If you cannot
find such an identifier, you may want to experiment with the Ordered
gene list = yes option described above.
Runtimes:
The running time of GenomeHistory is in large part dictated by the
number
of BLAST hits retained for each gene. In general, the lower
(smaller
E-value) the BLAST threshold is to, the fewer gene pairs will be
analyzed
and the faster the analysis will be. The trade-off, of course, is
the potential to miss significant hits. We have found values
around
E < 0.00001 to be quite conservative for the yeast genome.
BLAST in non-default locations:
Washington University BLAST expects to be installed at
/usr/ncbi/blast.
If this is not where you have installed BLAST, you will need to use the
"BLAST matrix location" option to locate your installation. For
instance
bl=/usr/local/wublast/BLOSUM62
Genetic Codes:
The Genetic code option takes on the following possible values:
- U: Universal
- VM: Vertebrate mitochondrial
- YM: Yeast mitochondrial
- MM: Mold mitochondrial
- IM: Invertebrate mitochondrial
- CN: ciliate nuclear
- EM: Echinoderm mitochondrial.
Using the "Only Top BLAST match=YES" option:
When this option is used, Ks and Ka is calculated
for each gene and its top BLAST hit, no matter what the E-value of that
hit. Thus, the "BLAST threshold" option is meaningless in this
context.
Repeated analysis of gene pairs:
Gene pairs could potentially represented (and analyzed) in two ways
(Gene
1, Gene 2) and (Gene 2, Gene 1). Under normal circumstances,
GenomeHistory
will only analyze one of these two combinations (the first one that
occurs).
However if the option "Only Top BLAST match = YES" is set, or if the
option
"Gene file = <filename>" is used, both comparisions may be made,
since
these two options may be used in analyses where it is not desirable to
exclude reciprocal comparisions.
Identical Sequences:
Each sequence pair for which Ks/Ka are calculated
is first checked to be sure that the two sequences are not
identical.
If they are, the word IDENTICAL is entered in the output rather than
values.
Saturation in Ks values:
If the divergence time between two gene pairs is sufficiently long,
there
may no longer be sufficient information in the two sequences to
estimate
Ks. Such sequences are said to be saturated.
Unfortunately,
numerical algorithms for estimating Ks under likelihood
cannot
directly determine that sequences are saturated. Instead we have
implemented a simple test: After the likelihood maximization is
complete,
the Ks value for the pair is multiplied by 10 and the
likelihood
is recalculated. If this likelihood is as high or higher than the
original likelihood, that indicates that the sequence pair has
saturated
Ks, and that Ks value in the output is marked
with
an *. Of course, this test does not address the problem of the
large
variances associated with high, but unsaturated, Ks
values.
Researchers will want to remember this issue when selecting duplicate
pairs
for analyses. A forthcoming manscript will discuss this issue in
more detail (Hahn, Conant and Wagner, Journal of Molecular
Evolution,
in press)
Checkpointing:
When the "Checkpoint" option is ON, GenomeHistory will look for a
file called .GenomeHistory.chkpnt.(argfile) (where argfile is the
command
file) or .GenomeHistory.chkpnt.cmdline (if invoking from the command
line).
If this file is found, GenomeHistory will restart from the point listed
in this file. This option is useful for stopping jobs and
restarting
them at a later time. For checkpointing to work
properly,
you must specify the same output and error filenames on the restart as
were originally used.
Note that the checkpoint file is created whether or not Checkpointing
is turned on.
File names:
You can use any combination of absolute and relative paths when
specifying
filenames to GenomeHistory.
Warnings option:
By default, any gene pair that do not meet the three analysis criteria
after pairwise alignment are discarded without further warnings.
By specifying "Warnings=on", you can output WARNING entries in the
error
file that indicate why a particular pair was discarded. This
option,
however, has the potential to make the error file very large.
Base Frequencies:
By default, GenomeHistory uses the observed base frequencies at each
codon
position in the calculation of Ks and Ka (in other words, the
nucleotide
composition at each codon position is averaged between the two
sequences).
In some cases, it may be preferable to use the over-all average base
composition
(OBSERVED) which does not break down the composition by codon
position.
It is also possible to use equal frequencies for all bases (EQUAL).
Gaps:
As written, GenomeHistory removes all gap positions in alignments
before
calculating Ks and Ka. This is done
because
gap characters can tend to downwardly bias estimates of sequence
diveragance
(Yang, Z., 2000, Mol. Bio. Evol. 51:423-432). It is
possible
to disable this feature of GenomeHistory by editting the
GenomeHistory.pl
script and removing the "-nogap" option from the calc_percent_dist
and like_pair_dist program options.
Alignment Matrices:
By default, both BLAST and Clustalw use the BLOSUM62 amino acid
similarity
matrix for sequence alignments. The BLAST matrix can be changed
with
the "BLAST matrix location= " option, provided you have another
alignment
matrix in the appropriate format. To use a different matrix in
clustalw,
you will need to edit GenomeHistory.pl to find the line that reads:
$output=`nice clustalw $tempseqfile /output=pir`;
Change this to read
$output=`nice clustalw $tempseqfile /output=pir
/PWMATRIX=<NAME>`;
where <NAME> is a matrix file or one of the default clustal
matrices.
For further details on clustal options, see http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/help9.html.
Comments on basic installation process:
To build and use the package, you will need a c and a c++ compiler
(preferably the GNu gcc and g++),
the Washington University implementation of BLAST, and clustalw.
The make command will build liblapack
(a partial lapack library), libf2c (because lapack was originally built
in Fortran), liblikelihood and
libdnafuncs (two of my libraries that have many generic functions),
like_pair_dist (which calculates Ks/Ka),
aligndna_new (which creates a nucleotide alignment from a protein
alignment),
and calc_percent_diff (which
calculates the percentage difference for all pairs of sequences in
an alignment).
You will need to add the GenomeHistory directory to your path.
You can then run the program
by typing GenomeHistory.pl <filename> where <filename> is a
input
file with the parameters for the run.
An example input file with descriptions of each command is included
at GenomeHistory/test_analysis.txt
Other platforms:
This package was developed and tested on a RedHat 7.1 Linux machines
with
the gcc compilers. It should work on
other platforms, but has not been tested on them. Use
configure.pl
or edit the file GenomeHistory/make.inc to
change the name of your c and c++ compiler.
Bug History:
7-29-02: Fixed a bug that caused incorrect nucleotide sequences
to
be returned in genomes (such as Drosophila) where gene names
can
be substrings of each other (e.g. CT1187 and CT11871).
Pairs
with this problem were still analyzed correctly, but also generated an
error message.
10-27-03: Fixed a bug that resulted in some gene pairs where Ks
= 0 (reported as Ks <10-4) being reported as
having
saturated in synonymous substitutions.
01-07-04: Fixed a bug preventing the "Ordered Gene list=YES" option
from working properly when a "Gene File" was provided.
10-08-04: Fixed a bug which caused GenomeHistory to fail to create
nucleotide alignments for calculating Ks and Ka
when compiled with gcc 3.0.
10-17-04: Fixed a bug in the GenomeHistory perl script which prevented the
analysis of gene pairs with E values of 0.0 when using some versions of WU-BLAST. Special thanks to Luke Hakes and Ignazio Carbone for help finding these last two bugs.
Included c++ packages:
The actual computational work of GenomeHistory is performed by
traditional
c++ executables. In addition to BLAST and Clustalw, this includes
3 programs we have written:
- calc_percent_dist: This program takes a sequence alignment
(protein
or nucleotide) and returns the pairwise percent similarity for all
pairs
of sequences in the alignment. The "-nogap" option causes it to
calculate
similarity based only on non-gap positions
- algndna_new: This is a complex program for deducing
nucleotide
sequence
alignments from protein sequence alignments. We use only a
portion
of it here: it also has the ablity to parse GenBank files to locate
introns
and exons from protein-coding genes
- like_pair_dist: The "meat" of GenomeHistory: this program
uses
numerical
optimization to find maximum likelihood estimates of Ks and
Ka.
Structure of the c++ codes (Advanced users):
The three c++ programs included are based on a
common
set of classes. Of these, some of the more important are:
- Exchange: By analogy to a telephone exchange, this
class is
the central repository of global parameters that all of the other
classes
need to know about. It mostly consists of public variables and
interface
functions to those variables. Improper settings in this class can
give strange results in programs. For instance, it is important
to
set to model of evolution in the exchange, as certain Boolean flags
must
be set correctly for codon-based models to work.
- Sequence_dataset: A simple data structure class
that
holds
protein or nucleotide sequences. It is primarily intended for
aligned
sequences (so that each sequence is of the same length).
- Read_Sequence: This is actually the base class for a
group
of classes
designed to read various sequence file formats. Currently
supported
formats are:
- PIR
- FASTA
- PAUP NEXUS
- PHYLIP (Interleaved only)
Support for these formats may not be complete. - Like_model:
This is the base class for the
likelihood calculations.
Models of evolution are implemented as derived classes. All
models are multiply inherited, with one side of the inheritance
determining
whether the model is nucleotide based or codon-based and the other side
describing the model of nucleotide substitution (for instance, Kimura
2-parameter).
- Tree: This class implements a bifurcating tree
data
structure
used for phylogenetic trees. In GenomeHistory, the only tree used
is a degenerate two-tip tree.
- Genetic_Code: This is the base class that
implements
different
genetic codes. To be compatible with some of my earlier codes,
the
base class Genetic_code functions equivalently to the derived
Univ_genetic_code