GenomeHistory Documentation

Installation:

See the INSTALL file included with GenomeHistory for installation instructions.

Citing GenomeHistory:

Conant, G. C. and Wagner, A. (2002) GenomeHistory: A software tool and its application to fully sequenced genomes, Nucleic Acids Research, 30(15): 3378-3386 [Reprint]

Running GenomeHistory:

GenomeHistory can be invoked either with command-line options or through the use of a command file. An example command file is included. The command-line options can be obtained by typing % GenomeHistory.pl with no options. Possible options are:

Long Option	Short Option	Example	Default
Protein sequences	pi	`=<protein sequences file>`	`orf_trans.fasta`
ORF sequences	ni	`=<nucleotide sequences file>`	`orf_coding.fasta`
Output file	o	`=<output file>`	`genome_ks.txt`
Error file	e	`=<error file>`	`genome_ks.err`
Gene file	gf	`=<Gene name file>`	`NONE`
Gene Name Field	gn	`=<Gene name field #>`	`2`
BLAST matrix location	bl	= `/usr/local/wublast/BLOSUM62`	`NONE`
Only Top BLAST match	tb	`=YES/NO`	`NO`
BLAST threshold	bt	`=0.001`	`0.01`
Minimum ORF Translation length	mo	`=55`	`40`
Minimum Number Aligned Residues	ma	`=45`	`40`
Percent identity threshold	pt	`=0.70`	`0.50`
Warnings	w	`=ON/OFF`	`OFF`
Genetic code	gc	`=<U/VM/YM/IM/MM/CN/EM>`	`U (Universal)`
Ordered Gene List	og	`=YES/NO`	`NO`
Base Frequencies	bf	`=<CODON/OBSERVED/EQUAL>`	`CODON`
Checkpoint	cp	`=ON/OFF`	`OFF`

Any combinations of options may be used. When used on the command line, only the short option format is allowed. The command file interprets the same set of options, but allows either format. Note that in both cases the equal (=) sign is required. Options are not case-sensitive (but filenames may be).

Example command-line invocation:

% GenomeHistory.pl pa=my_genome.fasta na=my_genome_nuc.fasta bl=/usr/local/wublast/BLOSUM62

Example command file invocation:

% GenomeHistory.pl genome_analysis.txt

Generating the example dataset:

GenomeHistory includes a small example dataset (a portion of the yeast genome) that can be used to test your installation of GenomeHistory. From the GenomeHistory directory, type (% is your prompt)
% GenomeHistory test_analysis.txt
This will create two new files, test_ks.txt and test_ks.err. These files should be identical to the example files provided, example_ks.txt and example_ks.err. You can compare them using the UNIX "diff"command:
% diff test_ks.txt example_ks.txt
% diff test_ks.err example_ks.err

In theory, "diff" should not print anything. However, if you are using a different version of clustalw (not 1.82), there may be (mostly minor) differences.

Notes and Tips:

Locating nucleotide sequences corresponding to protein sequences:

One of the chief difficulties with the sort of genome analysis performed by GenomeHistory is matching nucleotide sequences to protein sequences. By default, GenomeHistory does this using keywords in the FASTA header files (see the Gene Name Field option). Unfortuntately, some genomes do not provide identifiers that allow one to one matching between protein sequences and their corresponding nucleotide sequences. This problem can be overcome with a second matching method if the genome's ORFs and protein sequences are listed in a one to one order (meaning protein sequence number 1 is the translation of nucleotide sequence number 1 etc.). This second method is invoked with the Ordered gene list = yes option and simply uses the order that genes are listed in the two files to identify nucleotide sequences. Microbial genomes seem particularly suited for use with this option, as alternative splicing and gene identification are less of a problem for these genomes.

Using the Genename field to locate nucleotide sequences:

If you are experiencing numerous errors where GenomeHistory is not able to find nucleotide sequences for genes, you may want to experiment with the "Gene Name Field =" option. Sequences in the nucleotide files may have slightly different names than those in the protein sequence file, but the unquie identifiers should match. For instance, the yeast genome stores unique names in field #2, while Arabidopsis uses field 3. Note that this option will only be effective if there is a unquie identifier common to both protein and nucleotide sequences (for instance the YXX### format yeast gene identifiers). If you cannot find such an identifier, you may want to experiment with the Ordered gene list = yes option described above.

Runtimes:

The running time of GenomeHistory is in large part dictated by the number of BLAST hits retained for each gene. In general, the lower (smaller E-value) the BLAST threshold is to, the fewer gene pairs will be analyzed and the faster the analysis will be. The trade-off, of course, is the potential to miss significant hits. We have found values around E < 0.00001 to be quite conservative for the yeast genome.

BLAST in non-default locations:

Washington University BLAST expects to be installed at /usr/ncbi/blast. If this is not where you have installed BLAST, you will need to use the "BLAST matrix location" option to locate your installation. For instance bl=/usr/local/wublast/BLOSUM62

Genetic Codes:

The Genetic code option takes on the following possible values:

U: Universal
VM: Vertebrate mitochondrial
YM: Yeast mitochondrial
MM: Mold mitochondrial
IM: Invertebrate mitochondrial
CN: ciliate nuclear
EM: Echinoderm mitochondrial.

Using the "Only Top BLAST match=YES" option:

When this option is used, K_s and K_a is calculated for each gene and its top BLAST hit, no matter what the E-value of that hit. Thus, the "BLAST threshold" option is meaningless in this context.

Repeated analysis of gene pairs:

Gene pairs could potentially represented (and analyzed) in two ways (Gene 1, Gene 2) and (Gene 2, Gene 1). Under normal circumstances, GenomeHistory will only analyze one of these two combinations (the first one that occurs). However if the option "Only Top BLAST match = YES" is set, or if the option "Gene file = <filename>" is used, both comparisions may be made, since these two options may be used in analyses where it is not desirable to exclude reciprocal comparisions.

Identical Sequences:

Each sequence pair for which K_s/K_a are calculated is first checked to be sure that the two sequences are not identical. If they are, the word IDENTICAL is entered in the output rather than values.

Saturation in K_s values:

If the divergence time between two gene pairs is sufficiently long, there may no longer be sufficient information in the two sequences to estimate K_s. Such sequences are said to be saturated. Unfortunately, numerical algorithms for estimating K_s under likelihood cannot directly determine that sequences are saturated. Instead we have implemented a simple test: After the likelihood maximization is complete, the K_svalue for the pair is multiplied by 10 and the likelihood is recalculated. If this likelihood is as high or higher than the original likelihood, that indicates that the sequence pair has saturated K_s, and that K_s value in the output is marked with an *. Of course, this test does not address the problem of the large variances associated with high, but unsaturated, K_s values. Researchers will want to remember this issue when selecting duplicate pairs for analyses. A forthcoming manscript will discuss this issue in more detail (Hahn, Conant and Wagner, Journal of Molecular Evolution, in press)

Checkpointing:

When the "Checkpoint" option is ON, GenomeHistory will look for a file called .GenomeHistory.chkpnt.(argfile) (where argfile is the command file) or .GenomeHistory.chkpnt.cmdline (if invoking from the command line). If this file is found, GenomeHistory will restart from the point listed in this file. This option is useful for stopping jobs and restarting them at a later time. For checkpointing to work properly, you must specify the same output and error filenames on the restart as were originally used.
Note that the checkpoint file is created whether or not Checkpointing is turned on.

File names:

You can use any combination of absolute and relative paths when specifying filenames to GenomeHistory.

Warnings option:

By default, any gene pair that do not meet the three analysis criteria after pairwise alignment are discarded without further warnings. By specifying "Warnings=on", you can output WARNING entries in the error file that indicate why a particular pair was discarded. This option, however, has the potential to make the error file very large.

Base Frequencies:

By default, GenomeHistory uses the observed base frequencies at each codon position in the calculation of Ks and Ka (in other words, the nucleotide composition at each codon position is averaged between the two sequences). In some cases, it may be preferable to use the over-all average base composition (OBSERVED) which does not break down the composition by codon position. It is also possible to use equal frequencies for all bases (EQUAL).

Gaps:

As written, GenomeHistory removes all gap positions in alignments before calculating K_s and K_a. This is done because gap characters can tend to downwardly bias estimates of sequence diveragance (Yang, Z., 2000, Mol. Bio. Evol. 51:423-432). It is possible to disable this feature of GenomeHistory by editting the GenomeHistory.pl script and removing the "-nogap" option from the calc_percent_dist and like_pair_dist program options.

Alignment Matrices:

By default, both BLAST and Clustalw use the BLOSUM62 amino acid similarity matrix for sequence alignments. The BLAST matrix can be changed with the "BLAST matrix location= " option, provided you have another alignment matrix in the appropriate format. To use a different matrix in clustalw, you will need to edit GenomeHistory.pl to find the line that reads:

$output=`nice clustalw $tempseqfile /output=pir`;

Change this to read
$output=`nice clustalw $tempseqfile /output=pir /PWMATRIX=<NAME>`;

where <NAME> is a matrix file or one of the default clustal matrices. For further details on clustal options, see http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/help9.html.

Comments on basic installation process:

To build and use the package, you will need a c and a c++ compiler (preferably the GNu gcc and g++),
the Washington University implementation of BLAST, and clustalw. The make command will build liblapack
(a partial lapack library), libf2c (because lapack was originally built in Fortran), liblikelihood and
libdnafuncs (two of my libraries that have many generic functions), like_pair_dist (which calculates Ks/Ka),
aligndna_new (which creates a nucleotide alignment from a protein alignment), and calc_percent_diff (which
calculates the percentage difference for all pairs of sequences in an alignment).

You will need to add the GenomeHistory directory to your path. You can then run the program
by typing GenomeHistory.pl <filename> where <filename> is a input file with the parameters for the run.
An example input file with descriptions of each command is included at GenomeHistory/test_analysis.txt

Other platforms:

This package was developed and tested on a RedHat 7.1 Linux machines with the gcc compilers. It should work on
other platforms, but has not been tested on them. Use configure.pl or edit the file GenomeHistory/make.inc to
change the name of your c and c++ compiler.

Bug History:

7-29-02: Fixed a bug that caused incorrect nucleotide sequences to be returned in genomes (such as Drosophila) where gene names can be substrings of each other (e.g. CT1187 and CT11871). Pairs with this problem were still analyzed correctly, but also generated an error message.

10-27-03: Fixed a bug that resulted in some gene pairs where K_s = 0 (reported as K_s <10^-4) being reported as having saturated in synonymous substitutions.

01-07-04: Fixed a bug preventing the "Ordered Gene list=YES" option from working properly when a "Gene File" was provided.

10-08-04: Fixed a bug which caused GenomeHistory to fail to create nucleotide alignments for calculating K_s and K_a when compiled with gcc 3.0.

10-17-04: Fixed a bug in the GenomeHistory perl script which prevented the analysis of gene pairs with E values of 0.0 when using some versions of WU-BLAST. Special thanks to Luke Hakes and Ignazio Carbone for help finding these last two bugs.

Included c++ packages:

The actual computational work of GenomeHistory is performed by traditional c++ executables. In addition to BLAST and Clustalw, this includes 3 programs we have written:

calc_percent_dist: This program takes a sequence alignment (protein or nucleotide) and returns the pairwise percent similarity for all pairs of sequences in the alignment. The "-nogap" option causes it to calculate similarity based only on non-gap positions
algndna_new: This is a complex program for deducing nucleotide sequence alignments from protein sequence alignments. We use only a portion of it here: it also has the ablity to parse GenBank files to locate introns and exons from protein-coding genes
like_pair_dist: The "meat" of GenomeHistory: this program uses numerical optimization to find maximum likelihood estimates of K_s and K_a.

Structure of the c++ codes (Advanced users):

The three c++ programs included are based on a common set of classes. Of these, some of the more important are:

Exchange: By analogy to a telephone exchange, this class is the central repository of global parameters that all of the other classes need to know about. It mostly consists of public variables and interface functions to those variables. Improper settings in this class can give strange results in programs. For instance, it is important to set to model of evolution in the exchange, as certain Boolean flags must be set correctly for codon-based models to work.
Sequence_dataset: A simple data structure class that holds protein or nucleotide sequences. It is primarily intended for aligned sequences (so that each sequence is of the same length).
Read_Sequence: This is actually the base class for a group of classes designed to read various sequence file formats. Currently supported formats are:

PIR
FASTA
PAUP NEXUS
PHYLIP (Interleaved only)

Like_model: This is the base class for the likelihood calculations. Models of evolution are implemented as derived classes. All models are multiply inherited, with one side of the inheritance determining whether the model is nucleotide based or codon-based and the other side describing the model of nucleotide substitution (for instance, Kimura 2-parameter).
Tree: This class implements a bifurcating tree data structure used for phylogenetic trees. In GenomeHistory, the only tree used is a degenerate two-tip tree.
Genetic_Code: This is the base class that implements different genetic codes. To be compatible with some of my earlier codes, the base class Genetic_code functions equivalently to the derived Univ_genetic_code