GenomeHistory Documentation

Installation:

    See the INSTALL file included with GenomeHistory for installation instructions.

Citing GenomeHistory:

Conant, G. C. and Wagner, A. (2002) GenomeHistory: A software tool and its application to fully sequenced genomes, Nucleic Acids Research, 30(15): 3378-3386 [Reprint]

Running GenomeHistory:

 GenomeHistory can be invoked either with command-line options or through the use of a command file.  An example command file is included.  The command-line options can be obtained by typing % GenomeHistory.pl with no options.  Possible options are:
 

Long Option

 

Short Option

Example

Default

Protein sequences pi =<protein sequences file> orf_trans.fasta
ORF sequences ni =<nucleotide sequences file> orf_coding.fasta
Output file o =<output file> genome_ks.txt
Error file e =<error file>  genome_ks.err
Gene file gf =<Gene name file> NONE
Gene Name Field gn =<Gene name field #> 2
BLAST matrix location bl = /usr/local/wublast/BLOSUM62 NONE
Only Top BLAST match tb =YES/NO NO
BLAST threshold bt =0.001 0.01
Minimum ORF Translation length mo =55 40
Minimum Number Aligned Residues ma =45 40
Percent identity threshold pt =0.70 0.50
Warnings w =ON/OFF OFF
Genetic code gc =<U/VM/YM/IM/MM/CN/EM> U (Universal)
Ordered Gene List og =YES/NO NO
Base Frequencies bf =<CODON/OBSERVED/EQUAL> CODON
Checkpoint cp =ON/OFF OFF

Any combinations of options may be used.  When used on the command line, only the short option format is allowed.  The command file interprets the same set of options, but allows either format.  Note that in both cases the equal (=) sign is required.  Options are not case-sensitive (but filenames may be).
 

Example command-line invocation:

 % GenomeHistory.pl pa=my_genome.fasta na=my_genome_nuc.fasta bl=/usr/local/wublast/BLOSUM62
 

Example command file invocation:

% GenomeHistory.pl genome_analysis.txt
 

Generating the example dataset:

GenomeHistory includes a small example dataset (a portion of the yeast genome) that can be used to test your installation of GenomeHistory.  From the GenomeHistory directory, type (% is your prompt)
% GenomeHistory test_analysis.txt
This will create two new files, test_ks.txt and test_ks.err.  These files should be identical to the example files provided, example_ks.txt and example_ks.err.  You can compare them using the UNIX "diff"command:
% diff test_ks.txt example_ks.txt
% diff test_ks.err example_ks.err

In theory, "diff" should not print anything. However, if you are using a different version of clustalw (not 1.82), there may be (mostly minor) differences.

Notes and Tips:


Locating nucleotide sequences corresponding to protein sequences:

One of the chief difficulties with the sort of genome analysis performed by GenomeHistory is matching nucleotide sequences to protein sequences.  By default, GenomeHistory does this using keywords in the FASTA header files (see the Gene Name Field option). Unfortuntately, some genomes do not provide identifiers that allow one to one matching between protein sequences and their corresponding nucleotide sequences.   This problem can be overcome with a second matching method if the genome's ORFs and protein sequences are listed in a one to one order (meaning protein sequence number 1 is the translation of nucleotide sequence number 1 etc.).    This second method  is invoked with the Ordered gene list = yes option and simply uses the order that genes are listed in the two files to identify nucleotide sequences.  Microbial genomes seem particularly suited for use with this option, as alternative splicing and gene identification are less of a problem for these genomes.

Using the Genename field to locate nucleotide sequences:

If you are experiencing numerous errors where GenomeHistory is not able to find nucleotide sequences for genes, you may want to experiment with the "Gene Name Field =" option.  Sequences in the nucleotide files may have slightly different names than those in the protein sequence file, but the unquie identifiers should match.   For instance, the yeast genome stores unique names in field #2, while Arabidopsis uses field 3.  Note that this option will only be effective if there is a unquie identifier common to both protein and nucleotide sequences (for instance the YXX### format yeast gene identifiers).  If you cannot find such an identifier, you may want to experiment with the Ordered gene list = yes option described above.

Runtimes:

The running time of GenomeHistory is in large part dictated by the number of BLAST hits retained for each gene.  In general, the lower (smaller E-value) the BLAST threshold is to, the fewer gene pairs will be analyzed and the faster the analysis will be.  The trade-off, of course, is the potential to miss significant hits.  We have found values around E < 0.00001 to be quite conservative for the yeast genome.

BLAST in non-default locations:

Washington University BLAST expects to be installed at /usr/ncbi/blast.  If this is not where you have installed BLAST, you will need to use the "BLAST matrix location" option to locate your installation.  For instance bl=/usr/local/wublast/BLOSUM62

Genetic Codes:

The Genetic code option takes on the following possible values:

Using the "Only Top BLAST match=YES" option:

When this option is used, Ks and Ka is calculated for each gene and its top BLAST hit, no matter what the E-value of that hit.  Thus, the "BLAST threshold" option is meaningless in this context.

Repeated analysis of gene pairs:

Gene pairs could potentially represented (and analyzed) in two ways (Gene 1, Gene 2) and (Gene 2, Gene 1).  Under normal circumstances, GenomeHistory will only analyze one of these two combinations (the first one that occurs).  However if the option "Only Top BLAST match = YES" is set, or if the option "Gene file = <filename>" is used, both comparisions may be made, since these two options may be used in analyses where it is not desirable to exclude reciprocal comparisions.
 

Identical Sequences:

Each sequence pair for which Ks/Ka are calculated is first checked to be sure that the two sequences are not identical.  If they are, the word IDENTICAL is entered in the output rather than values.

Saturation in Ks values:

If the divergence time between two gene pairs is sufficiently long, there may no longer be sufficient information in the two sequences to estimate Ks.  Such sequences are said to be saturated.  Unfortunately, numerical algorithms for estimating Ks under likelihood cannot directly determine that sequences are saturated.  Instead we have implemented a simple test:  After the likelihood maximization is complete, the Ks value for the pair is multiplied by 10 and the likelihood is recalculated.  If this likelihood is as high or higher than the original likelihood, that indicates that the sequence pair has saturated Ks, and that Ks value in the output is marked with an *.  Of course, this test does not address the problem of the large variances associated with high, but unsaturated, Ks values.  Researchers will want to remember this issue when selecting duplicate pairs for analyses.  A forthcoming manscript will discuss this issue in more detail (Hahn, Conant and Wagner, Journal of Molecular Evolution, in press)

Checkpointing:

 When the "Checkpoint" option is ON, GenomeHistory will look for a file called .GenomeHistory.chkpnt.(argfile) (where argfile is the command file) or .GenomeHistory.chkpnt.cmdline (if invoking from the command line).  If this file is found, GenomeHistory will restart from the point listed in this file.  This option is useful for stopping jobs and restarting them at a later time.    For checkpointing to work properly, you must specify the same output and error filenames on the restart as were originally used.
Note that the checkpoint file is created whether or not Checkpointing is turned on.

File names:

 You can use any combination of absolute and relative paths when specifying filenames to GenomeHistory.

Warnings option:

By default, any gene pair that do not meet the three analysis criteria after pairwise alignment are discarded without further warnings.  By specifying "Warnings=on", you can output WARNING entries in the error file that indicate why a particular pair was discarded.  This option, however, has the potential to make the error file very large.

Base Frequencies:

By default, GenomeHistory uses the observed base frequencies at each codon position in the calculation of Ks and Ka (in other words, the nucleotide composition at each codon position is averaged between the two sequences).  In some cases, it may be preferable to use the over-all average base composition (OBSERVED) which does not break down the composition by codon position.  It is also possible to use equal frequencies for all bases (EQUAL).

Gaps:

As written, GenomeHistory removes all gap positions in alignments before calculating Ks and Ka.   This is done because gap characters can tend to downwardly bias estimates of sequence diveragance (Yang, Z., 2000, Mol. Bio. Evol. 51:423-432).  It is possible to disable this feature of GenomeHistory by editting the GenomeHistory.pl script and removing the "-nogap" option from the calc_percent_dist and like_pair_dist program options.
 

Alignment Matrices:

By default, both BLAST and Clustalw use the BLOSUM62 amino acid similarity matrix for sequence alignments.  The BLAST matrix can be changed with the "BLAST matrix location= " option, provided you have another alignment matrix in the appropriate format.  To use a different matrix in clustalw, you will need to edit GenomeHistory.pl to find the line that reads:

 $output=`nice clustalw $tempseqfile /output=pir`;

Change this to read
$output=`nice clustalw $tempseqfile /output=pir /PWMATRIX=<NAME>`;

where <NAME> is a matrix file or one of the default clustal matrices.  For further details on clustal options, see http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/help9.html.
 

Comments on basic installation process:


To build and use the package, you will need a c and a c++ compiler (preferably the GNu gcc and g++),
the Washington University implementation of BLAST, and clustalw.  The make command will build liblapack
(a partial lapack library), libf2c (because lapack was originally built in Fortran), liblikelihood and
libdnafuncs (two of my libraries that have many generic functions), like_pair_dist (which calculates Ks/Ka),
aligndna_new (which creates a nucleotide alignment from a protein alignment), and calc_percent_diff (which
calculates the percentage difference for all pairs of sequences in an alignment).
 

You will need to add the GenomeHistory directory to your path.  You can then run the program
by typing GenomeHistory.pl <filename> where <filename> is a input file with the parameters for the run.
An example input file with descriptions of each command is included at GenomeHistory/test_analysis.txt
 

Other platforms:

This package was developed and tested on a RedHat 7.1 Linux machines with the gcc compilers.  It should work on
other platforms, but has not been tested on them.  Use configure.pl or edit the file GenomeHistory/make.inc to
change the name of your c and c++ compiler.

Bug History:

7-29-02:  Fixed a bug that caused incorrect nucleotide sequences to be returned in genomes (such as Drosophila) where gene names can be substrings of each other (e.g. CT1187 and CT11871).  Pairs with this problem were still analyzed correctly, but also generated an error message.

10-27-03: Fixed a bug that resulted in some gene pairs where Ks = 0 (reported as Ks <10-4) being reported as having saturated in synonymous substitutions.

01-07-04: Fixed a bug preventing the "Ordered Gene list=YES" option from working properly when a "Gene File" was provided.

10-08-04: Fixed a bug which caused GenomeHistory to fail to create nucleotide alignments for calculating Ks and Ka when compiled with gcc 3.0.

10-17-04: Fixed a bug in the GenomeHistory perl script which prevented the analysis of gene pairs with E values of 0.0 when using some versions of WU-BLAST. Special thanks to Luke Hakes and Ignazio Carbone for help finding these last two bugs.

Included c++ packages:

The actual computational work of GenomeHistory is performed by traditional c++ executables.  In addition to BLAST and Clustalw, this includes 3 programs we have written:

Structure of the c++ codes (Advanced users):

    The three c++ programs included are based on a common set of classes.  Of these, some of the more important are: