Assignment 4: Simplistic phylogenomics

Computing phylogenies based on whole genomes is hard for several reasons. Do you work with genes only? How do you then pick the genes? You cannot, in general, compare genomes in a straightforward way (i.e., alignments) because of the complex and large-scale differences.

A simplistic approach is to compare nucleotide compositions. Let πX be the nucleotide composition vector for species X, containing the frequencies of the four nucleotides in X’s genome in the order A, C, G, and T. If all nucleotides have the same frequence in X (which would be surprising), then πX=(0.25, 0.25, 0.25, 0.25).

If a genome Y is GC rich, then its composition vector might be πY = (0.2, 0.3, 0.3, 0.2).

Difference in composition

To be able to use composition for distances, we want to have a measure for the difference. In this assignment, we use the root-mean-square distance:

  • Take the elementwise differences.
  • Square the differences and add them up.
  • Divide the sum by 4 and take the square root.

For our X and Y genomes, this is

diffXY) = √ (0.25 * ((0.25-0.2)2 + (0.25-0.3)2 + (0.25-0.3)2 + (0.25-0.2)2)) = 0.05

Assignment

Write a Python program that

  • takes as input a number of filenames, each pointing to files containing one genomic sequence in Fasta format, and
  • outputs a distance matrix, using the composition difference above, in Phylip matrix format.

The following requirements must be fulfilled:

  • All files are read from the command line. Use the module ArgParse!
  • The program is a good Unix citizen.

Test data

The linked file contains “simple1.fa” and “simple2.fa” for genomes X and Y in the example, as well as a selection of “reduced” genomes (severely shortened).

Example session

Running the program should look close to this:

prompt> ./basedist human.fa mouse.fa fly.fa yeast.fa ecoli.fa plasmodium.fa thermus.fa 
   7
Hsapiens  0.0   0.009   0.014   0.041   0.042   0.073   0.126
Mus_muscul0.009	0.0     0.017   0.044   0.039   0.079   0.126
Fly       0.014	0.017	0.0     0.054   0.031   0.081   0.112
Yeast     0.041	0.044	0.054	0.0     0.083   0.048   0.166
Ecoli     0.042	0.039	0.031	0.083	0.0     0.111   0.088
Plasmodium0.073	0.079	0.081	0.048	0.111	0.0     0.182
Thermus_th0.126	0.126	0.112	0.166	0.088	0.182	0.0