Alignment trimming

A common workflow in phylogenetics is as follows:

  • Compute a multiple-sequence alignment (MSA).
  • Trim the MSA.
  • Infer a phylogeny from the trimmed MSA.

The idea with the trimming is to remove noisy regions. Homologous proteins can contain regions that are not inherited and should therefore not be aligned, and other regions may have evolved so fast that the correct multialignment is impossible to infer. In order to get rid of such problematic regions in subsequential analysis, in particular for phylogeny inference, “bad” columns are often removed using tools such as GBlocks or TrimAl.

In this project, you will investigate the value of alignment trimming.

Evaluation

The question you will try to answer is whether phylogenies are more likely to look like the true tree after alignment trimming. You can choose software for phylogenetic inference yourself, for example RAxML, FastTree, or PhyML.

Data

Download this linked data set to get data to work on.

The testdata contains six subdirectories:

$ ls
asymmetric_0.5  asymmetric_1.0  asymmetric_2.0  
symmetric_0.5  symmetric_1.0  symmetric_2.0

and each such directory contains one tree file, containing a reference tree, and 300 alignments that were created by evolving (using a computer program) sequences along the reference tree and then running muscle to align the resulting sequences. The trees are either symmetric (“easy”) or assymetric (“less easy”) and the number in the filename denotes the average amount of mutations per site in the sequences. Hence, the *_2.0 directories has alignments with on average two mutations per sequence position, from the root to the leaves and this is of course the hardest case.

The benchmarking

To determine how accurate an inferred tree is, it should be compared to the reference tree. It is natural to make that comparison using Robinson-Foulds distance (or symmetric distance as some prefer to call it). There are methods for this in the Python library DendroPy.