Dynamic Programming

Global pairwise alignments

Implement a Needleman Wunch sequence alignment, i.e. the dynamic programming of the following matrix:

S(0,j) = j*g
S(i,0) = i*g
S(i,j) = max(S(i-1,j-1) + d(a(i),b(j)), S(i-1,j) + g, S(i,j-1) + g)

Here g is a gap penalty, a and b the sequences we are aligning and d a scoring matrix defining the score for two amino acids matching each other. For the exercise, use a PAM250 as a score matrix. You will also need to implement a tracing matrix, to recover the path taken by the max operator for each cell in the dynamic programming matrix. A good strategy is to store the relative coordinates in each step, i.e. (-1,-1), (-1,0) or (0,-1) for the three possible steps.

Viterbi decoding of HMMs

The example is taken (and slightly modified) from Durbin et al. 1998.

Here are the estimated probabilities of emitting a certain base (in columns) given the last observed base (in rows) in a DNA sequence in a CpG island.

Model + A C G T
A .179 .273 .425 .120 .003
C .171 .367 .273 .187 .003
G .160 .338 .374 .125 .003
T .079 .354 .383 .181 .003

Similarly, here are the estimated probabilities of emitting a certain base given the last observed base in a DNA sequence in a non-CpG island.

Model – A C G T +
A .299 .205 .285 .210 .001
C .321 .298 .078 .302 .001
G .248 .246 .297 .208 .001
T .177 .239 .292 .291 .001

The transition probabilities between CpG and non-CpG are given in the +/- column, which should be interpreted as the probability of transitioning to any of the four bases of in the other submodel.

Make a dynamic programming decoder that can calculate the most probable path taken by a generic DNA sequence through the model in order to localize any CpG-island. Try out your algorithm with this stretch of DNA in fasta-format.

Mass distribution of Proteins

Example taken from Rosalind.

Implement a function that can calculate the number of possible proteins (permutations of amino acids) with a certain integer mass (e.g. 5000 Da) using dynamic programming. For the purpose of the exercise, assume that the weights of the 20 standard amino acids in Dalton are:

aa = {'A':71,
'R': 156,
'N': 114,
'D': 115,
'C': 103,
'E': 129,
'Q': 128,
'G': 57,
'H': 137,
'I': 113,
'L': 113,
'K': 128,
'M': 131,
'F': 147,
'P': 97,
'S': 87,
'T': 101,
'W': 186,
'Y': 163,
'V': 99}

Tip: If you know the number of amino acid sequences with masses up to the current mass -1, you can calculate the number of amino acid sequences reaching the current mass by adding up all the sequences differating in mass by the mass of any of the amino acids.
(*) Which range of masses do you really need to keep in memory for performing this exercise? Could you implement the same function with a lower range in mind? Try to explore a function that can, even more quickly and memory efficiently, calculate the number of amino acid sequences of a certain mass.