In [ ]:
import math  # Just ignore this :-)

def log(x):
    if x == 0:
        return float('-inf')
    return math.log(x)

ML - Week 11 - Practical Exercises

In the exercise below, you will implement and experiment with various ways of training a HMM (i.e. deciding parameters from data), and you will see an example of how to apply a HMM for identifying coding regions (genes) in genetic matrial.

1 - Training

You are given the same 7-state HMM and helper functions that you used last time:

In [ ]:
class hmm:
    def __init__(self, init_probs, trans_probs, emission_probs):
        self.init_probs = init_probs
        self.trans_probs = trans_probs
        self.emission_probs = emission_probs
In [ ]:
init_probs_7_state = [0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00]

trans_probs_7_state = [
    [0.00, 0.00, 0.90, 0.10, 0.00, 0.00, 0.00],
    [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.05, 0.90, 0.05, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00],
    [0.00, 0.00, 0.00, 0.10, 0.90, 0.00, 0.00],
]

emission_probs_7_state = [
    #   A     C     G     T
    [0.30, 0.25, 0.25, 0.20],
    [0.20, 0.35, 0.15, 0.30],
    [0.40, 0.15, 0.20, 0.25],
    [0.25, 0.25, 0.25, 0.25],
    [0.20, 0.40, 0.30, 0.10],
    [0.30, 0.20, 0.30, 0.20],
    [0.15, 0.30, 0.20, 0.35],
]

hmm_7_state = hmm(init_probs_7_state, trans_probs_7_state, emission_probs_7_state)
In [ ]:
def translate_indices_to_observations(indices):
    mapping = ['a', 'c', 'g', 't']
    return ''.join(mapping[idx] for idx in indices)

def translate_path_to_indices(path):
    return list(map(lambda x: int(x), path))

def translate_indices_to_path(indices):
    return ''.join([str(i) for i in indices])

Training by counting

Training a hidden Markov model is a matter of estimating the initial, transition and emission probabilities. If we are given training data, i.e. a sequence of observations, ${\bf X}$, and a corresponding sequence of hidden states, ${\bf Z}$, we can do "training by counting" by counting the number of observed the transitions and observations in the training dataand as explained in the lecture.

Given ${\bf X}$ and ${\bf Z}$ we would like to count the number of transitions from one state to another, and the number of times that symbol $k$ was observed while being in state $i$. That is, we want to construct a $K \times K$ matrix such that entry $i, j$ is the number of times that a transition from state $i$ to state $j$ is observed in the training data, and a $K \times D$ matrix where entry $i, k$ contains the number of times that symbol $k$ is observed in the training data while being in state $i$.

Implement this as the below function:

In [ ]:
def count_transitions_and_emissions(K, D, x, z):
    """
    Returns a KxK matrix and a KxD matrix containing counts cf. above
    """
    pass
In [ ]:
x_long = 'TGAGTATCACTTAGGTCTATGTCTAGTCGTCTTTCGTAATGTTTGGTCTTGTCACCAGTTATCCTATGGCGCTCCGAGTCTGGTTCTCGAAATAAGCATCCCCGCCCAAGTCATGCACCCGTTTGTGTTCTTCGCCGACTTGAGCGACTTAATGAGGATGCCACTCGTCACCATCTTGAACATGCCACCAACGAGGTTGCCGCCGTCCATTATAACTACAACCTAGACAATTTTCGCTTTAGGTCCATTCACTAGGCCGAAATCCGCTGGAGTAAGCACAAAGCTCGTATAGGCAAAACCGACTCCATGAGTCTGCCTCCCGACCATTCCCATCAAAATACGCTATCAATACTAAAAAAATGACGGTTCAGCCTCACCCGGATGCTCGAGACAGCACACGGACATGATAGCGAACGTGACCAGTGTAGTGGCCCAGGGGAACCGCCGCGCCATTTTGTTCATGGCCCCGCTGCCGAATATTTCGATCCCAGCTAGAGTAATGACCTGTAGCTTAAACCCACTTTTGGCCCAAACTAGAGCAACAATCGGAATGGCTGAAGTGAATGCCGGCATGCCCTCAGCTCTAAGCGCCTCGATCGCAGTAATGACCGTCTTAACATTAGCTCTCAACGCTATGCAGTGGCTTTGGTGTCGCTTACTACCAGTTCCGAACGTCTCGGGGGTCTTGATGCAGCGCACCACGATGCCAAGCCACGCTGAATCGGGCAGCCAGCAGGATCGTTACAGTCGAGCCCACGGCAATGCGAGCCGTCACGTTGCCGAATATGCACTGCGGGACTACGGACGCAGGGCCGCCAACCATCTGGTTGACGATAGCCAAACACGGTCCAGAGGTGCCCCATCTCGGTTATTTGGATCGTAATTTTTGTGAAGAACACTGCAAACGCAAGTGGCTTTCCAGACTTTACGACTATGTGCCATCATTTAAGGCTACGACCCGGCTTTTAAGACCCCCACCACTAAATAGAGGTACATCTGA'
z_long = '3333321021021021021021021021021021021021021021021021021021021021021021033333333334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210321021021021021021021021033334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563333333456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456332102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102103210210210210210210210210210210210210210210210210210210210210210'

Test your implementation of count_transitions_and_emissions on (prefixes) of x_long and z_long above in order to conclude that your implementation works as expected.

In [ ]:
# Your code here ...

Use your implementation of count_transitions_and_emissions to implement a function training_by_counting that given the number of hidden states, $K$, the number of observables, $D$, a sequence of observations, ${\bf X}$, and a corresponding sequence of hidden states, ${\bf Z}$, returns a HMM (as an instance of class hmm), where the tranistion, emission, and initial probabilities are set cf. training by counting on ${\bf X}$ and ${\bf Z}$.

In [ ]:
def training_by_counting(K, D, x, z):
    """
    Returns a HMM trained on x and z cf. training-by-counting.
    """
    pass

Consinder a HMM trained on x_long and z_long:

In [ ]:
hmm_7_state_tbc = training_by_counting(7, 4, x_long, z_long)

How does this HMM (i.e. its transistion, emission, and initial probabilities) compare to hmm_7_state as specified above?

You can e.g. try to perform a Viterbi decoding of x_long using the two HMMs and investigate if the decodings differ:

In [ ]:
# Your implementation of Viterbi (log transformed) from last week

def compute_w_log(model, x):
    pass

def opt_path_prob_log(w):
    pass

def backtrack_log(model, w):
    pass
In [ ]:
w = compute_w_log(hmm_7_state, x_long)
z_vit = backtrack_log(w)

w_tbc = compute_w_log(hmm_7_state_tbc, x_long)
z_vit_tbc = backtrack_log(hmm_7_state, w_tbc)

# Your comparison of z_vit and z_vit_tbc here ...

Viterbi training

Use your implementation of the (log transformed) Viterbi algorithm (compute_w_log and backtrack_log) from last week, and your implementation of training_by_counting above, to implement Viterbi training as explained in class.

In the cell below, you should implement a function, viterbi_update_model, that given a HMM, model, and a sequence of observations, x, computes the Viterbi decoding of x, z_vit, and returns an updated model obtained by doing training by counting on x and z_vit. I.e. a function that corresponds to one round of Viterbi training.

In [ ]:
def viterbi_update_model(model, x):
    """
    return a new model that corresponds to one round of Viterbi training, 
    i.e. a model where the parameters reflect training by counting on x 
    and z_vit, where z_vit is the Viterbi decoding of x under the given 
    model.
    """
    
    # Your code here ...
    
    return new_model

Use your implementation of viterbi_update_model to implement Viterbi training, i.e. continue updating the model until it does not change, or until a stopping criteria of your choice is met.

Make an experiment where you perform Viterbi training on our example 7-state HMM using x_long and starting from the given modem, hmm_7_state, as well as a random model. Examine how the parameters evolve during the iterations. How does the obtaining parameters compare to the parameters obtained by training by counting from x_long and z_long above?

In [ ]:
# Your code here ...

Baum-Welch training

If you have time, you can experiment as above but with Baum-Welch (EM) training. This of course requires that you have a working implementation of the forward- and backward-algorithm (with scaling).

In [ ]:
# Your code here ...

2 - Using a HMM for Gene Finding

Below we will investigate how to use a hidden Markov model for gene finding in prokaryotes.

You are give a data set containing 2 Staphylococcus genomes, each containing several genes (i.e. substrings) obeying the "gene syntax" explained in class. The genomes are between 1.8 million and 2.8 million nucleotides.

The genomes and their annontations are given in FASTA format.

In [ ]:
def read_fasta_file(filename):
    """
    Reads the given FASTA file f and returns a dictionary of sequences.

    Lines starting with ';' in the FASTA file are ignored.
    """
    sequences_lines = {}
    current_sequence_lines = None
    with open(filename) as fp:
        for line in fp:
            line = line.strip()
            if line.startswith(';') or not line:
                continue
            if line.startswith('>'):
                sequence_name = line.lstrip('>')
                current_sequence_lines = []
                sequences_lines[sequence_name] = current_sequence_lines
            else:
                if current_sequence_lines is not None:
                    current_sequence_lines.append(line)
    sequences = {}
    for name, lines in sequences_lines.items():
        sequences[name] = ''.join(lines)
    return sequences

You can use the function like this (note that reading the entire genome will take some time):

In [ ]:
g1 = read_fasta_file('genome1.fa')
g1['genome1'][:50]

The data is:

  • The files genome1.fa and genome2.fa contain the 2 genomes.
  • The files true-ann1.fa and true-ann2.fa contain the annotation of the two genomes with the tru gene structure. The annotation is given in FASTA format as a sequence over the symbols N, C, and R. The symbol N, C, or R at position $i$ in true-annk.fa gives the "state" of the nucleotide at position $i$ in genomek.fa. N means that the nucleotide is non-coding. C means that the nucleotide is coding and part of a gene in the direction from left to right. R means that the nucleotide is coding and part of gene in the reverse direction from right to left.

The annotation files can also be read with read_fasta_file.

You are given the same 7-state HMM that you used above and similar helper functions:

In [ ]:
class hmm:
    def __init__(self, init_probs, trans_probs, emission_probs):
        self.init_probs = init_probs
        self.trans_probs = trans_probs
        self.emission_probs = emission_probs
In [ ]:
init_probs_7_state = [0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00]

trans_probs_7_state = [
    [0.00, 0.00, 0.90, 0.10, 0.00, 0.00, 0.00],
    [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.05, 0.90, 0.05, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00],
    [0.00, 0.00, 0.00, 0.10, 0.90, 0.00, 0.00],
]

emission_probs_7_state = [
    #   A     C     G     T
    [0.30, 0.25, 0.25, 0.20],
    [0.20, 0.35, 0.15, 0.30],
    [0.40, 0.15, 0.20, 0.25],
    [0.25, 0.25, 0.25, 0.25],
    [0.20, 0.40, 0.30, 0.10],
    [0.30, 0.20, 0.30, 0.20],
    [0.15, 0.30, 0.20, 0.35],
]

hmm_7_state = hmm(init_probs_7_state, trans_probs_7_state, emission_probs_7_state)

Notice that this time the function translate_indices_to_path is a bit different. In the given model the states 0, 1, 2 represent coding (C), state 3 represents non-coding (N) and states 4, 5, 6 represent reverse-coding (R) as explained in class.

In [ ]:
def translate_indices_to_path(indices):
    mapping = ['C', 'C', 'C', 'N', 'R', 'R', 'R']
    return ''.join([mapping[i] for i in indices])

def translate_observations_to_indices(obs):
    mapping = {'a': 0, 'c': 1, 'g': 2, 't': 3}
    return [mapping[symbol.lower()] for symbol in obs]

def translate_indices_to_observations(indices):
    mapping = ['a', 'c', 'g', 't']
    return ''.join(mapping[idx] for idx in indices)
In [ ]:
def make_table(m, n):
    """Make a table with `m` rows and `n` columns filled with zeros."""
    return [[0] * n for _ in range(m)]

Now insert your Viterbi implementation (log transformed) in the cell below, this means that you should copy compute_w_log, opt_path_prob_log, backtrack_log and any other functions you may have defined yourself for your Viterbi implementation.

In [ ]:
# Your Viterbi implementation here...

Finding genes in a genome

In the cell below, use your Viterbi implementation to compute an annotation for genome 1 and 2. Save the annotation in a variable (remember to translate the indicies to a path using translate_indices_to_path). Feel free to define a function that wraps compute_w_log and backtrack_log so that you don't have to call both functions each time you want an annotation for a sequence.

In [ ]:
# Your code here...

Comparing annotations

We will now compare the predicted annotations to the true annotations. Read the true annotations (true-ann1.fa and true-ann2.fa) and use the compute_accuracy function given below to compare the predicted annotation to the true annotation.

In [ ]:
def compute_accuracy(true_ann, pred_ann):
    if len(true_ann) != len(pred_ann):
        return 0.0
    return sum(1 if true_ann[i] == pred_ann[i] else 0 
               for i in range(len(true_ann))) / len(true_ann)
In [2]:
# Your code to read the annotations and compute the accuracies of your predictions...

Training a model

Above, we used the stock hmm_7_state for prediction. In a real application, one would train the HMM on genomes with known gene structure in order to make a model that reflects reality.

Make a HMM hmm_7_state_genome1 that has a transition diagram similar to hmm_7_state, but where the transition, emission, and initial probabilities are set by training by counting on genome1.fa and its corresponding true gene structure as given in true-ann1.fa.

You should be able to use your implementation of training by counting as done above, but you must translate the annotation in annotation1.fa into a proper sequence of hidden states, i.e. the annotation NCCCNRRRN would correspond to 321034563.

Using the trained HMM hmm_7_state_genome1 to predict the gene structure of genome 2, and compare the predicted annotation to true annotation (true-ann2.fa). Is the accurracy better than your prediction on genome 2 using hmm_7_state?

Implement training by counting in the cell below. We'll use it to train a new model for predicting genes. Feel free to define any helper functions you find useful.

In [ ]:
# Your code to get hmm_7_state_genome1 using trainng by counting, predict an annotation of genome2, and compare the prediction to annotation2.fa

Redo the above, where you train on genome 2 and predict on genome 1, i.e. make model hmm_7_state_genome2 using training by counting on true-ann2.fa, predict the gene structure of genome1.fa and compare your prediction against true-ann1.fa.

In [ ]:
# Your code to get hmm_7_state_genome2 using trainng by counting, predict an annotation of genome1, and compare the prediction to annotation1.fa

If you have time you can redo the above for other training methods, e.g. Viterbi training that you also considered above. I.e. train a model hmm_7_state_genome1_vit using Viterbi training on genome1.fa, and use it to predict a gene structure for genome 2.