{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import math # Just ignore this :-)\n", "\n", "def log(x):\n", " if x == 0:\n", " return float('-inf')\n", " return math.log(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ML E2022 - Week 11 - Practical Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the exercise below, you will implement and experiment with various ways of training a HMM (i.e. deciding parameters from data), and you will see an example of how to apply a HMM for identifying coding regions (genes) in genetic matrial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1 - Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are given the same 7-state HMM and helper functions that you used last week:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class hmm:\n", " def __init__(self, init_probs, trans_probs, emission_probs):\n", " self.init_probs = init_probs\n", " self.trans_probs = trans_probs\n", " self.emission_probs = emission_probs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "init_probs_7_state = [0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00]\n", "\n", "trans_probs_7_state = [\n", " [0.00, 0.00, 0.90, 0.10, 0.00, 0.00, 0.00],\n", " [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n", " [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n", " [0.00, 0.00, 0.05, 0.90, 0.05, 0.00, 0.00],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00],\n", " [0.00, 0.00, 0.00, 0.10, 0.90, 0.00, 0.00],\n", "]\n", "\n", "emission_probs_7_state = [\n", " # A C G T\n", " [0.30, 0.25, 0.25, 0.20],\n", " [0.20, 0.35, 0.15, 0.30],\n", " [0.40, 0.15, 0.20, 0.25],\n", " [0.25, 0.25, 0.25, 0.25],\n", " [0.20, 0.40, 0.30, 0.10],\n", " [0.30, 0.20, 0.30, 0.20],\n", " [0.15, 0.30, 0.20, 0.35],\n", "]\n", "\n", "hmm_7_state = hmm(init_probs_7_state, trans_probs_7_state, emission_probs_7_state)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def translate_observations_to_indices(obs):\n", " mapping = {'a': 0, 'c': 1, 'g': 2, 't': 3}\n", " return [mapping[symbol.lower()] for symbol in obs]\n", "\n", "def translate_indices_to_observations(indices):\n", " mapping = ['a', 'c', 'g', 't']\n", " return ''.join(mapping[idx] for idx in indices)\n", "\n", "def translate_path_to_indices(path):\n", " return list(map(lambda x: int(x), path))\n", "\n", "def translate_indices_to_path(indices):\n", " return ''.join([str(i) for i in indices])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_table(m, n):\n", " \"\"\"Make a table with `m` rows and `n` columns filled with zeros.\"\"\"\n", " return [[0] * n for _ in range(m)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training by counting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training a hidden Markov model is a matter of estimating the initial, transition and emission probabilities. If we are given training data, i.e. a sequence of observations, ${\\bf X}$, and a corresponding sequence of hidden states, ${\\bf Z}$, we can do \"training by counting\" by counting the number of observed transitions and emissions in the training data as explained in the lecture.\n", "\n", "Given ${\\bf X}$ and ${\\bf Z}$ we would like to count the number of transitions from one state to another, and the number of times that symbol $k$ was observed while being in state $i$. That is, we want to construct a $K \\times K$ matrix such that entry $i, j$ is the number of times that a transition from state $i$ to state $j$ is observed in the training data, and a $K \\times D$ matrix where entry $i, k$ contains the number of times that symbol $k$ is observed in the training data while being in state $i$.\n", "\n", "Implement this as the below function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def count_transitions_and_emissions(K, D, x, z):\n", " \"\"\"\n", " Returns a KxK matrix and a KxD matrix containing counts cf. above\n", " \"\"\"\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "x_long = 'TGAGTATCACTTAGGTCTATGTCTAGTCGTCTTTCGTAATGTTTGGTCTTGTCACCAGTTATCCTATGGCGCTCCGAGTCTGGTTCTCGAAATAAGCATCCCCGCCCAAGTCATGCACCCGTTTGTGTTCTTCGCCGACTTGAGCGACTTAATGAGGATGCCACTCGTCACCATCTTGAACATGCCACCAACGAGGTTGCCGCCGTCCATTATAACTACAACCTAGACAATTTTCGCTTTAGGTCCATTCACTAGGCCGAAATCCGCTGGAGTAAGCACAAAGCTCGTATAGGCAAAACCGACTCCATGAGTCTGCCTCCCGACCATTCCCATCAAAATACGCTATCAATACTAAAAAAATGACGGTTCAGCCTCACCCGGATGCTCGAGACAGCACACGGACATGATAGCGAACGTGACCAGTGTAGTGGCCCAGGGGAACCGCCGCGCCATTTTGTTCATGGCCCCGCTGCCGAATATTTCGATCCCAGCTAGAGTAATGACCTGTAGCTTAAACCCACTTTTGGCCCAAACTAGAGCAACAATCGGAATGGCTGAAGTGAATGCCGGCATGCCCTCAGCTCTAAGCGCCTCGATCGCAGTAATGACCGTCTTAACATTAGCTCTCAACGCTATGCAGTGGCTTTGGTGTCGCTTACTACCAGTTCCGAACGTCTCGGGGGTCTTGATGCAGCGCACCACGATGCCAAGCCACGCTGAATCGGGCAGCCAGCAGGATCGTTACAGTCGAGCCCACGGCAATGCGAGCCGTCACGTTGCCGAATATGCACTGCGGGACTACGGACGCAGGGCCGCCAACCATCTGGTTGACGATAGCCAAACACGGTCCAGAGGTGCCCCATCTCGGTTATTTGGATCGTAATTTTTGTGAAGAACACTGCAAACGCAAGTGGCTTTCCAGACTTTACGACTATGTGCCATCATTTAAGGCTACGACCCGGCTTTTAAGACCCCCACCACTAAATAGAGGTACATCTGA'\n", "z_long = '3333321021021021021021021021021021021021021021021021021021021021021021033333333334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210321021021021021021021021033334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563333333456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456332102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102103210210210210210210210210210210210210210210210210210210210210210'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your implementation of `count_transitions_and_emissions` on (prefixes) of `x_long` and `z_long` above in order to conclude that your implementation works as expected." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use your implementation of `count_transitions_and_emissions` to implement a function `training_by_counting` that given the number of hidden states, $K$, the number of observables, $D$, a sequence of observations, ${\\bf X}$, and a corresponding sequence of hidden states, ${\\bf Z}$, returns a HMM (as an instance of `class hmm`), where the tranistion, emission, and initial probabilities are set cf. training by counting on ${\\bf X}$ and ${\\bf Z}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def training_by_counting(K, D, x, z):\n", " \"\"\"\n", " Returns a HMM trained on x and z cf. training-by-counting.\n", " \"\"\"\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consinder a HMM trained on `x_long` and `z_long`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "hmm_7_state_tbc = training_by_counting(7, 4, x_long, z_long)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does this HMM (i.e. its transistion, emission, and initial probabilities) compare to `hmm_7_state` as specified above?\n", "\n", "You can e.g. try to perform a Viterbi decoding of `x_long` using the two HMMs and investigate if the decodings differ:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your implementation of Viterbi (log transformed) from last week\n", "\n", "def compute_w_log(model, x):\n", " pass\n", "\n", "def opt_path_prob_log(w):\n", " pass\n", "\n", "def backtrack_log(model, x, w):\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "w = compute_w_log(hmm_7_state, x_long)\n", "z_vit = backtrack_log(hmm_7_state, x_long, w)\n", "\n", "w_tbc = compute_w_log(hmm_7_state_tbc, x_long)\n", "z_vit_tbc = backtrack_log(hmm_7_state_tbc, x_long, w_tbc)\n", "\n", "# Your comparison of z_vit and z_vit_tbc here ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Viterbi training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use your implementation of the (log transformed) Viterbi algorithm (`compute_w_log` and `backtrack_log`) from last week, and your implementation of `training_by_counting` above, to implement Viterbi training as explained in class. \n", "\n", "In the cell below, you should implement a function, `viterbi_update_model`, that given a HMM, `model`, and a sequence of observations, `x`, computes the Viterbi decoding of `x`, `z_vit`, and returns an updated model obtained by doing training by counting on `x` and `z_vit`. I.e. a function that corresponds to one round of Viterbi training." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def viterbi_update_model(model, x):\n", " \"\"\"\n", " return a new model that corresponds to one round of Viterbi training, \n", " i.e. a model where the parameters reflect training by counting on x \n", " and z_vit, where z_vit is the Viterbi decoding of x under the given \n", " model.\n", " \"\"\"\n", " \n", " # Your code here ...\n", " \n", " return new_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use your implementation of `viterbi_update_model` to implement Viterbi training, i.e. continue updating the model until it does not change, or until a stopping criteria of your choice is met.\n", "\n", "Make an experiment where you perform Viterbi training on our example 7-state HMM using `x_long` and starting from the given modem, `hmm_7_state`, as well as a random model. Examine how the parameters evolve during the iterations. How does the obtaining parameters compare to the parameters obtained by training by counting from `x_long` and `z_long` above?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Baum-Welch training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have time, you can experiment as above but with Baum-Welch (EM) training. This of course requires that you have a working implementation of the forward- and backward-algorithm (with scaling)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2 - Using a HMM for Gene Finding\n", "\n", "Below we will investigate how to use a hidden Markov model for gene finding in prokaryotes.\n", "\n", "You are give a data set containing 2 Staphylococcus genomes, each containing several genes (i.e. substrings) obeying the \"gene syntax\" explained in class. The genomes are between 1.8 million and 2.8 million nucleotides.\n", "\n", "The genomes and their annontations are given in [FASTA format](https://en.wikipedia.org/wiki/FASTA_format)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def read_fasta_file(filename):\n", " \"\"\"\n", " Reads the given FASTA file f and returns a dictionary of sequences.\n", "\n", " Lines starting with ';' in the FASTA file are ignored.\n", " \"\"\"\n", " sequences_lines = {}\n", " current_sequence_lines = None\n", " with open(filename) as fp:\n", " for line in fp:\n", " line = line.strip()\n", " if line.startswith(';') or not line:\n", " continue\n", " if line.startswith('>'):\n", " sequence_name = line.lstrip('>')\n", " current_sequence_lines = []\n", " sequences_lines[sequence_name] = current_sequence_lines\n", " else:\n", " if current_sequence_lines is not None:\n", " current_sequence_lines.append(line)\n", " sequences = {}\n", " for name, lines in sequences_lines.items():\n", " sequences[name] = ''.join(lines)\n", " return sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the function like this (note that reading the entire genome will take some time):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "g1 = read_fasta_file('genome1.fa')\n", "g1['genome1'][:50]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is:\n", "\n", "* The files [genome1.fa](http://users-birc.au.dk/cstorm/courses/ML_e22/exercises/genome1.fa) and [genome2.fa](http://users-birc.au.dk/cstorm/courses/ML_e22/exercises/genome2.fa) contain the 2 genomes.\n", "* The files [true-ann1.fa](http://users-birc.au.dk/cstorm/courses/ML_e22/exercises/true-ann1.fa) and [true-ann2.fa](http://users-birc.au.dk/cstorm/courses/ML_e22/exercises/true-ann2.fa) contain the annotation of the two genomes with the tru gene structure. The annotation is given in FASTA format as a sequence over the symbols `N`, `C`, and `R`. The symbol `N`, `C`, or `R` at position $i$ in `true-annk.fa` gives the \"state\" of the nucleotide at position $i$ in `genomek.fa`. `N` means that the nucleotide is non-coding. `C` means that the nucleotide is coding and part of a gene in the direction from left to right. `R` means that the nucleotide is coding and part of gene in the reverse direction from right to left.\n", "\n", "The annotation files can also be read with `read_fasta_file`.\n", "\n", "You are given the same 7-state HMM that you used above and similar helper functions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class hmm:\n", " def __init__(self, init_probs, trans_probs, emission_probs):\n", " self.init_probs = init_probs\n", " self.trans_probs = trans_probs\n", " self.emission_probs = emission_probs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "init_probs_7_state = [0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00]\n", "\n", "trans_probs_7_state = [\n", " [0.00, 0.00, 0.90, 0.10, 0.00, 0.00, 0.00],\n", " [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n", " [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n", " [0.00, 0.00, 0.05, 0.90, 0.05, 0.00, 0.00],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00],\n", " [0.00, 0.00, 0.00, 0.10, 0.90, 0.00, 0.00],\n", "]\n", "\n", "emission_probs_7_state = [\n", " # A C G T\n", " [0.30, 0.25, 0.25, 0.20],\n", " [0.20, 0.35, 0.15, 0.30],\n", " [0.40, 0.15, 0.20, 0.25],\n", " [0.25, 0.25, 0.25, 0.25],\n", " [0.20, 0.40, 0.30, 0.10],\n", " [0.30, 0.20, 0.30, 0.20],\n", " [0.15, 0.30, 0.20, 0.35],\n", "]\n", "\n", "hmm_7_state = hmm(init_probs_7_state, trans_probs_7_state, emission_probs_7_state)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that this time the function `translate_indices_to_path` is a bit different. In the given model the states 0, 1, 2 represent coding (C), state 3 represents non-coding (N) and states 4, 5, 6 represent reverse-coding (R) as explained in class." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def translate_indices_to_path(indices):\n", " mapping = ['C', 'C', 'C', 'N', 'R', 'R', 'R']\n", " return ''.join([mapping[i] for i in indices])\n", "\n", "def translate_observations_to_indices(obs):\n", " mapping = {'a': 0, 'c': 1, 'g': 2, 't': 3}\n", " return [mapping[symbol.lower()] for symbol in obs]\n", "\n", "def translate_indices_to_observations(indices):\n", " mapping = ['a', 'c', 'g', 't']\n", " return ''.join(mapping[idx] for idx in indices)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def make_table(m, n):\n", " \"\"\"Make a table with `m` rows and `n` columns filled with zeros.\"\"\"\n", " return [[0] * n for _ in range(m)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now insert your Viterbi implementation (log transformed) in the cell below, this means that you should copy `compute_w_log`, `opt_path_prob_log`, `backtrack_log` and any other functions you may have defined yourself for your Viterbi implementation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your Viterbi implementation here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding genes in a genome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cell below, use your Viterbi implementation to compute an annotation for genome 1 and 2. Save the annotation in a variable (remember to translate the indicies to a path using `translate_indices_to_path`). Feel free to define a function that wraps `compute_w_log` and `backtrack_log` so that you don't have to call both functions each time you want an annotation for a sequence." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing annotations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now compare the predicted annotations to the true annotations. Read the true annotations (`true-ann1.fa` and `true-ann2.fa`) and use the `compute_accuracy` function given below to compare the predicted annotation to the true annotation by computing the accurary. Note that there are other ways to measure the quality of a prediction annotation against the true annotation, e.g. the ACC as shown in the lectures." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def compute_accuracy(true_ann, pred_ann):\n", " if len(true_ann) != len(pred_ann):\n", " return 0.0\n", " return sum(1 if true_ann[i] == pred_ann[i] else 0 \n", " for i in range(len(true_ann))) / len(true_ann)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code to read the annotations and compute the accuracies of your predictions..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training a model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, we used the stock `hmm_7_state` for prediction. In a real application, one would train the HMM on genomes with known gene structure in order to make a model that reflects reality. \n", "\n", "Make a HMM `hmm_7_state_genome1` that has a transition diagram similar to `hmm_7_state`, but where the transition, emission, and initial probabilities are set by training by counting on `genome1.fa` and its corresponding true gene structure as given in `true-ann1.fa`.\n", "\n", "You should be able to use your implementation of training by counting as done above, but you must translate the annotation in `annotation1.fa` into a proper sequence of hidden states, i.e. the annotation `NCCCNRRRN` would correspond to `321034563`.\n", "\n", "Use the trained HMM `hmm_7_state_genome1` to predict the gene structure of genome 2, and compare the predicted annotation to true annotation (`true-ann2.fa`). Is the accurracy better than your prediction on genome 2 using `hmm_7_state`?\n", "\n", "Implement training by counting in the cell below. We'll use it to train a new model for predicting genes. Feel free to define any helper functions you find useful." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Your code to get hmm_7_state_genome1 using trainng by counting, predict an annotation of genome2, and compare the prediction to true-ann2.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Redo the above, where you train on genome 2 and predict on genome 1, i.e. make model `hmm_7_state_genome2` using training by counting on `true-ann2.fa`, predict the gene structure of `genome1.fa` and compare your prediction against `true-ann1.fa`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code to get hmm_7_state_genome2 using trainng by counting, predict an annotation of genome1, and compare the prediction to true-ann1.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have time you can redo the above for other training methods, e.g. Viterbi training that you also considered above. I.e. train a model `hmm_7_state_genome1_vit` using Viterbi training on `genome1.fa`, and use it to predict a gene structure for genome 2.\n", "\n", "You can also experiment with other HMMs that allow for a more precise modelling of gene structure as explained in class, e.g. the model with 31 states that models start- and stop-codons. What is the best accuracy that you can obtain?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 1 }