{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project 3 - Machine Learning, Fall 2018"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This project is about using a hidden Markov model for gene finding in prokaryotes as explained in class and explored in the exercises in week 46.\n",
    "\n",
    "We suggest that you do the project in groups of 2-3 students. The project is mandatory and each group must hand in a report describing their work and results as outlined in section Report below. Handin the report in pdf-format via Blackboard no later than **Monday, November 26, 2018, at 08:00**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You are give a data set containing 10\n",
    "[Staphylococcus](http://en.wikipedia.org/wiki/Staphylococcus) genomes,\n",
    "each containing several genes (i.e. substring) obeying the \"gene syntax\" explained in class.\n",
    "The genomes are between 1.8 million and 2.8 million nucleotides.\n",
    "For 5 of the genomes, you are also given the location of the genes.\n",
    "For the remaining 5 genomes, you only know that they contain genes according to the \"gene syntax\". (You have already explored two of the genomes in the practical exercises in week 46.)\n",
    "\n",
    "The genomes and their annontations are given in\n",
    "[FASTA format](http://en.wikipedia.org/wiki/Fasta_format).\n",
    "There are several Python libraries for reading sequences in this format (use Google to find one),\n",
    "but it is also easy to read it directly,\n",
    "and you are welcome to use the simple reader function `read_fasta_file` below.\n",
    "The data is:\n",
    "\n",
    "  * The files [genome1.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome1.fa),\n",
    "    [genome2.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome2.fa),\n",
    "    [genome3.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome3.fa),\n",
    "    [genome4.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome4.fa),\n",
    "    [genome5.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome5.fa),\n",
    "    [genome6.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome6.fa),\n",
    "    [genome7.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome7.fa),\n",
    "    [genome8.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome8.fa),\n",
    "    [genome9.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome9.fa), and\n",
    "    [genome10.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome10.fa) contain the 10 genomes.\n",
    "  * The files [true-ann1.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/true-ann1.fa),\n",
    "    [true-ann2.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/true-ann2.fa),\n",
    "    [true-ann3.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/true-ann3.fa),\n",
    "    [true-ann4.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/true-ann4.fa), and\n",
    "    [true-ann5.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/true-ann5.fa) contain the annotation\n",
    "    of the first 5 genomes. The annotation is given in FASTA format as a sequence\n",
    "    over the symbols `N`, `C`, and `R`. The symbol `N`, `C`, or `R` at position _i_ in\n",
    "    true-ann_k_.fa gives the \"state\" of the nucleotide at position _i_ in\n",
    "    genome_k_.fa. `N` means that the nucleotide is non-coding. `C` means that the\n",
    "    nucleotide is coding and part of a gene in the direction from left to right. `R`\n",
    "    means that the nucleotide is coding and part of gene in the reverse direction\n",
    "    from right to left. (This is exactly as in the pratical exercises in week 46.)   \n",
    "  * All the above files are available in [data-handin3.zip](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/data-handin3.zip)\n",
    "\n",
    "By simple reading of the annotations and the corresponding genomes, it is\n",
    "clear that fx several start-codons are possible (and not only atg as modeled\n",
    "in class).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def read_fasta_file(filename):\n",
    "    \"\"\"\n",
    "    Reads the given FASTA file f and returns a dictionary of sequences.\n",
    "\n",
    "    Lines starting with ';' in the FASTA file are ignored.\n",
    "    \"\"\"\n",
    "    sequences_lines = {}\n",
    "    current_sequence_lines = None\n",
    "    with open(filename) as fp:\n",
    "        for line in fp:\n",
    "            line = line.strip()\n",
    "            if line.startswith(';') or not line:\n",
    "                continue\n",
    "            if line.startswith('>'):\n",
    "                sequence_name = line.lstrip('>')\n",
    "                current_sequence_lines = []\n",
    "                sequences_lines[sequence_name] = current_sequence_lines\n",
    "            else:\n",
    "                if current_sequence_lines is not None:\n",
    "                    current_sequence_lines.append(line)\n",
    "    sequences = {}\n",
    "    for name, lines in sequences_lines.items():\n",
    "        sequences[name] = ''.join(lines)\n",
    "    return sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function is used as follows. First, we write an example FASTA file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "s = '''\n",
    ";Example FASTA file\n",
    ">genomeA\n",
    "CGATTAAAGA\n",
    "TAGAAATACA\n",
    ">annotationA\n",
    "CCCCCCNNNN\n",
    "NNNNRRRRRR\n",
    "'''\n",
    "with open('test.fa', 'w') as fp:\n",
    "    fp.write(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we read the file with `read_fasta_file`, we get a dictionary with the keys `genomeA` and `genomeB` corresponding to the contents of the above file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'annotationA': 'CCCCCCNNNNNNNNRRRRRR', 'genomeA': 'CGATTAAAGATAGAAATACA'}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = read_fasta_file('test.fa')\n",
    "x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CGATTAAAGATAGAAATACA\n"
     ]
    }
   ],
   "source": [
    "print(x['genomeA'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem\n",
    "\n",
    "Your task is to predict the gene structure of the 5 unannotated genomes, i.e.\n",
    "[genome6.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome6.fa),\n",
    "[genome7.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome7.fa),\n",
    "[genome8.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome8.fa),\n",
    "[genome9.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome9.fa), and\n",
    "[genome10.fa](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/genome10.fa).\n",
    "\n",
    "As explained in class, predicting gene structure using a HMM involves:\n",
    "\n",
    "  * Deciding on an initial model structure, i.e. the number of hidden states\n",
    "    and which transitions and emission should have a fixed probability (e.g. 0\n",
    "    for \"not possible\", or 1 for \"always the case\").\n",
    "    You might get inspiration from, or even use, one of the model structures discussed in class.\n",
    "    However, be aware that the models presented in class assume start-codon atg.\n",
    "    This is not the case in the presented data, where multiple start-codons are possible.\n",
    "\n",
    "    To predict the full gene structure, you must decide how to handle to fact that\n",
    "    the genomes have genes in both directions (i.e. a nucleotide can be `C` or `R`).\n",
    "    You can e.g. choose to make a HMM which only models genes in one direction and\n",
    "    then use it twice to predict the genes in each direction (beware that start-\n",
    "    and stop-codons look different in the reversed direction, unless you make the\n",
    "    reverse-complement of the input string), or you can make a model which models\n",
    "    genes in both directions.\n",
    "\n",
    "\n",
    "  * Tune model parameters by training, i.e. set the non-fixed emission and transition probabilities.\n",
    "    This can be done by several methods as explained in class:\n",
    "    (1) \"By counting\" using the 5 genomes with know gene structure because you\n",
    "    know the underlying hidden state for each observation (i.e. symbol) in these\n",
    "    sequences, or\n",
    "    (2) by Viterbi- or EM-training using all the 10 genomes,\n",
    "    including the 5 genomes with unknown gene structure.\n",
    "    You can of course also combine the two approaches.\n",
    "    You can evaluate the performance of your gene finder by computing the\n",
    "    approximate correlation (AC) between your predictions and\n",
    "    the true structure using this small python program\n",
    "    [compare_anns.py](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/compare_anns.py)\n",
    "    (see [this paper](http://www.sciencedirect.com/science/article/pii/S0888754396902980)\n",
    "    for details).\n",
    "\n",
    "    In this project, training by counting is mandatory, while Viterbi- or\n",
    "    EM-training is optional. If you implement EM-training you should of course be\n",
    "    aware of the precision problems of the forward- and backward-algorithms.\n",
    "\n",
    "\n",
    "  * Use your best model to predict the gene structure for the 5 unannotated\n",
    "    genomes using the Viterbi algorithm with subsequent backtracking.\n",
    "    I.e. for each unannotated genome you must find a most likely sequence of\n",
    "    states in your model generating it, and convert this sequence of states into a\n",
    "    FASTA file giving the annotation of each nucleotide as `N`, `C`, or `R`.\n",
    "    You should make files, `pred-ann6.fa`, ..., `pred-ann10.fa`, giving the\n",
    "    annotation for the 5 unannotated genomes. \n",
    "    \n",
    "    You can use the www-service [GeneFinder Verifier](https://services.birc.au.dk/genefinder-verifier/) to compare your predictions on Genome 6-10 against their true structures.\n",
    "   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Report\n",
    "\n",
    "You (i.e. your group) must hand in a report (in pdf-format) describing your\n",
    "work and results and a zip-archive of the files `pred-ann6.fa` and\n",
    "`pred-ann7.fa`,..., `pred-ann10.fa` containing your predictions on the 5\n",
    "genomes with unknown structure. The files _must_ be in FASTA format similar to\n",
    "the annotation of the other genomes. Handin the report and zip-archive via\n",
    "Blackboard no later than **Monday, November 26, 2018, at 08:00**. Your report should\n",
    "be no more than 3 pages and clearly state who is in the group. It must cover:\n",
    "\n",
    "  * The status of the work, i.e. does it work, if not, then why.\n",
    "\n",
    "\n",
    "  * An explanation and illustration of your model structure (i.e. a transition\n",
    "    diagram). This should include an explanation of how you model that there\n",
    "    can be several start and stop codons, and an explanation of how you have\n",
    "    trained your model. For training by counting you should comment on how you have\n",
    "    translated the given annotations of `C`, `N`, and `R`'s into corresponding sequences\n",
    "    of hidden states. The transition-diagram should contain all non-0 probabilities\n",
    "    as computed by your training.\n",
    "\n",
    "\n",
    "  * An explanation of how you have predicted the gene structure for the unannotated sequences.\n",
    "\n",
    "\n",
    "  * An evaluation of the performance of your gene predictor.\n",
    "    You must make a 5-fold cross validation on the 5 genomes with known structure:\n",
    "\n",
    "    For each genome 1 to 5, you train your model by training-by-counting on the\n",
    "    remaining 4 genomes, and predict the gene structure on the genome you picked.\n",
    "    You compute and report the approximate correlation coefficient (AC) between\n",
    "    your prediction and the true annotation using the small python program\n",
    "    [compare_anns.py](https://users-cs.au.dk/cstorm/courses/ML_e18/projects/handin3/compare_anns.py).\n",
    "    Include a table with the computed ACs in your report.\n",
    "    \n",
    "    \n",
    "  * The result of comparing your predictions on the 5 genomes with unknown structure against their true structures via the www-service [GeneFinder Verifer](https://services.birc.au.dk/genefinder-verifier/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}