{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hand in 3 - Machine Learning, Fall 2021"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This project is about using a hidden Markov model for gene finding in prokaryotes as explained in class and explored in the exercises in week 47.\n",
    "\n",
    "You must do the project in groups similar to the previous hand ins. The project is mandatory and each group must hand in a report describing their work and results as outlined in section Report below. Handin the report in pdf-format via BrightSpace no later than **Wednesday, December 8, 2021, at 23:59**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You are give a data set containing 10\n",
    "[Staphylococcus](http://en.wikipedia.org/wiki/Staphylococcus) genomes,\n",
    "each containing several genes (i.e. substring) obeying the \"gene syntax\" explained in class.\n",
    "The genomes are between 1.8 million and 2.8 million nucleotides long.\n",
    "For 5 of the genomes, you are also given the gene structure, i.e. the location of the genes.\n",
    "For the remaining 5 genomes, you only know that they contain genes according to the \"gene syntax\". You have already explored two of the genomes in the practical exercises in week 47.\n",
    "\n",
    "The genomes and their annontations are given in\n",
    "[FASTA format](http://en.wikipedia.org/wiki/Fasta_format).\n",
    "There are several Python libraries for reading sequences in FASTA format (use Google to find one),\n",
    "and you are welcome to use the simple reader function `read_fasta_file` below that you also used in the exercises in week 47.\n",
    "\n",
    "The data is:\n",
    "\n",
    "\n",
    "  * The files [genome1.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome1.fa),\n",
    "    [genome2.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome2.fa),\n",
    "    [genome3.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome3.fa),\n",
    "    [genome4.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome4.fa),\n",
    "    [genome5.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome5.fa),\n",
    "    [genome6.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome6.fa),\n",
    "    [genome7.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome7.fa),\n",
    "    [genome8.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome8.fa),\n",
    "    [genome9.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome9.fa), and\n",
    "    [genome10.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome10.fa) contain the 10 genomes.\n",
    "\n",
    "\n",
    "* The files [true-ann1.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/true-ann1.fa),\n",
    "    [true-ann2.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/true-ann2.fa),\n",
    "    [true-ann3.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/true-ann3.fa),\n",
    "    [true-ann4.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/true-ann4.fa), and\n",
    "    [true-ann5.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/true-ann5.fa) contain the annotation\n",
    "    of the first 5 genomes. The annotation is given in FASTA format as a sequence\n",
    "    over the symbols `N`, `C`, and `R`. The symbol `N`, `C`, or `R` at position _i_ in\n",
    "    true-ann_k_.fa gives the \"state\" of the nucleotide at position _i_ in\n",
    "    genome_k_.fa. `N` means that the nucleotide is non-coding. `C` means that the\n",
    "    nucleotide is coding and part of a gene in the direction from left to right. `R`\n",
    "    means that the nucleotide is coding and part of gene in the reverse direction\n",
    "    from right to left. This is exactly as in the exercises in week 47.  \n",
    "\n",
    "\n",
    "* All the above files are available in [data-handin3.zip](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/data-handin3.zip)\n",
    "\n",
    "A simple analysis of the annotations and the corresponding genomes, it is\n",
    "clear that fx several start-codons are possible (and not only atg as modelled\n",
    "in class).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_fasta_file(filename):\n",
    "    \"\"\"\n",
    "    Reads the given FASTA file f and returns a dictionary of sequences.\n",
    "\n",
    "    Lines starting with ';' in the FASTA file are ignored.\n",
    "    \"\"\"\n",
    "    sequences_lines = {}\n",
    "    current_sequence_lines = None\n",
    "    with open(filename) as fp:\n",
    "        for line in fp:\n",
    "            line = line.strip()\n",
    "            if line.startswith(';') or not line:\n",
    "                continue\n",
    "            if line.startswith('>'):\n",
    "                sequence_name = line.lstrip('>')\n",
    "                current_sequence_lines = []\n",
    "                sequences_lines[sequence_name] = current_sequence_lines\n",
    "            else:\n",
    "                if current_sequence_lines is not None:\n",
    "                    current_sequence_lines.append(line)\n",
    "    sequences = {}\n",
    "    for name, lines in sequences_lines.items():\n",
    "        sequences[name] = ''.join(lines)\n",
    "    return sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `read_fast_file` function is used as follows. First, we write an example FASTA file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = '''\n",
    ";Example FASTA file\n",
    ">genomeA\n",
    "CGATTAAAGA\n",
    "TAGAAATACA\n",
    ">annotationA\n",
    "CCCCCCNNNN\n",
    "NNNNRRRRRR\n",
    "'''\n",
    "with open('test.fa', 'w') as fp:\n",
    "    fp.write(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we read the simple FASTA file `test.fa` using `read_fasta_file`, we get a dictionary with the keys `genomeA` and `annotationA` corresponding to the contents of the above file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'genomeA': 'CGATTAAAGATAGAAATACA', 'annotationA': 'CCCCCCNNNNNNNNRRRRRR'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = read_fasta_file('test.fa')\n",
    "x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CGATTAAAGATAGAAATACA\n"
     ]
    }
   ],
   "source": [
    "print(x['genomeA'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem\n",
    "\n",
    "Your task is to predict the gene structure of the 5 unannotated genomes, i.e.\n",
    "[genome6.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome6.fa),\n",
    "[genome7.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome7.fa),\n",
    "[genome8.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome8.fa),\n",
    "[genome9.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome9.fa), and\n",
    "[genome10.fa](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/genome10.fa), in the best possible manner using an HMM based approach.\n",
    "\n",
    "As explained in class, predicting gene structure using an HMM involves:\n",
    "\n",
    "  * Deciding on an initial model structure, i.e. the number of hidden states\n",
    "    and which transitions and emission should have a fixed probability (e.g. 0\n",
    "    for \"not possible\", or 1 for \"always the case\").\n",
    "    You might get inspiration from, or even use, one of the model structures discussed in class.\n",
    "    However, be aware that the models presented in class model only one start-codon atg. \n",
    "    This is not the case in the presented data, where multiple start-codons are possible (as also mentioned in the     lecture/slides about this project).\n",
    "\n",
    "\n",
    "  * Tune model parameters by training, i.e. set the non-fixed emission and transition probabilities.\n",
    "    This can be done by several methods as explained in class. You can use\n",
    "    (1) Training-by-Counting using the 5 genomes with know gene structure because you\n",
    "    know the underlying hidden state for each observation (i.e. symbol) in these\n",
    "    sequences, or\n",
    "    (2) Viterbi- or EM-training using all the 10 genomes,\n",
    "    including the 5 genomes with unknown gene structure.\n",
    "    You can of course also combine the two approaches.\n",
    "    You can evaluate the performance of your gene finder by computing the\n",
    "    approximate correlation (AC) between your predictions and\n",
    "    the true structure using this small python program\n",
    "    [compare_anns.py](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/compare_anns.py)\n",
    "    (see [this paper](http://www.sciencedirect.com/science/article/pii/S0888754396902980)\n",
    "    for details).\n",
    "\n",
    "    In this project, Training-by-Counting is mandatory, while Viterbi- or\n",
    "    EM-training is optional. If you implement EM-training you should of course be\n",
    "    aware of the numerical problems of the forward- and backward-algorithms unless you use scaling or similar techniqes.\n",
    "\n",
    "* Evaluate the performance of your gene predictor. In this project, you must do a 5-fold cross validation on the 5 genomes with known gene structure, i.e.:\n",
    "\n",
    "    You consider genome 1 to 5 in five rounds. In a each you train your model using Training-by-Counting on the remaining 4 genomes, predict the gene structure of the genome you consider, and compute the approximate correlation coefficient (AC) between your predicted gene structure and the true gene structure using the python program  [compare_anns.py](https://birc.au.dk/~cstorm/courses/ML_e21/projects/handin3/compare_anns.py).\n",
    "\n",
    "\n",
    "  * Use your best model to predict the gene structure for the 5 unannotated\n",
    "    genomes using the Viterbi algorithm with subsequent backtracking.\n",
    "    I.e. for each unannotated genome you must find a most likely sequence of\n",
    "    states in your model generating it, and convert this sequence of states into a\n",
    "    FASTA file giving the annotation of each nucleotide as `N`, `C`, or `R`.\n",
    "    You should make files, `pred-ann6.fa`, ..., `pred-ann10.fa`, giving the\n",
    "    annotation for the 5 unannotated genomes. \n",
    "    \n",
    "    You can use the www-service [GeneFinder Verifier](https://services.birc.au.dk/genefinder-verifier/) to compare your predictions on Genome 6-10 against their true structures.\n",
    "   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Report\n",
    "\n",
    "You (i.e. your group) must hand in a short report (in pdf-format) describing your\n",
    "work and results and a zip-archive of the files `pred-ann6.fa` and\n",
    "`pred-ann7.fa`,..., `pred-ann10.fa` containing your predictions on the 5\n",
    "genomes with unknown structure. The files _must_ be in FASTA format similar to\n",
    "the annotation of the other genomes. Handin the report and zip-archive via\n",
    "Blackboard no later than **Wednesday, December 8, 2021, at 23:59**. Your report should\n",
    "be no more than 3 pages. It must cover:\n",
    "\n",
    "  * The status of the work, i.e. does it work, if not, then why.\n",
    "\n",
    "\n",
    "  * An explanation and illustration of your model structure (i.e. a transition\n",
    "    diagram). This should include an explanation of how you model start and stop codons, and an explanation of how you have trained your model. For Training-by-Counting you might comment on how you \n",
    "    translata the given annotations of `C`, `N`, and `R`'s into corresponding sequences\n",
    "    of hidden states. The transition-diagram must contain all non-0 emission- and transition probabilities\n",
    "    for the model that you have used for predicting the gene structure of the 5 genomes with unknown gene structure.\n",
    "\n",
    "\n",
    "  * An explanation of how you have predicted the gene structure for the the 5 genomes with unknown gene structure. You should comment on how you translate a most likely sequence of hidden states as returned by Viterbi decoding into a sequence of `C`, `N`, and `R`'s.\n",
    "\n",
    "\n",
    "  * The result of your 5-fold cross valiation on the 5 genomes with known gene structure. The report must include a table with the AC obtained in each of the five rounds of the 5-fold cross validation.\n",
    "    \n",
    "    \n",
    "  * The result of comparing your predictions on the 5 genomes with unknown gene structure against their true structures via the www-service [GeneFinder Verifer](https://services.birc.au.dk/genefinder-verifier/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}