Project 3 - Machine Learning, Fall 2018

This project is about using a hidden Markov model for gene finding in prokaryotes as explained in class and explored in the exercises in week 46.

We suggest that you do the project in groups of 2-3 students. The project is mandatory and each group must hand in a report describing their work and results as outlined in section Report below. Handin the report in pdf-format via Blackboard no later than Monday, November 26, 2018, at 08:00.

Data set

You are give a data set containing 10 Staphylococcus genomes, each containing several genes (i.e. substring) obeying the "gene syntax" explained in class. The genomes are between 1.8 million and 2.8 million nucleotides. For 5 of the genomes, you are also given the location of the genes. For the remaining 5 genomes, you only know that they contain genes according to the "gene syntax". (You have already explored two of the genomes in the practical exercises in week 46.)

The genomes and their annontations are given in FASTA format. There are several Python libraries for reading sequences in this format (use Google to find one), but it is also easy to read it directly, and you are welcome to use the simple reader function read_fasta_file below. The data is:

By simple reading of the annotations and the corresponding genomes, it is clear that fx several start-codons are possible (and not only atg as modeled in class).

In [1]:
def read_fasta_file(filename):
    """
    Reads the given FASTA file f and returns a dictionary of sequences.

    Lines starting with ';' in the FASTA file are ignored.
    """
    sequences_lines = {}
    current_sequence_lines = None
    with open(filename) as fp:
        for line in fp:
            line = line.strip()
            if line.startswith(';') or not line:
                continue
            if line.startswith('>'):
                sequence_name = line.lstrip('>')
                current_sequence_lines = []
                sequences_lines[sequence_name] = current_sequence_lines
            else:
                if current_sequence_lines is not None:
                    current_sequence_lines.append(line)
    sequences = {}
    for name, lines in sequences_lines.items():
        sequences[name] = ''.join(lines)
    return sequences

The function is used as follows. First, we write an example FASTA file:

In [2]:
s = '''
;Example FASTA file
>genomeA
CGATTAAAGA
TAGAAATACA
>annotationA
CCCCCCNNNN
NNNNRRRRRR
'''
with open('test.fa', 'w') as fp:
    fp.write(s)

When we read the file with read_fasta_file, we get a dictionary with the keys genomeA and genomeB corresponding to the contents of the above file.

In [3]:
x = read_fasta_file('test.fa')
x
Out[3]:
{'annotationA': 'CCCCCCNNNNNNNNRRRRRR', 'genomeA': 'CGATTAAAGATAGAAATACA'}
In [4]:
print(x['genomeA'])
CGATTAAAGATAGAAATACA

Problem

Your task is to predict the gene structure of the 5 unannotated genomes, i.e. genome6.fa, genome7.fa, genome8.fa, genome9.fa, and genome10.fa.

As explained in class, predicting gene structure using a HMM involves:

  • Deciding on an initial model structure, i.e. the number of hidden states and which transitions and emission should have a fixed probability (e.g. 0 for "not possible", or 1 for "always the case"). You might get inspiration from, or even use, one of the model structures discussed in class. However, be aware that the models presented in class assume start-codon atg. This is not the case in the presented data, where multiple start-codons are possible.

    To predict the full gene structure, you must decide how to handle to fact that the genomes have genes in both directions (i.e. a nucleotide can be C or R). You can e.g. choose to make a HMM which only models genes in one direction and then use it twice to predict the genes in each direction (beware that start- and stop-codons look different in the reversed direction, unless you make the reverse-complement of the input string), or you can make a model which models genes in both directions.

  • Tune model parameters by training, i.e. set the non-fixed emission and transition probabilities. This can be done by several methods as explained in class: (1) "By counting" using the 5 genomes with know gene structure because you know the underlying hidden state for each observation (i.e. symbol) in these sequences, or (2) by Viterbi- or EM-training using all the 10 genomes, including the 5 genomes with unknown gene structure. You can of course also combine the two approaches. You can evaluate the performance of your gene finder by computing the approximate correlation (AC) between your predictions and the true structure using this small python program compare_anns.py (see this paper for details).

    In this project, training by counting is mandatory, while Viterbi- or EM-training is optional. If you implement EM-training you should of course be aware of the precision problems of the forward- and backward-algorithms.

  • Use your best model to predict the gene structure for the 5 unannotated genomes using the Viterbi algorithm with subsequent backtracking. I.e. for each unannotated genome you must find a most likely sequence of states in your model generating it, and convert this sequence of states into a FASTA file giving the annotation of each nucleotide as N, C, or R. You should make files, pred-ann6.fa, ..., pred-ann10.fa, giving the annotation for the 5 unannotated genomes.

    You can use the www-service GeneFinder Verifier to compare your predictions on Genome 6-10 against their true structures.

Report

You (i.e. your group) must hand in a report (in pdf-format) describing your work and results and a zip-archive of the files pred-ann6.fa and pred-ann7.fa,..., pred-ann10.fa containing your predictions on the 5 genomes with unknown structure. The files must be in FASTA format similar to the annotation of the other genomes. Handin the report and zip-archive via Blackboard no later than Monday, November 26, 2018, at 08:00. Your report should be no more than 3 pages and clearly state who is in the group. It must cover:

  • The status of the work, i.e. does it work, if not, then why.
  • An explanation and illustration of your model structure (i.e. a transition diagram). This should include an explanation of how you model that there can be several start and stop codons, and an explanation of how you have trained your model. For training by counting you should comment on how you have translated the given annotations of C, N, and R's into corresponding sequences of hidden states. The transition-diagram should contain all non-0 probabilities as computed by your training.
  • An explanation of how you have predicted the gene structure for the unannotated sequences.
  • An evaluation of the performance of your gene predictor. You must make a 5-fold cross validation on the 5 genomes with known structure:

    For each genome 1 to 5, you train your model by training-by-counting on the remaining 4 genomes, and predict the gene structure on the genome you picked. You compute and report the approximate correlation coefficient (AC) between your prediction and the true annotation using the small python program compare_anns.py. Include a table with the computed ACs in your report.

  • The result of comparing your predictions on the 5 genomes with unknown structure against their true structures via the www-service GeneFinder Verifer.