This project is about using a hidden Markov model for gene finding in prokaryotes as explained in class and explored in the exercises in week 46.
We suggest that you do the project in groups of 2-3 students. The project is mandatory and each group must hand in a report describing their work and results as outlined in section Report below. Handin the report in pdf-format via Blackboard no later than Monday, November 26, 2018, at 08:00.
You are give a data set containing 10 Staphylococcus genomes, each containing several genes (i.e. substring) obeying the "gene syntax" explained in class. The genomes are between 1.8 million and 2.8 million nucleotides. For 5 of the genomes, you are also given the location of the genes. For the remaining 5 genomes, you only know that they contain genes according to the "gene syntax". (You have already explored two of the genomes in the practical exercises in week 46.)
The genomes and their annontations are given in
FASTA format.
There are several Python libraries for reading sequences in this format (use Google to find one),
but it is also easy to read it directly,
and you are welcome to use the simple reader function read_fasta_file
below.
The data is:
N
, C
, and R
. The symbol N
, C
, or R
at position i in
true-annk.fa gives the "state" of the nucleotide at position i in
genomek.fa. N
means that the nucleotide is non-coding. C
means that the
nucleotide is coding and part of a gene in the direction from left to right. R
means that the nucleotide is coding and part of gene in the reverse direction
from right to left. (This is exactly as in the pratical exercises in week 46.) By simple reading of the annotations and the corresponding genomes, it is clear that fx several start-codons are possible (and not only atg as modeled in class).
def read_fasta_file(filename):
"""
Reads the given FASTA file f and returns a dictionary of sequences.
Lines starting with ';' in the FASTA file are ignored.
"""
sequences_lines = {}
current_sequence_lines = None
with open(filename) as fp:
for line in fp:
line = line.strip()
if line.startswith(';') or not line:
continue
if line.startswith('>'):
sequence_name = line.lstrip('>')
current_sequence_lines = []
sequences_lines[sequence_name] = current_sequence_lines
else:
if current_sequence_lines is not None:
current_sequence_lines.append(line)
sequences = {}
for name, lines in sequences_lines.items():
sequences[name] = ''.join(lines)
return sequences
The function is used as follows. First, we write an example FASTA file:
s = '''
;Example FASTA file
>genomeA
CGATTAAAGA
TAGAAATACA
>annotationA
CCCCCCNNNN
NNNNRRRRRR
'''
with open('test.fa', 'w') as fp:
fp.write(s)
When we read the file with read_fasta_file
, we get a dictionary with the keys genomeA
and genomeB
corresponding to the contents of the above file.
x = read_fasta_file('test.fa')
x
print(x['genomeA'])
Your task is to predict the gene structure of the 5 unannotated genomes, i.e. genome6.fa, genome7.fa, genome8.fa, genome9.fa, and genome10.fa.
As explained in class, predicting gene structure using a HMM involves:
Deciding on an initial model structure, i.e. the number of hidden states and which transitions and emission should have a fixed probability (e.g. 0 for "not possible", or 1 for "always the case"). You might get inspiration from, or even use, one of the model structures discussed in class. However, be aware that the models presented in class assume start-codon atg. This is not the case in the presented data, where multiple start-codons are possible.
To predict the full gene structure, you must decide how to handle to fact that
the genomes have genes in both directions (i.e. a nucleotide can be C
or R
).
You can e.g. choose to make a HMM which only models genes in one direction and
then use it twice to predict the genes in each direction (beware that start-
and stop-codons look different in the reversed direction, unless you make the
reverse-complement of the input string), or you can make a model which models
genes in both directions.
Tune model parameters by training, i.e. set the non-fixed emission and transition probabilities. This can be done by several methods as explained in class: (1) "By counting" using the 5 genomes with know gene structure because you know the underlying hidden state for each observation (i.e. symbol) in these sequences, or (2) by Viterbi- or EM-training using all the 10 genomes, including the 5 genomes with unknown gene structure. You can of course also combine the two approaches. You can evaluate the performance of your gene finder by computing the approximate correlation (AC) between your predictions and the true structure using this small python program compare_anns.py (see this paper for details).
In this project, training by counting is mandatory, while Viterbi- or EM-training is optional. If you implement EM-training you should of course be aware of the precision problems of the forward- and backward-algorithms.
Use your best model to predict the gene structure for the 5 unannotated
genomes using the Viterbi algorithm with subsequent backtracking.
I.e. for each unannotated genome you must find a most likely sequence of
states in your model generating it, and convert this sequence of states into a
FASTA file giving the annotation of each nucleotide as N
, C
, or R
.
You should make files, pred-ann6.fa
, ..., pred-ann10.fa
, giving the
annotation for the 5 unannotated genomes.
You can use the www-service GeneFinder Verifier to compare your predictions on Genome 6-10 against their true structures.
You (i.e. your group) must hand in a report (in pdf-format) describing your
work and results and a zip-archive of the files pred-ann6.fa
and
pred-ann7.fa
,..., pred-ann10.fa
containing your predictions on the 5
genomes with unknown structure. The files must be in FASTA format similar to
the annotation of the other genomes. Handin the report and zip-archive via
Blackboard no later than Monday, November 26, 2018, at 08:00. Your report should
be no more than 3 pages and clearly state who is in the group. It must cover:
C
, N
, and R
's into corresponding sequences
of hidden states. The transition-diagram should contain all non-0 probabilities
as computed by your training.An evaluation of the performance of your gene predictor. You must make a 5-fold cross validation on the 5 genomes with known structure:
For each genome 1 to 5, you train your model by training-by-counting on the remaining 4 genomes, and predict the gene structure on the genome you picked. You compute and report the approximate correlation coefficient (AC) between your prediction and the true annotation using the small python program compare_anns.py. Include a table with the computed ACs in your report.