Introduction
This page contains both some explanation of the topics we cover and some exercises for you to do. Read through the page and complete the exercises at the end. Spend any remaining time playing around on the code examples in the explanatory text. Remember that the key here is learning by doing. Type some of the code examples into your IDLE editor and try them out yourself — and be curious. Try and see what happens if you change things a bit.
Creating lists
Lists are sequences of objects, such as numbers, strings, or generally whatever you want to put into a sequence. You create an empty list like this:
L = []
and you can then manipulate it by adding and removing elements to your hearts desire. The range() method that you have used several times by now actually creates a list of numbers, so that is another way of creating a list.
List indexing
To get the ith element of a list, you can index it as L[i]. Indexing starts at zero, so L[0] is the first element, and if the list is n elements long, the last element is L[n-1]. You can index from the back by using negative numbers, so L[-1] is the last element in the list, L[-2] is the second last element and so on.
List slicing
You can get the sub-list starting at index i and ending at index j-1 using something called slicing, that looks like this L[i:j]. This can be handy at times.
List methods
Read the documentation for lists in the reading material to get a feeling for what you can do with lists. The methods you are most likely to need for the exercises are:
- reverse() — reverses the order of a list, so [1,2,3] becomes [3,2,1]. If you want to remove an element from a list, it is much faster to remove elements at the end than from the beginning — for reasons we won't come into here — so sometimes it makes sense to first reverse the string and then pop() from the reversed string rather than from the original. That being said, it is typically easier to create new lists than modifying existing ones.
- pop() — removes an element from the back of a list.
- sort() — does exactly that, it sorts the elements in the list.
- append() — adds an element to the list.
There are plenty more, though go see for yourself.
Exercise
In this exercise you will work a HIV sequence. It is a HIV-1 sequence but we also want to know that type it is. To help us type our sequence we have access to a number of database sequences of each of the types. By comparing our HIV sequence to the sequences of each type we can identify the most similar sequence which then is assumed to be of the same type as our HIV sequence.
First download the module exerciseWeek3.py (if you did not do that already) and put it in the folder where you keep your other python code. On most computers you can right-click on the link and have it "save files as...". Have a look through at what is in the module. Import it like this:
import exerciseWeek3
This module gives you access to exerciseWeek3.alignedHivSeq which is the sequence we want to type. The lists exerciseWeek3.typeA, exerciseWeek3.typeB, exerciseWeek3.typeC, and exerciseWeek3.typeD contain database HIV sequences of different known types. All sequences are from the same multiple alignment. This means that sequence positions match up across all sequences but also that a lot of gap characters, '-' are inserted.
Use the function pairWiseDifferences(seq1, seq2) you implemented in the strings exercise to find the database sequence that is most similar to exerciseWeek3.alignedHivSeq to determine its type.
Before doing so you need to modify pairWiseDifferences(seq1, seq2) so it does not consider sequence positions where both bases are '-' characters. In other words, you not only need to count differences. You also need to count how many alignment columns that are relevant, i.e that are not just "-" for both sequences.
Hint: use an if-statement in pairWiseDifferences with a boolean expression like this:
not (seq1[i] == '-' and seq2[i] == '-')
Once your function has computed both the number of differences and the number of relevant positions you can have it return the fraction of different bases as nrDifferences/float(nrRelevantPositions).
To type you HIV sequence you need compare your sequence to all the database sequences to see which group has the best matching sequence.
Hint: For each list, e.g. exerciseWeek3.typeA, you can go through the database sequences with a for loop to calculate the similarities.
You can define a list that you can add your results for each type to like this:
resultsTypeA = []
and then append results to the list as you calculate them in the for loop. Look up the append() method. Once you are done you need to find the smallest result in the list (fewest differences). Try the min() function.