Introduction
The Labeled Fasta (LFasta) format is invented by Anders Krogh and used extensively at the Bioinformatics Centre. The description below is scissored from his homepage.
In may applications of biological sequence analysis some label is associated with each letter in a sequence. For instance for secondary structure of proteins you may put an 'H' for an alpha helix, an 'E' for (extended) beta sheet and say 'x' for anything else. The 'Labeled FASTA format' allows for a string of such labels (or more than one string). For the secondary structure example it would look like this:
>1IRK._TRANSFERASE SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAV # xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEE MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNP # HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxx AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKG # HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxx DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVT # HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHH NLLKDDLHPSFPEVSFFHSEENK # HHHxxxxxxxHHHHxxxxxxxxx
Here the '#' preceeds the sequence of labels.
Format Specification
- Each entry starts with a '>' as the first character on a line immediately followed by the name of the entry (like FASTA format). The rest of the first line is ignored. An entry is terminated by EOF or a new entry ('>').
- The following lines contain a number of sequences of the same length.
- Lines starting with '#' are 'primary' labels.
- Lines starting with '?' followed by a letter are other labels identified by that letter.
- Lines starting with '%' are comment lines (ignored).
- Other lines contain the primary sequence (usually protein, DNA or RNA).
- After deletion of '#' or '?x' all blanks are deleted from all sequences.
Here's an example with 3 label sequences where one (?1) shows the DSSP secondary structure annotation and the other (?2) shows the helices only:
>1IRK._TRANSFERASE SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV # xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH ?1 ......GGGB..GGGEEEEEEEEE.SSSEEEEEEEEEEETTEEEEEEEEE...TT..HHHHHHHHHHHHH ?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHHHHHHH MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM # HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH ?1 HTT...TTB..EEEEE.SSSS.EEEEE..TT.BHHHHHHHTSTT.TT..S..S..HHHHHHHHHHHHHHH ?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS # HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH ?1 HHHHHTT...S..SGGGEEE.TT..EEE...S.SSSTTGGG.EEGGGSSEE.GGG..HHHHHH....HHH ?2 HHHHHxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHxxxxHHH DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV # HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH ?1 HHHHHHHHHHHHHHTS..TTTTS.HHHHHHHHHTT......SS..HHHHHHHHHHT.SSGGGS..HHHHH ?2 HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH NLLKDDLHPSFPEVSFFHSEENK # HHHxxxxxxxHHHHxxxxxxxxx ?1 HHHGGGS.TTHHHH.STTSTT.. ?2 HHHxxxxxxxHHHHxxxxxxxxx
LFasta related Scripts
reformat.pl
This script does reformatting between sequence formats. It handles
Genbank, EMBL, Fasta and all the other formats supported by
bioperl. In addition it formats to labeled fasta (lfa) which is the a
handy extention of the fasta format developed by Anders Krogh for use
in HMM training. The labeling is generated from the sequence features
in a manner directed by the —labelkey option. The information surplus
or deficit when formatting between rich formats like EMBL and Fasta
can be handled by using the gff option. This specifies a gff file that
is read from or written to depending on the which way the formatting
goes.
grepseq.pl
Extract sub-sequences from sequences on stdin based on a (perl)
regular expression given on the cmd line. Input sequences in labeled
fasta format. By default the labels are searched using the
regexp. Note that the IDs on the output are made unique by adding an
incrementing suffix for each match in an entry. This can be avoided by
using the keepid option.
addprediction.pl
This script adds a prediction track to labeled Fasta entries as
specified by a gff file. This is usefull for comaparing predictions.
untangle.pl
This script untangles Labeled Fasta as it comes out if you treat it as
ordinary Fasta in a Seq or SeqIO object.
LFasta Modules
LFasta
A LFasta (L for labeled) object is a sequence with sequence features
placed on it. The LFasta format is a hybrid between the simple Fasta
format and the rich formats such at Genbank, EMBL and Swissprot. Along
with the sequence it holds any information that maps directly to the
plus strand of the sequence. The features are held on one or more
label lines for each sequence line. A letter represents a type for
feature. Eg. E for exons, H for helix and so on. This gives LFasta the
"grepability" of the Fasta format and a sequence feature richness
comparable to the rich Seq formats.
LFastaIO
LFastaIO is to LFasta what SeqIO is to Seq. It works in much the same
way, but does only support the filehandel emulation for input and
output. So LFastaIO->new corresponds to Bio::SeqIO->newFh. As of now
the module only supports LFasta as input format. For output formats
other than LFasta, it uses the facilities of SeqIO.