Labelled Fasta Format

Introduction

The Labeled Fasta (LFasta) format is invented by Anders Krogh and used extensively at the Bioinformatics Centre. The description below is scissored from his homepage.

In may applications of biological sequence analysis some label is associated with each letter in a sequence. For instance for secondary structure of proteins you may put an 'H' for an alpha helix, an 'E' for (extended) beta sheet and say 'x' for anything else. The 'Labeled FASTA format' allows for a string of such labels (or more than one string). For the secondary structure example it would look like this:

>1IRK._TRANSFERASE
   SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAV
#  xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEE
   MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNP
#  HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxx
   AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKG
#  HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxx
   DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVT
#  HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHH
   NLLKDDLHPSFPEVSFFHSEENK
#  HHHxxxxxxxHHHHxxxxxxxxx

Here the '#' preceeds the sequence of labels.

Format Specification

- Each entry starts with a '>' as the first character on a line immediately followed by the name of the entry (like FASTA format). The rest of the first line is ignored. An entry is terminated by EOF or a new entry ('>').

- The following lines contain a number of sequences of the same length.

- Lines starting with '#' are 'primary' labels.

- Lines starting with '?' followed by a letter are other labels identified by that letter.

- Lines starting with '%' are comment lines (ignored).

- Other lines contain the primary sequence (usually protein, DNA or RNA).

- After deletion of '#' or '?x' all blanks are deleted from all sequences.

Here's an example with 3 label sequences where one (?1) shows the DSSP secondary structure annotation and the other (?2) shows the helices only:

>1IRK._TRANSFERASE
   SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV
#  xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH
?1 ......GGGB..GGGEEEEEEEEE.SSSEEEEEEEEEEETTEEEEEEEEE...TT..HHHHHHHHHHHHH
?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHHHHHHH
   MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM
#  HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH
?1 HTT...TTB..EEEEE.SSSS.EEEEE..TT.BHHHHHHHTSTT.TT..S..S..HHHHHHHHHHHHHHH
?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH
   AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS
#  HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH
?1 HHHHHTT...S..SGGGEEE.TT..EEE...S.SSSTTGGG.EEGGGSSEE.GGG..HHHHHH....HHH
?2 HHHHHxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHxxxxHHH
   DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV
#  HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH
?1 HHHHHHHHHHHHHHTS..TTTTS.HHHHHHHHHTT......SS..HHHHHHHHHHT.SSGGGS..HHHHH
?2 HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH
   NLLKDDLHPSFPEVSFFHSEENK
#  HHHxxxxxxxHHHHxxxxxxxxx
?1 HHHGGGS.TTHHHH.STTSTT..
?2 HHHxxxxxxxHHHHxxxxxxxxx

LFasta related Scripts

reformat.pl
This script does reformatting between sequence formats. It handles Genbank, EMBL, Fasta and all the other formats supported by bioperl. In addition it formats to labeled fasta (lfa) which is the a handy extention of the fasta format developed by Anders Krogh for use in HMM training. The labeling is generated from the sequence features in a manner directed by the —labelkey option. The information surplus or deficit when formatting between rich formats like EMBL and Fasta can be handled by using the gff option. This specifies a gff file that is read from or written to depending on the which way the formatting goes.

grepseq.pl
Extract sub-sequences from sequences on stdin based on a (perl) regular expression given on the cmd line. Input sequences in labeled fasta format. By default the labels are searched using the regexp. Note that the IDs on the output are made unique by adding an incrementing suffix for each match in an entry. This can be avoided by using the keepid option.

addprediction.pl
This script adds a prediction track to labeled Fasta entries as specified by a gff file. This is usefull for comaparing predictions.

untangle.pl
This script untangles Labeled Fasta as it comes out if you treat it as ordinary Fasta in a Seq or SeqIO object.

LFasta Modules

LFasta
A LFasta (L for labeled) object is a sequence with sequence features placed on it. The LFasta format is a hybrid between the simple Fasta format and the rich formats such at Genbank, EMBL and Swissprot. Along with the sequence it holds any information that maps directly to the plus strand of the sequence. The features are held on one or more label lines for each sequence line. A letter represents a type for feature. Eg. E for exons, H for helix and so on. This gives LFasta the "grepability" of the Fasta format and a sequence feature richness comparable to the rich Seq formats.

LFastaIO
LFastaIO is to LFasta what SeqIO is to Seq. It works in much the same way, but does only support the filehandel emulation for input and output. So LFastaIO->new corresponds to Bio::SeqIO->newFh. As of now the module only supports LFasta as input format. For output formats other than LFasta, it uses the facilities of SeqIO.