tqDist - a triplet and quartet distance library
Distance measures between trees are useful for comparing trees in a systematic manner and several different distance measures have been proposed. The triplet and quartet distances, for rooted and unrooted trees, are defined as the number of subsets of three or four leaves, respectively, where the topologies of the induced sub-trees differ. These distances can trivially be computed by explicitly enumerating all sets of three or four leaves and testing if the topologies are different, but this leads to time complexities at least of the order n3 or n4 just for enumerating the sets. The different topologies can be counted implicitly, however, and using this tqDist computes the triplet distance between rooted trees in O(n log n) time and the quartet distance between unrooted trees in O(dn log n) time, where d degree of the tree with the smallest degree.
If you use tqDist
in your research, please cite it in
the following way:
- Andreas Sand, Morten K. Holt, Jens Johansen, Gerth
Stølting Brodal, Thomas Mailund and Christian N.S. Pedersen;
tqDist: A Library for Computing the Quartet and
Triplet
Distances Between Binary or General Trees, Bioinformatics, 2014, xx(yy), pp. ii-jj, doi: 10.1093/bioinformatics/btu157.
You can e.g. use this BibTex entry.
The datasets used for the performance experiments in this Applications Note are available for download at http://birc.au.dk/~cstorm/software/tqDist/data/.
Source files
- Source: tqDist-1.0.2.tar.gz or tqDist-1.0.2.zip
- R-bindings are available via CRAN: https://CRAN.R-project.org/package=Quartet. Thanks to Martin R. Smith for providing these.
Building and installing
To install tqDist from source (on Linux or MacOS) download and unpack the source code, and execute the following commands in a terminal:
$ tar -xvf tqDist-1.0.2.tar.gz
$ cd tqDist-1.0.2/
$ cmake .
$ make
$ make test
$ make install
If you do not have permisions to install the package in the default location, you need to replace the fourth command above by something on this form:
Finally, if CMake is not installed on your system, you can download the newest version from http://cmake.org/.
Executables
The following executables are installed with tqDist:
triplet_dist
Usage: triplet_dist [-v] <filename1> <filename2>
Where <filename1>
and
<filename2>
point to two files each containing
one tree in Newick format. In both trees all leaves should be
labeled and the two trees should have the same set of leaf labels.
The triplet distance between the two trees will be printed to
stdout.
If the -v
option is used, the following numbers will be reported (in this
order):
- The number of leaves in the trees (should be the same for both).
- The number of triplets in the two trees (n choose 3).
- The triplet distance between the two trees.
- The normalized triplet distance between the two trees.
- The number of resolved triplets that agree in the two trees.
- The normalized number of resolved triplets that agree in the two trees.
- The number triplets that are unresolved in both trees.
- The normalized number triplets that are unresolved in both trees.
quartet_dist
Usage: quartet_dist [-v] <filename1> <filename2>
Where: <filename1>
and
<filename2>
point to two files each containing
one tree in Newick format. In both trees all leaves should be
labeled and the two trees should have the same set of leaf labels.
The quartet distance between the two trees will be printed to
stdout.
If the -v
option is used, the following numbers will be reported (in this
order):
- The number of leaves in the trees (should be the same for both).
- The number of quartets in the two trees (n choose 4).
- The quartet distance between the two trees.
- The normalized quartet distance between the two trees.
- The number of resolved quartets that agree in the two trees.
- The normalized number of resolved quartets that agree in the two trees.
- The number of quartets that are unresolved in both trees.
- The normalized number of quartets that are unresolved in both trees.
pairs_triplet_dist
Usage: bin/pairs_triplet_dist [-v] <filename1> <filename2> [<output filename>]
Where: <filename1>
and
<filename2>
point to two files each containing the
same number of trees in Newick format. The two trees on line i in the
two files must have the same set of leaf labels. The output is a list
of numbers, where the i'th number is the triplet distance between the
two trees on line i in the two files. If [output filename]
is
specified the output is written to the file pointed to (if the file
already exists the current content is deleted first), otherwise the
output is written to stdout.
If the -v
option is used, the following numbers will be reported for
each pair of trees (in this order):
- The number of leaves in the trees (should be the same for both).
- The number of triplets in the two trees (n choose 3).
- The triplet distance between the two trees.
- The normalized triplet distance between the two trees.
- The number of resolved triplets that agree in the two trees.
- The normalized number of resolved triplets that agree in the two trees.
- The number of triplets that are unresolved in both trees.
- The normalized number of triplets that are unresolved in both trees.
pairs_quartet_dist
Usage: bin/pairs_quartet_dist [-v] <filename1> <filename2> [<output filename>]
Where: <filename1>
and
<filename2>
point to two files each containing the
same number of trees in Newick format. The two trees on line i in the
two files must have the same set of leaf labels. The output is a list
of numbers, where the i'th number is the quartet distance between the
two trees on line i in the two files. If [output filename]
is
specified the output is written to the file pointed to (if the file
already exists the current content is deleted first), otherwise the
output is written to stdout.
If the -v
option is used, the following numbers will be reported for
each pair of trees (in this order):
- The number of leaves in the trees (should be the same for both).
- The number of quartets in the two trees (n choose 4).
- The quartet distance between the two trees.
- The normalized quartet distance between the two trees.
- The number of resolved quartets that agree in the two trees.
- The normalized number of resolved quartets that agree in the two trees.
- The number of quartets that are unresolved in both trees.
- The normalized number of quartets that are unresolved in both trees.
all_pairs_triplet_dist
Usage: all_pairs_triplet_dist <input filename> [output filename]
Where: <input filename>
is the name of a file
containing multiple trees in Newick format. Each tree should be on a
seperate line. In each tree all leaves should be labeled and all trees
should have the same set of leaf labels. If [output filename]
is
specified the output is written to the file pointed to (if the file
already exists the current content is deleted first), otherwise the
output is written to stdout. The output is a lower triangular matrix
in which the i, j'th entry is the pairwise triplet distance between
the tree on line i and the tree on line j in <input
filename>
.
all_pairt_quartet_dist
Usage: all_pairs_quartet_dist <input filename> [output filename]
Where: <input filename>
is the name of a file
containing multiple trees in Newick format. Each tree should
be on a seperate line. In each tree all leaves should be
labeled and all trees should have the same set of leaf labels.
If [output filename]
is specified the output is written to the
file pointed to (if the file already exists the current
content is deleted first), otherwise the output is written to
stdout. The output is a lower triangular matrix in which the
i, j'th entry is the pairwise quartet distance between the
tree on line i and the tree on line j in <input
filename>
.
File format
All input files for tqDist should contain trees encoded in Newick
format. Here is couple of examples that can be used with the
quartet_distance
and triplet_distance
executables:
And here is an example of a file containing multiple trees, which can
be used with the all_pairs_quartet_distance
,
all_pairs_triplet_distance
,
pairs_triplet_distance
and
pairs_triplet_distance
executables:
For details on how to encode trees in Newick format please see http://en.wikipedia.org/wiki/Newick_format.
Literature
The algorithms implemented in tqDist
was developed by Andreas
Sand, Gerth
Stølting Brodal, Rolf Fagerberg, Thomas Mailund, Christian
N. S. Pedersen, Jens Johansen and Morten K. Holt, and they are
explained in the following two papers:
- Efficient Algorithms for Computing the Triplet and
Quartet Distance Between Trees of Arbitrary Degree, Gerth
Stølting Brodal, Rolf Fagerberg, Christian
N.S. Pedersen,
Thomas Mailund and Andreas Sand, Proc. 24th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1814-1832, 2013. - Algorithms for Computing the Triplet and Quartet
Distances for Binary and General Trees, Andreas Sand, Morten
K. Holt, Jens Johansen, Rolf Fagerberg, Gerth Stølting
Brodal,
Christian N. S. Pedersen and Thomas Mailund, Biology, 2013, 2(4), 1189-1209
The implementation is desribed in:
- On the Scalability of Computing Triplet and Quartet Distances, Morten K. Holt, Jens Johansen and Gerth Stølting Brodal, to appear at ALENEX 2014.
Contact
If you encounter any problems or have questions about using
tqDist
, please contact Christian N. S. Pedersen.