CTree - analysis of clusters on phylogenetic trees

Introduction
CTree has been designed by John Archer and David Robertson for viewing, analyzing and editing phylogenetic trees. There is a particular emphasis on the analysis of clusters within such trees. Clusters are stored as individual data structures from which statistical data can be easily extracted.

Clusters can either be populated manually or alternatively via a novel heuristic algorithm that can automatically identify clusters on a tree (if they exist). The former is useful when previously published clusters are available while the latter is of use when control clusters across many random trees are required or when there are no previously published clusters. A detailed description of the heuristic algorithm is available in [4].

With this treatment of clusters comparison of topologies, in relation to both inter and intra cluster diversity, between two or more trees is made easy. CTree can be downloaded here.

Screen Shots

CTree Work Area A Radial Tree A Square Tree Pairwise Distance Output For Clusters
(Click on the thumbnails to enlarge.)

Novel Features
Some of the more novel features include:

  • Automatically defining clusters on the tree using a heuristic cluster finding algorithm.
  • Manually define clusters.
  • Display various tree statistics including Subtype Diversity Ratio (SDRa) and Subtype Diversity Variance (SDVb).
  • Generation and sampling from random trees.
  • Calculation of the SDR and SDV distributions over 'x' number of random trees.
  • Finding the Center Of The Tree (COTc).
a The SDR, is defined as the ratio of the mean intra cluster pairwise distance to the mean inter cluster pairwise distance [1]. Low intra-cluster pairwise distances relative to inter-cluster pairwise distances implies more defined clustering thus the SDR is quantitative measure of the extent of clustering found within the tree.
b The SDV is a measure of the variation within the ratio of the mean intra-cluster pairwise distance to the mean inter-cluster pairwise distance calculated for each cluster on the tree [2].
c COT is the point on a tree with the smallest average distance from each of the strains on the tree [3].

Standard Tree Viewing/Editing Features
Standard features include:

  • Ability to load multiple trees from a single file.
  • Re-rooting the tree.
  • Obtaining information such as pairwise distances between strains.
  • Swapping the order of sibling strains.
  • Manually removing strains (or groups of strains) from the tree.
  • Removing strains randomly from the tree.
  • An improved search interface that allows the user to color strains based on comma delimited search criteria.
  • Basic coloring of the tree.
  • Displaying various attributes on the tree such as strain labels, bootstrap values and evolutionary distances.
  • Allowing the user to obtain lists of strains within a user specified proximity to each other.
  • Allowing the user to define the distance covered by the scale bar associated with the tree.
  • Trees can be saved as publishable pdf format (using the iText library), newick format or java binary format (edits can be reloaded at a later date).
  • CTree deals well with large trees.

Defining Clusters
Automatically Defining Clusters
Automatic designation of clusters is based on an algorithm that chooses a cluster set based on minimizing the SDR. This method of picking clusters automatically was first described in [1]. To use this algorithm select "Pick Clusters Automatically" under the "Cluster" menu. A popup box will then ask you to enter the sensitivity of the process. This is the percentage diameter of the tree that will be used in incrementing steps to define the radius of potential clusters. This algorithm is explained in detail in [4]. The lower the value the slower the process with more accuracy. However very low values on large trees will take a very long time. We found that for a tree with between 100 - 200 nodes a value of between 2 - 5% is sufficient. After you enter this value you will be asked for the minimum number of strains that are allowed in a cluster. Once this is entered the program will try to find clusters on the tree. This feature has been limited to trees containing less than 126 taxa.

Manually Defining Clusters
To manually define a cluster select the "Manually Create Cluster Data Structure" button on the left hand side of the tree panel. Then simply select the node that will be the root of the cluster a. When manually defining clusters it should be noted that clusters are created away from the tree root. Thus it is helpful to place the tree root somewhere near the center of the tree e.g. at COT. This will prevent unexpected clustering. The tree root should not be selected as a cluster root.

a The root node for the cluster is the common ancestor of all strains within that cluster.

Usage Example
A comparison using CTree of the topologies present on phylogenetic trees in relation to HIV-1 group M and O is presented in [2].

Input Data
Files containing random trees can be generated within the program itself using the "Generate File Of 'x' Random Trees" option under the "Tree" menu (or just generating a single random tree and saving it). As well as this any tree in Newick (.ph/.phb) format is usable. This is a fairly standard output for many tree creating programs and libraries such as Phylip, ClustalX/W and The Phylogenic Analysis Library (PAL). However here are some sample datasets to get a new user started:

Here are the HIV-1 referance trees for the gag, pol and env genomic regions: Remember CTree can deal well with large trees but if a tree is very large some tasks including, finding clusters heuristically and finding the center of the tree, may take some time.

Download and Installation
An executable jar file is available here.

If the file does not execute it is probably because you do not have the latest java runtime environment (jre) installed. If you do not have this installed it can be downloaded here.

Implementation and Compatibility
CTree was designed and implemented by John Archer (PhD student) under the supervision of Dr. David Robertson. The webpage for David Robertson's research group can be found here. The program was implemented using the Java SDK (version 1.5.0_10) and so is platform independent. Testing has been done on both PC and Macintosh.

References
1. Rambaut A, Robertson DL, Pybus OG, Peeters M, Holmes C. Phylogeny and the origin of HIV-1. Nature 2001; 410:1047 to 1048.

2. Archer J, Robertson DL. Understanding the Diversification of the HIV-1 Groups. AIDS 2007; In Press.

3. Nickle DC, Jensen MA, Gottlieb GS, et al. Consensus and Ancestral State HIV Vaccines. Science 2003 ; 299 :1515 to 1517.

4. Archer J, Robertson DL. CTree: comparison of clusters between phylogenetic trees made easy. Bioinformatics 2007; In Press.

5. iText Library - Copyright (C) 1999-2006 by Bruno Lowagie and Paulo Soares. All Rights Reserved.

Citing
CTree can be cited using [4].