Segminator - A tool for the analysis of viral data generated using the 454 Life Sciences sequencing platform

Introduction
Segminator is a tool for extracting data from reads generated using the 454 Life Science sequencing platform [1]. A number of options are available to the user through the interface including:

  • Obtaining nucleotide and amino acid residue frequencies at sites across the template.

  • Generating a consensus across data set using overlapping windows (while allowing for indels differing from those in relation to the template).

  • Removal of reads based on hamming distance cutoff thresholds.

  • Reads spanning a particular region of the template can be multiply aligned and used to infer phylogenetic trees.

  • Identification of variants within viral populations.

Further details are described in [2].

Download and Installation
A jar file is available here.

The jar should be launched from its residing directory via the command line using the command java -jar -Xmx2000M segminator_v1.3.3.jar. If available memory is limited memory allocation can be reduce by reducing the -Xmx2000M parameter value. If available memory is not limited this can be increased to accommodate for larger numbers of reads.

If the file does not execute it is most likely as a result of not having the latest java runtime environment (jre) installed. This can be downloaded here.

Case Study
A typical usage scenario is presented in Archer et al. (2009) [3] where the software was used to detect and visualize low frequency resistant forms to the HIV drug maraviroc.

Sample Data and Input
Sample data is available within the software and can be loaded using the Sample Data button on the welcome screen. It has been constructed from the gp120 gene of the HXB2 strain of HIV using MetaSim.

User data can be loaded using the File menu. Input is in the form of reads in fasta format. Initially a suitable template in fasta format must be loaded following which the reads can then be loaded.

Further 454 sample data is available from the NCBI's website. If downloading this data the fasta format should be chosen. Future developments within the software will incorporate an option for loading FASTQ data which includes information on the Phred quality scores. Remember to use a suitable template before loading fasta files.

Note: when using the multiple alignment option on the Matched Tab be sure that the region selected contains spanning reads. If not no multiple alignment will be possible. This is important in terms of the short nature of the reads. For example: if the template is 1500 nucleotides in length and the user selects to align reads that span 3 to 600 no data will be extracted as the mean length of the reads is 200 so there will be none spanning this region. If the user wants to look at all reads within such a large region they should use the P ALIGN option.

Design and Implementation
John Archer and David Robertson designed Segminator with help from Andrew Rambaut. The program was implemented using the Java SDK and runs on both Mac OSX and Windows operating Systems (the limiting factor on other operating systems is the inclusion of external binaries that are opperating system dependent). All dependent binaries [1-3] are included within the jar so no complex set up is required.

We would like to thank Marilyn Lewis for providing further discussion.

Screen Shots
segminator Figure 1: Welcome screen.
segminator Figure 2: Pairwise indexed reads.
segminator Figure 3: Multiple alignment of narrow region aw well as amino acid translations.
segminator Figure 4: Maximum likelihood tree of the alignment displayed in 4.

Parameters
Various parameters can be set within the program through pop up boxes that are accessible via the Parameter menu or by selecting a specific data analysis option. The following is a list of available options. The default setting is displayed in brackets. Where not clear a description has been given.

  • Sequence Font:
    1. Sequence Font (12): Sets the size of the sequence display font.

  • Sample Size:
    1. Sample Size (100%): Percent of initial data to be randomly selected and processed. This option allows the user to work with a subset of the data if the number of reads is very large.

  • Pairwise Alignment: The Smith-Waterman algorithm is used to perform pairwise alignments.
    1. Gap Opening (-4): Gap opening penalty for the pairwise alignment process.
    2. Extension Divisor (4): Amount to reduce the gap penalty by once the initial gap has been inserted.
    3. Transversion (-2): Penalty for a transversion.
    4. Transition Divisor (2): Amount to reduce transversion penalty by to reflect a transition
    5. Match (1): Score for a match.

  • Word Matching is used to rapidly locate the most probable location of individual reads across the template. It is the same algorithm that BLAST initially uses to rapidly compare sequences and is described in [4]. It is not an alignment - just a guide for later steps.
    1. Word Size: (5) Defines the size of the words used for the matching process.
    2. Read Quality: (4) Defines the minimum word density provided by an individual read before the read is discarded as having low identity to the template.
    3. Min. Read Length: (70) Defines the minimum read length used in the word matching process. Reads below the specified user length are removed from the analysis.
    4. Max. Read Length (350):Same as above but for the maximum.

  • Multiple Alignment: Muscle [5] is used to perform multiple alignments within Segminator. The default parameters are used. Additional parameters that can be set within Segminator are:
    1. Region Start: (1) This must be set by the user. It specifies the start of the region accross which reads fully covering will be extracted and truncated.
    2. Region End: (2) This must be set by the user. It specifies the end of the region accross which reads fully covering will be extracted and truncated.
    3. Indel Threshold: (90) This specifies the % indel allowed before a correction is made. For gaps above the threshold (caused by insertions) the column is stripped while for gaps below 100% - value (deletions) the gap is replace by N.
    4. Display Translations (yes): The user can choose whether or not it is appropriate to display translations corresponding to the multiple alignment.

  • PhyML: Maximum Likelihood trees within segminator are constructed using PhyML [6]. The default parameters are used. Changes can be made to:
    1. PhyML Trans/Transv: (4) Specifies how PhyML software treats transitions and transversions.
    2. PhyML Model: (HKY) Specifies the evolutionary model used. The options are HKY, JC69, K80, F81, F84, TN93, GTR.

  • Miscellaneous:
    1. Hamming Threshold: (35) During translation any sequences with a higher hamming distance than this parameter from the translated template will be removed from the analysis. This is a way to identify uncorrected frame shift errors within the reads.
    2. Min. Cover Highlight: (50) Areas with below this cover will be highlighted as red.
    3. Max. Cover Highlight: (70) Areas with below this cover will be highlighted as blue.
    4. Condense Alignment: (yes) Identical reads or sequences within the word matching, pairwise and multiple alignments are removed and represented by a single sequence. The number of original sequences is stored in the title.
    5. Use Read Index as Titles: (yes) If this is selected indexes of reads that are kept during the Word Matching process will be used as the title instead of the original read title.

Other Features
Consensus: Given that the user does not have a data representative template word matching can be performed in relation to a generic template that is representative of the genomic region (for example obtained from a reference strain). Following this a multiple alignment can be constructed in narrow overlapping windows from which a consensus within each is taken. These can then be pieced together to form a representative template that takes into account data specific indels.The parameters for this process are:

  1. Window Size: (70) Window size within which multiple alignments will be generated.
  2. Overlap: (20) The overlap between the windows.
  3. Max % Gaps: (49) The percent gaps to allow within columns of the generated alignments. Values above this will be removed.
Bar Coding: Reads tagged with a particular bar code can be extracted from a fasta file using the Bar Code menu. The user selects the fasta file containing the read data, inputs the bar code to identify the reads of interest and selects the location to save the extracted reads too. The extracted reads will be saved (minus the barcode) in fasta format. If a file contains multiple differnet bar codes the process should be repeated with each code that the user requires data for.

References
1. Margulies M, Egholm M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005,437:376-380.

2. Archer J, Rambaut A, Taillon BE, Harrigan R, Lewis M and Robertson D. The evolutionary analysis of emerging low frequency HIV-1 CXCR4-using variants through time - an ultra-deep approach. PLoS Comput Biol. 2010 Dec 16;6(12):e1001022

3. Archer J, Braverman MS, et al. Detection of low frequency pre-therapy CXCR4-using HIV-1 with ultra-deep pyrosequenc. AIDS. 2009. May 6. [Epub ahead of print]

4. Altschul SF, Madden TL, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25:3389

5. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004; 32: 1792 - 1797.

6. Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003; 52:696 - 704.

7. iText Library - Copyright (C) 1999-2006 by Bruno Lowagie and Paulo Soares. All Rights Reserved.

Citing
Segminator can be cited using [2].