Since the introduction of high-throughput, second-generation DNA sequencing technologies, there has been an enormous increase in the size of datasets being used for estimating bacterial population phylodynamics. Although many phylogenetic techniques are scalable to hundreds of bacterial genomes, methods which have been used for mitigating the effect of mechanisms of horizontal sequence transfer on phylogenetic reconstructions cannot cope with these new datasets. Gubbins (Genealogies Unbiased By recomBinations In Nucleotide Sequences) is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions. Simulations demonstrate the algorithm generates highly accurate reconstructions under realistic models of short-term bacterial evolution, and can be run in only a few hours on alignments of hundreds of bacterial genome sequences.
The paperThe paper is available from Nucleic Acids Research (open access): Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins.
Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. "Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014.
Data used in the paper
- PMEN1 dataset (12 Mbyte multi-FASTA alignment file)
- ST239 dataset (7 Mbyte multi-FASTA alignment file)
If you need help or have any queries, please contact us using the details below. Our normal office hours are 8:30-17:00 (GMT) Monday to Friday.
- Email firstname.lastname@example.org
- Phone Pathogen Informatics on +44 (0)1223 834244 extension 8736
Detailed installation instructions are available in the README file. If you already have homebrew installed then run brew install gubbins.
To run Gubbins with default settings:
- run_gubbins.py [FASTA alignment]
- --outgroup, -o
The name of a sequence in the alignment on which to root the tree
- --starting_tree, -s
A Newick-format starting tree on which to perform the first iteration analysis. The default is to compute a starting tree using RAxML
- --filter_percentage -f
Filter out taxa with more than this percentage of missing data. Default is 25%
- --tree_builder, -t
The algorithm to use in the construction of phylogenies in the analysis; can be ‘raxml’, to use RAxML, ‘fasttree’, to use Fasttree, or ‘hybrid’, to use Fasttree for the first iteration and RAxML in all subsequent iterations. Default is raxml
- --iterations, -i
The maximum number of iterations to perform; the algorithm will stop earlier than this if it converges on the same tree in two successive iterations. Default is 5.
- --min_snps, -m
The minimum number of base substitutions required to identify a recombination. Default is 3.
- --converge_method, -z
Convergence criteria. Criteria to use to know when to halt iterations (weighted_robinson_foulds, robinson_foulds, recombination). Default is 'weighted_robinson_foulds'.
- --use_time_stamp, -u
Include a time stamp in the name of output files to avoid overwriting previous runs on the same input file. Default is to not include a time stamp.
- --prefix, -p
Specifiy a prefix for output files. If none is provided it defaults to the name of the input FASTA alignment
- --verbose, -v
Print debugging messages. Default is off.
- --no_cleanup, -n
Do not remove files from intermediate iterations. This option will also keep other files created by RAxML and fasttree, which would otherwise be deleted. Default is to only keep files from the final iteration.
If a prefix is not defined with the –prefix option, the default prefix of the output files is: X.Y
- X = Prefix taken from the input fasta file
- Y = Time stamp. NOTE: This will only be included in the output file prefix if the –u flag has been selected
Recombination predictions in EMBL tab file format.
Recombination predictions in GFF3 format
Base substitution reconstruction in EMBL tab format.
VCF file summarising the distribution of SNPs
Per branch reporting of the base substitutions inside and outside recombinations events.
FASTA format alignment of filtered polymorphic sites used to generate the phylogeny in the final iteration.
Phylip format alignment of filtered polymorphic sites used to generate the phylogeny in the final iteration.
Final phylogenetic tree in newick format.
Final phylogenetic tree in newick format but with internal node labels.
If you get a segfault, it means your input data is probably too divergenet or insufficiently curated for quality. Keep removing samples until gubbins can complete.
The output of Roary (the pan genome pipeline) cannot be used as the input to Gubbins. They are fundamentally different methods. Detecting recombination in pan genomes is an open problem.
The command to run the script within Sanger with 8GB of RAM and 16 cores looks like:
- bsub.py --threads 16 8 log run_gubbins.py --threads 16 my_alignment.aln