Phylogenetic trees provide a framework for organizing evolutionary histories across the tree of life and aid downstream comparative analyses such as metagenomic identification. Methods that rely on single-marker genes such as 16S rRNA have produced trees of limited accuracy with hundreds of thousands of organisms, whereas methods that use genome-wide data are not scalable to large numbers of genomes. We introduce updating trees using divide-and-conquer (uDance), a method that enables updatable genome-wide inference using a divide-and-conquer strategy that refines different parts of the tree independently and can build off of existing trees, with high accuracy and scalability. With uDance, we infer a species tree of roughly 200,000 genomes using 387 marker genes, totaling 42.5 billion amino acid residues. uDance is available publicly at GitHub.
Phylogenetic placement of query samples on an existing phylogeny is increasingly used in microbiome analyses and other biological studies. As the size of available reference trees used in microbiome analyses continues to grow, there is a growing need for methods that place sequences on ultra-large trees (e.g. 200,000 leaves) with high accuracy. In this project, I developed an algorithm called APPLES (and its successor APPLES-2), a distance-based method for phylogenetic placement. In extensive studies, we showed that APPLES-2 is more accurate and scalable, even more so than some of the leading maximum likelihood methods. This scalability is owed to a divide-and-conquer technique that limits distance calculation and phylogenetic placement to parts of the tree most relevant to each query. APPLES-2 is available publicly at GitHub. [Distance-based methods; genome skimming;phylogenetic placement.]
Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. In this project, we introduce a model based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. Resulting method called MISA and data are available at Github.
Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. We defined a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. These three problems can be solved in time that increases linearly with the size of the tree. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at Github.
The alignment-free methods have much appeal in terms of simplifying the process of inference, especially when analyzing genome-wide data. Despite the appeal, one limitation of alignment-free methods is that they typically rely on simplified models of sequence evolution such as Jukes-Cantor. It is possible to compute pairwise distances under more complex models by computing frequencies of base substitutions provided that these quantities can be estimated in the alignment-free setting. An underappreciated limitation is that for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the strand of DNA sequences is unknown. Under such conditions, the no-strand bias models are the best that we can do. Here, we show how to calculate distances under a no-strain bias restriction of the General Time Reversible (GTR) model called TK4 without relying on alignments or even assemblies using k-mers.