Average genome size estimation enables accurate quantification of gene family abundance and sheds light on the functional ecology of the human microbiome

Average genome size estimation enables accurate quantification of gene family abundance and sheds light on the functional ecology of the human microbiome

Stephen Nayfach, Katherine S Pollard
doi: http://dx.doi.org/10.1101/009001

Average genome size (AGS) is an important, yet often overlooked property of microbial communities. We developed MicrobeCensus to rapidly and accurately estimate AGS from short-read metagenomics data and applied our tool to over 1,300 human microbiome samples. We found that AGS differs significantly within and between body sites and tracks with major functional and taxonomic differences. For example, in the gut, AGS ranges from 2.5 to 5.8 megabases and is positively correlated with the abundance of Bacteroides and polysaccharide metabolism. Furthermore, we found that AGS variation can bias comparative analyses, and that normalization improves detection of differentially abundant genes.

Increasing evolvability of local adaptation during range expansion.

Increasing evolvability of local adaptation during range expansion.
Marleen M. P. Cobben, Alexander Kubisch
doi: http://dx.doi.org/10.1101/008979

Increasing dispersal under range expansion increases invasion speed, which implies that a species needs to adapt more rapidly to newly experienced local conditions. However, due to iterated founder effects, local genetic diversity under range expansion is low. Evolvability (the evolution of mutation rates) has been reported to possibly be an adaptive trait itself. Thus, we expect that increased dispersal during range expansion may raise the evolvability of local adaptation, and thus increase the survival of expanding populations. We have studied this phenomenon with a spatially explicit individual-based metapopulation model of a sexually reproducing species with discrete generations, expanding into an elevational gradient. Our results show that evolvability is likely to evolve as a result of spatial variation experienced under range expansion. In addition, we show that different spatial phenomena associated with range expansion, in this case spatial sorting / kin selection and priority effects, can enforce each other.

Behavioral individuality reveals genetic control of phenotypic variability

Behavioral individuality reveals genetic control of phenotypic variability

Julien F Ayroles, Sean M Buchanan, Chelsea Jenney, Kyobi Skutt-Kakaria, Jennifer Grenier, Andrew G Clark, Daniel L Hartl, Benjamin L de Bivort
doi: http://dx.doi.org/10.1101/009027

Variability is ubiquitous in nature and a fundamental feature of complex systems. Few studies, however, have investigated variance itself as a trait under genetic control. By focusing primarily on trait means and ignoring the effect of alternative alleles on trait variability, we may be missing an important axis of genetic variation contributing to phenotypic differences among individuals. To study genetic effects on individual-to-individual phenotypic variability (or intragenotypic variability), we used a panel of Drosophila inbred lines and focused on locomotor handedness, in an assay optimized to measure variability. We discovered that some lines had consistently high levels of intragenotypic variability among individuals while others had low levels. We demonstrate that the degree of variability is itself heritable. Using a genome-wide association study (GWAS) for the degree of intragenotypic variability as the phenotype across lines, we identified several genes expressed in the brain that affect variability in handedness without affecting the mean. One of these genes, Ten-a implicated a neuropil in the central complex of the fly brain as influencing the magnitude of behavioral variability, a brain region involved in sensory integration and locomotor coordination6. We have validated these results using genetic deficiencies, null alleles, and inducible RNAi transgenes. This study reveals the constellation of phenotypes that can arise from a single genotype and it shows that different genetic backgrounds differ dramatically in their propensity for phenotypic variabililty. Because traditional mean-focused GWASs ignore the contribution of variability to overall phenotypic variation, current methods may miss important links between genotype and phenotype.

Author post: Probabilities of Fitness Consequences for Point Mutations Across the Human Genome

This guest post is by Adam Siepel on his group’s paper Gulko et al. Probabilities of Fitness Consequences for Point Mutations Across the Human Genome, bioRxived here.

Four Genomicists in a Subaru

The idea for this paper emerged during a long drive across New York State, from Cold Spring Harbor to Ithaca, after the 2011 Biology of Genomes meeting. Two postdocs in my research group, Ilan Gronau and Leo Arbiza, were riding with me in my old Subaru, trying not to express too much alarm at my distracted driving. Also with us in the car was Ran Blekhman, who was at the time a postdoc with Andy Clark. (Ran is now an assistant professor at the University of Minnesota.)

Our conversation turned to important open questions in computational genomics, and in particular, to ways of making better use of the vast quantities of functional genomic data being pumped out of projects such as ENCODE. At the time, Ilan, Leo, and I were thinking a lot about how to use patterns of within-species polymorphism and between-species divergence to shed light on the influence of natural selection on regulatory sequences. Along these lines, we had spent much of the spring developing our new INSIGHT method [1,2], and we had just presented this work for the first time at Biology of Genomes. Ran, however, was pushing us to think less about abstract evolutionary questions and more about genomic function and disease association. He made a strong case that the biggest obstacle to progress in medically related human genomics was the absence of adequate functional annotations in noncoding regions of the human genome.

For a while the conversation went in circles, as we grasped for ways of measuring “functional potential” across the genome that would make use of genomic data yet be grounded in evolutionary theory. Then it suddenly dawned on us that we already had in hand a key piece of what we needed. The INSIGHT program was designed to estimate, for any collection of nucleotide positions across the genome, the fraction (denoted ρ) of those positions that were directly influenced by natural selection, in the sense that point mutations at those positions tended either to increase or to decrease the fitness of an organism. We realized that an alternative way of interpreting ρ was as a probability that the nucleotide at each position in the analyzed collection influenced fitness, assuming exchangeability of sites (as INSIGHT does).

All that we needed, therefore, was a general way of partitioning nucleotide positions from across the genome into distinct classes that were reasonably homogeneous in their functional roles. We could then estimate ρ for each class using INSIGHT, and assign to each genomic position the estimate of ρ for the partition to which that position belonged. This procedure would produce a “score” across the genome that looked roughly like widely used evolutionary conservation scores, but instead of representing local divergence patterns across the mammalian phylogeny, the score at each position would be estimated from groupwise patterns of polymorphism and divergence and would be directly interpretable as a probability of fitness consequences. Later Ilan would dub these “fitCons” scores, to emphasize this fitness-related interpretation. (“FitCons” also nicely parallels “phastCons,” our first conservation-scoring method.)

Because INSIGHT measures selection on recent time scales, fitCons scores would circumvent a major shortcoming of standard evolutionary conservation scores—that they require functional roles to have remained consistent over very long evolutionary time periods (tens to hundreds of millions of years) in order to be detectable in divergence patterns at individual sites or small loci. In principle, fitCons scores should be able to detect selection (hence potential function) at sites whose functional role had emerged quite recently, perhaps even along the human lineage.

The Problem of Grouping

The piece that was still missing in our plan was a particular scheme for grouping together similar genomic sites from across the genome. We did not get to the point of working this problem out in any detail during our revelatory drive to Ithaca, and, as it happened, it took several more months to settle on a solution. By this time, a Ph.D. student from Computer Science, Brad Gulko, had joined the project and assumed the lead in implementing a prototype of the scores.

At first, Brad, Ilan, Leo, and I spent some time thinking about fancy algorithms for clustering genomic sites that would consider functional and evolutionary information jointly. However, it did not take long to realize that this was a hard problem. Eventually, we decided to move forward with a simple grouping scheme, based on functional genomic data alone. This would allow us to cluster genomic sites in a pre-processing step and avoid the need for an iterative solution. Our hunch was that the scores would not be too sensitive to the grouping scheme as long as it was reasonable. As we discuss in our article, it may be worthwhile to revisit this clustering approach eventually, but it appears to be adequate for our current purposes. (I hope to convince Brad to discuss some of the technical issues with the clustering problem in an upcoming blog post.)

Relevance to the “Share Under Selection” in the Human Genome

By the fall of 2012, we had finally settled on an initial set of fitCons tracks and were beginning to observe decent prediction performance for cis-regulatory elements, when the human genomics community was thrown into a frenzy by a deluge of publications and accompanying press releases from the ENCODE Consortium. This event led to the now-famous controversy over what fraction of the genome is truly “functional” and whether ENCODE’s measures of “reproducible biochemical activity” (which apply to over 80% of the genome) were comparable in any meaningful way to the “share under selection” (SUS) estimated from comparative genomics (which generally came out to 5–10%).

I do not wish to rehash the familiar terms of this debate here, but I do want to focus on one aspect of it that was particularly relevant to our work. Many of the criticisms of ENCODE reminded readers that comparative genomic analyses pointed to a SUS of ~5–10%, suggesting that 80% might be a gross over-estimate of the functional content of the genome. However, others pointed out that these comparative-genomic estimates applied only to the fraction of the genome that had been under long-term selective constraint, because evolutionary turnover of functional elements—if it occurred at appreciable rates—could bias estimates based on long-term genomic divergence substantially downward. (For the latest chapter in this saga, see a recent paper by Gerton Lunter, Chris Ponting, and colleagues [3].)

We realized that the fitCons scores could help address aspects of this controversy, because they were based on patterns of variation over much more recent time scales and should therefore be much less sensitive to turnover than scores based on divergence patterns across the mammalian phylogeny. Moreover, the fitCons scores, by making use of INSIGHT to interpret patterns of polymorphism and divergence, might provide substantially better estimates of the quantities of interest than simple analyses of SNP densities or allele frequencies, a few of which had appeared among the ENCODE publications. Finally, the INSIGHT-based estimates are unique in that they directly predict the SUS, without the need for separate thresholding, mixture deconvolutions, or enrichment analyses.

Somewhat surprisingly, when we estimated the SUS based on fitCons scores, we obtained values (4.2–7.5%) that were quite similar to those based on conservation patterns in mammals. There are a number of tricky technical issues involved in this type of estimation—for example, concerning the corrections for local mutation rates and coalescence times—but violations of our modeling assumptions generally should tend to push our estimated upper bound (7.5%) to conservatively high values, implying that the true value is lower than 7.5%. In addition, the correction we have applied to obtain our lower bound (4.2%) is quite conservative, making it likely that the true value is higher than 4.2%. Therefore we have high confidence that the fraction of the genome under detectable selection from the available polymorphism and divergence data is indeed fairly close to 5%. As we discuss in the paper, it is important to bear in mind that the absolute values of these estimates reflect constraint on the identities of individual nucleotides only, and do not take into consideration higher order constraints, for example, on element lengths or spacing. Nevertheless, the similarity of the estimates based on mammalian divergence and human polymorphism suggest that evolutionary turnover has not produced a major downward bias in conservation-based estimates of the SUS.

Ilan and I soon realized that we could go a step further in this analysis and compare the fitCons-based estimates with parallel estimates based on the same functional categories but a measure of natural selection based on divergence only. The idea here was to perform a direct “apples to apples” comparison of the fraction of the genome under selection as measured on two different time scales: the 1–5 million-year time scale measured by fitCons and the ~30 million-year time scale measured by an analogous method based on divergence patterns in four primate genomes (human, chimpanzee, orangutan, and rhesus macaque), which we called “fitConsD” (the “D” is for “divergence”). I won’t attempt to describe this analysis in detail here, but our general conclusion is that the estimates of selection are highly similar on these two different time scales, suggesting further that evolutionary turnover has not had a dramatic effect on the functional content of the human genome over the past 30 million years or so. It is worth noting that Lunter and colleagues’ recent analysis is not strictly incompatible with ours (they estimate 7.1–9.2% constraint at present and focus on turnover over longer time scales) but their qualitative interpretation suggests large amounts of turnover, while ours suggests modest amounts.

Scooped by CADD… or Perhaps Not

As grant proposals, other manuscripts, and job searches led to delays in writing up our work through 2013, we began to hear rumblings on social media about a method called CADD, developed by Greg Cooper and Jay Shendure’s groups, that sounded alarmingly similar to fitCons. Then, in early 2014, a paper by Kircher, Witten et al. describing CADD appeared in Nature Genetics [4]. When we saw this paper, our initial impression was that we had been scooped by sitting on a good idea for too long. CADD was described as a method that integrated functional and evolutionary data and produced a measure of “relative pathogenicity” across the entire genome, and it was motivated, in part, by its potential usefulness in noncoding as well as coding regions. The paper included several impressive-looking ROC plots in which CADD apparently outperformed conservation based methods such as phastCons, phyloP, and GERP by a significant margin. In addition, CADD made use of a support vector machine (SVM), which was potentially a highly flexible and powerful means for considering large numbers of covariates with arbitrarily complex correlations.

We decided, with a certain amount of dread, that we needed to add CADD to our empirical performance comparisons on putative regulatory elements. At the time, fitCons was showing clear advantages in predictive power for cis regulatory elements compared with conservation-based methods and a functional annotation database called RegulomeDB. FitCons had several potential advantages over CADD—for example, it made direct use of polymorphism data for prediction, it considered covariates in a cell-type-specific manner, and it avoided a need for brute-force simulations through its use of INSIGHT for inference—but we thought that the use of the SVM in CADD made it unlikely that fitCons could compete with it in a pure classification task. Nevertheless, Brad dutifully downloaded the CADD scores, added them to his experiments, and displayed curves for CADD in his ROC plots for three types of regulatory elements (ChIP-seq-supported transcription factor binding sites, eQTLs, and enhancer predictions based on chromatin marks).

To our surprise, fitCons significantly outperformed CADD in all of these tests. This was true for three different types of putative regulatory elements, and true whether or not we considered cell-type-specific test sets. In fact, CADD performed essentially no better than conventional conservation scores in these tests, in apparent contradiction to the results presented in the CADD paper.

A closer reading of the CADD paper revealed a possible explanation for these observations. While the method was motivated, in part, by its applicability to the entire genome, the authors’ validation experiments heavily emphasized coding regions. In fact, it appears that even the ROC plot for “genome-wide” results (Figure 3a in the paper) is actually based almost exclusively (>92% by our interpretation of the paper) on missense variants in coding regions. The experiments that included substantial numbers of noncoding sites, in turn, were much more indirect—for example, by showing correlations with derived allele frequencies (Figure 2), known disease-causing status (Figure 4), and changes in expression in saturation mutagenesis experiments at two enhancers and one promoter. It is possible to have correlations of this kind without having substantial predictive power for regulatory variants.

When Greg Cooper saw our initial preprint on bioRxiv, he raised two major objections to our validation experiments. First, he pointed out that we were measuring the sensitivity of the fitCons scores in terms of bulk coverage of elements, when those elements actually consist of a mixture of sites at which mutations are deleterious and sites at which mutations are neutral or nearly neutral (such as degenerate positions in transcription factor binding sites). This approach to measuring sensitivity may be overly generous to the fitCons scores, which are relatively “blocky” along the genome, varying little from one site to the next, in comparison to higher-resolution prediction methods that properly distinguish between functionally important and neutral sites within elements. Second, Greg pointed out that we were using a naive genome-wide background set for our eQTL, which did not properly account for the ascertainment scheme used for eQTL identification.

We felt that these were fair and reasonable criticisms, and needed to be addressed. Therefore, we revised our validation experiments to consider only high-information-content positions in transcription factor binding sites (a proxy for functionally important nucleotides), and to use a more appropriate control for eQTL. The details of these follow-up experiments are described in our revised preprint (now on bioRxiv), but the bottom line was that they had almost no effect on our ROC plots. In other words, the apparent performance advantages of fitCons over CADD and other divergence-based methods is not an artifact of our experimental design but appears to reflect real advantages of the method. While Greg is correct that the coarse, “blocky” nature of the fitCons scores is a limitation of our current methods, the method still appears to perform significantly better than any competing method in distinguishing putatively functional regulatory nucleotides from background sequence. In other words, while scores that exhibit more variation from one nucleotide to the next—such as CADD, GERP, and phyloP—may appear on the surface to have higher predictive resolution, much of that variation is uninformative about regulatory function, and, on balance, the “blocky” fitCons scores are more useful in prediction.

We have spent some time trying to understand the differences in performance between fitCons and CADD, and believe we have some insights into why fitCons performs significantly better on regulatory elements. (What follows are our conjectures only; the authors of CADD do not agree with our analysis.) While the SVM in CADD is potentially a strength, we believe that it is substantially limited in this case by the use of a linear kernel and by pooling features across cell types, rather than focusing separately on each cell type of interest. In addition, we think there is a fundamental problem with the optimization scheme used by CADD. The SVM in CADD is trained by a global strategy, in the sense that a single set of parameter values is selected (for a given choice of training set and generalization parameter) to obtain an optimal fit, on average, across all examples in the training set. Thus, if it is true that different covariates are relevant in coding and noncoding regions, as expected, then the method will have to make tradeoffs between these types of sites. If the “signal” for training (i.e., the contribution toward the SVM’s objective function) is stronger in one type of region than another, it is likely that the tradeoff will favor that type of region. Because constraint will be strongest, on average, in coding regions, leading to higher rates of difference between simulated and observed variants in these regions, it seems likely that these regions will indeed dominate in the training procedure, and this may explain CADD’s superior performance in coding regions and its weaker performance in noncoding regions. FitCons avoids this problem by applying INSIGHT separately to each class of sites.

These observations raise the interesting possibility of a modified CADD that addresses some of these limitations. There is no reason why CADD couldn’t be trained separately on noncoding and coding regions, perhaps with different sets of covariates for each type of sites. Moreover, regulation-associated covariates could be treated in a cell-type-specific manner. A modified CADD designed along these lines (regulatory CADD, or rCADD?) could provide an interesting alternative to fitCons.

Summary

When we first discussed the idea for the fitCons scores during our drive across New York State three years ago, I envisioned a quick spin-off project that could be completed in perhaps half a year. As so often happens in research, several unanticipated challenges arose in completing this work, but we also found unexpected opportunities to connect our analysis with important open questions in the field. In addition, we were stimulated to think about the problem of combining functional and evolutionary data in new and deeper ways by another paper that addressed a similar problem but in a fundamentally different way. The end result is a paper I am quite proud of—one that provides what I think will be a useful resource to the genomics community and that also offers new insights into longstanding evolutionary questions.

References

[1] Gronau, I., Arbiza, L., Mohammed, J., & Siepel, A. (2013). Inference of Natural Selection from Interspersed Genomic Elements Based on Polymorphism and Divergence. Molecular Biology and Evolution. doi:10.1093/molbev/mst019

[2] Arbiza, L., Gronau, I., Aksoy, B. A., Hubisz, M. J., Gulko, B., Keinan, A., & Siepel, A. (2013). Genome-wide inference of natural selection on human transcription factor binding sites. Nature Genetics, 45(7), 723–729. doi:10.1038/ng.2658

[3] Rands, C. M., Meader, S., Ponting, C. P., & Lunter, G. (2014). 8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genetics, 10(7), e1004525. doi:10.1371/journal.pgen.1004525

[4] Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., & Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics. doi:10.1038/ng.2892

The Sea Lamprey Meiotic Map Resolves Ancient Vertebrate Genome Duplications

The Sea Lamprey Meiotic Map Resolves Ancient Vertebrate Genome Duplications
Jeramiah Smith
doi: http://dx.doi.org/10.1101/008953

Gene and genome duplications serve as an important reservoir of material for the evolution of new biological functions. It is generally accepted that many genes present in vertebrate genomes owe their origin to two whole genome duplications that occurred deep in the ancestry of the vertebrate lineage. However, details regarding the timing and outcome of these duplications are not well resolved. We present high-density meiotic and comparative genomic maps for the sea lamprey, a representative of an ancient lineage that diverged from all other vertebrates approximately 550 million years ago. Linkage analyses yielded a total of 95 linkage groups, similar to the estimated number of germline chromosomes (1N ~ 99), spanning a total of 5,570.25 cM. Comparative mapping data yield strong support for one ancient whole genome duplication but do not strongly support a hypothetical second event. Rather, these comparative maps reveal several evolutionary independent segmental duplications occurring over the last 600+ million years of chordate evolution. This refined history of vertebrate genome duplication should permit more precise investigations into the evolution of vertebrate gene functions.

Scalable Genomics with R and Bioconductor

Scalable Genomics with R and Bioconductor
Michael Lawrence, Martin Morgan
Journal-ref: Statistical Science 2014, Vol. 29, No. 2, 214-226
Subjects: Genomics (q-bio.GN); Distributed, Parallel, and Cluster Computing (cs.DC)

This paper reviews strategies for solving problems encountered when analyzing large genomic data sets and describes the implementation of those strategies in R by packages from the Bioconductor project. We treat the scalable processing, summarization and visualization of big genomic data. The general ideas are well established and include restrictive queries, compression, iteration and parallel computing. We demonstrate the strategies by applying Bioconductor packages to the detection and analysis of genetic variants from a whole genome sequencing experiment.

Genome-wide characterization of RNA editing in chicken: lack of evidence for non-A-to-I events

Genome-wide characterization of RNA editing in chicken: lack of evidence for non-A-to-I events

Laure Frésard, Sophie Leroux, Pierre-François Roux, C Klopp, Stéphane Fabre, Diane Esquerré, Patrice Dehais, Anis Djari, David Gourichon, Sandrine Lagarrigue, Frédérique Pitel
doi: http://dx.doi.org/10.1101/008912

RNA editing corresponds to a post-transcriptional nucleotide change in the RNA sequence, creating an alternative nucleotide, not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is more and more described in mammals, notably since the availability of next generation sequencing technologies allowing a whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken are still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, to understand to what extent this phenomenon is conserved in vertebrates. We performed RNA and DNA sequencing from 8 embryos. Being aware of common pitfalls inherent to sequence analyses leading to false positive discovery, we stringently filtered our datasets and found less than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual editing level increase with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, and provide support of an absence of non A-to-I events from the chicken transcriptome.

Accurate Liability Estimation Substantially Improves Power in Ascertained Case Control Studies

Accurate Liability Estimation Substantially Improves Power in Ascertained Case Control Studies

Omer Weissbrod, Christoph Lippert, Dan Geiger, David Heckerman
(Submitted on 8 Sep 2014)

Future genome wide association studies (GWAS) of diseases will include hundreds of thousands of individuals in order to detect risk variants with small effect sizes. Such samples are susceptible to confounding, which can lead to spurious results. Recently, linear mixed models (LMMs) have emerged as the method of choice for GWAS, due to their robustness to confounding. However, the performance of LMMs in case-control studies deteriorates with increasing sample size, resulting in reduced power. This loss of power can be remedied by transforming observed case-control status to liability space, wherein each individual is assigned a score corresponding to severity of phenotype. We propose a novel method for estimating liabilities, and demonstrate that testing for associations with estimated liabilities by way of an LMM leads to a substantial power increase. The proposed framework enables testing for association in ascertained case-control studies, without suffering from reduced power, while remaining resilient to confounding. Extensive experiments on synthetic and real data demonstrate that the proposed framework can lead to an average increase of over 20 percent for test statistics of causal variants, thus dramatically improving GWAS power.

Biallelic Mutation-Drift Diffusion in the Limit of Small Scaled Mutation Rates

Biallelic Mutation-Drift Diffusion in the Limit of Small Scaled Mutation Rates

Claus Vogl
(Submitted on 8 Sep 2014)

The evolution of the allelic proportion x of a biallelic locus subject to the forces of mutation and drift is investigated in a diffusion model, assuming small scaled mutation rates. The overall scaled mutation rate is parametrized with θ=(μ1+μ0)N and the ratio of mutation rates with α=μ1/(μ1+μ0)=1−β. The equilibrium density of this process is beta with parameters αθ and βθ. Away from equilibrium, the transition density can be expanded into a series of modified Jacobi polynomials. If the scaled mutation rates are small, i.e., θ≪1, it may be assumed that polymorphism derives from mutations at the boundaries. A model, where the interior dynamics conform to the pure drift diffusion model and the mutations are entering from the boundaries is derived. In equilibrium, the density of the proportion of polymorphic alleles, \ie\ x within the polymorphic region [1/N,1−1/N], is αβθ(1x+11−x)=αβθx(1−x), while the mutation bias α influences the proportion of monomorphic alleles at 0 and 1. Analogous to the expansion with modified Jacobi polynomials, a series expansion of the transition density is derived, which is connected to Kimura’s well known solution of the pure drift model using Gegenbauer polynomials. Two temporal and two spatial regions are separated. The eigenvectors representing the spatial component within the polymorphic region depend neither on the on the scaled mutation rate θ nor on the mutation bias α. Therefore parameter changes, e.g., growing or shrinking populations or changes in the mutation bias, can be modeled relatively easily, without the change of the eigenfunctions necessary for the series expansion with Jacobi polynomials.

Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

Sebastien Roch, Mike Steel
(Submitted on 6 Sep 2014)

The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multi-species coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGorgio and Degnan (2010) and Chifman and Kubtako (2014).