Average genome size estimation enables accurate quantification of gene family abundance and sheds light on the functional ecology of the human microbiome

Average genome size estimation enables accurate quantification of gene family abundance and sheds light on the functional ecology of the human microbiome

Stephen Nayfach, Katherine S Pollard
doi: http://dx.doi.org/10.1101/009001

Average genome size (AGS) is an important, yet often overlooked property of microbial communities. We developed MicrobeCensus to rapidly and accurately estimate AGS from short-read metagenomics data and applied our tool to over 1,300 human microbiome samples. We found that AGS differs significantly within and between body sites and tracks with major functional and taxonomic differences. For example, in the gut, AGS ranges from 2.5 to 5.8 megabases and is positively correlated with the abundance of Bacteroides and polysaccharide metabolism. Furthermore, we found that AGS variation can bias comparative analyses, and that normalization improves detection of differentially abundant genes.

Author post: Probabilities of Fitness Consequences for Point Mutations Across the Human Genome

This guest post is by Adam Siepel on his group’s paper Gulko et al. Probabilities of Fitness Consequences for Point Mutations Across the Human Genome, bioRxived here.

Four Genomicists in a Subaru

The idea for this paper emerged during a long drive across New York State, from Cold Spring Harbor to Ithaca, after the 2011 Biology of Genomes meeting. Two postdocs in my research group, Ilan Gronau and Leo Arbiza, were riding with me in my old Subaru, trying not to express too much alarm at my distracted driving. Also with us in the car was Ran Blekhman, who was at the time a postdoc with Andy Clark. (Ran is now an assistant professor at the University of Minnesota.)

Our conversation turned to important open questions in computational genomics, and in particular, to ways of making better use of the vast quantities of functional genomic data being pumped out of projects such as ENCODE. At the time, Ilan, Leo, and I were thinking a lot about how to use patterns of within-species polymorphism and between-species divergence to shed light on the influence of natural selection on regulatory sequences. Along these lines, we had spent much of the spring developing our new INSIGHT method [1,2], and we had just presented this work for the first time at Biology of Genomes. Ran, however, was pushing us to think less about abstract evolutionary questions and more about genomic function and disease association. He made a strong case that the biggest obstacle to progress in medically related human genomics was the absence of adequate functional annotations in noncoding regions of the human genome.

For a while the conversation went in circles, as we grasped for ways of measuring “functional potential” across the genome that would make use of genomic data yet be grounded in evolutionary theory. Then it suddenly dawned on us that we already had in hand a key piece of what we needed. The INSIGHT program was designed to estimate, for any collection of nucleotide positions across the genome, the fraction (denoted ρ) of those positions that were directly influenced by natural selection, in the sense that point mutations at those positions tended either to increase or to decrease the fitness of an organism. We realized that an alternative way of interpreting ρ was as a probability that the nucleotide at each position in the analyzed collection influenced fitness, assuming exchangeability of sites (as INSIGHT does).

All that we needed, therefore, was a general way of partitioning nucleotide positions from across the genome into distinct classes that were reasonably homogeneous in their functional roles. We could then estimate ρ for each class using INSIGHT, and assign to each genomic position the estimate of ρ for the partition to which that position belonged. This procedure would produce a “score” across the genome that looked roughly like widely used evolutionary conservation scores, but instead of representing local divergence patterns across the mammalian phylogeny, the score at each position would be estimated from groupwise patterns of polymorphism and divergence and would be directly interpretable as a probability of fitness consequences. Later Ilan would dub these “fitCons” scores, to emphasize this fitness-related interpretation. (“FitCons” also nicely parallels “phastCons,” our first conservation-scoring method.)

Because INSIGHT measures selection on recent time scales, fitCons scores would circumvent a major shortcoming of standard evolutionary conservation scores—that they require functional roles to have remained consistent over very long evolutionary time periods (tens to hundreds of millions of years) in order to be detectable in divergence patterns at individual sites or small loci. In principle, fitCons scores should be able to detect selection (hence potential function) at sites whose functional role had emerged quite recently, perhaps even along the human lineage.

The Problem of Grouping

The piece that was still missing in our plan was a particular scheme for grouping together similar genomic sites from across the genome. We did not get to the point of working this problem out in any detail during our revelatory drive to Ithaca, and, as it happened, it took several more months to settle on a solution. By this time, a Ph.D. student from Computer Science, Brad Gulko, had joined the project and assumed the lead in implementing a prototype of the scores.

At first, Brad, Ilan, Leo, and I spent some time thinking about fancy algorithms for clustering genomic sites that would consider functional and evolutionary information jointly. However, it did not take long to realize that this was a hard problem. Eventually, we decided to move forward with a simple grouping scheme, based on functional genomic data alone. This would allow us to cluster genomic sites in a pre-processing step and avoid the need for an iterative solution. Our hunch was that the scores would not be too sensitive to the grouping scheme as long as it was reasonable. As we discuss in our article, it may be worthwhile to revisit this clustering approach eventually, but it appears to be adequate for our current purposes. (I hope to convince Brad to discuss some of the technical issues with the clustering problem in an upcoming blog post.)

Relevance to the “Share Under Selection” in the Human Genome

By the fall of 2012, we had finally settled on an initial set of fitCons tracks and were beginning to observe decent prediction performance for cis-regulatory elements, when the human genomics community was thrown into a frenzy by a deluge of publications and accompanying press releases from the ENCODE Consortium. This event led to the now-famous controversy over what fraction of the genome is truly “functional” and whether ENCODE’s measures of “reproducible biochemical activity” (which apply to over 80% of the genome) were comparable in any meaningful way to the “share under selection” (SUS) estimated from comparative genomics (which generally came out to 5–10%).

I do not wish to rehash the familiar terms of this debate here, but I do want to focus on one aspect of it that was particularly relevant to our work. Many of the criticisms of ENCODE reminded readers that comparative genomic analyses pointed to a SUS of ~5–10%, suggesting that 80% might be a gross over-estimate of the functional content of the genome. However, others pointed out that these comparative-genomic estimates applied only to the fraction of the genome that had been under long-term selective constraint, because evolutionary turnover of functional elements—if it occurred at appreciable rates—could bias estimates based on long-term genomic divergence substantially downward. (For the latest chapter in this saga, see a recent paper by Gerton Lunter, Chris Ponting, and colleagues [3].)

We realized that the fitCons scores could help address aspects of this controversy, because they were based on patterns of variation over much more recent time scales and should therefore be much less sensitive to turnover than scores based on divergence patterns across the mammalian phylogeny. Moreover, the fitCons scores, by making use of INSIGHT to interpret patterns of polymorphism and divergence, might provide substantially better estimates of the quantities of interest than simple analyses of SNP densities or allele frequencies, a few of which had appeared among the ENCODE publications. Finally, the INSIGHT-based estimates are unique in that they directly predict the SUS, without the need for separate thresholding, mixture deconvolutions, or enrichment analyses.

Somewhat surprisingly, when we estimated the SUS based on fitCons scores, we obtained values (4.2–7.5%) that were quite similar to those based on conservation patterns in mammals. There are a number of tricky technical issues involved in this type of estimation—for example, concerning the corrections for local mutation rates and coalescence times—but violations of our modeling assumptions generally should tend to push our estimated upper bound (7.5%) to conservatively high values, implying that the true value is lower than 7.5%. In addition, the correction we have applied to obtain our lower bound (4.2%) is quite conservative, making it likely that the true value is higher than 4.2%. Therefore we have high confidence that the fraction of the genome under detectable selection from the available polymorphism and divergence data is indeed fairly close to 5%. As we discuss in the paper, it is important to bear in mind that the absolute values of these estimates reflect constraint on the identities of individual nucleotides only, and do not take into consideration higher order constraints, for example, on element lengths or spacing. Nevertheless, the similarity of the estimates based on mammalian divergence and human polymorphism suggest that evolutionary turnover has not produced a major downward bias in conservation-based estimates of the SUS.

Ilan and I soon realized that we could go a step further in this analysis and compare the fitCons-based estimates with parallel estimates based on the same functional categories but a measure of natural selection based on divergence only. The idea here was to perform a direct “apples to apples” comparison of the fraction of the genome under selection as measured on two different time scales: the 1–5 million-year time scale measured by fitCons and the ~30 million-year time scale measured by an analogous method based on divergence patterns in four primate genomes (human, chimpanzee, orangutan, and rhesus macaque), which we called “fitConsD” (the “D” is for “divergence”). I won’t attempt to describe this analysis in detail here, but our general conclusion is that the estimates of selection are highly similar on these two different time scales, suggesting further that evolutionary turnover has not had a dramatic effect on the functional content of the human genome over the past 30 million years or so. It is worth noting that Lunter and colleagues’ recent analysis is not strictly incompatible with ours (they estimate 7.1–9.2% constraint at present and focus on turnover over longer time scales) but their qualitative interpretation suggests large amounts of turnover, while ours suggests modest amounts.

Scooped by CADD… or Perhaps Not

As grant proposals, other manuscripts, and job searches led to delays in writing up our work through 2013, we began to hear rumblings on social media about a method called CADD, developed by Greg Cooper and Jay Shendure’s groups, that sounded alarmingly similar to fitCons. Then, in early 2014, a paper by Kircher, Witten et al. describing CADD appeared in Nature Genetics [4]. When we saw this paper, our initial impression was that we had been scooped by sitting on a good idea for too long. CADD was described as a method that integrated functional and evolutionary data and produced a measure of “relative pathogenicity” across the entire genome, and it was motivated, in part, by its potential usefulness in noncoding as well as coding regions. The paper included several impressive-looking ROC plots in which CADD apparently outperformed conservation based methods such as phastCons, phyloP, and GERP by a significant margin. In addition, CADD made use of a support vector machine (SVM), which was potentially a highly flexible and powerful means for considering large numbers of covariates with arbitrarily complex correlations.

We decided, with a certain amount of dread, that we needed to add CADD to our empirical performance comparisons on putative regulatory elements. At the time, fitCons was showing clear advantages in predictive power for cis regulatory elements compared with conservation-based methods and a functional annotation database called RegulomeDB. FitCons had several potential advantages over CADD—for example, it made direct use of polymorphism data for prediction, it considered covariates in a cell-type-specific manner, and it avoided a need for brute-force simulations through its use of INSIGHT for inference—but we thought that the use of the SVM in CADD made it unlikely that fitCons could compete with it in a pure classification task. Nevertheless, Brad dutifully downloaded the CADD scores, added them to his experiments, and displayed curves for CADD in his ROC plots for three types of regulatory elements (ChIP-seq-supported transcription factor binding sites, eQTLs, and enhancer predictions based on chromatin marks).

To our surprise, fitCons significantly outperformed CADD in all of these tests. This was true for three different types of putative regulatory elements, and true whether or not we considered cell-type-specific test sets. In fact, CADD performed essentially no better than conventional conservation scores in these tests, in apparent contradiction to the results presented in the CADD paper.

A closer reading of the CADD paper revealed a possible explanation for these observations. While the method was motivated, in part, by its applicability to the entire genome, the authors’ validation experiments heavily emphasized coding regions. In fact, it appears that even the ROC plot for “genome-wide” results (Figure 3a in the paper) is actually based almost exclusively (>92% by our interpretation of the paper) on missense variants in coding regions. The experiments that included substantial numbers of noncoding sites, in turn, were much more indirect—for example, by showing correlations with derived allele frequencies (Figure 2), known disease-causing status (Figure 4), and changes in expression in saturation mutagenesis experiments at two enhancers and one promoter. It is possible to have correlations of this kind without having substantial predictive power for regulatory variants.

When Greg Cooper saw our initial preprint on bioRxiv, he raised two major objections to our validation experiments. First, he pointed out that we were measuring the sensitivity of the fitCons scores in terms of bulk coverage of elements, when those elements actually consist of a mixture of sites at which mutations are deleterious and sites at which mutations are neutral or nearly neutral (such as degenerate positions in transcription factor binding sites). This approach to measuring sensitivity may be overly generous to the fitCons scores, which are relatively “blocky” along the genome, varying little from one site to the next, in comparison to higher-resolution prediction methods that properly distinguish between functionally important and neutral sites within elements. Second, Greg pointed out that we were using a naive genome-wide background set for our eQTL, which did not properly account for the ascertainment scheme used for eQTL identification.

We felt that these were fair and reasonable criticisms, and needed to be addressed. Therefore, we revised our validation experiments to consider only high-information-content positions in transcription factor binding sites (a proxy for functionally important nucleotides), and to use a more appropriate control for eQTL. The details of these follow-up experiments are described in our revised preprint (now on bioRxiv), but the bottom line was that they had almost no effect on our ROC plots. In other words, the apparent performance advantages of fitCons over CADD and other divergence-based methods is not an artifact of our experimental design but appears to reflect real advantages of the method. While Greg is correct that the coarse, “blocky” nature of the fitCons scores is a limitation of our current methods, the method still appears to perform significantly better than any competing method in distinguishing putatively functional regulatory nucleotides from background sequence. In other words, while scores that exhibit more variation from one nucleotide to the next—such as CADD, GERP, and phyloP—may appear on the surface to have higher predictive resolution, much of that variation is uninformative about regulatory function, and, on balance, the “blocky” fitCons scores are more useful in prediction.

We have spent some time trying to understand the differences in performance between fitCons and CADD, and believe we have some insights into why fitCons performs significantly better on regulatory elements. (What follows are our conjectures only; the authors of CADD do not agree with our analysis.) While the SVM in CADD is potentially a strength, we believe that it is substantially limited in this case by the use of a linear kernel and by pooling features across cell types, rather than focusing separately on each cell type of interest. In addition, we think there is a fundamental problem with the optimization scheme used by CADD. The SVM in CADD is trained by a global strategy, in the sense that a single set of parameter values is selected (for a given choice of training set and generalization parameter) to obtain an optimal fit, on average, across all examples in the training set. Thus, if it is true that different covariates are relevant in coding and noncoding regions, as expected, then the method will have to make tradeoffs between these types of sites. If the “signal” for training (i.e., the contribution toward the SVM’s objective function) is stronger in one type of region than another, it is likely that the tradeoff will favor that type of region. Because constraint will be strongest, on average, in coding regions, leading to higher rates of difference between simulated and observed variants in these regions, it seems likely that these regions will indeed dominate in the training procedure, and this may explain CADD’s superior performance in coding regions and its weaker performance in noncoding regions. FitCons avoids this problem by applying INSIGHT separately to each class of sites.

These observations raise the interesting possibility of a modified CADD that addresses some of these limitations. There is no reason why CADD couldn’t be trained separately on noncoding and coding regions, perhaps with different sets of covariates for each type of sites. Moreover, regulation-associated covariates could be treated in a cell-type-specific manner. A modified CADD designed along these lines (regulatory CADD, or rCADD?) could provide an interesting alternative to fitCons.

Summary

When we first discussed the idea for the fitCons scores during our drive across New York State three years ago, I envisioned a quick spin-off project that could be completed in perhaps half a year. As so often happens in research, several unanticipated challenges arose in completing this work, but we also found unexpected opportunities to connect our analysis with important open questions in the field. In addition, we were stimulated to think about the problem of combining functional and evolutionary data in new and deeper ways by another paper that addressed a similar problem but in a fundamentally different way. The end result is a paper I am quite proud of—one that provides what I think will be a useful resource to the genomics community and that also offers new insights into longstanding evolutionary questions.

References

[1] Gronau, I., Arbiza, L., Mohammed, J., & Siepel, A. (2013). Inference of Natural Selection from Interspersed Genomic Elements Based on Polymorphism and Divergence. Molecular Biology and Evolution. doi:10.1093/molbev/mst019

[2] Arbiza, L., Gronau, I., Aksoy, B. A., Hubisz, M. J., Gulko, B., Keinan, A., & Siepel, A. (2013). Genome-wide inference of natural selection on human transcription factor binding sites. Nature Genetics, 45(7), 723–729. doi:10.1038/ng.2658

[3] Rands, C. M., Meader, S., Ponting, C. P., & Lunter, G. (2014). 8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genetics, 10(7), e1004525. doi:10.1371/journal.pgen.1004525

[4] Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., & Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics. doi:10.1038/ng.2892

Genome-wide characterization of RNA editing in chicken: lack of evidence for non-A-to-I events

Genome-wide characterization of RNA editing in chicken: lack of evidence for non-A-to-I events

Laure Frésard, Sophie Leroux, Pierre-François Roux, C Klopp, Stéphane Fabre, Diane Esquerré, Patrice Dehais, Anis Djari, David Gourichon, Sandrine Lagarrigue, Frédérique Pitel
doi: http://dx.doi.org/10.1101/008912

RNA editing corresponds to a post-transcriptional nucleotide change in the RNA sequence, creating an alternative nucleotide, not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is more and more described in mammals, notably since the availability of next generation sequencing technologies allowing a whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken are still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, to understand to what extent this phenomenon is conserved in vertebrates. We performed RNA and DNA sequencing from 8 embryos. Being aware of common pitfalls inherent to sequence analyses leading to false positive discovery, we stringently filtered our datasets and found less than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual editing level increase with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, and provide support of an absence of non A-to-I events from the chicken transcriptome.

Accurate Liability Estimation Substantially Improves Power in Ascertained Case Control Studies

Accurate Liability Estimation Substantially Improves Power in Ascertained Case Control Studies

Omer Weissbrod, Christoph Lippert, Dan Geiger, David Heckerman
(Submitted on 8 Sep 2014)

Future genome wide association studies (GWAS) of diseases will include hundreds of thousands of individuals in order to detect risk variants with small effect sizes. Such samples are susceptible to confounding, which can lead to spurious results. Recently, linear mixed models (LMMs) have emerged as the method of choice for GWAS, due to their robustness to confounding. However, the performance of LMMs in case-control studies deteriorates with increasing sample size, resulting in reduced power. This loss of power can be remedied by transforming observed case-control status to liability space, wherein each individual is assigned a score corresponding to severity of phenotype. We propose a novel method for estimating liabilities, and demonstrate that testing for associations with estimated liabilities by way of an LMM leads to a substantial power increase. The proposed framework enables testing for association in ascertained case-control studies, without suffering from reduced power, while remaining resilient to confounding. Extensive experiments on synthetic and real data demonstrate that the proposed framework can lead to an average increase of over 20 percent for test statistics of causal variants, thus dramatically improving GWAS power.

Biallelic Mutation-Drift Diffusion in the Limit of Small Scaled Mutation Rates

Biallelic Mutation-Drift Diffusion in the Limit of Small Scaled Mutation Rates

Claus Vogl
(Submitted on 8 Sep 2014)

The evolution of the allelic proportion x of a biallelic locus subject to the forces of mutation and drift is investigated in a diffusion model, assuming small scaled mutation rates. The overall scaled mutation rate is parametrized with θ=(μ1+μ0)N and the ratio of mutation rates with α=μ1/(μ1+μ0)=1−β. The equilibrium density of this process is beta with parameters αθ and βθ. Away from equilibrium, the transition density can be expanded into a series of modified Jacobi polynomials. If the scaled mutation rates are small, i.e., θ≪1, it may be assumed that polymorphism derives from mutations at the boundaries. A model, where the interior dynamics conform to the pure drift diffusion model and the mutations are entering from the boundaries is derived. In equilibrium, the density of the proportion of polymorphic alleles, \ie\ x within the polymorphic region [1/N,1−1/N], is αβθ(1x+11−x)=αβθx(1−x), while the mutation bias α influences the proportion of monomorphic alleles at 0 and 1. Analogous to the expansion with modified Jacobi polynomials, a series expansion of the transition density is derived, which is connected to Kimura’s well known solution of the pure drift model using Gegenbauer polynomials. Two temporal and two spatial regions are separated. The eigenvectors representing the spatial component within the polymorphic region depend neither on the on the scaled mutation rate θ nor on the mutation bias α. Therefore parameter changes, e.g., growing or shrinking populations or changes in the mutation bias, can be modeled relatively easily, without the change of the eigenfunctions necessary for the series expansion with Jacobi polynomials.

Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

Sebastien Roch, Mike Steel
(Submitted on 6 Sep 2014)

The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multi-species coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGorgio and Degnan (2010) and Chifman and Kubtako (2014).

Mixed Model with Correction for Case-Control Ascertainment Increases Association Power

Mixed Model with Correction for Case-Control Ascertainment Increases Association Power

tristan hayeck, Noah Zaitlen, Po-Ru Loh, Bjarni Vilhjalmsson, Samuela Pollack, Alexander Gusev, Jian Yang, Guo-Bo Chen, Michael E. Goddard, Peter M. Visscher, Nick Patterson, Alkes Price
doi: http://dx.doi.org/10.1101/008755

We introduce a Liability Threshold Mixed Linear Model (LTMLM) association statistic for ascertained case-control studies that increases power vs. existing mixed model methods, with a well-controlled false-positive rate. Recent work has shown that existing mixed model methods suffer a loss in power under case-control ascertainment, but no solution has been proposed. Here, we solve this problem using a chi-square score statistic computed from posterior mean liabilities (PML) under the liability threshold model. Each individual’s PML is conditional not only on that individual’s case-control status, but also on every individual’s case-control status and on the genetic relationship matrix obtained from the data. The PML are estimated using a multivariate Gibbs sampler, with the liability-scale phenotypic covariance matrix based on the genetic relationship matrix (GRM) and a heritability parameter estimated via Haseman-Elston regression on case-control phenotypes followed by transformation to liability scale. In simulations of unrelated individuals, the LTMLM statistic was correctly calibrated and achieved higher power than existing mixed model methods in all scenarios tested, with the magnitude of the improvement depending on sample size and severity of case-control ascertainment. In a WTCCC2 multiple sclerosis data set with >10,000 samples, LTMLM was correctly calibrated and attained a 4.1% improvement (P=0.007) in chi-square statistics (vs. existing mixed model methods) at 75 known associated SNPs, consistent with simulations. Larger increases in power are expected at larger sample sizes. In conclusion, an increase in power over existing mixed model methods is available for ascertained case-control studies of diseases with low prevalence.

Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study

Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study

Xiaoquan Wen, Francesca Luca, Roger Pique-Regi
doi: http://dx.doi.org/10.1101/008797

Mapping expression quantitative trait loci (eQTLs) has been shown as a powerful tool to uncover the genetic underpinnings of many complex traits at the molecular level. In this paper, we present an integrative analysis approach that leverages eQTL data collected from multiple population groups. In particular, our approach effectively identifies multiple independent cis-eQTL signals that are consistently presented across populations, accounting for population heterogeneity in allele frequencies and linkage disequilibrium patterns. Furthermore, by integrating genomic annotations, our analysis framework enables high-resolution functional analysis of eQTLs. We applied our statistical approach to analyze the GEUVADIS data consisting of samples from five population groups. From this analysis, we concluded that i) jointly analysis across population groups greatly improves the power of eQTL discovery and the resolution of fine mapping of causal eQTL. ii) many genes harbor multiple independent eQTLs in their cis regions iii) genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs (p-value = 4.93 × 10-22).

The projection of a test genome onto a reference population and applications to humans and archaic hominins

The projection of a test genome onto a reference population and applications to humans and archaic hominins

Melinda A Yang, Montgomery Slatkin
doi: http://dx.doi.org/10.1101/008805

We introduce a method for comparing a test genome with numerous genomes from a reference population. Sites in the test genome are given a weight w that depends on the allele frequency x in the reference population. The projection of the test genome onto the reference population is the average weight for each x, w(x). The weight is assigned in such a way that if the test genome is a random sample from the reference population, w(x)=1. Using analytic theory, numerical analysis, and simulations, we show how the projection depends on the time of population splitting, the history of admixture and changes in past population size. The projection is sensitive to small amounts of past admixture, the direction of admixture and admixture from a population not sampled (a ghost population). We compute the projection of several human and two archaic genomes onto three reference populations from the 1000 Genomes project, Europeans (CEU), Han Chinese (CHB) and Yoruba (YRI) and discuss the consistency of our analysis with previously published results for European and Yoruba demographic history. Including higher amounts of admixture between Europeans and Yoruba soon after their separation and low amounts of admixture more recently can resolve discrepancies between the projections and demographic inferences from some previous studies.

Highly epistatic genetic architecture of root length in Arabidopsis thaliana

Highly epistatic genetic architecture of root length in Arabidopsis thaliana

Jennifer Lachowiec, Xia Shen, Christine Queitsch, Örjan Carlborg
doi: http://dx.doi.org/10.1101/008789

Efforts to identify loci underlying complex traits generally assume that most genetic variance is additive. This is despite the fact that non-additive genetic effects, such as epistatic interactions and developmental noise, are also likely to make important contributions to the phenotypic variability. Analyses beyond additivity require additional care in the design and collection of data, and introduce significant analytical and computational challenges in the statistical analyses. Here, we have conducted a study that, by focusing on a model complex trait that allows precise phenotyping across many replicates and by applying advanced analytical tools capable of capturing epistatic interactions, overcome these challenges. Specifically, we examined the genetic determinants of Arabidopsis thaliana root length, considering both trait mean and variance. Analysis of narrow-and broad-sense heritability of mean root length identified a large contribution of non-additive variation and a low contribution of additive variation. Also, no loci were found to contribute to mean root length using a standard additive model based genome-wide association analysis (GWAS). We could, however, identify one locus regulating developmental noise and seven loci contributing to root-length through epistatic interactions, and four of these were also experimentally confirmed. The candidate locus associated with root length variance contains a candidate gene that, when mutated, appears to decrease developmental noise. This is particularly interesting as most other known noise regulators in multicellular organisms increase noise when mutated. The mutant analysis of candidate genes within the seven epistatic loci implicated four genes in root development, including three without previously described root phenotypes. In summary, we identify several novel genes affecting root development, demonstrate the benefits of advanced analytical tools to study the genetic determinants of complex traits, and show that epistatic interactions can be a major determinant of complex traits in A. thaliana.