On the genetic architecture of intelligence and other quantitative traits

On the genetic architecture of intelligence and other quantitative traits

Stephen D.H. Hsu
(Submitted on 14 Aug 2014)

How do genes affect cognitive ability or other human quantitative traits such as height or disease risk? Progress on this challenging question is likely to be significant in the near future. I begin with a brief review of psychometric measurements of intelligence, introducing the idea of a “general factor” or g score. The main results concern the stability, validity (predictive power), and heritability of adult g. The largest component of genetic variance for both height and intelligence is additive (linear), leading to important simplifications in predictive modeling and statistical estimation. Due mainly to the rapidly decreasing cost of genotyping, it is possible that within the coming decade researchers will identify loci which account for a significant fraction of total g variation. In the case of height analogous efforts are well under way. I describe some unpublished results concerning the genetic architecture of height and cognitive ability, which suggest that roughly 10k moderately rare causal variants of mostly negative effect are responsible for normal population variation. Using results from Compressed Sensing (L1-penalized regression), I estimate the statistical power required to characterize both linear and nonlinear models for quantitative traits. The main unknown parameter s (sparsity) is the number of loci which account for the bulk of the genetic variation. The required sample size is of order 100s, or roughly a million in the case of cognitive ability.

Seasonality in the migration and establishment of H3N2 Influenza lineages with epidemic growth and decline

Seasonality in the migration and establishment of H3N2 Influenza lineages with epidemic growth and decline

Daniel Zinder, Trevor Bedford, Edward B. Baskerville, Robert J. Woods, Manojit Roy, Mercedes Pascual
(Submitted on 15 Aug 2014)

Background: Influenza A/H3N2 has been circulating in humans since 1968, causing considerable morbidity and mortality. Although H3N2 incidence is highly seasonal, how such seasonality contributes to global phylogeographic migration dynamics has not yet been established. In this study, we incorporate time-varying migration rates in a Bayesian MCMC framework focusing initially on migration within China and, to and from North-America, as case studies, and later on global communities.
Results: Both global migration and migration between and within large geographic regions is clearly seasonal. On a global level, windows of immigration (in migration) map to the seasonal timing of epidemic spread, while windows of emigration (out migration) to epidemic decline. Seasonal patterns also affect the probability that local lineages go extinct and fail to contribute to long term viral evolution. The probability that a region will contribute to long term viral evolution as a part of the trunk of the phylogenetic tree increases in the absence of deep troughs and with reduced incidence variability.
Conclusions: Seasonal migration and rapid turnover within regions is sustained by the invasion of ‘fertile epidemic grounds’ at the end of older epidemics. Thus, the current emphasis on connectivity, including air-travel, should be complemented with a better understanding of the conditions and timing required for successful establishment. This will better our understanding of seasonal drivers, improve predictions, and improve vaccine updating by identifying strains that not only escape immunity but also have the seasonal opportunity to establish and spread. Further work is also needed on additional conditions that contribute to the persistence and long term evolution of influenza within the human population, such as spatial heterogeneity with respect to climate and seasonality.

Understanding Admixture Fractions

Understanding Admixture Fractions

Mason Liang, Rasmus Nielsen
doi: http://dx.doi.org/10.1101/008078

Estimation of admixture fractions has become one of the most commonly used computational tools in population genomics. However, there is remarkably little population genetic theory on their statistical properties. We develop theoretical results that can accurately predict means and variances of admixture proportions within a population using models with recombination and genetic drift. Based on established theory on measures of multilocus disequilibrium, we show that there is a set of recurrence relations that can be used to derive expectations for higher moments of the admixture fraction distribution. We obtain closed form solutions for some special cases. Using these results, we develop a method for estimating admixture parameters from estimated admixture proportion obtained from programs such as Structure or Admixture. We apply this method to HapMap data and find that the population history of African Americans, as expected, is not best explained by a single admixture event between people of European and African ancestry. A model of constant gene flow for the past 11 generations until 2 generations ago gives a better fit.

Transposable elements contribute to activation of maize genes in response to abiotic stress

Transposable elements contribute to activation of maize genes in response to abiotic stress

Irina Makarevitch, Amanda J Waters, Patrick T West, Michelle C Stitzer, Jeffrey Ross-Ibarra, Nathan M Springer
doi: http://dx.doi.org/10.1101/008052

Transposable elements (TEs) account for a large portion of the genome in many eukaryotic species. Despite their reputation as “junk” DNA or genomic parasites deleterious for the host, TEs have complex interactions with host genes and the potential to contribute to regulatory variation in gene expression. It has been hypothesized that TEs and genes they insert near may be transcriptionally activated in response to stress conditions. The maize genome, with many different types of TEs interspersed with genes, provides an ideal system to study the genome-wide influence of TEs on gene regulation. To analyze the magnitude of the TE effect on gene expression response to environmental changes, we profiled gene and TE transcript levels in maize seedlings exposed to a number of abiotic stresses. Many genes exhibit up- or down-regulation in response to these stress conditions. The analysis of TE families inserted within upstream regions of up-regulated genes revealed that between four and nine different TE families are associated with up-regulated gene expression in each of these stress conditions, affecting up to 20% of the genes up-regulated in response to abiotic stress and as many as 33% of genes that are only expressed in response to stress. Expression of many of these same TE families also responds to the same stress conditions. The analysis of the stress- induced transcripts and proximity of the transposon to the gene suggests that these TEs may provide local enhancer activities that stimulate stress-responsive gene expression. Our data on allelic variation for insertions of several of these TEs show strong correlation between the presence of TE insertions and stress-responsive up-regulation of gene expression. Our findings suggest that TEs provide an important source of allelic regulatory variation in gene response to abiotic stress in maize.

Long-read, whole genome shotgun sequence data for five model organisms

Long-read, whole genome shotgun sequence data for five model organisms

Kristi E Kim, Paul Peluso, Primo Baybayan, Patricia Jane Yeadon, Charles Yu, William Fisher, Chen-Shan Chin, Nicole A Rapicavoli, David R Rank, Joachim Li, David Catcheside, Susan E Celniker, Adam M Phillippy, Casey M Bergman, Jane M Landolin
doi: http://dx.doi.org/10.1101/008037

Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characterisitcs of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4-C2 and P5-C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.

A Distance Method to Reconstruct Species Trees In the Presence of Gene Flow

A Distance Method to Reconstruct Species Trees In the Presence of Gene Flow

Lingfei Cui, Laura Kubatko
doi: http://dx.doi.org/10.1101/007955

One of the central tasks in evolutionary biology is to reconstruct the evolutionary relationships among species from sequence data, particularly from multilocus data. In the last ten years, many methods have been proposed to use the variance in the gene histories to estimate species trees by explicitly modeling deep coalescence. However, gene flow, another process that may produce gene history variance, has been less studied. In this paper, we propose a simple yet innovative method for species trees estimation in the presence of gene flow. Our method, called STEST (Species Tree Estimation from Speciation Times), constructs species tree estimates from pairwise speciation time or species divergence time estimates. By using methods that estimate speciation times in the presence of gene flow, (for example, M1 (Yang 2010) or SIM3s (Zhu and Yang 2012)), STEST is able to estimate species trees from data subject to gene flow. We develop two methods, called STEST (M1) and STEST (SIM3s), for this purpose. Additionally, we consider the method STEST (M0), which instead uses the M0 method (Yang 2002), a coalescent-based method that does not assume gene flow, to estimate speciation times. It is therefore devised to estimate species trees in the absence of gene flow. Our simulation studies show that STEST (M0) outperforms STEST(M1), STEST (SIM3s) and STEM in terms of estimation accuracy and outperfroms *BEAST in terms of running time when the degree of gene flow is small. STEST (M1) outperforms STEST (M0), STEST (SIM3s), STEM and *BEAST in term of estimation accuracy when the degree of gene flow is large. An empirical data set analyzed by these methods gives species tree estimates that are consistent with the previous results.

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James Drake, Jane M Landolin, Adam M Phillippy
doi: http://dx.doi.org/10.1101/008003

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

Changes in epistatic interactions in the long-term evolution of HIV-1 protease

Changes in epistatic interactions in the long-term evolution of HIV-1 protease

Aditi Gupta, Christoph Adami
(Submitted on 12 Aug 2014)

The human immuno-deficiency virus sub-type 1 (HIV-1) is evolving to keep up with a changing fitness landscape, due to the various drugs introduced to stop the virus’s replication. As the virus adapts, the information the virus encodes about its environment must change, and this change is reflected in the amino-acid composition of proteins, as well as changes in viral RNAs, binding sites, and splice sites. Information can also be encoded in the interaction between residues in a single protein as well as across proteins, leading to a change in the epistatic patterns that can affect how the virus can change in the future. Measuring epistasis usually requires fitness measurements that are difficult to obtain in high-throughput. Here we show that epistasis can be inferred from the pair-wise information between residues, and study how epistasis and information have changed over the long-term. Using HIV-1 protease sequence data from public databases covering the years 1998-2006 (from both treated and untreated subjects), we show that drug treatment has increased the protease’s per-site entropies on average. At the same time, the sum of mutual entropies across all pairs of residues within the protease shows a significant increase over the years, indicating an increase in epistasis in response to treatment, a trend not seen within sequences from untreated subjects. Our findings suggest that information theory can be an important tool to study long-term trends in the evolution of macromolecules.

A codon model of nucleotide substitution with selection on synonymous codon usage

A codon model of nucleotide substitution with selection on synonymous codon usage

Laura Kubatko, Premal Shah, Radu Herbei, Michael Gilchrist
doi: http://dx.doi.org/10.1101/007849

The quality of phylogenetic inference made from protein-coding genes depends, in part, on the realism with which the codon substitution process is modeled. Here we propose a new mechanistic model that combines the standard M0 substitution model of Yang (1997) with a simplified model from Gilchrist (2007) that includes selection on synonymous substitutions as a function of codon-specific nonsense error rates. We tested the newly proposed model by applying it to 104 protein-coding genes in brewer’s yeast, and compared the fit of the new model to the standard M0 model and to the mutation-selection model of Yang and Nielsen (2008) using the AIC. Our new model provided significantly better fit in approximately 85% of the cases considered for the basic M0 model and in approximately 25% of the cases for the M0 model with estimated codon frequencies, but only in a few cases when the mutation-selection model was considered. However, our model includes a parameter that can be interpreted as a measure of the rate of protein production, and the estimates of this parameter were highly correlated with an independent measure of protein production for the yeast genes considered here. Finally, we found that in some cases the new model led to the preference of a different phylogeny for a subset of the genes considered, indicating that substitution model choice may have an impact on the estimated phylogeny.

Efficient Bayesian mixed model analysis increases association power in large cohorts

Efficient Bayesian mixed model analysis increases association power in large cohorts

Po-Ru Loh, George Tucker, Brendan K Bulik-Sullivan, Bjarni J Vilhjalmsson, Hilary K Finucane, Daniel I Chasman, Paul M Ridker, Benjamin M Neale, Bonnie Berger, Nick Patterson, Alkes L Price
doi: http://dx.doi.org/10.1101/007799

Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts, and may not optimize power. All existing methods require time cost O(MN^2) (where N = #samples and M = #SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here, we present a far more efficient mixed model association method, BOLT-LMM, which requires only a small number of O(MN) iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to nine quantitative traits in 23,294 samples from the Women’s Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for GWAS in large cohorts.