Genome-wide association study of carbon and nitrogen metabolism in the maize nested association mapping population

Genome-wide association study of carbon and nitrogen metabolism in the maize nested association mapping population
Nengyi Zhang, Yves Gibon, Nicholas Lepak, Pinghua Li, Lauren Dedow, Charles Chen, Yoon-Sup So, Jason Wallace, Karl Kremling, Peter Bradbury, Thomas Brutnell, Mark Stitt, Edward Buckler
doi: http://dx.doi.org/10.1101/010785

Carbon (C) and nitrogen (N) metabolism are critical to plant growth and development and at the basis of yield and adaptation. We have applied high throughput metabolite analyses to over 12,000 diverse field grown samples from the maize nested association mapping population. This allowed us to identify natural variation controlling the levels of twelve key C and N metabolites, often with single gene resolution. In addition to expected genes like invertases, critical natural variation was identified in key C4 metabolism genes like carbonic anhydrases and a malate transporter. Unlike prior maize studies, extensive pleiotropy was found for C and N metabolites. This integration of field-derived metabolite data with powerful mapping and genomics resources allows dissection of key metabolic pathways, providing avenues for future genetic improvement.

Introns structure patterns of variation in nucleotide composition in Arabidopsis thaliana and rice protein-coding genes

Introns structure patterns of variation in nucleotide composition in Arabidopsis thaliana and rice protein-coding genes
Adrienne Ressayre, Sylvain Glemin, Pierre Montalent, Laurana Serres-Giardi, Christine Dillmann, Johann Joets
doi: http://dx.doi.org/10.1101/010819

Plant genomes are large, intron-rich and present a wide range of variation in coding region G+C content. Concerning coding regions, a sort of syndrome can be described in plants: the increase in G+C content is associated with both the increase in heterogeneity among genes within a genome and the increase in variation across genes. Taking advantage of the large number of genes composing plant genomes and the wide range of variation in gene intron number, we performed a comprehensive survey of the patterns of variation in G+C content at different scales from the nucleotide level to the genome scale in two species Arabidopsis thaliana and Oryza sativa, comparing the patterns in genes with different intron numbers. In both species, we observed a pervasive effect of gene intron number and location along genes on G+C content, codon and amino acid frequencies suggesting that in both species, introns have a barrier effect structuring G+C content along genes. In external gene regions (located upstream first or downstream last intron), species-specific factors are shaping G+C content while in internal gene regions (surrounded by introns), G+C content is constrained to remain within a range common to both species. In rice, introns appear as a major determinant of gene G+C content while in A. thaliana introns have a weaker but significant effect. The structuring effect of introns in both species is susceptible to explain the G+C content syndrome observed in plants.

Multicellularity makes cellular differentiation evolutionarily stable

Multicellularity makes cellular differentiation evolutionarily stable
Mary Elizabeth Wahl, Andrew Wood Murray
doi: http://dx.doi.org/10.1101/010728

Multicellularity and cellular differentiation, two traits shared by all developing organisms, have evolved independently in many taxa and are often found together in extant species. Differentiation, which we define as a permanent and heritable change in gene expression, produces somatic cells from a totipotent germ line. Though somatic cells may divide indefinitely, they cannot reproduce the complete organism and are thus effectively sterile on long timescales. How has differentiation evolved, repeatedly, despite the fitness costs of producing non-reproductive cells? The absence of extant unicellular differentiating species, as well as the persistence of undifferentiated multicellular groups among the volvocine algae and cyanobacteria, have fueled speculation that multicellularity must arise before differentiation can evolve. We propose that unicellular differentiating populations are intrinsically susceptible to invasion by non-differentiating mutants (“cheats”), whose spread eventually drives differentiating lineages extinct. To directly compare organisms which differ only in the presence or absence of these traits, we engineered both multicellularity and cellular differentiation in budding yeast, including such essential features as irreversible conversion, reproductive division of labor, and clonal multicellularity. We find that non-differentiating mutants overtake unicellular populations but are outcompeted effectively by multicellular differentiating strains, suggesting that multicellularity evolved before differentiation.

A systematic survey of an intragenic epistatic landscape

A systematic survey of an intragenic epistatic landscape
Claudia Bank, Ryan T. Hietpas, Jeffrey D. Jensen, Daniel N.A. Bolon
doi: http://dx.doi.org/10.1101/010645

Mutations are the source of evolutionary variation. The interactions of multiple mutations can have important effects on fitness and evolutionary trajectories. We have recently described the distribution of fitness effects of all single mutations for a nine amino acid region of yeast Hsp90 (Hsp82) implicated in substrate binding. Here, we report and discuss the distribution of intragenic epistatic effects within this region in seven Hsp90 point mutant backgrounds of neutral to slightly deleterious effect, resulting in an analysis of more than 1000 double-mutants. We find negative epistasis between substitutions to be common, and positive epistasis to be rare – resulting in a pattern that indicates a drastic change in the distribution of fitness effects one step away from the wild type. This can be well explained by a concave relationship between phenotype and genotype (i.e., a concave shape of the local fitness landscape), suggesting mutational robustness intrinsic to the local sequence space. Structural analyses indicate that, in this region, epistatic effects are most pronounced when a solvent-inaccessible position is involved in the interaction. In contrast, all 18 observations of positive epistasis involved at least one mutation at a solvent-exposed position. By combining the analysis of evolutionary and biophysical properties of an epistatic landscape, these results contribute to a more detailed understanding of the complexity of protein evolution.

Bayesian analyses of Yemeni mitochondrial genomes suggest multiple migration events with Africa and Western Eurasia

Bayesian analyses of Yemeni mitochondrial genomes suggest multiple migration events with Africa and Western Eurasia
Deven N Vyas, Andrew Kitchen, Aida T Miró-Herrans, Laurel N Pearson, Ali Al-Meeri, Connie J Mulligan
doi: http://dx.doi.org/10.1101/010629

Anatomically modern humans (AMHs) left Africa ~60,000 years ago, marking the first of multiple dispersal events by AMH between Africa and the Arabian Peninsula. The southern dispersal route (SDR) out of Africa (OOA) posits that early AMHs crossed the Bab el-Mandeb strait from the Horn of Africa into what is now Yemen and followed the coast of the Indian Ocean into eastern Eurasia. If AMHs followed the SDR and left modern descendants in situ, Yemeni populations should retain old autochthonous mitogenome lineages. Alternatively, if AMHs did not follow the SDR or did not leave modern descendants in the region, only young autochthonous lineages will remain as evidence of more recent dispersals. We sequenced 113 whole mitogenomes from multiple Yemeni regions with a focus on haplogroups M, N, and L3(xM,N) as they are considered markers of the initial OOA migrations. We performed Bayesian evolutionary analyses to generate time-measured phylogenies calibrated by Neanderthal and Denisovan mitogenome sequences in order to determine the age of Yemeni-specific clades in our dataset. Our results indicate that the M1, N1, and L3(xM,N) sequences in Yemen are the product of recent migration from Africa and western Eurasia. Although these data suggest that modern Yemeni mitogenomes are not markers of the original OOA migrants, we hypothesize that recent population dynamics may obscure any genetic signature of an ancient SDR migration.

Conflations of short IBD blocks can bias inferred length of IBD

Conflations of short IBD blocks can bias inferred length of IBD
Charleston W.K. Chiang, Peter Ralph, John Novembre
Comments: 12 figures, 1 table
Subjects: Populations and Evolution (q-bio.PE)

Identity-by-descent (IBD) is a fundamental concept in genetics with many applications. Often, segments between two haplotypes are said to be IBD if they are inherited from a recent shared common ancestor without intervening recombination. Long IBD blocks (> 1cM) can be efficiently detected by a number of computer programs using high-density SNP array data from a population sample. However, all programs detect IBD based on contiguous segments of identity-by-state, and can therefore be due to the conflation of smaller, nearby IBD blocks. We quantified this effect using coalescent simulations, finding that nearly 40% of inferred blocks 1-2cM long are false conflations of two or more longer blocks, under demographic scenarios typical for modern humans. This biases the inferred IBD block length distribution, and so can affect downstream inferences. We observed this conflation effect universally across different IBD detection programs and human demographic histories, and found inference of segments longer than 2cM to be much more reliable (less than 5% conflation rate). We then present and analyze a novel estimator of the de novo mutation rate using IBD blocks, and demonstrate that the biased length distribution of the IBD segments due to conflation can strongly affect this estimator if the conflation is not modeled. Thus, the conflation effect should be carefully considered, especially as methods to detect shorter IBD blocks using sequencing data are being developed.

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution
Eric Y Durand, Chuong B Do, Joanna L Mountain, J. Michael Macpherson
doi: http://dx.doi.org/10.1101/010512

Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.

Transcriptome Sequencing Reveals Widespread Gene-Gene and Gene-Environment Interactions

Transcriptome Sequencing Reveals Widespread Gene-Gene and Gene-Environment Interactions
Alfonso Buil, Andrew A Brown, Tuuli Lappalainen, Ana Viñuela, Matthew N Davies, Houfeng F Zheng, Brent J Richards, Daniel Glass, Kerrin S Small, Richard Durbin, Timothy D Spector, Emmanouil T Dermitzakis
doi: http://dx.doi.org/10.1101/010546

Understanding the genetic architecture of gene expression is an intermediate step to understand the genetic architecture of complex diseases. RNA-seq technologies have improved the quantification of gene expression and allow to measure allelic specific expression (ASE)1-3. ASE is hypothesized to result from the direct effect of cis regulatory variants, but a proper estimation of the causes of ASE has not been performed to date. In this study we take advantage of a sample of twins to measure the relative contribution of genetic and environmental effects on ASE and we found substantial effects of gene x gene (GxG) and gene x environment (GxE) interactions. We propose a model where ASE requires genetic variability in cis, a difference in the sequence of both alleles, but the magnitude of the ASE effect depends on trans genetic and environmental factors that interact with the cis genetic variants. We uncover large GxG and GxE effects on gene expression and likely complex phenotypes that currently remain elusive.

Second-generation PLINK: rising to the challenge of larger and richer datasets

Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C. Chang, Carson C. Chow, Laurent C.A.M. Tellier, Shashaank Vattikuti, Shaun M. Purcell, James J. Lee
Comments: 2 figures, 1 additional file
Subjects: Genomics (q-bio.GN); Computation (stat.CO)

PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information.
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants

Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants
Aziz Belkadi, Alexandre Bolze, Yuval Itan, Quentin B Vincent, Alexander Antipenko, Bertrand Boisson, Jean-Laurent Casanova, Laurent Abel
doi: http://dx.doi.org/10.1101/010363
We compared whole-exome sequencing (WES) and whole-genome sequencing (WGS) for the detection of single-nucleotide variants (SNVs) in the exomes of six unrelated individuals. In the regions targeted by exome capture, the mean number of SNVs detected was 84,192 for WES and 84,968 for WGS. Only 96% of the variants were detected by both methods, with the same genotype identified for 99.2% of them. The distributions of coverage depth (CD), genotype quality (GQ), and minor read ratio (MRR) were much more homogeneous for WGS than for WES data. Most variants with discordant genotypes were filtered out when we used thresholds of CD≥8X, GQ≥20, and MRR≥0.2. However, a substantial number of coding variants were identified exclusively by WES (105 on average) or WGS (692). We Sanger sequenced a random selection of 170 of these exclusive variants, and estimated the mean number of false-positive coding variants per sample at 79 for WES and 36 for WGS. Importantly, the mean number of real coding variants identified by WGS and missed by WES (656) was much larger than the number of real coding variants identified by WES and missed by WGS (26). A substantial proportion of these exclusive variants (32%) were predicted to be damaging. In addition, about 380 genes were poorly covered (~27% of base pairs with CD<8X) by WES for all samples, including 49 genes underlying Mendelian disorders. We conclude that WGS is more powerful and reliable than WES for detecting potential disease-causing mutations in the exome.