Sequence co-evolution gives 3D contacts and structures of protein complexes

Sequence co-evolution gives 3D contacts and structures of protein complexes
Thomas A. Hopf, Charlotta P.I. Schärfe, João P.G.L.M. Rodrigues, Anna G. Green, Chris Sander, Alexandre M.J.J. Bonvin, Debora S. Marks

High-throughput experiments in bacteria and eukaryotic cells have identified tens of thousands of interactions between proteins. This genome-wide view of the protein interaction universe is coarse-grained, whilst fine-grained detail of macro-molecular interactions critically depends on lower throughput, labor-intensive experiments. Computational approaches using measures of residue co-evolution across proteins show promise, but have been limited to specific interactions. Here we present a new generalized method showing that patterns of evolutionary sequence changes across proteins reflect residues that are close in space, with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We demonstrate that the inferred evolutionary coupling scores accurately predict inter-protein residue interactions and can distinguish between interacting and non-interacting proteins. To illustrate the utility of the method, we predict co-evolved contacts between 50 E. coli complexes (of unknown structure), including the unknown 3D interactions between subunits of ATP synthase and find results consistent with detailed experimental data. We expect that the method can be generalized to genome-wide interaction predictions at residue resolution.

A Simple Data-Adaptive Probabilistic Variant Calling Model

A Simple Data-Adaptive Probabilistic Variant Calling Model
Steve Hoffmann, Peter F. Stadler, Korbinian Strimmer
(Submitted on 20 May 2014)

Background: Several sources of noise obfuscate the identification of single nucleotide variation in next generation sequencing data. Not only errors introduced during library construction and sequencing steps but also the quality of the reference genome and the algorithms used for the alignment of the reads play an influential role. It is not trivial to estimate the influence these factors for individual sequencing experiments.
Results: We introduce a simple data-adaptive model for variant calling. Several characteristics are sampled from sites with low mismatch rates and uses to estimate empirical log-likelihoods. These likelihoods are then combined to a score that typically gives rise to a mixture distribution. From these we determine a decision threshold to separate potentially variant sites from the noisy background.
Conclusions: In simulations we show that the proposed model is at par with frequently used SNV calling algorithms in terms of sensitivity and specificity. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.

Inferring human population size and separation history from multiple genome sequences

Inferring human population size and separation history from multiple genome sequences
Stephan Schiffels, Richard Durbin

The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe.

LIMIX: genetic analysis of multiple traits

LIMIX: genetic analysis of multiple traits
Christoph Lippert, Francesco Paolo Casale, Barbara Rakitsch, Oliver Stegle

Multi-trait mixed models have emerged as a promising approach for joint analyses of multiple traits. In principle, the mixed model framework is remarkably general. However, current methods implement only a very specific range of tasks to optimize the necessary computations. Here, we present a multi-trait modeling framework that is versatile and fast: LIMIX enables to flexibly adapt mixed models for a broad range of applications with different observed and hidden covariates, and variable study designs. To highlight the novel modeling aspects of LIMIX we performed three vastly different genetic studies: joint GWAS of correlated blood lipid phenotypes, joint analysis of the expression levels of the multiple transcript-isoforms of a gene, and pathway-based modeling of molecular traits across environments. In these applications we show that LIMIX increases GWAS power and phenotype prediction accuracy, in particular when integrating stepwise multi-locus regression into multi-trait models, and when analyzing large numbers of traits. An open source implementation of LIMIX is freely available at: https://github.com/PMBio/limix.

The distribution of deleterious genetic variation in human populations

The distribution of deleterious genetic variation in human populations
Kirk E Lohmueller

Population genetic studies suggest that most amino-acid changing mutations are deleterious. Such mutations are of tremendous interest in human population genetics as they are important for the evolutionary process and may contribute risk to common disease. Genomic studies over the past 5 years have documented differences across populations in the number of heterozygous deleterious genotypes, numbers of homozygous derived deleterious genotypes, number of deleterious segregating sites and proportion of sites that are potentially deleterious. These differences have been attributed to population history affecting the ability of natural selection to remove deleterious variants from the population. However, recent studies have suggested that the genetic load may not differ across populations, and that the efficacy of natural selection has not differed across human populations. Here I show that these observations are not incompatible with each other and that the apparent differences are due to examining different features of the genetic data and differing definitions of terms.

Sperm should evolve to make female meiosis fair.

Sperm should evolve to make female meiosis fair.
Yaniv Brandvain, Graham Coop

Genomic conflicts arise when an allele gains an evolutionary advantage at a cost to organismal fitness. Oogenesis is inherently susceptible to such conflicts because alleles compete to be the product of female meiosis transmitted to the egg. Alleles that distort meiosis in their favor (i.e. meiotic drivers) often decrease organismal fitness, and therefore indirectly favor the evolution of mechanisms to suppress meiotic drive. In this light, many facets of oogenesis and gametogenesis have been interpreted as mechanisms of protection against genomic outlaws. Why then is female meiosis often left uncompleted until after fertilization in many animals — potentially providing an opportunity for sperm alleles to meddle with its outcome and help like-alleles drive in heterozygous females? The population genetic theory presented herein suggests that sperm nearly always evolve to increase the fairness of female meiosis in the face of genomic conflicts. These results are consistent with current knowledge of sperm-dependent meiotic drivers (loci whose distortion of female meiosis depends on sperm genotype), and suggest that the requirement of fertilization for the completion of female meiosis potentially represents a mechanism employed by females to ensure a fair meiosis.

Author post: Diversity and evolution of centromere repeats in the maize genome

This guest post is by Paul Bilinski on his paper with coauthors Diversity and evolution of centromere repeats in the maize genome BioRxived here.

Centromeres have the potential to play a central role in speciation, yet our ability to study them has been limited because of their repetitive nature. The centromeres of many eukaryotes consist partly of large arrays of short tandem repeats, though the actual sequence of the repeat varies widely across taxa. To investigate the whether the variation found in the tandem repeats themselves could inform our understanding of their evolutionary history we made use of the reference maize genome as well as resequencing data from several lines of maize and its wild relative teosinte.

Although tandem repeats should be identical upon duplication, our analysis of CentC in maize revealed that most copies genome-wide are unique. We observed only three instances where adjacent copies were identical in sequence and length, driving home the idea that these tandem repeats have accumulated immense diversity. Given such diversity, we wanted to investigate genetic relatedness across CentC copies.

Using positional and genetic relatedness information from the fully-sequenced centromeres 2 and 5, we found high within-cluster similarity, suggesting that tandem duplications drove most CentC copy number increase. Contrary to patterns seen in Arabidopsis (Kawabe and Nasuda 2005), principle coordinate analysis of repeats found no clustering by chromosome, with groups of CentC with similar sequence distributed across all of the chromosomes.

Another surprising discovery involved the origin of the biggest arrays of CentC. As an ancient tetraploid maize originally had 20 chromosomes with 20 centromeres. Processes of fractionation and rearrangement have led to the 10 chromosomes in the extant maize genome. Schnable et al (2011) were able to identify which chromosomal segments derive from each of maize’s ancient parents, referred to as subgenomes one and two. Wang and Bennetzen (2011) built on this information, and found that about half of the modern centromeres came from each parent. Inferring subgenome of origin by flanking regions, we found that all of the CentC clusters >20kb in length derive from subgenome 1. The proportions are less skewed when looking at clusters >10kb, though in all cases we see more bp of CentC assigned to subgenome 1 than we expect based on its total bp in the genome. This is particularly interesting because subgenome 1 also shows higher overall gene expression and fewer deletions than subgenome two (Schnable et al 2011).

The diversity of CentC seen might suggest that CentC repeats were reasonably static in the genome, persisting in the same spot for a long time with occasional increases in copy number via tandem duplication. However, fluorescent in situ hybridization suggested that domestication resulted in a large loss of CentC signal across many of maize’s 10 chromosomes. We confirmed and quantified the loss of CentC using resequencing data from a set of maize and teosinte lines (Chia et al. 2012).

Combined, our results suggest long term stability of CentC clusters with new copies arising from tandem duplication, while mutation serves to homogenize rather than separate clusters. We hope our insights into centromere repeat evolution will build toward a better understanding of their role in evolution.

Deleterious passengers in adapting populations

Deleterious passengers in adapting populations
Benjamin H Good, Michael M Desai
Subjects: Populations and Evolution (q-bio.PE)

Most new mutations are deleterious and are eventually eliminated by natural selection. But in an adapting population, the rapid amplification of beneficial mutations can hinder the removal of deleterious variants in nearby regions of the genome, altering the patterns of sequence evolution. Here, we analyze the interactions between beneficial “driver” mutations and linked deleterious “passengers” during the course of adaptation. We derive analytical expressions for the substitution rate of a deleterious mutation as a function of its fitness cost, as well as the reduction in the beneficial substitution rate due to the genetic load of the passengers. We find that the fate of each deleterious mutation varies dramatically with the rate and spectrum of beneficial mutations, with a non-monotonic dependence on both the population size and the rate of adaptation. By quantifying this dependence, our results allow us to estimate which deleterious mutations will be likely to fix, and how many of these mutations must arise before the progress of adaptation is significantly reduced.

Locus architecture affects mRNA expression levels in Drosophila embryos

Locus architecture affects mRNA expression levels in Drosophila embryos
Tara Lydiard-Martin, Meghan Bragdon, Kelly B Eckenrode, Zeba Wunderlich, Angela H DePace

Structural variation in the genome is common due to insertions, deletions, duplications and rearrangements. However, little is known about the ways structural variants impact gene expression. Developmental genes are controlled by multiple regulatory sequence elements scattered over thousands of bases; developmental loci are therefore a good model to test the functional impact of structural variation on gene expression. Here, we measured the effect of rearranging two developmental enhancers from the even-skipped (eve) locus in Drosophila melanogaster blastoderm embryos. We systematically varied orientation, order, and spacing of the enhancers in transgenic reporter constructs and measured expression quantitatively at single cell resolution in whole embryos to detect changes in both level and position of expression. We found that the position of expression was robust to changes in locus organization, but levels of expression were highly sensitive to the spacing between enhancers and order relative to the promoter. Our data demonstrate that changes in locus architecture can dramatically impact levels of gene expression. To quantitatively predict gene expression from sequence, we must therefore consider how information is integrated both within enhancers and across gene loci.

RNA-seq gene profiling – a systematic empirical comparison

RNA-seq gene profiling – a systematic empirical comparison
Nuno A Fonseca, John A Marioni, Alvis Brazma

Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the “true” expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the ‘ground truth’ in real RNAseq data sets, we used simulated data to assess the differences between the true expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to assess the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.