Powerful tests for multi-marker association analysis using ensemble learning

Powerful tests for multi-marker association analysis using ensemble learning
Badri Padhukasahasram

Multi-marker approaches are currently gaining a lot of interest in genome wide association studies and can enhance power to detect new associations under certain conditions. Gene and pathway based association tests are increasingly being viewed as useful complements to the more widely used single marker association analysis which have successfully uncovered numerous disease variants. A major drawback of single-marker based methods is that they do not consider pairwise and higher-order interactions between variants. Here, we describe multi-variate methods for gene and pathway based association analyses using phenotype predictions based on machine learning algorithms. Instead of utilizing only a linear or logistic regression model, we propose the use of ensembles of diverse machine learning algorithms for testing multi-variate associations. As the true mathematical relationship between a phenotype and any group of genetic and clinical variables is unknown in advance and may be complex, such a strategy gives us a general and flexible framework to approximate this relationship across different sets of SNPs. We show how phenotype prediction based on our method can be used for constructing tests for SNP set association analysis. We first apply our method to simulated datasets to demonstrate its power and correctness. Then, we apply our method to previously studied asthma-related genes in 2 independent asthma cohorts to conduct association tests.

Sequence co-evolution gives 3D contacts and structures of protein complexes

Sequence co-evolution gives 3D contacts and structures of protein complexes
Thomas A. Hopf, Charlotta P.I. Schärfe, João P.G.L.M. Rodrigues, Anna G. Green, Chris Sander, Alexandre M.J.J. Bonvin, Debora S. Marks

High-throughput experiments in bacteria and eukaryotic cells have identified tens of thousands of interactions between proteins. This genome-wide view of the protein interaction universe is coarse-grained, whilst fine-grained detail of macro-molecular interactions critically depends on lower throughput, labor-intensive experiments. Computational approaches using measures of residue co-evolution across proteins show promise, but have been limited to specific interactions. Here we present a new generalized method showing that patterns of evolutionary sequence changes across proteins reflect residues that are close in space, with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We demonstrate that the inferred evolutionary coupling scores accurately predict inter-protein residue interactions and can distinguish between interacting and non-interacting proteins. To illustrate the utility of the method, we predict co-evolved contacts between 50 E. coli complexes (of unknown structure), including the unknown 3D interactions between subunits of ATP synthase and find results consistent with detailed experimental data. We expect that the method can be generalized to genome-wide interaction predictions at residue resolution.

A Simple Data-Adaptive Probabilistic Variant Calling Model

A Simple Data-Adaptive Probabilistic Variant Calling Model
Steve Hoffmann, Peter F. Stadler, Korbinian Strimmer
(Submitted on 20 May 2014)

Background: Several sources of noise obfuscate the identification of single nucleotide variation in next generation sequencing data. Not only errors introduced during library construction and sequencing steps but also the quality of the reference genome and the algorithms used for the alignment of the reads play an influential role. It is not trivial to estimate the influence these factors for individual sequencing experiments.
Results: We introduce a simple data-adaptive model for variant calling. Several characteristics are sampled from sites with low mismatch rates and uses to estimate empirical log-likelihoods. These likelihoods are then combined to a score that typically gives rise to a mixture distribution. From these we determine a decision threshold to separate potentially variant sites from the noisy background.
Conclusions: In simulations we show that the proposed model is at par with frequently used SNV calling algorithms in terms of sensitivity and specificity. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.

Inferring human population size and separation history from multiple genome sequences

Inferring human population size and separation history from multiple genome sequences
Stephan Schiffels, Richard Durbin

The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe.

LIMIX: genetic analysis of multiple traits

LIMIX: genetic analysis of multiple traits
Christoph Lippert, Francesco Paolo Casale, Barbara Rakitsch, Oliver Stegle

Multi-trait mixed models have emerged as a promising approach for joint analyses of multiple traits. In principle, the mixed model framework is remarkably general. However, current methods implement only a very specific range of tasks to optimize the necessary computations. Here, we present a multi-trait modeling framework that is versatile and fast: LIMIX enables to flexibly adapt mixed models for a broad range of applications with different observed and hidden covariates, and variable study designs. To highlight the novel modeling aspects of LIMIX we performed three vastly different genetic studies: joint GWAS of correlated blood lipid phenotypes, joint analysis of the expression levels of the multiple transcript-isoforms of a gene, and pathway-based modeling of molecular traits across environments. In these applications we show that LIMIX increases GWAS power and phenotype prediction accuracy, in particular when integrating stepwise multi-locus regression into multi-trait models, and when analyzing large numbers of traits. An open source implementation of LIMIX is freely available at: https://github.com/PMBio/limix.

The distribution of deleterious genetic variation in human populations

The distribution of deleterious genetic variation in human populations
Kirk E Lohmueller

Population genetic studies suggest that most amino-acid changing mutations are deleterious. Such mutations are of tremendous interest in human population genetics as they are important for the evolutionary process and may contribute risk to common disease. Genomic studies over the past 5 years have documented differences across populations in the number of heterozygous deleterious genotypes, numbers of homozygous derived deleterious genotypes, number of deleterious segregating sites and proportion of sites that are potentially deleterious. These differences have been attributed to population history affecting the ability of natural selection to remove deleterious variants from the population. However, recent studies have suggested that the genetic load may not differ across populations, and that the efficacy of natural selection has not differed across human populations. Here I show that these observations are not incompatible with each other and that the apparent differences are due to examining different features of the genetic data and differing definitions of terms.

Sperm should evolve to make female meiosis fair.

Sperm should evolve to make female meiosis fair.
Yaniv Brandvain, Graham Coop

Genomic conflicts arise when an allele gains an evolutionary advantage at a cost to organismal fitness. Oogenesis is inherently susceptible to such conflicts because alleles compete to be the product of female meiosis transmitted to the egg. Alleles that distort meiosis in their favor (i.e. meiotic drivers) often decrease organismal fitness, and therefore indirectly favor the evolution of mechanisms to suppress meiotic drive. In this light, many facets of oogenesis and gametogenesis have been interpreted as mechanisms of protection against genomic outlaws. Why then is female meiosis often left uncompleted until after fertilization in many animals — potentially providing an opportunity for sperm alleles to meddle with its outcome and help like-alleles drive in heterozygous females? The population genetic theory presented herein suggests that sperm nearly always evolve to increase the fairness of female meiosis in the face of genomic conflicts. These results are consistent with current knowledge of sperm-dependent meiotic drivers (loci whose distortion of female meiosis depends on sperm genotype), and suggest that the requirement of fertilization for the completion of female meiosis potentially represents a mechanism employed by females to ensure a fair meiosis.

Author post: Long non-coding RNAs as a source of new peptides

This post is by M.Mar Albà on her preprint (with co-authors) available from arRxiv Long non-coding RNAs as a source of new peptides.

Several recent studies based on deep sequencing of ribosome protected fragments have reported that many long non-coding RNAs (lncRNAs) associate with ribosomes (see for example Everything old is new again: (linc)RNAs make proteins! a comment by Stephen M Cohen). We have analyzed the original data from experiments performed in six different eukaryotic species and confirmed that this is a widespread phenomenon. This is paradoxical because lncRNAs apparently have very little coding capacity with only short open reading frames (ORFs) that do not show sequence similarity to known proteins.

In contrast to typical mRNAs, many lncRNAs are lineage-specific. Therefore, if they are translated, they should be similar to recently evolved protein-coding genes. This is exactly what we have found. It turns out that transcripts encoding young proteins show very similar properties to lncRNAs; short and non-conserved ORFs, low coding sequence potential, and relatively weak selective constraints.

Evidence has accumulated in recent years that new protein-coding genes are continuously evolving (The continuing evolution of genes by Carl Zimmer). The birth of a new functional protein is a process of trial and error that most likely requires the expression of many transcripts that will not survive the test of time. LncRNAs seem to fit the bill for this role.

Author post: Diversity and evolution of centromere repeats in the maize genome

This guest post is by Paul Bilinski on his paper with coauthors Diversity and evolution of centromere repeats in the maize genome BioRxived here.

Centromeres have the potential to play a central role in speciation, yet our ability to study them has been limited because of their repetitive nature. The centromeres of many eukaryotes consist partly of large arrays of short tandem repeats, though the actual sequence of the repeat varies widely across taxa. To investigate the whether the variation found in the tandem repeats themselves could inform our understanding of their evolutionary history we made use of the reference maize genome as well as resequencing data from several lines of maize and its wild relative teosinte.

Although tandem repeats should be identical upon duplication, our analysis of CentC in maize revealed that most copies genome-wide are unique. We observed only three instances where adjacent copies were identical in sequence and length, driving home the idea that these tandem repeats have accumulated immense diversity. Given such diversity, we wanted to investigate genetic relatedness across CentC copies.

Using positional and genetic relatedness information from the fully-sequenced centromeres 2 and 5, we found high within-cluster similarity, suggesting that tandem duplications drove most CentC copy number increase. Contrary to patterns seen in Arabidopsis (Kawabe and Nasuda 2005), principle coordinate analysis of repeats found no clustering by chromosome, with groups of CentC with similar sequence distributed across all of the chromosomes.

Another surprising discovery involved the origin of the biggest arrays of CentC. As an ancient tetraploid maize originally had 20 chromosomes with 20 centromeres. Processes of fractionation and rearrangement have led to the 10 chromosomes in the extant maize genome. Schnable et al (2011) were able to identify which chromosomal segments derive from each of maize’s ancient parents, referred to as subgenomes one and two. Wang and Bennetzen (2011) built on this information, and found that about half of the modern centromeres came from each parent. Inferring subgenome of origin by flanking regions, we found that all of the CentC clusters >20kb in length derive from subgenome 1. The proportions are less skewed when looking at clusters >10kb, though in all cases we see more bp of CentC assigned to subgenome 1 than we expect based on its total bp in the genome. This is particularly interesting because subgenome 1 also shows higher overall gene expression and fewer deletions than subgenome two (Schnable et al 2011).

The diversity of CentC seen might suggest that CentC repeats were reasonably static in the genome, persisting in the same spot for a long time with occasional increases in copy number via tandem duplication. However, fluorescent in situ hybridization suggested that domestication resulted in a large loss of CentC signal across many of maize’s 10 chromosomes. We confirmed and quantified the loss of CentC using resequencing data from a set of maize and teosinte lines (Chia et al. 2012).

Combined, our results suggest long term stability of CentC clusters with new copies arising from tandem duplication, while mutation serves to homogenize rather than separate clusters. We hope our insights into centromere repeat evolution will build toward a better understanding of their role in evolution.

Evidence for strong co-evolution of mitochondrial and somatic genomes

Evidence for strong co-evolution of mitochondrial and somatic genomes

Michael G.Sadovsky
(Submitted on 20 May 2014)

We studied a relations between the triplet frequency composition of mitochondria genomes, and the phylogeny of their bearers. First, the clusters in 63dimensional space were developed due to K-means. Second, the clade composition of those clusters has been studied. It was found that genomes are distributed among the clusters very regularly, with strong correlation to taxonomy. Strong co-evolution manifests through this correlation: the proximity in frequency space was determined over the mitochondrion genomes, while the proximity in taxonomy was determined morphologically.