A novel test for detecting gene-gene interactions in trio studies

A novel test for detecting gene-gene interactions in trio studies

Brunilda Balliu, Noah Zaitlen
doi: http://dx.doi.org/10.1101/021469

Epistasis plays a significant role in the genetic architecture of many complex phenotypes in model organisms. To date, there have been very few interactions replicated in human studies due in part to the multiple hypothesis burden implicit in genome-wide tests of epistasis. Therefore, it is of paramount importance to develop the most powerful tests possible for detecting interactions. In this work we develop a new gene-gene interaction test for use in trio studies called the trio correlation (TC) test. The TC test computes the expected joint distribution of marker pairs in offspring conditional on parental genotypes. This distribution is then incorporated into a standard one degree of freedom correlation test of interaction. We show via extensive simulations that our test substantially outperforms existing tests of interaction in trio studies. The gain in power under standard models of phenotype is large, with previous tests requiring more than twice the number of trios to obtain the power of our test. We also demonstrate a bias in a previous trio interaction test and identify its origin. We conclude that the TC test shows improved power to identify interactions in existing, as well as emerging, trio association studies. The method is publicly available at http://www.github.com/BrunildaBalliu/TrioEpi.

On enhancing variation detection through pan-genome indexing

On enhancing variation detection through pan-genome indexing

Daniel Valenzuela, Niko Välimäki, Esa Pitkänen, Veli Mäkinen
doi: http://dx.doi.org/10.1101/021444

Detection of genomic variants is commonly conducted by aligning a set of reads sequenced from an individual to the reference genome of the species and analyzing the resulting read pileup. Typically, this process finds a subset of variants already reported in databases and additional novel variants characteristic to the sequenced individual. Most of the effort in the literature has been put to the alignment problem on a single reference sequence, although our gathered knowledge on species such as human is pan-genomic: We know most of the common variation in addition to the reference sequence. There have been some efforts to exploit pan-genome indexing, where the most widely adopted approach is to build an index structure on a set of reference sequences containing observed variation combinations. The enhancement in alignment accuracy when using pan-genome indexing has been demonstrated in experiments, but so far the above multiple references pan-genome indexing approach has not been tested on its final goal, that is, in enhancing variation detection. This is the focus of this article: We study a generic approach to add variation detection support on top of the multiple references pan-genomic indexing approach. Namely, we study the read pileup on a multiple alignment of reference genomes, and propose a heaviest path algorithm to extract a new recombined reference sequence. This recombined reference sequence can then be utilized in any standard read alignment and variation detection workflow. We demonstrate that the approach enhances variation detection on realistic data sets.

The structure of the genotype-phenotype map strongly constrains the evolution of non-coding RNA

The structure of the genotype-phenotype map strongly constrains the evolution of non-coding RNA

Kamaludin Dingle, Steffen Schaper, Ard A. Louis
(Submitted on 17 Jun 2015)

The prevalence of neutral mutations implies that biological systems typically have many more genotypes than phenotypes. But can the way that genotypes are distributed over phenotypes determine evolutionary outcomes? Answering such questions is difficult because the number of genotypes can be hyper-astronomically large. By solving the genotype-phentoype (GP) map for RNA secondary structure for systems up to length L=126 nucleotides (where the set of all possible RNA strands would weigh more than the mass of the visible universe) we show that the GP map strongly constrains the evolution of non-coding RNA (ncRNA). Remarkably, simple random sampling over genotypes accurately predicts the distribution of properties such as the mutational robustness or the number of stems per secondary structure found in naturally occurring ncRNA. Since we ignore natural selection, this close correspondence with the mapping suggests that structures allowing for functionality are easily discovered, despite the enormous size of the genetic spaces. The mapping is extremely biased: the majority of genotypes map to an exponentially small portion of the morphospace of all biophysically possible structures. Such strong constraints provide a non-adaptive explanation for the convergent evolution of structures such as the hammerhead ribozyme. ncRNA presents a particularly clear example of bias in the arrival of variation strongly shaping evolutionary outcomes.

Selection against maternal microRNA target sites in maternal transcripts

Selection against maternal microRNA target sites in maternal transcripts

Antonio Marco
doi: http://dx.doi.org/10.1101/012757

In animals, before the zygotic genome is expressed, the egg already contains gene products deposited by the mother. These maternal products are crucial during the initial steps of development. In Drosophila melanogaster a large number of maternal products are found in the oocyte, some of which are indispensable. Many of these products are RNA molecules, such as gene transcripts and ribosomal RNAs. Recently, microRNAs ? small RNA gene regulators ? have been detected early during development and are important in these initial steps. The presence of some microRNAs in unfertilized eggs has been reported, but whether they have a functional impact in the egg or early embryo has not being explored. I have extracted and sequenced small RNAs from Drosophila unfertilized eggs. The unfertilized egg is rich in small RNAs and contains multiple microRNA products. Maternal microRNAs are often encoded within the intron of maternal genes, suggesting that many maternal microRNAs are the product of transcriptional hitch-hiking. Comparative genomics and population data suggest that maternal transcripts tend to avoid target sites for maternal microRNAs. A potential role of the maternal microRNA mir-9c in maternal-to-zygotic transition is also discussed. In conclusion, maternal microRNAs in Drosophila have a functional impact in maternal protein-coding transcripts.

Computational Performance and Statistical Accuracy of *BEAST and Comparisons with Other Methods

Computational Performance and Statistical Accuracy of *BEAST and Comparisons with Other Methods

Huw A. Ogilvie, Joseph Heled, Dong Xie, Alexei J. Drummond
(Submitted on 22 Jun 2015)

Under the multispecies coalescent model of molecular evolution gene trees evolve within a species tree, and follow predicted distributions of topologies and coalescent times. In comparison, supermatrix concatenation methods assume that gene trees share a common history and equate gene coalescence with species divergence. The multispecies coalescent is supported by previous studies which found that its predicted distributions fit empirical data, and that concatenation is not a consistent estimator of the species tree. *BEAST, a fully Bayesian implementation of the multispecies coalescent, is popular but computationally intensive, so the advent of large phylogenomic data sets is both a computational challenge and an opportunity for better systematics. Using simulation studies, we characterise the scaling behaviour of *BEAST, and enable quantitative prediction of the impact increasing the number of loci has on both computational performance and statistical accuracy. Follow up simulations over a wide range of parameters show that the statistical performance of *BEAST relative to concatenation improves both as branch length is reduced and as the number of loci is increased. Finally, using simulations based on estimated parameters from two phylogenomic data sets, we compare the performance of a range of species tree and concatenation methods to show that using *BEAST with a small subset of loci can be preferable to using concatenation with thousands of loci. Our results provide insight into the practicalities of Bayesian species tree estimation, the number of genes required to obtain a given level of accuracy and the situations in which supermatrix or summary methods will be outperformed by the fully Bayesian multispecies coalescent.

A targeted subgenomic approach for phylogenomics based on microfluidic PCR and high throughput sequencing

A targeted subgenomic approach for phylogenomics based on microfluidic PCR and high throughput sequencing

Simon Uribe-Convers, Matthew L Settles, David C Tank
doi: http://dx.doi.org/10.1101/021246

Advances in high-throughput sequencing (HTS) have allowed researchers to obtain large amounts of biological sequence information at speeds and costs unimaginable only a decade ago. Phylogenetics, and the study of evolution in general, is quickly migrating towards using HTS to generate larger and more complex molecular datasets. In this paper, we present a method that utilizes microfluidic PCR and HTS to generate large amounts of sequence data suitable for phylogenetic analyses. The approach uses a Fluidigm microfluidic PCR array and two sets of PCR primers to simultaneously amplify 48 target regions across 48 samples, incorporating sample-specific barcodes and HTS adapters (2,304 unique amplicons per microfluidic array). The final product is a pooled set of amplicons ready to be sequenced, and thus, there is no need to construct separate, costly genomic libraries for each sample. Further, we present a bioinformatics pipeline to process the raw HTS reads to either generate consensus sequences (with or without ambiguities) for every locus in every sample or—more importantly—recover the separate alleles from heterozygous target regions in each sample. This is important because it adds allelic information that is well suited for coalescent-based phylogenetic analyses that are becoming very common in conservation and evolutionary biology. To test our subgenomic method and bioinformatics pipeline, we sequenced 576 samples across 96 target regions belonging to the South American clade of the genus Bartsia L. in the plant family Orobanchaceae. After sequencing cleanup and alignment, the experiment resulted in ~25,300bp across 486 samples for a set of 48 primer pairs targeting the plastome, and ~13,500bp for 363 samples for a set of primers targeting regions in the nuclear genome. Finally, we constructed a combined concatenated matrix from all 96 primer combinations, resulting in a combined aligned length of ~40,500bp for 349 samples.

A novel normalization approach unveils blind spots in gene expression profiling

A novel normalization approach unveils blind spots in gene expression profiling

Carlos P. Roca, Susana I. L. Gomes, Mónica J. B. Amorim, Janeck J. Scott-Fordsmand
doi: http://dx.doi.org/10.1101/021212

RNA-Seq and gene expression microarrays provide comprehensive profiles of gene activity, by measuring the concentration of tens of thousands of mRNA molecules in single assays. However, lack of accuracy and reproducibility have hindered the application of these high-throughput technologies. A key challenge in the data analysis is the normalization of gene expression levels, which is required to make them comparable between samples. This normalization is currently performed following approaches resting on an implicit assumption that most genes are not differentially expressed. Here we show that this assumption is unrealistic and likely results in failure to detect numerous gene expression changes. We have devised a mathematical approach to normalization that makes no assumption of this sort. We have found that variation in gene expression is much greater than currently believed, and that it can be measured with available technologies. Our results also explain, at least partially, the problems encountered in transcriptomics studies. We expect this improvement in detection to help efforts to realize the full potential of gene expression profiling, especially in analyses of cellular processes involving complex modulations of gene expression, such as cell differentiation, toxic responses and cancer.

Environmental fluctuations do not select for increased variation or population-based resistance in Escherichia coli

Environmental fluctuations do not select for increased variation or population-based resistance in Escherichia coli

Shraddha Madhav Karve, Kanishka Tiwary, S Selveshwari, Sutirth Dey
doi: http://dx.doi.org/10.1101/021030

In nature, organisms often face unpredictably fluctuating environments. However, little is understood about the mechanisms that allow organisms to cope with such unpredictability. To address this issue, we used replicate populations of Escherichia coli selected under complex, randomly changing environments. We assayed growth at the level of single cells under four different novel stresses that had no known correlation with the selection environments. Under such conditions, the individuals of the selected populations had significantly lower lag and greater yield compared to the controls. More importantly, there were no outliers in terms of growth, thus ruling out the evolution of population-based resistance. We also assayed the standing phenotypic variation of the selected populations, in terms of their growth on 94 different substrates. Contrary to extant theoretical predictions, there was no increase in the standing variation of the selected populations, nor was there any significant divergence from the ancestors. This suggested that the greater fitness in novel environments is brought about by selection at the level of the individuals, which restricts the suite of traits that can potentially evolve through this mechanism. Given that day-to-day climatic variability of the world is rising, these results have potential public health implications. Our results also underline the need for a very different kind of theoretical approach to study the effects of fluctuating environments.

Leveraging distant relatedness to quantify human mutation and gene conversion rates

Leveraging distant relatedness to quantify human mutation and gene conversion rates

Pier Francesco Palamara, Laurent Francioli, Giulio Genovese, Peter Wilton, Alexander Gusev, Hilary Finucane, Sriram Sankararaman, Shamil Sunyaev, Paul Debakker, John Wakeley, Itsik Pe’er, Alkes L. Price, The Genome of the Netherlands Consortium
doi: http://dx.doi.org/10.1101/020776

The rate at which human genomes mutate is a central biological parameter that has many implications for our ability to understand demographic and evolutionary phenomena. We present a method for inferring mutation and gene conversion rates using the number of sequence differences observed in identical-by-descent (IBD) segments together with a reconstructed model of recent population size history. This approach is robust to, and can quantify, the presence of substantial genotyping error, as validated in coalescent simulations. We applied the method to 498 trio-phased Dutch individuals from the Genome of the Netherlands (GoNL) project, sequenced at an average depth of 13×. We infer a point mutation rate of 1.66 ± 0.04 × 10-8 per base per generation, and a rate of 1.26 ± 0.06 × 10-9 for <20 bp indels. Our estimated average genome-wide mutation rate is higher than most pedigree-based estimates reported thus far, but lower than estimates obtained using substitution rates across primates. By quantifying how estimates vary as a function of allele frequency, we infer the probability that a site is involved in non-crossover gene conversion as 5.99 ± 0.69 × 10-6, consistent with recent reports. We find that recombination does not have observable mutagenic effects after gene conversion is accounted for, and that local gene conversion rates reflect recombination rates. We detect a strong enrichment for recent deleterious variation among mismatching variants found within IBD regions, and observe summary statistics of local IBD sharing to closely match previously proposed metrics of background selection, but find no significant effects of selection on our estimates of mutation rate. We detect no evidence for strong variation of mutation rates in a number of genomic annotations obtained from several recent studies.

Signatures of Dobzhansky-Muller Incompatibilities in the Genomes of Recombinant Inbred Lines

Signatures of Dobzhansky-Muller Incompatibilities in the Genomes of Recombinant Inbred Lines

Maria Colomé-Tatché, Frank Johannes
doi: http://dx.doi.org/10.1101/021006

In the construction of Recombinant Inbred Lines (RILs) from two divergent inbred parents certain genotype (or epigenotype) combinations may be functionally “incompatible” when brought together in the genomes of the progeny, thus resulting in sterility or lower fertility. Natural selection against these epistatic combinations during inbreeding can change haplotype frequencies and distort linkage disequilibrium (LD) relations between loci within and across chromosomes. These LD distortions have received increased experimental attention, because they point to genomic regions that may drive Dobzhansky-Muller-type of reproductive isolation and, ultimately, speciation in the wild. Here we study the selection signatures of two-locus epistatic incompatibility models and quantify their impact on the genetic composition of the genomes of 2-way RILs obtained by selfing. We also consider the biases introduced by breeders when trying to counteract the loss of lines by selectively propagating only viable seeds. Building on our theoretical results, we develop model-based maximum likelihood (ML) tests which can be employed in pairwise genome scans for incompatibility loci using multi-locus genotype data. We illustrate this ML approach in the context of two published A. thaliana RIL panels. Our work lays the theoretical foundation for studying more complex systems such as RILs obtained by sibling mating and/or from multi-parental crosses.