The Spatial Mixing of Genomes in Secondary Contact Zones

The Spatial Mixing of Genomes in Secondary Contact Zones
Alisa Sedghifar , Yaniv Brandvain , Peter L. Ralph , Graham Coop
doi: http://dx.doi.org/10.1101/016337

Recent genomic studies have highlighted the important role of admixture in shaping genome-wide patterns of diversity. Past admixture leaves a population genomic signature of linkage disequilibrium (LD), reflecting the mixing of parental chromosomes by segregation and recombination. The extent of this LD can be used to infer the timing of admixture. However, the results of inference can depend strongly on the assumed demographic model. Here, we introduce a theoretical framework for modeling patterns of LD in a geographic contact zone where two differentiated populations are diffusing back together. We derive expressions for the expected LD and admixture tract lengths across geographic space as a function of the age of the contact zone and the dispersal distance of individuals. We develop an approach to infer age of contact zones using population genomic data from multiple spatially sampled populations by fitting our model to the decay of LD with recombination distance. We use our approach to explore the fit of a geographic contact zone model to three human population genomic datasets from populations along the Indonesian archipelago, populations in Central Asia and populations in India.

Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis

Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis
Po-Ru Loh , Gaurav Bhatia , Alexander Gusev , Hilary K Finucane , Brendan K Bulik-Sullivan , Samuela J Pollack , Schizophrenia Working Group Psychiatric Genomics Consortium , Teresa R de Candia , Sang Hong Lee , Naomi R Wray , Kenneth S Kendler , Michael C O’Donovan , Benjamin M Neale , Nick Patterson , Alkes L Price
doi: http://dx.doi.org/10.1101/016527

Heritability analyses of GWAS cohorts have yielded important insights into complex disease architecture, and increasing sample sizes hold the promise of further discoveries. Here, we analyze the genetic architecture of schizophrenia in 49,806 samples from the PGC, and nine complex diseases in 54,734 samples from the GERA cohort. For schizophrenia, we infer an overwhelmingly polygenic disease architecture in which ≥76% of 1Mb genomic regions harbor at least one variant influencing schizophrenia risk. We also observe significant enrichment of heritability in GC-rich regions and in higher-frequency SNPs for both schizophrenia and GERA diseases. In bivariate analyses, we observe significant genetic correlations (ranging from 0.18 to 0.85) for 13 of 36 pairs of GERA diseases; genetic correlations were consistently stronger (1.3x on average) than correlations of overall disease liabilities. To accomplish these analyses, we developed a novel, fast algorithm for multi-component, multi-trait variance components analysis that overcomes prior computational barriers that made such analyses intractable at this scale.

Eight thousand years of natural selection in Europe

Eight thousand years of natural selection in Europe
Iain Mathieson , Iosif Lazaridis , Nadin Rohland , Swapan Mallick , Bastien Llamas , Joseph Pickrell , Harald Meller , Manuel A. Rojo Guerra , Johannes Krause , David Anthony , Dorcas Brown , Carles Lalueza Fox , Alan Cooper , Kurt W. Alt , Wolfgang Haak , Nick Patterson , David Reich
doi: http://dx.doi.org/10.1101/016477

The arrival of farming in Europe beginning around 8,500 years ago required adaptation to new environments, pathogens, diets, and social organizations. While evidence of natural selection can be revealed by studying patterns of genetic variation in present-day people, these pattern are only indirect echoes of past events, and provide little information about where and when selection occurred. Ancient DNA makes it possible to examine populations as they were before, during and after adaptation events, and thus to reveal the tempo and mode of selection. Here we report the first genome-wide scan for selection using ancient DNA, based on 83 human samples from Holocene Europe analyzed at over 300,000 positions. We find five genome-wide signals of selection, at loci associated with diet and pigmentation. Surprisingly in light of suggestions of selection on immune traits associated with the advent of agriculture and denser living conditions, we find no strong sweeps associated with immunological phenotypes. We also report a scan for selection for complex traits, and find two signals of selection on height: for short stature in Iberia after the arrival of agriculture, and for tall stature on the Pontic-Caspian steppe earlier than 5,000 years ago. A surprise is that in Scandinavian hunter-gatherers living around 8,000 years ago, there is a high frequency of the derived allele at the EDAR gene that is the strongest known signal of selection in East Asians and that is thought to have arisen in East Asia. These results document the power of ancient DNA to reveal features of past adaptation that could not be understood from analyses of present-day people.

Coalescent histories for lodgepole species trees

Coalescent histories for lodgepole species trees
Filippo Disanto, Noah A. Rosenberg
Subjects: Populations and Evolution (q-bio.PE); Combinatorics (math.CO)

Coalescent histories are combinatorial structures that describe for a given gene tree and species tree the possible lists of branches of the species tree on which the gene tree coalescences take place. Properties of the number of coalescent histories for gene trees and species trees affect a variety of probabilistic calculations in mathematical phylogenetics. Exact and asymptotic evaluations of the number of coalescent histories, however, are known only in a limited number of cases. Here we introduce a particular family of species trees, the \emph{lodgepole} species trees $(\lambda_n)_{n\geq 0}$, in which tree $\lambda_n$ has $m=2n+1$ taxa. We determine the number of coalescent histories for the lodgepole species trees, in the case that the gene tree matches the species tree, showing that this number grows with $m!!$ in the number of taxa $m$. This computation demonstrates the existence of tree families in which the growth in the number of coalescent histories is faster than exponential. Further, it provides a substantial improvement on the lower bound for the ratio of the largest number of matching coalescent histories to the smallest number of matching coalescent histories for trees with $m$ taxa, increasing a previous bound of $(\sqrt{\pi} / 32)[(5m-12)/(4m-6)] m \sqrt{m}$ to $[ \sqrt{m-1}/(4 \sqrt{e}) ]^{m}$. We discuss the implications of our enumerative results for phylogenetic computations.

PoMo: An Allele Frequency-based Approach for Species Tree Estimation

PoMo: An Allele Frequency-based Approach for Species Tree Estimation
Nicola De Maio , Dominik Schrempf , Carolin Kosiol
doi: http://dx.doi.org/10.1101/016360

Incomplete lineage sorting can cause incongruencies of the overall species-level phylogenetic tree with the phylogenetic trees for individual genes or genomic segments. If these incongruencies are not accounted for, it is possible to incur several biases in species tree estimation. Here, we present a simple maximum likelihood approach that accounts for ancestral variation and incomplete lineage sorting. We use a POlymorphisms-aware phylogenetic MOdel (PoMo) that we have recently shown to efficiently estimate mutation rates and fixation biases from within and between-species variation data. We extend this model to perform efficient estimation of species trees. We test the performance of PoMo in several different scenarios of incomplete lineage sorting using simulations and compare it with existing methods both in accuracy and computational speed. In contrast to other approaches, our model does not use coalescent theory but is allele-frequency based. We show that PoMo is well suited for genome-wide species tree estimation and that on such data it is more accurate than previous approaches.

ISMapper: Identifying insertion sequences in bacterial genomes from short read sequence data

ISMapper: Identifying insertion sequences in bacterial genomes from short read sequence data
Jane Hawkey , Mohammad Hamidian , Ryan R Wick , David J Edwards , Helen Billman-Jacobe , Ruth M Hall , Kathryn E Holt
doi: http://dx.doi.org/10.1101/016345

Background Insertion sequences (IS) are small transposable elements, commonly found in bacterial genomes. Identifying the location of IS in bacterial genomes can be useful for a variety of purposes including epidemiological tracking and predicting antibiotic resistance. However IS are commonly present in multiple copies in a single genome, which complicates genome assembly and the identification of IS insertion sites. Here we present ISMapper, a mapping-based tool for identification of the site and orientation of IS insertions in bacterial genomes, direct from paired-end short read data. Results ISMapper was validated using three types of short read data: (i) simulated reads from a variety of species, (ii) Illumina reads from 5 isolates for which finished genome sequences were available for comparison, and (iii) Illumina reads from 7 Acinetobacter baumannii isolates for which predicted IS locations were tested using PCR. A total of 20 genomes, including 13 species and 32 distinct IS, were used for validation. ISMapper correctly identified 96% of known IS insertions in the analysis of simulated reads, and 98% in real Illumina reads. Subsampling of real Illumina reads to lower depths indicated ISMapper was reliable for average genome-wide read depths >20x. All ISAba1 insertions identified by ISMapper in the A. baumannii genomes were confirmed by PCR. In each A. baumannii genome, ISMapper successfully identified an IS insertion upstream of the ampC beta-lactamase that could explain phenotypic resistance to third-generation cephalosporins. The utility of ISMapper was further demonstrated by profiling genome-wide IS6110 insertions in 138 publicly available Mycobacterium tuberculosis genomes, revealing lineage-specific insertions and multiple insertion hotspots. Conclusions ISMapper provides a rapid and robust method for identifying IS insertion sites direct from short read data, with a high degree of accuracy demonstrated across a wide range of bacteria.

No association between plant mating system & geographic range overlap

No association between plant mating system & geographic range overlap
Dena Grossenbacher , Ryan Briscoe Runquist , Emma Goldberg , Yaniv Brandvain
doi: http://dx.doi.org/10.1101/016261

Both evolutionary theory and numerous case studies suggest that selfing taxa are more likely to co-occur with outcrossing relatives than are outcrossing taxa. Despite suggestions that this pattern may be general, the extent to which mating system influences range overlap in close relatives has not been tested formally across a diverse group of plant species pairs. We test for a difference in range overlap between species pairs where zero, one, or both species are selfers with data from 98 sister species pairs in 20 genera. We also use divergence time estimates from time-calibrated phylogenies to ask how range overlap changes with divergence time and whether this effect depends on mating system. We find no evidence that self-pollination influences range overlap of closely related species. While the extent of range overlap decreased modestly with the divergence time of sister species, this effect did not depend on mating system. The absence of a strong influence of mating system on range overlap suggests that of the many mechanisms potentially influencing the co-occurrence of close relatives, mating system plays a minor and/or inconsistent role.

A Comparison of Methods to Measure Fitness in Escherichia coli

A Comparison of Methods to Measure Fitness in Escherichia coli
Michael J Wiser , Richard E Lenski
doi: http://dx.doi.org/10.1101/016121

In order to characterize the dynamics of adaptation, it is important to be able to quantify how a population’s mean fitness changes over time. Such measurements are especially important in experimental studies of evolution using microbes. The Long-Term Evolution Experiment (LTEE) with Escherichia coli provides one such system in which mean fitness has been measured by competing derived and ancestral populations. The traditional method used to measure fitness in the LTEE and many similar experiments, though, is subject to a potential limitation. As the relative fitness of the two competitors diverges, the measurement error increases because the less-fit population becomes increasingly small and cannot be enumerated as precisely. Here, we present and employ two alternatives to the traditional method. One is based on reducing the fitness differential between the competitors by using a common reference competitor from an intermediate generation that has intermediate fitness; the other alternative increases the initial population size of the less-fit, ancestral competitor. We performed a total of 480 competitions to compare the statistical properties of estimates obtained using these alternative methods with those obtained using the traditional method for samples taken over 50,000 generations from one of the LTEE populations. On balance, neither alternative method yielded measurements that were more precise than the traditional method.

Tools and best practices for allelic expression analysis

Tools and best practices for allelic expression analysis

Stephane E Castel , Ami Levy-Moonshine , Pejman Mohammadi , Eric Banks , Tuuli Lappalainen
doi: http://dx.doi.org/10.1101/016097

Allelic expression (AE) analysis has become an important tool for integrating genome and transcriptome data to characterize various biological phenomena such as cis-regulatory variation and nonsense-mediated decay. In this paper, we systematically analyze the properties of AE read count data and technical sources of error, such as low-quality or double-counted RNA-seq reads, genotyping errors, allelic mapping bias, and technical covariates due to sample preparation and sequencing, and variation in total read depth. We provide guidelines for correcting and filtering for such errors, and show that the resulting AE data has extremely low technical noise. Finally, we introduce novel software for high-throughput production of AE data from RNA-sequencing data, implemented in the GATK framework. These improved tools and best practices for AE analysis yield higher quality AE data by reducing technical bias. This provides a practical framework for wider adoption of AE analysis by the genomics community.

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads
Hung-I Harry Chen , Yuanhang Liu , Yi Zou , Zhao Lai , Devanand Sarkar , Yufei Huang , Yidong Chen
doi: http://dx.doi.org/10.1101/016196

Background RNA sequencing (RNA-seq) is a powerful tool for genome-wide expression profiling of biological samples with the advantage of high-throughput and high resolution. There are many existing algorithms nowadays for quantifying expression levels and detecting differential gene expression, but none of them takes the misaligned reads that are mapped to non-exonic regions into account. We developed a novel algorithm, XBSeq, where a statistical model was established based on the assumption that observed signals are the convolution of true expression signals and sequencing noises. The mapped reads in non-exonic regions are considered as sequencing noises, which follows a Poisson distribution. Given measureable observed and noise signals from RNA-seq data, true expression signals, assuming governed by the negative binomial distribution, can be delineated and thus the accurate detection of differential expressed genes. Results We implemented our novel XBSeq algorithm and evaluated it by using a set of simulated expression datasets under different conditions, using a combination of negative binomial and Poisson distributions with parameters derived from real RNA-seq data. We compared the performance of our method with other commonly used differential expression analysis algorithms. We also evaluated the changes in true and false positive rates with variations in biological replicates, differential fold changes, and expression levels in non-exonic regions. We also tested the algorithm on a set of real RNA-seq data where the common and different detection results from different algorithms were reported. Conclusions In this paper, we proposed a novel XBSeq, a differential expression analysis algorithm for RNA-seq data that takes non-exonic mapped reads into consideration. When background noise is at baseline level, the performance of XBSeq and DESeq are mostly equivalent. However, our method surpasses DESeq and other algorithms with the increase of non-exonic mapped reads. Only in very low read count condition XBSeq had a slightly higher false discovery rate, which may be improved by adjusting the background noise effect in this situation. Taken together, by considering non-exonic mapped reads, XBSeq can provide accurate expression measurement and thus detect differential expressed genes even in noisy conditions.