Length Distribution of Ancestral Tracks under a General Admixture Model and Its Applications in Population History Inference

Length Distribution of Ancestral Tracks under a General Admixture Model and Its Applications in Population History Inference
Xumin Ni, Xiong Yang, Wei Guo, Kai Yuan, Ying Zhou, Zhiming Ma, Shuhua Xu
doi: http://dx.doi.org/10.1101/023390

As a chromosome is sliced into pieces by recombination after entering an admixed population, ancestral tracks of chromosomes are shortened with the pasting of generations. The length distribution of ancestral tracks reflects information of recombination and thus can be used to infer the histories of admixed populations. Previous studies have shown that inference based on ancestral tracks is powerful in recovering the histories of admixed populations. However, population histories are always complex, and previous studies only deduced the length distribution of ancestral tracks under very simple admixture models. The deduction of length distribution of ancestral tracks under a more general model will greatly elevate the power in inferring population histories. Here we first deduced the length distribution of ancestral tracks under a general model in an admixed population, and proposed general principles in parameter estimation and model selection with the length distribution. Next, we focused on studying the length distribution of ancestral tracks and its applications under three typical admixture models, which were all special cases of our general model. Extensive simulations showed that the length distribution of ancestral tracks was well predicted by our theoretical models. We further developed a new method based on the length distribution of ancestral tracks and good performance was observed when it was applied in inferring population histories under the three typical models. Notably, our method was insensitive to demographic history, sample size and threshold to discard short tracks. Finally, we applied our method in African Americans and Mexicans from the HapMap dataset, and several South Asian populations from the Human Genome Diversity Project dataset. The results showed that the histories of African Americans and Mexicans matched the historical records well, and the population admixture history of South Asians was very complex and could be traced back to around 100 generations ago.


Circlator: automated circularization of genome assemblies using long sequencing reads

Circlator: automated circularization of genome assemblies using long sequencing readsMartin Hunt, Nishadi De Silva, Thomas D Otto, Julian Parkhill, Jacqueline A Keane, Simon R Harris
doi: http://dx.doi.org/10.1101/023408
The assembly of DNA sequence data into finished genomes is undergoing a renaissance thanks to emerging technologies producing reads of tens of kilobases. Assembling complete bacterial and small eukaryotic genomes is now possible, but the final step of circularizing sequences remains unsolved. Here we present Circlator, the first tool to automate assembly circularization and produce accurate linear representations of circular sequences. Using Pacific Biosciences and Oxford Nanopore data, Circlator correctly circularized 26 of 27 circularizable sequences, comprising 11 chromosomes and 12 plasmids from bacteria, the apicoplast and mitochondrion of Plasmodium falciparum and a human mitochondrion. Circlator is available at http://sanger-pathogens.github.io/circlator/.

Origins of de novo genes in human and chimpanzee

Origins of de novo genes in human and chimpanzee
Jorge Ruiz-Orera, Jessica Hernandez-Rodriguez, Cristina Chiva, Eduard Sabidó, Ivanela Kondova, Ronald Bontrop, Tomàs Marqués-Bonet, M. Mar Albà
(Submitted on 28 Jul 2015)

The birth of new genes is an important motor of evolutionary innovation. Whereas many new genes arise by gene duplication, others originate at genomic regions that do not contain any gene or gene copy. Some of these newly expressed genes may acquire coding or non-coding functions and be preserved by natural selection. However, it is yet unclear which is the prevalence and underlying mechanisms of de novo gene emergence. In order to obtain a comprehensive view of this process we have performed in-depth sequencing of the transcriptomes of four mammalian species, human, chimpanzee, macaque and mouse, and subsequently compared the assembled transcripts and the corresponding syntenic genomic regions. This has resulted in the identification of over five thousand new transcriptional multiexonic events in human and/or chimpanzee that are not observed in the rest of species. By comparative genomics we show that the expression of these transcripts is associated with the gain of regulatory motifs upstream of the transcription start site (TSS) and of U1 snRNP sites downstream of the TSS. We also find that the coding potential of the new genes is higher than expected by chance, consistent with the presence of protein-coding genes in the dataset. Using available human tissue proteomics and ribosome profiling data we identify several de novo genes with translation evidence. These genes show significant purifying selection signatures, indicating that they are probably functional. Taken together, the data supports a model in which frequently-occurring new transcriptional events in the genome provide the raw material for the evolution of new proteins.

Dis-integrating the fly: A mutational perspective on phenotypic integration and covariation

Dis-integrating the fly: A mutational perspective on phenotypic integration and covariation

Annat Haber, Ian Dworkin
doi: http://dx.doi.org/10.1101/023333

The structure of environmentally induced phenotypic covariation can influence the effective strength and magnitude of natural selection. Yet our understanding of the factors that contribute to and influence the evolutionary lability of such covariation is poor. Most studies have examined either environmental variation, without accounting for covariation, or examined phenotypic and genetic covariation, without distinguishing the environmental component. In this study we examined the effect of mutational perturbations on different properties of environmental covariation, as well as mean shape. We use strains of Drosophila melanogaster bearing well-characterized mutations known to influence wing shape, as well as naturally-derived strains, all reared under carefully-controlled conditions and with the same genetic background. We find that mean shape changes more freely than the covariance structure, and that different properties of the covariance matrix change independently from each other. The perturbations affect matrix orientation more than they affect matrix size or eccentricity. Yet, mutational effects on matrix orientation do not cluster according to the developmental pathway that they target. These results suggest that it might be useful to consider a more general concept of ‘decanalization’, involving all aspects of variation and covariation.

Long-term natural selection affects patterns of neutral divergence on the X chromosome more than the autosomes.

Long-term natural selection affects patterns of neutral divergence on the X chromosome more than the autosomes.

Melissa Ann Wilson Sayres, Pooja Narang
doi: http://dx.doi.org/10.1101/023234

Natural selection reduces neutral population genetic diversity near coding regions of the genome because recombination has not had time to unlink selected alleles from nearby neutral regions. For ten sub-species of great apes, including human, we show that long-term selection affects estimates of divergence on the X differently from the autosomes. Divergence increases with increasing distance from genes on both the X chromosome and autosomes, but increases faster on the X chromosome than autosomes, resulting in increasing ratios of X/A divergence in putatively neutral regions. Similarly, divergence is reduced more on the X chromosome in neutral regions near conserved regulatory elements than on the autosomes. Consequently estimates of male mutation bias, which rely on comparing neutral divergence between the X and autosomes, are twice as high in neutral regions near genes versus far from genes. Our results suggest filters for putatively neutral genomic regions differ between the X and autosomes.

Dating ancient human samples using the recombination clock

Dating ancient human samples using the recombination clock

Priya Moorjani, Sriram Sankararaman, Qiaomei Fu, Molly Przeworski, Nick J Patterson, David E. Reich
doi: http://dx.doi.org/10.1101/023341

The study of human evolution has been revolutionized by inferences from ancient DNA analyses. Key to these is the reliable estimation of the age of ancient specimens. The current best practice is radiocarbon dating, which relies on characterizing the decay of radioactive carbon isotope (14C), and is applicable for dating up to 50,000-year-old samples. Here, we introduce a new genetic method that uses recombination clock for dating. The key idea is that an ancient genome has evolved less than the genomes of extant individuals. Thus, given a molecular clock provided by the steady accumulation of recombination events, one can infer the age of the ancient genome based on the number of missing years of evolution. To implement this idea, we take advantage of the shared history of Neanderthal gene flow into non-Africans that occurred around 50,000 years ago. Using the Neanderthal ancestry decay patterns, we estimate the Neanderthal admixture time for both ancient and extant samples. The difference in these admixture dates then provides an estimate of the age of the ancient genome. We show that our method provides reliable results in simulations. We apply our method to date five ancient Eurasian genomes with radiocarbon dates ranging between 12,000 to 45,000 years and recover consistent age estimates. Our method provides a complementary approach for dating ancient human samples and is applicable to ancient non-African genomes with Neanderthal ancestry. Extensions of this methodology that use older shared events may be able to date ancient genomes that fall beyond the radiocarbon frontier.

The genetic basis of cone serotiny in Pinus contorta as a function of mixed-severity and stand-replacement fire regimes

The genetic basis of cone serotiny in Pinus contorta as a function of mixed-severity and stand-replacement fire regimes

Mike Feduck, Philippe Henry, Richard Winder, David Dunn, René I Alfaro, Lara vanAkker, Brad Hawkes
doi: http://dx.doi.org/10.1101/023267

Wildfires and mountain pine beetle (MPB) attacks are important contributors to the development of stand structure in lodgepole pine, and major drivers of its evolution. The historical pattern of these events have been correlated with variation in cone serotiny (possessing cones that remain closed and retain seeds until opened by fire) across the Rocky Mountain region of Western North America. As climate change brings about a marked increase in the size, intensity, and severity of our wildfires, it is becoming increasingly important to study the genetic basis of serotiny as an adaptation to wildfire. Knowledge gleaned from these studies would have direct implications for forest management in the future, and for the future. In this study, we collected physical data and DNA samples from 122 trees of two different areas in the IDF-dk of British Columbia; multi-cohort stands (Cariboo-Chilcotin) with a history of mixed-severity fire and frequent MPB disturbances, and single-cohort stands (Logan Lake) with a history of stand replacing (crown) fire and infrequent MPB disturbances. We used QuantiNemo to construct simulated populations of lodgepole pine at five different growth rates, and compared the statistical outputs to physical data, then ran a random forest analysis to shed light on sources of variation in serotiny. We also sequenced 39 SNPs, of which 23 failed or were monomorphic. The 16 informative SNPs were used to calculate HO and HE, which were included alongside genotypes for a second random forest analysis. Our best random forest model explained 33% of variation in serotiny, using simulation and physical variables. Our results highlight the need for more investigation into this matter, using more extensive approaches, and also consideration of alternative methods of heredity such as epigenetics.

Phylogenetic effective sample size

Phylogenetic effective sample size

Krzysztof Bartoszek
doi: http://dx.doi.org/10.1101/023242

In this paper I address the question – how large is a phylogenetic sample? I propose a definition of a phylogenetic effective sample size for Brownian motion and Ornstein-Uhlenbeck processes – the regression effective sample size. I discuss how mutual information can be used to define an effective sample size in the non-normal process case and compare these two definitions to an already present concept of effective sample size (the mean effective sample size). Through a simulation study I find that the AICc is robust if one corrects for the number of species or effective number of species. Lastly I discuss how the concept of the phylogenetic effective sample size can be useful for biodiversity quantification, identification of interesting clades and deciding on the importance of phylogenetic correlations

A probabilistic method for identifying sex-linked genes using RNA-seq-derived genotyping data

A probabilistic method for identifying sex-linked genes using RNA-seq-derived genotyping data

Aline Muyle, Jos Käfer, Niklaus Zemp, Sylvain Mousset, Franck Picard, Gabriel AB Marais
doi: http://dx.doi.org/10.1101/023358

The genetic basis of sex determination remains unknown for the vast majority of organisms with separate sexes. A key question is whether a species has sex chromosomes (SC). SC presence indicates genetic sex determination, and their sequencing may help identifying the sex-determining genes and understanding the molecular mechanisms of sex determination. Identifying SC, especially homomorphic SC, can be difficult. Sequencing SC is also very challenging, in particular the repeat-rich non-recombining regions. A novel approach for identifying sex-linked genes and SC consisting of using RNA-seq to genotype male and female individuals and study sex-linkage has recently been proposed. This approach entails a modest sequencing effort and does not require prior genomic or genetic resources, and is thus particularly suited to study non-model organisms. Applying this approach to many organisms is, however, difficult due to the lack of an appropriate statistically-grounded pipeline to analyse the data. Here we propose a model-based method to infer sex-linkage using a maximum likelihood framework and genotyping data from a full-sib family, which can be obtained for most organisms that can be grown in the lab and for economically important animals/plants. Our method works on any type of SC (XY, ZW, UV) and has been embedded in a pipeline that includes a genotyper specifically developed for RNA-seq data. Validation on empirical and simulated data indicates that our pipeline is particularly relevant to study SC of recent or intermediate age but can return useful information in old systems as well; it is available as a Galaxy workflow.

Interpreting the dependence of mutation rates on age and time

Interpreting the dependence of mutation rates on age and timeZiyue Gao, Minyoung J. Wyman, Guy Sella, Molly Przeworski
(Submitted on 24 Jul 2015)

Mutations can arise from the chance misincorporation of nucleotides during DNA replication or from DNA lesions that are not repaired correctly. We introduce a model that relates the source of mutations to their accumulation with cell divisions, providing a framework for understanding how mutation rates depend on sex, age and absolute time. We show that the accrual of mutations should track cell divisions not only when mutations are replicative in origin but also when they are non-replicative and repaired efficiently. One implication is that the higher incidence of cancer in rapidly renewing tissues, an observation ascribed to replication errors, could instead reflect exogenous or endogenous mutagens. We further find that only mutations that arise from inefficiently repaired lesions will accrue according to absolute time; thus, in the absence of selection on mutation rates, the phylogenetic “molecular clock” should not be expected to run steadily across species.