# Differential Evolution Approach to Detect Recent Admixture

Differential Evolution Approach to Detect Recent Admixture

Konstantin Kozlov , Dmitry Chebotarov , Mehedi Hassan , Petr Triska , Martin Triska , Pavel Flegontov , Tatiana V Tatarinova
doi: http://dx.doi.org/10.1101/015446

The genetic structure of human populations is extraordinarily complex and of fundamental importance to studies of anthropology, evolution, and medicine. As increasingly many individuals are of mixed origin, there is an unmet need for tools that can infer multiple origins. Misclassification of such individuals can lead to incorrect and costly misinterpretations of genomic data, primarily in disease studies and drug trials. We present an advanced tool to infer ancestry that can identify the biogeographic origins of highly mixed individuals. reAdmix can incorporate individual’s knowledge of ancestors (e.g. having some ancestors from Turkey or a Scottish grandmother). reAdmix is an online tool available at http://chcb.saban-chla.usc.edu/reAdmix/.

# Chromosome-scale shotgun assembly using an in vitro method for long-range linkage

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage
Nicholas H. Putnam, Brendan O’Connell, Jonathan C. Stites, Brandon J. Rice, Andrew Fields, Paul D. Hartley, Charles W. Sugnet, David Haussler, Daniel S. Rokhsar, Richard E. Green
Subjects: Genomics (q-bio.GN); Biomolecules (q-bio.BM)

Long-range and highly accurate de novo assembly from short-read data is one of the most pressing challenges in genomics. Recently, it has been shown that read pairs generated by proximity ligation of DNA in chromatin of living tissue can address this problem. These data dramatically increase the scaffold contiguity of assemblies and provide haplotype phasing information. Here, we describe a simpler approach (“Chicago”) based on in vitro reconstituted chromatin. We generated two Chicago datasets with human DNA and used a new software pipeline (“HiRise”) to construct a highly accurate de novo assembly and scaffolding of a human genome with scaffold N50 of 30 Mb. We also demonstrated the utility of Chicago for improving existing assemblies by re-assembling and scaffolding the genome of the American alligator. With a single library and one lane of Illumina HiSeq sequencing, we increased the scaffold N50 of the American alligator from 508 kb to 10 Mb. Our method uses established molecular biology procedures and can be used to analyze any genome, as it requires only about 5 micrograms of DNA as the starting material.

# Genetic evidence for an origin of the Armenians from Bronze Age mixing of multiple populations

Genetic evidence for an origin of the Armenians from Bronze Age mixing of multiple populations
Marc Haber , Massimo Mezzavilla , Yali Xue , David Comas , Paolo Gasparini , Pierre Zalloua , Chris Tyler-Smith
doi: http://dx.doi.org/10.1101/015396

The Armenians are a culturally isolated population who historically inhabited a region in the Near East bounded by the Mediterranean and Black seas and the Caucasus, but remain underrepresented in genetic studies and have a complex history including a major geographic displacement during World War One. Here, we analyse genome-wide variation in 173 Armenians and compare them to 78 other worldwide populations. We find that Armenians form a distinctive cluster linking the Near East, Europe, and the Caucasus. We show that Armenian diversity can be explained by several mixtures of Eurasian populations that occurred between ~3,000 and ~2,000 BCE, a period characterized by major population migrations after the domestication of the horse, appearance of chariots, and the rise of advanced civilizations in the Near East. However, genetic signals of population mixture cease after ~1,200 BCE when Bronze Age civilizations in the Eastern Mediterranean world suddenly and violently collapsed. Armenians have since remained isolated and genetic structure within the population developed ~500 years ago when Armenia was divided between the Ottomans and the Safavid Empire in Iran. Finally, we show that Armenians have higher genetic affinity to Neolithic Europeans than other present-day Near Easterners, and that 29% of the Armenian ancestry may originate from an ancestral population best represented by Neolithic Europeans.

# Partitioning, duality, and linkage disequilibria in the Moran model with recombination

Partitioning, duality, and linkage disequilibria in the Moran model with recombination
Mareike Esser, Sebastian Probst, Ellen Baake
Subjects: Probability (math.PR); Populations and Evolution (q-bio.PE)

The Moran model with recombination is considered, which describes the evolution of the genetic composition of a population under recombination and resampling. There are $n$ sites (or loci), a finite number of letters (or alleles) at every site, and we do not make any scaling assumptions. In particular, we do not assume a diffusion limit. We consider the following marginal ancestral recombination process. Let $S = \{1,…c,n\}$ and $\mathcal A=\{A_1, …c, A_m\}$ be a partition of $S$. We concentrate on the joint probability of the letters at the sites in $A_1$ in individual $1$, $…c$, at the sites in $A_m$ in individual $m$, where the individuals are sampled from the current population without replacement. Following the ancestry of these sites backwards in time yields a process on the set of partitions of $S$, which, in the diffusion limit, turns into a marginalised version of the $n$-locus ancestral recombination graph. With the help of an inclusion-exclusion principle, we show that the type distribution corresponding to a given partition may be represented in a systematic way, with the help of so-called recombinators and sampling functions. The same is true of correlation functions (known as linkage disequilibria in genetics) of all orders.
We prove that the partitioning process (backward in time) is dual to the Moran population process (forward in time), where the sampling function plays the role of the duality function. This sheds new light on the work of Bobrowski, Wojdyla, and Kimmel (2010). The result also leads to a closed system of ordinary differential equations for the expectations of the sampling functions, which can be translated into expected type distributions and expected linkage disequilibria.

# Systematic discovery and classification of human cell line essential genes

Systematic discovery and classification of human cell line essential genes
Traver Hart , Megha Chandrashekhar , Michael Aregger , Zachary Steinhart , Kevin R Brown , Stephane Angers , Jason Moffat
doi: http://dx.doi.org/10.1101/015412

The study of gene essentiality in human cells is crucial for elucidating gene function and holds great potential for finding therapeutic targets for diseases such as cancer. Technological advances in genome editing using clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9 systems have set the stage for identifying human cell line core and context-dependent essential genes. However, first generation negative selection screens using CRISPR technology demonstrate extreme variability across different cell lines. To advance the development of the catalogue of human core and context-dependent essential genes, we have developed an optimized, ultracomplex, genome-scale gRNA library of 176,500 guide RNAs targeting 17,661 genes and have applied it to negative and positive selection screens in a human cell line. Using an improved Bayesian analytical approach, we find CRISPR-based screens yield double to triple the number of essential genes than were previously observed using systematic RNA interference, including many genes at moderate expression levels that are largely refractory to RNAi methods. We further characterized four essential genes of unknown significance and found that they all likely exist in protein complexes with other essential genes. For example, RBM48 and ARMC7 are both essential nuclear proteins, strongly interact and are commonly amplified across major cancers. Our findings suggest the CRISPR-Cas9 system fundamentally alters the landscape for systematic reverse genetics in human cells for elucidating gene function, identifying disease genes, and uncovering therapeutic targets.

# Maximum Likelihood Estimation and Phylogenetic Tree based Backward Elimination for reconstructing Viral Haplotypes in a Population

Maximum Likelihood Estimation and Phylogenetic Tree based Backward Elimination for reconstructing Viral Haplotypes in a Population

Raunaq Malhotra, Steven Wu, Allen Rodrigo, Mary Poss, Raj Acharya
(Submitted on 14 Feb 2015)

A viral population can contain a large and diverse collection of viral haplotypes which play important roles in maintaining the viral population. We present an algorithm for reconstructing viral haplotypes in a population from paired-end Next Generation Sequencing (NGS) data. We propose a novel polynomial time dynamic programming based approximation algorithm for generating top paths through each node in De Bruijn graph constructed from the paired-end NGS data. We also propose two novel formulations for obtaining an optimal set of viral haplotypes for the population using the paths generated by the approximation algorithm. The first formulation obtains a maximum likelihood estimate of the viral population given the observed paired-end reads. The second formulation obtains a minimal set of viral haplotypes retaining the phylogenetic information in the population. We evaluate our algorithm on simulated datasets varying on mutation rates and genome length of the viral haplotypes. The results of our method are compared to other methods for viral haplotype estimation. While all the methods overestimate the number of viral haplotypes in a population, the two proposed optimality formulations correctly estimate the exact sequence of all the haplotypes in most datasets, and recover the overall diversity of the population in all datasets. The haplotypes recovered from popular methods are biased toward the reference sequence used for mapping of reads, while the proposed formulations are reference-free and retain the overall diversity in the population.

# Selection constrains phenotypic evolution in a functionally important plant trait

A long-standing idea is that the macroevolutionary adaptive landscape — a `map’ of phenotype to fitness — constrains evolution because certain phenotypes are fit, while others are universally unfit. Such constraints should be evident in traits that, across many species, cluster around particular modal values, with few intermediates between modes. Here, I compile a new global database of 599 species from 94 plant families showing that stomatal ratio, an important functional trait affecting photosynthesis, is multimodal, hinting at distinct peaks in the adaptive landscape. The dataset confirms that most plants have all their stomata on the lower leaf surface (hypostomy), but shows for the first time that species with roughly half their stomata on each leaf surface (amphistomy) form a distinct mode in the trait distribution. Based on a new evolutionary process model, this multimodal pattern is unlikely without constraint. Further, multimodality has evolved repeatedly across disparate families, evincing long-term constraint on the adaptive landscape. A simple cost-benefit model of stomatal ratio demonstrates that selection alone is sufficient to generate an adaptive landscape with multiple peaks. Finally, phylogenetic comparative methods indicate that life history evolution drives shifts between peaks. This implies that the adaptive benefit conferred by amphistomy — increased photosynthesis — is most important in plants with fast life histories, challenging existing ideas that amphistomy is an adaptation to thick leaves and open habitats. I conclude that peaks in the adaptive landscape have been constrained by selection over much of land plant evolution, leading to predictable, repeatable patterns of evolution.