Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences

Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences
Dietrich Radel, Andreas Sand, Mike Steel
(Submitted on 6 Jun 2013)

Finding optimal evolutionary trees from sequence data is typically an intractable problem, and there is usually no way of knowing how close to optimal the best tree from some search truly is. The problem would seem to be particularly acute when we have many taxa and when that data has high levels of homoplasy, in which the individual characters require many changes to fit on the best tree. However, a recent mathematical result has provided a precise tool to generate a short number of high-homoplasy characters for any given tree, so that this tree is provably the optimal tree under the maximum parsimony criterion. This provides, for the first time, a rigorous way to test tree search algorithms on homoplasy-rich data, where we know in advance what the `best’ tree is. In this short note we consider just one search program (TNT) but show that it is able to locate the globally optimal tree correctly for 32,768 taxa, even though the characters in the dataset requires, on average, 1148 state-changes each to fit on this tree, and the number of characters is only 57.

Density behavior of spatial birth-and-death stochastic evolution of mutating genotypes under selection rates

Density behavior of spatial birth-and-death stochastic evolution of mutating genotypes under selection rates
Dmitri Finkelshtein, Yuri Kondratiev, Oleksandr Kutoviy, Stanislav Molchanov, Elena Zhizhina
(Submitted on 5 Jun 2013)

We consider birth-and-death stochastic evolution of genotypes with different lengths. The genotypes might mutate that provides a stochastic changing of lengthes by a free diffusion law. The birth and death rates are length dependent which corresponds to a selection effect. We study an asymptotic behavior of a density for an infinite collection of genotypes. The cases of space homogeneous and space heterogeneous densities are considered.

Coalescence, genetic diversity and adaptation in sexual populations

Coalescence, genetic diversity and adaptation in sexual populations
Richard A. Neher, Taylor A. Kessinger, Boris I. Shraiman
(Submitted on 5 Jun 2013)

In diverse sexual populations, selection operates neither on the whole genome — which is repeatedly taken apart and reassembled by recombination — nor on individual alleles which are tightly linked to the chromosomal neighborhood. Those tightly linked alleles affect each others dynamics which reduces the efficiency of selection and distorts patterns of genetic diversity. Inference of evolutionary history from diversity shaped by linked selection requires an understanding of these patterns. Here, we reexamine this problem in the light of recent progress in coalescent theory of rapidly adapting asexual populations. We present a simple but powerful scaling analysis identifying the unit of selection as the genomic “linkage block” with characteristic length \xi_b, which is determined in a self-consistent manner by the condition that the rate of recombination within the block is comparable to the fitness differences between different alleles of the block. We find that an asexual model with strength of selection tuned to that of the linkage block provides an excellent description of genetic diversity and the site frequency spectra when compared to computer simulations of population dynamics. This correspondence holds for the entire spectrum of strength of selection. When fitness differentials arise from the collective contribution of numerous weakly selected polymorphisms, the rate of adaptation increases as the square root of the recombination rate. Linkage block approximation thus provides a simple but powerful tool for understanding interference and collective behavior of dense weakly selected loci.

Interference limits resolution of selection pressures from linked neutral diversity

Interference limits resolution of selection pressures from linked neutral diversity
Benjamin H. Good, Aleksandra M. Walczak, Richard A. Neher, Michael M. Desai
(Submitted on 5 Jun 2013)

Pervasive natural selection can strongly influence observed patterns of genetic variation, but these effects remain poorly understood when multiple selected variants segregate in nearby regions of the genome. Classical population genetics fails to account for interference between linked mutations, which grows increasingly severe as the density of selected polymorphisms increases. Here, we describe a simple limit that emerges when interference is common, in which the fitness effects of individual mutations play a relatively minor role. Instead, molecular evolution is determined by the variance in fitness within the population, defined over an effectively asexual segment of the genome (a “linkage block”). We exploit this insensitivity in a new “coarse-grained” coalescent framework, which approximates the effects of many weakly selected mutations with a smaller number of strongly selected mutations with the same variance in fitness. This approximation generates accurate and efficient predictions for the genetic diversity that cannot be summarized by a simple reduction in effective population size. However, these results suggest a fundamental limit on our ability to resolve individual selection pressures from contemporary sequence data alone, since a wide range of parameters yield nearly identical patterns of sequence variability.

Reconstructing the Population Genetic History of the Caribbean

Reconstructing the Population Genetic History of the Caribbean
Andres Moreno-Estrada, Simon Gravel, Fouad Zakharia, Jacob L. McCauley, Jake K. Byrnes, Christopher R. Gignoux, Patricia A. Ortiz-Tello, Ricardo J. Martinez, Dale J. Hedges, Richard W. Morris, Celeste Eng, Karla Sandoval, Suehelay Acevedo-Acevedo, Juan Carlos Martinez-Cruzado, Paul J. Norman, Zulay Layrisse, Peter Parham, Esteban Gonzalez Burchard, Michael L. Cuccaro, Eden R. Martin, Carlos D. Bustamante
(Submitted on 3 Jun 2013)

The Caribbean basin is home to some of the most complex interactions in recent history among previously diverged human populations. Here, by making use of genome-wide SNP array data, we characterize ancestral components of Caribbean populations on a sub-continental level and unveil fine-scale patterns of population structure distinguishing insular from mainland Caribbean populations as well as from other Hispanic/Latino groups. We provide genetic evidence for an inland South American origin of the Native American component in island populations and for extensive pre-Columbian gene flow across the Caribbean basin. The Caribbean-derived European component shows significant differentiation from parental Iberian populations, presumably as a result of founder effects during the colonization of the New World. Based on demographic models, we reconstruct the complex population history of the Caribbean since the onset of continental admixture. We find that insular populations are best modeled as mixtures absorbing two pulses of African migrants, coinciding with early and maximum activity stages of the transatlantic slave trade. These two pulses appear to have originated in different regions within West Africa, imprinting two distinguishable signatures in present day Afro-Caribbean genomes and shedding light on the genetic impact of the dynamics occurring during the slave trade in the Caribbean.

biobambam: tools for read pair collation based algorithms on BAM files

biobambam: tools for read pair collation based algorithms on BAM files
German Tischler, Steven Leonard
(Submitted on 4 Jun 2013)

Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs. In this paper we introduce biobambam, an API for efficient BAM file reading supporting the efficient collation of alignments by read name without performing a complete resorting of the input file and some tools based on this API performing tasks like marking duplicate reads and conversion to the FastQ format. In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities in the Picard suite the approach of biobambam can often perform an equivalent task more efficiently in terms of the required main memory and run-time.

Populations in statistical genetic modelling and inference

Populations in statistical genetic modelling and inference

Daniel John Lawson
(Submitted on 4 Jun 2013)

What is a population? This review considers how a population may be defined in terms of understanding the structure of the underlying genetics of the individuals involved. The main approach is to consider statistically identifiable groups of randomly mating individuals, which is well defined in theory for any type of (sexual) organism. We discuss generative models using drift, admixture and spatial structure, and the ancestral recombination graph. These are contrasted with statistical models for inference, principle component analysis and other `non-parametric’ methods. The relationships between these approaches are explored with both simulated and real-data examples. The state-of-the-art practical software tools are discussed and contrasted. We conclude that populations are a useful theoretical construct that can be well defined in theory and often approximately exist in practice.

Genetic Complexity in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin

Genetic Complexity in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin

Soo-Young Park, Michael Z. Ludwig, Natalia A. Tamarina, Bin Z. He, Sarah H. Carl, Desiree A. Dickerson, Levi Barse, Bharath Arun, Calvin Williams, Cecelia M. Miles, Louis H. Philipson, Donald F. Steiner, Graeme I. Bell, Martin Kreitman
(Submitted on 31 May 2013)

Here we use Drosophila melanogaster to create a genetic model of human permanent neonatal diabetes mellitus and present experimental results describing dimensions of this complexity. The approach involves the transgenic expression of a misfolded mutant of human preproinsulin, hINSC96Y, which is a cause of the disease. When expressed in fly imaginal discs, hINSC96Y causes a reduction of adult structures, including the eye, wing and notum. Eye imaginal discs exhibit defects in both the structure and arrangement of ommatidia. In the wing, expression of hINSC96Y leads to ectopic expression of veins and mechano-sensory organs, indicating disruption of wild type signaling processes regulating cell fates. These readily measurable disease phenotypes are sensitive to temperature, gene dose and sex. Mutant (but not wild type) proinsulin expression in the eye imaginal disc induces IRE1-mediated Xbp1 alternative splicing, a signal for endoplasmic reticulum stress response activation, and produces global change in gene expression. Mutant hINS transgene tester strains, when crossed to stocks from the Drosophila Genetic Reference Panel produces F1 adults with a continuous range of disease phenotypes and large broad-sense heritability. Surprisingly, the severity of mutant hINS-induced disease in the eye is not correlated with that in the notum in these crosses, nor with eye reduction phenotypes caused by the expression of two dominant eye mutants acting in two different eye development pathways, Drop (Dr) or Lobe (L) when crossed into the same genetic backgrounds. The tissue specificity of genetic variability for mutant hINS-induced disease thus has its own distinct signature. The genetic dominance of disease-specific phenotypic variability makes this approach amenable to genome-wide association study (GWAS) in a simple F1 screen of natural variation.

Genome Sequencing Highlights Genes Under Selection and the Dynamic Early History of Dogs

Genome Sequencing Highlights Genes Under Selection and the Dynamic Early History of Dogs
Adam H. Freedman, Rena M. Schweizer, Ilan Gronau, Eunjung Han, Diego Ortega-Del Vecchyo, Pedro M. Silva, Marco Galaverni, Zhenxin Fan, Peter Marx, Belen Lorente-Galdos, Holly Beale, Oscar Ramirez, Farhad Hormozdiari, Can Alkan, Carles Vilà, Kevin Squire, Eli Geffen, Josip Kusak, Adam R. Boyko, Heidi G. Parker, Clarence Lee, Vasisht Tadigotla, Adam Siepel, Carlos D. Bustamante, Timothy T. Harkins, Stanley F. Nelson, Elaine A. Ostrander, Tomas Marques-Bonet, Robert K. Wayne, John Novembre
(Submitted on 31 May 2013)

To identify genetic changes underlying dog domestication and reconstruct their early evolutionary history, we analyzed novel high-quality genome sequences of three gray wolves, one from each of three putative centers of dog domestication, two ancient dog lineages (Basenji and Dingo) and a golden jackal as an outgroup. We find dogs and wolves diverged through a dynamic process involving population bottlenecks in both lineages and post-divergence gene flow, which confounds previous inferences of dog origins. In dogs, the domestication bottleneck was severe involving a 17 to 49-fold reduction in population size, a much stronger bottleneck than estimated previously from less intensive sequencing efforts. A sharp bottleneck in wolves occurred soon after their divergence from dogs, implying that the pool of diversity from which dogs arose was far larger than represented by modern wolf populations. Conditional on mutation rate, we narrow the plausible range for the date of initial dog domestication to an interval from 11 to 16 thousand years ago. This period predates the rise of agriculture, implying that the earliest dogs arose alongside hunter-gathers rather than agriculturists. Regarding the geographic origin of dogs, we find that surprisingly, none of the extant wolf lineages from putative domestication centers are more closely related to dogs, and the sampled wolves instead form a sister monophyletic clade. This result, in combination with our finding of dog-wolf admixture during the process of domestication, suggests a re-evaluation of past hypotheses of dog origin is necessary. Finally, we also detect signatures of selection, including evidence for selection on genes implicated in morphology, metabolism, and neural development. Uniquely, we find support for selective sweeps at regulatory sites suggesting gene regulatory changes played a critical role in dog domestication.

Most viewed on Haldane’s Sieve: May 2013

The most viewed posts on Haldane’s Sieve in May 2013 were: