On the number of ranked species trees producing anomalous ranked gene trees

On the number of ranked species trees producing anomalous ranked gene trees
Filippo Disanto, Noah A. Rosenberg
Subjects: Populations and Evolution (q-bio.PE)

Analysis of probability distributions conditional on species trees has demonstrated the existence of anomalous ranked gene trees (ARGTs), ranked gene trees that are more probable than the ranked gene tree that accords with the ranked species tree. Here, to improve the characterization of ARGTs, we study enumerative and probabilistic properties of two classes of ranked labeled species trees, focusing on the presence or avoidance of certain subtree patterns associated with the production of ARGTs. We provide exact enumerations and asymptotic estimates for cardinalities of these sets of trees, showing that as the number of species increases without bound, the fraction of all ranked labeled species trees that are ARGT-producing approaches 1. This result extends beyond earlier existence results to provide a probabilistic claim about the frequency of ARGTs.

Author post: Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris

This guest post is by Gavin Douglas (@gmdougla), Stephen Wright (@stepheniwright), and Tanja Slotte (@tanjaslotte) on their paper Douglas et al. Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris. bioRxived here.

photo credit: Tanja Slotte

photo credit: Tanja Slotte


In this preprint we investigate the mode of origin and evolutionary consequences of polyploidy in the highly successful tetraploid plant Capsella bursa-pastoris. We analyze high-coverage massively parallel genomic sequence data and first show that C. bursa-pastoris is a recent hybrid of two Capsella lineages leading to C. grandiflora and C. orientalis. This settles a long-standing uncertainty regarding the origins of C. bursa-pastoris. Second, we investigate patterns of nonfunctionalization and gene loss, and while we find little evidence for rapid, massive genome-wide fractionation, our analyses suggest that there is a decrease in the efficacy of selection in this recently formed tetraploid.

Allopolyploid origins of Capsella bursa-pastoris

Determining the evolutionary origin of C. bursa-pastoris has proven to be difficult and many contradictory hypotheses have been suggested, including that the tetraploid is an autopolyploid of a single Capsella species. Part of the complication has been the relatively low levels of sequence divergence between homeologous gene copies, and across the diploid Capsella lineages. Given population genomic sequences from all three Capsella species mentioned, we were able to address this question again with several different approaches.
C. bursa-pastoris undergoes disomic inheritance, meaning that genes duplicated as a result of polyploidy (homeologs) are independently inherited. Thus, one of the major tasks with our genomic data was to partition out the sequences from the two homeologous subgenomes. Because of the low levels of sequence divergence between homeologs (3% on average), this can be a challenging task. We took two approaches to generate phased genome sequence for inferring species origins; de novo assembly of short reads and phasing of SNPs from mapping reads to the reference genome of the diploid Capsella rubella. Phylogenetic trees generated from de novo assemblies of these species overwhelmingly support one C. bursa-pastoris homeolog forming a clade with C. grandiflora and the other with C. orientalis. The distribution of SNPs and transposable elements shared between these species also strongly support this hybridization model, which we estimate occurred within the last 100-300,000 years.
One reason the hybrid origins of C. bursa-pastoris is exciting is due to the divergent evolution of its progenitor lineages. C. orientalis and C. grandiflora differ both in their mating system and geographical distribution. Given that C. bursa-pastoris is a highly successful weed found worldwide, it will be interesting in future work to assess whether this divergence between the C. orientalis and C. grandiflora lineages contributed to the tetraploid’s adaptability.

Decreased efficacy of selection in the recently arisen polyploid
Following genome duplications the majority of redundant loci are expected to become lost over time through the process of diploidization. This model has been supported by several ancient polyploid events, including in Arabidopsis. Capsella bursa-pastoris presents an interesting model for studying the early phases of diploidization, and allows for an investigation of the rate of gene loss as well as the relative importance of relaxed selection vs. positive selection during early stages of gene inactivation. We searched for large deletions spanning genes using several approaches both based on determination of exact breakpoints and by cross-referencing low-coverage regions in C. bursa-pastoris with other Capsella species. Although we identified proportionately more large deletions segregating in C. bursa-pastoris than in the diploids, we did not find evidence for massive genomic changes in the tetraploid.
We were able to demonstrate relaxation of selection by analyzing the site frequency spectrum of SNPs segregating at 0-fold nonsynonymous sites in the three Capsella species. We also investigated SNPs causing putatively deleterious effects, such as premature stop codons, segregating in the three Capsella. Many of these SNPs are shared between the three species, although segregating at low frequencies in C. grandiflora. Since this shared deleterious variation inherited from progenitors seems to be responsible for a large proportion of the earliest stages of gene degeneration, this data supports a model of genome fractionation that is given a “head start” from standing variation. A key message following from this result is that we should be giving more weight to purely historical explanations of gene loss when studying biased fractionation.

Chromosomal distribution of cyto-nuclear genes in a dioecious plant with sex chromosomes

Chromosomal distribution of cyto-nuclear genes in a dioecious plant with sex chromosomes
Josh Hough, J Arvid Agren, Spencer CH Barrett, Stephen I Wright

The coordination between nuclear and organellar genes is essential to many aspects of eukaryotic life, including basic metabolism, energy production, and ultimately, organismal fitness. Whereas nuclear genes are bi-parentally inherited, mitochondrial and chloroplast genes are almost exclusively maternally inherited, and this asymmetry may lead to a bias in the chromosomal distribution of nuclear genes whose products act in the mitochondria or chloroplasts. In particular, because X-linked genes have a higher probability of co-transmission with organellar genes (2/3) compared to autosomal genes (1/2), selection for co-adaptation has been predicted to lead to an over-representation of nuclear-mitochondrial (N-mt) or nuclear-chloroplast (N-cp) genes on the X chromosome relative to autosomes. In contrast, the occurrence of sexually antagonistic organellar mutations might lead to selection for movement of cyto-nuclear genes from the X chromosome to autosomes to reduce male mutation load. Recent broad-scale comparative studies of N-mt distributions in animals have found evidence for these hypotheses in some species, but not others. Here, we use transcriptome sequences to conduct the first study of the chromosomal distribution of cyto-nuclear interacting genes in a plant species with sex chromosomes (Rumex hastatulus; Polygonaceae). We found no evidence of under- or over-representation of either N-mt or N-cp genes on the X chromosome, and thus no support for either the co-adaptation or the sexual-conflict hypothesis. We discuss how our results from a species with recently evolved sex chromosomes fit into an emerging picture of the evolutionary forces governing the chromosomal distribution of N-mt and N-cp genes.

Author post: Single haplotype assembly of the human genome from a hydatidiform mole

This guest post is by Karyn Meltz Steinberg on her preprint (with coauthors) Single haplotype assembly of the human genome from a hydatidiform mole, bioRxived here.

The human reference sequence is a mosaic of many DNA sources patched together to create the (mostly) contiguous chromosomes we all use every day in genomics labs. This mélange of haplotypes can result in reference representations that do not exist. For example, in GRCh37 at the MRC1 locus mixed haplotypes led to the presence of two gene models that represent false duplications and a gap that affected alignments of short reads. The problem of diploid source DNA was even worse in regions of structural variation where it was difficult to distinguish allelic variation from paralogous variation. The assembly structure in these regions was often wrong with either a collapsed assembly leading to missing sequence or with haplotype expanse, meaning 2, 3 or more haplotypes were represented on the chromosome sequence. The genomic resources associated with the essentially haploid complete hydatidiform mole, CHM1, have opened the door to allow us to address these issues.

What is essentially haploid? Well, what usually happens is that a sperm (usually bearing an X chromosome) fertilizes an egg that doesn’t have a nucleus (thus no DNA), the sperm DNA doubles, giving the correct chromosome complement, but with both pairs of chromosomes being identical. This cell divides and grows but does not form a normal embryo. In the early 1990s, Dr. Urvashi Surti was able to make an immortalized cell line from CHM1 tissue using hTERT. She karyotyped each passage to check that it maintained ploidy and that there were no gross somatic rearrangements.

Dr. Pieter de Jong then created an indexed BAC library from this high fidelity material. These BACs could then be used to resolve structurally complex regions such as 17q21.311. We continued working on sequencing more tiling paths across structurally complex regions; however, it was not practical or cost-efficient to Sanger sequence every single clone in the BAC library. As work continued, it became clear that developing the Primary Assembly using a single haplotype resource could be very powerful. This was possible due to the efforts of the Genome Reference Consortium (GRC) to extend the assembly model to include more than one sequence representation for a give region. We used Illumina sequencing technology and a reference based assembly algorithm developed at NCBI to produce an initial assembly. We then integrated the BAC sequences into the assembly to improve regions that are nearly impossible to assembly using whole genome strategies. The result is the highest quality whole genome sequence human genome assembly that is publicly available to date as assessed by metrics including contig and scaffold N50, repetitive element content and gene annotation.

So what–how can this help us, you ask? For the first time, one single haplotype of the human genome is represented. The fact that CHM1 is haploid means that we are able to finally go into the messy regions of the genome and resolve the genomic architecture as well as put any structural variation in the context of surrounding linked allelic variation. These are often biologically interesting regions; for example genes related to immune response and metabolism that are probably associated with complex traits are usually members of large gene families in segmentally duplicated sequence.

A fine example of the power of this haploid resource is also on BioRxiv, “Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity” (disclosure: this is also co-authored by me). The IG light chain genes encode for one part of immunoglobulin molecules that are expressed by B cells in response to antigenic stimulation (the heavy chain, IGH, was also resolved using CHM1 resources last year2). They are part of large gene families formed by duplication at three loci in the human genome. In previous versions of the reference assembly, these loci were comprised of sequence from multiple DNA sources that may have undergone somatic rearrangement. By sequencing BAC clones in a tiling path across these loci we now have a single haplotype representation of germline DNA sequence that allows us to perform accurate analyses of variation.

REFERENCES

1 Itsara, A. et al. Resolving the breakpoints of the 17q21.31 microdeletion syndrome with next-generation sequencing. American journal of human genetics 90, 599-613, doi:10.1016/j.ajhg.2012.02.013 (2012).
2 Watson, C. T. et al. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. American journal of human genetics 92, 530-546 (2013).

Mito-seek enables deep analysis of mitochondrial DNA, revealing ubiquitous, stable heteroplasmy maintained by intercellular exchange

Mito-seek enables deep analysis of mitochondrial DNA, revealing ubiquitous, stable heteroplasmy maintained by intercellular exchange

Ravi Sachidanandam, Anitha D Jayaprakash, Erica Benson, Raymond Liang, Jaehee Shim, Luca Lambertini, Mike Wigler, Stuart Aaronson
doi: http://dx.doi.org/10.1101/007005

Eukaryotic cells carry two genomes, nuclear (nDNA) and mitochondrial (mtDNA), which are ostensibly decoupled in their replication, segregation and inheritance. It is increasingly appreciated that heteroplasmy, the occurrence of multiple mtDNA haplotypes in a cell, plays an important biological role, but its features are not well understood. Until now, accurately determining the diversity of mtDNA has been difficult due to the relatively small amount of mtDNA in each cell ( 98%) mtDNA and its ability to detect rare variants is limited only by sequencing depth, providing unprecedented sensitivity and specificity. Using Mito-seek, we confirmed the ubiquity of heteroplasmy by analyzing mtDNA from a diverse set of cell lines and human samples. By applying Mito-seek to colonies derived from single cells, we showed that heteroplasmy is stably maintained in individual daughter cells over multiple cell divisions. Our simulations indicate that the stability of heteroplasmy can be facilitated by the exchange of mtDNA between cells. We also explicitly demonstrate this exchange by co-culturing cell lines with distinct mtDNA haplotypes. Our results shed new light on the maintenance of heteroplasmy and provide a novel platform to investigate various features of heteroplasmy in normal and diseased tissues.

Phylogenomic analyses of deep gastropod relationships reject Orthogastropoda

Phylogenomic analyses of deep gastropod relationships reject Orthogastropoda

Felipe Zapata, Nerida G Wilson, Mark Howison, Sónia CS Andrade, Katharina M J?rger, Michael Schrödl, Freya E Goetz, Gonzalo Giribet, Casey W Dunn
doi: http://dx.doi.org/10.1101/007039

Gastropods are a highly diverse clade of molluscs that includes many familiar animals, such as limpets, snails, slugs, and sea slugs. It is one of the most abundant groups of animals in the sea and the only molluscan lineage that has successfully colonised land. Yet the relationships among and within its constituent clades have remained in flux for over a century of morphological, anatomical and molecular study. Here we re-evaluate gastropod phylogenetic relationships by collecting new transcriptome data for 40 species and analysing them in combination with publicly available genomes and transcriptomes. Our datasets include all five main gastropod clades: Patellogastropoda, Vetigastropoda, Neritimorpha, Caenogastropoda and Heterobranchia. We use two different methods to assign orthology, subsample each of these matrices into three increasingly dense subsets, and analyse all six of these supermatrices with two different models of molecular evolution. All twelve analyses yield the same unrooted network connecting the five major gastropod lineages. This reduces deep gastropod phylogeny to three alternative rooting hypotheses. These results reject the prevalent hypothesis of gastropod phylogeny, Orthogastropoda. Our dated tree is congruent with a possible end-Permian recovery of some gastropod clades, namely Caenogastropoda and some Heterobranchia subclades.

Interpretation and approximation tools for big, dense Markov chain transition matrices in ecology and evolution

Interpretation and approximation tools for big, dense Markov chain transition matrices in ecology and evolution
Katja Reichel, Valentin Bahier, Cédric Midoux, Jean-Pierre Masson, Solenn Stoeckel
Comments: 8 pages, 4 figures, supplement: 2 figures, visual abstract, highlights, source code
Subjects: Quantitative Methods (q-bio.QM); Populations and Evolution (q-bio.PE)

Markov chains are a common framework for individual-based state and time discrete models in ecology and evolution. Their use, however, is largely limited to systems with a low number of states, since the transition matrices involved pose considerable challenges as their size and their density increase. Big, dense transition matrices may easily defy both the computer’s memory and the scientists’ ability to interpret them, due to the very high amount of information they contain; yet approximations using other types of models are not always the best solution.
We propose a set of methods to overcome the difficulties associated with big, dense Markov chain transition matrices. Using a population genetic model as an example, we demonstrate how big matrices can be transformed into clear and easily interpretable graphs with the help of network analysis. Moreover, we describe an algorithm to save computer memory by substituting the original matrix with a sparse approximate while preserving all its mathematically important properties. In the same model example, we manage to store about 90% less data while keeping more than 99% of the information contained in the matrix and a closely corresponding dominant eigenvector.
Our approach is an example how numerical limitations for the number of states in a Markov chain can be overcome. By facilitating the use of state-rich Markov chain models, they may become a valuable supplement to the diversity of models currently employed in biology.

The site frequency spectrum of dispensable genes

The site frequency spectrum of dispensable genes
Franz Baumdicker
Comments: 24 pages, 8 figures
Subjects: Populations and Evolution (q-bio.PE); Probability (math.PR)

The differences between DNA-sequences within a population are the basis to infer the ancestral relationship of the individuals. Within the classical infinitely many sites model, it is possible to estimate the mutation rate based on the site frequency spectrum, which is comprised by the numbers $C_1,…,C_{n-1}$, where n is the sample size and $C_s$ is the number of site mutations (Single Nucleotide Polymorphisms, SNPs) which are seen in $s$ genomes. Classical results can be used to compare the observed site frequency spectrum with its neutral expectation, $E[C_s]= \theta_2/s$, where $\theta_2$ is the scaled site mutation rate. In this paper, we will relax the assumption of the infinitely many sites model that all individuals only carry homologous genetic material. Especially, it is today well-known that bacterial genomes have the ability to gain and lose genes, such that every single genome is a mosaic of genes, and genes are present and absent in a random fashion, giving rise to the dispensable genome. While this presence and absence has been modeled under neutral evolution within the infinitely many genes model in previous papers, we link presence and absence of genes with the numbers of site mutations seen within each gene. In this work we derive a formula for the expectation of the joint gene and site frequency spectrum, denotes $G_{k,s}$ the number of mutated sites occurring in exactly $s$ gene sequences, while the corresponding gene is present in exactly $k$ individuals. We show that standard estimators of $\theta_2$ for dispensable genes are biased and that the site frequency spectrum for dispensable genes differs from the classical result.

Extraordinarily wide genomic impact of a selective sweep associated with the evolution of sex ratio distorter suppression

Extraordinarily wide genomic impact of a selective sweep associated with the evolution of sex ratio distorter suppression
Emily A Hornett, Bruce Moran, Louise A Reynolds, Sylvain Charlat, Samuel Tazzyman, Nina Wedell, Chris D Jiggins, Gregory Hurst

Symbionts that distort their host?s sex ratio by favouring the production and survival of females are common in arthropods. Their presence produces intense Fisherian selection to return the sex ratio to parity, typified by the rapid spread of host ?suppressor? loci that restore male survival/development. In this study, we investigated the genomic impact of a selective event of this kind in the butterfly Hypolimnas bolina. Through linkage mapping we first identified a genomic region that was necessary for males to survive Wolbachia-induced killing. We then investigated the genomic impact of the rapid spread of suppression that converted the Samoan population of this butterfly from a 100:1 female-biased sex ratio in 2001, to a 1:1 sex ratio by 2006. Models of this process revealed the potential for a chromosome-wide selective sweep. To measure the impact directly, the pattern of genetic variation before and after the episode of selection was compared. Significant changes in allele frequencies were observed over a 25cM region surrounding the suppressor locus, alongside generation of linkage disequilibrium. The presence of novel allelic variants in 2006 suggests that the suppressor was introduced via immigration rather than through de novo mutation. In addition, further sampling in 2010 indicated that many of the introduced variants were lost or had reduced in frequency since 2006. We hypothesise that this loss may have resulted from a period of purifying selection – removing deleterious material that introgressed during the initial sweep. Our observations of the impact of suppression of sex ratio distorting activity reveal an extraordinarily wide genomic imprint, reflecting its status as one of the strongest selective forces in nature.

inPHAP: Interactive visualization of genotype and phased haplotype data

inPHAP: Interactive visualization of genotype and phased haplotype data
Günter Jäger, Alexander Peltzer, Kay Nieselt
Comments: BioVis 2014 conference
Subjects: Graphics (cs.GR); Genomics (q-bio.GN)

Background: To understand individual genomes it is necessary to look at the variations that lead to changes in phenotype and possibly to disease. However, genotype information alone is often not sufficient and additional knowledge regarding the phase of the variation is needed to make correct interpretations. Interactive visualizations, that allow the user to explore the data in various ways, can be of great assistance in the process of making well informed decisions. But, currently there is a lack for visualizations that are able to deal with phased haplotype data. Results: We present inPHAP, an interactive visualization tool for genotype and phased haplotype data. inPHAP features a variety of interaction possibilities such as zooming, sorting, filtering and aggregation of rows in order to explore patterns hidden in large genetic data sets. As a proof of concept, we apply inPHAP to the phased haplotype data set of Phase 1 of the 1000 Genomes Project. Thereby, inPHAP’s ability to show genetic variations on the population as well as on the individuals level is demonstrated for several disease related loci. Conclusions: As of today, inPHAP is the only visual analytical tool that allows the user to explore unphased and phased haplotype data interactively. Due to its highly scalable design, inPHAP can be applied to large datasets with up to 100 GB of data, enabling users to visualize even large scale input data. inPHAP closes the gap between common visualization tools for unphased genotype data and introduces several new features, such as the visualization of phased data.