Heterozygous gene truncation delineates the human haploinsufficient genome

Heterozygous gene truncation delineates the human haploinsufficient genome

István Bartha, Antonio Rausell, Paul McLaren, Manuel Tardaguila, Pejman Mohammadi, Nimisha Chaturvedi, Jacques Fellay, Amalio Telenti
doi: http://dx.doi.org/10.1101/010611

Sequencing projects have identified large numbers of rare stop-gain and frameshift variants in the human genome. As most of these are observed in the heterozygous state, they test a gene?s tolerance to haploinsufficiency and dominant loss of function. We analyzed the distribution of truncating variants across 16,260 protein coding autosomal genes in 11,546 individuals. We observed 39,893 truncating variants affecting 12,062 genes, which significantly differed from an expectation of 12,916 genes under a model of neutral de novo mutation (p<1E-4). Extrapolating this to increasing numbers of sequenced individuals, we estimate that 10.8% of human genes do not tolerate heterozygous truncating variants. An additional 10 to 15% of truncated genes may be rescued by incomplete penetrance or compensatory mutations, or because the truncating variants are of limited functional impact. The study of protein truncating variants delineates the essential genome and, more generally, identifies rare heterozygous variants as an unexplored source of diversity of phenotypic traits and diseases.

High Frequency Haplotypes are Expected Events, not Historical Figures

High Frequency Haplotypes are Expected Events, not Historical Figures

Elsa G Guillot, Murray P Cox
doi: http://dx.doi.org/10.1101/022160

Cultural transmission of reproductive success states that successful men have more children and pass this greater fecundity to their offspring. Balaresque and colleagues found high frequency haplotypes in a Central Asian Y chromosome dataset, which they attribute to cultural transmission of reproductive success by prominent historical men, including Genghis Khan. Using coalescent simulation, we show that these high frequency haplotypes are expected simply by chance. Hence, an explanation invoking cultural transmission of reproductive success is statistically unnecessary.

Low but significant genetic differentiation underlies biologically meaningful phenotypic divergence in a large Atlantic salmon population

Low but significant genetic differentiation underlies biologically meaningful phenotypic divergence in a large Atlantic salmon population

Tutku Aykanat, Susan E Johnston, Panu Orell, Eero Niemelä, Jaakko Erkinaro, Craig Primmer
doi: http://dx.doi.org/10.1101/022178

Despite decades of research assessing the genetic structure of natural populations, the biological meaning of low yet significant genetic divergence often remains unclear due to a lack of associated phenotypic and ecological information. At the same time, structured populations with low genetic divergence and overlapping boundaries can potentially provide excellent models to study the eco-evolutionary dynamics in cases where high resolution genetic markers and relevant phenotypic and life history information are available. Here, we combined SNP-based population inference with extensive phenotypic and life history data to identify potential biological mechanisms driving fine scale sub-population differentiation in Atlantic salmon (Salmo salar) from the Teno River, a major salmon river in Europe. Two sympatrically occurring sub-populations had low but significant genetic differentiation (FST = 0.018) and displayed marked differences in the distribution of life history strategies, including variation in juvenile growth rate, age at maturity and size within age classes. Large, late-maturing individuals were virtually absent from one of the two sub-populations and there were significant differences in juvenile growth rates and size-at-age after oceanic migration between individuals in the respective sub-populations. Our findings suggest that different eco-evolutionary processes affect each sub-population and that hybridization and subsequent selection may maintain low genetic differentiation without hindering adaptive divergence.

Worldwide patterns of human epigenetic variation

Worldwide patterns of human epigenetic variation

Oana Carja, Julia L MacIsaac, Sarah M Mah, Brenna M Henn, Michael S Kobor, Marcus W Feldman, Hunter B Fraser
doi: http://dx.doi.org/10.1101/021931

DNA methylation is an epigenetic modification, influenced by both genetic and environmental variation, that can affect transcription and many organismal phenotypes. Although patterns of DNA methylation have been shown to differ between human populations, it remains to be determined whether epigenetic diversity mirrors the patterns observed for DNA polymorphisms or gene expression levels. We measured DNA methylation at 480,000 sites in 34 individuals from five diverse human populations in the Human Genome Diversity Panel, and analyzed these together with single nucleotide polymorphisms (SNPs) and gene expression data. We found greater population-specificity of DNA methylation than of mRNA levels, which may be driven by the greater genetic control of methylation. This study provides insights into gene expression and its epigenetic regulation across populations and offers a deeper understanding of worldwide patterns of epigenetic diversity in humans.

Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life

Predicting genome sizes and restriction enzyme recognition-sequence probabilities across the eukaryotic tree of life

Santiago Herrera, Paula H. Reyes-Herrera, Timothy M. Shank
doi: http://dx.doi.org/10.1101/007781

High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes ? generically known as restriction-site associated DNA sequencing (RAD-seq) ? is a common strategy to generate genome-wide genotypic and sequence data from eukaryotes. A critical design element of any RAD-seq study is a knowledge of the approximate number of genetic markers that can be obtained for a taxon using different restriction enzymes, as this number determines the scope of a project, and ultimately defines its success. This number can only be directly determined if a reference genome sequence is available, or it can be estimated if the genome size and restriction recognition sequence probabilities are known. However, both scenarios are uncommon for non-model species. Here, we performed systematic in silico surveys of recognition sequences, for diverse and commonly used type II restriction enzymes across the eukaryotic tree of life. Our observations reveal that recognition-sequence frequencies for a given restriction enzyme are strikingly variable among broad eukaryotic taxonomic groups, being largely determined by phylogenetic relatedness. We demonstrate that genome sizes can be predicted from cleavage frequency data obtained with restriction enzymes targeting ?neutral? elements. Models based on genomic compositions are also effective tools to accurately calculate probabilities of recognition sequences across taxa, and can be applied to species for which reduced-representation data is available (including transcriptomes and ?neutral? RAD-seq datasets). The analytical pipeline developed in this study, PredRAD (https://github.com/phrh/PredRAD), and the resulting databases constitute valuable resources that will help guide the design of any study using RAD-seq or related methods.

How obstacles perturb population fronts and alter their genetic structure

How obstacles perturb population fronts and alter their genetic structure

Wolfram Moebius, Andrew W. Murray, David R. Nelson
doi: http://dx.doi.org/10.1101/021964

As populations spread into new territory, environmental heterogeneities can shape the population front and genetic composition. We study here the effect of one important building block of inhomogeneous environments, compact obstacles. With a combination of experiments, theory, and simulation, we show how isolated obstacles both create long-lived distortions of the front shape and amplify the effect of genetic drift. A system of bacteriophage T7 spreading on a spatially heterogeneous Escherichia coli lawn serves as an experimental model system to study population expansions. Using an inkjet printer, we create well-defined replicates of the lawn and quantitatively study the population expansion manifested in plaque growth. The transient perturbations of the plaque boundary found in the experiments are well described by a model in which the front moves with constant speed. Independent of the precise details of the expansion, we show that obstacles create a kink in the front that persists over large distances and is insensitive to the details of the obstacle’s shape. The small deviations between experimental findings and the predictions of the constant speed model can be understood with a more general reaction-diffusion model, which reduces to the constant speed model when the obstacle size is large compared to the front width. Using this framework, we demonstrate that frontier alleles that just graze the side of an isolated obstacle increase in abundance, a phenomenon we call ‘geometry-enhanced genetic drift’, complementary to the founder effect associated with spatial bottlenecks. Bacterial range expansions around nutrient-poor barriers and stochastic simulations confirm this prediction, the latter highlight as well the effect of the obstacle on the genealogy of individuals at the front. We argue that related ideas and experimental techniques are applicable to a wide variety of more complex environments, leading to a better understanding of how environmental heterogeneities affect population range expansions.

The anatomical distribution of genetic associations

The anatomical distribution of genetic associations

Alan B Wells, Nathan Kopp, Xiaoxiao Xu, David R O’Brien, Wei Yang, Arye Nehorai, Tracy L. Adair-Kirk, Raphael Kopan, Joseph D Dougherty
doi: http://dx.doi.org/10.1101/021824

Deeper understanding of the anatomical intermediaries for disease and other complex genetic traits is essential to understanding mechanisms and developing new interventions. Existing ontology tools provide functional annotations for many genes in the genome and they are widely used to develop mechanistic hypotheses based on genetic and transcriptomic data. Yet, information about where a set of genes is expressed may be equally useful in interpreting results and forming novel mechanistic hypotheses for a trait. Therefore, we developed a framework for statistically testing the relationship between gene expression across the body and sets of candidate genes from across the genome. We validated this tool and tested its utility on three applications. First, using thousands of loci identified by GWA studies, our framework identifies the number of disease-associated genes that have enriched expression in the disease-affected tissue. Second, we experimentally confirmed an underappreciated prediction highlighted by our tool: variation in skin expressed genes are a major quantitative genetic modulator of white blood cell count – a trait considered to be a feature of the immune system. Finally, using gene lists derived from sequencing data, we show that human genes under constrained selective pressure are disproportionately expressed in nervous system tissues.

The two-speed genomes of filamentous pathogens: waltz with plants

The two-speed genomes of filamentous pathogens: waltz with plants

Suomeng Dong, Sylvain Raffaele, Sophien Kamoun
doi: http://dx.doi.org/10.1101/021774

Fungi and oomycetes include deep and diverse lineages of eukaryotic plant pathogens. The last 10 years have seen the sequencing of the genomes of a multitude of species of these so-called filamentous plant pathogens. Already, fundamental concepts have emerged. Filamentous plant pathogen genomes tend to harbor large repertoires of genes encoding virulence effectors that modulate host plant processes. Effector genes are not randomly distributed across the genomes but tend to be associated with compartments enriched in repetitive sequences and transposable elements. These findings have led to the “two-speed genome” model in which filamentous pathogen genomes have a bipartite architecture with gene sparse, repeat rich compartments serving as a cradle for adaptive evolution. Here, we review this concept and discuss how plant pathogens are great model systems to study evolutionary adaptations at multiple time scales. We will also introduce the next phase of research on this topic.

Most viewed on Haldane’s Sieve: June 2015

The most viewed preprints on Haldane’s Sieve in June 2015 were:

SimPhy: Phylogenomic Simulation of Gene, Locus and Species Trees

SimPhy: Phylogenomic Simulation of Gene, Locus and Species Trees
Diego Mallo, Leonardo de Oliveira Martins, David Posada
doi: http://dx.doi.org/10.1101/021709
We present here a fast and flexible software–SimPhy–for the simulation of multiple gene families evolving under incomplete lineage sorting, gene duplication and loss, horizontal gene transfer—all three potentially leading to the species tree/gene tree discordance—and gene conversion. SimPhy implements a hierarchical phylogenetic model in which the evolution of species, locus and gene trees is governed by global and local parameters (e.g., genome-wide, species-specific, locus-specific), that can be fixed or be sampled from a priori statistical distributions. SimPhy also incorporates comprehensive models of substitution rate variation among lineages (uncorrelated relaxed clocks) and the capability of simulating partitioned nucleotide, codon and protein multilocus sequence alignments under a plethora of substitution models using the program INDELible. We validate SimPhy’s output using theoretical expectations and other programs, and show that it scales extremely well with complex models and/or large trees, being an order of magnitude faster than the most similar program (DLCoal-Sim). In addition, we demonstrate how SimPhy can be useful to understand interactions among different evolutionary processes, conducting a simulation study to characterize the systematic overestimation of the duplication time when using standard reconciliation methods. SimPhy is available at https://github.com/adamallo/SimPhy, where users can find the source code, pre-compiled executables, a detailed manual and example cases.