V genes in primates from whole genome shotgun data

V genes in primates from whole genome shotgun data
David N Olivieri, Francisco Gambon-Deza

The adaptive immune system uses V genes for antigen recognition. The evolutionary diversification and selection processes within and across species and orders are poorly understood. Here, we studied the amino acid (AA) sequences obtained of translated in-frame V exons of immunoglobulins (IG) and T cell receptors (TR) from 16 primate species whose genomes have been sequenced. Multi-species comparative analysis supports the hypothesis that V genes in the IG loci undergo birth/death processes, thereby permitting rapid adaptability over evolutionary time. We also show that multiple cladistic groupings exist in the TRA (35 clades) and TRB (25 clades) V gene loci and that each primate species typically contributes at least one V gene to each of these clade. The results demonstrate that IG V genes and TR V genes have quite different evolutionary pathways; multiple duplications can explain the IG loci results, while co-evolutionary pressures can explain the phylogenetic results, as seen in genes of the TR loci. We describe how each of the 35 V genes clades of the TRA locus and 25 clades of the TRB locus must have specific and necessary roles for the viability of the species.

Improved genome inference in the MHC using a population reference graph

Improved genome inference in the MHC using a population reference graph
Alexander Dilthey, Charles J Cox, Zamin Iqbal, Matthew R Nelson, Gil McVean

In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used in assembling novel genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and short indels. The genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov Model, allowing for recombination between different haplotypes and variants. By applying the method to the 4.5Mb extended MHC region on chromosome 6, combining eight assembled haplotypes, sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate, using simulations, SNP genotyping, short-read and long-read data, how the method improves the accuracy of genome inference. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference-quality genome sequences.

Convergent Evolution During Local Adaptation to Patchy Landscapes

Convergent Evolution During Local Adaptation to Patchy Landscapes
Peter L. Ralph, Graham Coop

Species often encounter, and adapt to, many patches of locally similar environmental conditions across their range. Such adaptation can occur through convergent evolution as different alleles arise and spread in different patches, or through the spread of alleles by migration acting to synchronize adaptation across the species. The tension between the two reflects the degree of constraint imposed on evolution by the underlying genetic architecture versus how how effectively selection acts to inhibit the geographic spread of locally adapted alleles. This paper studies a model of the balance between these two routes to adaptation in continuous environments with patchy selection pressures. We address the following questions: How long does it take for a new, locally adapted allele to appear in a patch of habitat where it is favored through new mutation? Or, through migration from another, already adapted patch? Which is more likely to occur, as a function of distance between the patches? How can we tell which has occurred, i.e.\ what population genetic signal is left by the spread of migrant alleles and how long does this signal persist for? To answer these questions we decompose the migration–selection equilibrium surrounding an already adapted patch into families of migrant alleles, in particular treating those rare families that reach new patches as spatial branching processes. This provides a way to understand the role of geographic separation between patches in promoting convergent adaptation and the genomic signals it leaves behind. We illustrate these ideas using the convergent evolution of cryptic coloration in the rock pocket mouse, Chaetodipus intermedius, as an empirical example.

Purifying selection, drift and reversible mutation with arbitrarily high mutation rates


Purifying selection, drift and reversible mutation with arbitrarily high mutation rates

Brian Charlesworth, Kavita Jain
Comments: Supplementary Information available on request
Subjects: Populations and Evolution (q-bio.PE)

Some species exhibit very high levels of DNA sequence variability; there is also evidence for the existence of heritable epigenetic variants that experience state changes at a much higher rate than sequence variants. In both cases, the resulting high diversity levels within a population (hyperdiversity) mean that standard population genetics methods are not trustworthy. We analyze a population genetics model that incorporates purifying selection, reversible mutations and genetic drift, assuming a stationary population size. We derive analytical results for both population parameters and sample statistics, and discuss their implications for studies of natural genetic and epigenetic variation. In particular, we find that (1) many more intermediate frequency variants are expected than under standard models, even with moderately strong purifying selection (2) rates of evolution under purifying selection may be close to, or even exceed, neutral rates. These findings are related to empirical studies of sequence and epigenetic variation.

Phylogenetics and the human microbiome

Phylogenetics and the human microbiome
Frederick A Matsen IV
Comments: to appear in Systematic Biology
Subjects: Populations and Evolution (q-bio.PE); Genomics (q-bio.GN)

The human microbiome is the ensemble of genes in the microbes that live inside and on the surface of humans. Because microbial sequencing information is now much easier to come by than phenotypic information, there has been an explosion of sequencing and genetic analysis of microbiome samples. Much of the analytical work for these sequences involves phylogenetics, at least indirectly, but methodology has developed in a somewhat different direction than for other applications of phylogenetics. In this paper I review the field and its methods from the perspective of a phylogeneticist, as well as describing current challenges for phylogenetics coming from this type of work.

Bayesian Coalescent Epidemic Inference: Comparison of Stochastic and Deterministic SIR Population Dynamics


Bayesian Coalescent Epidemic Inference: Comparison of Stochastic and Deterministic SIR Population Dynamics

Alex Popinga, Tim Vaughan, Tanja Stadler, Alexei Drummond
Comments: Submitted
Subjects: Populations and Evolution (q-bio.PE)

Estimation of epidemiological and population parameters from molecular sequence data has become central to the understanding of infectious disease dynamics. Various models have been proposed to infer details of the dynamics that describe epidemic progression. These include inference approaches derived from Kingmans coalescent as well as from birth death branching processes. The development of alternative approaches merits investigation of their characteristics and differences. Here we use recently described coalescent theory for epidemic dynamics to develop stochastic and deterministic coalescent SIR tree priors. We implement these in a Bayesian phylogenetic inference framework to permit joint estimation of SIR epidemic parameters and the sample genealogy. We assess the models performance and contrast results obtained with a recently published birth death sampling model for epidemic inference. Comparisons are made by analyzing sets of genealogies simulated under precisely known epidemiological parameters. We also compare results of analyses using published HIV1 sequence data obtained from known UK infection clusters. We show that the coalescent SIR model is effective at estimating epidemiological parameters from data with large fundamental reproductive number R0 and large population size S0. We find that the stochastic variant generally outperforms its deterministic counterpart. However, each of these Bayesian estimators are shown to have undesirable properties in certain circumstances, especially for epidemic outbreaks with R0 close to one or with small susceptible populations.

Single haplotype assembly of the human genome from a hydatidiform mole

Single haplotype assembly of the human genome from a hydatidiform mole

Karyn Meltz Steinberg, Valerie K Schneider, Tina A Graves-Lindsay, Robert S Fulton, Richa Agarwala, John Huddleston, Sergey A Shiryayev, Aleksandr Morgulis, Urvashi Surti, Wesley C Warren, Deanna M Church, Evan E Eichler, Richard K Wilson

An accurate and complete reference human genome sequence assembly is essential for accurately interpreting individual genomes and associating sequence variation with disease phenotypes. While the current reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can help overcome these problems, even the longest available reads do not resolve all regions of the human genome. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones, an optical map, and 100X whole genome shotgun (WGS) sequence coverage using short (Illumina) read pairs. We used the WGS sequence and the GRCh37 reference assembly to create a sequence assembly of the CHM1 genome. We subsequently incorporated 382 finished CHORI-17 BAC clone sequences to generate a second draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene and repeat content show this assembly to be of excellent quality and contiguity, and comparisons to ClinVar and the NHGRI GWAS catalog show that the CHM1 genome does not harbor an excess of deleterious alleles. However, comparison to assembly-independent resources, such as BAC clone end sequences and long reads generated by a different sequencing technology (PacBio), indicate misassembled regions. The great majority of these regions is enriched for structural variation and segmental duplication, and can be resolved in the future by sequencing BAC clone tiling paths. This publicly available first generation assembly will be integrated into the Genome Reference Consortium (GRC) curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.

Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity

Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity

Corey T Watson, Karyn Meltz Steinberg, Tina A Graves-Lindsay, Rene L Warren, Maika Malig, Jacqueline E Schein, Richard K Wilson, Rob Holt, Evan Eichler, Felix Breden

Germline variation at immunoglobulin gene (IG) loci is critical for pathogen-mediated immunity, but establishing complete reference sequences in these regions is problematic because of segmental duplications and somatically rearranged source DNA. We sequenced BAC clones from the essentially haploid hydatidiform mole, CHM1, across the light chain IG loci, kappa (IGK) and lambda (IGL), creating single haplotype representations of these regions. The IGL haplotype is 1.25Mb of contiguous sequence with four novel V gene and one novel C gene alleles and an 11.9kbp insertion. The IGK haplotype consists of two 644kbp proximal and 466kbp distal contigs separated by a gap also present in the reference genome sequence. Our effort added an additional 49kbp of unique sequence extending into this gap. The IGK haplotype contains six novel V gene and one novel J gene alleles and a 16.7kbp region with increased sequence identity between the two IGK contigs, exhibiting signatures of interlocus gene conversion. Our data facilitated the first comparison of nucleotide diversity between the light and IG heavy (IGH) chain haplotypes within a single genome, revealing a three to six fold enrichment in the IGH locus, supporting the theory that the heavy chain may be more important in determining antigenic specificity.

Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris

Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris
Gavin Douglas, Gesseca Gos, Kim Steige, Adriana Salcedo, Karl Holm, J. Arvid ?gren, Khaled Hazzouri, Wei Wang, Adrian E. Platts, Emily B. Josephs, Robert J. Williamson, Barbara Neuffer, Martin Lascoux, Tanja Slotte, Stephen Wright

Whole genome duplication events have occurred repeatedly during flowering plant evolution, and there is growing evidence for predictable patterns of gene retention and loss following polyploidization. Despite these important insights, the rate and processes governing the earliest stages of diploidization remain uncertain, and the relative importance of genetic drift vs. natural selection in the process of gene degeneration and loss is unclear. Here we conduct whole genome resequencing in Capsella bursa-pastoris, a recently formed tetraploid with one of the most widespread species distributions of any angiosperm. Whole genome data provide strong support for recent hybrid origins of the tetraploid species within the last 100-300,000 years from two diploid progenitors in the Capsella genus. Major-effect inactivating mutations are frequent, but many were inherited from the parental species and show no evidence of being fixed by positive selection. Despite a lack of large-scale gene loss, we observe a shift in the efficacy of natural selection genome-wide. Our results suggest that the earliest stages of diploidization are associated with quantitative genome-wide shifts in the strength and efficacy of selection rather than rapid gene loss, and that nonfunctionalization can receive a ‘head start’ through deleterious variants found in parental diploid populations.

Probabilities of Fitness Consequences for Point Mutations Across the Human Genome

Probabilities of Fitness Consequences for Point Mutations Across the Human Genome
Brad Gulko, Ilan Gronau, Melissa J Hubisz, Adam Siepel

The identification of noncoding functional elements based on high-throughput genomic data remains an important open problem. Here we describe a novel computational approach for estimating the probability that a point mutation at each nucleotide position in a genome will influence organismal fitness. These fitness consequence (fitCons) scores can be interpreted as an evolution-based measure of potential genomic function. We first partition the genome into clusters of positions having distinct functional genomic “fingerprints,” based on cell-type-specific DNase-seq, RNA-seq, and histone modification data. Then we estimate the probability of fitness consequences for each cluster from associated patterns of genetic polymorphism and divergence using a recently developed probabilistic method called INSIGHT. We have generated fitCons scores for three human cell types based on publicly available genomic data and made them available as UCSC Genome Browser tracks. Like conventional evolutionary conservation scores, fitCons scores are clearly elevated in known coding and noncoding functional elements, but they show considerably better sensitivity than conservation scores for many noncoding elements. In addition, they perform exceptionally well in distinguishing ChIP-seq-supported transcription factor binding sites, expression quantitative trait loci, and predicted enhancers from putatively nonfunctional sequences. The fitCons scores indicate that 4.2-7.5% of nucleotide positions in the human genome have influenced fitness since the human-chimpanzee divergence. In contrast to several recent studies, they suggest that recent evolutionary turnover has had a relatively modest impact on the functional content of the genome. Our approach provides a unique new measure of genomic function that complements measures based on evolutionary conservation or functional genomics alone and is particularly well suited for characterizing turnover and evolutionary novelty.