Model Inadequacy and Mistaken Inferences of Trait-Dependent Speciation

Model Inadequacy and Mistaken Inferences of Trait-Dependent Speciation

Daniel L. Rabosky, Emma E. Goldberg
(Submitted on 22 Dec 2014)

Species richness varies widely across the tree of life, and there is great interest in identifying ecological, geographic, and other factors that affect rates of species proliferation. Recent methods for explicitly modeling the relationships among character states, speciation rates, and extinction rates on phylogenetic trees- BiSSE, QuaSSE, GeoSSE, and related models – have been widely used to test hypotheses about character state-dependent diversification rates. Here, we document the disconcerting ease with which neutral traits are inferred to have statistically significant associations with speciation rate. We first demonstrate this unfortunate effect for a known model assumption violation: shifts in speciation rate associated with a character not included in the model. We further show that for many empirical phylogenies, characters simulated in the absence of state-dependent diversification exhibit an even higher Type I error rate, indicating that the method is susceptible to additional, unknown model inadequacies. For traits that evolve slowly, the root cause appears to be a statistical framework that does not require replicated shifts in character state and diversification. However, spurious associations between character state and speciation rate arise even for traits that lack phylogenetic signal, suggesting that phylogenetic pseudoreplication alone cannot fully explain the problem. The surprising severity of this phenomenon suggests that many trait-diversification relationships reported in the literature may not be real. More generally, we highlight the need for diagnosing and understanding the consequences of model inadequacy in phylogenetic comparative methods.

Marker-based estimation of heritability in immortal populations

Marker-based estimation of heritability in immortal populations

Willem Kruijer, Martin Boer, Marcos Malosetti, Padraic J. Flood, Bas Engel, Rik Kooke, Joost Keurentjes, Fred van Eeuwijk
(Submitted on 21 Dec 2014)

Heritability is a central parameter in quantitative genetics, both from an evolutionary and a breeding perspective. For plant traits heritability is traditionally estimated by comparing within and between genotype variability. This approach estimates broad-sense heritability, and does not account for different genetic relatedness. With the availability of high-density markers there is growing interest in marker based estimates of narrow-sense heritability, using mixed models in which genetic relatedness is estimated from genetic markers. Such estimates have received much attention in human genetics but are rarely reported for plant traits. A major obstacle is that current methodology and software assume a single phenotypic value per genotype, hence requiring genotypic means. An alternative that we propose here, is to use mixed models at individual plant or plot level. Using statistical arguments, simulations and real data we investigate the feasibility of both approaches, and how these affect genomic prediction with G-BLUP and genome-wide association studies. Heritability estimates obtained from genotypic means had very large standard errors and were sometimes biologically unrealistic. Mixed models at individual plant or plot level produced more realistic estimates, and for simulated traits standard errors were up to 13 times smaller. Genomic prediction was also improved by using these mixed models, with up to a 49% increase in accuracy. For GWAS on simulated traits, the use of individual plant data gave almost no increase in power. The new methodology is applicable to any complex trait where multiple replicates of individual genotypes can be scored. This includes important agronomic crops, as well as bacteria and fungi.

Bias in Estimators of Archaic Admixture

Bias in Estimators of Archaic Admixture

Alan R. Rogers, Ryan J. Bohlender
(Submitted on 20 Dec 2014)

This article evaluates bias in one class of methods used to estimate archaic admixture in modern humans. These methods study the pattern of allele sharing among modern and archaic genomes. They are sensitive to “ghost” admixture, which occurs when a population receives archaic DNA from sources not acknowledged by the statistical model. The effect of ghost admixture depends on two factors: branch-length bias and population-size bias. Branch-length bias occurs because a given amount of admixture has a larger effect if the two populations have been separated for a long time. Population-size bias occurs because differences in population size distort branch lengths in the gene genealogy. In the absence of ghost admixture, these effects are small. They become important, however, in the presence of ghost admixture. Estimators differ in the pattern of response. Increasing a given parameter may inflate one estimator but deflate another. For this reason, comparisons among estimators are informative. Using such comparisons, this article supports previous findings that the archaic population was small and that Europeans received little gene flow from archaic populations other than Neanderthals. It also identifies an inconsistency in estimates of archaic admixture into Melanesia.

Scaling probabilistic models of genetic variation to millions of humans

Scaling probabilistic models of genetic variation to millions of humans

Prem Gopalan, Wei Hao, David M. Blei, John D. Storey
doi: http://dx.doi.org/10.1101/013227

A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. Researchers have developed sophisticated statistical methods to capture the complex population structure that underlies observed genotypes in humans. The number of humans that have been densely genotyped across the genome has grown significantly in recent years. In aggregate about 1M individuals have been densely genotyped to date, and if we could analyze this data then we would have a nearly complete picture of human genetic variation. Existing state-of-the-art methods, however, cannot scale to data of this size. To this end, we have developed TeraStructure. TeraStructure is a new algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (1012 observed genotypes, e.g., 1M individuals at 1M SNPs). It is a principled approach to approximate Bayesian inference that iterates between subsampling locations of the genome and updating an estimate of the latent population structure. On real and simulated data sets of up to 10K individuals, TeraStructure is twice as fast as existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure continues to be accurate and is the only method that can complete its analysis.

A FISH-based chromosome map for the European corn borers yields insights into ancient chromosomal fusions in the silkworm.

A FISH-based chromosome map for the European corn borers yields insights into ancient chromosomal fusions in the silkworm.

Yuji Yasukochi, Mizuki Ohno, Fukashi Shibata, Akiya Jouraku, Ryo Nakano, Yukio Ishikawa, Ken Sahara
doi: http://dx.doi.org/10.1101/013284

A significant feature of the genomes of Lepidoptera, butterflies and moths, is the high conservation of chromosome organization. Recent remarkable progress in genome sequencing of Lepidoptera has revealed that syntenic gene order is extensively conserved across phylogenetically distant species. The ancestral karyotype of Lepidoptera is thought to be n = 31; however, that of the most well studied moth, Bombyx mori, is n = 28, suggesting that three chromosomal fusion events occurred in this lineage. To identify the boundaries between predicted ancient fusions involving B. mori chromosomes 11, 23 and 24, we constructed FISH-based chromosome maps of the European corn borer, Ostrinia nubilalis (n = 31). We first determined 511 Mb genomic sequence of the Asian corn borer, Ostrinia furnacalis, a congener of O. nubilalis, and isolated BAC and fosmid clones that were expected to localize in candidate regions for the boundaries using these sequences. Combined with FISH and genetic analysis, we narrowed down the candidate regions to 40kb ??? 1.5Mb, in strong agreement with a previous estimate based on the genome of a butterfly, Melitaea cinxia. The significant difference in the lengths of the candidate regions where no functional genes were observed may reflect the evolutionary time after fusion events.

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data

Debora Yoshihara Caldeira Brandt, Vitor Rezende da Costa Aguiar, Bárbara Domingues Bitarello, Kelly Nunes, Jérôme Goudet, Diogo Meyer
doi: http://dx.doi.org/10.1101/013151

Next Generation Sequencing (NGS) technologies have become the standard for data generation in studies of population genomics, as the 1000 Genomes Project (1000G). However, these techniques are known to be problematic when applied to highly polymorphic genomic regions, such as the Human Leukocyte Antigen (HLA) genes. Because accurate genotype calls and allele frequency estimations are crucial to population genomics analises, it is important to assess the reliability of NGS data. Here, we evaluate the reliability of genotype calls and allele frequency estimates of the SNPs reported by 1000G (phase I) at five HLA genes (HLA-A, -B, -C, -DRB1, -DQB1 ). We take advantage of the availability of HLA Sanger sequencing of 930 of the 1,092 1000G samples, and use this as a gold standard to benchmark the 1000G data. We document that 18.6% of SNP genotype calls in HLA genes are incorrect, and that allele frequencies are estimated with an error higher than ??0.1 at approximately 25% of the SNPs in HLA genes. We found a bias towards overestimation of reference allele frequency for the 1000G data, indicating mapping bias is an important cause of error in frequency estimation in this dataset. We provide a list of sites that have poor allele frequency estimates, and discuss the outcomes of including those sites in different kinds of analyses. Since the HLA region is the most polymorphic in the human genome, our results provide insights into the challenges of using of NGS data at other genomic regions of high diversity.

Genome-engineering with CRISPR-Cas9 in the mosquito Aedes aegypti

Genome-engineering with CRISPR-Cas9 in the mosquito Aedes aegypti

Kathryn E Kistler, Leslie B Vosshall, Benjamin J Matthews
doi: http://dx.doi.org/10.1101/013276

The mosquito Aedes aegypti is a potent vector of the Chikungunya, yellow fever, and Dengue viruses, which result in hundreds of millions of infections and over 50,000 human deaths per year. Loss-of-function mutagenesis in Ae. aegypti has been established with TALENs, ZFNs, and homing endonucleases, which require the engineering of DNA-binding protein domains to generate target specificity for a particular stretch of genomic DNA. Here, we describe the first use of the CRISPR-Cas9 system to generate targeted, site-specific mutations in Ae. aegypti. CRISPR-Cas9 relies on RNA-DNA base-pairing to generate targeting specificity, resulting in cheaper, faster, and more flexible genome-editing reagents. We investigate the efficiency of reagent concentrations and compositions, demonstrate the ability of CRISPR-Cas9 to generate several different types of mutations via disparate repair mechanisms, and show that stable germ-line mutations can be readily generated at the vast majority of genomic loci tested. This work offers a detailed exploration into the optimal use of CRISPR-Cas9 in Ae. aegypti that should be applicable to non-model organisms previously out of reach of genetic modification.

Expansion of the HSFY gene family in pig lineages

Expansion of the HSFY gene family in pig lineages

Benjamin M Skinner, Kim Lachani, Carole A Sargent, Fengtang Yang, Peter JI Ellis, Toby Hunt, Beiyuan Fu, Sandra Louzada, Carol Churcher, Chris Tyler-Smith, Nabeel A Affara
doi: http://dx.doi.org/10.1101/012906

Amplified gene families on sex chromosomes can harbour genes with important biological functions, especially relating to fertility. The HSFY family has amplified on the Y chromosome of the domestic pig (Sus scrofa), in an apparently independent event to an HSFY expansion on the Y chromosome of cattle (Bos taurus). Although the biological functions of HSFY genes are poorly understood, they appear to be involved in gametogenesis in a number of mammalian species, and, in cattle, HSFY gene copy number correlates with levels of fertility. We have investigated the HSFY family in domestic pigs, and other suid species including warthogs, bushpigs, babirusas and peccaries. The domestic pig contains at least two amplified variants of HSFY, distinguished predominantly by presence or absence of a SINE within the intron. Both these variants are expressed in testis, and both are present in approximately 50 copies each in a single cluster on the short arm of the Y. The longer form has multiple nonsense mutations rendering it likely non-functional, but many of the shorter forms still have coding potential. Other suid species also have these two variants of HSFY, and estimates of copy number suggest the HSFY family may have amplified independently twice during suid evolution. Given the association of HSFY gene copy number with fertility in cattle, HSFY is likely to play an important role in spermatogenesis in pigs also.

Stationary solutions for metapopulation Moran models with mutation and selection

Stationary solutions for metapopulation Moran models with mutation and selection

George W. A. Constable, Alan J. McKane
(Submitted on 19 Dec 2014)

We construct an individual-based metapopulation model of population genetics featuring migration, mutation, selection and genetic drift. In the case of a single `island’, the model reduces to the Moran model. Using the diffusion approximation and timescale separation arguments, an effective one-variable description of the model is developed. The effective description bears similarities to the well-mixed Moran model with effective parameters which depend on the network structure and island sizes, and is amenable to analysis. Predictions from the reduced theory match the results from stochastic simulations across a range of parameters. The nature of the fast-variable elimination technique we adopt is further studied by applying it to a linear system, where it provides a precise description of the slow-dynamics in the limit of large timescale separation.

The pig X and Y chromosomes: structure, sequence and evolution

The pig X and Y chromosomes: structure, sequence and evolution

Benjamin M Skinner, Carole A Sargent, Carol Churcher, Toby Hunt, Javier Herrero, Jane Loveland, Matt Dunn, Sandra Louzada, Beiyuan Fu, William Chow, James Gilbert, Siobhan Austin-Guest, Kathryn Beal, Denise Carvalho-Silva, William Cheng, Daria Gordon, Darren Grafham, Matt Hardy, Jo Harley, Heidi Hauser, Philip Howden, Kerstin Howe, Kim Lachani, Peter JI Ellis, Daniel Kelly, Giselle Kerry, James Kerwin, Bee Ling Ng, Glen Threadgold, Thomas Wileman, Jonathan MD Wood, Fengtang Yang, Jen Harrow, Nabeel A Affara, Chris Tyler-Smith
doi: http://dx.doi.org/10.1101/012914

We have generated an improved assembly and gene annotation of the pig X chromosome, and a first draft assembly of the pig Y chromosome, by sequencing BAC and fosmid clones, and incorporating information from optical mapping and fibre-FISH. The X chromosome carries 1,014 annotated genes, 689 of which are protein-coding. Gene order closely matches that found in Primates (including humans) and Carnivores (including cats and dogs), which is inferred to be ancestral. Nevertheless, several protein-coding genes present on the human X chromosome were absent from the pig (e.g. the cancer/testis antigen family) or inactive (e.g. AWAT1), and 38 pig-specific X-chromosomal genes were annotated, 22 of which were olfactory receptors. The pig Y chromosome assembly focussed on two clusters of male-specific low-copy number genes, separated by an ampliconic region including the HSFY gene family, which together make up most of the short arm. Both clusters contain palindromes with high sequence identity, presumably maintained by gene conversion. The long arm of the chromosome is almost entirely repetitive, containing previously characterised sequences. Many of the ancestral X-related genes previously reported in at least one mammalian Y chromosome are represented either as active genes or partial sequences. This sequencing project has allowed us to identify genes – both single copy and amplified – on the pig Y, to compare the pig X and Y chromosomes for homologous sequences, and thereby to reveal mechanisms underlying pig X and Y chromosome evolution.