Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data

Debora Yoshihara Caldeira Brandt, Vitor Rezende da Costa Aguiar, Bárbara Domingues Bitarello, Kelly Nunes, Jérôme Goudet, Diogo Meyer
doi: http://dx.doi.org/10.1101/013151

Next Generation Sequencing (NGS) technologies have become the standard for data generation in studies of population genomics, as the 1000 Genomes Project (1000G). However, these techniques are known to be problematic when applied to highly polymorphic genomic regions, such as the Human Leukocyte Antigen (HLA) genes. Because accurate genotype calls and allele frequency estimations are crucial to population genomics analises, it is important to assess the reliability of NGS data. Here, we evaluate the reliability of genotype calls and allele frequency estimates of the SNPs reported by 1000G (phase I) at five HLA genes (HLA-A, -B, -C, -DRB1, -DQB1 ). We take advantage of the availability of HLA Sanger sequencing of 930 of the 1,092 1000G samples, and use this as a gold standard to benchmark the 1000G data. We document that 18.6% of SNP genotype calls in HLA genes are incorrect, and that allele frequencies are estimated with an error higher than ??0.1 at approximately 25% of the SNPs in HLA genes. We found a bias towards overestimation of reference allele frequency for the 1000G data, indicating mapping bias is an important cause of error in frequency estimation in this dataset. We provide a list of sites that have poor allele frequency estimates, and discuss the outcomes of including those sites in different kinds of analyses. Since the HLA region is the most polymorphic in the human genome, our results provide insights into the challenges of using of NGS data at other genomic regions of high diversity.

Genome-engineering with CRISPR-Cas9 in the mosquito Aedes aegypti

Genome-engineering with CRISPR-Cas9 in the mosquito Aedes aegypti

Kathryn E Kistler, Leslie B Vosshall, Benjamin J Matthews
doi: http://dx.doi.org/10.1101/013276

The mosquito Aedes aegypti is a potent vector of the Chikungunya, yellow fever, and Dengue viruses, which result in hundreds of millions of infections and over 50,000 human deaths per year. Loss-of-function mutagenesis in Ae. aegypti has been established with TALENs, ZFNs, and homing endonucleases, which require the engineering of DNA-binding protein domains to generate target specificity for a particular stretch of genomic DNA. Here, we describe the first use of the CRISPR-Cas9 system to generate targeted, site-specific mutations in Ae. aegypti. CRISPR-Cas9 relies on RNA-DNA base-pairing to generate targeting specificity, resulting in cheaper, faster, and more flexible genome-editing reagents. We investigate the efficiency of reagent concentrations and compositions, demonstrate the ability of CRISPR-Cas9 to generate several different types of mutations via disparate repair mechanisms, and show that stable germ-line mutations can be readily generated at the vast majority of genomic loci tested. This work offers a detailed exploration into the optimal use of CRISPR-Cas9 in Ae. aegypti that should be applicable to non-model organisms previously out of reach of genetic modification.

Expansion of the HSFY gene family in pig lineages

Expansion of the HSFY gene family in pig lineages

Benjamin M Skinner, Kim Lachani, Carole A Sargent, Fengtang Yang, Peter JI Ellis, Toby Hunt, Beiyuan Fu, Sandra Louzada, Carol Churcher, Chris Tyler-Smith, Nabeel A Affara
doi: http://dx.doi.org/10.1101/012906

Amplified gene families on sex chromosomes can harbour genes with important biological functions, especially relating to fertility. The HSFY family has amplified on the Y chromosome of the domestic pig (Sus scrofa), in an apparently independent event to an HSFY expansion on the Y chromosome of cattle (Bos taurus). Although the biological functions of HSFY genes are poorly understood, they appear to be involved in gametogenesis in a number of mammalian species, and, in cattle, HSFY gene copy number correlates with levels of fertility. We have investigated the HSFY family in domestic pigs, and other suid species including warthogs, bushpigs, babirusas and peccaries. The domestic pig contains at least two amplified variants of HSFY, distinguished predominantly by presence or absence of a SINE within the intron. Both these variants are expressed in testis, and both are present in approximately 50 copies each in a single cluster on the short arm of the Y. The longer form has multiple nonsense mutations rendering it likely non-functional, but many of the shorter forms still have coding potential. Other suid species also have these two variants of HSFY, and estimates of copy number suggest the HSFY family may have amplified independently twice during suid evolution. Given the association of HSFY gene copy number with fertility in cattle, HSFY is likely to play an important role in spermatogenesis in pigs also.

Stationary solutions for metapopulation Moran models with mutation and selection

Stationary solutions for metapopulation Moran models with mutation and selection

George W. A. Constable, Alan J. McKane
(Submitted on 19 Dec 2014)

We construct an individual-based metapopulation model of population genetics featuring migration, mutation, selection and genetic drift. In the case of a single `island’, the model reduces to the Moran model. Using the diffusion approximation and timescale separation arguments, an effective one-variable description of the model is developed. The effective description bears similarities to the well-mixed Moran model with effective parameters which depend on the network structure and island sizes, and is amenable to analysis. Predictions from the reduced theory match the results from stochastic simulations across a range of parameters. The nature of the fast-variable elimination technique we adopt is further studied by applying it to a linear system, where it provides a precise description of the slow-dynamics in the limit of large timescale separation.

The pig X and Y chromosomes: structure, sequence and evolution

The pig X and Y chromosomes: structure, sequence and evolution

Benjamin M Skinner, Carole A Sargent, Carol Churcher, Toby Hunt, Javier Herrero, Jane Loveland, Matt Dunn, Sandra Louzada, Beiyuan Fu, William Chow, James Gilbert, Siobhan Austin-Guest, Kathryn Beal, Denise Carvalho-Silva, William Cheng, Daria Gordon, Darren Grafham, Matt Hardy, Jo Harley, Heidi Hauser, Philip Howden, Kerstin Howe, Kim Lachani, Peter JI Ellis, Daniel Kelly, Giselle Kerry, James Kerwin, Bee Ling Ng, Glen Threadgold, Thomas Wileman, Jonathan MD Wood, Fengtang Yang, Jen Harrow, Nabeel A Affara, Chris Tyler-Smith
doi: http://dx.doi.org/10.1101/012914

We have generated an improved assembly and gene annotation of the pig X chromosome, and a first draft assembly of the pig Y chromosome, by sequencing BAC and fosmid clones, and incorporating information from optical mapping and fibre-FISH. The X chromosome carries 1,014 annotated genes, 689 of which are protein-coding. Gene order closely matches that found in Primates (including humans) and Carnivores (including cats and dogs), which is inferred to be ancestral. Nevertheless, several protein-coding genes present on the human X chromosome were absent from the pig (e.g. the cancer/testis antigen family) or inactive (e.g. AWAT1), and 38 pig-specific X-chromosomal genes were annotated, 22 of which were olfactory receptors. The pig Y chromosome assembly focussed on two clusters of male-specific low-copy number genes, separated by an ampliconic region including the HSFY gene family, which together make up most of the short arm. Both clusters contain palindromes with high sequence identity, presumably maintained by gene conversion. The long arm of the chromosome is almost entirely repetitive, containing previously characterised sequences. Many of the ancestral X-related genes previously reported in at least one mammalian Y chromosome are represented either as active genes or partial sequences. This sequencing project has allowed us to identify genes – both single copy and amplified – on the pig Y, to compare the pig X and Y chromosomes for homologous sequences, and thereby to reveal mechanisms underlying pig X and Y chromosome evolution.

FORGE : A tool to discover cell specific enrichments of GWAS associated SNPs in regulatory regions.

FORGE : A tool to discover cell specific enrichments of GWAS associated SNPs in regulatory regions.

Ian Dunham, Eugene Kulesha, Valentina Iotchkova, Sandro Morganella, Ewan Birney
doi: http://dx.doi.org/10.1101/013045

Genome wide association studies provide an unbiased discovery mechanism for numerous human diseases. However, a frustration in the analysis of GWAS is that the majority of variants discovered do not directly alter protein-coding genes. We have developed a simple analysis approach that detects the tissue-specific regulatory component of a set of GWAS SNPs by identifying enrichment of overlap with DNase I hotspots from diverse tissue samples. Functional element Overlap analysis of the Results of GWAS Experiments (FORGE) is available as a web tool and as standalone software and provides tabular and graphical summaries of the enrichments. Conducting FORGE analysis on SNP sets for 260 phenotypes available from the GWAS catalogue reveals numerous overlap enrichments with tissue–specific components reflecting the known aetiology of the phenotypes as well as revealing other unforeseen tissue involvements that may lead to mechanistic insights for disease.

Genetic Analysis of Substrain Divergence in NOD Mice

Genetic Analysis of Substrain Divergence in NOD Mice

Petr Simecek, Gary A Churchill, Hyuna Yang, Lucy B Rowe, Lieselotte Herberg, David V Serreze, Edward H Leiter
doi: http://dx.doi.org/10.1101/013037

The NOD mouse is a polygenic model for type 1 diabetes that is characterized by insulitis, a leukocytic infiltration of the pancreatic islets. During ~35 years since the original inbred strain was developed in Japan, NOD substrains have been established at different laboratories around the world. Although environmental differences among NOD colonies capable of impacting diabetes incidence have been recognized, differences arising from genetic divergence have not previously been analyzed. We illustrate the importance of intersubstrain genetic differences by showing a difference in diabetes incidence between two substrains (NOD/ShiLtJ and NOD/Bom) maintained in a common environment. We use both Mouse Diversity Array and Whole Exome Capture Sequencing platforms to identify genetic differences distinguishing 5 NOD substrains. We describe 64 SNPs, and 2 short indels that differ in coding regions of the 5 NOD substrains. A 100 kb deletion on Chromosome 3 distinguishes NOD/ShiLtJ and NOD/ShiLtDvs from 3 other substrains, while a 111 kb deletion in the Icam2 gene on Chromosome 11 is unique to the NOD/ShiLtDvs genome. The extent of genetic divergence for NOD substrains is compared to similar studies for C57BL6 and BALB/c substrains. As mutations are fixed to homozygosity by continued inbreeding, significant differences in substrain phenotypes are to be expected. These results emphasize the importance of using embryo freezing methods to minimize genetic drift within substrains.

Y Chromosome of Aisin Gioro, the Imperial House of Qing Dynasty

Y Chromosome of Aisin Gioro, the Imperial House of Qing Dynasty

Shi Yan, Harumasa Tachibana, Lan-Hai Wei, Ge Yu, Shao-Qing Wen, Chuan-Chao Wang
(Submitted on 19 Dec 2014)

House of Aisin Gioro is the imperial family of the last dynasty in Chinese history – Qing Dynasty (1644 – 1911). Aisin Gioro family originated from Jurchen tribes and developed the Manchu people before they conquered China. By investigating the Y chromosomal short tandem repeats (STRs) of 7 modern male individuals who claim belonging to Aisin Gioro family (in which 3 have full records of pedigree), we found that 3 of them (in which 2 keep full pedigree, whose most recent common ancestor is Nurgaci) shows very close relationship (1 – 2 steps of difference in 17 STR) and the haplotype is rare. We therefore conclude that this haplotype is the Y chromosome of the House of Aisin Gioro. Further tests of single nucleotide polymorphisms (SNPs) indicates that they belong to Haplogroup C3b2b1*-M401(xF5483), although their Y-STR results are distant to the “star cluster”, which also belongs to the same haplogroup. This study forms the base for the pedigree research of the imperial family of Qing Dynasty by means of genetics.

Using Bayesian multilevel whole-genome regression models for partial pooling of estimation sets in genomic prediction

Using Bayesian multilevel whole-genome regression models for partial pooling of estimation sets in genomic prediction

Frank Technow, L. Radu Totir
doi: http://dx.doi.org/10.1101/012971

Estimation set size is an important determinant of genomic prediction accuracy. Plant breeding programs are characterized by a high degree of structuring, particularly into populations. This hampers establishment of large estimation sets for each population. Pooling populations increases estimation set size but ignores unique genetic characteristics of each. A possible solution is partial pooling with multilevel models, which allows estimating population specific marker effects while still leveraging information across populations. We developed a Bayesian multilevel whole-genome regression model and compared its performance to that of the popular BayesA model applied to each population separately (no pooling) and to the joined data set (complete pooling). As example we analyzed a wide array of traits from the nested association mapping maize population. There we show that for small population sizes (e.g., < 50), partial pooling increased prediction accuracy over no or complete pooling for populations represented in the estimation set. No pooling was superior however when populations were large. In another example data set of interconnected biparental maize populations either partial or complete pooling were superior, depending on the trait. A simulation showed that no pooling is superior when differences in genetic effects among populations are large and partial pooling when they are intermediate. With small differences, partial and complete pooling achieved equally high accuracy. For prediction of new populations, partial and complete pooling had very similar accuracy in all cases. We conclude that partial pooling with multilevel models can maximize the potential of pooling by making optimal use of information in pooled estimation sets.

Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance

Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance

Stefany Moreno-Gamez, Alison L Hill, Daniel I.S. Rosenbloom, Dmitri A. Petrov, Martin A Nowak, Pleuni Pennings
doi: http://dx.doi.org/10.1101/013003

Infections with rapidly evolving pathogens are often treated using combinations of drugs with different mechanisms of action. One of the major goals of combination therapy is to reduce the risk of drug resistance emerging during a patient’s treatment. While this strategy generally has significant benefits over monotherapy, it may also select for multi-drug resistant strains, which present an important clinical and public health problem. For many antimicrobial treatment regimes, individual drugs have imperfect penetration throughout the body, so there may be regions where only one drug reaches an effective concentration. Here we propose that mismatched drug coverage can greatly speed up the evolution of multi-drug resistance by allowing mutations to accumulate in a stepwise fashion. We develop a mathematical model of within-host pathogen evolution under spatially heterogeneous drug coverage and demonstrate that even very small single-drug compartments lead to dramatically higher resistance risk. We find that it is often better to use drug combinations with matched penetration profiles, although there may be a trade-off between preventing eventual treatment failure due to resistance in this way, and temporarily reducing pathogen levels systemically. Our results show that drugs with the most extensive distribution are likely to be the most vulnerable to resistance. We conclude that optimal combination treatments should be designed to prevent this spatial effective monotherapy. These results are widely applicable to diverse microbial infections including viruses, bacteria and parasites.