Haplotypes of common SNPs can explain missing heritability of complex diseases

Haplotypes of common SNPs can explain missing heritability of complex diseases
Gaurav Bhatia, Alexander Gusev, Po-Ru Loh, Bjarni J Vilhjálmsson, Stephan Ripke, Shaun Purcell, Eli Stahl, Mark Daly, Teresa R de Candia, Kenneth S Kendler, Michael C O’Donovan, Sang Hong Lee, Naomi R Wray, Benjamin M Neale, Matthew C Keller, Noah A Zaitlen, Bogdan Pasaniuc, Jian Yang, Alkes L Price, Schizophrenia Working Group Psychiatric Genomics C
doi: http://dx.doi.org/10.1101/022418

While genome-wide significant associations generally explain only a small proportion of the narrow-sense heritability of complex disease (h2), recent work has shown that more heritability is explained by all genotyped SNPs (hg2). However, much of the heritability is still missing (hg2 0.1% explained substantially more phenotypic variance (hhap2 = 0.64 (S.E. 0.084)) than genotyped SNPs alone (hg2 = 0.32 (S.E. 0.029)). These estimates were based on cross-cohort comparisons, ensuring that cohort-specific assay artifacts did not contribute to our estimates. In a large multiple sclerosis data set (WTCCC2-MS), we observed an even larger difference between hhap2 and hg2, though data from other cohorts will be required to validate this result. Overall, our results suggest that haplotypes of common SNPs can explain a large fraction of missing heritability of complex disease, shedding light on genetic architecture and informing disease mapping strategies.

Coevolution of male and female reproductive traits drive cascading reinforcement in Drosophila yakuba

Coevolution of male and female reproductive traits drive cascading reinforcement in Drosophila yakuba

Aaron A Comeault, Aarti Venkat, Daniel R Matute
doi: http://dx.doi.org/10.1101/022244

When the ranges of two hybridizing species overlap, individuals may ‘waste’ gametes on inviable or infertile hybrids. In these cases, selection against maladaptive hybridization can lead to the evolution of enhanced reproductive isolation in a process called reinforcement. On the slopes of the African island of São Tomé, Drosophila yakuba and its endemic sister species D. santomea have a well-defined hybrid zone. Drosophila yakuba females from within this zone show increased postmating-prezygotic isolation towards D. santomea males when compared with D. yakuba females from allopatric populations. To understand why reinforced gametic isolation is confined to areas of secondary contact and has not spread throughout the entire D. yakuba geographic range, we studied the costs of reinforcement in D. yakuba using a combination of natural collections and experimental evolution. We found that D. yakuba males from sympatric populations sire fewer progeny than allopatric males when mated to allopatric D. yakuba females. Our results suggest that the correlated evolution of male and female reproductive traits in sympatric D. yakuba have associated costs (i.e., reduced male fertility) that prevent the alleles responsible for enhanced isolation from spreading outside the hybrid zone.

Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation

Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation

Minliang Jin, Haijun Liu, Cheng He, Junjie Fu, Yingjie Xiao, Yuebin Wang, Weibo Xie, Guoying Wang, Jianbing Yan
doi: http://dx.doi.org/10.1101/022384

Variation in gene expression contributes to the diversity of phenotype. The construction of the pan-transcriptome is especially necessary for species with complex genomes, such as maize. However, knowledge of the regulation mechanisms and functional consequences of the pan-transcriptome is limited. In this study, we identified 13,382 nuclear expression presence and absence variation candidates (ePAVs, expressed in 5%~95% lines; based on the reference genome) by re-analyzing the RNA sequencing data from the kernels (15 days after pollination) of 368 maize diverse inbreds. It was estimated that only ~1% of the ePAVs are explained by DNA sequence presence and absence variations (PAV). The ePAV genes tend to be regulated by distant eQTLs when compared with non-ePAV genes (called here core expression genes, expressed in more than 95% lines). When the expression presence/absence status was used as the ???genotype??? to perform genome-wide association study, 56 (0.42%) ePAVs were significantly associated with 15 agronomic traits and 1,967 (14.74%) with 526 metabolic traits, measured from the mature kernels. While the above was majorly based on the reference genome, by using a modified ???assemble-then-align??? strategy, 2,355 high confidence novel sequences with a total length of 1.9Mb were found absent in the current B73 reference genome (v2). Ten randomly selected novel sequences were validated with genomic PCR. A simulation analysis suggested that the pan-transcriptome of the maize whole kernel is approaching a maximum value of 63,000 genes. Two novel validated sequences annotated as NBS_LRR like genes were found to associate with flavonoid content and their homologs in rice were also found to affect flavonoids and disease-resistance. Novel sequences absent in the present reference genome might be functionally important and deserve more attentions. This study provides novel perspectives and resources to discover maize quantitative trait variations and help us to better understand the kernel regulation networks, thus enhancing maize breeding.

Genoogle: an indexed and parallelized search engine for similar DNA sequences

Genoogle: an indexed and parallelized search engine for similar DNA sequences

Felipe Albrecht
(Submitted on 10 Jul 2015)

The search for similar genetic sequences is one of the main bioinformatics tasks. The genetic sequences data banks are growing exponentially and the searching techniques that use linear time are not capable to do the search in the required time anymore. Another problem is that the clock speed of the modern processors are not growing as it did before, instead, the processing capacity is growing with the addiction of more processing cores and the techniques which does not use parallel computing does not have benefits from these extra cores. This work aims to use data indexing techniques to reduce the searching process computation cost united with the parallelization of the searching techniques to use the computational capacity of the multi core processors. To verify the viability of using these two techniques simultaneously, a software which uses parallelization techniques with inverted indexes was developed.
Experiments were executed to analyze the performance gain when parallelism is utilized, the search time gain, and also the quality of the results when it compared with others searching tools. The results of these experiments were promising, the parallelism gain overcame the expected speedup, the searching time was 20 times faster than the parallelized NCBI BLAST, and the searching results showed a good quality when compared with this tool.
The software source code is available at this https URL .

Haplotypes of common SNPs can explain missing heritability of complex diseases

Haplotypes of common SNPs can explain missing heritability of complex diseases

Gaurav Bhatia, Alexander Gusev, Po-Ru Loh, Bjarni J Vilhjálmsson, Stephan Ripke, Shaun Purcell, Eli Stahl, Mark Daly, Teresa R de Candia, Kenneth S Kendler, Michael C O’Donovan, Sang Hong Lee, Naomi R Wray, Benjamin M Neale, Matthew C Keller, Noah A Zaitlen, Bogdan Pasaniuc, Jian Yang, Alkes L Price, Schizophrenia Working Group Psychiatric Genomics C
doi: http://dx.doi.org/10.1101/022418

While genome-wide significant associations generally explain only a small proportion of the narrow-sense heritability of complex disease (h2), recent work has shown that more heritability is explained by all genotyped SNPs (hg2). However, much of the heritability is still missing (hg2

0.1% explained substantially more phenotypic variance (hhap2 = 0.64 (S.E. 0.084)) than genotyped SNPs alone (hg2 = 0.32 (S.E. 0.029)). These estimates were based on cross-cohort comparisons, ensuring that cohort-specific assay artifacts did not contribute to our estimates. In a large multiple sclerosis data set (WTCCC2-MS), we observed an even larger difference between hhap2 and hg2, though data from other cohorts will be required to validate this result. Overall, our results suggest that haplotypes of common SNPs can explain a large fraction of missing heritability of complex disease, shedding light on genetic architecture and informing disease mapping strategies.

Joint estimation of contamination, error and demography for nuclear DNA from ancient humans

Joint estimation of contamination, error and demography for nuclear DNA from ancient humans

Fernando Racimo, Gabriel Renaud, Montgomery Slatkin
doi: http://dx.doi.org/10.1101/022285

When sequencing an ancient DNA sample from a hominin fossil, DNA from present-day humans involved in excavation and extraction will be sequenced along with the endogenous material. This type of contamination is problematic for downstream analyses as it will introduce a bias towards the population to which the contaminating individuals belong. Quantifying the extent of contamination is a crucial step as it allows researchers to account for possible biases that may arise in downstream genetic analyses. Here, we present an MCMC algorithm to co-estimate the contamination rate, sequencing error rate and demographic parameters – including drift times and admixture rates – for an ancient nuclear genome obtained from human remains, when the putative contaminating DNA comes from present-day humans. We assume we have a large panel representing the putative contaminating population (e.g. European, East Asian or African). The method is implemented in a C++ program called ‘Demographic Inference with Contamination and Error’ (DICE). The program can also be used to determine the most likely population to which the contaminant DNA belongs. We applied it to simulations and Neanderthal genome data, and we recover accurate estimates of all parameters, even when the average sequencing coverage is low (0.5X) and the per-read contamination rate is high (25%).

Automated and accurate estimation of gene family abundance from shotgun metagenomes

Automated and accurate estimation of gene family abundance from shotgun metagenomes

Stephen Nayfach, Patrick H. Bradley, Stacia K. Wyman, Timothy J. Laurent, Alex Williams, Jonathan A. Eisen, Katherine S. Pollard, Thomas J. Sharpton
doi: http://dx.doi.org/10.1101/022335

Shotgun metagenomic DNA sequencing is a widely applicable tool for characterizing the functions that are encoded by microbial communities. Several bioinformatic tools can be used to functionally annotate metagenomes, allowing researchers to draw inferences about the functional potential of the community and to identify putative functional biomarkers. However, little is known about how decisions made during annotation affect the reliability of the results. Here, we use statistical simulations to rigorously assess how to optimize annotation accuracy and speed, given parameters of the input data like read length and library size. We identify best practices in metagenome annotation and use them to guide the development of the Shotgun Metagenome Annotation Pipeline (ShotMAP). ShotMAP is an analytically flexible, end-to-end annotation pipeline that can be implemented either on a local computer or a cloud compute cluster. We use ShotMAP to assess how different annotation databases impact the interpretation of how marine metagenome and metatranscriptome functional capacity changes across seasons. We also apply ShotMAP to data obtained from a clinical microbiome investigation of inflammatory bowel disease. This analysis finds that gut microbiota collected from Crohn’s disease patients are functionally distinct from gut microbiota collected from either ulcerative colitis patients or healthy controls, with differential abundance of metabolic pathways related to host-microbiome interactions that may serve as putative biomarkers of disease.

Protein binding and methylation on looping chromatin accurately predict distal regulatory interactions

Protein binding and methylation on looping chromatin accurately predict distal regulatory interactionsSean Whalen, Rebecca M. Truty, Katherine S. Pollard
doi: http://dx.doi.org/10.1101/022293

Identifying the gene targets of distal regulatory sequences is a challenging problem with the potential to illuminate the causal underpinnings of complex diseases. However, current experimental methods to map enhancer-promoter interactions genome-wide are limited by their cost and complexity. We present TargetFinder, a computational method that reconstructs a cell’s three-dimensional regulatory landscape from two-dimensional genomic features. TargetFinder achieves outstanding predictive accuracy across diverse cell lines with a false discovery rate up to fifteen times smaller than common heuristics, and reveals that distal regulatory interactions are characterized by distinct signatures of protein interactions and epigenetic marks on the DNA loop between an active enhancer and targeted promoter. Much of this signature is shared across cell types, shedding light on the role of chromatin organization in gene regulation and establishing TargetFinder as a method to accurately map long-range regulatory interactions using a small number of easily acquired datasets.

Coalescent models for developmental biology and the spatio-temporal dynamics of growing tissues.

Coalescent models for developmental biology and the spatio-temporal dynamics of growing tissues.
Patrick Smadbeck, Michael P.H. Stumpf
doi: http://dx.doi.org/10.1101/022251

Development is a process that needs to tightly coordinated in both space and time. Cell tracking and lineage tracing have become important experimental techniques in developmental biology and allow us to map the fate of cells and their progeny in both space and time. A generic feature of developing (as well as homeostatic) tissues that these analyses have revealed is that relatively few cells give rise to the bulk of the cells in a tissue; the lineages of most cells come to an end fairly quickly. This has spurned the interest also of computational and theoretical biologists/physicists who have developed a range of modelling — perhaps most notably are the agent-based modelling (ABM) — approaches. These can become computationally prohibitively expensive but seem to capture some of the features observed in experiments. Here we develop a complementary perspective that allows us to understand the dynamics leading to the formation of a tissue (or colony of cells). Borrowing from the rich population genetics literature we develop genealogical models of tissue development that trace the ancestry of cells in a tissue back to their most recent common ancestors. We apply this approach to tissues that grow under confined conditions — as would, for example, be appropriate for the neural crest — and unbounded growth — illustrative of the behaviour of 2D tumours or bacterial colonies. The classical coalescent model from population genetics is readily adapted to capture tissue genealogies for different models of tissue growth and development. We show that simple but universal scaling relationships allow us to establish relationships between the coalescent and different fractal growth models that have been extensively studied in many different contexts, including developmental biology. Using our genealogical perspective we are able to study the statistical properties of the processes that give rise to tissues of cells, without the need for large-scale simulations.

Whole genome sequence analyses of Western Central African Pygmy hunter-gatherers reveal a complex demographic history and identify candidate genes under positive natural selection

Whole genome sequence analyses of Western Central African Pygmy hunter-gatherers reveal a complex demographic history and identify candidate genes under positive natural selection

PingHsun Hsieh, Krishna R Veeramah, Joseph Lachance, Sarah A Tishkoff, Jeffrey D Wall, Michael F Hammer, Ryan N Gutenkunst
doi: http://dx.doi.org/10.1101/022194

African Pygmies practicing a mobile hunter-gatherer lifestyle are phenotypically and genetically diverged from other anatomically modern humans, and they likely experienced strong selective pressures due to their unique lifestyle in the Central African rainforest. To identify genomic targets of adaptation, we sequenced the genomes of four Biaka Pygmies from the Central African Republic and jointly analyzed these data with the genome sequences of three Baka Pygmies from Cameroon and nine Yoruba famers. To account for the complex demographic history of these populations that includes both isolation and gene flow, we fit models using the joint allele frequency spectrum and validated them using independent approaches. Our two best-fit models both suggest ancient divergence between the ancestors of the farmers and Pygmies, 90,000 or 150,000 years ago. We also find that bi-directional asymmetric gene-flow is statistically better supported than a single pulse of unidirectional gene flow from farmers to Pygmies, as previously suggested. We then applied complementary statistics to scan the genome for evidence of selective sweeps and polygenic selection. We found that conventional statistical outlier approaches were biased toward identifying candidates in regions of high mutation or low recombination rate. To avoid this bias, we assigned P-values for candidates using whole-genome simulations incorporating demography and variation in both recombination and mutation rates. We found that genes and gene sets involved in muscle development, bone synthesis, immunity, reproduction, cell signaling and development, and energy metabolism are likely to be targets of positive natural selection in Western African Pygmies or their recent ancestors.