Inference of super-exponential human population growth via efficient computation of the site frequency spectrum for generalized models

Inference of super-exponential human population growth via efficient computation of the site frequency spectrum for generalized models
Feng Gao, Alon Keinan
doi: http://dx.doi.org/10.1101/022574

The site frequency spectrum (SFS) and other genetic summary statistics are at the heart of many population genetics studies. Previous studies have shown that human populations had undergone a recent epoch of fast growth in effective population size. These studies assumed that growth is exponential, and the ensuing models leave unexplained excess amount of extremely rare variants. This suggests that human populations might have experienced a recent growth with speed faster than exponential. Recent studies have introduced a generalized growth model where the growth speed can be faster or slower than exponential. However, only simulation approaches were available for obtaining summary statistics under such models. In this study, we provide expressions to accurately and efficiently evaluate the SFS and other summary statistics under generalized models, which we further implement in a publicly available software. Investigating the power to infer deviation of growth from being exponential, we observed that decent sample sizes facilitate accurate inference, e.g. a sample of 3000 individuals with the amount of data expected from exome sequencing allows observing and accurately estimating growth with speed deviating by 10% or more from that of exponential. Applying our inference framework to data from the NHLBI Exome Sequencing Project, we found that a model with a generalized growth epoch fits the observed SFS significantly better than the equivalent model with exponential growth (p-value = 3.85 × 10-6). The estimated growth speed significantly deviates from exponential (p-value << 10-12), with the best-fit estimate being of growth speed 12% faster than exponential.

Monoallelic methylation and allele specific expression in a social insect

Monoallelic methylation and allele specific expression in a social insect
Kate D Lee, Zoe N Lonsdale, Maria Kyriakidou, Despina Nathanael, Harindra E Amarasinghe, Eamonn B Mallon
doi: http://dx.doi.org/10.1101/022657

Abstract

Social insects are emerging models for epigenetics. Here we examine the link between monoallelic methylation and monoallelic expression in the bumblebee \textit{Bombus terrestris} using whole methylome and transcriptome analysis. We found nineteen genes displaying monoallelic methylation and expression. They were enriched for functions to do with social organisation in the social insects. These are the biological processes predicted to involve imprinting by evolutionary theory.

Investigating the Evolutionary Importance of Denisovan Introgressions in Papua New Guineans and Australians

Investigating the Evolutionary Importance of Denisovan Introgressions in Papua New Guineans and Australians
Ya Hu, Qiliang Ding, Yi Wang, Shuhua Xu, Yungang He, Minxian Wang, Jiucun Wang, Li Jin
doi: http://dx.doi.org/10.1101/022632

Previous research reported that Papua New Guineans (PNG) and Australians contain introgressions from Denisovans. Here we present a genome-wide analysis of Denisovan introgressions in PNG and Australians. We firstly developed a two-phase method to detect Denisovan introgressions from whole-genome sequencing data. This method has relatively high detection power (79.74%) and low false positive rate (2.44%) based on simulations. Using this method, we identified 1.34 Gb of Denisovan introgressions from sixteen PNG and four Australian genomes, in which we identified 38,877 Denisovan introgressive alleles (DIAs). We found that 78 Denisovan introgressions were under positive selection. Genes located in the 78 introgressions are related to evolutionarily important functions, such as spermatogenesis, fertilization, cold acclimation, circadian rhythm, development of brain, neural tube, face, and olfactory pit, immunity, etc. We also found that 121 DIAs are missense. Genes harboring the 121 missense DIAs are also related to evolutionarily important functions, such as female pregnancy, development of face, lung, heart, skin, nervous system, and male gonad, visual and smell perception, response to heat, pain, hypoxia, and UV, lipid transport, metabolism, blood coagulation, wound healing, aging, etc. Taken together, this study suggests that Denisovan introgressions in PNG and Australians are evolutionarily important, and may help PNG and Australians in local adaptation. In this study, we also proposed a method that could efficiently identify archaic hominin introgressions in modern non-African genomes.

The distribution and impact of common copy-number variation in the genome of the domesticated apple, Malus x domestica Borkh.

The distribution and impact of common copy-number variation in the genome of the domesticated apple, Malus x domestica Borkh.
James Boocock, David David Chagné, Tony R Merriman, Mik Black
doi: http://dx.doi.org/10.1101/021857

Background Copy number variation (CNV) is a common feature of eukaryotic genomes, and a growing body of evidence suggests that genes affected by CNV are enriched in processes that are associated with environmental responses. Here we use next generation sequence (NGS) data to detect copy-number variable regions (CNVRs) within the Malus x domestica genome, as well as to examine their distribution and impact. Methods CNVRs were detected using NGS data derived from 30 accessions of M. x domestica analysed using the read-depth method, as implemented in the CNVrd2 software. To improve the reliability of our results, we developed a quality control and analysis procedure that involved checking for organelle DNA, not repeat masking, and the determination of CNVR identity using a permutation testing procedure. Results Overall, we identified 876 CNVRs, which spanned 3.5% of the apple genome. To verify that detected CNVRs were not artefacts, we analysed the B- allele-frequencies (BAF) within a SNP array dataset derived from a screening of 185 individual apple accessions and found the CNVRs were enriched for SNPs having aberrant BAFs (P < 1e-13, Fisher’s Exact test). Putative CNVRs overlapped 845 gene models and were enriched for resistance (R) genes (P < 1e-22, Fisher’s exact test). Of note is a cluster of resistance genes on chromosome 2 near a region containing multiple major gene loci conferring resistance to apple scab. Conclusion We present the first analysis and catalogue of CNVRs in the M. x domestica genome. The enrichment of the CNVRs with R genes and their overlap with gene loci of agricultural significance draw attention to a form of unexplored genetic variation in apple. This research will underpin further investigation of the role that CNV plays within the apple genome.

Accelerating Scientific Publication in Biology

Accelerating Scientific Publication in Biology
Ronald D Vale
doi: http://dx.doi.org/10.1101/022368

Scientific publications enable results and ideas to be transmitted throughout the scientific community. The number and type of journal publications also have become the primary criteria used in evaluating career advancement. Our analysis suggests that publication practices have changed considerably in the life sciences over the past thirty years. Considerably more experimental data is now required for publication, and the average time required for graduate students to publish their first paper has increased and is approaching the desirable duration of Ph.D. training. Since publication is generally a requirement for career progression, schemes to reduce the time of graduate student and postdoctoral training may be difficult to implement without also considering new mechanisms for accelerating communication of their work. The increasing time to publication also delays potential catalytic effects that ensue when many scientists have access to new information. The time has come for the life scientists, funding agencies, and publishers to discuss how to communicate new findings in a way that best serves the interests of the public and scientific community.

Haplotypes of common SNPs can explain missing heritability of complex diseases

Haplotypes of common SNPs can explain missing heritability of complex diseases
Gaurav Bhatia, Alexander Gusev, Po-Ru Loh, Bjarni J Vilhjálmsson, Stephan Ripke, Shaun Purcell, Eli Stahl, Mark Daly, Teresa R de Candia, Kenneth S Kendler, Michael C O’Donovan, Sang Hong Lee, Naomi R Wray, Benjamin M Neale, Matthew C Keller, Noah A Zaitlen, Bogdan Pasaniuc, Jian Yang, Alkes L Price, Schizophrenia Working Group Psychiatric Genomics C
doi: http://dx.doi.org/10.1101/022418

While genome-wide significant associations generally explain only a small proportion of the narrow-sense heritability of complex disease (h2), recent work has shown that more heritability is explained by all genotyped SNPs (hg2). However, much of the heritability is still missing (hg2 0.1% explained substantially more phenotypic variance (hhap2 = 0.64 (S.E. 0.084)) than genotyped SNPs alone (hg2 = 0.32 (S.E. 0.029)). These estimates were based on cross-cohort comparisons, ensuring that cohort-specific assay artifacts did not contribute to our estimates. In a large multiple sclerosis data set (WTCCC2-MS), we observed an even larger difference between hhap2 and hg2, though data from other cohorts will be required to validate this result. Overall, our results suggest that haplotypes of common SNPs can explain a large fraction of missing heritability of complex disease, shedding light on genetic architecture and informing disease mapping strategies.

Coevolution of male and female reproductive traits drive cascading reinforcement in Drosophila yakuba

Coevolution of male and female reproductive traits drive cascading reinforcement in Drosophila yakuba

Aaron A Comeault, Aarti Venkat, Daniel R Matute
doi: http://dx.doi.org/10.1101/022244

When the ranges of two hybridizing species overlap, individuals may ‘waste’ gametes on inviable or infertile hybrids. In these cases, selection against maladaptive hybridization can lead to the evolution of enhanced reproductive isolation in a process called reinforcement. On the slopes of the African island of São Tomé, Drosophila yakuba and its endemic sister species D. santomea have a well-defined hybrid zone. Drosophila yakuba females from within this zone show increased postmating-prezygotic isolation towards D. santomea males when compared with D. yakuba females from allopatric populations. To understand why reinforced gametic isolation is confined to areas of secondary contact and has not spread throughout the entire D. yakuba geographic range, we studied the costs of reinforcement in D. yakuba using a combination of natural collections and experimental evolution. We found that D. yakuba males from sympatric populations sire fewer progeny than allopatric males when mated to allopatric D. yakuba females. Our results suggest that the correlated evolution of male and female reproductive traits in sympatric D. yakuba have associated costs (i.e., reduced male fertility) that prevent the alleles responsible for enhanced isolation from spreading outside the hybrid zone.

Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation

Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation

Minliang Jin, Haijun Liu, Cheng He, Junjie Fu, Yingjie Xiao, Yuebin Wang, Weibo Xie, Guoying Wang, Jianbing Yan
doi: http://dx.doi.org/10.1101/022384

Variation in gene expression contributes to the diversity of phenotype. The construction of the pan-transcriptome is especially necessary for species with complex genomes, such as maize. However, knowledge of the regulation mechanisms and functional consequences of the pan-transcriptome is limited. In this study, we identified 13,382 nuclear expression presence and absence variation candidates (ePAVs, expressed in 5%~95% lines; based on the reference genome) by re-analyzing the RNA sequencing data from the kernels (15 days after pollination) of 368 maize diverse inbreds. It was estimated that only ~1% of the ePAVs are explained by DNA sequence presence and absence variations (PAV). The ePAV genes tend to be regulated by distant eQTLs when compared with non-ePAV genes (called here core expression genes, expressed in more than 95% lines). When the expression presence/absence status was used as the ???genotype??? to perform genome-wide association study, 56 (0.42%) ePAVs were significantly associated with 15 agronomic traits and 1,967 (14.74%) with 526 metabolic traits, measured from the mature kernels. While the above was majorly based on the reference genome, by using a modified ???assemble-then-align??? strategy, 2,355 high confidence novel sequences with a total length of 1.9Mb were found absent in the current B73 reference genome (v2). Ten randomly selected novel sequences were validated with genomic PCR. A simulation analysis suggested that the pan-transcriptome of the maize whole kernel is approaching a maximum value of 63,000 genes. Two novel validated sequences annotated as NBS_LRR like genes were found to associate with flavonoid content and their homologs in rice were also found to affect flavonoids and disease-resistance. Novel sequences absent in the present reference genome might be functionally important and deserve more attentions. This study provides novel perspectives and resources to discover maize quantitative trait variations and help us to better understand the kernel regulation networks, thus enhancing maize breeding.

Genoogle: an indexed and parallelized search engine for similar DNA sequences

Genoogle: an indexed and parallelized search engine for similar DNA sequences

Felipe Albrecht
(Submitted on 10 Jul 2015)

The search for similar genetic sequences is one of the main bioinformatics tasks. The genetic sequences data banks are growing exponentially and the searching techniques that use linear time are not capable to do the search in the required time anymore. Another problem is that the clock speed of the modern processors are not growing as it did before, instead, the processing capacity is growing with the addiction of more processing cores and the techniques which does not use parallel computing does not have benefits from these extra cores. This work aims to use data indexing techniques to reduce the searching process computation cost united with the parallelization of the searching techniques to use the computational capacity of the multi core processors. To verify the viability of using these two techniques simultaneously, a software which uses parallelization techniques with inverted indexes was developed.
Experiments were executed to analyze the performance gain when parallelism is utilized, the search time gain, and also the quality of the results when it compared with others searching tools. The results of these experiments were promising, the parallelism gain overcame the expected speedup, the searching time was 20 times faster than the parallelized NCBI BLAST, and the searching results showed a good quality when compared with this tool.
The software source code is available at this https URL .

Haplotypes of common SNPs can explain missing heritability of complex diseases

Haplotypes of common SNPs can explain missing heritability of complex diseases

Gaurav Bhatia, Alexander Gusev, Po-Ru Loh, Bjarni J Vilhjálmsson, Stephan Ripke, Shaun Purcell, Eli Stahl, Mark Daly, Teresa R de Candia, Kenneth S Kendler, Michael C O’Donovan, Sang Hong Lee, Naomi R Wray, Benjamin M Neale, Matthew C Keller, Noah A Zaitlen, Bogdan Pasaniuc, Jian Yang, Alkes L Price, Schizophrenia Working Group Psychiatric Genomics C
doi: http://dx.doi.org/10.1101/022418

While genome-wide significant associations generally explain only a small proportion of the narrow-sense heritability of complex disease (h2), recent work has shown that more heritability is explained by all genotyped SNPs (hg2). However, much of the heritability is still missing (hg2

0.1% explained substantially more phenotypic variance (hhap2 = 0.64 (S.E. 0.084)) than genotyped SNPs alone (hg2 = 0.32 (S.E. 0.029)). These estimates were based on cross-cohort comparisons, ensuring that cohort-specific assay artifacts did not contribute to our estimates. In a large multiple sclerosis data set (WTCCC2-MS), we observed an even larger difference between hhap2 and hg2, though data from other cohorts will be required to validate this result. Overall, our results suggest that haplotypes of common SNPs can explain a large fraction of missing heritability of complex disease, shedding light on genetic architecture and informing disease mapping strategies.