Tanglegrams: a reduction tool for mathematical phylogenetics

Tanglegrams: a reduction tool for mathematical phylogenetics

Frederick A Matsen IV, Sara Billey, Arnold Kas, Matjaž Konvalinka
(Submitted on 16 Jul 2015)

Many discrete mathematics problems in phylogenetics are defined in terms of the relative labeling of pairs of leaf-labeled trees. These relative labelings are naturally formalized as tanglegrams, which have previously been an object of study in coevolutionary analysis. Although there has been considerable work on planar drawings of tanglegrams, they have not been fully explored as combinatorial objects until recently. In this paper, we describe how many discrete mathematical questions on trees “factor” through a problem on tanglegrams, and how understanding that factoring can simplify analysis. Depending on the problem, it may be useful to consider a unordered version of tanglegrams, and/or their unrooted counterparts. For all of these definitions, we show how the isomorphism types of tanglegrams can be understood in terms of double cosets of the symmetric group, and we investigate their automorphisms. Understanding tanglegrams better will isolate the distinct problems on leaf-labeled pairs of trees and reveal natural symmetries of spaces associated with such problems.

Adaptive variation in human toll-like receptors is contributed by introgression from both Neandertals and Denisovans

Adaptive variation in human toll-like receptors is contributed by introgression from both Neandertals and Denisovans

Michael Dannemann, Aida M. Andrés, Janet Kelso
doi: http://dx.doi.org/10.1101/022699

Pathogens and the diseases they cause have been among the most important selective forces experienced by humans during their evolutionary history. Although adaptive alleles generally arise by mutation, introgression can also be a valuable source of beneficial alleles. Archaic humans, who lived in Europe and Western Asia for over 200,000 years, were likely well-adapted to the environment and its local pathogens, and it is therefore conceivable that modern humans entering Europe and Western Asia who admixed with them obtained a substantial immune advantage from the introgression of archaic alleles. Here we document a cluster of three toll-like receptors (TLR6-TLR1-TLR10) in modern humans that carries three distinct archaic haplotypes, indicating repeated introgression from archaic humans. Two of these haplotypes are most similar to Neandertal genome, while the third haplotype is most similar to the Denisovan genome. The toll-like receptors are key components of innate immunity and provide an important first line of immune defense against bacteria, fungi and parasites. The unusually high allele frequencies and unexpected levels of population differentiation indicate that there has been local positive selection on multiple haplotypes at this locus. We show that the introgressed alleles have clear functional effects in modern humans; archaic-like alleles underlie differences in the expression of the TLR genes and are associated with reduced microbial resistance and increased allergic disease in large cohorts. This provides strong evidence for recurrent adaptive introgression at the TLR6-TLR1-TLR10 locus, resulting in differences in disease phenotypes in modern humans.

Improving the Efficiency of Genomic Selection in Chinese Simmental beef cattle

Improving the Efficiency of Genomic Selection in Chinese Simmental beef cattle

Jiangwei Xia, Yang Wu, Huizhong Fang, Wengang Zhang, Yuxin Song, Lupei Zhang, Xue Gao, Yan Chen, Junya Li, Huijiang Gao
doi: http://dx.doi.org/10.1101/022673

Genomic selection is an accurate and efficient method of estimating genetic merits by using high-density genome-wide single nucleotide polymorphisms (SNPs).In this study, we investigate an approach to increase the efficiency of genomic prediction by using genome-wide markers. The approach is a feature selection based on genomic best linear unbiased prediction (GBLUP),which is a statistical method used to predict breeding values using SNPs for selection in animal and plant breeding. The objective of this study is the choice of kinship matrix for genomic best linear unbiased prediction (GBLUP).The G-matrix is using the information of genome-wide dense markers. We compare three kinds of kinships based on different combinations of centring and scaling of marker genotypes.And find a suitable kinship approach that adjusts for the resource population of Chinese Simmental beef cattle.Single nucleotide polymorphism (SNPs) can be used to estimate kinship matrix and individual inbreeding coefficients more accurately. So in our research a genomic relationship matrix was developed for 1059 Chinese Simmental beef cattle using 640000 single nucleotide polymorphisms and breeding values were estimated using phenotypes about Carcass weight and Sirloin weight. The number of SNPs needed to accurately estimate a genomic relationship matrix was evaluated in this population. Another aim of this study was to optimize the selection of markers and determine the required number of SNPs for estimation of kinship in the Chinese Simmental beef cattle. We find that the feature selection of GBLUP using Xu’s and the Astle and Balding’s kinships model performed similarly well, and were the best-performing methods in our study. Inbreeding and kinship matrix can be estimated with high accuracy using ≥12,000s in Chinese Simmental beef cattle.

metaCCA: Summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis

metaCCA: Summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis

Anna Cichonska, Juho Rousu, Pekka Marttinen, Antti J Kangas, Pasi Soininen, Terho Lehtimäki, Olli T Raitakari, Marjo-Riitta Järvelin, Veikko Salomaa, Mika Ala-Korpela, Samuli Ripatti, Matti Pirinen
doi: http://dx.doi.org/10.1101/022665

A dominant approach to genetic association studies is to perform univariate tests between genotype-phenotype pairs. However, analysing related traits together increases statistical power, and certain complex associations become detectable only when several variants are tested jointly. Currently, modest sample sizes of individual cohorts and restricted availability of individual-level genotype-phenotype data across the cohorts limit conducting multivariate tests. We introduce metaCCA, a computational framework for summary statistics-based analysis of a single or multiple studies that allows multivariate representation of both genotype and phenotype. It extends the statistical technique of canonical correlation analysis to the setting where original individual-level records are not available, and employs a covariance shrinkage algorithm to achieve robustness. Multivariate meta-analysis of two Finnish studies of nuclear magnetic resonance metabolomics by metaCCA, using standard univariate output from the program SNPTEST, shows an excellent agreement with the pooled individual-level analysis of original data. Motivated by strong multivariate signals in the lipid genes tested, we envision that multivariate association testing using metaCCA has a great potential to provide novel insights from already published summary statistics from high-throughput phenotyping technologies.

Iron Age and Anglo-Saxon genomes from East England reveal British migration history

Iron Age and Anglo-Saxon genomes from East England reveal British migration history

Stephan Schiffels, Wolfgang Haak, Pirita Paajanen, Bastien Llamas, Elizabeth Popescu, Louise Lou, Rachel Clarke, Alice Lyons, Richard Mortimer, Duncan Sayer, Chris Tyler-Smith, Alan Cooper, Richard Durbin
doi: http://dx.doi.org/10.1101/022723

British population history has been shaped by a series of immigrations and internal movements, including the early Anglo-Saxon migrations following the breakdown of the Roman administration after 410CE. It remains an open question how these events affected the genetic composition of the current British population. Here, we present whole-genome sequences generated from ten ancient individuals found in archaeological excavations close to Cambridge in the East of England, ranging from 2,300 until 1,200 years before present (Iron Age to Anglo-Saxon period). We use present-day genetic data to characterize the relationship of these ancient individuals to contemporary British and other European populations. By analyzing the distribution of shared rare variants across ancient and modern individuals, we find that today’s British are more similar to the Iron Age individuals than to most of the Anglo-Saxon individuals, and estimate that the contemporary East English population derives 30% of its ancestry from Anglo-Saxon migrations, with a lower fraction in Wales and Scotland. We gain further insight with a new method, rarecoal, which fits a demographic model to the distribution of shared rare variants across a large number of samples, enabling fine scale analysis of subtle genetic differences and yielding explicit estimates of population sizes and split times. Using rarecoal we find that the ancestors of the Anglo-Saxon samples are closest to modern Danish and Dutch populations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain.

CUA: a Flexible and Comprehensive Codon Usage Analyzer

CUA: a Flexible and Comprehensive Codon Usage Analyzer

ZHENGUO ZHANG
doi: http://dx.doi.org/10.1101/022814

Codon usage bias (CUB) is pervasive in genomes. Studying its patterns and causes is fundamental for understanding genome evolution. Rapidly emerging large-scale RNA and DNA sequences make studying CUB in many species feasible. Existing software however is limited in incorporating the new data resources. Therefore, I release the software CUA which can compute all popular CUB metrics, including CAI, tAI, Fop, ENC. More importantly, CUA allows users to incorporate user-specific data, such as tRNA abundance and highly expressed genes from considered tissues; this flexibility enables computing CUB metrics for any species with improved accuracy. In sum, CUA eases codon usage studies and establishes a platform for incorporating new metrics in future. CUA is available at http://search.cpan.org/dist/Bio-CUA/ with help documentation and tutorial.

Replaying Evolution to Test the Cause of Extinction of One Ecotype in an Experimentally Evolved Population

Replaying Evolution to Test the Cause of Extinction of One Ecotype in an Experimentally Evolved Population

Caroline B. Turner, Zachary D. Blount, Richard E. Lenski
doi: http://dx.doi.org/10.1101/022798

In a long-term evolution experiment with Escherichia coli, bacteria in one of twelve populations evolved the ability to consume citrate, a previously unexploited resource in a glucose-limited medium. This innovation led to the frequency-dependent coexistence of citrate-consuming (Cit+) and non-consuming (Cit–) ecotypes, with Cit– bacteria persisting on the exogenously supplied glucose as well as other carbon molecules released by the Cit+ bacteria. After more than 10,000 generations of coexistence, however, the Cit– lineage went extinct; cells with the Cit– phenotype dropped to levels below detection, and the Cit– clade could not be detected by molecular assays based on its unique genotype. We hypothesized that this extinction event was a deterministic outcome of evolutionary change within the population, specifically the appearance of a more-fit Cit+ ecotype that competitively excluded the Cit– ecotype. We tested this hypothesis by re-evolving the population from one frozen sample taken just prior to the extinction and from another sample taken several thousand generations earlier, in each case for 500 generations and with 20-fold replication. To our surprise, the Cit– type did not go extinct in any of these replays, and Cit– cells also persisted in a single replicate that was propagated for 3,000 generations. Even more unexpectedly, we showed that the Cit– ecotype could reinvade the Cit+ population after its extinction. Taken together, these results indicate that the extinction of the Cit– ecotype was not a deterministic outcome driven by competitive exclusion by the Cit+ ecotype. The extinction also cannot be explained by demographic stochasticity, as the population size of the Cit– ecotype should have been many thousands of cells even during the daily transfer events. Instead, we infer that the extinction must have been caused by a rare chance event in which some aspect of the experimental conditions was inadvertently perturbed.

Coevolution of male and female reproductive traits drive cascading reinforcement in Drosophila yakuba

Coevolution of male and female reproductive traits drive cascading reinforcement in Drosophila yakuba

Aaron A Comeault, Aarti Venkat, Daniel R Matute
doi: http://dx.doi.org/10.1101/022244

When the ranges of two hybridizing species overlap, individuals may ‘waste’ gametes on inviable or infertile hybrids. In these cases, selection against maladaptive hybridization can lead to the evolution of enhanced reproductive isolation in a process called reinforcement. On the slopes of the African island of São Tomé, Drosophila yakuba and its endemic sister species D. santomea have a well-defined hybrid zone. Drosophila yakuba females from within this zone show increased postmating-prezygotic isolation towards D. santomea males when compared with D. yakuba females from allopatric populations. To understand why reinforced gametic isolation is confined to areas of secondary contact and has not spread throughout the entire D. yakuba geographic range, we studied the costs of reinforcement in D. yakuba using a combination of natural collections and experimental evolution. We found that D. yakuba males from sympatric populations sire fewer progeny than allopatric males when mated to allopatric D. yakuba females. Our results suggest that the correlated evolution of male and female reproductive traits in sympatric D. yakuba have associated costs (i.e., reduced male fertility) that prevent the alleles responsible for enhanced isolation from spreading outside the hybrid zone.

Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation

Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation

Minliang Jin, Haijun Liu, Cheng He, Junjie Fu, Yingjie Xiao, Yuebin Wang, Weibo Xie, Guoying Wang, Jianbing Yan
doi: http://dx.doi.org/10.1101/022384

Variation in gene expression contributes to the diversity of phenotype. The construction of the pan-transcriptome is especially necessary for species with complex genomes, such as maize. However, knowledge of the regulation mechanisms and functional consequences of the pan-transcriptome is limited. In this study, we identified 13,382 nuclear expression presence and absence variation candidates (ePAVs, expressed in 5%~95% lines; based on the reference genome) by re-analyzing the RNA sequencing data from the kernels (15 days after pollination) of 368 maize diverse inbreds. It was estimated that only ~1% of the ePAVs are explained by DNA sequence presence and absence variations (PAV). The ePAV genes tend to be regulated by distant eQTLs when compared with non-ePAV genes (called here core expression genes, expressed in more than 95% lines). When the expression presence/absence status was used as the ???genotype??? to perform genome-wide association study, 56 (0.42%) ePAVs were significantly associated with 15 agronomic traits and 1,967 (14.74%) with 526 metabolic traits, measured from the mature kernels. While the above was majorly based on the reference genome, by using a modified ???assemble-then-align??? strategy, 2,355 high confidence novel sequences with a total length of 1.9Mb were found absent in the current B73 reference genome (v2). Ten randomly selected novel sequences were validated with genomic PCR. A simulation analysis suggested that the pan-transcriptome of the maize whole kernel is approaching a maximum value of 63,000 genes. Two novel validated sequences annotated as NBS_LRR like genes were found to associate with flavonoid content and their homologs in rice were also found to affect flavonoids and disease-resistance. Novel sequences absent in the present reference genome might be functionally important and deserve more attentions. This study provides novel perspectives and resources to discover maize quantitative trait variations and help us to better understand the kernel regulation networks, thus enhancing maize breeding.

Genoogle: an indexed and parallelized search engine for similar DNA sequences

Genoogle: an indexed and parallelized search engine for similar DNA sequences

Felipe Albrecht
(Submitted on 10 Jul 2015)

The search for similar genetic sequences is one of the main bioinformatics tasks. The genetic sequences data banks are growing exponentially and the searching techniques that use linear time are not capable to do the search in the required time anymore. Another problem is that the clock speed of the modern processors are not growing as it did before, instead, the processing capacity is growing with the addiction of more processing cores and the techniques which does not use parallel computing does not have benefits from these extra cores. This work aims to use data indexing techniques to reduce the searching process computation cost united with the parallelization of the searching techniques to use the computational capacity of the multi core processors. To verify the viability of using these two techniques simultaneously, a software which uses parallelization techniques with inverted indexes was developed.
Experiments were executed to analyze the performance gain when parallelism is utilized, the search time gain, and also the quality of the results when it compared with others searching tools. The results of these experiments were promising, the parallelism gain overcame the expected speedup, the searching time was 20 times faster than the parallelized NCBI BLAST, and the searching results showed a good quality when compared with this tool.
The software source code is available at this https URL .