Decreased transcription factor binding levels nearby primate pseudogenes suggests regulatory degeneration

Decreased transcription factor binding levels nearby primate pseudogenes suggests regulatory degeneration

Gavin M Douglas, Michael D Wilson, Alan M Moses
doi: http://dx.doi.org/10.1101/024026

Characteristics of pseudogene degeneration at the coding level are well-known, such as a shift towards neutral rates of nonsynonymous substitutions and gain of frameshift mutations. In contrast, degeneration of pseudogene transcriptional regulation is not well understood. Here, we test two predictions of regulatory degeneration along the pseudogenized lineage: (1) decreased transcription factor binding and (2) accelerated evolution in putative cis-regulatory regions. We find evidence for decreased TF binding levels nearby two primate pseudogenes compared to functional liver genes. We also find evidence for pseudogene-lineage-specific relaxation of sequence constraint on a fragment of the promoter of the primate pseudogene urate oxidase (Uox) and a nearby cis-regulatory module (CRM). However, the majority of TF-bound sequences nearby pseudogenes do not show evidence for lineage-specific accelerated rates of evolution. We conclude that decreases in TF binding level could be a marker for regulatory degeneration, while sequence degeneration in most CRMs may be obscured by background rates of TF binding site turnover.

Inference and analysis of population structure using genetic data and network theory

Inference and analysis of population structure using genetic data and network theory

Gili Greenbaum, Alan R. Templeton, Shirli Bar-David
doi: http://dx.doi.org/10.1101/024042

Clustering individuals based on genetic data has become commonplace in many genetic and ecological studies. Most often, statistical inference of population structure is done by applying model-based approaches, such as Bayesian clustering, aided by visualization using distance-based approaches, such as PCA (Principle Component Analysis). While existing distance-based approaches suffer from lack of statistical rigour, model-based approaches entail assumption of prior conditions such as that the subpopulations are at Hardy-Wienberg equilibria. Here we present a distance-based approach for inference of population structure using genetic data based on the network theory concept of community, a dense subgraph within a network. A network is constructed using the pairwise genetic-distance matrix of all sampled individuals, and utilizes community detection algorithms to partition the network into communities, interpreted as a partition of the population to subpopulations. The statistical significance of the structure can be estimated by using permutation tests to evaluate the significance of the partition’s modularity, a network theory concept measuring the strength in which partitions divide the network. In order to further characterize population structure, a measure of the Strength of Association (SA) for an individual to its assigned community is calculated, and the Strength of Association Distribution (SAD) of the communities is analysed to provide additional population structure details. The approach presented here provides a novel, computationally efficient, method for inference of population structure which does not assume an underlying model nor prior conditions, making inference potentially more robust. The method is implemented in the software NetStruct, available at https://github.com/GiliG/NetStruct.

The impact of partitioning on phylogenomic accuracy

The impact of partitioning on phylogenomic accuracy

Diego Darriba, David Posada
doi: http://dx.doi.org/10.1101/023978

Several strategies have been proposed to assign substitution models in phylogenomic datasets, or partitioning. The accuracy of these methods, and most importantly, their impact on phylogenetic estimation has not been thoroughly assessed using computer simulations. We simulated multiple partitioning scenarios to benchmark two a priori partitioning schemes (one model for the whole alignment, one model for each data block), and two statistical approaches (hierarchical clustering and greedy) implemented in PartitionFinder and in our new program, PartitionTest. Most methods were able to identify optimal partitioning schemes closely related to the true one. Greedy algorithms identified the true partitioning scheme more frequently than the clustering algorithms, but selected slightly less accurate partitioning schemes and tended to underestimate the number of partitions. PartitionTest was several times faster than PartitionFinder, with equal or better accuracy. Importantly, maximum likelihood phylogenetic inference was very robust to the partitioning scheme. Best-fit partitioning schemes resulted in optimal phylogenetic performance, without appreciable differences compared to the use of the true partitioning scheme. However, accurate trees were also obtained by a “simple” strategy consisting of assigning independent GTR+G models to each data block. On the contrary, leaving the data unpartitioned always diminished the quality of the trees inferred, to a greater or lesser extent depending on the simulated scenario. The analysis of empirical data confirmed these trends, although suggesting a stronger influence of the partitioning scheme. Overall, our results suggests that statistical partitioning, but also the a priori assignment of independent GTR+G models, maximize phylogenomic performance.

Learning the human chromatin network from all ENCODE ChIP-seq data

Learning the human chromatin network from all ENCODE ChIP-seq data

Scott M. Lundberg, William B. Tu, Brian Raught, Linda Z. Penn, Michael M. Hoffman, Su-In Lee
doi: http://dx.doi.org/10.1101/023911

Introduction: A cell’s epigenome arises from interactions among chromatin factors — transcription factors, histones, and other DNA-associated proteins — co-localized at particular genomic regions. Identifying the network of interactions among chromatin factors, the chromatin network, is of paramount importance in understanding epigenome regulation. Methods: We developed a novel computational approach, ChromNet, to infer the chromatin network from a set of ChIP-seq datasets. ChromNet has three key features that enable its use on large collections of ChIP-seq data. First, rather than using pairwise co-localization of factors along the genome, ChromNet identifies conditional dependence relationships that better discriminate direct and indirect interactions. Second, our novel statistical technique, the group graphical model, improves inference of conditional dependence on tightly correlated datasets. These datasets include transcription factors that form a complex or the same transcription factor assayed in different laboratories. Third, ChromNet’s computationally efficient method allows network learning among thousands of factors, and efficient relearning as new data is added. Results: We applied ChromNet to all available ChIP-seq data from the ENCODE Project, consisting of 1,415 ChIP-seq datasets, which revealed previously known chromatin factor interactions better than alternative approaches. ChromNet also identified previously unreported chromatin factor interactions. We experimentally validated one of these interactions, between the MYC and HCFC1 transcription factors. Discussion: ChromNet provides a useful tool for understanding the interactions among chromatin factors and identifying novel interactions. We have provided an interactive web-based visualization of the full ENCODE chromatin network and the ability to incorporate custom datasets at http://chromnet.cs.washington.edu.

Infinitely Long Branches and an Informal Test of Common Ancestry

Infinitely Long Branches and an Informal Test of Common Ancestry

Leonardo de Oliveira Martins, David Posada
doi: http://dx.doi.org/10.1101/023903

The evidence for universal common ancestry (UCA) is vast and persuasive, and a phylogenetic test was proposed for quantifying its odds against independently originated sequences based on the comparison between one and several trees. This test was successfully applied to a well-supported homologous sequence alignment, being however criticized once simulations showed that even alignments without any phylogenetic structure could mislead its conclusions. Despite claims to the contrary, we believe that the counterexample successfully showed a drawback of the test, of relying on good alignments. Here we present a simplified version of this counterexample, which can be interpreted as a tree with arbitrarily long branches, and where the test again fails. We also present another simulation showing circumstances whereby any sufficiently similar alignment will favor UCA irrespective of the true independent origins for the sequences. We therefore conclude that the test should not be trusted unless convergence has already been ruled out a priori. Finally, we present a class of frequentist tests that perform better than the purportedly formal UCA test.

Trait Evolution on Phylogenetic Networks

Trait Evolution on Phylogenetic Networks

Dwueng-Chwuan Jhwueng, Brian O’Meara
doi: http://dx.doi.org/10.1101/023986

Species may evolve on a reticulate network due to hybridization or other gene flow rather than on a strictly bifurcating tree, but comparative methods to deal with trait evolution on a network are lacking. We create such a method, which uses a Brownian motion model. Our method seeks to separately or jointly detect a bias in trait value coming from hybridization (β) and a burst of variation at the time of hybridization (v_H) associated with the hybridization event, as well as traditional Brownian motion parameters of ancestral state (μ) and rate of evolution (σ^2) of Brownian motion, as well as measurement error of the tips (SE). We test the method with extensive simulations. We also apply the model to two empirical examples, cichlid body size and Nicotiana drought tolerance, and find substantial measurement error and a hint that hybrids have greater drought tolerance in the latter case. The new methods are available in CRAN R package BMhyd.

Disentangling sources of selection on exonic transcriptional enhancers

Disentangling sources of selection on exonic transcriptional enhancers

Rachel M Agoglia, Hunter B Fraser
doi: http://dx.doi.org/10.1101/024000

In addition to coding for proteins, exons can also impact transcription by encoding regulatory elements such as enhancers. It has been debated whether such features confer heightened selective constraint, or evolve neutrally. We have addressed this question by developing a new approach to disentangle the sources of selection acting on exonic enhancers, in which we model the evolutionary rates of every possible substitution as a function of their effects on both protein sequence and enhancer activity. In three exonic enhancers, we found no significant association between evolutionary rates and effects on enhancer activity. This suggests that despite having biochemical activity, these exonic enhancers have no detectable selective constraint, and thus are unlikely to play a major role in protein evolution.

Genomic DNA transposition induced by human PGBD5

Genomic DNA transposition induced by human PGBD5

Anton Henssen, Elizabeth Henaff, Eileen Jiang, Amy R Eisenberg, Julianne R Carson, Camila Villasante, Mondira Ray, Eric Still, Melissa Burns, Jorge Gandara, Cedric Feschotte, Christopher E. Mason, Alex Kentsis
doi: http://dx.doi.org/10.1101/023887

Transposons are mobile genetic elements that are found in nearly all organisms, including humans. Mobilization of DNA transposons by transposase enzymes can cause genomic rearrangements, but our knowledge of human genes derived from transposases is limited. Here, we find that the protein encoded by human PGBD5, the most evolutionarily conserved transposable element-derived gene in chordates, can induce stereotypical cut-and-paste DNA transposition in human cells. Genomic integration activity of PGBD5 requires distinct aspartic acid residues in its transposase domain, and specific DNA sequences with inverted terminal repeats with similarity to piggyBac transposons. DNA transposition catalyzed by PGBD5 in human cells occurs genome-wide, with precise transposon excision and preference for insertion at TTAA sites. The apparent conservation of DNA transposition activity by PGBD5 raises the possibility that genomic remodeling may contribute to its biological function.

Species Tree Estimation from Genome-wide Data with Guenomu

Species Tree Estimation from Genome-wide Data with Guenomu

Leonardo de Oliveira Martins, David Posada
doi: http://dx.doi.org/10.1101/023861

The history of particular genes and that of the species that carry them can be different due to different reasons. In particular, gene trees and species trees can truly differ due to well-known evolutionary processes like gene duplication and loss, lateral gene transfer or incomplete lineage sorting. Different species tree reconstruction methods have been developed to take this incongruence into account, which can be divided grossly into supertree and supermatrix approaches. Here, we introduce a new Bayesian hierarchical model that we have recently developed and implemented in the program Guenomu, that considers multiple sources of gene tree/species tree disagreement. Guenomu takes as input the posterior distributions of unrooted gene tree topologies for multiple gene families, in order to estimate the posterior distribution of rooted species tree topologies.

Population genomic scans reveal novel genes underlie convergent flowering time evolution in the introduced range of Arabidopsis thaliana

Population genomic scans reveal novel genes underlie convergent flowering time evolution in the introduced range of Arabidopsis thaliana

Billie Gould, John R Stinchcombe
doi: http://dx.doi.org/10.1101/023788

A long-standing question in evolutionary biology is whether the evolution of convergent phenotypes results from selection on the same heritable genetic components. Using whole genome sequencing and genome scans, we tested whether the evolution of parallel longitudinal flowering time clines in the native and introduced ranges of Arabidopsis thaliana has a similar genetic basis. We found that common variants of large effect on flowering time in the native range do not appear to have been under recent strong selection in the introduced range. Genes in regions of the genome that are under selection for flowering time are also not enriched for functions related to development or environmental sensing. We instead identified a set of 53 new candidate genes putatively linked to the evolution of flowering time in the species introduced range. A high degree of conditional neutrality of flowering time variants between the native and introduced range may preclude parallel evolution at the level of genes. Overall, neither gene pleiotropy nor available standing genetic variation appears to have restricted the evolution of flowering time in the introduced range to high frequency variants from the native range or to known flowering time pathway genes.