Sex-specific recombination rates and allele frequencies affect the invasion of sexually antagonistic variation on autosomes

Sex-specific recombination rates and allele frequencies affect the invasion of sexually antagonistic variation on autosomes
Minyoung Wyman, Mark Wyman
(Submitted on 19 Oct 2013)

The introduction and persistence of novel sexually antagonistic alleles can depend upon factors that differ between males and females. Understanding the conditions for invasion in a two-locus model can elucidate these processes. For instance, selection can act differently upon the sexes, or sex-linkage can facilitate the invasion of genetic variation with opposing fitness effects between the sexes. Two factors that deserve further attention are recombination rates and allele frequencies — both of which can vary substantially between the sexes. We find that sex-specific recombination rates in a two-locus diploid model can affect the invasion outcome of sexually antagonistic alleles and that the sex-averaged recombination rate is not necessarily sufficient to predict invasion. We confirm that the range of permissible recombination rates is smaller in the sex benefitting from invasion and larger in the sex harmed by invasion. However, within the invasion space, male recombination rate can be greater than, equal to, or less than female recombination rate in order for a male-benefit, female-detriment allele to invade (and similarly for a female-benefit, male-detriment allele). We further show that a novel, sexually antagonistic allele that is also associated with a lowered recombination rate can invade more easily when present in the double heterozygote genotype. Finally, we find that sexual dimorphism in resident allele frequencies can impact the invasion of new sexually antagonistic alleles at a second locus. Our results suggest that accounting for sex-specific recombination rates and allele frequencies can determine the difference between invasion and non-invasion of novel sexually antagonistic alleles in a two-locus model.

The Functional Consequences of Variation in Transcription Factor Binding

The Functional Consequences of Variation in Transcription Factor Binding
Darren A. Cusanovich, Bryan Pavlovic, Jonathan K. Pritchard, Yoav Gilad
(Submitted on 18 Oct 2013)

One goal of human genetics is to understand how the information for precise and dynamic gene expression programs is encoded in the genome. The interactions of transcription factors (TFs) with DNA regulatory elements clearly play an important role in determining gene expression outputs, yet the regulatory logic underlying functional transcription factor binding is poorly understood. Many studies have focused on characterizing the genomic locations of TF binding, yet it is unclear to what extent TF binding at any specific locus has functional consequences with respect to gene expression output. To evaluate the context of functional TF binding we knocked down 59 TFs and chromatin modifiers in one HapMap lymphoblastoid cell line. We then identified genes whose expression was affected by the knockdowns. We intersected the gene expression data with transcription factor binding data (based on ChIP-seq and DNase-seq) within 10 kb of the transcription start sites of expressed genes. This combination of data allowed us to infer functional TF binding. On average, 14.7% of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. We found that functional TF binding is enriched in regulatory elements that harbor a large number of TF binding sites, at sites with predicted higher binding affinity, and at sites that are enriched in genomic regions annotated as active enhancers.

Non-monotonic effects of migration in populations with balancing selection

Non-monotonic effects of migration in populations with balancing selection
Pierangelo Lombardo, Andrea Gambassi, Luca Dall’Asta
(Submitted on 18 Oct 2013)

Balancing selection is recognized as a prominent evolutionary force responsible for the maintenance of genetic diversity in natural populations. We quantify its influence on the evolution of a subdivided population, investigating how the mean-fixation time (MFT) depends on the migration rate among subpopulations. We identify a threshold in the strength of the balancing selection above which the MFT changes its qualitative behavior compared to that of neutral populations, developing an unexpected non-monotonic dependence on the migration rate. This feature carries over into an analogous behavior of the heterozygosity, which is an index of the biodiversity of the population.

Author post: A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

This author post is by Cyrus Maher and Ryan Hernandez on their preprint A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity, arXived here.

Rigorous evolutionary analysis of protein coding regions often requires high-quality multiple sequence alignments. These alignments can only be generated after the identification of orthologous sequences. In our pre-print, “A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity”, we present a novel method that substantially improves the number and quality of detected orthologs, especially in the presence of sequencing error and complex evolutionary processes.

This endeavor grew out of our forthcoming work on the evolutionary impact of ancient pathogens on the human genome. Early on, we observed the decisive influence ortholog quality exerted on our downstream conclusions. As one might imagine, accurate sequence analysis is a fool’s errand if the sequences are, in fact, the wrong ones! Such experiences have impelled us to take a keen interest in orthologs, much as a bad case of gastroenteritis might inspire a sushi chef to become thoroughly attentive to the quality of his or her fish.

Identifying orthologous sequences is referred to as ortholog detection (OD). In brief, existing OD methods can be classified as tree-based, graph-based, or a hybrid of the two. Tree-based methods may use reconciliation techniques between gene and species trees or may rely on the gene tree alone. Graph-based methods can employ a variety of metrics to quantify similarity between sequences. Popular measures include sequence identity and matrix-weighted similarity scores. Syntenic information may also be incorporated in this context.

Here we consider alignments from UCSC (MZ), MultiParanoid (MP), translated BLAT (BL), and OMA. To briefly summarize the strengths of the considered methods: MZ utilizes syntenic similarity, MP includes all-by-all similarity in its calculations, OMA considers phylogenetic information directly, and BL does not require an accurately predicted proteome. In figure 1A of our paper, we illustrate the head-to-head performance of four popular methods for OD. Interestingly, we find striking complementarity between methods, motivating a search for a practical way to integrate ortholog predictions from methodologically diverse sources.

Comparison of sequence identity levels between methods A.) Heat map of the percent of orthologs for which BLAT (BL), OMA (OMA),  MultiParanoid (MP),, and MultiZ (MZ) outperform one another. Performance is based on percent identity of each method’s orthologs to the human sequence. One method is considered to outperform another method if it improves percent identity by at least five percentage points. Text in diagonal cells shows the number of orthologs identified by each method, colored by the percent of transcripts at which a given method outperforms all the others

Figure 1: Comparison of sequence identity levels between methods A.) Heat map of the percent of orthologs for which BLAT (BL), OMA (OMA), MultiParanoid (MP),, and MultiZ (MZ) outperform one another. Performance is based on percent identity of each method’s orthologs to the human sequence. One method is considered to outperform another method if it improves percent identity by at least five percentage points. Text in diagonal cells shows the number of orthologs identified by each method, colored by the percent of transcripts at which a given method outperforms all the others

These efforts culminate in the presentation of MOSAIC, or Multiple Orthologous Sequence Analysis and Integration by Cluster optimization. MOSAIC is a well-documented python package that can flexibly integrate ortholog predictions from an arbitrary number of sources. We compare integrated MOSAIC alignments to those generated using each constituent method alone. Relative to the best-performing single method, we show that MOSAIC more than quintuples the number of sequences for which all orthologs of interest are successfully identified (see figure below). However, this increase in putative orthologs could be the result of, e.g. the improper inclusion low-quality or paralogous sequences. This does not appear to be the case for MOSAIC. Crucially, improvements in power are secured while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality.

OD power and the effect of pooling methods A.) The cumulative number of human transcripts as a function of the maximum number of missing species allowed

Figure 2: OD power and the effect of pooling methods A.) The cumulative number of human transcripts as a function of the maximum number of missing species allowed

These results are obtained from alignments between the human proteome and orthologs from nine species encompassing a range of primates and closely related mammals. For other sequence sets, the best strategy for method integration may differ slightly depending on, e.g. the level of divergence between species of interest. To account for this, MOSAIC provides several options for scoring and optimization, and even facilitates the specification of user-defined metrics for sequence similarity and cluster optimality.

In the future, we would also like to add functionality to automatically fetch relevant alignments from major ortholog databases. In the meantime, we hope that this tool will prove a useful addition to a variety of evolutionary analysis pipelines. We of course welcome feedback on how we might improve the performance and practical utility of the method. Thank you in advance for your input!

Mutant epigenetic machinery mediates climate adaptation in Arabidopsis thaliana

Mutant epigenetic machinery mediates climate adaptation in Arabidopsis thaliana
Xia Shen, Simon Forsberg, Mats Pettersson, Zheya Sheng, Orjan Carlborg
(Submitted on 16 Oct 2013)

The genetic basis of adaptation to climate is largely unknown. We explored the genetic regulation of climate plasticity and its contribution to adaptation using publicly available data from two collections of natural Arabidopsis thaliana accessions from a wide range of habitats. Sixteen loci with plastic alleles were mapped and many of these contained candidate genes with amino acid changes. The Chromomethylase 2 (CMT2) genotype influenced adaptation to seasonal temperature variability and accessions carrying a mutant CMT2 allele disrupting the genome-wide CHH-methylation pattern displayed a more plastic response to climate. We conclude that genetic regulation of plasticity appears to be important for climate adaptation and that genetic variation in the epigenetic machinery, leading to altered genome-wide epigenetic modifications, is one of the underlying molecular mechanisms.

A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects

A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects
Chuan Gao, Christopher D Brown, Barbara E Engelhardt
(Submitted on 17 Oct 2013)

One important problem in genome science is to determine sets of co-regulated genes based on measurements of gene expression levels across samples, where the quantification of expression levels includes substantial technical and biological noise. To address this problem, we developed a Bayesian sparse latent factor model that uses a three parameter beta prior to flexibly model shrinkage in the loading matrix. By applying three layers of shrinkage to the loading matrix (global, factor-specific, and element-wise), this model has non-parametric properties in that it estimates the appropriate number of factors from the data. We added a two-component mixture to model each factor loading as being generated from either a sparse or a dense mixture component; this allows dense factors that capture confounding noise, and sparse factors that capture local gene interactions. We developed two statistics to quantify the stability of the recovered matrices for both sparse and dense matrices. We tested our model on simulated data and found that we successfully recovered the true latent structure as compared to related models. We applied our model to a large gene expression study and found that we recovered known covariates and small groups of co-regulated genes. We validated these gene subsets by testing for associations between genotype data and these latent factors, and we found a substantial number of biologically important genetic regulators for the recovered gene subsets.

Y Chromosomes of 40% Chinese Are Descendants of Three Neolithic Super-grandfathers

Y Chromosomes of 40% Chinese Are Descendants of Three Neolithic Super-grandfathers
Shi Yan, Chuan-Chao Wang, Hong-Xiang Zheng, Wei Wang, Zhen-Dong Qin, Lan-Hai Wei, Yi Wang, Xue-Dong Pan, Wen-Qing Fu, Yun-Gang He, Li-Jun Xiong, Wen-Fei Jin, Shi-Lin Li, Yu An, Hui Li, Li Jin
(Submitted on 15 Oct 2013)

Demographic change of human populations is one of the central questions for delving into the past of human beings. To identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of the non-recombining region (NRY), discovered >4,000 new SNPs, and identified many new clades. The relative divergence dates can be estimated much more precisely using molecular clock. We found that all the Paleolithic divergences were binary; however, three strong star-like Neolithic expansions at ~6 kya (thousand years ago) (assuming a constant substitution rate of 1e-9/bp/year) indicates that ~40% of modern Chinese are patrilineal descendants of only three super-grandfathers at that time. This observation suggests that the main patrilineal expansion in China occurred in the Neolithic Era and might be related to the development of agriculture.

General triallelic frequency spectrum under demographic models with variable population size

General triallelic frequency spectrum under demographic models with variable population size
Paul A. Jenkins, Jonas W. Mueller, Yun S. Song
(Submitted on 13 Oct 2013)

It is becoming routine to obtain datasets on DNA sequence variation across several thousands of chromosomes, providing unprecedented opportunity to infer the underlying biological and demographic forces. Such data make it vital to study summary statistics which offer enough compression to be tractable, while preserving a great deal of information. One well-studied summary is the site frequency spectrum—the empirical distribution, across segregating sites, of the sample frequency of the derived allele. However, most previous theoretical work has assumed that each site has experienced at most one mutation event in its genealogical history, which becomes less tenable for very large sample sizes. In this work we obtain, in closed-form, the predicted frequency spectrum of a site that has experienced at most two mutation events, under very general assumptions about the distribution of branch lengths in the underlying coalescent tree. Among other applications, we obtain the frequency spectrum of a triallelic site in a model of historically varying population size. We demonstrate the utility of our formulas in two settings: First, we show that triallelic sites are more sensitive to the parameters of a population that has experienced historical growth, suggesting that they will have use if they can be incorporated into demographic inference. Second, we investigate a recently proposed alternative mechanism of mutation in which the two derived alleles of a triallelic site are created simultaneously within a single individual, and we develop a test to determine whether it is responsible for the excess of triallelic sites in the human genome.

Non-identifiability of identity coefficients at biallelic loci

Non-identifiability of identity coefficients at biallelic loci
Miklós Csűrös
(Submitted on 13 Oct 2013)

Shared genealogies introduce allele dependencies in diploid genotypes, as alleles within an individual or between different individuals will likely match when they originate from a recent common ancestor. At a locus shared by a pair of diploid individuals, there are nine combinatorially distinct modes of identity-by-descent (IBD), capturing all possible combinations of coancestry and inbreeding. A distribution over the IBD modes is described by the nine associated probabilities, known as (Jacquard’s) identity coefficients. The genetic relatedness between two individuals can be succinctly characterized by the identity coefficients corresponding to the joint genealogy. The identity coefficients (together with allele frequencies) determine the distribution of joint genotypes at a locus. At a locus with two possible alleles, identity coefficients are not identifiable because different coefficients can generate the same genotype distribution.
We analyze precisely how different IBD modes combine into identical genotype distributions at diallelic loci. In particular, we describe IBD mode mixtures that result in identical genotype distributions at all allele frequencies, implying the non-identifiability of the identity coefficients from independent loci. Our analysis yields an exhaustive characterization of relatedness statistics that are always identifiable. Importantly, we show that identifiable relatedness statistics include the kinship coefficient (probability that a random pair of alleles are identical by descent between individuals) and inbreeding-related measures, which can thus be estimated from genotype distributions at independent loci.

forqs: Forward-in-time Simulation of Recombination, Quantitative Traits, and Selection

forqs: Forward-in-time Simulation of Recombination, Quantitative Traits, and Selection
Darren Kessner, John Novembre
(Submitted on 11 Oct 2013)

forqs is a forward-in-time simulation of recombination, quantitative traits, and selection. It was designed to investigate haplotype patterns resulting from scenarios where substantial evolutionary change has taken place in a small number of generations due to recombination and/or selection on polygenic quantitative traits. forqs is implemented as a command- line C++ program. Source code and binary executables for Linux, OSX, and Windows are freely available under a permissive BSD license.