Convergence of Y chromosome STR haplotypes from different SNP haplogroups compromises accuracy of haplogroup prediction

Convergence of Y chromosome STR haplotypes from different SNP haplogroups compromises accuracy of haplogroup prediction
Chuan-Chao Wang, Ling-Xiang Wang, Rukesh Shrestha, Shaoqing Wen, Manfei Zhang, Xinzhu Tong, Li Jin, Hui Li
(Submitted on 21 Oct 2013)

Short tandem repeats (STRs) and single nucleotide polymorphisms (SNPs) are two kinds of commonly used markers in Y chromosome studies of forensic and population genetics. There has been increasing interest in the cost saving strategy by using the STR haplotypes to predict SNP haplogroups. However, the convergence of Y chromosome STR haplotypes from different haplogroups might compromise the accuracy of haplogroup prediction. Here, we compared the worldwide Y chromosome lineages at both haplogroup level and haplotype level to search for the possible haplotype similarities among haplogroups. The similar haplotypes between haplogroups B and I2, C1 and E1b1b1, C2 and E1b1a1, H1 and J, L and O3a2c1, O1a and N, O3a1c and O3a2b, and M1 and O3a2 have been found, and those similarities reduce the accuracy of prediction.

Sex-specific recombination rates and allele frequencies affect the invasion of sexually antagonistic variation on autosomes

Sex-specific recombination rates and allele frequencies affect the invasion of sexually antagonistic variation on autosomes
Minyoung Wyman, Mark Wyman
(Submitted on 19 Oct 2013)

The introduction and persistence of novel sexually antagonistic alleles can depend upon factors that differ between males and females. Understanding the conditions for invasion in a two-locus model can elucidate these processes. For instance, selection can act differently upon the sexes, or sex-linkage can facilitate the invasion of genetic variation with opposing fitness effects between the sexes. Two factors that deserve further attention are recombination rates and allele frequencies — both of which can vary substantially between the sexes. We find that sex-specific recombination rates in a two-locus diploid model can affect the invasion outcome of sexually antagonistic alleles and that the sex-averaged recombination rate is not necessarily sufficient to predict invasion. We confirm that the range of permissible recombination rates is smaller in the sex benefitting from invasion and larger in the sex harmed by invasion. However, within the invasion space, male recombination rate can be greater than, equal to, or less than female recombination rate in order for a male-benefit, female-detriment allele to invade (and similarly for a female-benefit, male-detriment allele). We further show that a novel, sexually antagonistic allele that is also associated with a lowered recombination rate can invade more easily when present in the double heterozygote genotype. Finally, we find that sexual dimorphism in resident allele frequencies can impact the invasion of new sexually antagonistic alleles at a second locus. Our results suggest that accounting for sex-specific recombination rates and allele frequencies can determine the difference between invasion and non-invasion of novel sexually antagonistic alleles in a two-locus model.

The Functional Consequences of Variation in Transcription Factor Binding

The Functional Consequences of Variation in Transcription Factor Binding
Darren A. Cusanovich, Bryan Pavlovic, Jonathan K. Pritchard, Yoav Gilad
(Submitted on 18 Oct 2013)

One goal of human genetics is to understand how the information for precise and dynamic gene expression programs is encoded in the genome. The interactions of transcription factors (TFs) with DNA regulatory elements clearly play an important role in determining gene expression outputs, yet the regulatory logic underlying functional transcription factor binding is poorly understood. Many studies have focused on characterizing the genomic locations of TF binding, yet it is unclear to what extent TF binding at any specific locus has functional consequences with respect to gene expression output. To evaluate the context of functional TF binding we knocked down 59 TFs and chromatin modifiers in one HapMap lymphoblastoid cell line. We then identified genes whose expression was affected by the knockdowns. We intersected the gene expression data with transcription factor binding data (based on ChIP-seq and DNase-seq) within 10 kb of the transcription start sites of expressed genes. This combination of data allowed us to infer functional TF binding. On average, 14.7% of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. We found that functional TF binding is enriched in regulatory elements that harbor a large number of TF binding sites, at sites with predicted higher binding affinity, and at sites that are enriched in genomic regions annotated as active enhancers.

Non-monotonic effects of migration in populations with balancing selection

Non-monotonic effects of migration in populations with balancing selection
Pierangelo Lombardo, Andrea Gambassi, Luca Dall’Asta
(Submitted on 18 Oct 2013)

Balancing selection is recognized as a prominent evolutionary force responsible for the maintenance of genetic diversity in natural populations. We quantify its influence on the evolution of a subdivided population, investigating how the mean-fixation time (MFT) depends on the migration rate among subpopulations. We identify a threshold in the strength of the balancing selection above which the MFT changes its qualitative behavior compared to that of neutral populations, developing an unexpected non-monotonic dependence on the migration rate. This feature carries over into an analogous behavior of the heterozygosity, which is an index of the biodiversity of the population.

Author post: A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

This author post is by Cyrus Maher and Ryan Hernandez on their preprint A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity, arXived here.

Rigorous evolutionary analysis of protein coding regions often requires high-quality multiple sequence alignments. These alignments can only be generated after the identification of orthologous sequences. In our pre-print, “A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity”, we present a novel method that substantially improves the number and quality of detected orthologs, especially in the presence of sequencing error and complex evolutionary processes.

This endeavor grew out of our forthcoming work on the evolutionary impact of ancient pathogens on the human genome. Early on, we observed the decisive influence ortholog quality exerted on our downstream conclusions. As one might imagine, accurate sequence analysis is a fool’s errand if the sequences are, in fact, the wrong ones! Such experiences have impelled us to take a keen interest in orthologs, much as a bad case of gastroenteritis might inspire a sushi chef to become thoroughly attentive to the quality of his or her fish.

Identifying orthologous sequences is referred to as ortholog detection (OD). In brief, existing OD methods can be classified as tree-based, graph-based, or a hybrid of the two. Tree-based methods may use reconciliation techniques between gene and species trees or may rely on the gene tree alone. Graph-based methods can employ a variety of metrics to quantify similarity between sequences. Popular measures include sequence identity and matrix-weighted similarity scores. Syntenic information may also be incorporated in this context.

Here we consider alignments from UCSC (MZ), MultiParanoid (MP), translated BLAT (BL), and OMA. To briefly summarize the strengths of the considered methods: MZ utilizes syntenic similarity, MP includes all-by-all similarity in its calculations, OMA considers phylogenetic information directly, and BL does not require an accurately predicted proteome. In figure 1A of our paper, we illustrate the head-to-head performance of four popular methods for OD. Interestingly, we find striking complementarity between methods, motivating a search for a practical way to integrate ortholog predictions from methodologically diverse sources.

Comparison of sequence identity levels between methods A.) Heat map of the percent of orthologs for which BLAT (BL), OMA (OMA),  MultiParanoid (MP),, and MultiZ (MZ) outperform one another. Performance is based on percent identity of each method’s orthologs to the human sequence. One method is considered to outperform another method if it improves percent identity by at least five percentage points. Text in diagonal cells shows the number of orthologs identified by each method, colored by the percent of transcripts at which a given method outperforms all the others

Figure 1: Comparison of sequence identity levels between methods A.) Heat map of the percent of orthologs for which BLAT (BL), OMA (OMA), MultiParanoid (MP),, and MultiZ (MZ) outperform one another. Performance is based on percent identity of each method’s orthologs to the human sequence. One method is considered to outperform another method if it improves percent identity by at least five percentage points. Text in diagonal cells shows the number of orthologs identified by each method, colored by the percent of transcripts at which a given method outperforms all the others

These efforts culminate in the presentation of MOSAIC, or Multiple Orthologous Sequence Analysis and Integration by Cluster optimization. MOSAIC is a well-documented python package that can flexibly integrate ortholog predictions from an arbitrary number of sources. We compare integrated MOSAIC alignments to those generated using each constituent method alone. Relative to the best-performing single method, we show that MOSAIC more than quintuples the number of sequences for which all orthologs of interest are successfully identified (see figure below). However, this increase in putative orthologs could be the result of, e.g. the improper inclusion low-quality or paralogous sequences. This does not appear to be the case for MOSAIC. Crucially, improvements in power are secured while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality.

OD power and the effect of pooling methods A.) The cumulative number of human transcripts as a function of the maximum number of missing species allowed

Figure 2: OD power and the effect of pooling methods A.) The cumulative number of human transcripts as a function of the maximum number of missing species allowed

These results are obtained from alignments between the human proteome and orthologs from nine species encompassing a range of primates and closely related mammals. For other sequence sets, the best strategy for method integration may differ slightly depending on, e.g. the level of divergence between species of interest. To account for this, MOSAIC provides several options for scoring and optimization, and even facilitates the specification of user-defined metrics for sequence similarity and cluster optimality.

In the future, we would also like to add functionality to automatically fetch relevant alignments from major ortholog databases. In the meantime, we hope that this tool will prove a useful addition to a variety of evolutionary analysis pipelines. We of course welcome feedback on how we might improve the performance and practical utility of the method. Thank you in advance for your input!

Y Chromosomes of 40% Chinese Are Descendants of Three Neolithic Super-grandfathers

Y Chromosomes of 40% Chinese Are Descendants of Three Neolithic Super-grandfathers
Shi Yan, Chuan-Chao Wang, Hong-Xiang Zheng, Wei Wang, Zhen-Dong Qin, Lan-Hai Wei, Yi Wang, Xue-Dong Pan, Wen-Qing Fu, Yun-Gang He, Li-Jun Xiong, Wen-Fei Jin, Shi-Lin Li, Yu An, Hui Li, Li Jin
(Submitted on 15 Oct 2013)

Demographic change of human populations is one of the central questions for delving into the past of human beings. To identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of the non-recombining region (NRY), discovered >4,000 new SNPs, and identified many new clades. The relative divergence dates can be estimated much more precisely using molecular clock. We found that all the Paleolithic divergences were binary; however, three strong star-like Neolithic expansions at ~6 kya (thousand years ago) (assuming a constant substitution rate of 1e-9/bp/year) indicates that ~40% of modern Chinese are patrilineal descendants of only three super-grandfathers at that time. This observation suggests that the main patrilineal expansion in China occurred in the Neolithic Era and might be related to the development of agriculture.

Application of compressed sensing to genome wide association studies and genomic selection

Application of compressed sensing to genome wide association studies and genomic selection
Shashaank Vattikuti, James J. Lee, Stephen D. H. Hsu, Carson C. Chow
(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics

IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics
Marta Rosikiewicz, Marc Robinson-Rechavi
(Submitted on 8 Oct 2013)

Motivation: Microarray results accumulated in public repositories are widely re-used in meta-analytical studies and secondary databases. The quality of the data obtained with this technology varies from experiment to experiment and efficient method for quality assessment is neces-sary to ensure their reliability. Results: The lack of a good benchmark has hampered evaluation of existing methods for quality control. In this study we propose a new inde-pendent quality metric that is based on evolutionary conservation of expression profiles. We show, using 11 large organ-specific datasets, that IQRray, a new quality metrics developed by us, exhibits the highest correlation with this reference metric, among 14 metrics tested. IQRray outperforms other methods in identification of poor quality arrays in dataset composed of arrays from many independent experiments. In con-trast, the performance of methods designed for detecting outliers in a single experiment like NUSE and RLE was low because of the inability of these method to detect datasets containing only low quality arrays, and the fact that the scores cannot be directly compared between ex-periments. Availability: The R implementation of IQRray is available at: this ftp URL

Let my people go (home) to Spain: a genealogical model of Jewish identities since 1492

Let my people go (home) to Spain: a genealogical model of Jewish identities since 1492
Joshua S. Weitz
(Submitted on 7 Oct 2013)

The Spanish government recently announced an official fast-track path to citizenship for any individual who is Jewish and whose ancestors were expelled from Spain during the inquisition-related dislocation of Spanish Jews in 1492. It would seem that this policy targets a small subset of the global Jewish population, i.e., restricted to individuals who retain cultural practices associated with ancestral origins in Spain. However, the central contribution of this manuscript is to demonstrate how and why the policy is far more likely to apply to a very large fraction (i.e., the vast majority) of Jews. This claim is supported using a series of genealogical models that include transmissable “identities” and preferential intra-group mating. Model analysis reveals that even when intra-group mating is strong and even if only a small subset of a present-day population retains cultural practices typically associated with that of an ancestral group, it is highly likely that nearly all members of that population have direct geneaological links to that ancestral group, given sufficient number of generations have elapsed. The basis for this conclusion is that not having a link to an ancestral group must be a property of all of an individual’s ancestors, the probability of which declines (nearly) superexponentially with each successive generation. These findings highlight unexpected incongruities induced by genealogical dynamics between present-day and ancestral identities.

Neighbor Joining Plus – algorithm for phylogenetic tree reconstruction with proper nodes assignment

Neighbor Joining Plus – algorithm for phylogenetic tree reconstruction with proper nodes assignment
Piotr Plonski, Jan P. Radomski
(Submitted on 8 Oct 2013)

Most of major algorithms for phylogenetic tree reconstruction assume that sequences in the analyzed set either do not have any offspring, or that parent sequences can maximally mutate into just two descendants. The graph resulting from such assumptions forms therefore a binary tree, with all the nodes labeled as leaves. However, these constraints are unduly restrictive as there are numerous data sets with multiple offspring of the same ancestors. Here we propose a solution to analyze and visualize such sets in a more intuitive manner. The method reconstructs phylogenetic tree by assigning the sequences with offspring as internal nodes, and the sequences without offspring as leaf nodes. In the resulting tree there is no constraint for the number of adjacent nodes, which means that the solution tree needs not to be a binary graph only. The subsequent derivation of evolutionary pathways, and pair-wise mutations, are then an algorithmically straightforward, with edge’s length corresponding directly to the number of mutations. Other tree reconstruction algorithms can be extended in the proposed manner, to also give unbiased topologies.