Present Y chromosomes support the Persian ancestry of Sayyid Ajjal Shams al-Din Omar and Eminent Navigator Zheng He

Present Y chromosomes support the Persian ancestry of Sayyid Ajjal Shams al-Din Omar and Eminent Navigator Zheng He
Chuan-Chao Wang, Ling-Xiang Wang, Manfei Zhang, Dali Yao, Li Jin, Hui Li
(Submitted on 21 Oct 2013)

Sayyid Ajjal is the ancestor of many Muslims in areas all across China. And one of his descendants is the famous Navigator of Ming Dynasty, Zheng He, who led the largest armada in the world of 15th century. The origin of Sayyid Ajjal’s family remains unclear although many studies have been done on this topic of Muslim history. In this paper, we studied the Y chromosomes of his present descendants, and found they all have haplogroup L1a-M76, proving a southern Persian origin.

Thoughts on: Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle

I (@joe_pickrell) was recently asked to review a preprint by Decker et al., Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle for a journal. Below are the comments I sent the journal.

In this paper, the authors apply a suite of population genetics analyses to a set of cattle breeds. The basic data consists of around 1,500 individuals from 143 breeds typed at around 40,000 SNPs. The authors use this data to build population trees/graphs using TreeMix and visualize population structure with PCA/ADMIXTURE. They then interpret the results of these programs in light of their knowledge of the history of cattle domestication. I had no knowledge of cattle history prior to reading this manuscript, so I enjoyed reading it. I have first a few comments on the manuscript as a whole, then on individual points.

Overall comments:

1. A lot of interpretation depends on the robustness of the inferred population graph from TreeMix. It would be extremely helpful to see that the estimated graph is consistent across different random starting points. The authors could run TreeMix, say, five different times, and compare the results across runs. I expect that many of the inferred migration edges will be consistent, but a subset will not. It’s probably most interesting to focus interpretation on the edges that are consistent.

2. Throughout the manuscript, inference from genetics is mixed in with evidence from other sources. At points it sometimes becomes unclear which points are made strictly from genetics and which are not. For example, the authors write, “Anatolian breeds are admixed between European, African, and Asian cattle, and do not represent the populations originally domesticated in the region”. It seems possible that the first part of that statement (about admixture) could be their conclusion from the genetic data, but it’s difficult to make the second statement (about the original populations in the region) from genetics, so presumably this is based on other sources. In general, I would suggest splitting the results internal to this paper apart from the other statements and making a clear firewall between their results and the historical interpretation of the results (right now the authors have a “Results and Discussion” section, but it might be easiest to do this by splitting the “Results” from the “Discussion”. But this is up to the authors.).

3. Related to the above point, could the authors add subsection headings to the results/discussion section? Right now the topic of the paper jumps around considerably from paragraph to paragraph, and at points I had difficulty following. One possibility would be to organize subheading by the claims made in the abstract, e.g. “Cline of indicine introgression into Africa”, “wild African auroch ancestry”, etc…

Specific comments:

There are quite a few results claimed in this paper, so I’m going to split my comments apart by the results reported in the abstract. As mentioned above, it would be nice if the authors clearly stated exactly which pieces of evidence they view as supporting each of these, perhaps in subheadings in the Results section. In italics is the relevant sentence in the abstract, followed by my thoughts:

Using 19 breeds, we map the cline of indicine introgression into Africa.

This claim is based on interpretation of the ADMIXTURE plot in Figure 5. I wonder if a map might make this point more clearly than Figure 5, however; the three-letter population labels in Figure 5 are not very easy to read, especially since most readers will have no knowledge of the geographic locations of these breeds.

We infer that African taurine possess a large portion of wild African auroch ancestry, causing their divergence from Eurasian taurine.

This claim appears to be largely based on the interpretation of the treemix plot in Figure 4. This figure shows an admixture edge from the ancestors of the European breeds into the African breeds. As noted above, it seems important that this migration edge be robust across different treemix runs. Also, labeling this ancestry as “wild African auroch ancestry” seem like an interpretation of the data rather than something that has been explicitly tested, since the authors don’t have wild African aurochs in their data.

Additionally, the authors claim that this result shows “there was not a third domestication process, rather there was a single origin of domesticated taurine…”. I may be missing something, but it seems that genetic data cannot distinguish whether a population was “domesticated” or “wild”. That is, it seems plausible that the source population tentatively identified in Figure 4 may have been independently domesticated. There may be other sources of evidence that refute this interpretation, but this is another example of where it would be useful to have a firewall between the genetic results and the interpretation in light of other evidence. The speculation about the role of disease resistance in introgression is similarly not based on evidence from this paper and should probably be set apart.

We detect exportation patterns in Asia and identify a cline of Eurasian taurine/indicine hybridization in Asia.

The cline of taurine/indicine hybridization is based on interpretation of ADMIXTURE plots and some follow-up f4 statistics. I found this difficult to follow, especially since a significant f4 statistic can have multiple interpretations. Perhaps the authors could draw out the proposed phylogeny for these breeds and explain the reasons they chose particular f4 statistics to highlight.

We also identify the influence of species other than Bos taurus in the formation of Asian breeds.

The conclusion that other species other than Bos taurus have introgressed into Asian breeds seems to be based on interpretation of branch lengths in the trees in Figures 2-3 and some f3 statistics. The interpretation of branch lengths is extremely weak evidence for introgression, probably not even worth mentioning. The f3 statistics are potentially quite informative though. For the breeds in question (Brebes and Madura), which pairs of populations give the most negative f3 statstics? This is difficult information to extract from Supplementary Table 2, where the populations appear to be sorted alphabetically. A table showing the (for example) five most negative f3 statistics could be quite useful here. In general, if the SNP ascertainment scheme is not extremely complicated (can the authors describe the ascertainment scheme for this array?), a negative f3 statistic is very strong evidence that a target population is admixed, which a significant f4 statistic only means that at least one of the four populations in the statistic is admixed. This might be a useful property for the authors.

We detect the pronounced influence of Shorthorn cattle in the formation of European breeds.

This conclusion appears to be based on interpretation of ADMIXTURE plots in Figures S6-S9. Interpreting these types of plots is notoriously difficult. I wonder if the f3 statistics might be useful here: do the authors get negative f3 statistics in the populations they write “share ancestry with Shorthorn cattle” when using the Durham shorthorns as one reference?

Iberian and Italian cattle possess introgression from African taurine.

This conclusion is based on ADMIXTURE plots and treemix; it would be interesting to see the results from f3 statistics as well.

American Criollo cattle are shown to be of Iberian, and not African, decent.

I found this difficult to follow–the authors write that these breeds “derive 7.5% of their ancestry from African taurine introgression”, so presumably they are in fact partially of African descent?

Indicine introgression into American cattle occurred in the Americas, and not Europe

This conclusion seems difficult to make from genetic data. The authors identify “indicine” ancestry in American cattle, so I don’t see how they can determine whether this happened before or after a migration without temporal information. It would be helpful if the authors walk the reader through each logical step they’re making so that the reader can decide whether they believe the evidence for each step.

Convergence of Y chromosome STR haplotypes from different SNP haplogroups compromises accuracy of haplogroup prediction

Convergence of Y chromosome STR haplotypes from different SNP haplogroups compromises accuracy of haplogroup prediction
Chuan-Chao Wang, Ling-Xiang Wang, Rukesh Shrestha, Shaoqing Wen, Manfei Zhang, Xinzhu Tong, Li Jin, Hui Li
(Submitted on 21 Oct 2013)

Short tandem repeats (STRs) and single nucleotide polymorphisms (SNPs) are two kinds of commonly used markers in Y chromosome studies of forensic and population genetics. There has been increasing interest in the cost saving strategy by using the STR haplotypes to predict SNP haplogroups. However, the convergence of Y chromosome STR haplotypes from different haplogroups might compromise the accuracy of haplogroup prediction. Here, we compared the worldwide Y chromosome lineages at both haplogroup level and haplotype level to search for the possible haplotype similarities among haplogroups. The similar haplotypes between haplogroups B and I2, C1 and E1b1b1, C2 and E1b1a1, H1 and J, L and O3a2c1, O1a and N, O3a1c and O3a2b, and M1 and O3a2 have been found, and those similarities reduce the accuracy of prediction.

Sex-specific recombination rates and allele frequencies affect the invasion of sexually antagonistic variation on autosomes

Sex-specific recombination rates and allele frequencies affect the invasion of sexually antagonistic variation on autosomes
Minyoung Wyman, Mark Wyman
(Submitted on 19 Oct 2013)

The introduction and persistence of novel sexually antagonistic alleles can depend upon factors that differ between males and females. Understanding the conditions for invasion in a two-locus model can elucidate these processes. For instance, selection can act differently upon the sexes, or sex-linkage can facilitate the invasion of genetic variation with opposing fitness effects between the sexes. Two factors that deserve further attention are recombination rates and allele frequencies — both of which can vary substantially between the sexes. We find that sex-specific recombination rates in a two-locus diploid model can affect the invasion outcome of sexually antagonistic alleles and that the sex-averaged recombination rate is not necessarily sufficient to predict invasion. We confirm that the range of permissible recombination rates is smaller in the sex benefitting from invasion and larger in the sex harmed by invasion. However, within the invasion space, male recombination rate can be greater than, equal to, or less than female recombination rate in order for a male-benefit, female-detriment allele to invade (and similarly for a female-benefit, male-detriment allele). We further show that a novel, sexually antagonistic allele that is also associated with a lowered recombination rate can invade more easily when present in the double heterozygote genotype. Finally, we find that sexual dimorphism in resident allele frequencies can impact the invasion of new sexually antagonistic alleles at a second locus. Our results suggest that accounting for sex-specific recombination rates and allele frequencies can determine the difference between invasion and non-invasion of novel sexually antagonistic alleles in a two-locus model.

The Functional Consequences of Variation in Transcription Factor Binding

The Functional Consequences of Variation in Transcription Factor Binding
Darren A. Cusanovich, Bryan Pavlovic, Jonathan K. Pritchard, Yoav Gilad
(Submitted on 18 Oct 2013)

One goal of human genetics is to understand how the information for precise and dynamic gene expression programs is encoded in the genome. The interactions of transcription factors (TFs) with DNA regulatory elements clearly play an important role in determining gene expression outputs, yet the regulatory logic underlying functional transcription factor binding is poorly understood. Many studies have focused on characterizing the genomic locations of TF binding, yet it is unclear to what extent TF binding at any specific locus has functional consequences with respect to gene expression output. To evaluate the context of functional TF binding we knocked down 59 TFs and chromatin modifiers in one HapMap lymphoblastoid cell line. We then identified genes whose expression was affected by the knockdowns. We intersected the gene expression data with transcription factor binding data (based on ChIP-seq and DNase-seq) within 10 kb of the transcription start sites of expressed genes. This combination of data allowed us to infer functional TF binding. On average, 14.7% of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. We found that functional TF binding is enriched in regulatory elements that harbor a large number of TF binding sites, at sites with predicted higher binding affinity, and at sites that are enriched in genomic regions annotated as active enhancers.

Non-monotonic effects of migration in populations with balancing selection

Non-monotonic effects of migration in populations with balancing selection
Pierangelo Lombardo, Andrea Gambassi, Luca Dall’Asta
(Submitted on 18 Oct 2013)

Balancing selection is recognized as a prominent evolutionary force responsible for the maintenance of genetic diversity in natural populations. We quantify its influence on the evolution of a subdivided population, investigating how the mean-fixation time (MFT) depends on the migration rate among subpopulations. We identify a threshold in the strength of the balancing selection above which the MFT changes its qualitative behavior compared to that of neutral populations, developing an unexpected non-monotonic dependence on the migration rate. This feature carries over into an analogous behavior of the heterozygosity, which is an index of the biodiversity of the population.

Author post: A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

This author post is by Cyrus Maher and Ryan Hernandez on their preprint A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity, arXived here.

Rigorous evolutionary analysis of protein coding regions often requires high-quality multiple sequence alignments. These alignments can only be generated after the identification of orthologous sequences. In our pre-print, “A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity”, we present a novel method that substantially improves the number and quality of detected orthologs, especially in the presence of sequencing error and complex evolutionary processes.

This endeavor grew out of our forthcoming work on the evolutionary impact of ancient pathogens on the human genome. Early on, we observed the decisive influence ortholog quality exerted on our downstream conclusions. As one might imagine, accurate sequence analysis is a fool’s errand if the sequences are, in fact, the wrong ones! Such experiences have impelled us to take a keen interest in orthologs, much as a bad case of gastroenteritis might inspire a sushi chef to become thoroughly attentive to the quality of his or her fish.

Identifying orthologous sequences is referred to as ortholog detection (OD). In brief, existing OD methods can be classified as tree-based, graph-based, or a hybrid of the two. Tree-based methods may use reconciliation techniques between gene and species trees or may rely on the gene tree alone. Graph-based methods can employ a variety of metrics to quantify similarity between sequences. Popular measures include sequence identity and matrix-weighted similarity scores. Syntenic information may also be incorporated in this context.

Here we consider alignments from UCSC (MZ), MultiParanoid (MP), translated BLAT (BL), and OMA. To briefly summarize the strengths of the considered methods: MZ utilizes syntenic similarity, MP includes all-by-all similarity in its calculations, OMA considers phylogenetic information directly, and BL does not require an accurately predicted proteome. In figure 1A of our paper, we illustrate the head-to-head performance of four popular methods for OD. Interestingly, we find striking complementarity between methods, motivating a search for a practical way to integrate ortholog predictions from methodologically diverse sources.

Comparison of sequence identity levels between methods A.) Heat map of the percent of orthologs for which BLAT (BL), OMA (OMA),  MultiParanoid (MP),, and MultiZ (MZ) outperform one another. Performance is based on percent identity of each method’s orthologs to the human sequence. One method is considered to outperform another method if it improves percent identity by at least five percentage points. Text in diagonal cells shows the number of orthologs identified by each method, colored by the percent of transcripts at which a given method outperforms all the others

Figure 1: Comparison of sequence identity levels between methods A.) Heat map of the percent of orthologs for which BLAT (BL), OMA (OMA), MultiParanoid (MP),, and MultiZ (MZ) outperform one another. Performance is based on percent identity of each method’s orthologs to the human sequence. One method is considered to outperform another method if it improves percent identity by at least five percentage points. Text in diagonal cells shows the number of orthologs identified by each method, colored by the percent of transcripts at which a given method outperforms all the others

These efforts culminate in the presentation of MOSAIC, or Multiple Orthologous Sequence Analysis and Integration by Cluster optimization. MOSAIC is a well-documented python package that can flexibly integrate ortholog predictions from an arbitrary number of sources. We compare integrated MOSAIC alignments to those generated using each constituent method alone. Relative to the best-performing single method, we show that MOSAIC more than quintuples the number of sequences for which all orthologs of interest are successfully identified (see figure below). However, this increase in putative orthologs could be the result of, e.g. the improper inclusion low-quality or paralogous sequences. This does not appear to be the case for MOSAIC. Crucially, improvements in power are secured while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality.

OD power and the effect of pooling methods A.) The cumulative number of human transcripts as a function of the maximum number of missing species allowed

Figure 2: OD power and the effect of pooling methods A.) The cumulative number of human transcripts as a function of the maximum number of missing species allowed

These results are obtained from alignments between the human proteome and orthologs from nine species encompassing a range of primates and closely related mammals. For other sequence sets, the best strategy for method integration may differ slightly depending on, e.g. the level of divergence between species of interest. To account for this, MOSAIC provides several options for scoring and optimization, and even facilitates the specification of user-defined metrics for sequence similarity and cluster optimality.

In the future, we would also like to add functionality to automatically fetch relevant alignments from major ortholog databases. In the meantime, we hope that this tool will prove a useful addition to a variety of evolutionary analysis pipelines. We of course welcome feedback on how we might improve the performance and practical utility of the method. Thank you in advance for your input!

Mutant epigenetic machinery mediates climate adaptation in Arabidopsis thaliana

Mutant epigenetic machinery mediates climate adaptation in Arabidopsis thaliana
Xia Shen, Simon Forsberg, Mats Pettersson, Zheya Sheng, Orjan Carlborg
(Submitted on 16 Oct 2013)

The genetic basis of adaptation to climate is largely unknown. We explored the genetic regulation of climate plasticity and its contribution to adaptation using publicly available data from two collections of natural Arabidopsis thaliana accessions from a wide range of habitats. Sixteen loci with plastic alleles were mapped and many of these contained candidate genes with amino acid changes. The Chromomethylase 2 (CMT2) genotype influenced adaptation to seasonal temperature variability and accessions carrying a mutant CMT2 allele disrupting the genome-wide CHH-methylation pattern displayed a more plastic response to climate. We conclude that genetic regulation of plasticity appears to be important for climate adaptation and that genetic variation in the epigenetic machinery, leading to altered genome-wide epigenetic modifications, is one of the underlying molecular mechanisms.

A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects

A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects
Chuan Gao, Christopher D Brown, Barbara E Engelhardt
(Submitted on 17 Oct 2013)

One important problem in genome science is to determine sets of co-regulated genes based on measurements of gene expression levels across samples, where the quantification of expression levels includes substantial technical and biological noise. To address this problem, we developed a Bayesian sparse latent factor model that uses a three parameter beta prior to flexibly model shrinkage in the loading matrix. By applying three layers of shrinkage to the loading matrix (global, factor-specific, and element-wise), this model has non-parametric properties in that it estimates the appropriate number of factors from the data. We added a two-component mixture to model each factor loading as being generated from either a sparse or a dense mixture component; this allows dense factors that capture confounding noise, and sparse factors that capture local gene interactions. We developed two statistics to quantify the stability of the recovered matrices for both sparse and dense matrices. We tested our model on simulated data and found that we successfully recovered the true latent structure as compared to related models. We applied our model to a large gene expression study and found that we recovered known covariates and small groups of co-regulated genes. We validated these gene subsets by testing for associations between genotype data and these latent factors, and we found a substantial number of biologically important genetic regulators for the recovered gene subsets.

Y Chromosomes of 40% Chinese Are Descendants of Three Neolithic Super-grandfathers

Y Chromosomes of 40% Chinese Are Descendants of Three Neolithic Super-grandfathers
Shi Yan, Chuan-Chao Wang, Hong-Xiang Zheng, Wei Wang, Zhen-Dong Qin, Lan-Hai Wei, Yi Wang, Xue-Dong Pan, Wen-Qing Fu, Yun-Gang He, Li-Jun Xiong, Wen-Fei Jin, Shi-Lin Li, Yu An, Hui Li, Li Jin
(Submitted on 15 Oct 2013)

Demographic change of human populations is one of the central questions for delving into the past of human beings. To identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of the non-recombining region (NRY), discovered >4,000 new SNPs, and identified many new clades. The relative divergence dates can be estimated much more precisely using molecular clock. We found that all the Paleolithic divergences were binary; however, three strong star-like Neolithic expansions at ~6 kya (thousand years ago) (assuming a constant substitution rate of 1e-9/bp/year) indicates that ~40% of modern Chinese are patrilineal descendants of only three super-grandfathers at that time. This observation suggests that the main patrilineal expansion in China occurred in the Neolithic Era and might be related to the development of agriculture.