Author post: Facilitated diffusion buffers noise in gene expression

This guest post is by Radu Zabet on his preprint (with Armin Schoech) Facilitated diffusion buffers noise in gene expression, arXived here.

How does the binding dynamics of transcription factors affect the noise in gene expression?

Transcription factors (TFs) are proteins that bind to DNA and control gene activity. Gene regulation can be modelled as a chemical reaction, which is fundamentally a stochastic process. Given the importance of an accurate control of the gene regulatory program in the cell, significant efforts have been made in understanding the noise properties of gene expression.

Why can noise in gene expression be modelled assuming an ON/OFF gene model?

With few exceptions, previous studies investigated the noise in gene expression assuming that the regulatory process is a two-state Markov model (genes switch stochastically between ON and OFF states). However, it is known that, mechanistically, transcription factors find their genomic target sites through facilitated diffusion, a combination of 3D diffusion in the cytoplasm/nucleoplasm and 1D random walk along the DNA, and this is likely to influence the noise properties of the gene regulation process. Previous experimental studies (e.g. see successfully modelled the noise measured experimentally by assuming an ON/OFF gene model (two-state Markov model) in bacterial and animal cells. In this manuscript, we built a three-state Markov model that accurately models the facilitated diffusion and we showed that for biologically relevant parameters, at least in bacteria (we assumed lac repressor system, noise in gene expression can be modelled assuming the ON/OFF gene model, but only if the binding/unbinding rates are adjusted accordingly. This explains why in many cases the experimental noise in gene regulation can be modelled assuming an ON/OFF gene model. Note that there are several exceptions where the noise in gene expression does not seem to be accounted by the ON/OFF gene model (e.g. or

What is the effect of facilitated diffusion on the noise in gene expression?

Next, assuming the ON/OFF gene model we investigated the evolutionary advantage that a TF, which performs facilitated diffusion, has on noise in gene expression compared to an equivalent TF that only performs the 3D diffusion (and does not perform 1D random walk on the DNA). Our results show that the noise in gene expression can be reduced significantly when the TF performs facilitated diffusion compared to its equivalent TF that only performs 3D diffusion in the cell. This is important, because while the majority of the studies identify the speedup in the binding site search process as the main evolutionary advantage of why facilitated diffusion exists, we show that, in addition to this speedup in binding kinetics, facilitated diffusion also reduces the noise in gene expression. Interestingly, it seems that the noise level in gene expression is reduced close to the noise level of an unregulated gene (the lowest noise level in gene expression that could be achieved), while the noise of an equivalent TF that performs only 3D diffusion is significantly higher.

Finally, to test our model, we parameterise it with values measured experimentally in the case of lac repressor in E. coli and we estimated the mean mRNA level to be 0.16 and the Fano factor (variance divided by mean) to be 1.3 (as opposed to 2.0 in the case of TF performing only 3D diffusion). These values are similar to the values measured experimentally in the low inducer case of Plac by (mean mRNA level of 0.15 and Fano factor of 1.25) and shows that facilitated diffusion is essential in explaining the experimentally measured noise in mRNA.

Author post: Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans

This guest post is by Gundula Povysil and Sepp Hochreiter on their preprint Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans, bioRxived here.

We completed our preprint Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans in bioRxiv by presenting results not only for chromosome 1 but now for all autosomes and chromosome X.

In this manuscript we analyze the sharing of very short identity by descent (IBD) segments between humans, Neandertals, and Denisovans to gain new insights into their demographic history. In the updated version we included a separate chromosome X analysis (both IBD segment sharing and length of segments). We identified IBD segments in the 1000 Genomes Project sequencing data using our recently published method HapFABIA, many of which are shared with Neandertals or Denisovans.

Here we highlight the most interesting findings of our analysis:

Introgression from Denisovans into ancestors of Asians:

The Denisova genome most prominently matches IBD segments that are shared by Asians and on average these segments are longer than segments shared between other continental populations and the Denisova genome. Therefore, we could confirm an introgression from Denisovans into ancestors of Asians after their migration out of Africa.

Introgression from Neandertals into ancestors of Europeans and Asians:

While Neandertal-matching IBD segments are most often shared by Asians, Europeans share a considerably higher percentage of IBD segments with Neandertals compared to other populations, too. Neandertal-matching IBD segments that are shared by Asians or Europeans are longer than those observed in Africans. These IBD segments hint at a gene flow from Neandertals into ancestors of Asians and Europeans after they left Africa.

Ancient Neandertal and Denisova IBD segments survived only in Africans

Interestingly, many Neandertal- and/or Denisova-matching IBD segments are predominantly observed in Africans – some of them even exclusively. IBD segments shared between Africans and Neandertals or Denisovans are strikingly short, therefore we assume that they are very old. Consequently, we conclude that DNA regions from ancestors of humans, Neandertals, and Denisovans have survived in Africans.

Neandertal but no Denisova introgression on the X chromosome

Neandertal-matching IBD segments on chromosome X confirm gene flow from Neandertals into ancestors of Asians and Europeans outside Africa. Interestingly, there is hardly any signal of Denisova introgression on the X chromosome.

We highly appreciate any comments, discussions, or thoughts on our results.

Author post: Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris

This guest post is by Gavin Douglas (@gmdougla), Stephen Wright (@stepheniwright), and Tanja Slotte (@tanjaslotte) on their paper Douglas et al. Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris. bioRxived here.

photo credit: Tanja Slotte

photo credit: Tanja Slotte

In this preprint we investigate the mode of origin and evolutionary consequences of polyploidy in the highly successful tetraploid plant Capsella bursa-pastoris. We analyze high-coverage massively parallel genomic sequence data and first show that C. bursa-pastoris is a recent hybrid of two Capsella lineages leading to C. grandiflora and C. orientalis. This settles a long-standing uncertainty regarding the origins of C. bursa-pastoris. Second, we investigate patterns of nonfunctionalization and gene loss, and while we find little evidence for rapid, massive genome-wide fractionation, our analyses suggest that there is a decrease in the efficacy of selection in this recently formed tetraploid.

Allopolyploid origins of Capsella bursa-pastoris

Determining the evolutionary origin of C. bursa-pastoris has proven to be difficult and many contradictory hypotheses have been suggested, including that the tetraploid is an autopolyploid of a single Capsella species. Part of the complication has been the relatively low levels of sequence divergence between homeologous gene copies, and across the diploid Capsella lineages. Given population genomic sequences from all three Capsella species mentioned, we were able to address this question again with several different approaches.
C. bursa-pastoris undergoes disomic inheritance, meaning that genes duplicated as a result of polyploidy (homeologs) are independently inherited. Thus, one of the major tasks with our genomic data was to partition out the sequences from the two homeologous subgenomes. Because of the low levels of sequence divergence between homeologs (3% on average), this can be a challenging task. We took two approaches to generate phased genome sequence for inferring species origins; de novo assembly of short reads and phasing of SNPs from mapping reads to the reference genome of the diploid Capsella rubella. Phylogenetic trees generated from de novo assemblies of these species overwhelmingly support one C. bursa-pastoris homeolog forming a clade with C. grandiflora and the other with C. orientalis. The distribution of SNPs and transposable elements shared between these species also strongly support this hybridization model, which we estimate occurred within the last 100-300,000 years.
One reason the hybrid origins of C. bursa-pastoris is exciting is due to the divergent evolution of its progenitor lineages. C. orientalis and C. grandiflora differ both in their mating system and geographical distribution. Given that C. bursa-pastoris is a highly successful weed found worldwide, it will be interesting in future work to assess whether this divergence between the C. orientalis and C. grandiflora lineages contributed to the tetraploid’s adaptability.

Decreased efficacy of selection in the recently arisen polyploid
Following genome duplications the majority of redundant loci are expected to become lost over time through the process of diploidization. This model has been supported by several ancient polyploid events, including in Arabidopsis. Capsella bursa-pastoris presents an interesting model for studying the early phases of diploidization, and allows for an investigation of the rate of gene loss as well as the relative importance of relaxed selection vs. positive selection during early stages of gene inactivation. We searched for large deletions spanning genes using several approaches both based on determination of exact breakpoints and by cross-referencing low-coverage regions in C. bursa-pastoris with other Capsella species. Although we identified proportionately more large deletions segregating in C. bursa-pastoris than in the diploids, we did not find evidence for massive genomic changes in the tetraploid.
We were able to demonstrate relaxation of selection by analyzing the site frequency spectrum of SNPs segregating at 0-fold nonsynonymous sites in the three Capsella species. We also investigated SNPs causing putatively deleterious effects, such as premature stop codons, segregating in the three Capsella. Many of these SNPs are shared between the three species, although segregating at low frequencies in C. grandiflora. Since this shared deleterious variation inherited from progenitors seems to be responsible for a large proportion of the earliest stages of gene degeneration, this data supports a model of genome fractionation that is given a “head start” from standing variation. A key message following from this result is that we should be giving more weight to purely historical explanations of gene loss when studying biased fractionation.

Author post: Single haplotype assembly of the human genome from a hydatidiform mole

This guest post is by Karyn Meltz Steinberg on her preprint (with coauthors) Single haplotype assembly of the human genome from a hydatidiform mole, bioRxived here.

The human reference sequence is a mosaic of many DNA sources patched together to create the (mostly) contiguous chromosomes we all use every day in genomics labs. This mélange of haplotypes can result in reference representations that do not exist. For example, in GRCh37 at the MRC1 locus mixed haplotypes led to the presence of two gene models that represent false duplications and a gap that affected alignments of short reads. The problem of diploid source DNA was even worse in regions of structural variation where it was difficult to distinguish allelic variation from paralogous variation. The assembly structure in these regions was often wrong with either a collapsed assembly leading to missing sequence or with haplotype expanse, meaning 2, 3 or more haplotypes were represented on the chromosome sequence. The genomic resources associated with the essentially haploid complete hydatidiform mole, CHM1, have opened the door to allow us to address these issues.

What is essentially haploid? Well, what usually happens is that a sperm (usually bearing an X chromosome) fertilizes an egg that doesn’t have a nucleus (thus no DNA), the sperm DNA doubles, giving the correct chromosome complement, but with both pairs of chromosomes being identical. This cell divides and grows but does not form a normal embryo. In the early 1990s, Dr. Urvashi Surti was able to make an immortalized cell line from CHM1 tissue using hTERT. She karyotyped each passage to check that it maintained ploidy and that there were no gross somatic rearrangements.

Dr. Pieter de Jong then created an indexed BAC library from this high fidelity material. These BACs could then be used to resolve structurally complex regions such as 17q21.311. We continued working on sequencing more tiling paths across structurally complex regions; however, it was not practical or cost-efficient to Sanger sequence every single clone in the BAC library. As work continued, it became clear that developing the Primary Assembly using a single haplotype resource could be very powerful. This was possible due to the efforts of the Genome Reference Consortium (GRC) to extend the assembly model to include more than one sequence representation for a give region. We used Illumina sequencing technology and a reference based assembly algorithm developed at NCBI to produce an initial assembly. We then integrated the BAC sequences into the assembly to improve regions that are nearly impossible to assembly using whole genome strategies. The result is the highest quality whole genome sequence human genome assembly that is publicly available to date as assessed by metrics including contig and scaffold N50, repetitive element content and gene annotation.

So what–how can this help us, you ask? For the first time, one single haplotype of the human genome is represented. The fact that CHM1 is haploid means that we are able to finally go into the messy regions of the genome and resolve the genomic architecture as well as put any structural variation in the context of surrounding linked allelic variation. These are often biologically interesting regions; for example genes related to immune response and metabolism that are probably associated with complex traits are usually members of large gene families in segmentally duplicated sequence.

A fine example of the power of this haploid resource is also on BioRxiv, “Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity” (disclosure: this is also co-authored by me). The IG light chain genes encode for one part of immunoglobulin molecules that are expressed by B cells in response to antigenic stimulation (the heavy chain, IGH, was also resolved using CHM1 resources last year2). They are part of large gene families formed by duplication at three loci in the human genome. In previous versions of the reference assembly, these loci were comprised of sequence from multiple DNA sources that may have undergone somatic rearrangement. By sequencing BAC clones in a tiling path across these loci we now have a single haplotype representation of germline DNA sequence that allows us to perform accurate analyses of variation.


1 Itsara, A. et al. Resolving the breakpoints of the 17q21.31 microdeletion syndrome with next-generation sequencing. American journal of human genetics 90, 599-613, doi:10.1016/j.ajhg.2012.02.013 (2012).
2 Watson, C. T. et al. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. American journal of human genetics 92, 530-546 (2013).

Author post: Predicting evolution from the shape of genealogical trees

This guest post by Richard Neher discusses his preprint Predicting evolution from the shape of genealogical trees. Richard A. Neher, Colin A. Russell, Boris I. Shraiman. arXived here. This is cross-posted from the Neher lab website.

In this preprint — a collaboration with Colin Russell and Boris Shraiman — we show that it is possible to predict which individual from a population is most closely related to future populations. To this end, we have developed a method that uses the branching pattern of genealogical trees to estimate which part of the tree contains the “fittest” sequences, where fit means rapidly multiplying. Those that multiply rapidly, are most likely to take over the population. We demonstrate the power of our method by predicting the evolution of seasonal influenza viruses.

How does it work?
Individuals adapt to a changing environment by accumulating beneficial mutations, while avoiding deleterious mutations. We model this process assuming that there are many such mutations which change fitness in small increments. Using this model, we calculate the probability that an individual that lived in the past at time t leaves n descendants in the present. This distributions depends critically on the fitness of the ancestral individual. We then extend this calculation to the probability of observing a certain branch in a genealogical tree reconstructed from a sample of sequences. A branch in a tree connects an individual A that lived at time tA and had fitness xA and with an individual B that lived at a later time tB with fitness xB as illustrated in the figure. B has descendants in the sample, otherwise the branch would not be part of the tree. Furthermore, all sampled descendants of A are also descendants of B, otherwise the connection between A and B would have branched between tA and tB. We call the mathematical object describing fitness evolution between A and B “branch propagator” and propagatordenote it by g(xB,tB|xA,tA). The joint probability distribution of fitness values of all nodes of the tree is given by a product of branch propagators. We then calculate the expected fitness of each node and use it to rank the sampled sequences. The top ranked sequence is our prediction for the sequence of the progenitor of the future population.

Why do we care?
flu_tree Being able to predict evolution could have immediate applications. The best example is the seasonal influenza vaccine, that needs to be updated frequently to keep up with the evolving virus. Vaccine strains are chosen among sampled virus strains, and the more closely this strain matches the future influenza virus population, the better the vaccine is going to be. Hence by predicting a likely progenitor of the future, our method could help to improve influenza vaccines. One of our predictions is shown in the figure, with the top ranked sequence marked by a black arrow. Influenza is not the only possible application. Since the algorithm only requires a reconstructed tree as input, it can be applied to other rapidly evolving pathogens or cancer cell populations. In addition, to being useful, the ability to predict also implies that the model captures an essential aspect of evolutionary dynamics: influenza evolution is to a substantial degree — enough to enable prediction — dependent on the accumulation of small effect mutations.

Comparison to other approaches
Given the importance of good influenza vaccines, there has been a number of previous efforts to anticipate influenza virus evolution, typically based on using patterns of molecular evolution from historical data. Along these lines, Luksza and Lässig have recently presented an explicit fitness model for influenza virus evolution that rewards mutations at positions known to convey antigenic novelty and penalizes likely deleterious mutations (+a few other things). By using molecular influenza specific signatures, this model is complementary to ours that uses only the tree reconstructed from nucleotide sequences. Interestingly, the two models do more or less equally well and combining different methods of prediction should result in more reliable results.

Author post: Long non-coding RNAs as a source of new peptides

This post is by M.Mar Albà on her preprint (with co-authors) available from arRxiv Long non-coding RNAs as a source of new peptides.

Several recent studies based on deep sequencing of ribosome protected fragments have reported that many long non-coding RNAs (lncRNAs) associate with ribosomes (see for example Everything old is new again: (linc)RNAs make proteins! a comment by Stephen M Cohen). We have analyzed the original data from experiments performed in six different eukaryotic species and confirmed that this is a widespread phenomenon. This is paradoxical because lncRNAs apparently have very little coding capacity with only short open reading frames (ORFs) that do not show sequence similarity to known proteins.

In contrast to typical mRNAs, many lncRNAs are lineage-specific. Therefore, if they are translated, they should be similar to recently evolved protein-coding genes. This is exactly what we have found. It turns out that transcripts encoding young proteins show very similar properties to lncRNAs; short and non-conserved ORFs, low coding sequence potential, and relatively weak selective constraints.

Evidence has accumulated in recent years that new protein-coding genes are continuously evolving (The continuing evolution of genes by Carl Zimmer). The birth of a new functional protein is a process of trial and error that most likely requires the expression of many transcripts that will not survive the test of time. LncRNAs seem to fit the bill for this role.

Author post: Diversity and evolution of centromere repeats in the maize genome

This guest post is by Paul Bilinski on his paper with coauthors Diversity and evolution of centromere repeats in the maize genome BioRxived here.

Centromeres have the potential to play a central role in speciation, yet our ability to study them has been limited because of their repetitive nature. The centromeres of many eukaryotes consist partly of large arrays of short tandem repeats, though the actual sequence of the repeat varies widely across taxa. To investigate the whether the variation found in the tandem repeats themselves could inform our understanding of their evolutionary history we made use of the reference maize genome as well as resequencing data from several lines of maize and its wild relative teosinte.

Although tandem repeats should be identical upon duplication, our analysis of CentC in maize revealed that most copies genome-wide are unique. We observed only three instances where adjacent copies were identical in sequence and length, driving home the idea that these tandem repeats have accumulated immense diversity. Given such diversity, we wanted to investigate genetic relatedness across CentC copies.

Using positional and genetic relatedness information from the fully-sequenced centromeres 2 and 5, we found high within-cluster similarity, suggesting that tandem duplications drove most CentC copy number increase. Contrary to patterns seen in Arabidopsis (Kawabe and Nasuda 2005), principle coordinate analysis of repeats found no clustering by chromosome, with groups of CentC with similar sequence distributed across all of the chromosomes.

Another surprising discovery involved the origin of the biggest arrays of CentC. As an ancient tetraploid maize originally had 20 chromosomes with 20 centromeres. Processes of fractionation and rearrangement have led to the 10 chromosomes in the extant maize genome. Schnable et al (2011) were able to identify which chromosomal segments derive from each of maize’s ancient parents, referred to as subgenomes one and two. Wang and Bennetzen (2011) built on this information, and found that about half of the modern centromeres came from each parent. Inferring subgenome of origin by flanking regions, we found that all of the CentC clusters >20kb in length derive from subgenome 1. The proportions are less skewed when looking at clusters >10kb, though in all cases we see more bp of CentC assigned to subgenome 1 than we expect based on its total bp in the genome. This is particularly interesting because subgenome 1 also shows higher overall gene expression and fewer deletions than subgenome two (Schnable et al 2011).

The diversity of CentC seen might suggest that CentC repeats were reasonably static in the genome, persisting in the same spot for a long time with occasional increases in copy number via tandem duplication. However, fluorescent in situ hybridization suggested that domestication resulted in a large loss of CentC signal across many of maize’s 10 chromosomes. We confirmed and quantified the loss of CentC using resequencing data from a set of maize and teosinte lines (Chia et al. 2012).

Combined, our results suggest long term stability of CentC clusters with new copies arising from tandem duplication, while mutation serves to homogenize rather than separate clusters. We hope our insights into centromere repeat evolution will build toward a better understanding of their role in evolution.

Author post: When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis

This guest post is by Justin Blumenstiel on his preprint (with co-authors) When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis, available from bioRxiv here.

Does the activation of one transposable element (TE) family typically lead to the activation of many? If so, this would indicate a synergism between different TE families with significance for TE dynamics in natural populations. A standard model of TE dynamics typically takes into account population size, transposition rate (which may vary based on host defense) and selection against TE insertions. If the mobilization of one TE can lead to the mobilization of others, the transposition rate of one TE family could influence the transposition rate of others.

Hybrid dysgenic syndromes in Drosophila are an important model for TE dynamics when one TE family becomes mobilized. In the 1980’s, it was generally concluded that P element dysgenesis did not lead to mobilization of other TEs. However, studies in the D. virilis system of hybrid dysgenesis indicated otherwise. More recently, an analysis of transposition in the P element system indicated movement of elements other than the P element. Thus, it appears that co-mobilization may be a common feature of dysgenic syndromes.

What is the mechanism of co-mobilization? For the P element system, studies by William Theurkauf and colleagues point to the DNA damage response as key. Specifically, via Chk2 kinase, DNA damage signaling leads to perturbed piRNA biogenesis, which in turn leads to the activation of other elements under control of piRNA-based silencing. Does this mechanism also apply to other systems?

To study the mechanism of TE co-mobilization in the D. virilis system, we performed small RNA sequencing and mRNA sequencing experiments using germline material of reciprocal females of the dysgenic and non-dysgenic crosses. In contrast to the P element and I element systems, hybrid dysgenesis in D. virilis is more complex. For one, there is not a single element that has been proven to be the sole cause. This was previously shown, but in this study we identified several more elements that likely contribute. From small RNA sequencing, we find that TE mis-expression persists in the progeny of the dysgenic cross, without a persisting global defect in piRNA biogenesis. Rather, it appears that piRNA biogenesis defects are idiosyncratic across different TE families. Interestingly, we also find evidence that piRNA silencing loses specificity in the dysgenic cross, with some highly expressed genes becoming non-specific targets.

Overall, this study provided several insights, but the mechanism of co-mobilization in the D. virilis system remains unknown. The complexity of this syndrome makes it a challenge for study, but it may provide significant insight into genome dynamics of hybrids whose parents differ for more than one TE family. Future genetic analysis may allow us to determine the role of the DNA damage response in maintaining the activity of some TE families, but not others.

Author post: Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans

This guest post is by Rebekah Rogers (@evolscientist) on her paper with coauthors “Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans” arXived here.

Tandem duplications are widely recognized as a source of genetic novelty. Duplication of gene sequences can result in adaptive evolution through the development of novel functions or specialization in subsets of ancestral functions when ‘spare parts’ are relieved of evolutionary constraints. Additionally, tandem duplications have the potential to create entirely novel gene structures through chimeric gene formation and recruitment of formerly non-coding sequence. Here, we survey the limits of standing variation for tandem duplications in natural populations of D. yakuba and D. simulans, estimate the upper bound of mutation rates, and explore their role in rapid evolution.

Tandem duplicates on the X chromosome in D. simulans show an excess of high frequency variants consistent with adaptive evolution through tandem duplication. Furthermore, we identify an overrepresentation of genes involved in rapidly evolving phenotypes such as chorion development and oogenesis, drug and toxin metabolism, chitin cuticle formation, chemosensory processes, lipases and endopeptidases expressed in male reproduction, as well as immune response to pathogens in both D. yakuba and D. simulans. The enrichment of such rapidly evolving functional classes points to a role for tandem duplicates in Red Queen dynamics and responses to strong selective pressures.
In spite of the observed concordance across functional classes we observe few duplicated genes that are shared across species indicating that parallel recruitment of tandem duplications is rare. The span of duplicates in the population is quite limited, and we estimate that less than 15% of the genome is represented among the tandem duplications segregating in the entire population for the species. Moreover, many duplicates are present at low frequency and will have difficulty escaping the forces of drift during selective sweeps. This very limited standing variation combined with low mutation rates for tandem duplications results in severe limitations in the substrate of genetic novelty that is available for adaptation.

Thus, the limits of standing variation and the rate of new mutations are expected to play a vital role in defining evolutionary trajectories and the ability of organisms to adapt in the event of gross environmental change. Given the limited substrate of genetic novelty, we expect that if adaptation is dependent upon gene duplications, suboptimal outcomes in adaptive walks will be common, long wait times will occur for new phenotypic changes, and many multicellular eukaryotes will display limited ability to adapt to rapidly changing environments.

Author post: Spatial localization of recent ancestors for admixed individuals

A guest post by Bogdan Pasaniuc [@bpasaniuc] on his paper with coauthors: Spatial localization of recent ancestors for admixed individuals by Wen-Yun Yang, Alexander Platt, Charleston Wen-Kai Chiang, Eleazar Eskin, John Novembre, Bogdan Pasaniuc. bioRxived here.

Geographic localization based on genetic data has received much attention recently. Here we present a preprint that aims to address one of the drawbacks of existing approaches. As opposed to existing works that typically make a very strong assumption that all recent ancestors come from the same location on a map, we seek to infer multiple locations for a given individual corresponding to its ancestors. That is, our approach uses genetic data from a given individual to localize on the map its recent ancestors several generations ago (e.g. grandparents).

To accomplish this we approximate the admixture process (i.e. mixing of genetic variants from different sources) in a genetic-geographic continuum. We view the mixed ancestry genome as being generated from several locations on a map (corresponding to its recent ancestors) and model the mosaic structure of local ancestries across the genome through an admixture HMM. We link geography to the admixture process by allowing allele frequencies at every site in the genome to vary across geography according to a logistic gradient function (as in SPA[1]); the complete model is an admixture HMM for a genotype-specific pair of ancestral locations on the map.

As the number of generations since admixture increases the total number of ancestors to localize increases dramatically making the inference infeasible ( To account for this, we limit the number of different “ancestry locations” that contribute to admixture to a small constant, each with varying amount of contribution. We devise efficient algorithms to make inferences in our model and show that accuracy decreases with number of locations to infer, with number of generations in the admixture and with geographic distance among ancestors. For example, SPAMIX can localize the grandparents of the POPRES[2] individuals with multiple sub-continental European ancestries within 470Km of their reported locations.

As with all methods, limitations do exist and we outline several here. We use logistic gradient functions to relate geography to genetics and investigating more complex functions may prove fruitful. We developed an efficient algorithm for producing point estimates for location and locus-specific ancestry; in some cases a probabilistic output may be desired. Finally, our approach models admixture-LD and assumes no background LD; more involved procedures to model background LD (such as the one we proposed [3]) is an interesting area of research.

1. Yang, Wen-Yun, et al. “A model-based approach for analysis of spatial structure in genetic data.” Nature genetics 44.6 (2012): 725-731.
2. Nelson, Matthew R., et al. “The population reference sample, POPRES: a resource for population, disease, and pharmacological genetics research.” The American Journal of Human Genetics 83.3 (2008): 347-358.
3. Baran, Yael, et al. “Enhanced localization of genetic samples through linkage-disequilibrium correction.” The American Journal of Human Genetics 92.6 (2013): 882-894.