Mutation rate estimation for 15 autosomal STR loci in a large population from Mainland China
Zhuo Zhao , Hua Wang , Jie Zhang , Zhi-Peng Liu , Ming Liu , Yuan Zhang , Li Sun , Hui Zhang
doi: http://dx.doi.org/10.1101/015875

STR, short trandem repeats, is well known as a type of powerful genetic marker and widely used in studying human population genetics. Compared with the conventional genetic markers, the mutation rate of STR is higher. Additionally, the mutations of STR loci do not lead to genetic inconsistencies between the genotypes of parents and children; therefore, the analysis of STR mutation is more suited to assess the population mutation. In this study, we focused on 15 autosomal STR loci (D8S1179, D21S11, D7S820, CSF1PO, D3S1358, TH01, D13S317, D16S539, D2S1338, D19S433, vWA, TPOX, D18S51, D5S818, FGA). DNA samples from a total of 42416 unrelated healthy individuals (19037 trios) from the population of Mainland China collected between Jan 2012 and May 2014 were successfully investigated. In our study, the allele frequencies, paternal mutation rates, maternal mutation rates and average mutation rates were detected in the 15 STR loci. Furthermore, we also investigated the relationship between paternal ages, maternal ages, pregnant time, area and average mutation rate. We found that paternal mutation rate is higher than maternal mutation rate and the paternal, maternal, and average mutation rates have a positive correlation with paternal ages, maternal ages and times respectively. Additionally, the average mutation rates of coastal areas are higher than that of inland areas. Overall, these results suggest that the 15 autosomal STR loci can provide highly informative polymorphic data for population genetic assessment in Mainland China, as well as confirm and extend the application of STR analysis in population genetics.

Recent evolution in Rattus norvegicus is shaped by declining effective population size

Recent evolution in Rattus norvegicus is shaped by declining effective population size
Eva E Deinum , Daniel L Halligan , Rob W Ness , Yao-Hua Zhang , Lin Cong , Jian-Xu Zhang , Peter D Keightley
doi: http://dx.doi.org/10.1101/015818

The brown rat, Rattus norvegicus, is both a notorious pest and a frequently used model in biomedical research. By analysing genome sequences of 12 wild-caught brown rats from their ancestral range in NE China, along with the sequence of a black rat, R. rattus, we investigate the selective and demographic forces shaping variation in the genome. We estimate that the recent effective population size (N_e) of this species = 1.24 x 10^5, based on silent site diversity. We compare patterns of diversity in these genomes with patterns in multiple genome sequences of the house mouse Mus musculus castaneus), which has a much larger N_e. This reveals an important role for variation in the strength of genetic drift in mammalian genome evolution. By a Pairwise Sequentially Markovian Coalescent (PSMC) analysis of demographic history, we infer that there has been a recent population size bottleneck in wild rats, which we date to approximately 20,000 years ago. Consistent with this, wild rat populations have experienced an increased flux of mildly deleterious mutations, which segregate at higher frequencies in protein-coding genes and conserved noncoding elements (CNEs). This leads to negative estimates of the rate of adaptive evolution (alpha) in proteins and CNEs, a result which we discuss in relation to the strongly positive estimates observed in wild house mice. As a consequence of the population bottleneck, wild rats also show a markedly slower decay of linkage disequilibrium with physical distance than wild house mice.

Speciation in Heliconius Butterflies: Minimal Contact Followed by Millions of Generations of Hybridisation

Speciation in Heliconius Butterflies: Minimal Contact Followed by Millions of Generations of Hybridisation
Simon Henry Martin , Anders Eriksson , Krzysztof M. Kozak , Andrea Manica , Chris D. Jiggins
doi: http://dx.doi.org/10.1101/015800

Documenting the full extent of gene flow during speciation poses a challenge, as species ranges change over time and current rates of hybridisation might not reflect historical trends. Theoretical work has emphasized the potential for speciation in the face of ongoing hybridisation, and the genetic mechanisms that might facilitate this process. However, elucidating how the rate of gene flow between species may have changed over time has proved difficult. Here we use Approximate Bayesian Computation (ABC) to fit a model of speciation between the Neotropical butterflies Heliconius melpomene and Heliconius cydno. These species are ecologically divergent, rarely hybridize and display female hybrid sterility. Nevertheless, previous genomic studies suggests pervasive gene flow between them, extending deep into their past, and potentially throughout the speciation process. By modelling the rates of gene flow during early and later stages of speciation, we find that these species have been hybridising for hundreds of thousands of years, but have not done so continuously since their initial divergence. Instead, it appears that gene flow was rare or absent for as long as a million years in the early stages of speciation. Therefore, by dissecting the timing of gene flow between these species, we are able to reject a scenario of purely sympatric speciation in the face of continuous gene flow. We suggest that the period of minimal contact early in speciation may have allowed for the accumulation of genomic changes that later enabled these species to remain distinct despite a dramatic increase in the rate of hybridisation.

Quality assessment for different haplotyping methods and GWAS sensitivity to phasing errors

Quality assessment for different haplotyping methods and GWAS sensitivity to phasing errors

Giovanni Busonera , Marco Cogoni , Gianluigi Zanetti
doi: http://dx.doi.org/10.1101/015669

In this report we present a multimarker association tool (Flash) based on a novel algorithm to generate haplotypes from raw genotype data. It belongs to the entropy minimization class of methods and is composed of a two stage deterministic – heuristic part and of a optional stochastic optimization. This algorithm is able to scale up well to handle huge datasets with faster performance than the competing technologies such as BEAGLE and MACH while maintaining a comparable accuracy. A quality assessment of the results is carried out by comparing the switch error. Finally, the haplotypes are used to perform a haplotype-based Genome-wide Association Study (GWAS). The association results are compared with a multimarker and a single SNP association test performed with Plink. Our experiments confirm that the multimarker association test can be more powerful than the single SNP one as stated in the literature. Moreover, Flash and Plink show similar results for the multimarker association test but Flash speeds up the computation time of about an order of magnitude using 5 SNP size haplotypes.

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage
Nicholas H. Putnam, Brendan O’Connell, Jonathan C. Stites, Brandon J. Rice, Andrew Fields, Paul D. Hartley, Charles W. Sugnet, David Haussler, Daniel S. Rokhsar, Richard E. Green
Subjects: Genomics (q-bio.GN); Biomolecules (q-bio.BM)

Long-range and highly accurate de novo assembly from short-read data is one of the most pressing challenges in genomics. Recently, it has been shown that read pairs generated by proximity ligation of DNA in chromatin of living tissue can address this problem. These data dramatically increase the scaffold contiguity of assemblies and provide haplotype phasing information. Here, we describe a simpler approach (“Chicago”) based on in vitro reconstituted chromatin. We generated two Chicago datasets with human DNA and used a new software pipeline (“HiRise”) to construct a highly accurate de novo assembly and scaffolding of a human genome with scaffold N50 of 30 Mb. We also demonstrated the utility of Chicago for improving existing assemblies by re-assembling and scaffolding the genome of the American alligator. With a single library and one lane of Illumina HiSeq sequencing, we increased the scaffold N50 of the American alligator from 508 kb to 10 Mb. Our method uses established molecular biology procedures and can be used to analyze any genome, as it requires only about 5 micrograms of DNA as the starting material.

Genetic evidence for an origin of the Armenians from Bronze Age mixing of multiple populations

Genetic evidence for an origin of the Armenians from Bronze Age mixing of multiple populations
Marc Haber , Massimo Mezzavilla , Yali Xue , David Comas , Paolo Gasparini , Pierre Zalloua , Chris Tyler-Smith
doi: http://dx.doi.org/10.1101/015396

The Armenians are a culturally isolated population who historically inhabited a region in the Near East bounded by the Mediterranean and Black seas and the Caucasus, but remain underrepresented in genetic studies and have a complex history including a major geographic displacement during World War One. Here, we analyse genome-wide variation in 173 Armenians and compare them to 78 other worldwide populations. We find that Armenians form a distinctive cluster linking the Near East, Europe, and the Caucasus. We show that Armenian diversity can be explained by several mixtures of Eurasian populations that occurred between ~3,000 and ~2,000 BCE, a period characterized by major population migrations after the domestication of the horse, appearance of chariots, and the rise of advanced civilizations in the Near East. However, genetic signals of population mixture cease after ~1,200 BCE when Bronze Age civilizations in the Eastern Mediterranean world suddenly and violently collapsed. Armenians have since remained isolated and genetic structure within the population developed ~500 years ago when Armenia was divided between the Ottomans and the Safavid Empire in Iran. Finally, we show that Armenians have higher genetic affinity to Neolithic Europeans than other present-day Near Easterners, and that 29% of the Armenian ancestry may originate from an ancestral population best represented by Neolithic Europeans.

Partitioning, duality, and linkage disequilibria in the Moran model with recombination

Partitioning, duality, and linkage disequilibria in the Moran model with recombination
Mareike Esser, Sebastian Probst, Ellen Baake
Comments: 29 pages, 6 figures
Subjects: Probability (math.PR); Populations and Evolution (q-bio.PE)

The Moran model with recombination is considered, which describes the evolution of the genetic composition of a population under recombination and resampling. There are $n$ sites (or loci), a finite number of letters (or alleles) at every site, and we do not make any scaling assumptions. In particular, we do not assume a diffusion limit. We consider the following marginal ancestral recombination process. Let $S = \{1,…c,n\}$ and $\mathcal A=\{A_1, …c, A_m\}$ be a partition of $S$. We concentrate on the joint probability of the letters at the sites in $A_1$ in individual $1$, $…c$, at the sites in $A_m$ in individual $m$, where the individuals are sampled from the current population without replacement. Following the ancestry of these sites backwards in time yields a process on the set of partitions of $S$, which, in the diffusion limit, turns into a marginalised version of the $n$-locus ancestral recombination graph. With the help of an inclusion-exclusion principle, we show that the type distribution corresponding to a given partition may be represented in a systematic way, with the help of so-called recombinators and sampling functions. The same is true of correlation functions (known as linkage disequilibria in genetics) of all orders.
We prove that the partitioning process (backward in time) is dual to the Moran population process (forward in time), where the sampling function plays the role of the duality function. This sheds new light on the work of Bobrowski, Wojdyla, and Kimmel (2010). The result also leads to a closed system of ordinary differential equations for the expectations of the sampling functions, which can be translated into expected type distributions and expected linkage disequilibria.