A Complete Public Domain Family Genomics Dataset

A Complete Public Domain Family Genomics Dataset
Manuel Corpas, Mike Cariaso, Alain Coletta, David Weiss, Andrew P Harrison, Federico Moran, Huanming Yang

BACKGROUND: The availability of open access genomic data is essential for the personal genomics field. Public genomic data allow comparative analyses, testing of new tools and genotype-phenotype association studies. Personal genomics data of unrelated individuals are available in the public domain, notably the Personal Genome Project; however, to date genomics family data and metadata are severely lacking, mainly due to cost, privacy concerns or restricted access to Next Generation Sequencing (NGS) technology. Family data have a lot to offer as they allow the study of heritability, something which is impossible to do just by using unrelated individuals. FINDINGS: A whole family from Southern Spain decided to genotype, sequence and analyse their personal genomes making them publicly available under a Creative Commons 0 license (CC0; commonly denominated as public domain). These data include a) five 23andMe SNP chip genotype bed files, b) four raw exomes with their assorted bam files and VCF files, c) a metagenomic raw sequencing data file and d) derived data of likely phenotypes using SNPedia-derived tools. CONCLUSIONS: To our knowledge this is the first CC0 released set of genomic, phenotypic and metagenomic data for a whole family. This dataset is also unique in that it was obtained through direct-to-consumer genetic tests. Hence any ordinary citizen with enough budget and samples should be able to reproduce this experiment. We envisage this dataset to be a useful resource for a variety of applications in the personal genomics field as a) negative control data for trait association discovery, b) testing data for development of new software and c) sample data for heritability studies. We encourage prospective users to share with us derived results so that they can be added to our existing collection.

Population genomics of parallel hybrid zones in the mimetic butterflies, H. melpomene and H. erato

Population genomics of parallel hybrid zones in the mimetic butterflies, H. melpomene and H. erato

Nicola Nadeau, Mayte Ruiz, Patricio Salazar, Brian Counterman, Jose Alejandro Medina, Humberto Ortiz-Zuazaga, Anna Morrison, W. Owen McMillan, Chri Jiggins, Riccardo Papa

Hybrid zones can be valuable tools for studying evolution and identifying genomic regions responsible for adaptive divergence and underlying phenotypic variation. Hybrid zones between subspecies of Heliconius butterflies can be very narrow and are maintained by strong selection acting on colour pattern. The co-mimetic species H. erato and H. melpomene have parallel hybrid zones where both species undergo a change from one colour pattern form to another. We use restriction associated DNA sequencing to obtain several thousand genome wide sequence markers and use these to analyse patterns of population divergence across two pairs of parallel hybrid zones in Peru and Ecuador. We compare two approaches for analysis of this type of data; alignment to a reference genome and de novo assembly, and find that alignment gives the best results for species both closely (H. melpomene) and distantly (H. erato, ~15% divergent) related to the reference sequence. Our results confirm that the colour pattern controlling loci account for the majority of divergent regions across the genome, but we also detect other divergent regions apparently unlinked to colour pattern differences. We also use association mapping to identify previously unmapped colour pattern loci, in particular the Ro locus. Finally, we identify within our sample a new cryptic population of H. timareta in Ecuador, which occurs at relatively low altitude and is mimetic with H. melpomene malleti.

The genomic landscape of meiotic crossovers and gene conversions in Arabidopsis thaliana

The genomic landscape of meiotic crossovers and gene conversions in Arabidopsis thaliana
Erik Wijnker, Geo Velikkakam James, Jia Ding, Frank Becker, Jonas R. Klasen, Vimal Rawat, Beth A. Rowan, Daniel F. de Jong, C. Bastiaan de Snoo, Luis Zapata, Bruno Huettel, Hans de Jong, Stephan Ossowski, Detlef Weigel, Maarten Koornneef, Joost J.B. Keurentjes, Korbinian Schneeberger
(Submitted on 13 Nov 2013)

Knowledge of the exact distribution of meiotic crossovers (COs) and gene conversions (GCs) is essential for understanding many aspects of population genetics and evolution, from haplotype structure and long-distance genetic linkage to the generation of new allelic variants of genes. To this end, we resequenced the four products of 13 meiotic tetrads along with 10 doubled haploids derived from Arabidopsis thaliana hybrids. GC detection through short reads has previously been confounded by genomic rearrangements. Rigid filtering for misaligned reads allowed GC identification at high accuracy and revealed an ~80-kb transposition, which undergoes copy-number changes mediated by meiotic recombination. Non-crossover associated GCs were extremely rare most likely due to their short average length of ~25-50 bp, which is significantly shorter than the length of CO associated GCs. Overall, recombination preferentially targeted non-methylated nucleosome-free regions at gene promoters, which showed significant enrichment of two sequence motifs.

Genome-wide targets of selection: female response to experimental removal of sexual selection in Drosophila melanogaster

Genome-wide targets of selection: female response to experimental removal of sexual selection in Drosophila melanogaster
Paolo Innocenti, Ilona Flis, Edward H Morrow

Despite the common assumption that promiscuity should in general be favored in males, but not in females, to date there is no consensus on the general impact of multiple mating on female fitness. Notably, very little is known about the genetic and physiological features underlying the female response to sexual selection pressures. By combining an experimental evolution approach with genomic techniques, we investigated the effects of single and multiple matings on female fecundity and gene expression. We experimentally manipulated the mating system in replicate populations of Drosophila melanogaster by removing sexual selection, with the aim of testing differences in short term post-mating effects of females evolved under different mating strategies. We show that monogamous females suffer decreased fecundity, a decrease that was partially recovered by experimentally reversing the selection pressure back to the ancestral promiscuous state. The post-mating gene expression profiles of monogamous females differ significantly from promiscuous females, involving 9% of the genes tested. These transcripts are active in several tissues, mainly ovaries, neural tissues and midgut, and are involved in metabolic processes, reproduction and signaling pathways. Our results demonstrate how the female post-mating response can evolve under different mating systems, and provide novel insights into the genes targeted by sexual selection in females, by identifying a list of candidate genes responsible for the decrease in female fecundity in the absence of promiscuity.

Improved annotation of 3-prime untranslated regions and complex loci by combination of strand-specific Direct RNA Sequencing, RNA-seq and ESTs

Improved annotation of 3-prime untranslated regions and complex loci by combination of strand-specific Direct RNA Sequencing, RNA-seq and ESTs
Nick Schurch, Christian Cole, Alexander Sherstnev, Junfang Song, Céline Duc, Kate G. Storey, W. H. Irwin McLean, Sara J. Brown, Gordon G. Simpson, Geoffrey J. Barton
(Submitted on 11 Nov 2013)

The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct annotation is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3-prime untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3-prime polyadenylation sites to within +/- 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3-prime UTR re-annotation (including extension of one 3-prime UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data

The first steps of adaptation of Escherichia coli to the gut are dominated by soft sweeps

The first steps of adaptation of Escherichia coli to the gut are dominated by soft sweeps
João Barroso-Batista, Ana Sousa, Marta Lourenço, Marie-Louise Bergman, Jocelyne Demengeot, Karina B. Xavier, Isabel Gordo
(Submitted on 11 Nov 2013)

The accumulation of adaptive mutations is essential for survival in novel environments. However, in clonal populations with a high mutational supply, the power of natural selection is expected to be limited. This is due to clonal interference – the competition of clones carrying different beneficial mutations – which leads to the loss of many small effect mutations and fixation of large effect ones. If interference is abundant, then mechanisms for horizontal transfer of genes, which allow the immediate combination of beneficial alleles in a single background, are expected to evolve. However, the relevance of interference in natural complex environments, such as the gut, is poorly known. To address this issue, we studied the invasion of beneficial mutations responsible for Escherichia coli’s adaptation to the mouse gut and demonstrate the pervasiveness of clonal interference. The observed dynamics of change in frequency of beneficial mutations are consistent with soft sweeps, where a similar adaptive mutation arises repeatedly on different haplotypes without reaching fixation. The genetic basis of the adaptive mutations revealed a striking parallelism in independently evolving populations. This was mainly characterized by the insertion of transposable elements in both coding and regulatory regions of a few genes. Interestingly in most populations, we observed a complete phenotypic sweep without loss of genetic variation. The intense clonal interference during adaptation to the gut environment, here demonstrated, may be important for our understanding of the levels of strain diversity of E. coli inhabiting the human gut microbiota and of its recombination rate.

Functional Annotation Signatures of Disease Susceptibility Loci Improve SNP Association Analysis

Functional Annotation Signatures of Disease Susceptibility Loci Improve SNP Association Analysis

Edwin S Iversen, Gary Lipton, Merlise A. Clyde, Alvaro N. A. Monteiro
doi: 10.1101/000158

We describe the development and application of a Bayesian statistical model for the prior probability of phenotype-genotype association that incorporates data from past association studies and publicly available functional annotation data regarding the susceptibility variants under study. The model takes the form of a binary regression of association status on a set of annotation variables whose coefficients were estimated through an analysis of associated SNPs housed in the GWAS Catalog (GC). The set of functional predictors we examined includes measures that have been demonstrated to correlate with the association status of SNPs in the GC and some whose utility in this regard is speculative: summaries of the UCSC Human Genome Browser ENCODE super-track data, dbSNP function class, sequence conservation summaries, proximity to genomic variants included in the Database of Genomic Variants (DGV) and known regulatory elements included in the Open Regulatory Annotation database (ORegAnno), PolyPhen-2 probabilities and RegulomeDB categories. Because we expected that only a fraction of the annotation variables would contribute to predicting association, we employed a penalized likelihood method to reduce the impact of non-informative predictors and evaluated the model’s ability to predict GC SNPs not used to construct the model. We show that the functional data alone are predictive of a SNP’s presence in the GC. Further, using data from a genome-wide study of ovarian cancer, we demonstrate that their use as prior data when testing for association is practical at the genome-wide scale and improves power to detect associations.

A stochastic microscopic model for the dynamics of antigenic variation


A stochastic microscopic model for the dynamics of antigenic variation

Gustavo Guerberoff, Fernando Alvarez-Valin
(Submitted on 8 Nov 2013)

We present a novel model that describes the within-host evolutionary dynamics of parasites undergoing antigenic variation. The approach uses a multi-type branching process with two types of entities defined according to their relationship with the immune system: clans of resistant parasitic cells (i.e. groups of cells sharing the same antigen not yet recognized by the immune system) that may become sensitive, and individual sensitive cells that can acquire a new resistance thus giving rise to the emergence of a new clan. The simplicity of the model allows analytical treatment to determine the subcritical and supercritical regimes in the space of parameters. By incorporating a density-dependent mechanism the model is able to capture additional relevant features observed in experimental data, such as the characteristic parasitemia waves. In summary our approach provides a new general framework to address the dynamics of antigenic variation which can be easily adapted to cope with broader and more complex situations.

Mapping of the Influenza-A Hemagglutinin Serotypes Evolution by the ISSCOR Method

Mapping of the Influenza-A Hemagglutinin Serotypes Evolution by the ISSCOR Method
Jan P. Radomski, Piotr P. Slonimski, Włodzimierz Zagórski-Ostoja, Piotr Borowicz
(Submitted on 8 Nov 2013)

Analyses and visualizations by the ISSCOR method of influenza virus hemagglutinin genes of different A-subtypes revealed some rather striking temporal relationships between groups of individual gene subsets. Based on these findings we consider application of the ISSCOR-PCA method for analyses of large sets of homologous genes to be a worthwhile addition to a toolbox of genomics – allowing for a rapid diagnostics of trends, and ultimately even aiding an early warning of newly emerging epidemiological threats.

Sequencing and characterisation of rearrangements in three S. pastorianus strains reveals the presence of chimeric genes and gives evidence of breakpoint reuse

Sequencing and characterisation of rearrangements in three S. pastorianus strains reveals the presence of chimeric genes and gives evidence of breakpoint reuse
Sarah K. Hewitt, Ian Donaldson, Simon C. Lovell, Daniela Delneri
(Submitted on 8 Nov 2013)

Gross chromosomal rearrangements have the potential to be evolutionarily advantageous to an adapting organism. The generation of a hybrid species increases opportunity for recombination by bringing together two homologous genomes. We sought to define the location of genomic rearrangements in three strains of Saccharomyces pastorianus, a natural lager-brewing yeast hybrid of Saccharomyces cerevisiae and Saccharomyces eubayanus, using whole genome shotgun sequencing. Each strain of S. pastorianus has lost species-specific portions of its genome and has undergone extensive recombination, producing chimeric chromosomes. We predicted 30 breakpoints that we confirmed at the single nucleotide level by designing species-specific primers that flank each breakpoint, and then sequencing the PCR product. These rearrangements are the result of recombination between areas of homology between the two subgenomes, rather than repetitive elements such as transposons or tRNAs. Interestingly, 28/30 S. cerevisiae- S. eubayanus recombination breakpoints are located within genic regions, generating chimeric genes. Furthermore we show evidence for the reuse of two breakpoints, located in HSP82 and KEM1, in strains of proposed independent origin.