Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads
Hung-I Harry Chen , Yuanhang Liu , Yi Zou , Zhao Lai , Devanand Sarkar , Yufei Huang , Yidong Chen
doi: http://dx.doi.org/10.1101/016196

Background RNA sequencing (RNA-seq) is a powerful tool for genome-wide expression profiling of biological samples with the advantage of high-throughput and high resolution. There are many existing algorithms nowadays for quantifying expression levels and detecting differential gene expression, but none of them takes the misaligned reads that are mapped to non-exonic regions into account. We developed a novel algorithm, XBSeq, where a statistical model was established based on the assumption that observed signals are the convolution of true expression signals and sequencing noises. The mapped reads in non-exonic regions are considered as sequencing noises, which follows a Poisson distribution. Given measureable observed and noise signals from RNA-seq data, true expression signals, assuming governed by the negative binomial distribution, can be delineated and thus the accurate detection of differential expressed genes. Results We implemented our novel XBSeq algorithm and evaluated it by using a set of simulated expression datasets under different conditions, using a combination of negative binomial and Poisson distributions with parameters derived from real RNA-seq data. We compared the performance of our method with other commonly used differential expression analysis algorithms. We also evaluated the changes in true and false positive rates with variations in biological replicates, differential fold changes, and expression levels in non-exonic regions. We also tested the algorithm on a set of real RNA-seq data where the common and different detection results from different algorithms were reported. Conclusions In this paper, we proposed a novel XBSeq, a differential expression analysis algorithm for RNA-seq data that takes non-exonic mapped reads into consideration. When background noise is at baseline level, the performance of XBSeq and DESeq are mostly equivalent. However, our method surpasses DESeq and other algorithms with the increase of non-exonic mapped reads. Only in very low read count condition XBSeq had a slightly higher false discovery rate, which may be improved by adjusting the background noise effect in this situation. Taken together, by considering non-exonic mapped reads, XBSeq can provide accurate expression measurement and thus detect differential expressed genes even in noisy conditions.

Pathway based factor analysis of gene expression data produces highly heritable phenotypes that associate with age

Pathway based factor analysis of gene expression data produces highly heritable phenotypes that associate with age
Andrew Anand Brown , Zhihao Ding , Ana Viñuela , Dan Glass , Leopold Parts , Timothy Spector , John Winn , Richard Durbin
doi: http://dx.doi.org/10.1101/016154

Statistical factor analysis methods have previously been used to remove noise components from high dimensional data prior to genetic association mapping, and in a guided fashion to summarise biologically relevant sources of variation. Here we show how the derived factors summarising pathway expression can be used to analyse the relationships between expression, heritability and ageing. We used skin gene expression data from 647 twins from the MuTHER Consortium and applied factor analysis to concisely summarise patterns of gene expression, both to remove broad confounding influences and to produce concise pathway-level phenotypes. We derived 930 “pathway phenotypes” which summarised patterns of variation across 186 KEGG pathways (five phenotypes per pathway). We identified 69 significant associations of age with phenotype from 57 distinct KEGG pathways at a stringent Bonferroni threshold (P<5.38E-5). These phenotypes are more heritable (h^2=0.32) than gene expression levels. On average, expression levels of 16% of genes within these pathways are associated with age. Several significant pathways relate to metabolising sugars and fatty acids, others with insulin signalling. We have demonstrated that factor analysis methods combined with biological knowledge can produce more reliable phenotypes with less stochastic noise than the individual gene expression levels, which increases our power to discover biologically relevant associations. These phenotypes could also be applied to discover associations with other environmental factors.

svviz: a read viewer for validating structural variants

svviz: a read viewer for validating structural variants
Noah Spies , Justin M Zook , Marc Salit , Arend Sidow
doi: http://dx.doi.org/10.1101/016063

Visualizing read alignments is the most effective way to validate candidate SVs with existing data. We present svviz, a sequencing read visualizer for structural variants (SVs) that sorts and displays only reads relevant to a candidate SV. svviz works by searching input bam(s) for potentially relevant reads, realigning them against the inferred sequence of the putative variant allele as well as the reference allele, and identifying reads that match one allele better than the other. Reads are assigned to the proper allele based on alignment score, read pair orientation and insert size. Separate views of the two alleles are then displayed in a scrollable web browser view, enabling a more intuitive visualization of each allele, compared to the single reference genome-based view common to most current read browsers. The web view facilitates examining the evidence for or against a putative variant, estimating zygosity, visualizing affected genomic annotations, and manual refinement of breakpoints. An optional command-line-only interface allows summary statistics and graphics to be exported directly to standard graphics file formats. svviz is open source and freely available from github, and requires as input only structural variant coordinates (called using any other software package), reads in bam format, and a reference genome. Reads from any high-throughput sequencing platform are supported, including Illumina short-read, mate-pair, synthetic long-read (assembled), Pacific Biosciences, and Oxford Nanopore. svviz is open source and freely available from https://github.com/svviz/svviz. 

The origins of a novel butterfly wing patterning gene from within a family of conserved cell cycle regulators

The origins of a novel butterfly wing patterning gene from within a family of conserved cell cycle regulators
Nicola Nadeau , Carolina Pardo-Diaz , Annabel Whibley , Megan Ann Supple , Richard Wallbank , Grace C. Wu , Luana Maroja , Laura Ferguson , Heather Hines , Camilo Salazar , Richard ffrench-Constant , Mathieu Joron , William Owen McMillan , Chris Jiggins
doi: http://dx.doi.org/10.1101/016006

A major challenge in evolutionary biology is to understand the origins of novel structures. The wing patterns of butterflies and moths are derived phenotypes unique to the Lepidoptera. Here we identify a gene that we name poikilomousa (poik), which regulates colour pattern switches in the mimetic Heliconius butterflies. Strong associations between phenotypic variation and DNA sequence variation are seen in three different Heliconius species, in addition to associations between gene expression and colour pattern. Colour pattern variants are also associated with differences in splicing of poik transcripts. poik is a member of the conserved fizzy family of cell cycle regulators. It belongs to a faster evolving subfamily, the closest functionally characterised orthologue being the cortex gene in Drosophila, a female germ-line specific protein involved in meiosis. poik appears to have adopted a novel function in the Lepidoptera and become a major target for natural selection acting on colour and pattern variation in this group.

utation rate estimation for 15 autosomal STR loci in a large population from Mainland China

Mutation rate estimation for 15 autosomal STR loci in a large population from Mainland China
Zhuo Zhao , Hua Wang , Jie Zhang , Zhi-Peng Liu , Ming Liu , Yuan Zhang , Li Sun , Hui Zhang
doi: http://dx.doi.org/10.1101/015875

STR, short trandem repeats, is well known as a type of powerful genetic marker and widely used in studying human population genetics. Compared with the conventional genetic markers, the mutation rate of STR is higher. Additionally, the mutations of STR loci do not lead to genetic inconsistencies between the genotypes of parents and children; therefore, the analysis of STR mutation is more suited to assess the population mutation. In this study, we focused on 15 autosomal STR loci (D8S1179, D21S11, D7S820, CSF1PO, D3S1358, TH01, D13S317, D16S539, D2S1338, D19S433, vWA, TPOX, D18S51, D5S818, FGA). DNA samples from a total of 42416 unrelated healthy individuals (19037 trios) from the population of Mainland China collected between Jan 2012 and May 2014 were successfully investigated. In our study, the allele frequencies, paternal mutation rates, maternal mutation rates and average mutation rates were detected in the 15 STR loci. Furthermore, we also investigated the relationship between paternal ages, maternal ages, pregnant time, area and average mutation rate. We found that paternal mutation rate is higher than maternal mutation rate and the paternal, maternal, and average mutation rates have a positive correlation with paternal ages, maternal ages and times respectively. Additionally, the average mutation rates of coastal areas are higher than that of inland areas. Overall, these results suggest that the 15 autosomal STR loci can provide highly informative polymorphic data for population genetic assessment in Mainland China, as well as confirm and extend the application of STR analysis in population genetics.

Recent evolution in Rattus norvegicus is shaped by declining effective population size

Recent evolution in Rattus norvegicus is shaped by declining effective population size
Eva E Deinum , Daniel L Halligan , Rob W Ness , Yao-Hua Zhang , Lin Cong , Jian-Xu Zhang , Peter D Keightley
doi: http://dx.doi.org/10.1101/015818

The brown rat, Rattus norvegicus, is both a notorious pest and a frequently used model in biomedical research. By analysing genome sequences of 12 wild-caught brown rats from their ancestral range in NE China, along with the sequence of a black rat, R. rattus, we investigate the selective and demographic forces shaping variation in the genome. We estimate that the recent effective population size (N_e) of this species = 1.24 x 10^5, based on silent site diversity. We compare patterns of diversity in these genomes with patterns in multiple genome sequences of the house mouse Mus musculus castaneus), which has a much larger N_e. This reveals an important role for variation in the strength of genetic drift in mammalian genome evolution. By a Pairwise Sequentially Markovian Coalescent (PSMC) analysis of demographic history, we infer that there has been a recent population size bottleneck in wild rats, which we date to approximately 20,000 years ago. Consistent with this, wild rat populations have experienced an increased flux of mildly deleterious mutations, which segregate at higher frequencies in protein-coding genes and conserved noncoding elements (CNEs). This leads to negative estimates of the rate of adaptive evolution (alpha) in proteins and CNEs, a result which we discuss in relation to the strongly positive estimates observed in wild house mice. As a consequence of the population bottleneck, wild rats also show a markedly slower decay of linkage disequilibrium with physical distance than wild house mice.

Speciation in Heliconius Butterflies: Minimal Contact Followed by Millions of Generations of Hybridisation

Speciation in Heliconius Butterflies: Minimal Contact Followed by Millions of Generations of Hybridisation
Simon Henry Martin , Anders Eriksson , Krzysztof M. Kozak , Andrea Manica , Chris D. Jiggins
doi: http://dx.doi.org/10.1101/015800

Documenting the full extent of gene flow during speciation poses a challenge, as species ranges change over time and current rates of hybridisation might not reflect historical trends. Theoretical work has emphasized the potential for speciation in the face of ongoing hybridisation, and the genetic mechanisms that might facilitate this process. However, elucidating how the rate of gene flow between species may have changed over time has proved difficult. Here we use Approximate Bayesian Computation (ABC) to fit a model of speciation between the Neotropical butterflies Heliconius melpomene and Heliconius cydno. These species are ecologically divergent, rarely hybridize and display female hybrid sterility. Nevertheless, previous genomic studies suggests pervasive gene flow between them, extending deep into their past, and potentially throughout the speciation process. By modelling the rates of gene flow during early and later stages of speciation, we find that these species have been hybridising for hundreds of thousands of years, but have not done so continuously since their initial divergence. Instead, it appears that gene flow was rare or absent for as long as a million years in the early stages of speciation. Therefore, by dissecting the timing of gene flow between these species, we are able to reject a scenario of purely sympatric speciation in the face of continuous gene flow. We suggest that the period of minimal contact early in speciation may have allowed for the accumulation of genomic changes that later enabled these species to remain distinct despite a dramatic increase in the rate of hybridisation.

Quality assessment for different haplotyping methods and GWAS sensitivity to phasing errors


Quality assessment for different haplotyping methods and GWAS sensitivity to phasing errors

Giovanni Busonera , Marco Cogoni , Gianluigi Zanetti
doi: http://dx.doi.org/10.1101/015669

In this report we present a multimarker association tool (Flash) based on a novel algorithm to generate haplotypes from raw genotype data. It belongs to the entropy minimization class of methods and is composed of a two stage deterministic – heuristic part and of a optional stochastic optimization. This algorithm is able to scale up well to handle huge datasets with faster performance than the competing technologies such as BEAGLE and MACH while maintaining a comparable accuracy. A quality assessment of the results is carried out by comparing the switch error. Finally, the haplotypes are used to perform a haplotype-based Genome-wide Association Study (GWAS). The association results are compared with a multimarker and a single SNP association test performed with Plink. Our experiments confirm that the multimarker association test can be more powerful than the single SNP one as stated in the literature. Moreover, Flash and Plink show similar results for the multimarker association test but Flash speeds up the computation time of about an order of magnitude using 5 SNP size haplotypes.

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage
Nicholas H. Putnam, Brendan O’Connell, Jonathan C. Stites, Brandon J. Rice, Andrew Fields, Paul D. Hartley, Charles W. Sugnet, David Haussler, Daniel S. Rokhsar, Richard E. Green
Subjects: Genomics (q-bio.GN); Biomolecules (q-bio.BM)

Long-range and highly accurate de novo assembly from short-read data is one of the most pressing challenges in genomics. Recently, it has been shown that read pairs generated by proximity ligation of DNA in chromatin of living tissue can address this problem. These data dramatically increase the scaffold contiguity of assemblies and provide haplotype phasing information. Here, we describe a simpler approach (“Chicago”) based on in vitro reconstituted chromatin. We generated two Chicago datasets with human DNA and used a new software pipeline (“HiRise”) to construct a highly accurate de novo assembly and scaffolding of a human genome with scaffold N50 of 30 Mb. We also demonstrated the utility of Chicago for improving existing assemblies by re-assembling and scaffolding the genome of the American alligator. With a single library and one lane of Illumina HiSeq sequencing, we increased the scaffold N50 of the American alligator from 508 kb to 10 Mb. Our method uses established molecular biology procedures and can be used to analyze any genome, as it requires only about 5 micrograms of DNA as the starting material.

Genetic evidence for an origin of the Armenians from Bronze Age mixing of multiple populations

Genetic evidence for an origin of the Armenians from Bronze Age mixing of multiple populations
Marc Haber , Massimo Mezzavilla , Yali Xue , David Comas , Paolo Gasparini , Pierre Zalloua , Chris Tyler-Smith
doi: http://dx.doi.org/10.1101/015396

The Armenians are a culturally isolated population who historically inhabited a region in the Near East bounded by the Mediterranean and Black seas and the Caucasus, but remain underrepresented in genetic studies and have a complex history including a major geographic displacement during World War One. Here, we analyse genome-wide variation in 173 Armenians and compare them to 78 other worldwide populations. We find that Armenians form a distinctive cluster linking the Near East, Europe, and the Caucasus. We show that Armenian diversity can be explained by several mixtures of Eurasian populations that occurred between ~3,000 and ~2,000 BCE, a period characterized by major population migrations after the domestication of the horse, appearance of chariots, and the rise of advanced civilizations in the Near East. However, genetic signals of population mixture cease after ~1,200 BCE when Bronze Age civilizations in the Eastern Mediterranean world suddenly and violently collapsed. Armenians have since remained isolated and genetic structure within the population developed ~500 years ago when Armenia was divided between the Ottomans and the Safavid Empire in Iran. Finally, we show that Armenians have higher genetic affinity to Neolithic Europeans than other present-day Near Easterners, and that 29% of the Armenian ancestry may originate from an ancestral population best represented by Neolithic Europeans.