RNA profiling using Next-generation Sequencing technologies RNA

Transcription

RNA profiling using Next-generation Sequencing technologies RNA
gq-11
Conclusion
An RNA-Seq pipeline was created in order to facilitate mRNA profiling
of eukaryotic cells. The pipeline includes the analysis of differential
gene expression. Preliminary results indicate good correlations
with microarray data. Future developments upstream from the
bioinformatics pipeline include increasing the throughput for
generating the libraries, decreasing sample requirements and
developing a pipeline for the profiling of non-polyA RNA species.
Future work on the bioinformatics pipeline will focus on de novo
transcript discovery and the analysis of splice isoforms.
Technical Application Note
RNA profiling using Next-Generation Sequencing technologies
RNA profiling using Next-Generation Sequencing technologies
Introduction
RNA profiling using Next-Generation Sequencing technology
(RNA-Seq) has started to revolutionize transcriptome analysis. For
instance, it can be used to determine the structure of genes, identify
splicing variants and other post transcriptional modifications, to
detect rare and novel transcripts, and to quantify the changing
expression levels of individual transcript and splice variant. Recent
advances in sequencing throughput and workflows, from sample
preparation to data analysis, are now increasing the speed and depth
of output available from the transcriptome of a myriad of cell types.
These advances also lead to decreases in costs that in turn have
opened the door to undertaking studies with a higher number of
samples in order to analyze gene expression with statistical tools
similar to those available for gene expression on arrays.
Here we describe an RNA-Seq pipeline based on Illumina technologies
combined with a bioinformatics analysis pipeline assembled in-house
from software freely available on the web. This pipeline is specific
for mRNA profiling from eukaryotes for which a reference genome
is available. Alternative pipelines are also available for the study of
non-polyadenylated RNA species.
References
[1] H. Li and R. Durbin (2009). Fast and accurate short read
alignment with Burrows–Wheeler transform. Bioinformatics,
25(14): 1754–1760.
Generation of cDNA Libraries and Sequencing
An overview of the steps involved in the construction of cDNA libraries
and sequencing is shown in Figure 1. Total RNA quality is verified on a
RNA chip using an Agilent 2100 Bioanalyzer (Agilent) and quantified
using a NanoDrop ND-1000 UV-VIS spectrophotometer (Thermo
Fisher). Libraries are generated using the TruSeq RNA kit (Illumina).
Messenger RNA is purified from 1 ug of total RNA using oligo-dT
beads. The mRNA enriched fraction is reversed transcribed to generate
cDNA fragments that are sheared using a Covaris instrument to yield
~200 bp fragments. Following end-repair and 3’ end adenylation
steps, an index is ligated and a PCR step performed. Up to 24 different
indexes can be ligated to cDNA fragments prior to the PCR step,
which allows the possibility of multiplexing several samples prior
to the sequencing. The quality of the library is assessed on a DNA
1000 chip and quantified by PCR.
[2] M.A. DePristo, E. Banks, R. Poplin, K.V. Garimella, J.R. Maguire,
C. Hartl, A.A. Philippakis, G. Del Angel, M.A. Rivas, M. Hanna et al.
(2011). A framework for variation discovery and genotyping using
next-generation DNA sequencing data. Nature Genetics, 43(5):
491–498.
[3] A. Roberts, H. Pimentel, C. Trapnell and L. Pachter (2011).
Identification of novel transcripts in annotated genomes using
RNA-Seq. Bioinformatics, 27(17): 2325–2329.
[4] A.P. Fejes, G. Robertson, M. Bilenky, R. Varhol, M. Bainbridge
and S.J.M. Jones (2008). Findpeaks 3.1: a tool for identifying areas
of enrichment from massively parallel short-read sequencing
technology. Bioinformatics, 24(15): 1729–1730.
Libraries can then be subjected to 50-300 cycles of sequencing on the
Illumina HiSeq2000 instrument. In a typical run, over 90% of the reads
will align to the transcriptome of reference. With a well-annotated
genome such as Human, 50 bases (cycles) in single-read mode is
sufficient. However, if the goal is to describe the transcriptome of
a novel species or to study alternative splicing variation, reads of
100 or even 150 bases in paired-end mode (where both ends of a
fragment are sequenced) can be necessary. Each lane of an Illumina
flow cell will generate over 150 million independent reads. A third of
that amount will probably be sufficient to obtain comparable results
to a typical expression array for most genes; however, increasing
the depth of sequencing could be useful to study detailed gene
structure or splicing.
[5] M.D. Robinson, D.J. McCarthy and G.K. Smyth (2010). edgeR:
a Bioconductor package for differential expression analysis of
digital gene expression data. Bioinformatics, 26(1): 139–140.
[6] S. Anders and W. Huber (2010). Differential expression analysis
for sequence count data. Genome Biology, 11(10):R106.
[7] M.D. Young, M.J. Wakefield, G.K. Smyth and A. Oshlack (2010).
Method gene ontology analysis for rna-seq: accounting for
selection bias.
3043_GQ11 (03-12)
Client Management Office: [email protected]
Assistant Scientific Director:
Alexandre Montpetit, PhD
[email protected]
3095_Fiche_GQ11.indd 1-2
Figure 1. Schematic Representation of the RNA Sequencing
Workflow.
514 398-7211
Sylvie LaBoissière, Alexandre Montpetit, Maxime Caron,
Mathieu Bourgey, Gregory Voisin and Guillaume Bourque.
514 398-3311
ext. 00913
McGill University and Génome Québec Innovation Centre
McGill University and Génome Québec Innovation Centre, Montréal, QC
www.genomequebec.mcgill.ca
740 Docteur Penfield Ave., Montréal, QC Canada H3A 0G1
T 514 398-7211 • [email protected]
12-03-19 4:24 PM
RNA profiling using Next-Generation Sequencing technologies
RNA profiling using Next-Generation Sequencing technologies
Bioinformatics Analysis
Bioinformatics Analysis (cont’d)
Reads from the instrument are transferred onto our servers and go through our in-house bioinformatics pipeline. A report, summarizing
key findings and providing a basic description of analysis steps, is generated and made available to users.
4) ”Wiggle Tracks” files are generated from the aligned reads using ”FindPeaks” [4]. They can be loaded in browsers such as IGV and UCSC
and represent overall read counts across the genome (Figure 3).
Key findings include:
1) The raw read statistics.
Table 1. Read statistics.
Samples
Raw1
%
Filtered2
%
Aligned3
%
1
65,391,207
100.0
63,645,362
97.3
60,079,278
91.9
2
77,247,651
100.0
74,992,558
97.1
71,048,142
92.0
3
74,948,506
100.0
72,833,957
97.2
69,338,686
92.5
1
Raw reads correspond to the number of reads from the sequencing instrument.
2
F iltered reads correspond to the number of reads after quality and length trimming. Sequence trimming is performed with
“fastx” and alignment with “BWA” [1].
3
Aligned reads correspond to the number of reads that were successfully aligned to a reference genome, if applicable.
2) The sequencing coverage (Figure 2), which is calculated with the “Genome Analysis ToolKit” [2]. In the example shown below, the graph
shows that 40-60% of each nucleotide on the exome has been sequenced at least 10 times (10X) across the 6 study samples. By contrast,
~ 25% of nucleotides display a coverage of 100X.
Figure 3. Visualization of Sequence Coverage of RNA-Seq Samples using UCSC Browser.
Aligned reads are used to generate a wiggle file (.wig.gz) with FindPeaks. The file is loaded in the IGV browser.
5) Gene expression analysis is carried out with the use of “edgeR” and DESeq. edgeR uses empirical Bayes estimation and exact tests based
on the negative binomial distribution [5] whereas DESeq uses a model based on the negative binomial distribution [6]. Raw read counts
generated by the sequencing instruments are used as input. As for gene expression analyses done from microarray experiments, data
are normalized prior to calculating fold changes and p-values. High correlation between data obtained from both types of approaches
is observed (Figure 4).
Figure 2. Sequence Coverage of RNA-Seq Samples.
Aligned sequences are used as input for the calculations.
3) Another measure for quality is the calculation of FPKM correlations across study samples (not shown). Biological replicates should have
correlation rates above 0.9 to maximize differential gene expression analyses across actual samples. Replicates with correlation values
< 0.9 should be explained or repeated. FPKM values are determined using “cufflinks” [3].
3095_Fiche_GQ11.indd 3-4
Figure 4. Examples of Gene Expression Analyses.
A) Overview of the distribution of the data (MA plot) Normalized counts of aligned reads is used as input to generate the graph. The Y-axis (M) displays the
intensity ratio of each gene (fold change in log2 ) whereas the X-axis (A) displays the average abundance of each gene, irrespective of the experimental condition.
B) Comparison of estimated fold changes (log2 ) from a typical RNA-Seq (Y-axis) vs. microarray (X-axis) using two experimental conditions. Only genes for which
at least 96 aligned read counts have been assigned were considered (9,972/12,425 genes). Green and red dots represent genes for which p-values are > 0.05 and <
0.05, respectively.
Additional features for data visualization are currently being developed. Finally, findings of a gene ontology analysis done using “goseq”
on genes differentially expressed [7] are also provided. Deeper understanding on networks and pathways can be achieved after uploading
differential gene expression analysis results in software such as Ingenuity (licenses are available for our users).
12-03-19 4:24 PM