RNA profiling using Next-generation Sequencing technologies RNA
Transcription
RNA profiling using Next-generation Sequencing technologies RNA
gq-11 Conclusion An RNA-Seq pipeline was created in order to facilitate mRNA profiling of eukaryotic cells. The pipeline includes the analysis of differential gene expression. Preliminary results indicate good correlations with microarray data. Future developments upstream from the bioinformatics pipeline include increasing the throughput for generating the libraries, decreasing sample requirements and developing a pipeline for the profiling of non-polyA RNA species. Future work on the bioinformatics pipeline will focus on de novo transcript discovery and the analysis of splice isoforms. Technical Application Note RNA profiling using Next-Generation Sequencing technologies RNA profiling using Next-Generation Sequencing technologies Introduction RNA profiling using Next-Generation Sequencing technology (RNA-Seq) has started to revolutionize transcriptome analysis. For instance, it can be used to determine the structure of genes, identify splicing variants and other post transcriptional modifications, to detect rare and novel transcripts, and to quantify the changing expression levels of individual transcript and splice variant. Recent advances in sequencing throughput and workflows, from sample preparation to data analysis, are now increasing the speed and depth of output available from the transcriptome of a myriad of cell types. These advances also lead to decreases in costs that in turn have opened the door to undertaking studies with a higher number of samples in order to analyze gene expression with statistical tools similar to those available for gene expression on arrays. Here we describe an RNA-Seq pipeline based on Illumina technologies combined with a bioinformatics analysis pipeline assembled in-house from software freely available on the web. This pipeline is specific for mRNA profiling from eukaryotes for which a reference genome is available. Alternative pipelines are also available for the study of non-polyadenylated RNA species. References [1] H. Li and R. Durbin (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14): 1754–1760. Generation of cDNA Libraries and Sequencing An overview of the steps involved in the construction of cDNA libraries and sequencing is shown in Figure 1. Total RNA quality is verified on a RNA chip using an Agilent 2100 Bioanalyzer (Agilent) and quantified using a NanoDrop ND-1000 UV-VIS spectrophotometer (Thermo Fisher). Libraries are generated using the TruSeq RNA kit (Illumina). Messenger RNA is purified from 1 ug of total RNA using oligo-dT beads. The mRNA enriched fraction is reversed transcribed to generate cDNA fragments that are sheared using a Covaris instrument to yield ~200 bp fragments. Following end-repair and 3’ end adenylation steps, an index is ligated and a PCR step performed. Up to 24 different indexes can be ligated to cDNA fragments prior to the PCR step, which allows the possibility of multiplexing several samples prior to the sequencing. The quality of the library is assessed on a DNA 1000 chip and quantified by PCR. [2] M.A. DePristo, E. Banks, R. Poplin, K.V. Garimella, J.R. Maguire, C. Hartl, A.A. Philippakis, G. Del Angel, M.A. Rivas, M. Hanna et al. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5): 491–498. [3] A. Roberts, H. Pimentel, C. Trapnell and L. Pachter (2011). Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics, 27(17): 2325–2329. [4] A.P. Fejes, G. Robertson, M. Bilenky, R. Varhol, M. Bainbridge and S.J.M. Jones (2008). Findpeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics, 24(15): 1729–1730. Libraries can then be subjected to 50-300 cycles of sequencing on the Illumina HiSeq2000 instrument. In a typical run, over 90% of the reads will align to the transcriptome of reference. With a well-annotated genome such as Human, 50 bases (cycles) in single-read mode is sufficient. However, if the goal is to describe the transcriptome of a novel species or to study alternative splicing variation, reads of 100 or even 150 bases in paired-end mode (where both ends of a fragment are sequenced) can be necessary. Each lane of an Illumina flow cell will generate over 150 million independent reads. A third of that amount will probably be sufficient to obtain comparable results to a typical expression array for most genes; however, increasing the depth of sequencing could be useful to study detailed gene structure or splicing. [5] M.D. Robinson, D.J. McCarthy and G.K. Smyth (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1): 139–140. [6] S. Anders and W. Huber (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10):R106. [7] M.D. Young, M.J. Wakefield, G.K. Smyth and A. Oshlack (2010). Method gene ontology analysis for rna-seq: accounting for selection bias. 3043_GQ11 (03-12) Client Management Office: [email protected] Assistant Scientific Director: Alexandre Montpetit, PhD [email protected] 3095_Fiche_GQ11.indd 1-2 Figure 1. Schematic Representation of the RNA Sequencing Workflow. 514 398-7211 Sylvie LaBoissière, Alexandre Montpetit, Maxime Caron, Mathieu Bourgey, Gregory Voisin and Guillaume Bourque. 514 398-3311 ext. 00913 McGill University and Génome Québec Innovation Centre McGill University and Génome Québec Innovation Centre, Montréal, QC www.genomequebec.mcgill.ca 740 Docteur Penfield Ave., Montréal, QC Canada H3A 0G1 T 514 398-7211 • [email protected] 12-03-19 4:24 PM RNA profiling using Next-Generation Sequencing technologies RNA profiling using Next-Generation Sequencing technologies Bioinformatics Analysis Bioinformatics Analysis (cont’d) Reads from the instrument are transferred onto our servers and go through our in-house bioinformatics pipeline. A report, summarizing key findings and providing a basic description of analysis steps, is generated and made available to users. 4) ”Wiggle Tracks” files are generated from the aligned reads using ”FindPeaks” [4]. They can be loaded in browsers such as IGV and UCSC and represent overall read counts across the genome (Figure 3). Key findings include: 1) The raw read statistics. Table 1. Read statistics. Samples Raw1 % Filtered2 % Aligned3 % 1 65,391,207 100.0 63,645,362 97.3 60,079,278 91.9 2 77,247,651 100.0 74,992,558 97.1 71,048,142 92.0 3 74,948,506 100.0 72,833,957 97.2 69,338,686 92.5 1 Raw reads correspond to the number of reads from the sequencing instrument. 2 F iltered reads correspond to the number of reads after quality and length trimming. Sequence trimming is performed with “fastx” and alignment with “BWA” [1]. 3 Aligned reads correspond to the number of reads that were successfully aligned to a reference genome, if applicable. 2) The sequencing coverage (Figure 2), which is calculated with the “Genome Analysis ToolKit” [2]. In the example shown below, the graph shows that 40-60% of each nucleotide on the exome has been sequenced at least 10 times (10X) across the 6 study samples. By contrast, ~ 25% of nucleotides display a coverage of 100X. Figure 3. Visualization of Sequence Coverage of RNA-Seq Samples using UCSC Browser. Aligned reads are used to generate a wiggle file (.wig.gz) with FindPeaks. The file is loaded in the IGV browser. 5) Gene expression analysis is carried out with the use of “edgeR” and DESeq. edgeR uses empirical Bayes estimation and exact tests based on the negative binomial distribution [5] whereas DESeq uses a model based on the negative binomial distribution [6]. Raw read counts generated by the sequencing instruments are used as input. As for gene expression analyses done from microarray experiments, data are normalized prior to calculating fold changes and p-values. High correlation between data obtained from both types of approaches is observed (Figure 4). Figure 2. Sequence Coverage of RNA-Seq Samples. Aligned sequences are used as input for the calculations. 3) Another measure for quality is the calculation of FPKM correlations across study samples (not shown). Biological replicates should have correlation rates above 0.9 to maximize differential gene expression analyses across actual samples. Replicates with correlation values < 0.9 should be explained or repeated. FPKM values are determined using “cufflinks” [3]. 3095_Fiche_GQ11.indd 3-4 Figure 4. Examples of Gene Expression Analyses. A) Overview of the distribution of the data (MA plot) Normalized counts of aligned reads is used as input to generate the graph. The Y-axis (M) displays the intensity ratio of each gene (fold change in log2 ) whereas the X-axis (A) displays the average abundance of each gene, irrespective of the experimental condition. B) Comparison of estimated fold changes (log2 ) from a typical RNA-Seq (Y-axis) vs. microarray (X-axis) using two experimental conditions. Only genes for which at least 96 aligned read counts have been assigned were considered (9,972/12,425 genes). Green and red dots represent genes for which p-values are > 0.05 and < 0.05, respectively. Additional features for data visualization are currently being developed. Finally, findings of a gene ontology analysis done using “goseq” on genes differentially expressed [7] are also provided. Deeper understanding on networks and pathways can be achieved after uploading differential gene expression analysis results in software such as Ingenuity (licenses are available for our users). 12-03-19 4:24 PM