ROSO: optimizing oligonucleotide probes for microarrays
Transcription
ROSO: optimizing oligonucleotide probes for microarrays
BIOINFORMATICS APPLICATIONS NOTE Vol. 20 no. 2 2004, pages 271–273 DOI: 10.1093/bioinformatics/btg401 ROSO: optimizing oligonucleotide probes for microarrays Nancie Reymond1, ∗, Hubert Charles1 , Laurent Duret2 , Federica Calevro1 , Guillaume Beslon3 and Jean-Michel Fayard1 1 Biologie Fonctionnelle, Insectes et Interactions (BF2I), UMR INRA/INSA de Lyon, 69621 Villeurbanne, France, 2 BBE, UMR CNRS, Université Claude Bernard—Lyon 1, 69622 Villeurbanne, France and 3 PRISMa, 69621 Villeurbanne, France Received on April 1, 2003; revised on June 20, 2003; accepted on August 5, 2003 ABSTRACT Summary: ROSO is software to design optimal oligonucleotide probe sets for microarrays. Selected probes show no significant cross-hybridization, no stable secondary structures and their Tm are chosen to minimize the Tm variability of the probe set. Availability: The program is available on the internet. Sources are freely available, for non-profit use, on request to the authors. Contact: [email protected] Supplementary information: http://pbil.univ-lyon1.fr/roso Until recently, the analysis of gene function and regulation has been carried out using step-by-step studies of a unique gene. Currently, with the increasing amount of genomic sequences available, microarray technology offers the possibility of studying the expression of hundreds of thousands of genes in a single experiment and appears to be a powerful tool in achieving a more global investigation of biological questions. To take advantage of this emerging technology, microarray users need to be very careful as regards the definition of an appropriate probe design strategy (Holloway et al., 2002). Either PCR amplified fragments (>100 bases) or synthetic oligonucleotides (25–70 bases) can be used as probes. However, compared with long PCR probes, oligonucleotide probes provide the possibility of targeting a specific region of the transcript (Kane et al., 2000), thus minimizing cross-hybridization events (Tomiuk and Hofman, 2001). Nevertheless, the choice of an optimal oligonucleotide probe set is a time-consuming multicriteria problem that requires a complete bioinformatic analysis of genes. In most of the free software developed to date, each possible probe is analyzed individually regarding its duplex stability [i.e. melting temperature (Tm ) and secondary structures] with threshold values determined a priori. For the study of probe specificity, two approaches are commonly used. Some software perform this analysis with a BLAST search (Altschul et al., 1997), either ∗ To whom correspondence should be addressed. Bioinformatics 20(2) © Oxford University Press 2004; all rights reserved. on the probe sequence (OligoArray: Rouillard et al., 2002), or on the whole input sequence (OligoArray2.0: Rouillard et al., 2003 or Oligo Wiz: Nielsen et al., 2003). Others software are not directly relying on BLAST search (OligoPicker: Wang and Seed, 2003) or use other algorithms (ProbeSelect: Li and Stormo, 2001). The originality of the ROSO software is to separate the time-consuming step of cross-hybridization analysis (estimated on the whole gene sequence using BLAST) from the fast step of thermodynamic analysis. ROSO is a French acronym for ‘Recherche et Optimisation de Sondes Oligonucléotidiques’. ROSO is a program written in C that can run on all platforms supporting C and BLAST program. BLAST results are parsed using the parser Dublastn (Duret, unpublished data). Implementations are available on Irix (Silicon Graphics), Solaris (Sun) and Windows platforms. A web interface developed in PHP is also available (see Supplementary information). ROSO utilization requires two kinds of input fastaformatted files: an interest file and a facultative external file. The interest file contains the sequences of the overall genes to be spotted on the chip while the external file contains the sequences of any genes that are not to be spotted. It enables the user to avoid cross-hybridization explicitly with known genes. As an example, if the interest file contains a subset of CDSs from one organism, the other CDSs, as well as pseudogenes and intergenic regions, might be added in the external file. A list of external files dedicated to model organisms is provided on the ROSO web site. When design is performed without any complete external file, it may be useful to remove repeated sequences from the interest file before running ROSO. For the main parameters, that need to be defined to start the probe selection, default values are proposed: targets and ionic concentrations (K + and Na+ : 1 M, Target: 10−6 M), probe size and probe orientation (either reverse complementary or identical), probe location along the gene sequence, the number of probes per gene (overlapping or not), the threshold values for secondary structure stability (Hairpin: 0 kcal/mol and Homoduplex: −6 kcal/mol) and the hybridization temperature (65◦ C in the absence of denaturing agents). Parameter range 271 N.Reymond et al. and significance are described in the Help section of the web site. Our algorithm is organized in five successive steps in order of decreasing importance. The optimization process can be briefly described as follows: (1) The interest file is reduced in three ways. First, ROSO removes any putative identical genes, defined here as any pairs showing more than 98% identity on more than 100 bases, or 95% for expressed sequence tag sequences. For each group of identical genes only the longest is kept in the interest file and the name of identical genes is reported in the output file. Second, the user can choose probe localization on a range of n nucleotides from the 3 or 5 end of genes. Third, during the probe selection process, probes with undetermined or degenerated nucleotides are not selected for the next step. Finally, the user can accept, or not, any sequences with small repetitions of four identical nucleotides (such repeated sequences might facilitate cross-hybridization and are difficult to synthesize). (2) Putative cross-hybridizations are checked for each gene within the interest file and compared to both the interest and the external sequences using the BLAST program. BLAST parameters (W = 7, z = 1 000 000, r = 2) were estimated on simulated data sets (data not shown) to detect a minimal identity of 70% on 20 bases. In fact, it has been reported that at this threshold the risks of cross-hybridization are not significant (Richmond et al., 1999). Moreover, Hughes et al. (2001) have shown that a 70% minimal similarity is sufficient to reduce crosshybridization to the noise level. This phase ends with the computation of a cross-hybridization homology score for each putative probe. This score is calculated as the mean of each nucleotide homology (corresponding to the higher BLAST score hit homology). Homology ranges from 60% for a specific area (this value corresponding to the maximal sensibility of BLAST) to 100% for a strictly homologous area. (3) The software then searches for stable secondary structures (hairpin and homoduplex) in the probes with the best specificity. Such structures act as a barrier to hybridization between the probe and its target. Each secondary structure conformation is analyzed and the corresponding free energy is computed at the user hybridization temperature. The simple algorithm of Rychlik and Rhoads (1989) is used with the thermodynamics parameters of Groebe and Uhlenbeck (1988). All the probes which may turn into stable secondary structures (i.e. the free energy of which is above a certain threshold) are removed from the probe list. (4) The melting temperature (Tm ) of all the remaining probes is estimated using the nearest-neighbor thermodynamic model and the parameters indicated in 272 SantaLucia (1998). This model, dedicated to hybridization in solution, is perhaps misleading for fixed oligonucleotides. However, it allows a relative Tm calculation that permits the iterative selection of a set of probes with a minimal Tm variability. (5) The ultimate step is the selection of the optimal probe set. It is based on four stability criteria: G + C rate (preferably between 40 and 65%), first and last bases (preferably a G or a C), repetitions (avoid GGG or CCC strings) and free energy (GC clamps at both ends). In addition, ROSO allows users to calculate the Tm of control probes with mismatches to evaluate cross-hybridization on the microarray, but these functions are not available online. The originality of our optimization process is mainly based on the separation between the specificity analysis (using BLAST) and the probe selection. Indeed, BLAST finds word matches of a minimal size before extending them. Then, running BLAST oligonucleotide probes against genes leads to loss of information, as compared to running BLAST between the whole genes, when a word matches outside a potential probe. Moreover, since microarrays involve the parallel hybridization of many probes, the thermodynamic properties need to be homogeneous among the entire probe set. The ROSO structure enables the user to isolate the time-consuming BLAST step, thus enabling us to perform efficiently an iterative selection of the probes in order to ensure the thermodynamic uniformity. This progressive refinement of the probe selection enables us to introduce different criteria of localization or secondary structure thresholds. Models of such multicriteria research are proposed on the web site (see Supplementary information). When confronted with ‘problematic genes’, the program can then quickly compute several solutions according to the relative weights of the different criteria. ROSO was first validated on simulated data (with probe sizes between 15 and 70 bases and G + C content ranging from 30 to 70%). Tm and secondary structures computed with ROSO were found to be equivalent to those obtained with the reference software Oligo6® (Rychlik, Molecular Biology Insights). Second, ROSO was used to design several probe sets for microarray projects (see Supplementary information). As an example, an optimal 70 bases probe set was designed for 5133 CDSs of the bacterium Ralstonia solanacearum. In this experiment the design was realized with three successive runs: after the first run, optimal probes were found for 4677 (91%) genes. After the second run, 415 (8%) supplementary probes were found with more stable secondary structures. The last 22 probes (1%) were found by relaxing the hairpin threshold to −12 kcal/mol. The probes set has a 7◦ C Tm range and 4924 (96%) specific probes (i.e. with no significant crosshybridization). Finally, an experimental validation has been achieved with 35 base probes designed for four CDS of the bacterium Buchnera aphidicola, three positive controls, one negative control and four mismatch control (MM). MM were ROSO: optimizing oligonucleotide probes designed to be identical to their perfect match partners except 3 or 5 bases dispread along the probe. Hybridization results agreed with the predictions and allow us to assess specificity (unpublished data). ROSO can also be used to select PCR probes (>300 bases), when using appropriate parameters. At present, small probes (<25 nt) designed with a sequence signature centrally positioned (Probe software: Pozhitkov and Tautz, 2002; http://www.biomedcentral.com/) and used in SNP detection, bacterial identification or analysis of expression of highly homologous genes (paralogous family), cannot be selected with ROSO. However, a small probe version is currently under development which introduces mismatch position control. ACKNOWLEDGEMENTS We are grateful to Christian Boucher for his collaboration (experimental test) on the Ralstonia solanacearum data set and Tom Wilkinson for his critical reading. This work was supported by a PhD fellowship from the French Ministry of Education and Research, the Action Bioinformatique InterEPST and a BQR INSA-UCBL. REFERENCES Altschul,S., Madden,T., Schäffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Groebe,D. and Uhlenbeck,O. (1988) Characterization of RNA hairpin loop stability. Nucleic Acids Res., 16, 11725–11735. Holloway,A., Van Laar,K., Tothill,R. and Bowtell,D. (2002) Options available-from start to finish-for obtaining data from DNA microarrays II. Nat. Genet. Suppl., 32, 481–489. Hughes,T., Mao,M., Jones,A., Burchard,J., Marton,M., Shannon,K., Lefkowitz,S., Ziman,M., Schelter,J., Meyer,M. et al. (2001) Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthetizer. Nat. Biotechnol., 19, 342–347. Kane,M., Jatkoe,T., Stumpf,C., Lu,J., Thomas,J. and Madore,S. (2000) Assessment of the sensitivity and specificity of oligonucleotide (50 mer) microarrays. Nucleic Acids Res., 28, 4552–4557. Li,F. and Stormo,G. (2001) Selection of optimal DNA oligos for gene expression arrays. Bioinformatics, 17, 1067–1076. Nielsen,H., Wernersson,R. and Knudsen,S. (2003) Design of oligonucleotides for microarrays and perspectives for design of multi-transcriptome arrays. Nucleic Acids Res., 31, 3491–3496. Pozhitkov,A. and Tautz,D. (2002) An algorithm and program for finding sequence specific oligonucleotide probes for species identification. BMC Bioinformatics, 3, 1471–2105/3/9. Richmond,C., Glasner,J., Mau,R., Jin,H. and Blattner,F. (1999) Genome-wide expression profiling in Escherichia coli K-12. Nucleic Acids Res., 27, 3821–3835. Rouillard,J.-M., Herbert,C. and Zucker,M. (2002) OligoArray: genome-scale oligonucleotide design for microarrays. Bioinformatics, 18, 486–487. Rouillard,J.-M., Zuker,M. and Gulari,E. (2003) OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach. Nucleic Acids Res., 31, 3057–3062. Rychlik,W. and Rhoads,R. (1989) A computer program for choosing optimal oligonucleotides for filter hybridization, sequencing and in vitro amplification of DNA. Nucleic Acids Res., 17, 8543–8551. SantaLucia,J. (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl Acad. Sci. USA, 95, 1460–1465. Tomiuk,S. and Hofman,K. (2001) Microarray probe strategies. Brief. Bioinform., 2, 329–340. Wang,X. and Seed,B. (2003) Selection of oligonucleotide probes for protein coding sequences. Bioinformatics, 19, 796–802. 273