ROSO: optimizing oligonucleotide probes for microarrays

Transcription

ROSO: optimizing oligonucleotide probes for microarrays
BIOINFORMATICS APPLICATIONS NOTE
Vol. 20 no. 2 2004, pages 271–273
DOI: 10.1093/bioinformatics/btg401
ROSO: optimizing oligonucleotide probes for
microarrays
Nancie Reymond1, ∗, Hubert Charles1 , Laurent Duret2 , Federica
Calevro1 , Guillaume Beslon3 and Jean-Michel Fayard1
1 Biologie
Fonctionnelle, Insectes et Interactions (BF2I), UMR INRA/INSA de Lyon,
69621 Villeurbanne, France, 2 BBE, UMR CNRS, Université Claude Bernard—Lyon 1,
69622 Villeurbanne, France and 3 PRISMa, 69621 Villeurbanne, France
Received on April 1, 2003; revised on June 20, 2003; accepted on August 5, 2003
ABSTRACT
Summary: ROSO is software to design optimal oligonucleotide probe sets for microarrays. Selected probes show no
significant cross-hybridization, no stable secondary structures
and their Tm are chosen to minimize the Tm variability of the
probe set.
Availability: The program is available on the internet. Sources
are freely available, for non-profit use, on request to the
authors.
Contact: [email protected]
Supplementary information: http://pbil.univ-lyon1.fr/roso
Until recently, the analysis of gene function and regulation has been carried out using step-by-step studies of a
unique gene. Currently, with the increasing amount of genomic sequences available, microarray technology offers the
possibility of studying the expression of hundreds of thousands of genes in a single experiment and appears to be a
powerful tool in achieving a more global investigation of
biological questions. To take advantage of this emerging technology, microarray users need to be very careful as regards the
definition of an appropriate probe design strategy (Holloway
et al., 2002). Either PCR amplified fragments (>100 bases)
or synthetic oligonucleotides (25–70 bases) can be used as
probes. However, compared with long PCR probes, oligonucleotide probes provide the possibility of targeting a specific
region of the transcript (Kane et al., 2000), thus minimizing cross-hybridization events (Tomiuk and Hofman, 2001).
Nevertheless, the choice of an optimal oligonucleotide probe
set is a time-consuming multicriteria problem that requires a
complete bioinformatic analysis of genes. In most of the free
software developed to date, each possible probe is analyzed
individually regarding its duplex stability [i.e. melting temperature (Tm ) and secondary structures] with threshold values
determined a priori. For the study of probe specificity, two
approaches are commonly used. Some software perform this
analysis with a BLAST search (Altschul et al., 1997), either
∗ To
whom correspondence should be addressed.
Bioinformatics 20(2) © Oxford University Press 2004; all rights reserved.
on the probe sequence (OligoArray: Rouillard et al., 2002), or
on the whole input sequence (OligoArray2.0: Rouillard et al.,
2003 or Oligo Wiz: Nielsen et al., 2003). Others software are
not directly relying on BLAST search (OligoPicker: Wang
and Seed, 2003) or use other algorithms (ProbeSelect: Li and
Stormo, 2001). The originality of the ROSO software is to
separate the time-consuming step of cross-hybridization analysis (estimated on the whole gene sequence using BLAST)
from the fast step of thermodynamic analysis. ROSO is a
French acronym for ‘Recherche et Optimisation de Sondes
Oligonucléotidiques’.
ROSO is a program written in C that can run on all platforms
supporting C and BLAST program. BLAST results are parsed
using the parser Dublastn (Duret, unpublished data). Implementations are available on Irix (Silicon Graphics), Solaris
(Sun) and Windows platforms. A web interface developed in
PHP is also available (see Supplementary information).
ROSO utilization requires two kinds of input fastaformatted files: an interest file and a facultative external file.
The interest file contains the sequences of the overall genes
to be spotted on the chip while the external file contains the
sequences of any genes that are not to be spotted. It enables the
user to avoid cross-hybridization explicitly with known genes.
As an example, if the interest file contains a subset of CDSs
from one organism, the other CDSs, as well as pseudogenes
and intergenic regions, might be added in the external file. A
list of external files dedicated to model organisms is provided
on the ROSO web site. When design is performed without any
complete external file, it may be useful to remove repeated
sequences from the interest file before running ROSO.
For the main parameters, that need to be defined to start the
probe selection, default values are proposed: targets and ionic
concentrations (K + and Na+ : 1 M, Target: 10−6 M), probe
size and probe orientation (either reverse complementary or
identical), probe location along the gene sequence, the number
of probes per gene (overlapping or not), the threshold values for secondary structure stability (Hairpin: 0 kcal/mol and
Homoduplex: −6 kcal/mol) and the hybridization temperature
(65◦ C in the absence of denaturing agents). Parameter range
271
N.Reymond et al.
and significance are described in the Help section of the
web site.
Our algorithm is organized in five successive steps in order
of decreasing importance. The optimization process can be
briefly described as follows:
(1) The interest file is reduced in three ways. First, ROSO
removes any putative identical genes, defined here as
any pairs showing more than 98% identity on more
than 100 bases, or 95% for expressed sequence tag
sequences. For each group of identical genes only the
longest is kept in the interest file and the name of
identical genes is reported in the output file. Second,
the user can choose probe localization on a range of
n nucleotides from the 3 or 5 end of genes. Third,
during the probe selection process, probes with undetermined or degenerated nucleotides are not selected for
the next step. Finally, the user can accept, or not,
any sequences with small repetitions of four identical
nucleotides (such repeated sequences might facilitate
cross-hybridization and are difficult to synthesize).
(2) Putative cross-hybridizations are checked for each gene
within the interest file and compared to both the interest
and the external sequences using the BLAST program.
BLAST parameters (W = 7, z = 1 000 000, r = 2)
were estimated on simulated data sets (data not shown)
to detect a minimal identity of 70% on 20 bases. In fact,
it has been reported that at this threshold the risks of
cross-hybridization are not significant (Richmond et al.,
1999). Moreover, Hughes et al. (2001) have shown that
a 70% minimal similarity is sufficient to reduce crosshybridization to the noise level. This phase ends with the
computation of a cross-hybridization homology score
for each putative probe. This score is calculated as
the mean of each nucleotide homology (corresponding
to the higher BLAST score hit homology). Homology
ranges from 60% for a specific area (this value corresponding to the maximal sensibility of BLAST) to 100%
for a strictly homologous area.
(3) The software then searches for stable secondary structures (hairpin and homoduplex) in the probes with the
best specificity. Such structures act as a barrier to
hybridization between the probe and its target. Each
secondary structure conformation is analyzed and the
corresponding free energy is computed at the user
hybridization temperature. The simple algorithm of
Rychlik and Rhoads (1989) is used with the thermodynamics parameters of Groebe and Uhlenbeck (1988).
All the probes which may turn into stable secondary
structures (i.e. the free energy of which is above a
certain threshold) are removed from the probe list.
(4) The melting temperature (Tm ) of all the remaining probes is estimated using the nearest-neighbor
thermodynamic model and the parameters indicated in
272
SantaLucia (1998). This model, dedicated to hybridization in solution, is perhaps misleading for fixed
oligonucleotides. However, it allows a relative Tm calculation that permits the iterative selection of a set of
probes with a minimal Tm variability.
(5) The ultimate step is the selection of the optimal probe
set. It is based on four stability criteria: G + C rate
(preferably between 40 and 65%), first and last bases
(preferably a G or a C), repetitions (avoid GGG or CCC
strings) and free energy (GC clamps at both ends).
In addition, ROSO allows users to calculate the Tm of control
probes with mismatches to evaluate cross-hybridization on the
microarray, but these functions are not available online.
The originality of our optimization process is mainly based
on the separation between the specificity analysis (using
BLAST) and the probe selection. Indeed, BLAST finds word
matches of a minimal size before extending them. Then, running BLAST oligonucleotide probes against genes leads to
loss of information, as compared to running BLAST between
the whole genes, when a word matches outside a potential probe. Moreover, since microarrays involve the parallel
hybridization of many probes, the thermodynamic properties
need to be homogeneous among the entire probe set. The
ROSO structure enables the user to isolate the time-consuming
BLAST step, thus enabling us to perform efficiently an
iterative selection of the probes in order to ensure the thermodynamic uniformity. This progressive refinement of the probe
selection enables us to introduce different criteria of localization or secondary structure thresholds. Models of such
multicriteria research are proposed on the web site (see Supplementary information). When confronted with ‘problematic
genes’, the program can then quickly compute several solutions according to the relative weights of the different criteria.
ROSO was first validated on simulated data (with probe
sizes between 15 and 70 bases and G + C content ranging
from 30 to 70%). Tm and secondary structures computed with
ROSO were found to be equivalent to those obtained with
the reference software Oligo6® (Rychlik, Molecular Biology
Insights). Second, ROSO was used to design several probe
sets for microarray projects (see Supplementary information).
As an example, an optimal 70 bases probe set was designed
for 5133 CDSs of the bacterium Ralstonia solanacearum. In
this experiment the design was realized with three successive
runs: after the first run, optimal probes were found for 4677
(91%) genes. After the second run, 415 (8%) supplementary probes were found with more stable secondary structures.
The last 22 probes (1%) were found by relaxing the hairpin
threshold to −12 kcal/mol. The probes set has a 7◦ C Tm range
and 4924 (96%) specific probes (i.e. with no significant crosshybridization). Finally, an experimental validation has been
achieved with 35 base probes designed for four CDS of the
bacterium Buchnera aphidicola, three positive controls, one
negative control and four mismatch control (MM). MM were
ROSO: optimizing oligonucleotide probes
designed to be identical to their perfect match partners except
3 or 5 bases dispread along the probe. Hybridization results
agreed with the predictions and allow us to assess specificity
(unpublished data).
ROSO can also be used to select PCR probes (>300 bases),
when using appropriate parameters. At present, small
probes (<25 nt) designed with a sequence signature centrally positioned (Probe software: Pozhitkov and Tautz, 2002;
http://www.biomedcentral.com/) and used in SNP detection,
bacterial identification or analysis of expression of highly
homologous genes (paralogous family), cannot be selected
with ROSO. However, a small probe version is currently under
development which introduces mismatch position control.
ACKNOWLEDGEMENTS
We are grateful to Christian Boucher for his collaboration
(experimental test) on the Ralstonia solanacearum data set
and Tom Wilkinson for his critical reading. This work was
supported by a PhD fellowship from the French Ministry of
Education and Research, the Action Bioinformatique InterEPST and a BQR INSA-UCBL.
REFERENCES
Altschul,S., Madden,T., Schäffer,A., Zhang,J., Zhang,Z., Miller,W.
and Lipman,D. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids
Res., 25, 3389–3402.
Groebe,D. and Uhlenbeck,O. (1988) Characterization of RNA hairpin loop stability. Nucleic Acids Res., 16, 11725–11735.
Holloway,A., Van Laar,K., Tothill,R. and Bowtell,D. (2002) Options
available-from start to finish-for obtaining data from DNA
microarrays II. Nat. Genet. Suppl., 32, 481–489.
Hughes,T., Mao,M., Jones,A., Burchard,J., Marton,M., Shannon,K.,
Lefkowitz,S., Ziman,M., Schelter,J., Meyer,M. et al. (2001)
Expression profiling using microarrays fabricated by an ink-jet
oligonucleotide synthetizer. Nat. Biotechnol., 19, 342–347.
Kane,M., Jatkoe,T., Stumpf,C., Lu,J., Thomas,J. and Madore,S.
(2000) Assessment of the sensitivity and specificity of oligonucleotide (50 mer) microarrays. Nucleic Acids Res., 28,
4552–4557.
Li,F. and Stormo,G. (2001) Selection of optimal DNA oligos for gene
expression arrays. Bioinformatics, 17, 1067–1076.
Nielsen,H., Wernersson,R. and Knudsen,S. (2003) Design of oligonucleotides for microarrays and perspectives for design of
multi-transcriptome arrays. Nucleic Acids Res., 31, 3491–3496.
Pozhitkov,A. and Tautz,D. (2002) An algorithm and program for
finding sequence specific oligonucleotide probes for species
identification. BMC Bioinformatics, 3, 1471–2105/3/9.
Richmond,C., Glasner,J., Mau,R., Jin,H. and Blattner,F. (1999)
Genome-wide expression profiling in Escherichia coli K-12.
Nucleic Acids Res., 27, 3821–3835.
Rouillard,J.-M., Herbert,C. and Zucker,M. (2002) OligoArray:
genome-scale oligonucleotide design for microarrays.
Bioinformatics, 18, 486–487.
Rouillard,J.-M., Zuker,M. and Gulari,E. (2003) OligoArray 2.0:
design of oligonucleotide probes for DNA microarrays using a
thermodynamic approach. Nucleic Acids Res., 31, 3057–3062.
Rychlik,W. and Rhoads,R. (1989) A computer program for choosing
optimal oligonucleotides for filter hybridization, sequencing and
in vitro amplification of DNA. Nucleic Acids Res., 17, 8543–8551.
SantaLucia,J. (1998) A unified view of polymer, dumbbell, and
oligonucleotide DNA nearest-neighbor thermodynamics. Proc.
Natl Acad. Sci. USA, 95, 1460–1465.
Tomiuk,S. and Hofman,K. (2001) Microarray probe strategies. Brief.
Bioinform., 2, 329–340.
Wang,X. and Seed,B. (2003) Selection of oligonucleotide probes for
protein coding sequences. Bioinformatics, 19, 796–802.
273

Documents pareils