FLLL-TR-0210 - Department of Knowledge
Transcription
FLLL-TR-0210 - Department of Knowledge
Technical Report FLLL–TR–0210 Dimension Reduction: A Comparison of Methods with respect to Fault Diagnosis of Engine Test Bench Data Edwin Lughofer Fuzzy Logic Laboratorium Linz-Hagenberg e-mail [email protected] Werner Groissböck Fuzzy Logic Laboratorium Linz-Hagenberg e-mail [email protected] Johannes Kepler Universität Linz Institut für Algebra, Stochastik und wissensbasierte mathematische Systeme Johannes Kepler Universität Linz A-4040 Linz Austria Fuzzy Logic Laboratorium Linz-Hagenberg FLLL Softwarepark Hagenberg Hauptstrasse 99 A-4232 Hagenberg Austria 1 Motivation 1 1 Motivation In many real-world applications of manufacturers, companies and firms concerning data analysis, data transformations or data processing the question of dealing with a manageable size of data dimension (for example as columns in a data matrix) arises. Therefore, in a number of occasions it can be useful or even indispensable to first reduce the dimension of the data, keeping as much of the original information as possible, and then feed the reduced data into the system. Occasions for the urgent need of dimension reduction algorithms could be: Too small amount of data records for given input dimension: A well-known and widely-used estimation about how much data records should be used to perform data analysis methods for a given amount (n) of measured channels is determined by n2 . Of course, this is just a rough estimate and has to be adapted to the special needs of a concrete method (see [1]). If too few data points are used for data analysis methods, the probability of achieving inaccurate or even unstable results increases. The empty space phenomenon: The curse of the dimensionality refers to the fact that the sample size needed to estimate a function of several variables to a given degree of accuracy grows exponentially with the number of variables (= number of dimensions). A related fact, responsible for the curse of dimensionality, is the so-called empty space phenomenon: high-dimensional spaces are inherently sparse. So for example consider a multi-dimensional lattice over the input space and a given fixed amount of data points characterizing the input. Then, as a matter of fact, the density of data points in the cuboids, built up by the lattice, decreases, if dimensions are added. For practical usage a lower bound exists how much data should lie in one cuboid and this bound is given by 10. Thus, in a multi-dimensional cuboid with dimension dim 10dim data points are needed (see also [1] or [2]). Visualization of data clouds and data models: In scientific applications on data analysis/data processing it is warmly welcome or even indispensable first, to visualize data for having a first impression about the nature of the data and second, to visualize empirical models which describe the data with special parameters for verification of the model generation methods. For this purpose also dimension reduction is needed, because only up to 3-dimensional models can be visualized; indeed, 4 or even 5-dimensional models can be prepared for visualization by setting 1 or 2 dimensions to a fixed value but this is just a simulation of visualization such that information gains out of curves, points etc. cannot be easily taken. Fast computation time for data analysis methods: For the purpose of verification and validation of results achieved by theoretical research it is required to get as much data as possible. But this fact does not hold for real-time application where for example data models need to be computed online. Some algorithms for processing data even grow exponentially in time with the increase of the dimension of the data. Hence, dimension reduction is needed to decrease the computation time to the required maximum. Local regression methods which build up models on the "check side" 2 Dimension Reduction Techniques - Theoretical Aspects 2 Figure 1: General Dimension Reduction Process due to the actual test data points takes a long time to come to a plausibility decision, if for example the whole amount of dimension is taken as input space. For a test-run to check the speed of the method and to check the improvement of the computation by using dimension reduction a 200-dimensional data set with 1534 data points were chosen and taken as input to a local regression method. Calculation was performed in MATHEMATICA under Windows 2000 on a Pentium II 366 Mhz processor with 128 MB RAM. The following results were achieved: Computation time for a 200-dimensional input data set: 20.23 seconds Computation time after reducing the data dimension to 97: 3.17 seconds It can be easily seen that computation time could be reduced by nearly a factor 7 whereas the dimension was only reduced by a factor 2. Complexity reduction of data analysis methods: Complexity reduction here means above all reduction of needed memory of a data processing procedure to make an algorithm more efficient or even appliable at all. For example the so-called RENO (REgularized Numerical Optimization) (see [3], [4], [5]) was applied onto data which builds up rule-bases for fuzzy inference systems by performing all possible combinations of fuzzy-sets, defined over each input dimension. Hence, considering 10 input dimensions, for each 5 fuzzy-sets defined, it was tested that this configuration yields 510 = 9765625 different rules!!! and a virtual memory overflow during computation of RENO was produced on a PC with 512 MB RAM. In Figure 1 the work-flow of a data analysis process using a dimension reduction step is demonstrated 2 Dimension Reduction Techniques - Theoretical Aspects As a consequence of the previous chapter some dimension reduction techniques were studied, researched and applied onto data coming from engine test stations during an industrial project. These techniques will be demonstrated in the chapter below. Here it should be also mentioned that this is not the complete current state of the art about dimension reduction algorithms; this would include more techniques, for example Projection pursuit 2 Dimension Reduction Techniques - Theoretical Aspects 3 Generalized additive models Self-organizing maps (Kohonen maps) Generative topographic mapping See also [2] 2.1 Neglecting Redundancy A substantial contribution to dimension reduction can be given by redundant data vectors, respectively often called data channels. Redundancy is the information which exists several times independently inside the system and therefore acts on data columns in a data matrix. For neglecting redundancy in a data set obtained from an engine test bench there is no preference to decide between redundancy between measured channels or redundancy between calculated channels. Redundancy between 2 data channels can simply be discovered by using correlation coefficients such as the empirical correlation coefficient for identifying redundancy causing a linear behaviour: Pn − x̄)(yi − ȳ) pPn 2 2 i=1 (xi − x̄) i=1 (yi − ȳ) = pPn rxy i=1 (xi (1) or by Spearman-Rank correlation coefficient for identifying redundancy causing a monotonic behaviour: Pn rxy = qP n+1 n+1 2q)(Rg(yi ) − 2 ) n+1 2 Pn n+1 2 i=1 (Rg(yi ) − 2 ) 2 ) i=1 (Rg(xi ) n i=1 (Rg(xi ) − − (2) where xi and yi for i = 1, ..., n are the measurements of 2 channels, x̄ and ȳ the mean estimators of 2 channels and Rg(xi ), Rg(yi ) the rank of the ith measurement in the sorted data columns xsort and ysort . If now |rxy | ≥ rred (3) where rred is slightly smaller than 1, a redundancy is identified. We tried out redundancy creation due to measured data from a diesel engine containing 101 channels and 758 samples and obtained the results shown in figure 2.1: So, for example if somebody wants to build an empirical model with input channels VE, GL, QLUFT, NOX and NOX_S the dimension from 5 can be simply reduced to 3 by using the redundancy table and neglecting QLUFT and NOX_S. One problem during the determination of redundant channels is caused by the fact that it can happen that redundancy is calculated whereas the appearance of redundancy is just a "fake" (for example in the case of 2 constant channels). Fortunately, as channels are mostly not really exactly constant, this circumstance does not appear very often. Moreover, it can be checked by single channel analysis if a (nearly) constant channel appears in the data matrix, and therefore may be neglected. 2 Dimension Reduction Techniques - Theoretical Aspects Channel VE TA31_1 TA31_2 GL NOX GNOX 4 Redundant with VE_H TA31_2,TAZ2,TAZ3 TAZ4, TAZ5 GAH,QLUFT,GLF NOX_S GNOX_S Figure 2: Example of a redundancy causing linear behaviour (left) and the list of channel redundancies (right) 2.2 T-Values Another approach of channel reduction can be obtained through so-called T -values which denote the degree of influence of an input channel for building an empirical model for the reproduction of another channel (output channel). In order to be able to compute the T -values first a regression with linear or also nonlinear regressors has to be carried out for estimating parameters βi , i = 1, ..., k, where k is the amount of regressors. (Note: for static linear regression without any timediscrete dynamics the amount of regressors k is equal to the amount of channels m - for detailed inspection see [6]). Then the T -value can be calculated by: βi Ti = √ s di (4) where s is an estimator of the perturbation in the regression analysis and can be calculated by s s= SSreg n − (m + 1) (5) with SSreg = ni=1 (yˆi − ȳ)2 the sum of squared errors of the regression and di the ith diagonal element of the matrix (xt x)−1 . P The following "influence" matrix with the 4 channels having the greatest influence and each row containing the channel which is reconstructed (output channel), could be obtained by using above formulas on a data set coming from a large diesel engine: Another approach to gain information for the degree of influence of a channel in an empirical model using regression estimators βi can be realized by performing a statistical test with hypothesis H0i : βi = 0 and rejection domain |Hβ − ζ0 | √ nσ 2 s n − (m + 1) ≥ t1−α/2 (n − (m + 1)) (H(xt x)−1 H t ) (6) where x is the data matrix containing all data channels and t1−α/2 the α-quantile of the Student distribution. A small p-value (given by the left side of above inequation) yields a rejection of the hypothesis and therefore indicates that βi 6= 0 and hence a degree of influence of the corresponding regressor in dependency of α: with the parameter α the rejection domain can be adjusted in a 2 Dimension Reduction Techniques - Theoretical Aspects 5 Output Ch Ch 1 t-value Ch 2 t-value Ch 3 t-value Ch 4 t-value N_MI PE_MI BE VE VE_H RW_S1 LAMBDA LBDREZ LBDL POEL MD_MI PHC VE_H VE GSOOT KHC N_MI LBDF2_1 30.18 8620 70.12 134.4 134.4 30.49 38.05 23.6 53.6 LBDREZ TA32 PNOX_S STOECH STOECH GNOX_S HC TA41 TL2_1 23.6 2.5301 69.48 64.4 55.02 13.56 26.25 22.8 44.78 P11 TP PSOOT X X LBDREZ GHC BH QLUFT 20.12 2.44 29.57 9.59 10.17 11.67 19.71 16.43 29.81 LBDF2_1 P31 KNOX_S KLFS P_MI SOOT KNOX_S LAMBDA NOX_S 19.46 2.405 8.22 8.24 9.57 8.37 16.32 16.04 20.7 Table 1: Table of channels having greatest influence for reproducing output channels plus t-values (=influence levels) of these channels way such that equation (6) holds for almost all β, not only for those whose absolute value is significantly greater than 0. The information as it is stated in the table above can be used for a strategy to built up models for a smaller amount of data channels (=> lower-dimensional models, for example in the case of reproducing channel PE_MI (second line) MD_MI (with a influence value of 8620!) plays the crucial role and the others can more or less be neglected): Input: train data matrix, threshold for amount of necessary channels Output: Sorted Channel list (sorted from top to bottom due to the factor of influence) for each channel -> channel matrix Step 1: Designate "well-measured" channels in data matrix For i=0; i<number_of_well_measured_channels; ++i Step 2: Perform multi-dimensional regression for each channel Step 3: Compute t-values (as described above) Step 4: Sort t-values from top to bottom Step 5: Choose threshold (either computation dependent on actual t-values or fixed) Step 6: Store channels in channels list due to threshold End For End Here it should be mentioned that Step 1 is more or less a simple pre-filtering step and is shown in [7]. Step 5 can be carried out in many different ways, which are stated below in chapter 3. Step 2 is described in [7], [6] and [9]. 2 Dimension Reduction Techniques - Theoretical Aspects 6 This channel matrix which denotes the most important channels for building up a model for each channel can be taken as input be any data-based model generation algorithm (a variety of model generation or identification algorithms is demonstrated in [7] and [8]). 2.3 Principal Component Analysis (PCA) Principal component analysis is possibly the dimension reduction technique most widely used in practice, perhaps due to its conceptual simplicity and to the fact relatively efficient algorithms exist for its computation. Let’s start with a data matrix A containing channels (= columns) x1 , x2 , ..., xn and let’s say that µx = E{x} denotes the mean values of the channels and the matrix Cx = E{x − µx )(x − mux )t } (7) the covariance matrix of the data set. The components of Cx , denoted by cij represent the covariances between the data channels xi and xj . The covariance matrix is, by definition, always symmetric. From such a symmetric matrix it is possible to calculate an orthogonal basis by finding its eigenvalues (λi ) and eigenvectors (ei ), which are the solution of the equation Cx ei = λi ei ∀i = 1, ..., n (8) For simplicity it can be assumed, that the λi are distinct. These values can be found by finding the solutions of the characteristic equation |Cx − λI| = 0 (9) where I is the identity matrix having the same order than Cx and the |.| denotes the determinant of the matrix. If the data vector has n components (= dimensions), the characteristic equation becomes of order n. This is easy to solve only if n is small. Solving eigenvalues and corresponding eigenvectors is a non-trivial task (the zero points of a characteristic polynomial with degree n have to be evaluated), and many methods exist, which can perform this in O(n3 ) : singular value decomposition, cholesky decomposition, etc. (see [10] for further details). By ordering the eigenvectors in the order of descending eigenvalues (largest first), one can create an ordered orthogonal basis with the first eigenvector having the direction of largest variance of the data. In this way, we can find directions in which the data set has the most significant amounts of energy and neglect the others ⇒ dimension reduction. In the Figure 2.3 a 2-dimensional data set is demonstrated and how it can be achieved to represent this 2 dimensions by just one principal component via applying principal component analysis. It can easily be seen, that the first eigenvector has the largest eigenvalue points to the direction of largest variance (right and upwards) whereas the second eigenvector is orthogonal to the first one (pointing to left and upwards). In this example the first eigenvalue corresponding to the first eigenvector is λ1 = 0.1737 while the other eigenvalue is λ2 = 0.0001. By comparing the values of eigenvalues to the total sum of eigenvalues, we can get an idea how much of the energy is 2 Dimension Reduction Techniques - Theoretical Aspects 7 Figure 3: Principal components for 2 data channels concentrated along the particular eigenvector. In this case, the first eigenvector contains almost all the energy, therefore the data could be well approximated with a one-dimensional representation. Instead of using all eigenvalues and as a consequence all eigenvectors in a transformation matrix B, which transforms data points lying in the original coordinate system to a new data point lying in the coordinate system spanned by the principal components, a partial subspace consisting of the eigenvectors corresponding to largest K eigenvalues can be chosen and represented as a matrix BK . Transformation of arbitrary input points from the original space to the "principal component space" can be performed via using the operation y = BK (x − µx ) (10) For example, this has to be done, if models are generated based on the principal component vectors, and new data points should be checked due to these models. If the data is concentrated in a linear subspace, this provides a way to compress data without losing much information and simplifying the representation. By picking the eigenvectors having the largest eigenvalues we lose as little information as possible in the mean-square sense. One can e.g. choose a fixed number of eigenvectors and their respective eigenvalues and get a consistent representation, or abstraction of the data. This preserves a varying amount of energy of the original data. Alternatively, we can choose approximately the same amount of energy and a varying amount of eigenvectors and their respective eigenvalues. This would in turn give approximately consistent amount of information in the expense of varying representations with regard to the dimension of the subspace. Besides, PCA offers a convenient way to control the trade-off between loosing information and simplifying the problem at hand by reducing dimensions. Recapitulating, 3 side effects can be achieved: Orthogonalisation of the components of the input vector Ordering of the resulting orthogonal components (principal components) so that those with the largest variation comes first Eliminating those components that contribute the least to the variation in the data set 2 Dimension Reduction Techniques - Theoretical Aspects 8 Although principal component analysis yields a clear compact mathematical description of the data through transformed channels (as principal components), comparing it to the channel selection approach through regression analysis, 3 disadvantages appear within practical usage: PCA is an unsupervised method, which means that PCA does not take into account any output channel which should be mapped by others. Therefore, in a measurement system containing N channels N − 1 PCAs with N − 1 input channels have to be carried out, which probably would lead to an enormous computational effort. As every channel appears in every principal component, faults detected through plausibility analysis algorithms using empirical models based on principal components can hardly be isolated by FI (=Fault Isolation) algorithms - see [12]. Adjustment of the number of the components used for building up an empirical model: a too small number causes a too inaccurate model, a too large number increases the computational complexity. For further details about special appearance of the PCA (for example clustered PCA, recursive PCA etc.) see [11]. 2.4 Partial low dimensional Models Another approach which was pursued, implemented and verified in an industrial project and does not illustrate a conventional dimension reduction method like those in the chapters before, is denoted by selecting partial low-dimensional models out of the whole data matrix. In other words, these partial models can be seen as a cloud of low-dimensional models covering subspaces of the original input space. So more or less, this approach can be seen as a dimension reduction not performing "top-to-bottom" (as it is for example done by principal component analysis), but more "bottom-to-top". From the mathematical point of view this approach is not really sophisticated and depends just on channel selection due to measure values which gives hints for which channel combinations data-based models should be built up or due to expert knowledge defined by the operator. The great effort lies more in the strategy how to select the channels. For our model generation approaches (see [7], chapter 8), 2 different selection strategies were accomplished: Selecting pair of channels by using redundancy obtained through empirical (1) and/or SpearmanRank correlation coefficient (2) and reproducing this redundancy by an empirical model. Computing models (especially 3-dimensional fuzzy inference systems) based on a list (also called leading channel list) defined by an operator with physical expert knowledge. The practical and also theoretical advantages of performing small partial models are the following: Fast computation time: Although somebody can rise the objection that theoretically it can happen that for n channels n over 2 2-dimensional models have to be built up, practically it turned out that, if chosen a reasonable lower bound a correlation coefficient should have 3 Results with respect to Fault Diagnosis 9 to be able to build up useful models, approximately 2n models were generated. Thus, if n is large, for a lot model generation algorithms building up 2n small -dimensional models take much less time than computing 1 high-dimensional model taking all input channels; the reason is that a lot of model generation algorithms grow exponentially with the dimension (see [7]). Strong flexibility due to changings in the channel configuration: In the practical use of offline training and online-check it can happen that some channels which were taken as input in offline training were not present in online-checking any longer, while maybe others were added. In this case plausibility checking due to a model based on all input channels, which were present during offline-training, cannot be performed, while checking due to some low-dimensional partial models (those which only consists of channels which are actually present in the online configuration) is still possible and valid; therefore with an amount of low-dimensional models a plausibility statement for a set of test data points can be reached. Easier fault isolation: If there are small partial models with just a few channels in it and faults are detected due to these models, the fault can be isolated much more easier as if there 1 only one model where all channels are included (for further details see concepts of FDI in [7], chapter 9). Another approach for a hint to build up a model is denoted by the so-called multiple correlation coefficient which can be evaluated for example in the 3-dimensional case via the following formula: s 2 + r 2 − 2r r r r12 12 13 23 13 R123 = (11) 2 1 − r23 where r12 ,r13 and r23 are the correlation coefficients between the channels x1 and x2 , x1 and x3 and x2 and x3 . For higher dimensions this multiple correlation coefficient changes to the so-called r-squared values (for further details see [13]). 3 Results with respect to Fault Diagnosis In this chapter the results of fault detection based on empirical models are compared between different dimension reduction techniques as preliminary steps before the generation of fuzzy inference systems by using the famous MATLAB method genfis2. Genfis2, as opposed to RENO (see [3], [4], [5]), has the great advantage to be able to deal with a dimensionality up to 20, while RENO produces a virtual memory overflow on a machine with 256 MB RAM if input dimension is higher than 7. Genfis2 first uses the subtractive clustering algorithm (see [16]) to determine the number of rules and antecedent membership functions and then uses linear least squares estimation to determine each rule’s consequent equations. The Gaussian-shaped fuzzy sets are obtained by projecting each cluster onto each axis. This method returns a Takagi-Sugeno-type fuzzy inference system (see [15]) structure that contains a set of fuzzy rules to cover the feature space (see [17], [18]). An approach for the fault detection process itself, hence the decision based on data-based models such as fuzzy inference systems, if a current point is faulty or not, is to filter the data 3 Results with respect to Fault Diagnosis 10 Input: Train data set, Check data set with rating channel Output:Detection and over-detection rate Step 1: Defining Parameters Step 2: Pre-filtering of train data matrix Step 3: Applying a special dimension reduction method onto filtered data Step 4: Adjusting parameter(s) for genfis2 Step 5: Generate FIS with genfis2 Step 6: Evaluate quality of FIS Step 7: Perform fault detection due to check data Step 8: Calculate detection and overdetection rate Table 2: Complete process scheme using dimension reduction methods as preliminary steps for model generation and fault diagnosis in order to estimate a parameter which represents the deviation from the reference situation (see [8]). Additionally, an internal and an external quality measure is calculated, which both give rise to the trustability of a model (see [7]) and is taken into account in the whole fault diagnosis process: Statements about faults coming from models with better qualities are higher weighted than obtained through worse ones. The performance reflects detection and over-detection rate of faulty points occurred during a test procedure of a large diesel engine. The train data matrix, containing 1810 data points (rows) with 80 channels (columns) for building up the fuzzy inference system was first pre-filtered by the outlier detection algorithms as described in [7], chapter 7 and through the famous Mahalanobis distance measure (see [14]) before sending it into dimension reduction methods. A so-called check data set of 250 points with the same 80 channels as contained in the train data set for verification and validation of the complete general process scheme as described in table (3 was defined, including 129 faulty and 121 faulty-free samples. Additionally, a so-called rating channel was placed into the check data matrix, which clarifies for each sample, if there is a fault in it or not (1 or 0). Step 3 in above algorithm was carried out for the following 3 different dimension reduction techniques: Dimension reduction based on physical expert knowledge ⇒ partial low dimensional models (see above) Principal Component Analysis with different variations of parameters Dimension Reduction based on t-values with different variations of parameters In table 3 a summary of results is stated, including comments on the choice of parameters and the output structure of fuzzy inference systems generated by genfis2. For the calculation of detection and overdetection rate (step 8 in table 3) see [7]. Up-to-n-dimensional models here means that the amount of inputs for reproducing one output channel is restricted by n, where 3 Results with respect to Fault Diagnosis 11 Dim. Red. Method Model Structure Parameters Expert Knowledge 12 3-dimensional models with inputs N_MI (rev) and PE (torsional) and 12 common leading channels for diesel engines 10 10-dimensional models by taking out the 10 most significant principal components reproducing each of these 10 by the other 9 62 5-dimensional models by taking all original channels (after filtering!) as output channels and reproducing them by the 5 most significant principal components 62 5-dimensional models channels (after filtering!) as output channels and reproducing them by the 5 most significant channels 62 up-to-5-dimensional models (after filtering!) as output channels and reproducing them by the most (up to 5) significant channels 62 up-to-5-dimensional models (after filtering!) as output channels and reproducing them by the most (up to 5) significant channels 62 up-to-10-dimensional models (after filtering!) as output channels and reproducing them by the most (up to 10) significant channels radius_cluster 0.2 PCA PCA T -Values T -Values T -Values T -Values Det. Rate Overdet. Rate = 56.2% 6.0% radius_cluster 0.2 = 71.32% 62.28% radius_cluster 0.2 = 46.32% 0.88% radius_cluster 0.2 = 48.53% 1.75% radius_cluster 0.2 = 45.59% 1.75% radius_cluster dynamic = 50.0% radius_cluster dynamic = 45.59% 1.75% 1.75% Table 3: Performance results of fault diagnosis obtained by using different dimension reduction techniques 4 Interpretation of Results and Conclusion 12 below n the input dimension is dynamically adapted for output channel m through the following assignment: dimm = |{tval ∈ Tm |tval ≥ 0.3 max(tvali ), tvali ∈ Tm }| (12) i where Tm is the set of t-values belonging to input channels for reproducing output channel m. Moreover, also the crucial parameter radius_cluster for genfis2 can be adjusted dynamically (3) in dependency of the actual input dimension by ( rad = 1 n−dimm 0.7 dimm < n − 1 dimm ≥ n − 1 (13) Radius_cluster is always between 0 and 1 and specifies that the range of influence in a data dimension of one cluster is a fraction of the width of the data space. If radius_cluster is a scalar (as in our case), then the scalar value is applied to all data dimensions, i.e., each cluster center will have a spherical neighborhood of influence with the given radius. 4 Interpretation of Results and Conclusion As in the case of a too high over-detection rate the danger is great that the confidence of the operators in the complete fault detection system decreases, the dimension reduction method as stated in line 6 of table 3 should be preferred to any others; hence, the dynamisation of both, used input dimension and parameter radius_cluster for genfis2, gives us really an essential impact onto the performance. Indeed, the method based on expert knowledge achieves the highest detection rate, but an over-detection rate of 3 to 4 times higher than the others is unacceptable. By the way, too much fixed input dimensions (as in the case of algorithm in line 2) explain the variances of the data in a too detailed way and therefore lead to a so-called overfitting of the data, which means nothing else than a too exact adaptation of the model to the data, such that new incoming faulty-free points often do not fit into the model ⇒ high overdetection rate. Even though, with decoupling the dynamic fuzzy expert systems from the physical a-priori knowledge and hence, as an important side effect for reducing parametrization effort, shifting it into the area of pure data-based modeling methods (see figure functional framework in [7], chapter 4), overdetection rate can be almost squeezed down to 0, the detection rate of about 50% is not satisfying enough. Therefore advanced techniques in dimension reduction and channel/variable selection will be needed in order to be able to improve the fault detection process further. References [1] Edwin Lughofer, "Testdata Requirement Specification" [2] Miguel A. Carreira-Perpinan, "A review of dimension reduction techniques", technical report CS-96-09, Dept. Of Computer Science, University of Sheffield [3] Josef Haslinger, Ulrich Bodenhofer,Martin Burger, "Data-Driven Construction of Sugeno Controllers: Analytical Aspects and New Numerical Results", Software Competence Center Hagenberg, 2000 References 13 [4] Martin Burger, Josef Haslinger and Ulrich Bodenhofer, "Regularized Optimization of Fuzzy Controllers", Technical Report, SCCH-TR-0056, Software Competence Center Hagenberg, 2000 [5] H.Schwetlick and Thorsten Schütze, "Least Squares Approximation by Splines with Free Knots", BIT 35(3):361-384, 1995. [6] Frank E. Harell jr., "Regression modeling strategies: With applications to linear models, logistic regression and survival analysis", Springer Series in Statistics [7] Günther Frizberg, Edwin Lughofer, Thomas Strasser, "Final Report AMPA01" [8] Oliver Nelles, "Nonlinear System Identification - From Classical Approaches to Neural Networks and Fuzzy Models", ISBN 3-540-67369-5 Springer-Verlag Berlin Heidelberg New York [9] Peter Weiss, "Statistik 2" [10] W. H. Press, S. A. Teukolsky, W. T. Vetterling and P.B. Flannery, "Numerical Recipes in C: The Art of Scientific Computing", Cambridge University Press, Cambridge, U.K., second ed., 1992 [11] J. Edward Jackson, "A user’s guide to principal components", ISBN: 0471622672, John Wiley & Sons [12] Patton, R.J. et al., "Issues of Fault Diagnosis for Dynamic Systems", Springer-Verlag, London Limited, 2000, pp. 87-114 [13] Murray R. Spiegel, "Statistik", McGraw-Hill Book Company Europe [14] P. C. Mahalanobis, "Proc. Natl. Institute of Science of India", 2, 49 (Original reference to Mahalanobis distance calculations.) [15] Hung T. Nguyen, Michio Sugeno, Richard Tong and Ronald R. Yager, "Theoretical Aspects of Fuzzy Control", ISBN 0-471-02079-6, John Wiley and Sons, Inc. [16] Robert P. Velthuizen, Lawrence O. Hall, Laurence P. Clark and Martin L. Silbiger, "An Investigation of Mountain Method Clustering for large Data Sets", Pattern Recognition, Vol. 30, No. 7, pp. 1121-1135, 1997 [17] Ronald R. Yager and Dimitar P. Filev, "Generation of Fuzzy Rules by Mountain Clustering", Technical Report MII-1318R, Machine Intelligence Institute, Iona College, New Rochelle, NY 10801 [18] S. Chiu, "Fuzzy Model Identification Based on Cluster Estimation," Journal of Intelligent & Fuzzy Systems, Vol. 2, No. 3, 1994.