Towards Adaptive Anomaly Detection Systems using
Transcription
Towards Adaptive Anomaly Detection Systems using
ÉCOLE DE TECHNOLOGIE SUPÉRIEURE UNIVERSITÉ DU QUÉBEC MANUSCRIPT-BASED THESIS PRESENTED TO ÉCOLE DE TECHNOLOGIE SUPÉRIEURE IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Ph.D. BY Wael KHREICH TOWARDS ADAPTIVE ANOMALY DETECTION SYSTEMS USING BOOLEAN COMBINATION OF HIDDEN MARKOV MODELS MONTREAL, JULY 18, 2011 c Copyright 2011 reserved by Wael Khreich Wael Khreich, 2011 BOARD OF EXAMINERS THIS THESIS HAS BEEN EVALUATED BY THE FOLLOWING BOARD OF EXAMINERS : Mr. Éric Granger, Thesis Supervisor Département de génie de la production automatisée at École de technologie supérieure Mr. Robert Sabourin, Thesis Co-supervisor Département de génie de la production automatisée at École de technologie supérieure Mr. Ali Miri, Thesis Co-supervisor School of Computer Science, Ryerson University, Toronto, Canada Mr. Jean-Marc Robert, President of the Board of Examiners Département de génie logiciel at École de technologie supérieure Mr. Tony Wong, Examiner Département de génie de la production automatisée at École de technologie supérieure Mr. Qinghan Xiao, External examiner Defence Research and Development Canada (DRDC), Ottawa THIS THESIS WAS PRESENTED AND DEFENDED BEFORE A BOARD OF EXAMINERS AND PUBLIC ON JUNE, 21 2011 AT ÉCOLE DE TECHNOLOGIE SUPÉRIEURE ACKNOWLEDGEMENTS I would like to express sincere appreciation to Professor Éric Granger my research supervisor for his worthwhile assistance, support, and guidance throughout this research work. In particular, I would like to thank him for his protracted patience during those many hours of stimulating discussion. In addition, the progress I have achieved in writing is mostly due to his thorough and critical reviews of my manuscripts. I would like to extend my heartfelt gratitude to my co-supervisor Professor Ali Miri whose constant encouragement and valuable insights provided the perfect guidance that I needed through those years of research. I am deeply indebted to Prof. Ali Miri, who first believed in me and introduced me to Prof. Éric Granger and made possible the first step of this PhD thesis. I would like to express my sincere and special thanks to Professor Robert Sabourin my codirector, for his support and instruction throughout this project. Each meeting with him added invaluable aspects to the project and broadened my perspective. His stimulating discussions and comments led to much of this work. My gratitude also goes to members of my thesis committee: Prof. Jean-Marc Robert, Prof. Tony Wong, and Prof. Qinghan Xiao for their time and their constructive comments. Thanks to all the wonderful people that I met at LIVIA and that made all this experience so satisfactory. Special thanks to Dominique Rivard, Eduardo Vellasques, Éric Thibodeau, Luana Batista, Marcelo Kapp, Paulo Cavalin, Vincent Doré, and Youssouf Chherawala. Thanks to all my friends back home, who made me feel like I never left and provided me with long-distance support. Thanks to my friends Joy Khoriaty, Toufic Khreich and Myra Sader, for their moral support and help. Special thanks to Marc Hasrouny, Patrick IV Diab, Sam Khreiche, Roger Mehanna, and Julie Barakat who have supported me since I came to Canada, and made me feel like home. To my beloved wife Roula, thank you for being by my side, for putting up with all those nights and weekends that I spent in front of my laptop, and for giving me endless support and motivation to finish this thesis. Your being here was the best thing that could have happened to me. Thank you. Towards the end, gratitude goes to my loved mother and father back home for their endless love and sacrifices, and for their sustained moral and financial support during those years of research and throughout my whole life. You were always my source of strength. This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), and le Fonds québécois de la recherche sur la nature et les technologies (FQRNT). The author gratefully acknowledges their financial support. TOWARDS ADAPTIVE ANOMALY DETECTION SYSTEMS USING BOOLEAN COMBINATION OF HIDDEN MARKOV MODELS Wael KHREICH ABSTRACT Anomaly detection monitors for significant deviations from normal system behavior. Hidden Markov Models (HMMs) have been successfully applied in many intrusion detection applications, including anomaly detection from sequences of operating system calls. In practice, anomaly detection systems (ADSs) based on HMMs typically generate false alarms because they are designed using limited representative training data and prior knowledge. However, since new data may become available over time, an important feature of an ADS is the ability to accommodate newly-acquired data incrementally, after it has originally been trained and deployed for operations. Incremental re-estimation of HMM parameters raises several challenges. HMM parameters should be updated from new data without requiring access to the previously-learned training data, and without corrupting previously-learned models of normal behavior. Standard techniques for training HMM parameters involve iterative batch learning, and hence must observe the entire training data prior to updating HMM parameters. Given new training data, these techniques must restart the training procedure using all (new and previously-accumulated) data. Moreover, a single HMM system for incremental learning may not adequately approximate the underlying data distribution of the normal process, due to the many local maxima in the solution space. Ensemble methods have been shown to alleviate knowledge corruption, by combining the outputs of classifiers trained independently on successive blocks of data. This thesis makes contributions at the HMM and decision levels towards improved accuracy, efficiency and adaptability of HMM-based ADSs. It first presents a survey of techniques found in literature that may be suitable for incremental learning of HMM parameters, and assesses the challenges faced when these techniques are applied to incremental learning scenarios in which the new training data is limited and abundant. Consequently, An efficient alternative to the Forward-Backward algorithm is first proposed to reduce the memory complexity without increasing the computational overhead of HMM parameters estimation from fixed-size abundant data. Improved techniques for incremental learning of HMM parameters are then proposed to accommodate new data over time, while maintaining a high level of performance. However, knowledge corruption caused by a single HMM with a fixed number of states remains an issue. To overcome such limitations, this thesis presents an efficient system to accommodate new data using a learn-and-combine approach at the decision level. When a new block of training data becomes available, a new pool of base HMMs is generated from the data using a different number of HMM states and random initializations. The responses from the newly-trained HMMs are then combined to those of the previously-trained HMMs in receiver operat- VI ing characteristic (ROC) space using novel Boolean combination (BC) techniques. The learn-and-combine approach allows to select a diversified ensemble of HMMs (EoHMMs) from the pool, and adapts the Boolean fusion functions and thresholds for improved performance, while it prunes redundant base HMMs. The proposed system is capable of changing its desired operating point during operations, and this point can be adjusted to changes in prior probabilities and costs of errors. During simulations conducted for incremental learning from successive data blocks using both synthetic and real-world system call data sets, the proposed learn-and-combine approach has been shown to achieve the highest level of accuracy than all related techniques. In particular, it can sustain a significantly higher level of accuracy than when the parameters of a single best HMM are re-estimated for each new block of data, using the reference batch learning and the proposed incremental learning techniques. It also outperforms static fusion techniques such as majority voting for combining the responses of new and previously-generated pools of HMMs. Ensemble selection techniques have been shown to form compact EoHMMs for operations, by selecting diverse and accurate base HMMs from the pool while maintaining or improving the overall system accuracy. Pruning has been shown to prevents pool sizes from increasing indefinitely with the number of data blocks acquired over time. Therefore, the storage space for accommodating HMMs parameters and the computational costs of the selection techniques are reduced, without negatively affecting the overall system performance. The proposed techniques are general in that they can be employed to adapt HMM-based systems to new data, within a wide range of application domains. More importantly, the proposed Boolean combination techniques can be employed to combine diverse responses from any set of crisp or soft one- or two-class classifiers trained on different data or features or trained according to different parameters, or from different detectors trained on the same data. In particular, they can be effectively applied when training data is limited and test data is imbalanced. Keywords: Intrusion Detection Systems, Anomaly Detection, Adaptive Systems, Ensemble of Classifiers, Information Fusion, Boolean Combination, Incremental Learning, On-Line Learning, Hidden Markov Models, Receiver Operating Characteristics. VERS DES SYSTÈMES ADAPTATIFS DE DÉTECTION D’ANOMALIES UTILISANT DES COMBINAISONS BOOLÉENNES DE MODÈLES DE MARKOV CACHÉS Wael KHREICH RÉSUMÉ La détection d’anomalies permet de surveiller les déviations significatives du comportement normal d’un système. Les modèles de Markov cachés (MMCs) ont été utilisés avec succès dans différentes applications de détection d’intrusions, en particulier la détection d’anomalies à partir de séquences d’appels système. Dans la pratique, les systèmes de détection d’anomalies (SDAs) basés sur les MMCs génèrent typiquement de fausses alertes étant donné qu’ils ont été conçu en utilisant des données d’apprentissage et des connaissances préalables limitées. Mais puisque de nouvelles données peuvent être acquises avec le temps, les SDAs doivent s’adapter aux nouvelles données une fois qu’ils ont été entraînés et mis en opération. La ré-estimation incrémentale des paramètres des MMCs présente plusieurs défis à relever. Ces paramètres devront être ajustés selon l’information fournie par les nouvelles données, sans réutiliser les données d’apprentissage antérieures et sans compromettre les informations déjà acquises dans les modèles de comportement normal. Les techniques standards pour l’entraînement des paramètres des MMCs utilisent l’apprentissage itératif en mode « batch » et doivent ainsi observer la totalité de la base de données d’apprentissage avant d’ajuster les paramètres des MMCs. À l’acquisition de nouvelles données, ces techniques devraient recommencer la procédure de l’apprentissage en utilisant toutes les données accumulées. En outre, un système d’apprentissage incrémental basé sur un seul MMC n’a pas la capacité de fournir une bonne approximation de la distribution sous-jacente du comportement normal du processus à cause du grand nombre des maximums locaux. Les ensembles de classificateurs permettent de réduire la corruption des connaissances acquises, en combinant les sorties des classificateurs indépendamment entraînés sur des blocs de données successives. Cette thèse apporte des contributions au niveau des MMCs ainsi qu’en regard de la décision des classificateurs dans le but d’améliorer la précision, l’efficacité et l’adaptabilité des SDAs basés sur les MMCs. Elle présente en premier lieu une étude d’ensemble des techniques qui peuvent être utilisées pour l’apprentissage incrémental des paramètres des MMCs et évalue les défis rencontrés par ces techniques lors d’un apprentissage incrémental avec des données limitées ou abondantes. Par conséquent, une alternative efficace à « l’algorithme avant-arrière » est proposée pour réduire la complexité de la mémoire sans augmenter le coût computationnel de l’estimation des paramètres des MMCs, faite à partir de données fixes. D’autre part, elle propose des améliorations aux techniques d’apprentissage incrémental des paramètres des MMCs pour qu’elles s’adaptent à des nouvelles données, tout en conservant un niveau de performance élevé. Cependant, le problème de la corruption des connaissances acquises causée par l’utilisation d’un sys- VIII tème basé sur un seul MMC n’est toujours pas complètement résolu. Pour surmonter ces problèmes, cette thèse présente un système efficace qui s’adapte aux nouvelles données, en utilisant une approche apprentissage-combinaison au niveau décision. À l’arrivée d’un nouveau bloc de données d’apprentissage, un ensemble de MMCs est généré en variant le nombre d’états et l’initialisation aléatoire des paramètres. Les réponses des MMCs entraînés sur les nouvelles données sont combinées avec celles des MMCs déjà entraînés sur les données antérieures dans l’espace ROC (caractéristique de fonctionnement du récepteur, receiver operating characteristic), en utilisant les techniques de combinaison booléenne. L’approche apprentissage-combinaison proposée permet de sélectionner un ensemble diversifié de MMCs et d’ajuster les fonctions booléennes et les seuils de décision, tout en éliminant les MMCs redondants. Pendant la phase d’opération, le système proposé est capable de changer le point d’opération et celui-ci peut s’adapter au changement des probabilités a priori et des coûts des erreurs. Les résultats des simulations obtenus sur des bases de données réelles et synthétiques montrent que l’approche apprentissage-combinaison proposée a atteint le taux le plus élevé de précision en comparaison avec d’autres techniques d’apprentissage incrémental à partir de blocs de données successives. En particulier, le système a démontré qu’il pouvait maintenir un plus haut taux de précision que celui d’un système basé sur un seul MMC entraîné selon le mode batch (utilisant tous les blocs de données cumulatifs) ou qui adapte les paramètres du MMC selon la méthode incrémentale proposée (utilisant des blocs de données successifs). De même, la précision du système proposé a surpassé celle des techniques basées sur les fonctions de fusion classiques tel que le vote majoritaire combinant les décisions des MMCs entraînés sur les anciens et les nouveaux blocs de données. Les techniques de sélection d’ensembles du système proposé sont capables de choisir un ensemble compact, en sélectionnant les MMCs générés, les plus précis et les plus diversifiés, tout en conservant ou en améliorant la précision du système. La technique d’élimination des modèles empêche l’augmentation de la taille du pool avec le temps. Ceci permet de réduire l’espace de stockage des MMCs et le coût computationnel des techniques de sélection, sans compromettre la performance globale du système. Les techniques proposées sont générales et peuvent être utilisées dans diverses applications pratiques qui nécessitent l’adaptation aux nouvelles données des systèmes basés sur les MMCs. Qui plus est, les techniques de combinaison booléennes proposées sont capables de combiner les réponses des classificateurs binaires ou probabilistes pour des problèmes à une ou deux classes. Ceci inclut la combinaison du même type de classificateurs entraînés sur différentes données ou caractéristiques ou selon différents paramètres, et la combinaison de différents classificateurs entraînés sur la même de données. En particulier, ces techniques peuvent être efficaces dans des applications où les données d’apprentissage sont limitées et les données de test sont non équilibrées. Mots-clés : Systèmes de détection d’intrusions, détection d’anomalies, systèmes adaptatifs, ensembles de classificateurs, fusion de l’information, combinaison booléenne, apprentissage incrémental, apprentissage en-ligne, modèles de Markov cachés, caractéristique de fonctionnement du récepteur. CONTENTS Page INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 1 ANOMALY INTRUSION DETECTION SYSTEMS . . . . . . . . . . . . . . . . 1.1 Overview of Intrusion Detection Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Network-based IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Host-based IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Misuse Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Host-based Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Privileged Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Anomaly Detection using System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Anomaly Detection with HMMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 HMM-based Anomaly Detection using System Calls . . . . . . . . . . . . . . . . . 1.4 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 University of New Mexico (UNM) Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Synthetic Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Evaluation of Intrusion Detection Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Receiver Operating Characteristic (ROC) Analysis . . . . . . . . . . . . . . . . . . . 1.6 Anomaly Detection Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Representative Data Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 Unrepresentative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 19 20 22 23 24 25 27 28 29 34 37 37 38 42 43 46 46 48 CHAPTER 2 2.1 2.2 2.3 A SURVEY OF TECHNIQUES FOR INCREMENTAL LEARNING OF HMM PARAMETERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Batch Learning of HMM Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Objective Functions:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Optimization Techniques: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2.1 Expectation-Maximization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2.2 Standard Numerical Optimization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2.3 Expectation-Maximization versus Gradient-based Techniques On-Line Learning of HMM Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Minimum Model Divergence (MMD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2.1 On-line Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2.2 Numerical Optimization Methods: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2.3 Recursive Maximum Likelihood Estimation (RMLE): . . . . . . . 53 54 59 61 65 66 68 71 72 73 75 76 79 80 X 2.4 2.5 2.6 2.3.3 Minimum Prediction Error (MPE): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 An Analysis of the On-line Learning Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 2.4.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 2.4.2 Time and Memory Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Guidelines for Incremental Learning of HMM Parameters . . . . . . . . . . . . . . . . . . . . 99 2.5.1 Abundant Data Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 2.5.2 Limited Data Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 CHAPTER 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 ITERATIVE BOOLEAN COMBINATION OF CLASSIFIERS IN THE ROC SPACE: AN APPLICATION TO ANOMALY DETECTION WITH HMMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Anomaly Detection with HMMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Fusion of Detectors in the Receiver Operating Characteristic (ROC) Space . 113 3.3.1 Maximum Realizable ROC (MRROC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.3.2 Repairing Concavities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.3.3 Conjunction and Disjunction Rules for Crisp Detectors . . . . . . . . . . . . . . 117 3.3.4 Conjunction and Disjunction Rules for Combining Soft Detectors . . . 119 A Boolean Combination (BCALL ) Algorithm for Fusion of Detectors . . . . . . . . 121 3.4.1 Boolean Combination of Two ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 3.4.2 Boolean Combination of Multiple ROC Curves . . . . . . . . . . . . . . . . . . . . . . . 124 3.4.3 Time and Memory Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.4.4 Related Work on Classifiers Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.5.1 University of New Mexico (UNM) Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 3.5.2 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 3.5.3 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Simulation Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.6.1 An Illustrative Example with Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.6.2 Results with Synthetic and Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 CHAPTER 4 4.1 4.2 4.3 ADAPTIVE ROC-BASED ENSEMBLES OF HMMS APPLIED TO ANOMALY DETECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Adaptive Anomaly Detection Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 4.2.1 Anomaly Detection Using HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 4.2.2 Adaptation in Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 4.2.3 Techniques for Incremental Learning of HMM Parameters . . . . . . . . . . . 162 4.2.4 Incremental Learning with Ensembles of Classifiers. . . . . . . . . . . . . . . . . . . 165 Learn-and-Combine Approach Using Incremental Boolean Combination. . . . 170 4.3.1 Incremental Boolean Combination in the ROC Space . . . . . . . . . . . . . . . . 172 4.3.2 Model Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 XI 4.4 4.5 4.6 4.3.2.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 4.3.2.2 Model Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 4.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 4.4.2 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Simulation Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 4.5.1 Evaluation of the Learn-and-Combine Approach: . . . . . . . . . . . . . . . . . . . . . 192 4.5.2 Evaluation of Model Management Strategies: . . . . . . . . . . . . . . . . . . . . . . . . . 198 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 APPENDIX I ON THE MEMORY COMPLEXITY OF THE FORWARD BACKWARD ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 APPENDIX II A COMPARISON OF TECHNIQUES FOR ON-LINE INCREMENTAL LEARNING OF HMM PARAMETERS IN ANOMALY DETECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 APPENDIX III INCREMENTAL LEARNING STRATEGY FOR UPDATING HMM PARAMETERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 APPENDIX IV BOOLEAN FUNCTIONS AND ADDITIONAL RESULTS . . . . . . . . 271 APPENDIX V COMBINING HIDDEN MARKOV MODELS FOR IMPROVED ANOMALY DETECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 XII LIST OF TABLES Page Table 2.1 Time and memory complexity for some representative blockwise algorithms used for on-line learning of a new sub-sequence o1:T of length T . N is the number of HMM states and M is the size of alphabet symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Table 2.2 Time and memory complexity of some representative symbol-wise algorithms used for on-line learning of new symbol oi . N is the number of HMM states and M is the size of alphabet symbols . . . . . . . . . . . . . . 97 Table 3.1 Combination of conditionally independent detectors . . . . . . . . . . . . . . . . . . . . 118 Table 3.2 Combination of conditionally dependent detectors . . . . . . . . . . . . . . . . . . . . . . 119 Table 3.3 The maximum likelihood combination of detectors Ca and Cb as proposed by Haker et al. (2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Table 3.4 Worst-case time and memory complexity for the illustrative example . 139 XIV LIST OF FIGURES Page Figure 0.1 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Figure 1.1 High level architecture of an intrusion detection system . . . . . . . . . . . . . . . . 18 Figure 1.2 Host-based IDSs run on each server or host systems, while network-based IDS monitor network borders and DMZ. . . . . . . . . . . . . . . . . 21 Figure 1.3 A high-level architecture of operating system, illustrating system-call interface to kernel and generic types of system calls . . . . . . . 28 Figure 1.4 Examples of Markov models with alphabet Σ = 8 symbols and various CRE values, each used to generate a sequence of length T = 50 observation symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 1.5 Synthetic normal and anomalous data generation according to a Markov model with CRE = 0.3 and alphabet of size Σ = 8 symbols 40 Figure 1.6 Confusion matrix and ROC common measures. . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 1.7 Illustration of the fixed tpr and f pr values produced by a crisp detector and the ROC curve generated by a soft detector (for various decision thresholds T). Important regions in the ROC space are annotated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 1.8 Illustration of normal process behavior when modeled with unrepresentative training data. Normal rare events will be considered as anomalous by the ADS, and hence trigger false alarms . 49 Figure 1.9 Illustration of changes in normal process behavior, due for instance to application update or changes in user behavior, and the resulting regions causing both false positive and negative errors . . 50 Figure 2.1 A generic incremental learning scenario where blocks of data are used to update the classifier in an incremental fashion over a period of time t. Let D1 , . . . , Dn+1 be the blocks of training data available to the classifier at discrete instants in time t1 , . . . , tn+1 . The classifier starts with initial hypothesis h0 which constitutes the prior knowledge of the domain. Thus, h0 gets updated to h1 on the basis of D1 , and h1 gets updated to h2 on the basis of data D2 , and so forth (Caragea et al., 2001) . . . . . 56 XVI Figure 2.2 An illustration of the degeneration that may occur with batch learning of a new block of data. Suppose that the dotted curve represents the cost function associated with a system trained on block D1 , and that the plain curve represents the cost function associated with a system trained on the cumulative data D1 D2 . Point (1) represents the optimum solution of batch learning performed on D1 , while point (4) is the optimum solution for batch learning performed on D1 D2 . If point (1) is used as a starting point for incremental training on D2 (point (2)), then it will become trapped in the local optimum at point (3) . . . 57 Figure 2.3 An illustration of an ergodic three states HMM with either continuous or discrete output observations (left). A discrete HMM with N states and M symbols transits between the hidden states qt , and generates the observations ot (right) . . . . . . . . . . . . . 59 Figure 2.4 Taxonomy of techniques for on-line learning of HMM parameters . . . . . 73 Figure 2.5 An illustration of on-line learning (a) from an infinite stream of observation symbols (oi ) or sub-sequences (si ) versus batch learning (b) from accumulated sequences of observations symbols (S1 ∪ S2 ) or blocks of sub-sequences (D1 ∪ D2) . . . . . . . . . . . . . . . . 92 Figure 2.6 An example of the time and memory complexity of a block-wise (Mizuno et al. (2000)) versus symbol-wise (Florez-Larrahondo et al. (2005)) algorithm for learning an observation sequence of length T = 1000, 000 with an output alphabet of size M = 50 symbols 98 Figure 2.7 The dotted curve represents the log-likelihood function associated with and HMM (λ1 ) trained on block D1 over the space of one model parameter λ. After training on D1 , the online algorithm estimates λ = λ1 (point (a)) and provides HMM (λ1 ). The plain curve represents the log-likelihood function associated with HMM (λ2 ) trained on block D2 over the same optimization space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 3.1 Illustration of a fully connected three state HMM with a discrete output observations (left). Illustration of a discrete HMM with N states and M symbols switching between the hidden states qt and generating the observations ot (right). The state qt = i denotes that the state of the process at time t is Si . . . . . . 112 Figure 3.2 Illustration of the ROCCH (dashed line) applied to: (a) the combination of two soft detectors (b) the repair of concavities of an improper soft detector. For instance, in (a) the composite XVII detector Cc is realized by randomly selecting the responses from Ca half of the time and the other half from Cb . . . . . . . . . . . . . . . . . . . . . . . . . 115 Figure 3.3 Examples of combination of two conditionally-independent crisp detectors, Ca and Cb , using the AND and OR rules. The performance of the their combination is shown superior to that of the MRROC. The shaded regions are the expected performance of combination when there is an interaction between the detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Figure 3.4 Block diagram of the system used for combining the responses of two HMMs. It illustrates the combination of HMM responses in the ROC space according to the BCALL technique . . . . . . . . . . . . . . . . . 122 Figure 3.5 Illustration of data pre-processing for training, validation and testing 133 Figure 3.6 Illustration of the steps involved for estimating HMM parameters . . . 135 Figure 3.7 Illustrative example that compares the AUC performance of techniques for combination (top) and for repairing (bottom) of ROC curves. The example is conducted on a system consisting of two ergodic HMMs trained with N = 4 and N = 12 states, on a block of 100 sequences, each of length DW = 4 symbols, synthetically generated with Σ = 8 and CRE = 0.3 . . . . . . . . . . . . . . . . . . . . 137 Figure 3.8 Results for synthetically generated data with Σ = 8 and CRE = 0.3. Average AUC values obtained on the test sets as a function of the number of training blocks (50 to 450 sequences) for a 3-HMM system each trained with a different state (N = 4, 8, 12), and combined with the MRROC, BC and IBC techniques. Average AUC performance is compared for various detector windows sizes (DW = 2, 4, 6). Error bars are standard deviations over ten replications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Figure 3.9 Results for synthetically generated data with Σ = 50 and CRE = 0.4. Average AUC values (left) and tpr values at f pr = 0.1 (right) obtained on the test sets as a function of the number of training blocks (1000 to 5000 sequences) for a 3-HMM system each trained with a different state (N = 40, 50, 60), and combined with the MRROC and IBC techniques. The performance is compared for various detector windows sizes (DW = 2, 4, 6). Error bars are standard deviations over ten replications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 XVIII Figure 3.10 Results for sendmail data. AUC values (left) and tpr values at f pr = 0.1 (right) obtained on the test sets as function of the number of training blocks (100 to 1000 sequences) for a 5-HMM system, each trained with a different state (N = 40, 45, 50, 55, 60), and combined with the MRROC and IBC techniques. The performance is compared for various detector windows sizes (DW = 2, 4, 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Figure 3.11 Performance versus the number of training blocks achieved after repairing concavities for synthetically generated data with Σ = 8 and CRE = 0.3. Average AUCs (left) and tpr values at f pr = 0.1 (right) on the test set for a μ-HMM where each HMM is trained with different number of states (N = 4, 8, 12). HMMs are combined with the MRROC technique and compared to the performance of IBCALL and LCR repairing techniques, for various training block sizes (50 to 450 sequences) and detector windows sizes (DW = 4 and 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Figure 4.1 An incremental learning scenario where blocks of data are used to update the HMM parameters (λ) over a period of time. Let D1 , D2 , ..., Dn be the blocks of training data available to the HMM at discrete instants in time. The HMM starts with an initial hypothesis h0 associated with the initial set of parameters λ0 , which constitutes the prior knowledge of the domain. Thus, h0 and λ0 get updated to h1 and λ1 on the basis of D1 , then h1 and λ1 get updated to h2 and λ2 on the basis of D2 , and so forth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Figure 4.2 An illustration of the decline in performance that may occur with batch or on-line estimation of HMM parameters, when learning is performed incrementally on successive blocks . . . . . . . . . . . . . . 166 Figure 4.3 Block diagram of the adaptive system proposed for incrBC of HMMs, trained from newly-acquired blocks of data Dk , according to the learn-and-combine approach. It allows for an efficient management of the pool of HMMs and selection of EoHMMs, decision thresholds, and Boolean functions . . . . . . . . . . . . . . . . . 172 Figure 4.4 An illustration of the steps involved during the design phase of the incrBC algorithm employed for incremental combination from a pool of four HMMs P1 = {λ11 , . . . , λ14 }. Each HMM is trained with different number of states and initializations on a block (D1 ) of normal data synthetically generated with Σ = 8 and CRE = 0.3 using the BW algorithm (see Section 4.4 for XIX details on data generation and HMM training). At each step, the example illustrates the update of the composite ROCCH (CH) and the selection of the corresponding set (S) of decision thresholds and Boolean functions for overall improved system performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Figure 4.5 An illustration of the incrBC algorithm during the operational phase. The example presents the set of decision thresholds and Boolean functions that are activated given, for instance, a maximum specified f pr of 10% for the system designed in Figure 4.4. The operational point (cop ) that corresponds to f pr = 10% is located between the vertices c33 and c24 of the composite ROCCH, which can be achieved by interpolation of responses (Provost and Fawcett, 2001; Scott et al., 1998). The desired cop is therefore realized by randomly taking the responses from c33 with probability value of 0.85 and from c24 with probability value of 0.15 . The decision thresholds and Boolean functions of c33 and c24 are then retrieved from S and applied for operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Figure 4.6 A comparison of the composite ROCCH (CH) and AUC performance achieved with the HMMs selected according to the BCgreedy and BCsearch algorithms. Suppose that a new pool of four HMMs P2 = {λ21 , . . . , λ24 } is generated from a new block of training data (D2 ), and appended to the previouslygenerated pool, P1 = {λ11 , . . . , λ14 }, in the example presented in Section 4.3.1 (Figure 4.4). The incrBC algorithm combines all available HMMs in P = {λ11 , . . . , λ14 , λ21 , . . . , λ24 }, according to their order of generation and storage, while BCgreedy and BCsearch start by ranking the members of P according to AUC values and then apply their ensemble selection strategies. . . . . . . . . . . . . . 185 Figure 4.7 Overall steps involved to estimate HMM parameters and select HMMs with the highest AUCH for each number of states from the first block (D1 ) of normal data, using 10-FCV and ten random initializations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Figure 4.8 An illustration of HMM parameter estimation according to each learning technique (BBW, BW, OBW, and IBW) when subsequent blocks of observation sub-sequences (D1 , D2 , D3 , . . .) become available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Figure 4.9 Results for synthetically generated data with Σ = 8 and CRE = 0.3. The HMMs are trained according to each XX technique with N = 6 states for each block of data providing a pool of size |P| = 1, 2, . . . , 10 HMMs. Error bars are lower and upper quartiles over ten replications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Figure 4.10 Results for synthetically generated data with Σ = 8 and CRE = 0.3. The HMMs are trained according to each technique with nine different states (N = 4, 5, . . . , 12) for each block of data providing a pool of size |P| = 9, 18, . . . , 90 HMMs. Numbers above points are the state values that achieved the highest average level of performance on each block. Error bars are lower and upper quartiles over ten replications. . . . . . . . . . . . . . . . . . . . . 195 Figure 4.11 Results for synthetically generated data with Σ = 50 and CRE = 0.4. The HMMs are trained according to each technique with 20 different states (N = 5, 10, . . . , 100) for each block of data providing a pool of size |P| = 20, 40, . . . , 200 HMMs. Numbers above points are the state values that achieved the highest average level of performance on each block. Error bars are lower and upper quartiles over ten replications 196 Figure 4.12 Results for sendmail data. The HMMs are trained according to each technique with 20 different states (N = 5, 10, . . . , 100) for each block of data providing a pool of size |P| = 20, 40, . . . , 200 HMMs. Numbers above points are the state values that achieved the highest level of performance on each block . . . . . . . . . . . . . . . 197 Figure 4.13 Ensemble selection results for synthetically generated data with Σ = 50 and CRE = 0.4. These results are for the first replication of Figure 4.11. For each block, the values on the arrows represent the size of the EoHMMs (|E|) selected by each technique from the pool of size |P| = 20, 40, . . . , 200 HMMs . . . . . . . . . . . 199 Figure 4.14 Representation of the HMMs selected in Figure 4.13a with the presentation of each new block of data according to the BCgreedy algorithm with tolerance = 0.01. HMMs trained on different blocks are presented with a different symbol. |P| = 20, 40, . . . , 200 HMMs indicated by the grid on the figure . . . . . . . . . . . . . . 199 Figure 4.15 Representation of the HMMs selected in Figure 4.13a with the presentation of each new block of data according to the BCsearch algorithm with tolerance = 0.01. HMMs trained on different blocks are presented with a different symbol. |P| = 20, 40, . . . , 200 HMMs indicated by the grid on the figure . . . . . . . . . . . . . . 200 XXI Figure 4.16 Ensemble selection For each block, the the EoHMMs (|E|) |P| = 20, 40, . . . , 200 results for sendmail data of Figure 4.12. values on the arrows represent the size of selected by each technique from P of size HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Figure 4.17 Representation of the HMMs selected in Figure 4.16a with the presentation of each new block of data according to the BCgreedy algorithm with tolerance = 0.003. HMMs trained on different blocks are presented with a different symbol. |P| = 20, 40, . . . 200 HMMs indicated by the grid on the figure. . . . . . . . . 201 Figure 4.18 Representation of the HMMs selected in Figure 4.16a with the presentation of each new block of data according to the BCsearch algorithm with tolerance = 0.003. HMMs trained on different blocks are presented with a different symbol. |P| = 20, 40, . . . 200 HMMs indicated by the grid on the figure. . . . . . . . . 202 Figure 4.19 illustration of the impact on performance of pruning the pool of HMMs in Figure 4.13a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Figure 4.20 illustration of the impact on performance of pruning the pool of HMMs in Figure 4.16a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 XXII LIST OF ABBREVIATIONS μ-HMMs Multiple Hidden Markov Models. ADS Anomaly Detection System. AS Anomaly Size. AUC Area Under the ROC Curve. AUCH Area under the ROC ROC Convex Hull. BC Boolean Combination. BCM Boolean Combination of Multiple Detectors. BKS Behavior Knowledge Space. BW Baum-Welch. CIA Confidentiality, Integrity and Availability. CRE Conditional Relative Entropy. DMZ Demilitarized Zone. DW Detector Window Size. EFFBS Efficient Forward Filtering Backward Smoothing. ELS Extended Least Squares. EM Expectation-Maximization. EoC Ensemble of Classifiers. EoHMMs Ensemble of Hidden Markov Models. FB Forward Backward. XXIV FFBS Forward Filtering Backward Smoothing. FO Forward Only. FSA Finite State Automaton. GAM Generalized Alternating Minimization. GD Gradient Descent. GEM Generalized Expectation-Maximization. HIDS Host-based Intrusion Detection System. HMM Hidden Markov Model. IBC Iterative Boolean Combination. IBW Incremental Baum-Welch. IDPS Intrusion Detection and Prevention System. IDS Intrusion Detection System. IPS Intrusion Prevention System. KL Kullback-Leibler. LAN local Area Network. LT Life Time Expectancy of Models. LCR Largest Concavity Repair. MDI Minimum Discrimination Information. MLE Maximum Likelihood Estimation. MMD Minimum Model Divergence. XXV MMI Maximum Mutual Information. MM Markov Model. MMSE Minimum Mean Square Error. MPE Minimum Prediction Error. MRROC Maximum Realizable ROC. NIDS Network-based Intrusion Detection System. NIST National Institute of Standards and Technology. ODE Ordinary Differential Equation. OS Operating System. RCLSE Recursive Conditioned Least Squares Estimator. RIPPER Repeated Incremental Pruning to Produce Error Reduction. ROCCH ROC Convex Hull. ROC Receiver Operating Characteristic. RPE Recursive Prediction Error. RSPE Recursive State Prediction Error. SSH Secure Shell. SSL Secure Sockets Layer. STIDE Sequence Time-Delay Embedding. TIDE Time-Delay Embedding. UNM University of New Mexico. XXVI VPN Virtual Private Network. WMW Wilcoxon-Mann-Whitney. acc Accuracy. bf Boolean function. err Error Rate. f nr False Negative Rate. f pr False Positive Rate. setuid Set user identifier permission. tnr True Negative Rate. tpr True Positive Rate. uid User identifiers. LIST OF SYMBOLS Δ Fixed positive lag in fixed-lag smoothing estimation. Λ HMM parameter space. Σ Alphabet size of Markov model generator (or of a process). ∇(Y ) X Gradient of X with reference to Y. αt (i) The forward variable. Unnormalized joint probability density of the state i and the observations up to time t, given the HMM λ. βt (i) The backward variable. Conditional probability density of the observations from time t + 1 up to the last observation, given the state at time t is i. γτ |t (i) Conditional state density. Probability of being in state i at time τ given the HMM λ and the observation sequence o1:t . It is a filtered state densisty if τ = t; predictive state density if τ < t, and smoothed state density if τ > t. δi (j) Kronecker delta. It is equal to one if i = j and zero otherwise. Small positive constant representing tolerance in error or accuracy measure values. ε Output prediction error, ε = ot − ôt . Et HMM output prediction error, Et = ot − ôt|t−1 ζ Gradient of state variable alpha w.r.t. HMM state transitions. η Fixed learning rate. ηt Time-varying learning rate. λ HMM parameters, λ = (A, B, π). λ̂ Estimated HMM parameters. λ0 Initial estimation (or guess) of HMM parameters. λ̂k Estimated HMM parameters at the k−th iteration. λr Estimated HMM parameters from the r-th observation. λt Estimated HMM parameters at time t. μ Mean of output probability distribution of a continuous HMM. XXVIII ξτ |t (i, j) Conditional joint state density. Probability of being in state i at time τ and transiting to state j at time τ + 1 given the HMM λ and the observation sequence o1:t . It is a filtered joint state densisty if τ = t; predictive joint state density if τ < t, and smoothed joint state density if τ > t. π Vector of initial state probability distribution. π(i) Element of π. Probability of being in state i at time t = 0. σt Variance of output probability distribution of a continuous HMM. σt (i, j, k) Probability of having made a transition from state i to state j, at some point in the past, and of ending up in state k at the current time t. ς Gradient of state variable alpha w.r.t. HMM outputs. ψt (i) Gradient of the prediction error cost function (J) w.r.t. HMM parameters. ω Decay function. A Matrix of HMM state transition probability distribution. AU CH0.1 Partial area under the ROC convex hull for the range of f pr = [0, 0.1]. B Matrix of HMM state output probability distribution. D Block of training data. Dk Block of training data received at time t = k. E[.] Expectation function. E Selected ensemble of base HMMs from the pool. Ek Selected ensemble of base HMMs after receiving the k th block of data. I Number of iterations of the IBC algorithm. J(λ) Prediction error cost function based on HMM (λ) M Size of output alphabet (V ) of a discrete HMM. The number of distinct observable symbols. N Number of HMM states. t,τ Nijk (O) The probability of making at time t a transition from state i to j with symbol Ot and to be in state k at time τ ≥ t. O Sequence of observation symbols. P Pool of base HMMs. Pk Pool of base HMMs generated after receiving the k th block of data. XXIX S State space of HMM. S Selected set of decision thresholds (from each base HMM) and Boolean functions that most improves the overall ROC convex hull of Boolean combination. T Length of the observation sequence. T Decision threshold. T Test data set. V Output alphabet of HMM. It can be either a continuous or discrete. V Validation data set. aij Element of A. Probability of being in state i at time t and going to state j at time t + 1. aτij (ot ) Probability to make a transition from state i to state j and to generate the output ot at time τ . bj (k) Element of B. Probability of emitting an observation symbol ok at state j. T (λ) Log-likelihood of the observation sequence o1:T with regard to HMM parameters (λ). ot Observation symbol at time t. o1:t Concise notation denoting the observation sub-sequence o1 , o2 , . . . , ot . ôt Output prediction at time t. ôt|t−1 Output prediction at time t, conditioned on all previous output values. qt State of HMM at time t. qt = i denotes that the state of the HMM at time t is Si . wt Gradient of the state prediction filter (γt|t−1 ) w.r.t. HMM parameters. XXX INTRODUCTION Computers and network touch every facet of modern life. Our society is increasingly dependent on interconnected information systems, offering more opportunities and challenges for computer criminals to break into these systems. Security attacks through Internet have proliferated in recent years. Information security has therefore become a critical issue for all – governments, organizations and individuals. As systems become ever more complex, there are always exploitable weaknesses that are inherent to all stages of system development life cycle (from design to deployment and maintenance). Several security attacks have emerged because of vulnerabilities in protocol design. Programming errors are unavoidable, even for experienced programmers. Developers often assume that their products will be used under expected conditions. Many security vulnerabilities are also caused by system misconfiguration and mismanagement. In fact, managing system security is a time-consuming and challenging task. Different complex software components, including operating systems, firmware, and applications, must be configured securely, updated, and continuously monitored for security. This requires a large amount of time, energy, and resources to keep up with the rapidly evolving pace of threat sophistication. Preventive security mechanisms such as firewalls, cryptography, access control and authentication are deployed to stop unwanted or unauthorized activities before they actually cause damage. These techniques provide a security perimeter but are insufficient to ensure reliable security. Firewalls are prone to misconfiguration, and can be circumvented by redirecting traffic or using encrypted tunnels (Ingham and Forrest, 2005). With the exception of one-time pad cipher, all cryptographic algorithms are theoretically vulnerable to cryptanalytic attacks. Furthermore, viruses, worms, and other malicious code may defeat crypto-systems by secretly recording and transmitting secret keys residing in a computer memory. Regardless of the preventive security measures, incidents are likely to happen. Insider threats are invisible to perimeter security mechanisms. At- 2 tacks perpetrated by insiders are very often more damaging because they understand the organization business and have access privileges, which facilitate breaking into systems and extracting or damaging critical information (Salem et al., 2008). Furthermore, social engineering bypasses all prevention techniques and provides a direct access to systems. In addition, organizational security policies typically attempt to maintain an appropriate balance between security and usability, which makes it impossible for an operational system to be completely secure. Intrusion detection is the process of monitoring the events occurring in a computer system or network and analyzing them for signs of intrusions, defined as attempts to compromise the confidentiality, integrity and availability (CIA), or to bypass the security mechanisms of a computer or network (Scarfone and Mell, 2007). It is more feasible to prevent some attacks and detect the rest than to try to prevent everything. When perimeter security fails, intrusion attempts must be detected as soon as possible to limit the damage and take corrective measures, hence the need for an intrusion detection system (IDS) as a second line of defense (McHugh et al., 2000). IDSs monitor computer or network systems to detect unusual activities and notify the system administrator. In addition, IDSs should also provide relevant information for post-attack forensics analysis, and should ideally provide reactive countermeasures. Without an IDS, security administrators may have no sign of many ongoing or previously-deployed attacks. For instance, attacks that are not intended to damage or control a system, but instead to compromise the confidentiality or integrity of the data (e.g., by extracting or altering sensitive information), would be very difficult to detect. An IDS is not a stand-alone system, but rather a fundamental technology that complements preventive techniques and other security mechanisms. Since their inceptions in the 1980s (Anderson, 1980; Denning, 1986), IDSs have received increasing research attention (Alessandri et al., 2001; Axelsson, 2000; Debar et al., 2000; Estevez-Tapiador et al., 2004; Lazarevic et al., 2005; Scarfone and Mell, 2007; Tucker et al., 2007). IDSs are classified based on their monitoring scope into host-based IDS (HIDS) and network-based IDS (NIDS). HIDSs are designed to monitor the activities of 3 host systems, such as mail servers, web servers, or individual workstations. NIDSs monitor the network traffic for multiple hosts by capturing and analyzing network packets. HIDSs are typically more expensive to deploy and manage, and may consume system resource, however they can detect insider attacks. NIDSs are easier to manage because they are platform independent and typically installed on a designated system, but they can only detect attacks which come through the network. In general, intrusion detection methods are categorized into misuse and anomaly detection. In misuse detection, known attack patterns are stored in a database and then system activities are checked against these patterns. Such approach is also employed in commercial anti-virus products. Misuse detection systems may provide a high level of accuracy, but unable to detect novel attacks. In contrast, anomaly detection approaches learn normal system behavior and detect significant deviations from this baseline behavior. Anomaly detection systems (ADS) can detect novel attacks, but generate a prohibitive number of false alarms due in large part to the difficulty in obtaining complete descriptions of normal behavior. Security events monitored for anomalies typically involve sequential and categorical records (e.g., audit trails, application logs, system calls, network requests) that reflect specific system activities and may indirectly reflect user behaviors. Hidden Markov Model (HMM) is a doubly stochastic process for sequential data Theoretical and empirical results have shown that, given an adequate number of hidden states and a sufficiently rich set of observations, HMMs are capable of representing probability distributions corresponding to complex real-world phenomena (Bengio, 1999; Cappe et al., 2005; Ephraim and Merhav, 2002; Ghahramani, 2001; Poritz, 1988; Rabiner, 1989; Smyth et al., 1997). A well trained HMM provides a compact detector that captures the underlying structures of the monitored system based on the temporal order of events generated during normal operations. It then detects deviations from normal system behavior with high accuracy and tolerance to noise. ADSs based on HMMs have been successfully applied to model sequential security events occurring at different levels within a networking environment, 4 such as at the host level (Cho and Han, 2003; Hu, 2010; Lane and Brodley, 2003; Warrender et al., 1999; Yeung and Ding, 2003), network level (Gao et al., 2003; Ourston et al., 2003; Tosun, 2005), and wireless level (Cardenas et al., 2003; Konorski, 2005). Traditional host-based anomaly detection systems monitor for significant deviation in operating system calls, as they provide a gateway between user and kernel mode (Forrest et al., 1996). In particular, abnormal behavior of privileged processes (e.g., root and setuid processes described in Section 1.2.1) is most dangerous, since these processes run with elevated administrative privileges. Flaws in privileged processes are often exploited to compromise system security. Various neural and statistical anomaly detectors have been applied within ADSs to learn the normal behavior of privileged processes through the system call sequences that are generated (Forrest et al., 2008; Warrender et al., 1999), and then detect deviations from normal behavior as anomalous. Among these, techniques based on discrete HMMs have been shown to produce a very high level of performance (Du et al., 2004; Florez-Larrahondo et al., 2005; Gao et al., 2002, 2003; Hoang and Hu, 2004; Hu, 2010; Wang et al., 2010, 2004; Warrender et al., 1999; Zhang et al., 2003). This thesis focuses on HMM-based anomaly detection techniques for monitoring the deviations from normal behavior of privileged processes based on sequences of operating system calls. Motivation and Problem Statement Ideally, anomaly detection systems should efficiently detect and report all accidental or deliberate malicious activities – from insiders or outsiders, successful or unsuccessful, known or novel, without raising false alarms. In this ideal situation, the output of an ADS should also be readily usable, with no or limited user intervention. More importantly, an ADS must accommodate newly-acquired data and adapt to changes in normal system behavior over time, to maintain or improve its accuracy. 5 In practice, however, ADSs typically generate an excessive number of false alarms – a major obstacle limiting their deployment in real-world applications. A false alarm (or false positive) occurs when a normal event is misclassified as anomalous. False alarms cause expensive disruption due to the need to investigate, and ascertain or refute. It is often very difficult to determine the exact sequence of events that triggered an alarm. Furthermore, frequent false alarms reduce the confidence in the system, and lead operators to undermine the credibility of future alarms. This issue is compounded further when several IDSs are actively monitored by a single operator. In addition, when intrusion detection is coupled with active response capabilities1 (Ghorbani et al., 2010; Rash et al., 2005; Stakhanova et al., 2007), frequent false alarms will affect the availability of system resources, because of possible service disruptions. False alarms are caused by several reasons, including poorly designed detectors, unrepresentative normal data for training, and limited data for validation and testing. Poorly designed anomaly detectors can not precisely describe the typical behavior of the protected system, and hence normal events are misclassified as anomalous. Design issues typically include inadequate assumptions about the real system behavior, inappropriate model selection, and poor optimization of models parameters. For instance, the number of HMM states may have a significant impact on the detection accuracy. Overfitting is an important issue that leads to inaccurate predictions. It occurs when the training process leads to memorization of noise in training data, providing models that are overspecialized for the training data but perform poorly on unseen test data. In general, complex HMMs (with large number of states) trained on limited amount of data are more prone to overfitting. Anomaly detectors based on HMMs require a representative amount of normal training data. In practice, a limited amount of representative normal data is typically provided for training, because the collection and analysis of training data is costly. The anomaly 1 An IDS that actively responds to suspicious activities by, for instance, shutting systems down, logging out users, or blocking network connections is often referred to as an intrusion prevention systems (IPS) or an intrusion detection and prevention system (IDPS). 6 detector will therefore have an incomplete view of the normal process behavior, and hence misclassify rare normal events as anomalous. Furthermore, substantial changes to the monitored environment, reduce the reliability of the detector as the internal model of normal behavior diverges with respect to the underlying data distribution, producing both false positive and negative errors. Therefore, an ADS must be able to efficiently accommodate newly-acquired data, after it has originally been trained and deployed for operations, to maintain or improve system accuracy over time. The incremental re-estimation of HMM parameters raises several challenges. Given newly-acquired data for training, HMM parameters should be updated from new data without requiring access to the previously-learned training data, and without corrupting previously-acquired knowledge (i.e., models of normal behavior) (Grossberg, 1988; Polikar et al., 2001). Standard techniques for training HMM parameters involve iterative batch learning, based either on the Baum-Welch (BW) algorithm (Baum et al., 1970), a specialized expectation maximization (EM) technique (Dempster et al., 1977), or on numerical optimization methods, such as the Gradient Descent (GD) algorithm (Levinson et al., 1983). These techniques assume a fixed (finite) amount of training data available throughout the training process. In either case, HMM parameters are estimated over several training iterations, until the likelihood function is maximized over data samples. Each training iteration requires applying the Forward-Backward (FB) algorithm (Baum, 1972; Baum et al., 1970; Chang and Hancock, 1966) on the entire training data to evaluate the likelihood value and estimate the state densities – the sufficient statistics for updating HMM parameters. To accommodate newly-acquired data, batch learning techniques must restart HMM training from the start using all cumulative data, which is a resource intensive and time consuming task. Over time, an increasingly large space is required for storing cumulative training data, and HMM parameters estimation becomes prohibitively costly. In fact, considering the HMM parameters obtained from previously-learned data as starting point for training on newly-acquired data, may not allow for an optimization process that 7 escapes the local maximum associated with the previously-learned data. Being stuck in local maxima leads to knowledge corruption and hence to a decline in system performance. The time and memory complexity of BW or GD algorithm, for training an HMM with N states, is O(N 2 T ) and O(N T ) respectively, for a sequence of length T symbols. As an alternative, several on-line learning techniques have been proposed in literature to re-estimate HMM parameters from an infinite stream of observation symbols or subsequences. These algorithms re-estimate HMM parameters continuously upon observing each new observation symbol (Florez-Larrahondo et al., 2005; Garg and Warmuth, 2003; LeGland and Mevel, 1995, 1997; Mongillo and Deneve, 2008; Stiller and Radons, 1999) or new observation sub-sequence, (Baldi and Chauvin, 1994; Cappe et al., 1998; Mizuno et al., 2000; Ryden, 1997; Singer and Warmuth, 1996) typically without iteration. In practice however, on-line learning using blocks comprising a limited number of observation sub-sequences yields a low level of performance as one pass over each block is not sufficient to capture the phenomena. A single classifier systems for incremental learning may approximate the underlying data distribution inadequately. Indeed, incremental learning using a single HMM with a fixed number of states may not capture a representative approximation of the normal process behavior. Different HMMs trained with different number of states capture different underlying structures of the training data. Furthermore, HMMs trained according to different random initializations may lead the algorithm to converge to different solutions in parameters space, due to the many local maxima of the likelihood function. Therefore, a single HMM not provide a high level of performance over the entire detection space. Ensemble methods have been recently employed to overcome the limitations faced with a single classifier system (Dietterich, 2000; Kuncheva, 2004a; Polikar, 2006; Tulyakov et al., 2008). Theoretical and empirical evidence have shown that combining the outputs of several accurate and diverse classifiers is an effective technique for improving the overall system accuracy (Brown et al., 2005; Dietterich, 2000; Kittler, 1998; Kuncheva, 2004a; 8 Polikar, 2006; Rokach, 2010; Tulyakov et al., 2008). In general, designing an ensemble of classifiers (EoC) involves generating a diverse pool of base classifiers (Breiman, 1996; Freund and Schapire, 1996; Ho et al., 1994), selecting an accurate and diversified subset of classifiers (Tsoumakas et al., 2009), and then combining their output predictions (Kuncheva, 2004a; Tulyakov et al., 2008). The combination of selected classifiers typically occurs at the score, rank or decision levels (Kuncheva, 2004a; Tulyakov et al., 2008). Fusion at the score level typically requires normalization of the scores before applying the fusion functions, such as sum, product, and average (Kittler, 1998). Fusion at the rank level is mostly suitable for multi-class classification problems where the correct class is expected to appear in the top of the ranked list (Ho et al., 1994; Van Erp and Schomaker, 2000). Fusion at the decision level exploits the least amount of information since only class labels are considered. Majority voting (Ruta and Gabrys, 2002), weighted majority voting (Ali and Pazzani, 1996; Littlestone and Warmuth, 1994) are among the most representative decision-level fusing functions. Combination of responses in the Receiver Operating Characteristic (ROC) space have been recently investigated as alternative decision-level fusion techniques. The ROC convex hull (ROCCH) or the maximum realizable ROC (MRROC) have been proposed to combine detectors based on a simple interpolation between their responses (Provost and Fawcett, 2001; Scott et al., 1998). Using Boolean conjunction or disjunction functions to combine the responses of multiple soft detectors in the ROC space have shown and improved performance over the MRROC (Haker et al., 2005; Tao and Veldhuis, 2008). However, these techniques assume conditionally independent classifiers and convexity of ROC curves (Haker et al., 2005; Tao and Veldhuis, 2008). EoCs may be adapted for incremental learning by generating a new pool of classifiers as each block of data becomes available, and combining the outputs with those of previouslygenerated classifiers with some fusion technique (Kuncheva, 2004b). Most existing ensemble techniques proposed for incremental learning aim at maximizing the system performance through a single measure of accuracy (Chen and Chen, 2009; Muhlbaier et al., 9 2004; Polikar et al., 2001). Furthermore, these techniques are mostly designed for twoor multi-class classification tasks, and hence require labeled training data sets to compute the errors committed by the base classifiers and update the weight distribution at each iteration. In HMM-based ADSs, the HMM detector is a one-class classifier that is trained using the normal system call data only. Moreover, an arbitrary and fixed threshold is typically set on the output scores of base classifiers prior to combining their decisions. This implicitly assumes fixed prior probabilities and misclassification costs. Nevertheless, the effectiveness of an EoC strongly depends on the decision fusion strategies. Standard techniques for fusion, such as voting, sum or averaging, assume independent classifiers and do not consider class priors (Kittler, 1998; Kuncheva, 2002). However, these assumptions are violated in anomaly detection applications, since detectors are typically designed using limited representative training data. Furthermore, the prior class distributions and misclassification costs are imbalanced, and may vary over time. In addition the correlation between classifiers decisions depends on the selection of decision thresholds. Objectives and Contributions This thesis addresses the challenges mentioned above and aims at improving the accuracy, efficiency and adaptability of existing host-based anomaly detection systems based on HMMs. A major objective of this thesis is to develop adaptive techniques for ADSs based on HMMs that can maintain a high level of performance by efficiently adapting to newlyacquired data over time to account for rare normal events and adapt to legitimate changes in the monitored environments. To address the objectives mentioned above, this thesis presents effective techniques for adaptation of ADS based on HMMs. In response to new training data, these techniques allow for incremental learning at the decision and detection levels. At the decision level, this thesis presents a novel and general learn-and-combine approach based on Boolean combination of detector responses in the ROC space. It also proposes improved techniques for incremental learning of HMM parameters, which allow to accommodate new 10 data over time. In addition, this thesis attempts to provide an answer to the following question: Given newly-acquired training data, is it best to adapt parameters of HMM detectors trained on previously-acquired data, or to train new HMMs on newly-acquired data and combine them with those trained on previously-acquired data? This thesis presents the following key contributions: At the HMM Level. • A survey of techniques found in literature that are suitable for incremental learning of HMM parameters. These techniques are classified according to the objective function, optimization technique, and target application, involving block-wise and symbol-wise learning of parameters. Convergence properties of these techniques are presented, along with an analysis of time and memory complexity. In addition, the challenges faced when these techniques are applied to incremental learning is assessed for scenarios in which the new training data is limited and abundant. As a result of this survey, improved strategies for incremental learning of HMM parameters are proposed. They are capable of maintaining or improving HMM accuracy over time, by limiting corruption of previously-acquired knowledge, while reducing training time and storage requirements. These incremental learning strategies provide efficient solutions for HMM-based anomaly detectors, and can be applied to other application domains. • A novel variation of the Forward-Backward algorithm, called the Efficient Forward Filtering Backward Smoothing (EFFBS). EFFBS reduces the memory complexity, required for training an HMM with N states on a sequence of length T , from O(N T ) to O(N ), without increasing the computational cost. EFFBS is useful when abundant data (large T values) is provided for training HMM parameters. • In depth investigation of the impact of several factors on HMM-based ADSs performance of training set size, number of HMM states, detector window size, anomaly size, and complexity and irregularity of the monitored process. 11 At the Decision Level. • An efficient system is proposed for incremental learning of new blocks of training data using a learn-and-combine approach. When a new block of training data becomes available, a new pool of HMMs is generated using different number of HMM states and random initializations. The responses from the newly-trained HMMs are then combined to those of the previously-trained HMMs in ROC space using a novel incremental Boolean combination technique. The learn-and-combine approach allows to select a diversified ensemble of HMMs (EoHMMs) from the pool, and adapts the Boolean fusion functions and thresholds for improved performance, while it prunes redundant base HMMs. • Since the pool size grows indefinitely as new blocks of data become available over time, employing specialized model management strategies is therefore a crucial aspect for maintaining the efficiency of an adaptive ADS. When a new pool of HMM is generated from a new block of data, the best EoHMMs is selected from the pool according to the proposed model selection algorithms (BCgreedy or BCsearch ), and then deployed for operations. Less accurate and redundant HMMs that have not been selected for some user-defined time interval are discarded from the pool. • The proposed Boolean combination techniques includes: – A new technique for Boolean combination (BC) of responses from two detectors in ROC space. – A new technique for cumulative Boolean combination of multiple detector (BCM ) responses. BCM combines the results of BC (from the first two detectors) with the responses form the third detector, then with those of the fourth detector, and so on until the last detector. When applied to incremental learning from new data a variant of this algorithm is referred to as incrBC to emphasize its incremental learning properties. – A new iterative Boolean combination (IBC) technique. IBC recombines original detector responses with the combination results of BCM over several iterations, 12 until the ROCCH stops improving (or a maximum number of iteration is performed). • These techniques apply all Boolean functions to combine detector responses corresponding to each decision threshold (or each crisp detectors), and select the Boolean functions and decision thresholds (or crisp detectors) that most improve the overall ROCCH over that of original detectors. Since all Boolean functions are applied to explore the solution space, the proposed techniques require no prior assumptions about conditional independence of detectors or convexity of ROC curves, and their time complexity is linear with the number of soft detectors and quadratic with the number of crisp detectors. Improving the ROCCH minimizes both false positive and negative errors over the entire range of operating conditions, separately from assumptions about class distributions and error costs. More importantly, the overall composite ROCCH is always retained for adjusting the desired operating point during operations, and hence accounting for changes in prior probabilities and costs of errors. • The proposed Boolean combination techniques are general in the sense that they can be employed to combine diverse responses of any set of crisp or soft one- or twoclass classifiers, within a wide range of application domains. This includes combining the responses of the same detector trained on different data or features or trained according to different parameters, or from different detectors trained on the same data. In particular, they can be effectively applied when training data is limited and test data is imbalanced. Proof-of-concept simulations are applied to adaptive anomaly detection from system call sequences. The experiments are conducted on both synthetically generated data and sendmail data from the University of New Mexico (UNM) data sets2 . They are conducted using a wide range of training set sizes with various alphabet sizes and complexities, and different anomaly sizes. The impact on performance of different techniques is assessed 2 http://www.cs.unm.edu/~immsec/systemcalls.htm 13 through different ROC measures, such as the area under the curve (AUC), partial AUC, and true positive rate (tpr) at a fixed false positive rate (f pr). Organization of the Thesis This manuscript-based thesis is structured into four chapters and five appendices. Figure 0.1, presents a schematic diagram of the thesis structure, and depicts the relationship between chapters and appendices. As illustrated in Figure 0.1 and further described next, most chapters (and appendices) consist of published (or submitted for publication) articles or papers in refereed scientific journals or conferences. The contents of each chapter is almost the same as that of the published paper, with minor modifications for consistency of notation throughout the thesis. Therefore, an overlap could not be avoided. Although each chapter may be read independently, it is recommended to read the thesis sequentially. However, readers unfamiliar with HMMs may benefit from reading Appendix I and II, prior to reading Chapter 2. Appendix V may also provide good introduction to Chapter 3. The rest of this thesis is structured as follows. The first chapter provides an introduction to intrusion detection systems and their components, describes the general and basic concepts related to IDS, and presents a overview of relevant and related literature on host-based anomaly detection systems monitoring privileged process behavior by using system call sequences. It then focuses on process anomaly detection based on hidden Markov models. Both real and synthetic data sets employed in this research are also presented and the methodology considered for evaluation of IDSs performance is described. This chapter ends with the challenges facing process anomaly detection with HMMs. Chapter 2 presents the survey of techniques found in literature that may be suitable for incremental learning of HMM parameters, which has been accepted in Information Sciences (Khreich et al., 2011a). Most techniques found in literature apply to on-line learning of HMM parameters from an infinite data stream. In appendix II, representative on-line techniques for training HMM parameters (from Chapter 2) are selected, extended 14 CHAPTER 0 Introduction Appendix I On the Memory Complexity of the Forward-Backward Algorithm (Published in PRL) Appendix II A Comparison of Techniques for On-line Incremental Learning of HMM Parameters in Anomaly Detection (Appeared in CISDA09) CHAPTER 1 Anomaly Detection with HMMs CHAPTER 2 A Survey of Techniques for Incremental Learning of HMM Parameters (Accepted in INS) CHAPTER 3 Iterative Boolean Combination of Classifiers in the ROC Space: An Application to Anomaly Detection with HMMs (Published in PR) (A general version appeared in ICPR10) Appendix III Incremental Learning Strategy Appendix IV Boolean Functions and Additional Results for Chapter 3 Appendix V Combining Hidden Markov Models for Improved Anomaly Detection (Appeared in ICC09) CHAPTER 4 Adaptive ROC-based Ensembles of HMMs applied to Anomaly Detection (Accepted in PR) (A general version appeared in MSC11) CONCLUSIONS Figure 0.1 Structure of the thesis to incremental learning, and compared to the original algorithms. Appendix II, appeared in the IEEE international conference on computational intelligence for security and defense applications (CISDA09) (Khreich et al., 2009a). Appendix III proposes strategies for further improvement of incremental learning of HMM parameters. A variation of these incremental learning strategies based on Baum-Welch algorithm (IBW) is applied in Chapter 4, and compared with the novel learn-and-combine approach proposed. Al- 15 though on-line learning techniques provide efficient solutions for training HMMs parameters from long observation sequences, they provide approximations to the smoothed state densities computed with the FB algorithm. The Efficient Forward Filtering Backward Smoothing (EFFBS) algorithm described in Appendix I, provides an exact alternative to the FB algorithm, with a memory complexity that is independent of the sequence length. This chapter has been published in Pattern Recognition Letters (Khreich et al., 2010c). Chapter 3 describes the Boolean combination techniques proposed for efficient fusion of responses from detectors in the ROC space. The performance obtained by combining the responses of ADSs based on multiple HMMs, trained on fixed-size data sets, according to the Boolean combination techniques is compared to that of related techniques. This chapter has been published in Pattern Recognition (Khreich et al., 2010b). Further details about the Boolean functions and additional results are presented in Appendix IV. A version of this chapter, which addresses the Boolean combination in the more general decision-level fusion context, appeared in International Conference on Pattern Recognition (ICPR10) (Khreich et al., 2010a). The proposed Boolean combination techniques have been also successfully applied to combine different biometric systems for iris recognition (Gorodnichy et al., 2011). Appendix V presents an ADS with multiple-HMMs combined according to the MRROC technique. This chapter appeared in the International Conference on Communications (ICC09) (Khreich et al., 2009b). Chapter 4 presents a novel anomaly detection system based on the learn-and-combine approach to accommodate newly-acquired system call data over time. It describes the generation, selection and pruning of HMMs, as well as selection and adaptation of Boolean functions and decision thresholds, when a new block of training data becomes available. The performance of the proposed system is compared to that of the reference batch BW (BBW), of on-line BW (OBW), and to the proposed incremental (IBW) described in Appendix III. This chapter has been accepted in Pattern Recognition (Khreich et al., 2011c). A version of this chapter, addressing incremental Boolean combination in the more general sense, has been also accepted in Multiple Classifier Systems (MSC11) (Khreich et al., 16 2011b). Finally, a summary of contributions and discussion of key findings, followed by recommendations for future extensions to this research are presented in the conclusions. CHAPTER 1 ANOMALY INTRUSION DETECTION SYSTEMS This chapter presents an introduction to intrusion detection systems (IDSs). It starts with a general description of an IDS and its components, followed by an overview of common intrusion detection techniques in both host and network systems. It then focuses on host-based anomaly detection techniques using system call sequences generated by privileged processes. In particular, related work on HMM-based ADSs is presented and discussed in greater details. Both real and synthetic data sets employed in this research are presented next as well as the methodology considered for evaluation of IDSs performance. This chapter ends with a discussion on the challenges facing process anomaly detection with HMMs. 1.1 Overview of Intrusion Detection Systems IDSs allow for the detection of successful or unsuccessful attempts to compromise systems security. An IDS is an important component of any security infrastructure that complements other security mechanisms. As illustrated in Figure 1.1, an IDS consists of four essential components: sensors, analysis engines, data repository, and management and reporting modules. An IDS monitors the activity of a target system through a data source, such as system call traces, audit trails, or network packets. Relevant information from these data sources are captured by IDSs sensors, synthesized as events, and forwarded to the analysis engine for on-line analysis or to a repository for off-line analysis. The analysis engine contains decision-making mechanisms to discriminate malicious events from normal events. It may include anomaly, misuse, or hybrid detection approaches (described next). Outputs from analysis engines include specific information regarding manifestation of suspicious events. These information are stored in a repository for foren- 18 Data Srouces Intrusion detection systems (IDS) Internet Raw Data Sensor Events Data Repository • • • • Audit trails System calls Network trafic ··· Alerts Severs Host PCs Management & Reporting Module IDS Administrator Figure 1.1 High level architecture of an intrusion detection system sics analysis. A management and reporting module receives events that could indicate an attack from the analysis engine, raises an alarm to notify human operators, and reports the relevant information and the level of threat. The management module controls operations of IDS components, such as tuning decision thresholds of the analysis engine and updating the data repository. The configuration of analysis engines, update of data repository, and response to alerts are among the responsibilities of the IDS administrator. When alerts are raised, the IDS administrator should prioritize and investigate incidents to refute or confirm that an attack has actually occurred. If an intrusion attempt is confirmed, a response team should react to limit the damage, and a forensic analysis team should investigate the cause of the successful attack. An IDS may include a response module that undertake further actions either to prevent an ongoing attack or to collect additional supporting information – it is often referred to as intrusion prevention system (IPS) or intrusion detection and prevention system (IDPS), (Ghorbani et al., 2010; Rash et al., 2005; Scarfone and Mell, 2007; Stakhanova et al., 2007). 19 IDSs are typically categorized depending on their monitoring scope (or location of the sensors) into network-based and host-based intrusion detection systems. They are also classified based on the detection methodology (employed by the analysis engine) into misuse and anomaly detection. More detailed taxonomies have been also developed, which further classify IDSs according to their architecture (centralized or fully distributed), behavior after attacks (passive or active), processing time (on-line or off-line), level of inspection (stateless or state full), etc. (Alessandri et al., 2001; Axelsson, 2000; Debar et al., 2000; Estevez-Tapiador et al., 2004; Lazarevic et al., 2005; Scarfone and Mell, 2007; Tucker et al., 2007). 1.1.1 Network-based IDS Network-based IDSs (NIDSs) monitor the network traffic for multiple hosts by capturing and analyzing network packets for patterns of malicious activities. An NIDS is typically a stand-alone device that can control a network of arbitrary size with a small number of sensors (Lim and Jones, 2008). As illustrated in Figure 1.2, the sensors are often located at critical network junctures, such as network borders, demilitarized zone (DMZ)1 , or inside the local network (Northcutt and Novak, 2002; Proctor, 2000). NIDSs capture network traffic in promiscuous mode by connecting to a hub, network switch configured for port mirroring2 , or network tap3 . Examples of open source NIDSs include Snort (Roesch, 1999) and Bro (Paxson, 1999). NIDSs are platform independent, easy to deploy, and can cover large networks without consuming network or host resources. They are also able to detect unsuccessful attack attempts. However, an NIDS can detect only attacks which come through the monitored 1 Demilitarized zone is a network segment located between a secured local network and unsecured external networks (Internet). DMZ usually contains servers that provide services to users on the external network, such as web, mail, and DNS servers, which must be hardened systems. Two firewalls are typically installed to form the DMZ. 2 Port mirroring or spanning cross connects two or more ports on a network switch so that traffic can be simultaneously sent to a network analyzer connected to another port. 3 A network tap is a direct connection between a sensor and the physical network. It provides the sensor with a copy of the data flowing across the network. 20 network links and has no visibility into individual systems, and hence unable to verify attack results. Advances in technology such as high-speed networks, switched networks and encrypted traffic have imposed several challenges on NIDSs. To inspect network traffic, a NIDS must reconstruct network data streams from every host. For instance, it must reassemble TCP/IP fragments and also take into account the variability in the implementation of these network protocols among different platforms. This causes a significant performance bottleneck and packet drop, especially with high-speed network throughputs. In addition, traffic reconstruction may open new doors for attackers (Peng et al., 2007; Ptacek and Newsham, 1998). The migration from shared to switching environment has enhanced both network security and performance, but also complicates the allocation of NIDSs sensors. In traditional shared network, hubs were commonly used to connect segments of a local area network (LAN). When a packet arrives at one port, it is copied to the other ports to be available for all segments of the LAN. In switched networks messages are directly sent to the ultimate recipient, without broadcast and collision information over the network. Switching reduces the network traffic and makes it more difficult for intruders to sniff switched traffic, however it presents additional challenges on the distribution of sensors, especially if port spanning is not enabled on a switch. Enabling port spanning can also pose a risk if the spanned port is accessible to intruders. In addition, network communications are increasingly using encryption technologies such as virtual private network (VPN), secure sockets layer (SSL), and secure shell (SSH) to conceal plain text. Encryption also prevents NIDSs from accessing and analyzing the content of network traffic. In fact, it is very difficult to analyze encrypted traffic until it has been decrypted with specific a application on the target host. 1.1.2 Host-based IDS Host-based IDSs (HIDSs) are designed to monitor the activity of a host system, such as a mail server, web server, or an individual workstation. HIDSs identify intrusions by 21 Front-end Firewall Internet LAN NIDS DMZ NIDS NIDS NIDS DNS DHCP File Server Server Server Local Servers Back-end Firewall DNS Web Mail Server Server Server Public Servers Workstations Figure 1.2 Host-based IDSs run on each server or host systems, while network-based IDS monitor network borders and DMZ analyzing operating system calls, audit trails, application logs, file-system modifications (e.g., password files, access control lists, executable files), and other host activities and state (De Boer and Pels, 2005; Vigna and Kruegel, 2005). HIDSs are typically softwarebased systems which should be installed on every host of interest. In contrast with NIDSs, most HIDSs have software-based sensors that are able to extract sensitive (decrypted) information from a specific host at both kernel and user level, which make them more versatile systems. Examples of HIDSs include OSSEC (Bray et al., 2008) and NIDES (Anderson et al., 1995). HIDSs are particularly effective at detecting insider attacks and verifying their success or failure. They are able to inspect low-level local system activities such as file access or modification time, changes to file permissions, or attempts to access privileged pro- 22 cesses or overwrite system executables; or to install new applications, Trojan horses4 or backdoors5 , or other attacks that may involve software integrity breaches. Although HIDSs typically require no dedicated hardware, they have to be tailored for a specific platform, and must be implemented on each individual machine. While this is manageable in a small homogeneous network, it may impose deployment and management difficulties as the network becomes larger and more heterogeneous. HIDSs are concerned only with individual systems and usually have limited view of network activity. Depending on the underlying detection methods, a HIDS can cause a significant degradation in hosts performance. In addition, HIDSs are prone to tampering since they are accessible to users. 1.1.3 Misuse Detection Misuse detection approaches analyze host or network activity, looking for events that match patterns of known attacks (signatures). First a reference database of attack signatures is constructed, then monitored events from sensors data are compared against this database for evidence of intrusions. Signature matching is the most commonly employed misuse detection technique. For instance, Snort is a well-known open source signature-based network intrusion detection system (Roesch, 1999). Other misuse detection approaches include rule-based systems, state transition analysis, machine learning and data mining techniques. In rule-based techniques, a set of if-then-else implication rules is used to characterize attacks (Lindqvist and Porras, 1999). Rule-based techniques are employed in various IDSs, such as EMERALD (Porras and Neumann, 1997) and NIDES (Anderson et al., 1995). 4A Trojan horse is a malicious program that masquerades as a legitimate application or file. Users are typically tricked into opening and executing it on their systems because it looks useful such as a music or picture file in E-mail attachments. Trojan infections range from annoying (modifying desktop appearance) to destructive (deleting files or changing information). Trojans are also commonly used to create backdoors. 5 A backdoor is a tool installed on a compromised system to give an attacker direct access at a later date, without going through normal authentication or other security procedures. Typically, a backdoor program listens on specific ports to interact with the attackers. 23 State transition6 techniques represent an attack as a minimal sequence of actions that an intruder must perform to break into a system, e.g., STAT (Ilgun et al., 1995). Recently, several machine learning and data mining techniques have been employed to extract attack signatures from collected data which contains some known attacks (Brugger et al., 2001; Helali, 2010; Lee et al., 2000). Misuse detection systems provide a high level of detection accuracy without generating an overwhelming number of false alarms, but unable to detect novel attacks. Attacks for which signatures or rules have not been extracted yet, will go undetected. Signatures or rules describing new attacks must therefore be constantly updated. However, some time lag will inevitably occur before analyzing and incorporating patterns of newly deployed attacks, and thus novel attacks will always do some damage (zero-day attack). In general signature definition is a difficult task. Loosely defined signatures decrease the detection accuracy and generate false alarms. However, by employing very specific signatures, a slight variation of the attack may not be detected (polymorphic attack). In practice, attackers attempt to deploy several variations of the same attack, and the vulnerabilities can change throughout the life cycle of a system and from one system to another. Therefore, the number of signatures or rules increases over time and may become unmanageable. Despite these limitations, misuse detection is still the most commonly applied approach in commercial IDS. 1.1.4 Anomaly Detection Anomalies are patterns in data that do not conform to expected normal behavior. Anomaly detectors identify anomalous behavior on a host or network, under the assumption that attack patterns are typically different from those of normal behavior. An anomaly detection system (ADS) constructs a profile of expected normal behavior by using event data from users, processes, hosts,network connections, etc. These data sets 6 States are used to characterize different snapshots of a system during the evolution of an attack, while transitions are user actions associated with specific events. 24 are typically collected over a period of normal (attack-free) operation to build normal profiles. During operation, an ADS attempts to detect sensor data events that deviate significantly from the normal profile. These deviations are considered as anomalous activities, however they are not necessarily malicious, since they may also correspond to rare normal events, program errors, or changes in normal behavior. Several anomaly detection approaches have been proposed in literature (Chandola et al., 2009b), including statistical (Anderson et al., 1995; Markou and Singh, 2003a), rule-based (Vaccaro and Liepins, 1989), neural (Debar et al., 1992; Markou and Singh, 2003b), and machine learning approaches (Alanazi et al., 2010; Kumar et al., 2010; Ng, 2006; Tsai et al., 2009). Unlike misuse detection techniques, anomaly detection is capable of detecting novel attacks. As discussed in Section , ADSs generate a large number of false alarms that would not be tolerated by the operator, and may decrease his confidence in the system. These alarms are not very instructive and require a costly and time-consuming investigation. In fact, ADSs can only detect symptoms of attacks without specific knowledge of details. However, anomaly detection remains an active intrusion detection research area. 1.2 Host-based Anomaly Detection Host-based anomaly detection systems typically monitor the behavior of system processes7 to determine whether a process is behaving normally or has been subverted by an attack. In particular, abnormal behavior of privileged processes is most dangerous. Attacks exploiting vulnerabilities in privileged processes can lead to full system compromise. In general, ADSs monitor events from variety of data sources, including log files created by a process and audit trails generated by the operating system. While these sources provide important security information, it is fairly easy to deploy attacks which do not leave any traces in traditional logs or audit trails, and hence evade detection. Sequences of system calls issued by a process to request kernel services, have been shown effec7A process is a program in execution. 25 tive in describing normal process behavior (Forrest et al., 1996). A substantial amount of research have investigated various techniques for detecting anomalies in system call sequences (Forrest et al., 2008; Warrender et al., 1999). The remaining of this section provides a brief background on privileged processes and system calls and then reviews system call-based approaches to process anomaly detection. 1.2.1 Privileged Processes A privilege is a permission to perform an action. The fundamental principles of least privilege and separation of privileges should be enforced on all subjects8 at every level of the organization, for improved security, reliability, and accountability. According to the principle of least privilege (or need-to-know), all subjects should only be given enough privileges to perform their tasks (Saltzer and Schroeder, 1975). Ensuring least privilege requires identifying a subject tasks, determining the minimum set of privileges required to perform those tasks, and restricting the subject to a protection domain – a logical boundary that controls or prevents direct access among entities with different levels of privilege. To enforce these principles, and control subjects’ access to objects9 , modern computer systems provide several layers of protection domains. Modern processor architectures allow the operating system (OS) to run at different privilege levels (Silberschatz et al., 2008). In a typical dual-mode operation, user applications run in non-privileged user mode, while critical OS code runs in a privileged kernel mode and has access to system data and hardware. In user mode, programs have limited access to system information, and execute within their own virtual memory space. In a multi-user operating system, two protection domains (user and kernel) are insufficient, since users need to share system objects and processes. For instance, in UNIX-like 8A subject is an active entity such as a user, administrator, process, or any other device that causes information to flow among objects or changes the system state. 9 An object is a passive entity that contains or receives data, such as a data file, directory, printer, or other device. 26 OS, each user is associated with a protection domain. Each domain is a collection of access rights, each of which is an ordered a pair <object-name, rights-set>. When a user executes a program, the process runs in the protection domain associated with the user. Typically, two user identifiers (uid) are employed within the kernel. The real uid identifies the user who has created a process, and the effective uid allows a process to gain or drop privileges temporarily. Likewise, users with similar privileges have the same group identifier. The root, or superuser, has the highest level of privilege (uid = 0); an absolute power over a system. Users are granted different levels of privileges by the root according to the organization’s security policy. In some cases users privileges are insufficient to complete the required tasks. For example, if users are granted write access to the root-owned password file (/etc/passwd), they can alter other passwords; otherwise they cannot change their own password. UNIX-like OS employ the set user identifier (setuid) permission, which temporarily changes the effective uid of the process to that of the owner of the file, whereas the real uid is left unchanged. Therefore, the process inherits the privileges of the owner during the execution time of the required task, and then drops it. The setuid scheme allows low-privilege users to execute higher-privilege processes. A similar idea exists for groups. Privileged processes and daemons10 running with setuid permissions have been the source of endless security problems (Chen et al., 2002). Any process spawned by an elevatedprivileges process, runs with these privileges. Buffer overflows are the most common attacks targeting privileged (root-owned) processes for escalation of privilege (AlephOne, 1996). Buffer overflow occurs when more data is written into the memory allocated to a variable than was allocated at compile time. Buffer overflow attacks exploit improper bounds checking on input data, to redirect program execution to arbitrary malicious code. If successful, the attacker may spawn a root shell and take full control over the 10 A process continuously running in the background often with root or setuid permission and provides services from other processes. 27 system. Even though they have been reported for more than 15 years, buffer overflow vulnerabilities are likely to persist (Foster et al., 2005). 1.2.2 System Calls System calls provide an interface between user and kernel mode. Users requests are made from applications or libraries11 to the kernel via a platform-dependent set of system calls. A program that is not statically compiled will typically link to libraries at run-time to dynamically load required functions. Applications or libraries must invoke system calls to request kernel services, such as access to standard input/output, physical devices and network resources. Every system call has a unique number (known by the kernel) and a set of arguments. As illustrated in Figure 1.3, system calls are usually grouped according to their services. Examples of system calls for file management include open(), read(), write(), and close(). The length of a system call sequence generated by a process depends on the complexity and execution time of the process. The OS gains control upon a system call12 , switches to kernel mode, performs the requested service, and switches back to user mode (Silberschatz et al., 2008). Manifestations of a wide variety of attacks, including buffer overflows, symbolic link, decode and SYN floods may appear at the system call level and differ from normal behavior of privileged processes, (Forrest et al., 1996; Hofmeyr et al., 1998; Kosoresow and Hofmeyer, 1997; Somayaji, 2002; Warrender et al., 1999). Furthermore, after gaining root privileges on a host, attackers will typically try to maintain administrative access on the compromised host (Blunden, 2009; Mitnick and Simon, 2005), by for instance installing backdoors and rootkits. Attackers will also attempt to distribute Trojan horses, using for instance comprised email addresses or shared network folders, to compromise other systems. These malicious activities may also invoke different sequences of system calls than those generated during normal execution of a process. 11 In general, libraries runs in user mode. However, some libraries (e.g., standard C library) could offer a portion of system call interface. 12 System calls require an architecture-specific control transfer, which is typically implemented through a software interrupt or trap. Interrupts transfer control form user mode to the kernel mode and back. 28 root-owned processes User mode passwd setuid cron syslog user-level processes editor shell interpreter compiler Types of system calls Process control: File management: • end, abort • create/delete file • load, execute • open, close • create/terminate process • read, write, reposition • get/set process attributes shared system libraries • wait for time • wait/signal event Kernel mode process control, file management, device manipulation, information maintenance, and communications. Hardware system calls device controllers memory controllers physical memory disk file IPC system device driver CPU scheduiling • allocate/free memory sockets TCP/IP virtual memory device driver Device manipulation: • request/release device • read, write, reposition • get/set device attributes I/O devices network • logically devices terminal attach/detach • get/set file attributes Information maintenance: • get/set time or date • get/set system data • get/set process, file, or device attributes Communications: • create/delete communication connection • send/receive messages • transfer status information • attach/detach remote devices Figure 1.3 A high-level architecture of operating system, illustrating system-call interface to kernel and generic types of system calls 1.2.3 Anomaly Detection using System Calls Forrest et al. (1996) were the first to suggest that the temporal order of system calls could be used to represent the normal behavior of a privileged process. They have collected system call data sets from various privileged process at the University of New Mexico (UNM) (as described in Section 1.4.1), and confirmed that short sub-sequences of system calls are consistent with normal process operation, and unusual burst will occur during an attack. Their anomaly detection system, called Time-Delay Embedding (TIDE), employed a sliding look-ahead window of a fixed length to record correlations between pairs of system calls. These correlations were stored in a database of normal patterns. During operations, the sliding window scheme is used to scan the system calls generated by the monitored process for anomalies – sub-sequences that are not found in the normal data. These anomalies were accumulated over the entire sequence and an alarm was raised if the anomaly count exceeded a user-defined threshold. Sequence time-delay embedding (STIDE) extended previous work by segmenting and enumerating system call sequences generated by a privileged process of interest into fixed-length 29 contiguous sub-sequences, using a fixed-size sliding window, shifted by one symbol (Warrender et al., 1999). During operations however, an anomaly score was defined as the number of mismatches in a temporally local region. To reduce the number of false alarms, a threshold was set for the anomaly score above which a sequence is considered as anomalous. Hamming-distance between sub-sequences has been also employed as a measure of anomalous behavior (Hofmeyr et al., 1998). Several statistical and machine learning techniques, and other various extensions have been investigated over the last two decades for detecting system call anomalies using the UNM data sets Forrest et al. (2008). For instance, an inductive rule generator called RIPPER (Repeated Incremental Pruning to Produce Error Reduction) (Cohen, 1995), has been used for analyzing sequences of system calls and extracting rules (Fan et al., 2004). Finite state automata (FSA) have been proposed to model the system calls language, using deterministic or nondeterministic automatons (Michael and Ghosh, 2002; Sekar et al., 2001), or a call graph representation (Wagner and Dean, 2001). Lee and Xiang (2001) evaluated information-theoretic measures such as entropy and information cost. Application of machine learning techniques include neural network (Ghosh et al., 1999), k-nearest neighbors (Liao and Vemuri, 2002), n-grams Marceau (2000), Bayesian models (Kruegel et al., 2003), Markov models (Jha et al., 2001). Among these, techniques based on discrete HMMs have been shown to produce a high level of accuracy (Du et al., 2004; Florez-Larrahondo et al., 2005; Gao et al., 2002, 2003; Hoang and Hu, 2004; Hu, 2010; Wang et al., 2010, 2004; Warrender et al., 1999; Zhang et al., 2003). These HMMbased techniques are detailed in the next section. 1.3 Anomaly Detection with HMMs Hidden Markov model is a stochastic process determined by the two interrelated mechanisms – a latent Markov chain having a finite number of states, N , and a set of observation probability distributions, each one associated with a state. Starting from an initial state Si ∈ {S1 , ..., SN }, determined by the initial state probability distribution πi , 30 at each discrete-time instant, the process transits from state Si to state Sj according to the transition probability distribution aij . The process then emits a symbol vk , from a finite alphabet V = {v1 , . . . , vM } with M distinct observable symbols, according to the discrete-output probability distribution bj (vk ) of the current state Sj (Cappe et al., 2005; Elliott, 1994; Ephraim and Merhav, 2002; Rabiner, 1989). Let qt ∈ S denotes the state of the process at time t, where qt = i indicates that the state is Si at time t. An observation sequence of length T is denoted by O = o1 , . . . , oT (or more concisely by o1:T ), where ot is the observation at time t. An HMM is commonly parametrized by λ = (π, A, B), where • π = {πi } denotes the vector of initial state probability distribution, πi P (q1 = i)1≤i≤N • A = {aij } denotes the state transition probability distribution, aij P (qt+1 = j | qt = i)1≤i,j≤N • B = {bj (k)} denotes the state output probability distribution, bj (k) P (ot = vk | qt = j)1≤j≤N, 1≤k≤M The matrices A and B, and the transpose of vector π (π ) are row stochastic, which impose the following constraints: N j=1 aij = 1∀i, M k=1 bj (k) = 1 ∀j, N πi = 1, and aij , bj (k), πi ∈ [0, 1], ∀i, j, k i=1 The theory underlying HMMs rely on the following assumptions: 31 • The first order Markov property (or limited horizon): The conditional probability distribution of current state depends only on previous state, P (qt = j | q1 , q2 , . . . , qt−1 ) = P (qt = j | qt−1 ) • Time-homogeneous (or time-invariant) assumption: The state transition probabilities are independent of the actual time at which the transitions occur, P (qt = j | qt−1 ) = aij , ∀t • Output independence assumption: The output generated at the current time step t depends solely on the current state qt (independent of previous outputs), P (ot | q1 , q2 , . . . , qt ) = P (ot | qt ) The success or failure of HMM application for pattern classification or recognition tasks rely on these fundamental assumptions. If the patterns considered can be described by stochastic processes without potentially violating these assumptions, then a well trained HMM may provide successful results. In fact, theoretical and empirical results have shown that, given an adequate number of states and a sufficiently rich set of data, HMMs are capable of representing probability distributions corresponding to complex real-world phenomena in terms of simple and compact models (Bengio, 1999; Bilmes, 2002). HMM has been successfully applied in various practical applications. It has become a predominant methodology for design of automatic speech recognition systems (Bahl et al., 1982; Huang and Hon, 2001; Rabiner, 1989). It has also been successfully applied to various other fields, such as communication and control (Elliott, 1994; Hovland and McCarragher, 1998), bioinformatics (Eddy, 1998; Krogh et al., 2001), computer vision (Brand and Kettnaker, 2000; Rittscher et al., 2000). The output observations of HMMs can be discrete or continuous, scalars or vectors. As further detailed in Chapter 2, an HMM is termed discrete if the output alphabet is finite, 32 and continuous if the output alphabet is not necessarily finite, e.g., each state is governed by a parametric density function. Although, a continuous signal may be quantized into discrete observation, some applications, such as automatic speech recognition, attempt to avoid the inherent quantization error by using continuous output HMMs (Huang and Hon, 2001). The transitions allowed between states and the number of states (or model order) determine the HMM topology. For instance, in ergodic (or fully connected) HMMs every state can be reached in a single step from any other state of the model. Strict left-right HMMs only allow self transition and transition from state Sn to Sn+1 . An HMM that restricts transition from only left to right (but allow skipping states), is called Bakis HMM. In many real-world applications, the best topology is determined empirically. For example, in HMM application to automatic speech recognition, left-right topologies, where the state(s) represents a phoneme in a word, have been shown successful (Huang and Hon, 2001; Rabiner, 1989). In general, when the states have no physical meaning ergodic HMMs are employed. In this thesis discrete-time finite-state HMMs are only considered, since they are representative of the intended application. In fact the process behavior can be described by finite set of hidden states, and there is only a finite alphabet of system calls. Furthermore, since the states have no physical representation, but only varied to best fit the considered process, ergodic HMMs are only considered. In general, HMMs can perform the following canonical functions (Rabiner, 1989): Evaluation aims to compute the likelihood of an observation sequence o1:T given a trained model λ, P (o1:T | λ). The likelihood is typically evaluated by using a fixedinterval smoothing algorithm such as the Forward-Backward (FB) (Rabiner, 1989) or the numerically more stable Forward-Filtering Backward-Smoothing (FFBS) (Ephraim and Merhav, 2002). See Appendix I for further details. 33 Decoding aims to find the most likely state sequence S that best explains the observation sequence o1:T , given a trained HMM. The target is therefore to find the most likely state sequence S that maximizes P (S | o1:T , λ). The best state sequence is commonly determined by the Viterbi algorithm (Forney, 2005; Viterbi, 1967). Learning aims to estimate the HMM parameters λ to best fit the observed data. Given a predefined topology and an observation sequence o1:T , the target is to find the model λ that maximizes P (o1:T | λ). Unfortunately, there is no known analytical solution to the training problem. In practice, HMM parameters estimation is frequently performed according to the maximum likelihood estimation (MLE) criterion. MLE consists of maximizing the log-likelihood, log P (o1:T | λ), of the training data over HMM parameters space. Unfortunately, since the log-likelihood depends on missing information (the latent states), there is no known analytical solution to the learning problem. In practice, iterative optimization techniques such as the Baum-Welch (BW) algorithm (Baum et al., 1970), a special case of the ExpectationMaximization (EM) (Dempster et al., 1977), or standard numerical optimization methods such as gradient descent (Baldi and Chauvin, 1994; Levinson et al., 1983) are often employed for this task. In either case, HMM parameters are estimated over several training iterations, until the likelihood function is maximized over data samples. Each training iteration typically involves observing all available training data to evaluate the log-likelihood value and estimate the state densities by using a fixed-interval smoothing algorithm. Further details on the evaluation and parameter estimation of HMMs are presented in Chapter 2 and Appendix I. Decoding is less relevant to the application, since the hidden states have no physical. In fact, decoding the most likely state sequences (or uncovering the hidden states) is most commonly applied for recognition tasks, such as in automatic speech recognition (Bahl et al., 1982; Huang and Hon, 2001; Rabiner, 1989). For anomaly detection applications, HMMs (detectors) are typically trained on normal data and then employed to evaluate the test data, as described in the next section. 34 1.3.1 HMM-based Anomaly Detection using System Calls In an effort to find the best technique for detecting anomalies in system call sequences generated by privileged processes, Warrender et al. (1999) have conducted a comparative benchmarking on UNM data sets (described in Section 1.4.1), using various techniques including sequence matching, data mining, and HMMs. The authors trained an ergodic HMM using BW algorithm on the normal system call sequences for each process in the UNM data sets. The number of HMM states (N ) was selected heuristically. It is set roughly equal to the process alphabet size (Σ) – the number of unique system call symbols used by the process. For instance, they selected N = 60 states for sendmail process since its alphabet size Σ = 53 symbols. Each HMM is then tested on the entire anomalous sequences, looking for unusual state transitions or symbol outputs according to a predefined threshold. Their experimental results have shown that HMM-based anomaly detectors produce the highest level of detection accuracy on average, compared to other techniques at the expenses of expensive training resource requirements. Indeed, the time complexity of BW algorithm per iteration scales linearly with the sequence length and quadratically with the number of states. In addition, its memory complexity scales linearly with both sequence length and number of states (see Section 2.2 and Appendix I, for details). Subsequent work addressed other variations of HMM training and testing techniques, as well as various alarm raising strategies, for detecting system calls anomaly using UNM data sets, in particular sendmail data sets. For instance, Wang et al. (2004) used BW algorithm to train ergodic HMMs with a fixed number of states (N = 53) on sendmail data, using different detector window sizes, DW , to assess the impact of DW on detection accuracy. The training sub-sequences are segmented from the original normal sequences using an overlapping sliding window of size DW . Anomalous sub-sequences are then labeled in comparison with normal the sub-sequences (of the same length) using STIDE. The accuracy is measured by counting the number of anomalies in a test sequence that are below a pre-specified threshold. Similarly, Du et al. (2004) trained HMM with fixed 35 number of states according to BW algorithm, and focused on the impact of DW on detection accuracy. However they introduced the notion of relative probability to raise alarms. This consists of computing the number of anomalous sub-sequences of length DW contained in a larger sliding window around the tested sub-sequence. An alarm is raised when the number of anomalies reaches an arbitrary threshold. Gao et al. (2002) trained an ergodic HMM with N = 60 states using BW algorithm on the normal sub-sequences and a subset of the anomalous sub-sequences. Remaining subsets of anomalous sub-sequences, which are not included in training, are used for testing. The author also focused on the impact of detector window size and that of neighboring effects on detection accuracy. Zhang et al. (2003) proposed a detection method based on a hierarchical HMM to overcome the training computational time and memory costs. First, an HMM with a large number of states is trained on the entire data set, then HMM transition sequences are used to train another HMM with a lower number of states. The authors also applied this strategy to create profiles for both normal and anomalous data sets. These two approaches are based on misuse detection rather than an anomaly detection though. Yeung and Ding (2003) compared ergodic and left-right HMM topologies using UNM data sets. STIDE was used for labeling training and testing sub-sequences. HMMs are trained according to BW algorithm on unique normal sub-sequences, using different values of sub-sequence length DW and number of states N . For each DW , they used the trained HMMs to compute the likelihood of all training sub-sequences and selected the minimal likelihood values as decision thresholds. Results have shown that ergodic HMMs outperform the left-right HMMs for a given number of states, and left-right HMMs have shown high sensitivity to the sequence length. Qiao et al. (2002) trained ergodic HMMs according to BW algorithm using sendmail data sets and STIDE for labeling anomalous sub-sequences. However, a threshold is set on state transition only to discriminate between normal and intrusion behavior. A different 36 number of states have been used, with a fixed detector window size. The author noted the costly time and memory complexity with a large HMM state values. Hoang and Hu (2004) proposed a combination technique at the detector level. It consists of training an HMM for each sub-sequence of observation symbols segmented from the entire sequence of observation, and then weight averaging the parameters of these HMMs to produce the combined model. The author claimed that this strategy may be used to incrementally combine ergodic HMMs. Although this technique may work for left-right HMMs, it is not suitable for ergodic HMMs as their states are permuted with each different training phase, and hence averaging parameters leads to knowledge corruption. This have been confirmed empirically in Section II.5 of Appendix II. Florez-Larrahondo et al. (2005) proposed an on-line algorithm based on BW for efficient learning of HMM parameters from long sequence of observation (described in Section 2.3.2.1). The algorithm allows to learn to reduce the time and memory complexity in comparison to the traditional BW training. (Chen and Chen, 2009) employed this on-line algorithm to train five ergodic HMMs with the same number of states as base detectors in AdaBoost algorithm (Freund and Schapire, 1996) using fixed-sized data set. AdaBoost requires a labeled training set comprising normal and anomalous examples that can be weighted for importance. Therefore, the authors set an arbitrary threshold on the log-likelihood output from HMMs trained on the normal system call behavior, below which normal system call sequences are considered as anomalous. These HMMs are then updated according to an on-line AdaBoost variant (Oza and Russell, 2001) from new data. Threshold setting is shown to have a significant impact on the detection performance of the EoHMMs (Chen and Chen, 2009). In fact, rare system call events are normal, if they are considered anomalous during the design phase, they will generate false alarms during the testing phase. These rare events may be suspicious if, during operation, they occur in bursts over a short period of time. 37 A recent survey on HMM-based techniques for system call anomaly detection is provided by Wang et al. (2010). 1.4 1.4.1 Data Sets University of New Mexico (UNM) Data Sets The UNM data sets13 are commonly used for benchmarking anomaly detections based on system calls sequences (Warrender et al., 1999). Normal system call sequences generated by privileged processes during (secured) normal operation were collected. These sequences are assumed attack-free and used for training the anomaly detectors. Different kinds of attacks such as Trojan and backdoor intrusion, buffer overflows, symbolic link, and decode attacks (Forrest et al., 1996; Hofmeyr et al., 1998; Kosoresow and Hofmeyer, 1997; Warrender et al., 1999) were launched against these processes, while collecting the system call sequences. These intrusive sequences comprise both normal and anomalous sub-sequences for testing, but specific labeling of these sub-sequences remains an issue. In related work, intrusive sequences are usually labeled in comparison with normal sequences, using the STIDE matching technique. This labeling process considers STIDE responses as the ground truth, and leads to a biased evaluation and comparison of techniques, which depends on both training data size and detector window size. To confirm the results on system calls data from real processes, the same labeling strategy is used in this work. However fewer sequences are used to train the HMMs to alleviate the bias. Therefore, for each DW , STIDE is first trained on all available normal data from UNM sendmail (about 1.8 million system calls), and then used to label the corresponding subsequences (AS = DW ) from the ten sequences available for testing (about 6, 755 system calls). The resulting labeled sub-sequences are concatenated, then divided into blocks of equal sizes, one for validation and the other for testing. During the experiments, smaller blocks of normal data are used for training the HMMs as normal system call observations are very redundant. In spite of labeling issues, redundant training data, and unrepresen13 http://www.cs.unm.edu/~immsec/systemcalls.htm 38 tative test data, UNM sendmail data set is the mostly used in literature due to limited publicly available system call data sets. 1.4.2 Synthetic Generator The need to overcome issues encountered when using real-world data for anomaly-based HIDS (incomplete data for training and labeling) has lead to the implementation of a synthetic data generation platform for proof-of-concept simulations. It is intended to provide normal data for training and labeled data (normal and anomalous) for testing. This is done by simulating different processes with various complexities then injecting anomalies in known locations. The data generator is based on the Conditional Relative Entropy (CRE) of a source; it is closely related to the work of Florez-Larrahondo et al. (2005); Maxion and Tan (2000); Tan and Maxion (2002, 2003, 2005); Tan et al. (2002). The CRE is defined as the conditional entropy divided by the maximum entropy (MaxEnt) of that source, which gives an irregularity index to the generated data. For two random variables x and y the CRE is given by CRE = − x p(x) | x) log p(y | x) M axEnt y p(y (1.1) where for an alphabet of size Σ symbols, M axEnt = −Σ log(1/Σ) is the entropy of a theoretical source in which all symbols are equiprobale. It normalizes the conditional entropy values between CRE = 0 (perfect regularity) and CRE = 1 (complete irregularity or random). In a sequence of system calls, the conditional probability, p(y | x), represents the probability of the next system call given the current one. It can be represented as the columns and rows (respectively) of a Markov Model with the transition matrix M M = {aij }, where aij = p(St+1 = j | St = i) is the transition probability from state i at time t to state j at time t + 1. Accordingly, for a specific alphabet size Σ and CRE value, a Markov model is first constructed by fixing some state transition values and varying other values, subject to j aij = 1, to obtain the desired CRE (see Figure 1.4). The resulting Markov matrix is then used as a generative model for normal data, as illustrated 39 ⎡ M M(Σ=8,CRE=0.0) ⎢ ⎢ ⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ M M(Σ=8,CRE=0.3) ⎢ ⎢ ⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ M M(Σ=8,CRE=1.0) ⎢ ⎢ ⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0003 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0380 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.2365 0.2630 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.2529 0.2903 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.1364 0.1865 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.2080 0.1784 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.1450 0.0435 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0209 0.0000 0.0000 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 0.1250 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ Example of a sequence of length T = 50 observation symbols generated from M M(Σ=8,CRE=0.0) : 81234567812345678123456781234567812345678123456781 Example of a sequence of length T = 50 observation symbols generated from M M(Σ=8,CRE=0.3) : 76565645667234563456673456345634563456345663456456 Example of a sequence of length T = 50 observation symbols generated from M M(Σ=8,CRE=1.0) : 871636176633746263456828426338613216132184422764512 Figure 1.4 Examples of Markov models with alphabet Σ = 8 symbols and various CRE values, each used to generate a sequence of length T = 50 observation symbols in Figure 1.5. Sampling starts with a uniform initial state probability distribution. At each discrete time step, the process transits from state Si to Sj according to the state probability distributions (aij ), and outputs the value of new state (Sj ), until the desired sequence length is reached. The same Markov model, M M , is also used for labeling injected anomalies as described below. Let an anomalous event be defined as a surprising event which does not belong to the process normal pattern. This type of event may be a foreign-symbol anomaly 40 • Generate a normal sequence of observation symbols (sn ) of length Tn for training using a Markov model specified by an alphabet size (Σ) and CRE value. M M(Σ=8,CRE=0.3) sn : 7 6 5 6 5 6 4 5 6 6 7 2 3 4 5 6 3 4 5 6 6 · · · • Segment sn into sub-sequences (si ) according to the detector window size (DW ) DW = 4 DW = 6 7 6 5 6 5 6 4 5 ··· 7 6 5 6 5 6 4 5 ··· s1 : 7 6 5 6 s2 : 6 5 6 5 s3 : 5 6 5 6 .. . s1 : 7 6 5 6 5 6 s2 : 6 5 6 5 6 4 s3 : 5 6 5 6 4 5 .. . (a) Generation of normal data for training • Generate a random sequence of observation symbols (sr ) of length Tr using a Markov model with the same alphabet size Σ, and a CRE = 1 (uniform transition distribution) M M(Σ=8,CRE=1.0) Segment sr into sub-sequences (si ) according to the anomaly size (AS) 8 7 1 6 3 6 1 7 ··· AS = 4 Sub-Sequences s1 : 6 3 6 1 s2 : 3 6 1 7 s3 : 6 1 7 6 s4 : 1 7 6 6 s5 : 7 6 6 3 s6 : 6 6 3 3 .. .. . . sr : 6 3 6 1 7 6 6 3 3 7 4 6 2 6 3 4 5 6 8 2 8 · · · Compute the log-likelihood (LL) of each sub-sequence (si ), given the same Markov generator used for normal data generation (Σ = 8 and CRE = 0.3) LL = log P (si | M M(Σ=8,CRE=0.3) ) Compare LL to the threshold of normal subsequences of length AS: T = log( Σ1 (min(aij ))AS−1 ) Yes LL > T No 1 (anomaly) 0 (normal) T = log( 18 × 0.00033 ) = −26.4 LL -7.2e+02 -1.4e+03 -7.2e+02 -7.1e+02 -6.8e+00 -7.1e+02 .. . Label 1 1 1 1 0 1 .. . (b) Generation of anomalous data for validation and testing Figure 1.5 Synthetic normal and anomalous data generation according to a Markov model with CRE = 0.3 and alphabet of size Σ = 8 symbols 41 sequence that contains symbols not included in the process normal alphabet, a foreign n-gram anomaly sequence that contains n-grams not present in the process normal data, or a rare n-gram anomaly sequence that contains n-grams that are infrequent in the process normal data and occurs in burst during the test14 . Generating training data consists of constructing Markov transition matrices for an alphabet of size Σ symbols with the desired irregularity index (CRE) for the normal sequences (see Figure 1.4). As illustrated in Figure 1.5a, the normal data sequence with the desired length is then produced with the Markov model, and segmented using a sliding window (shift one) of a fixed size DW . To produce the anomalous data, a random sequence is generated using the same alphabet size Σ and a CRE = 1, which covers the entire normal and anomalous space. This random sequence is then segmented into sub-sequences of a desired length using a sliding window with a fixed size of AS (see Figure 1.5b). Then, the original generative Markov model is used to compute the likelihood of each sub-sequence. If the likelihood is lower than a threshold it is labeled as anomaly. The threshold is set to T = Σ1 (min(aij ))AS−1 , ∀i,j , the minimal positive value in the Markov transition matrix to the power (AS − 1), which is the number of symbol transitions in the sequence of size AS. This ensures that the anomalous sequences of size AS are not associated with the process normal behavior, and hence foreign n-gram anomalies are collected. The trivial case of foreign-symbol anomaly is disregarded since it is easy to be detected. Rare n-gram anomalies are not considered since we seek to investigate the performance at the detection level, and such kind of anomalies are accounted for at a higher level by computing the frequency of rare events over a local region. Finally, to create the testing data another normal sequence is generated, segmented and labeled as normal. The collected anomalies of the same length are then injected into this sequence at random according to a mixing ratio. 14 This is in contrast with other work which consider rare event as anomalies. Rare events are normal, however they may be suspicious if they occur in high frequency over a short period of time. 42 1.5 Evaluation of Intrusion Detection Systems The evaluation of intrusion detection systems is an open research topic, which face several challenges, including the lack of representative data and unified methodologies, and the employment of inadequate metrics for evaluation. (Abouzakhar and Manson, 2004; Gadelrab, 2008; Tucker et al., 2007). This thesis evaluates the proposed techniques for system call anomaly detection based on their efficiency, adaptability, and accuracy. The efficiency considers the costs involved during the design and operation phase of an anomaly detection system. It considers the time and memory complexity required for designing an ADS, including detectors training, validation, selection or combination, as well as the space requirement for storing training data (e.g., batch technique) and for storing selected or combined models. Impact on efficiency of training set sizes with various alphabet sizes and complexities of monitored processes is also assessed. During operation, the efficiency considers the time and memory complexity required to operate the ADS, with one or multiple detectors, evaluate the likelihood of the input sub-sequences of observations, and make a decision. A fully adaptive ADS must have mechanisms to detect legitimate changes in normal behaviors, collect data that reflect the changes, ensure the relevant data contain no pattern of attacks, and update its internal detectors and decision thresholds to adapt for the changes. The scope of this thesis is limited to adaptation at the detector and decision level. The remaining tasks are still under the administrator’s scope of responsibility. Therefore, the adaptability of an ADS is evaluated for its effectiveness in maintaining or improving the overall system accuracy in response to new data. The accuracy of an ADS is determined based on the receiver operating characteristic analysis, and should not be confused with the accuracy measure (or inversely error rate), as described next. 43 1.5.1 Receiver Operating Characteristic (ROC) Analysis This subsection provides relevant background details on ROC analysis, since it is heavily used in this research for evaluation and combination of detectors. Given a detector and a test sample, there are four possible outcomes described by means of a confusion matrix, as illustrated in Figure 1.6. When a positive test sample (p) is presented to the detector and predicted as positive (p̂) then it is counted as a true positive (T P ); if it is however predicted as negative (n̂) then it is counted as a false negative (F N ). On the other hand, a negative test sample (n) that is predicted as negative (n̂) is a true negative (T N ), while it is a false positive (F P ) if predicted as positive (p̂). The true positive rate (tpr) is therefore the proportion of positives correctly classified (as positives) over the total number of positive samples in the test. The false positive rate (f pr) is the proportion of negatives incorrectly classified (as positives) over the total number of negative samples in the test. Similarly, the true negative rate (tnr) and false negative rate (f nr) can be defined over the negative class however. A crisp detector outputs only a class label (Y = 0 or Y = 1), while a soft detector assigns scores or probabilities to the test samples (x), by the means of a scoring function f : x → R. Typically, the higher the score value, f (x), the more likely the prediction of the positive event (Y = 1). A soft detector can be converted to a crisp one by setting a threshold T on the score values. A sample x is classified as positive (Y = 1) if it is assigned a score that is greater than or equal to T and negative (Y = 0) otherwise (see Figure 1.6). A ROC curve is a two-dimensional curve in which the tpr is plotted against the f pr. An empirical ROC curve is typically obtained by connecting the observed (tpr, f pr) pairs of a detector for each decision threshold T. ROC curves can be efficiently generated by sorting the output scores of a detector, from the most likely to the least likely positive value, and considering unique values as decision thresholds (Fawcett, 2006). A crisp Predicted Class 44 p̂ n̂ TP P = T PT+F P os N FP FP f pr = N eg = F P +T N N tnr = NT N = F PT+T = 1 − f pr eg N N N f nr = PF os = T PF+F = 1 − tpr N P +T N = 1 − err acc = PTos+N eg tpr = True Class p n True Positive False Positive (T P ) (F P ) Good: correct detection Bad: Type I error False Negative (F N ) True Negative (T N ) Bad: Type II error Good: correct rejection P os = T P + F N (Total number of actual positives) N eg = F P + T N (Total number of actual negatives) AU C = ri −P os(P os+1)/2 P os×N eg ri is the rank of the ith positive example Figure 1.6 Confusion matrix and ROC common measures Negative class (normal) Positive class (anomalous) tnr tpr f nr f pr x Soft detector f (x) ≥T score x Crisp detector Perfect discrimination (ROC heaven) 1 Crisp detector (data point) 0.8 0.6 Soft detector (empirical ROC) 0.4 p̂, (Y = 1) n̂, (Y = 0) p̂, (Y = 1) n̂, (Y = 0) Always predicts positive tnr More aggressive f nr Predicted positive (p̂) tpr Decision threshold (T) Predicted negative (n̂) er sifi s a cl m do an R AUC Non-proper classifiers (Negative responses may 0.2 be negated) More conservative 0 0.2 0.4 0.6 0.8 1 Always predicts f pr negative Figure 1.7 Illustration of the fixed tpr and f pr values produced by a crisp detector and the ROC curve generated by a soft detector (for various decision thresholds T). Important regions in the ROC space are annotated detector however produces a single point (or data point) defined by its (f pr, tpr) in the ROC space (see Figure 1.7). Given two operating points, say i and j, in the ROC space, i is defined as superior to j if f pri ≤ f prj and tpri ≥ tprj . If one ROC curve I has all its points superior to those of another curve J, then I dominates J. This means that the detector of I always has a lower expected cost than that of J , over all possible class distributions and error costs . 45 If a ROC curve has tpri > f pri for all its points i then, it is a proper ROC curve. ROC curves below the major diagonal can always be transformed into proper curves by using a linear transformation or changing the meaning of positive and negative. ROC curves allows to visualize the performance of detectors and select optimal operational points, without committing to a single decision threshold. It presents detectors performance across the entire range of class distribution and error costs. For equal prior probability and cost of errors, the optimal decision threshold (minimizing overall errors and costs) corresponds to the vertex that is closest to the upper-left corner of the ROC plane (also called ROC heaven). In many practical cases, where prior knowledge of skewed class distributions or misclassification costs are available from the application domain, they should be considered while analyzing the ROC curve and selecting operational thresholds, using iso-performance lines (Fawcett, 2006; Provost and Fawcett, 2001). For imprecise class distribution, the area under the ROC curve (AUC) has been proposed as more robust (global) measure for evaluation and selection of classifiers than accuracy (acc) – or conversely error rate (err), (Bradley, 1997; Huang and Ling, 2005; Provost et al., 1998). In fact, the acc depends on the decision threshold, and assume fixed distributions for both positive and negative classes. The AUC however is the average of the tpr over all values of the f pr (independently of decision thresholds and prior class distributions). It can be also interpreted in terms of class separation, as the fraction of positive–negative pairs that are ranked correctly, (see Figure 1.7). The AUC evaluates how well a classifier is able to sort its predictions according to the confidence it assigns to them. For instance, with an AU C = 1 all positives are ranked higher than negatives indicating a perfect discrimination between classes. A random classifier has an AU C = 0.5 that is both classes are ranked at random. For a crisp classifier, the AUC is simply the area under the trapezoid and is calculated as the average of the tpr and f pr values. For a soft classifier, the AUC may be estimated directly from the data either by summing up the areas of the underlying trapezoids (Fawcett, 2004) or by means of the Wilcoxon–Mann–Whitney (WMW) statistic (Hanley and McNeil, 1982). 46 In some cases ROC curves may cross, and hence detectors providing higher overall AUC values may perform worse than those providing lower AUC values, in a specific region of ROC space . In such cases, the partial area under the ROC curve (pAUC) (Walter, 2005), for instance the area under between f pr = 0 and f pr = 0.1 (AUC0.1 ), could be useful for comparing the specific regions of interest (Zhang et al., 2002). If the AUC (or the pAUC) values are not significantly different, the shape of the curves might need to be looked at. It may also be useful to look at the tpr for a fixed f pr of particular interest. However, any attempt to summarize a ROC curve into a single number leads to information loss, such as errors and costs trade-offs, which reduces the system adaptability. 1.6 1.6.1 Anomaly Detection Challenges Representative Data Assumption Most work found in related literature (described in Section 1.2.3 and 1.3.1) considers static environments and assumes being provided with a sufficiently representative amount of clean (attack-free) system call data for training the anomaly detectors. Even under these ideal assumptions, anomaly detectors still face the following challenges. Designing an HMM for anomaly detection involves estimating HMM parameters and the number of hidden states, N , from the training data. The value of N has a considerable impact not only on system accuracy but also on HMM training time and memory requirements. In the literature on HMMs applied to anomaly detection (Du et al., 2004; Gao et al., 2002; Hoang and Hu, 2004; Hu, 2010; Qiao et al., 2002; Warrender et al., 1999; Zhang et al., 2003), the number of states is often chosen heuristically or empirically using validation data. In fact, HMMs trained with different number of states are able to capture different underlying structures of data. Therefore, a single best HMM will not provide a high level of performance over the entire detection space (see Appendix V). An open issue, for all window based anomaly detection, is the selection of the operating detector window size DW – the size of the sliding window used for testing. A small 47 DW value is always desirable since it allows faster detection and response, incurs less processing overhead , as described in Appendix V. However, choosing the smallest DW that has good discriminative abilities, depends on the anomaly size, AS, in the testing set. The anomaly size is not known a priori, may vary considerably in real world application, and could even be controlled by the attacker. In general, ADSs are designed to provide accurate results for a particular window size and anomaly size. Discriminative or sequence matching techniques, are only able to detect anomalies with sizes equal to that of the detector window (AS = DW ). In addition, they provide blind regions in the detection space (Maxion and Tan, 2000; Tan and Maxion, 2002, 2003). These techniques will miss an anomalous sequence that are larger that the detector window size (DW < AS) if all of its sub-sequences (of size equal to that of DW ) are normal. The detector window will slide on these normal sub-sequences, without being able to discover that the whole sequence is anomalous (Maxion and Tan, 2000; Tan and Maxion, 2002, 2003). As a generative model, HMM is less sensitive to the detector window size and have no blind regions since it computes the likelihood of each observation sequence. Although not investigated in depth in this thesis, Chapter 3 and Appendix V, show that HMM detection ability increases with the detector window size, since the likelihood of anomalous sequences becomes smaller at a faster rate than normal ones, and hence easier to detect. Once designed and trained, an HMM provides a compact model that does not increase with the size of training data or detector window. In contrast, the dimensionality of the training sub-sequences space increases exponentially with DW , which makes discriminative techniques prone to the curse of dimensionality. In addition, matching techniques that are based on search procedure such as look-up tables in STIDE must compare inputs to all normal training sub-sequences. The number of comparisons also increases exponentially with the detector window size DW , while for HMM evaluation the time complexity grows linearly with DW . To eliminate the dependence on DW , HMM is typically trained on the entire sequence of system call observations without segmentation (Lane, 2000; Warrender et al., 1999). 48 However, when learning from long sequences of observations the FB algorithm, employed within standard BW (Baum et al., 1970) or Gradient-based algorithms (Baldi and Chauvin, 1994; Levinson et al., 1983) for estimation of HMM parameters, may become prohibitively costly in terms of time and memory complexity (Lane, 2000; Warrender et al., 1999). This has been also noted in other fields, such as bioinformatics (Krogh et al., 1994; Meyer and Durbin, 2004) and robot navigation systems (Koenig and Simmons, 1996). For a sequence of length T and an ergodic HMM with N states, the memory complexity of the FB algorithm grows linearly with T and N , O(N T ), and its time complexity grows quadratically with N and linearly with T , O(N 2 T ). The efficient forward filtering backward smoothing (EFFBS) algorithm proposed in Appendix I, provides an alternative to the FB algorithm, which a reduced memory complexity, O(N ), i.e., independent of the sequence length T . Another issue with related anomaly detection techniques resides in their evaluation methodology, in particular, the accuracy measure and arbitrary selection of decision thresholds. Most techniques consider the number of anomalous sub-sequences found in the test sequence as accuracy measure. A sub-sequence is considered as anomalous if its likelihood value assigned by the HMM is below a given threshold. The thresholds are typically selected according to the minimum likelihood value assigned by a trained HMM to the normal sub-sequences. Consequently, this evaluation methodology accounts only for one type of error, as it does not consider false negative errors. Furthermore, setting the decision threshold at the tail of the normal distribution does not provide the optimal (Bayes) decision, unless in trivial situations where the two distributions (normal and anomalous) are linearly separable (see Figure 1.7). 1.6.2 Unrepresentative Data In practice however, unrepresentative data is typically provided for training. Acquiring a sufficient amount of clean data that represents the normal behavior of complex realworld processes is a challenging task. Although abundant data can be collected from a 49 Rare Events (false alarms) Normal Behavior Modeled Behavior Suspicious Behavior Figure 1.8 Illustration of normal process behavior when modeled with unrepresentative training data. Normal rare events will be considered as anomalous by the ADS, and hence trigger false alarms constrained and overly secured environment,such as UNM data sets, this data still provide an incomplete view of the normal behavior of complex processes. This is due to the trade-off between process functionality and the restrictions required to achieve a high level of security, and hence “ascertain” that the collected data contains no attack patterns. More representative data can be collected from an open environment, in which a process is allowed to fully preform its intended functionality. However this data requires a time-consuming investigation by human specialists for the presence of attacks. Therefore, a limited amount of clean data is typically analyzed over time, and then provided for training the anomaly detector. The anomaly detector will also have an incomplete view of the normal process behavior. Incomplete view of the normal behavior leads to misclassifying rare normal events as anomalous, as illustrated in Figure 1.8. Since HIDSs are deployed on each host, it is more practical to start from a generic model with limited view of normal behavior then update it, than to restart the whole training procedure on each host. For example, the same process of interest may have different settings and functionality on different hosts machines. It may be very costly to collect and analyze training data from the same process on every host and train a new model for each host. A less costly solution consists of training a generic HMM using normal data corresponding to common settings, an then incrementally update each model with 50 Old Normal False negatives New Normal Behavior False alarms Behavior Modeled Behavior Suspicious Behavior Figure 1.9 Illustration of changes in normal process behavior, due for instance to application update or changes in user behavior, and the resulting regions causing both false positive and negative errors the required normal data to better fit the process normal behavior and hence reduce generation of false alarms. Furthermore, the underlying data distribution of normal behavior may vary according to changes in the monitored environment after the ADS has been deployed for operations. For instance, an update to the code of the monitored process ( e.g., to add newer functionality, fix some security holes, etc.) could lead the underlying distribution of normal behavior to drift. Normal behavior may also drift due to a change in the way users interact with the process. Changes in user behavior may occur when they are assigned new tasks or acquire new skills. Changes to the underlying normal distribution lead the model to deviate, and generate both false positive and negative error as illustrated in Figure 1.9. Accordingly, the model must be updated from new data to account from this change. As a part of an ADS, the system administrator plays a crucial role in providing new data for training and validation of the detectors. When an alarm is raised, the suspicious system call sub-sequences are logged and analyzed for evidence of an attack. If an intrusion attempt is confirmed, the corresponding anomalous system calls should be provided to update the validation set. A response team should react to limit the damage, and a forensic analysis team should investigate the cause of the successful attack. Otherwise, 51 the intrusion attempt is considered as a false alarm and these rare sub-sequences are tagged as normal and employed to update the HMM detectors. Therefore, an important feature of an ADS is the ability to accommodate newly-acquired data incrementally, after it has originally been trained and deployed for operations. Newly-acquired data allow to account for the incomplete view of normal process behavior and for possible changes that may occur over time. However, one major challenge is the efficient integration of the newly-acquired data into the ADS without corrupting the existing knowledge structure, and thereby degrading the performance. 52 CHAPTER 2 A SURVEY OF TECHNIQUES FOR INCREMENTAL LEARNING OF HMM PARAMETERS∗ This chapter presents a survey of techniques found in literature that are suitable for incremental learning of HMM parameters. These techniques are classified according to the objective function, optimization technique, and target application, involving blockwise and symbol-wise learning of parameters. Convergence properties of these techniques are presented, along with an analysis of time and memory complexity. In addition, the challenges faced when these techniques are applied to incremental learning is assessed for scenarios in which the new training data is limited and abundant. While the convergence rate and resource requirements are critical factors when incremental learning is performed through one pass over abundant stream of data, effective stopping criteria and management of validation set are important when learning is performed through several iterations over limited data. In both cases managing the learning rate to integrate preexisting knowledge and new data is crucial for maintaining a high level of performance. Finally, this chapter underscores the need for empirical benchmarking studies among techniques presented in literature, and proposes several evaluation criteria based on nonparametric statistical testing to facilitate the selection of techniques given a particular application domain. ∗ THIS CHAPTER IS ACCEPTED, UNDER REVISION, IN INFORMATION SCIENCES JOURNAL, ON MAY 30, 2011, SUBMISSION NUMBER: INS-D-10-141 54 2.1 Introduction The Hidden Markov Model (HMM) is a stochastic model for sequential data. It is a stochastic process determined by the two interrelated mechanisms – a latent Markov chain having a finite number of states, and a set of observation probability distributions, each one associated with a state. At each discrete time instant, the process is assumed to be in a state, and an observation is generated by the probability distribution corresponding to the current state. The HMM is termed discrete if the output alphabet is finite, and continuous if the output alphabet is not necessarily finite, e.g., each state is governed by a parametric density function (Elliott, 1994; Ephraim and Merhav, 2002; Rabiner, 1989). Theoretical and empirical results have shown that, given an adequate number of states and a sufficiently rich set of data, HMMs are capable of representing probability distributions corresponding to complex real-world phenomena in terms of simple and compact models (Bengio, 1999; Bilmes, 2002). This is supported by the success of HMMs in various practical applications, where it has become a predominant methodology for design of automatic speech recognition systems (Bahl et al., 1982; Huang and Hon, 2001; Rabiner, 1989). It has also been successfully applied to various other fields, such as communication and control (Elliott, 1994; Hovland and McCarragher, 1998), bioinformatics (Eddy, 1998; Krogh et al., 2001), computer vision (Brand and Kettnaker, 2000; Rittscher et al., 2000), and computer and network security (Cho and Han, 2003; Lane and Brodley, 2003; Warrender et al., 1999). For instance, in the area of computer and network security, a growing number of HMM applications are found in intrusion detection systems (IDSs). HMMs have been applied either to anomaly detection, to model normal patterns of behavior, or in misuse detection, to model a predefined set of attacks. HMM applications in anomaly and misuse detection have emerged in both main categories of IDS - host-based IDS (Cho and Han, 2003; Lane and Brodley, 2003; Warrender et al., 1999; Yeung and Ding, 2003) and network-based IDS (Gao et al., 2003; Tosun, 2005). Moreover, HMMs have recently begun to emerge in wireless IDS applications (Cardenas et al., 2003; Konorski, 2005). 55 In many practical applications, the collection and analysis of training data is expensive and time consuming. As a consequence, data for training an HMM is often limited in practice, and may over time no longer be representative of the underlying data distribution. However, the performance of a generative model like the HMM depends heavily on the availability of an adequate amount of representative training data to estimate its parameters, and in some cases its topology. In static environments, where the underlying data distribution remains fixed, designing a HMM with a limited number of training observations may significantly degrade performance. This is also the case when new information emerges in dynamically-changing environments, where underlying data distribution varies or drifts in time. A HMM that is trained using data sampled from the environment will therefore incorporate some uncertainty with respect to the underlying data distribution (Domingos, 2000). It is common to acquire additional training data from the environment at some point in time after a pattern classification system has originally been trained and deployed for operations. Since limited training data is typically employed in practice, and underlying data distribution are susceptible to change, a system based on HMMs should allow for adaptation in response to new training data from the operational environment or other sources (see Figure 2.1). The ability to efficiently adapt HMM parameters in response to newly-acquired training data, through incremental learning, is therefore an undisputed asset for sustaining a high level of performance. Indeed, refining a HMM to novelty encountered in the environment may reduce its uncertainty with respect to the underlying data distribution. However, incremental learning raises several issues. For one, HMM parameters should be updated from new data without requiring access to the previouslylearned training data. In addition, parameters should be updated without corrupting previously-acquired knowledge (Polikar et al., 2001). Standard techniques for estimating HMM parameters involve batch learning, based either on specialized Expectation-Maximization (EM) techniques (Dempster et al., 1977), such as the Baum-Welch algorithm (Baum et al., 1970), or on numerical optimization 56 h0 Classifier h1 D1 Classifier hn D2 Classifier hn+1 Dn+1 Figure 2.1 A generic incremental learning scenario where blocks of data are used to update the classifier in an incremental fashion over a period of time t. Let D1 , . . . , Dn+1 be the blocks of training data available to the classifier at discrete instants in time t1 , . . . , tn+1 . The classifier starts with initial hypothesis h0 which constitutes the prior knowledge of the domain. Thus, h0 gets updated to h1 on the basis of D1 , and h1 gets updated to h2 on the basis of data D2 , and so forth (Caragea et al., 2001) techniques, such as the Gradient Descent algorithm (Levinson et al., 1983). In either case, HMM parameters are estimated over several training iterations, until some objective function, e.g., maximum likelihood over some independent validation data, is maximized. For a batch learning technique, a finite-length sequence O = o1 , o2 , . . . , oT of T training observations oi is assumed to be available throughout the training process. Assuming that O is assembled into a block D of training data1 , each training iteration typically involves observing all subsequences in D prior to updating HMM parameters. Given a new block D2 of training data, a HMM that has previously been trained on D1 through batch learning cannot accommodate D2 without accumulating and storing all training data in memory, and training from the start using all of the cumulative data, D2 D1 . Otherwise, the previously-acquired knowledge may be corrupted, thereby compromising HMM performance. As illustrated in Figure 2.2, the HMM probabilities to be optimized may become trapped in local optima of the new cost function associated with D2 D1 . In fact, probabilities estimated after training on D1 may not constitute a good starting point for training on D2 . Updating HMM parameters on all data using some batch learning technique may therefore incur a significant cost in terms of processing time and storage requirements. The time and memory complexity of standard techniques 1A block of training data is defined as a sequence of training observations that has been segmented into overlapping or non-overlapping subsequences according to a user-defined window size. Cost function 57 D1 D1 D2 (2) (3) (4) (1) Optimization Space of one Model Parameter Figure 2.2 An illustration of the degeneration that may occur with batch learning of a new block of data. Suppose that the dotted curve represents the cost function associated with a system trained on block D1 , and that the plain curve represents the cost function associated with a system trained on the cumulative data D1 D2 . Point (1) represents the optimum solution of batch learning performed on D1 , while point (4) is the optimum solution for batch learning performed on D1 D2 . If point (1) is used as a starting point for incremental training on D2 (point (2)), then it will become trapped in the local optimum at point (3) grows linearly with the length, T , and number of training sequences R, and quadratically with the number of HMM states, N . As an alternative, several on-line learning techniques proposed in literature may be applied for incremental learning. These include techniques based on EM, numerical optimization and recursive estimation, and assume the observation of stream of data. Some of these techniques are designed to update HMM parameters at a symbol level (symbolwise), while others update parameters at a sequence level (block-wise). Techniques for on-line symbol-wise learning, also referred to as as recursive or sequential estimation techniques, are designed for situations in which training symbols are received one at a time, and HMM parameters are re-estimated upon observing each new symbol. Techniques for on-line block-wise learning are designed for situations in which training symbols are orga- 58 nized into a block of one or more sub-sequences, and HMM parameters are re-estimated upon observing each new sub-sequence of symbols. In either case, HMM parameters are updated from new training data, without requiring access to the previously-learned training data, and potentially without corrupting previously acquired knowledge. The main advantages of applying these techniques to incremental learning are the ability to sustain a high level of performance, yet decrease the memory requirements, since there is no need for storing the data from previous training phases. Furthermore, since training is performed only on the new training sequences, and not on all accumulated data, on-line learning would also lower time complexity needed to learn new data. Finally, incremental learning may provide a powerful tool in a human-centric approach, where domain experts may be called upon to gradually design and update HMMs as the operational environment unfolds. This chapter contains a survey of techniques that apply to incremental learning of HMM parameters2 . These techniques are classified according to objective function, optimization technique, and target application that involve block-wise and symbol-wise learning of parameters. An analysis of their convergence properties and of their time and memory complexity is presented, and the applicability of these techniques is assessed for incremental learning scenarios in which new data is either abundant or limited. Finally, the advantages and shortcomings of these techniques are outlined, providing the key issues and guidelines for their application in different learning scenarios. This chapter is structured according to five sections. The next section briefly reviews the batch learning techniques employed to estimate HMM parameters, and introduces the formalism needed to support subsequent sections. Section 2.3 provides a taxonomy of on-line learning techniques form literature that apply for incremental learning of HMM parameters. An analysis of their convergence properties and resource requirements is 2A predefined topology and permissible transitions between the states (e.g., ergodic or temporal) is assumed when learning HMM parameters. In many real-world applications, the best topology is determined empirically. Although adapting HMM topologies, or jointly HMM parameters and topologies, to new data may have a significant impact on performance, this issue is beyond the scope of the thesis. 59 Continuous output bj (k) = P (ot = vk | qt = j) bj (k) = f (ot = v | qt = j) a22 0.5 5 v 0 a12 a21 S1 a11 aij S2 a23 a13 k 0.5 0.0 Observations M b2 (k) 1.0 b2 (v) 1.0 0.0 −5 Discrete output o1 , o2 , . . . , oT 2 1 2 3 4 vk 1 bj (k) a32 Hidden states N j S3 a33 a31 = P (qt+1 = j | qt = i) i aij q1 , q 2 , . . . , q T 2 1 t 1 2 T Figure 2.3 An illustration of an ergodic three states HMM with either continuous or discrete output observations (left). A discrete HMM with N states and M symbols transits between the hidden states qt , and generates the observations ot (right) provided in Section 2.4. Then, an analysis of their potential applicability in different incremental learning scenarios is presented in Section 2.5. This chapter concludes with a discussion of the main challenges to be addressed for incremental learning of HMM parameters. 2.2 Batch Learning of HMM Parameters A discrete-time finite-state HMM consists of N hidden states in the finite-state space S = {S1 , ..., SN } of the Markov process. Starting from an initial state Si , determined by the initial state probability distribution πi , at each discrete-time instant, the process transits from state Si to state Sj according to the transition probability distribution aij . The process then emits a symbol v according to the output probability distribution bj (v) of the current state Sj (see Figure 2.3). With a discrete HMM, the output bj (v) is finite, and with a continuous HMM, the output alphabet is governed by a parametric density function. Let qt ∈ S denotes the state of the process at time t, where qt = i indicates that the state is Si at time t. An observation sequence of length T is denoted by O = o1 , . . . , oT , where 60 ot is the observation at time t. The sub-sequence om , om+1 , . . . , on , n > m, by the concise notation om:n . The HMM is then usually parametrized by λ = (π, A, B), where • π = {πi } denotes the vector of initial state probability distribution, πi P (q1 = i)1≤i≤N • A = {aij } denotes the state transition probability distribution, aij P (qt+1 = j | qt = i)1≤i,j≤N • B = {bj (k)} denotes the state output probability distribution, Aligned symbols with upper level text . – bj (k) P (ot = vk | qt = j)1≤j≤N, 1≤k≤M for a finite and discrete alphabet V = {v1 , v2 , . . . , vM }, vk ∈ RL with M distinct observable symbols. – bj (v) f (ot = v | qt = j)1≤j≤N for an infinite and continuous alphabet V = {v | v ∈ RL }. For instance, if the observation density for each state in the HMM is described by a univariate Gaussian distribution3 bj (ot ) ∼ N (μj , σj ), for a scalar observation ot , with a mean μ and a variance σ 2 then the state output density is given by: (ot − μj )2 bj (ot | μj , σj ) = exp − 2σj2 2πσj 1 (2.1) For a finite-discrete HMM, both A, B and π (the transpose of vector π) are row stochastic, which impose the following constraints: N j=1 aij = 1∀i, M bj (k) = 1 ∀j, and k=1 aij , bj (k), πi ∈ [0, 1], ∀ijk 3 It N πi = 1 (2.2) i=1 (2.3) is usually termed as finite-state Markov chains in white Gaussian noise in control and communication community. It is also referred to as normal HMM. 61 For a continuous HMM, other constraints arise depending on the probability density function (pdf) of the states. For example, in the Gaussian distribution case, (2.1), one must ensure that the standard deviation is always positive, σj > 0, ∀j. Given a HMM initialized according to the constraints described so far, there are three tasks of interest – the evaluation, decoding, and training tasks (Rabiner, 1989). The rest of this section focuses on batch learning techniques applied to address the third task, and in particular estimating HMM parameters, along with the definitions needed for future sections. For further details regarding the HMM, the reader is referred to the extensive literature (Elliott, 1994; Ephraim and Merhav, 2002; Rabiner, 1989). Standard techniques for estimating HMM parameters λ = (π, A, B) involve batch learning are based either on expectation-maximization or numerical optimization techniques. With batch learning, a finite-length sequence O = o1 , o2 , ..., oT of T training observations oi is assumed to be available throughout the training process. The parameters are estimated over several training iterations, until some objective function is maximized. Each training iteration involves observing all the observation symbols prior to updating HMM parameters. 2.2.1 Objective Functions: The estimation of HMM parameters is frequently performed according to the Maximum Likelihood Estimation (MLE) criterion. Other criteria such as the maximum mutual information (MMI), and minimum discrimination information (MDI) also be used for estimating HMM parameters (Ephraim and Rabiner, 1990). However, the widespread use of the MLE for HMM is a result of its attractive statistical properties – consistency and asymptotic normality – proven under quite general conditions (Bickel et al., 1998; Leroux, 1992). MLE consists in maximizing the likelihood or equivalently the log-likelihood of 62 the training data with regard to the model parameters: T (λ) log P (o1:T | λ) = log P (o1:T , S | λ) = S log P (o1:T | S, λ)P (S | λ) (2.4) S over the model parameter space Λ: λ∗ = arg max T (λ) (2.5) λ∈Λ There is no known analytical solution to HMM parameters estimation since the loglikelihood function (2.4) depends on the unknown probability values of the latent states, S. In practice, iterative optimization procedures such as the expectation-maximization or the standard numerical optimizations techniques, which are described in the following sub-sections, are usually employed. In order to proceed iteratively any numerical optimization procedure must also evaluate the log-likelihood function at any value. However, a direct evaluation of the log-likelihood function (2.4) requires a summation over all hidden state paths q1:T ∈ S, which has a prohibitively costly time complexity O(T N T ). Fortunately, there exists efficient recursive procedures for evaluating the log-likelihood values as well as estimating the conditional state (2.6) and joint state (2.7) densities associated with a given observation sequence o1:t : γτ |t (i) P (qτ = i | o1:t , λ) (2.6) ξτ |t (i, j) P (qτ = i, qτ +1 = j | o1:t , λ) (2.7) in order to provide an optimal estimate, in the minimum mean square error (MMSE) sense, for the unknown state q̂τ |t (i) frequency – the key problem for HMM parameters estimation: q̂τ |t (i) = E {qτ = i | o1:t } = N i=1 γτ |t (i) (2.8) 63 In estimation theory, this conditional estimation problem is called filtering if τ = t; prediction if τ < t, and smoothing if τ > t. The smoothing problem is termed fixed-point smoothing when computing the E {qτ | o1:t } for a fixed τ and increasing t = τ, τ + 1, . . ., fixed-lag smoothing when computing the E {qτ | o1:t+Δ } for a fixed lag Δ > 0, and fixedinterval smoothing when computing the E {qτ | o1:T } for all τ = 1, 2, . . . t, . . . , T . The fixed-interval smoothing problem is therefore to find the best estimate of the states at any time conditioned on the entire observations sequence, which is typically performed in batch learning using the Forward-Backward (FB) (Baum, 1972; Baum et al., 1970; Chang and Hancock, 1966) or the Forward-Filtering Backward-Smoothing (FFBS) (Cappe and Moulines, 2005; Ephraim and Merhav, 2002) algorithms. Typically, fixedinterval smoothing algorithms involve an estimation of the filtered state density, for t = 1, . . . , T : γt|t (i) P (qt = i | o1:t , λ) (2.9) which provides the best estimate of the conditional distribution of states given the past and present observations. It can be computed from the predictive state density (2.10), which provides the best estimate of the conditional distribution of states given only the past observations, γt|t−1 (i) P (qt = i | o1:t−1 , λ); γ1|0 (i) = πi (2.10) according to the following recursion: γt|t−1 (i)bi (ot ) γt|t (i) = N j=1 γt|t−1 (j)bj (ot ) (2.11) The predictive state density (2.10) at time t + 1 can be then computed from the filtered state density (2.9) at time t: γt+1|t (j) = N i=1 γt|t (i)aij (2.12) 64 Given an observation sequence o1:T , the log-likelihood is therefore evaluated with a time complexity of O(N 2 T ): T (λ) = T log P (ot | o1:t−1 ) = t=1 T t=1 log N bi (ot )γt|t−1 (i) (2.13) i=1 The predictive state density (2.10) is in fact a fixed-lag smoothing with one symbol look-ahead (Δ = 1). In situations where some delay or latency between receiving the observations and updating HMM parameters can be tolerated, incorporating more lag provides stability and improved performance over filtering estimation (Anderson, 1999). This is because the latter relies only on the information that is available at the current time. Although they provide an approximation of the conditional state distributions, both filtering and fixed-lag smoothing have been the core of several on-line learning techniques, presented in Section 2.3, since their recursions can go on indefinitely in time providing the state estimates without (or with a small fixed) delay. In addition, they require a linear memory complexity O(N ) that is independent of the observation length T , while maintaining a low computational complexity O(N 2 T ). The backward recursion of the FFBS (or FB) algorithm exploits the filtered state estimates in order to provide an exact smoothed estimate for the states, for t = T − 1, . . . , 1: γt|T (i) = N ξt|T (i, j) (2.14) j=1 γt|t (i)aij ξt|T (i, j) = N γt+1|T (j) i=1 γt|t (i)aij (2.15) Although fixed-interval smoothing is the standard for batch learning, waiting until the last observation before providing the state estimates incurs long delays as well as a large memory complexity O(N T ), to store the filtered state estimate for the entire observation sequence (see Khreich et al. (2010c) for more details). These issues are prohibitive for on-line learning. 65 A Forward Only (FO) fixed-interval smoothing algorithm has been proposed as an alternative to reduce the memory requirement of the FB or FFBS algorithms (Churbanov and Winters-Hilt, 2008; Elliott et al., 1995; Sivaprakasam and Shanmugan, 1995). The basic idea is to directly propagate all smoothed information in the forward pass: σt (i, j, k) t−1 τ =1 P (qτ = i, qτ +1 = j, qt = k | o1:t , λ), which represents the probability of having made a transition from state Si to state Sj at some point in the past (τ < t) and of ending up in state Sk at the current time t. At the next time step, σt+1 (i, j, k) can be recursively computed from σt (i, j, k) using: σt (i, j, k) = N σt−1 (i, j, n)ank bk (ot ) + αt−1 (i)aik bk (ot )δjk (2.16) n=1 from which the smoothed state densities can be obtained, at time t, by marginalizing over the states. The advantage of the FO algorithm is that it provides the exact smoothed state estimate at each time step with the linear memory complexity O(N ). However, this is achieved at the expenses of increasing the time complexity to O(N 4 T ) as can be seen in the four-dimensional recursion (2.16). As described in Section 2.3, several authors have proposed the FO algorithm for on-line learning of HMM parameters. 2.2.2 Optimization Techniques: Maximum-likelihood (ML) parameter estimation in Hidden Markov Models (HMMs) can be carried out using either the expectation maximization (EM) (Baum et al., 1970; Dempster et al., 1977) or standard numerical optimization techniques (Nocedal and Wright, 2006; Zucchini and MacDonald, 2009). This section describes both estimation procedures for HMMs. 66 2.2.2.1 Expectation-Maximization: The EM is a general iterative method to find the MLE of the parameters of an underlying distribution given a data set when the data is incomplete or has missing values. EM alternates between computing an expectation of the likelihood (E-step) – by including the latent variables as if they were observed – and a maximization of the expected likelihood (M-step) found on the E-step. The parameters found in the M-step are then used to initiate another iteration until a monotonic convergence to a stationary point of the likelihood (Dempster et al., 1977). In the context of HMM, the basic idea consists of optimizing an auxiliary Q-function (also known as the intermediate quantity of EM): QT (λ, λ(k) ) = Eλ(k) log P (o1:T , q1:T | λ) | o1:T , λ(k) = (2.17) P (o1:T , q1:T | λ(k) ) log P (o1:T , q1:T | λ) q∈S which is the expected value of the complete-data log-likelihood. By assuming the probability values of the hidden states, the Q-function is therefore easier to optimize than the incomplete-data log-likelihood (2.4). Nevertheless, it can be explicitly expressed in terms of HMM parameters. For instance, for a discrete output HMM it has the following form: QT (λ, λ(k) ) = N (k) N N T −1 i=1 t=1 i=1 j=1 γ1|T (i) log πi + (k) ξt|T (i, j) log aij + M T (k) t=1 j=1 γt|T (j)δot vm log bj (m) (2.18) Each term on the right hand side can be maximized individually using Lagrange multipliers subject to the relevant constraints (2.2). The solutions provide the same update formulas as those provided with the Baum-Welch algorithm below, (2.19), (2.20) and (2.21). Starting with an initial guess λ(0) , subsequent iterations, k = 1, 2, . . ., of the EM consist in: 67 E-step: computing the auxiliary function Q(λ, λ(k) ) M-step: determining the model that maximizes Q, λ(k+1) = arg maxλ Q(λ, λ(k) ) If instead of maximizing Q, we find some λ(k+1) that increases the likelihood (partial M-step) then it is called Generalized EM (GEM), which is also guaranteed to converge (Wu, 1983) . The Baum-Welch (BW) algorithm (Baum and Petrie, 1966; Baum et al., 1970) is a specialized EM algorithm for estimating HMM parameters. Originally introduced to estimate the parameters when provided with a single discrete observations sequence (Baum et al., 1970), it was later extended for multiple observation sequences (Levinson et al., 1983; Li et al., 2000) and for continuous observations (Juang et al., 1986; Liporace, 1982). Given a sequence o1:T of T observations, an HMM with N states, and an initial guess λ(0) , the BW algorithm proceeds as follows. During each iteration k, the FB or FF-BS computes the expected number of state transitions and state emissions based on the current model (E-step), and then re-estimated the model parameters (M-step) using: (k+1) πi (k+1) aij (k) = γ1|T (i) = (expected frequency of state Si at t = 1) (2.19) expected #of trans. from Si → Sj t=1 ξt|T (i, j) = = T −1 (k) expected #of trans. from Si t=1 γt|T (i) (2.20) (k+1) bj (m) = T −1 (k) T (k) t=1 γt|T (j)δot vm T (k) t=1 γt|T (j) = expected #of times in Sj observing vm expected #of times in Sj (2.21) Each iteration of the E-step and M-step is guaranteed to increase the likelihood of the observations giving the new model until a convergence to a stationary point of the likelihood (Baum, 1972; Baum et al., 1970; Dempster et al., 1977). For a continuous HMM only the last term of (2.18) and hence (2.21) must be changed according to the state parametric density. For instance, for an HMM with Gaussian state 68 densities (2.1), the state outputs update are given by: T μˆj = t=1 γt (j)ot , T t=1 γt (j) and σ̂j2 T = 2 t=1 γt (j)(ot − μj ) T t=1 γt (j) For multiple independent observation sequences, Levinson et al. (1983) proposed averaging the smoothed conditional densities, (2.14) and (2.14), over all observation sequences in the E-step, then updating the model parameters (M-step). Li et al. (2000) propose using a combinatorial method on individual observations probabilities, rather than their product, to overcome the independence assumption. For both cases, re-estimating parameters with the standard EM requires accessing all of the sequences in the training block during each iteration. 2.2.2.2 Standard Numerical Optimization: While EM-based techniques indirectly optimize the log-likelihood function (2.4) through the complete-data log-likelihood (2.17), standard numerical optimization methods work directly with the log-likelihood function and its derivatives. For instance, starting with an initial guess of the model λ0 , first order methods such as the gradient descent update the model at each iteration k using: λ(k+1) = λ(k) + ηk ∇λ T (λ(k) ) (2.22) where the learning rate ηk decrease monotonically over training iterations to ensure that the sequence T (λ(k) ) is non-decreasing. It is usually chosen as to maximize the objective function in the search direction: ηk = arg max T [λ(k) + η∇λ (λ(k) )] (2.23) η≥0 However due to its linear convergence rate, gradient descent methods could converge slowly in large optimization spaces. 69 Second order methods guarantee a faster convergence. For instance, the Newton-Raphson method updates the model at each iteration k using: λ(k+1) = λ(k) − H −1 (λ(k) )∇λ T (λ(k) ) (2.24) The convergence rate is at least quadratic at the convergence point, for which the Hessian is negative definite. However, if the initial parameters are far from those of true model parameters, the convergence is not guaranteed. A learning rate ηk similar to (2.23) can be introduced to control the step length in the search direction. A polynomial interpolation of T (λ) along the line-segment between λ(k) and λ(k+1) is usually used in practice, since it is often impossible to have an exact maximum of the line search. Nonetheless, the Hessian H = ∇2λ T (λ(k) ) could be non-invertible or not negative semi-definite. To avoid these issues, quasi-Newton methods use an adaptive matrix W (k) which provides an approximation of the Hessian at each iteration: λ(k+1) = λ(k) + ηk W (k) ∇λ T (λ(k) ) (2.25) where W (k) is negative definite to ensure that ascent direction is chosen. The numerical issues associated with the matrix inversion are therefore avoided, while still exhibiting the convergence rate of the Newton algorithms near the convergence point. For a discrete HMM, the gradient of the log-likelihood ∇λ T (λ(k) ) can be computed using the state conditional densities obtained from the fixed-interval smoothing algorithms (2.14) and (2.15) as: ∂T (λ(k) ) = ∂aij ∂T (λ(k) ) = ∂bj (m) T −1 (k) t=1 T ξt|T (i, j) aij (k) t=1 γt|T (j)δot vm bj (m) (2.26) (2.27) 70 An alternative computation can be achieved by using a recursive computation of the gradient itself as described in sub-section 2.3.2.2. Once the gradient of the likelihood is computed, the HMM parameters are then additively updated according to, for instance, the first order (2.22) or second order (2.25) methods. For multiple independent observation sequences, the derivatives with reference to each model parameter (2.26) and (2.27) must be accumulated and averaged over all observation sequences, prior to updating the model parameters. The reader is referred to Cappe et al. (1998); Qin et al. (2000) for more details on quasi-Newton, and to Turner (2008) on Levenberg–Marquardt optimizations for HMMs. Standard numerical optimization methods have to ensure parameter constraints, (2.2) and (2.3), explicitly through either a constrained optimization or a re-parametrization to reduce the problem to unconstrained optimization. Levinson et al. (1983) detail the early implementation of a constrained optimization for HMM using Lagrange multipliers. Among the unconstrained transformation is the soft-max parametrization proposed in (Baldi and Chauvin, 1994), which maps the bounded space (a, b) to the unbounded space (u, v): euij evj (k) aij = u and b(k) = v (z) j k e ik ze (2.28) Other unconstrained parametrization consist of a projection of the model parameters onto the constraints domain, such as a simplex or a sphere. The parametrization on a simplex (Krishnamurthy and Moore, 1993; Slingsby, 1992) = {aij ∈ R N −1 : N −1 aij ≤ 1, and aij ≥ 0} (2.29) j=1 enforces the stochastic constraints (2.2) by updating all but one of the parameters in each row, ali = 1 − N j=li aij , 1 ≤ li ≤ N . However, it does not ensure the positiveness of the parameters (2.3). One improvement consists in using a restricted projection to the 71 space of positive matrices for some positive (Arapostathis and Marcus, 1990): = {aij ∈ RN −1 : N −1 aij ≤ 1 − , and aij ≥ } (2.30) j=1 Alternatively, parametrization on a sphere adequately enforces both constraints (Collings et al., 1994): SN −1 = {sij : N s2ij = 1} (2.31) j=1 where aij = s2ij . This also has the advantage that the constraint manifold is differentiable at all points. 2.2.2.3 Expectation-Maximization versus Gradient-based Techniques For discrete output HMMs, the EM algorithm is generally easier to apply than gradientbased technique since derivatives, Hessian inversion, line-search, etc., are not required. It is numerically more robust against poor initialization values or when the model has large number of parameters. It also has a monotonic convergence propriety which is not maintained in gradient-based techniques, and ensures parameters constraints implicitly. However the rate of convergence of the EM can be very slow, it is usually linear in the neighborhood of a maximum (Dempster et al., 1977). This is comparable to gradient descent but slower than the quadratic convergence that can be achieved with second order or conjugate gradient techniques. Nevertheless, EM steps do not always involve closed-form expressions in such cases the maximization has to be carried out numerically. Another advantage of numerical optimization techniques is that they automatically yield an estimate of parameters variances (Cappe and Moulines, 2005; Zucchini and MacDonald, 2009). Since there is no clear advantage of one technique on the other, hybrid algorithms have been proposed to take advantage from both techniques (Bulla and Berzel, 2008; Lange, 1995). Given a block of data with R independent sequences each of length T and an N state HMM, the time an complexity per iteration for EM and gradient-based techniques are 72 O(RN 2 T ) and their memory complexity is O(N T ). This is because they both rely on the fixed-interval smoothing algorithms (2.14) and (2.15) to compute the state densities for the E-step or the gradient of the log-likelihood. The difference is basically related to the internal procedures employed in gradient-based such as the line search, and to the speed of convergence (see Cappe and Moulines (2005, Ch. 10) for more details). 2.3 On-Line Learning of HMM Parameters Several on-line learning techniques from the literature may be applied to incremental learning of HMM parameters from new training sequences. Figure 2.4 presents a taxonomy of techniques for on-line learning of HMM parameters, according to objective function, optimization technique, and target application. As shown in the figure, they fall in the categories of standard numerical optimization, expectation-maximization and recursive estimation, with the objective of either maximizing the likelihood estimation (MLE) criterion, minimizing the model divergence (MMD) of parameters penalized with the MLE, or minimizing the output or state prediction error (MPE). The target application implies a scenario for data organization and learning. Some techniques have been designed for block-wise estimation of HMM parameters, while others for symbol-wise estimation of parameters. Block-wise techniques are designed for scenarios in which training symbols are organized into a block of sub-sequences and the HMM re-estimates its parameters after observing each sub-sequence. In contrast, symbol-wise techniques, also known as as recursive or sequential techniques, are designed for scenarios in which training symbols are observed one at a time, from a stream of symbols, and the HMM parameters are re-estimated upon observing each new symbol. The rest of this section provides a survey of techniques for on-line learning of HMM parameters shown in Figure 2.4. Symbol-wise Block-wise Standard Numerical Optimization Maximum Likelihood Minimum Prediction Estimation Error Expectation Maximization Singer and Warmuth (1996) Baldi and Chauvin (1994) Cappe et al. (1998) Digalakis (1999) Mizuno et al. (2000) Garg and Warmuth (2003) Stiller and Radons (1999) Stinger et al. (2001) FlorezLarrahondo et al. (2005) Mongillo and Deneve (2008) Cappe (2009) Holst and Lindgren (1991) Slingsby (1992) Krishnamurthy and Moore (1993) Recursive Estimation Ryden (1997, 1998) Collings et al. LeGland and (1994) Mevel (1995, 1997) Ford and Moore (1998) Collings and Ryden (1998) Ford and Moore (1998b) LeGland and Mevel (1997) Target Application Minimum Model Divergence Optimization Objective Techniques Functions 73 Figure 2.4 Taxonomy of techniques for on-line learning of HMM parameters 2.3.1 Minimum Model Divergence (MMD) A dual cost function that maximizes the log-likelihood while minimizing the divergence of model parameters, using an exponentiated gradient optimization framework (Kivinen and Warmuth, 1997), is first proposed for the block-wise case (Singer and Warmuth, 1996), then extended to symbol-wise case (Garg and Warmuth, 2003). Block-wise Based on the exponentiated gradient framework (Kivinen and Warmuth, 1997), Singer and Warmuth (1996) proposed an objective function for discrete HMMs that minimizes the divergence between old and new HMM parameters penalized by the negative log- 74 likelihood of each sequence multiplied by a fixed positive learning rate (η > 0): λ(k+1) = arg min KL(λ(k+1) , λ(k) ) − ηT (λ(k+1) ) λ where KL(λ(k+1) , λ(k) ) is the Kullback-Leibler (KL) divergence between the distributions λ(k+1) and λ(k) . KL is defined between two probability distributions Pλ0 (o1:t ) and Pλ (o1:t ) by KLt (Pλ0 P 0 (o1:t ) P 0 (o1:t ) Pλ ) = E log λ = Pλ0 (o1:t ) log λ Pλ (o1:t ) Pλ (o1:t ) t (2.32) which is always non-negative and attains its global minimum at zero for Pλ → Pλ(0) . Its limit (t → ∞) is Kullback-Leibler rate or relative entropy rate. The model parameters are updated after processing each observation sequence: (k+1) aij ∂T (λ(k+1) ) 1 (k) η = a exp − Z1 ij ∂aij ni (λ(k+1) ) (k+1) bj (m) = ∂T (λ(k+1) ) 1 (k) η bj (m) exp − Z2 nj (λ(k+1) ) ∂bj (m) (2.33) (2.34) where Z1 and Z2 are normalization factors. The expected usage of state i, ni (λ(k) ) = T t=1 γt|T (i) can be computed using the filtering estimation (2.14), and the derivatives of the log-likelihood are given by (2.26) and (2.27). Symbol-wise Garg and Warmuth (2003) extended the block-wise learning of Singer and Warmuth (1996) to symbol-wise. The model divergence is however penalized by the negative of the log-likelihood increment, i.e., log-likelihood of each new symbol given all previous ones. At each time instant the model is updated using: λt+1 = arg min KL(λt+1 , λt ) − η log P (ot+1 | o1:t , λt+1 ) λ 75 which can be decoupled into similar update, for both state transition and state output, as with (2.33) and (2.34), respectively. The main difference resides however in the computation of the expected usage of state i – it is computed recursively based on each symbol with ni (λt ) = t τ =1 γτ |t (i), is the gradient of the log-likelihood increment. The recursive formulas proposed to implement this computation are very similar to the recursive maximum likelihood estimation proposed by LeGland and Mevel (LeGland and Mevel, 1995, 1997) (see Section 2.3.2). 2.3.2 Maximum Likelihood Estimation (MLE) The attractive asymptotic properties of the MLE in general, have prompted many attempts to extend this objective function to on-line learning for different applications. In particular, indirect maximization of log-likelihood, through the complete log-likelihood (2.17), has been the basis for many on-line learning algorithms. Neal and Hinton (1998) have proposed an “incremental” version of the EM algorithm to accelerate its convergence on a finite training data set4 . Assuming a fixed training data set is divided into n blocks, each iteration of this algorithm performs a partial E-step (for a selected block) then updates the model parameters (M-step), until a convergence criterion is met. The improved convergence is attributed to exploiting new information more quickly, rather than waiting for the complete data to be processed before updating the parameters (Gotoh et al., 1998; Neal and Hinton, 1998). Many extensions to this algorithm have emerged for online block-wise (Digalakis, 1999; Mizuno et al., 2000) or symbol-wise (Florez-Larrahondo et al., 2005; Stenger et al., 2001; Stiller and Radons, 1999) learning. Direct maximization of the log-likelihood function is first proposed for the block-wise case using the GD algorithm (Baldi and Chauvin, 1994), and then using the quasiNewton algorithm (Cappe et al., 1998) for faster convergence. A recursive block-wise estimation technique is also proposed for this optimization (Ryden, 1998, 1997). Finally, extensions to the work of Titterington (1984) and Weinstein E. and Oppenheim (1990) 4 In contrast, the incremental scenario considered in this chapter restricts accessing a block of data once it has been processed. 76 for independent data has lead to a symbol-wise technique to optimize the complete data likelihood (2.17) recursively, at each time step. 2.3.2.1 On-line Expectation-Maximization An important property of the “incremental” version of EM resides in its flexibility (Neal and Hinton, 1998). There is no constraints on how the data is divided into blocks. The block might range from a single observation symbol to one or multiple sequence of observations. In addition, one can choose to update the parameters by selecting data blocks cyclically, or by any other scheme. Accordingly, several algorithms have extended this approach to on-line learning of HMM parameters. As detailed next, this is basically achieved by initializing the smoothed densities to zero, processing the training data (symbol-wise or block-wise) sequentially, and updating the HMM after each symbol or sequence. Block-wise Digalakis (1999) derived an on-line algorithm of EM to update the parameters of a continuous HMM for automatic speech recognition, and showed faster convergence and higher recognition rate over the batch algorithm. Similarly, Mizuno et al. (2000) proposed a similar algorithm for a discrete HMM however. Given the initial model λ0 , and after processing each new sequence of observation of length T , the state densities are recursively computed using: T −1 t=1 (k+1) ξt|T T (k+1) t=1 γt|T (i, j) = (1 − ηk ) (j)δot m = (1 − ηk ) T −1 t=1 (k) ξt|T (i, j) + ηk T −1 t=1 (k+1) ξt|T (i, j) T (k) T (k+1) t=1 t=1 γt|T (j)δot m + ηk γt|T (j)δot m and the model parameters are then directly updated using (2.20) and (2.21). The learning rate ηk is proposed in polynomial form ηk = c( k1 )d for some positive constants c and d. 77 Symbol-wise Stiller and Radons (1999) introduced a recursive algorithm for non-stationary discrete Mealy HMMs5 , which is essentially based on the FO (2.16). The main idea is to recursively update a tensorial quantity, t ξijk (ot ) = t τ =1 τ,t αiτ −1 aτij−1 (oτ )βjk δot oτ (2.35) which accounts for the weighted sum of transition probability from state Si to Sj (at time τ ), emitting symbol ot and being in state k at time t ≥ τ . The model parameters (emission on arcs) are condensed in atij (ot ) = P (qt = j, ot | qt−1 = i). The forward-vector (ατ ) and backward-matrix (β τ,t ) are recursively computed by: ατ = π τ a(r−1) (or ) and β τ,t = r=1 t r=τ +1 ar−1 (or ) t (o ) and Equation (2.35) can therefore be recursively computed as function of the old ξijk t the new symbol ot+1 : t+1 ξijk (ot+1 ) = t l ξk̂kl (ot+1 ) t ξij k̂ (ot ) t (o ) τ ĵ,ˆ l ξiĵ ˆ l τ k̂ t k ξijk (ot ) t + ηt δot ot+1 δjk ξlmi (oτ ) t τ ĵ,k̂ ξij k̂ (oτ ) l,m τ where ηt is a time-varying learning rate. Finally, the model parameters (aτij ) and the probability of the system currently being in state k (αkτ ) can be estimated at any time using: t k ξijk (ot ) t aij (o) = t t ĵ,k̂ ξij k̂ (ot ) and αkτ = i,j i,j,k̂ t t ξijk (ot ) t t ξij k̂ (ot ) This algorithm have been recently proposed for Moore HMMs (Mongillo and Deneve, 2008) and then generalized (Cappe, 2009). 5 The emission is produced on transitions (Mealy model) while for all other algorithms presented the output is produced on states (Moore model). 78 Stenger et al. (2001) approximated the fixed-interval smoothing, used in BW, by the filtered state density (2.9) for updating the parameters of a continuous HMM from a stream of data. Naturally, the filtered state density is recursively computed independently of the sequence length. The model parameters are then updated, at each time step, by t−2 atij μti = ξt−1|t (i, j) τ =1 γt|τ (i) t−1 aij + t−1 τ =1 γt|τ (i) τ =1 γt|τ (i) t−1 γt|t (i)ot τ =1 γt|τ (i) t−1 μi + t t τ =1 γt|τ (i) τ =1 γt|τ (i) t−1 t−1 γt|t (i)(ot − μt )2 τ =1 γt|τ (i) i σi2 + t t τ =1 γt|τ (i) τ =1 γt|τ (i) = t−1 t σi2 = (2.36) An exponential forgetting is proposed to make old estimates receive less weight by tuning a fixed learning rate. Similarly, Florez-Larrahondo et al. (2005) proposed, for discrete HMM, using the predictive state density (2.10) as a better approximation6 . The model parameters are recursively updated using an adaptation of Stenger’s recursion formulas for discrete HMMs however. The transition update is the same as (2.36) with γt|τ replaced by γt+1|τ , while the state output is given by: t−1 γt+1|t (j)δ(oT , vk ) τ =1 γt+1|τ (j) T −1 bj (k) + t τ =1 γt+1|τ (j) τ =1 γt+1|τ (j) bTj (k) = t (2.37) The author argued that there is no need to exponential forgetting and proposed postponing the first update of the parameters until some time t > 0 so that enough statistics should have been collected to stabilize the re-estimation. 6 This is equivalent to setting βt (i) = N j=1 aij bj (ot+1 ) in the FB algorithm. 79 2.3.2.2 Numerical Optimization Methods: Block-wise Based on the GD of the negative likelihood (2.22), Baldi and Chauvin (1994) have introduced a block-wise algorithm for estimation of discrete HMM parameters, which related to the Generalized EM (GEM). By using the soft-max parametrization, (2.28), and the FF-BS algorithm, parameters are updated after each sequence as follows: (k+1) (k) = uij + η uij (k+1) vj (m) = T ξt|T (i, j) − aij γt|T (i) t=1 (k+1) vj (m) + η Tt=1 γt|T (j)δot m − bj (m)γt|T (j) This is typically referred to as stochastic gradient-based techniques (Bottou, 2004). Based on the recursive maximum likelihood estimation of LeGland and Mevel (LeGland and Mevel, 1995, 1997) presented below, Cappe et al. (1998) proposed a quasi-Newton (2.25) faster convergent technique for maximizing the likelihood of a continuous HMM with Gaussian output (2.1). Using the parametrization on a simplex (2.29) and replacing the standard deviation by its log (to ensure its positivity), the gradient of the log-likelihood (2.13), can be computed recursively using: ∇λ T (λ(k) ) = where ct = N N 1 γt|t−1 (i)∇λ(k−1) bi (ot ) + bi (ot )∇λ(k−1) γt|t−1 (i) t=1 ct i=1 T k=1 bk (ot )γt|t−1 (k) (2.38) is a normalization factor. The gradient of the predictive density is also computed recursively, with reference to the model parameters, using: N ∂γt|t−1 (i) ∂γt+1|t (j) 1 = aij − γt+1|t (j) + γt|t−1 (i) ∂aij ct i=1 ∂aij N 1 ∇λb γt+1|t (j) = aij − γt+1|t (j) γt|t−1 (i)∇λb bi (ot ) + bi (ot )∇λb γt|t−1 (i) ct i=1 80 and the gradient of the Gaussian output density (2.1), w.r.t. the mean and the standard deviation, is given by ∂bi (ot ) ∂bi (ot ) , ∇λb bi (ot ) = ∂μi ∂σi 2.3.2.3 ⎡ ⎛ (ot − μi ) (ot − μi ) 1 =⎣ bi (ot ), ⎝ 2 σi σi σi 2 ⎞ ⎤ − 1⎠ bi (ot )⎦ (2.39) Recursive Maximum Likelihood Estimation (RMLE): The general recursive estimator for the parameters of a stochastic process is of the form (Benveniste et al., 1990; Ljung and Soderstrom, 1983): λt+1 = λt + ηt H(ot+1 , λt )−1 h(ot+1 , λt ) (2.40) where ηt is a sequence of small gains (constant or decreasing with t), H(ot+1 , λt )−1 an adaptive matrix, and h(ot+1 , λt ) is a score function. Both the adaptive matrix, H, and the score function, h, determine the update of the parameter λt as a function of new observations. The goal of the recursive procedures is to find the roots of a regression function given the true parameters λ0 : limt→∞ E H(ot+1 , λt )−1 h(ot+1 , λt ) | λ0 . In the general case of incomplete data, the score is equal to the gradient of the log-likelihood function h = ∇λ log P (ot+1 | λ). The adaptive matrix is equal to the incomplete data observed information matrix H = ∇2λ log P (o1:t+1 | λ), which ideally must be taken as the incomplete data Fisher information matrix H = Eλ [h(ot+1 | λ)h (ot+1 | λ)]. However, for the incomplete data models, explicit form of the incomplete data Fisher information matrix is rarely available. Titterington (1984, Eq. 9) proposed to use the complete data Fisher information matrix instead H = E −∇2λt log P (ot , qt | λ) . He showed that it is easy to compute and invert because it is always block-diagonal with respect to latent data and model parameters. For several independent and identically distributed (i.i.d.) incomplete data models, this recursion had proven to be consistent and asymptotically 81 normal (Titterington, 1984): λt+1 = λt + −1 1 E −∇2λt log P (ot , qt | λ) ∇λt log P (ot+1 | λt ) t+1 (2.41) This recursive learning algorithm is related to on-line estimation of parameters using EM. It was also related to an EM iteration since it uses a recursive version of (2.17). The efficiency and convergence is an issue with Titterington’s recursion (Ryden, 1998; Wang and Zhao, 2006). In fact, for HMMs, the score function must consider the previous observations: h(ot+1 | λ) = ∇λ log P (ot+1 | o1:t , λ) (2.42) Another recursion using the relative entropy (2.32) as objective function, is proposed by Weinstein E. and Oppenheim (1990, Eq. 4): λ̂t+1 = λ̂t + ηt h(ot+1 | λ̂t ) (2.43) where the gain ηt satisfies: lim ηt = 0; t→∞ ∞ t=1 ηt = ∞; ∞ ηt2 ≤ ∞ (2.44) t=1 It was suggested that h may be calculated from the complete data using h(ot+1 | λ̂t ) = Eλ {∇ log P (ot+1 , qt+1 | λ) | ot+1 } and is shown to be consistent in the strong sense and in the mean-square sense for stationary ergodic processes that satisfy some regularity conditions. However, some of these conditions do not hold for HMM (Ryden, 1998, 1997). 82 Block-wise The idea proposed by Ryden (Ryden, 1998, 1997) is to consider successive blocks of m observations taken from data stream, on = {o(n−1)m+1 , . . . , onm }, as independent of each other, while the Markov dependence is only maintained within the observations of each block. This assumption reduces the extraction of information from all the previous observations to a data-segment of length m. Although it gives an approximation of the MLE – a less efficient maximum pseudo-likelihood – the presented technique is easier to apply in practice. Using a projection PG on the simplex (2.30), , for some > 0, at each iteration, the recursion is given (without matrix inversion) λ̂n+1 = PG λ̂n + ηn h(on+1 | λ̂n ) (2.45) where h(on+1 | λ̂n ) = ∇λn log P (on+1 | λn ) and ηn = η0 n−ρ for some positive constant η0 and ρ ∈ ( 12 , 1]. It was shown to converge almost surely to the set of Kuhn-Tucker points for minimizing the relative entropy KLm (λ0 λ) defined in (2.32), which attains its global minimum at λ̂n → λ0 provided that the HMM is identifiable (Leroux, 1992; Ryden, 1994), hence the requirement m ≥ 2. Symbol-wise Holst and Lindgren (1991, Eq. 16) proposed a similar recursion to (2.41) for estimating the parameter of an HMM however. They used the conditional expectation over (qt−1 , qt ) given o1:t , which can be efficiently calculated using a forward recursion. It is guided by Ht−1 t 1 = h(ot | λ̂t−1 )h (ot | λ̂t−1 ) t τ =1 83 which is different from the score function (2.42). The adaptive matrix is suggested by an empirical estimate to the incomplete data information matrix given by Ht−1 = t 1 h(ot | λ̂t−1 )h (ot | λ̂t−1 ) t τ =1 which can be computed recursively without matrix inversion using: t Ht−1 ht ht Ht−1 Ht = Ht−1 − t−1 t − 1 + ht Ht−1 ht Ryden (1997) argued that the recursion of Holst and Lindgren aims at local minimization of the relative entropy rate (2.32) between λ0 and λ̂t . Moreover, he showed that if √ λ̂t+1 → λ0 , then t λ̂t+1 − λ0 is asymptotically normal with zero mean and covariance matrix given by the inverse of this expectation limt→∞ E h(ot+1 , λ0 )h (ot+1 , λ0 ) . Slingsby (1992) and Krishnamurthy and Moore (1993) derived an on-line algorithm for HMM based on the recursive version of the EM (2.41) proposed in (Titterington, 1984; Weinstein E. and Oppenheim, 1990). This technique applies a two step procedure, similar to EM, as each observation symbol arrives: E-step: Recursively computes the complete data likelihood Qt+1 (λ, λt ) = Eλ [log P (ot+1 , qt+1 | λ) | o1:t+1 , λt ] = t+1 τ =1 Qτ (λ, λt ) (2.46) M-step: Estimates λt+1 = arg maxλ Qt+1 (Λt , λ) where Qτ (λ, λt ) is defined in (2.17). The model is updated using: −1 λt+1 = λt + ∇2λt Qt+1 (λ, λt ) ∇λt Qt+1 (λ, λt ) Since each term of (2.46) depends on only one of the model parameters, ∇2λt Qt+1 (λ, λt ) is a block diagonal matrix for each of the model parameters. A projection into the constraint 84 domain along with the application of smoothing (fixed-lag or more conveniently sawtoothlag) and forgetting show, through empirical experiments, a convergence to the correct model. Slingsby (1992) presented the parameters update for a discrete HMM using a projection on a simplex (2.29): ⎛ (t+1) aij where (i) dj (t+1) bj t+1 τ =1 ξτ (i, j) ; â2ij = (t) ⎞ N (i) (i) 1 ⎝ (i) (t) h=1 gh /dh ⎠ = aij + (i) gj − (i) N dj h=1 1/dh (i) gj = γt+1 (j) τ =1 γτ (j)δ(ot+1 , k) ξt+1 (i, j) âij (k) = bj (k) − bj (ot+1 ) t+1 (2.47) ⎛ (2.48) b (k)2 t+1 j τ =1 γτ (j)δ(ot+1 ,k) ⎜ ⎜ ⎜ ⎝ M p=1 bj (p)2 τ =1 γτ (j)δ(ot+1 ,p) t+1 ⎞ ⎟ ⎟ ⎟; ⎠ k = ot+1 (t+1) bj (k) = 1 − M (t+1) p=k bj (p); k = ot+1 Krishnamurthy and Moore (1993) focused on continuous HMM with Gaussian output (2.1). The state transitions update is the same as (2.48) and (2.47), and the state output update is given by: (t+1) μi 2 σw (t+1) (t) = μi + 2 = σw (t) γt+1 (i) ot+1 − μi (t) t+1 τ =1 γτ |t+1 (i) N + i=1 γt+1 (i) (t) ot+1 − μi 2 2 − σw (t) t+1 Since the parameters positiveness is not maintained with the simplex parametrization, the projection on a sphere (2.31) as detailed in (Collings et al., 1994) solves this issue. 85 Now (i) dj = t+1 τ =1 2ξτ (i, j) + 2γt+1 (i) ; s2ij (i) gj = 2ξt+1 (i, j) − 2γt+1 (i)sij sij and the transition parameters update is given by (t+1) sij (i) gj (t) = sij + (i) , dj aij = s2ij LeGland and Mevel (1995, 1997) suggested and proved the convergence and the asymptotic normality of the RMLE (and the RCLSE described in the next section) without any stationary assumption using the geometric ergodicity and the exponential forgetting of their prediction filter and its gradient. This approach is based on the observation that the log likelihood can be expressed as an additive function of an extended Markov chain, i.e., as sum of terms depending on the observations and the state predictive filter (2.10): τ (λ) = τ t=1 log N bi (ot )γt|t−1 (i) i=1 Taking the gradient of the log-likelihood increment gives the score: (as a function of the extended Markov chain Zt = F (ot , γt|t−1 , wt )) h(Zt | λ) = ∇λ τ (λ) = where ct = N k=1 bk (ot )γt|t−1 (k) N 1 γt|t−1 (i)∇λ bi (ot ) + bi (ot )wti ct i=1 is a normalization factor, and wti = ∇λ γt|t−1 (i) is the gradient of the state prediction filter, which can be also computed recursively using: i = R1 (ot , γt|t−1 , λ)wti + R2i (ot , γt|t−1 , λ) wt+1 (2.49) 86 where R1 (ot , γt|t−1 , λ) = N N i=1 j=1 R2i (ot , γt|t−1 , λ) = N ⎡ aij ⎣1 − ⎡ aij ⎣1 − γt|t−1 (i) bi (ot ) ⎤ N k=1 bk (ot ) ⎦ bi (ot ) ct ct ⎤ N j=1 k=1 γt|t−1 (k) ⎦ γt|t−1 (i) ∇λ bi (ot ) + ct ct N γt|t−1 (i)bi (ot ) ∇λ aij ct j=1 The parameters update is then done for each row i using: λ̂it+1 = P λ̂it + ηt+1 h(Zti | λ) where, ηt = 1t , and P is the projection on the simplex (2.30), , for some > 0. The update can be decoupled into transition and state output probabilities as follow (R1 does not change). For transition probabilities, ∇λ bi (ot ) are zeros, accordingly h and R2 reduces to: h(Zt | λ) = R2i (ot , γt|t−1 , λ) = N 1 bi (ot )wt (i) ct i=1 N γt|t−1 (i)bi (ot ) ∇λ aij ct j=1 where, ∇λ aij = 1 at position i, j and zero otherwise. For discrete state outputs, ⎧ ⎪ i ; ⎪ ⎨1 N γ (i)∇ b (o ) + b (o )w t t i i λ t i=1 t|t−1 h(Zt | λ) = ⎪ ct ⎪ ⎩ 1 N b ct ⎧ ⎪ ⎪ ⎪ N ⎨ R2i (ot , γt|t−1 , λ) = ⎪ ⎪ ⎪ ⎩0 i=1 j=1 aij i i (ot )wt 1− ; bi (ot ) N k=1 γt|t−1 (k) ct if ot = vk if ot = vk γt|t−1 (i) ∇λ bi (ot ), ct if ot = vk if ot = vk 87 For the Gaussian state outputs (2.1): h(Zt | λ) = R2i (ot , γt|t−1 , λ) = N 1 γt|t−1 (i)∇λb bi (ot ) + bi (ot )wti ct i=1 N ⎡ aij ⎣1 − j=1 bi (ot ) N ⎤ k=1 γt|t−1 (k) ⎦ γt|t−1 (i) ∇λb bi (ot ) ct ct where ∇λb bi (ot ) is given by (2.39). Collings and Ryden (1998) proposed a similar RMLE approach, for continuous HMM with Gaussian output (2.1). The difference consists of using the sphere parametrization (2.31), instead of the simplex used in (LeGland and Mevel, 1995, 1997). The gradient of the transition probabilities and the recursive computation of the gradient of the state predictive density should be now taken w.r.t. the projected parameters sij (aij = s2ij ). Although for different purpose, these derivations are very similar to (2.50) and (2.51) presented next. 2.3.3 Minimum Prediction Error (MPE): Minimizing the output or state prediction error is also considered as alternative objective function. It consists in measuring the error of HMM output prediction (Collings et al., 1994; LeGland and Mevel, 1997) or of HMM filtered state (Ford and Moore, 1998a,b), and then providing an updated estimate of HMM parameters with each new observation symbol. Recursive Prediction Error (RPE) is first proposed for continuous range Gauss-Markov process (Ljung, 1977), and for a general recursive stochastic gradient algorithm (Arapostathis and Marcus, 1990). RPE extends the concept of least squares from linear to non linear functions. It locates the locale minimum of the prediction error cost function J(λ) and provides an updated estimate of the model with each new observation. It is formulated by considering the minimum variance of the prediction error based on the 88 best model estimate at the time # 1 λ̂ = arg min J(λ) = E [ot − ôt ]2 2 λ $ where ôt is the output prediction. The expectation of the off-line minimum variance, for an observation sequence of length T , is approximated by its time average: JT (λ) = T 1 [ot − ôt ]2 2T t=1 which is commonly minimized with the Gauss-Newton method – a simplified Newton method specialized to the nonlinear least square problem – where each pass k through the data the model is updated using: ⎡ ⎤−1 ⎡ T 1 λ̂(k+1) = λ̂(k) − ηt ⎣ ψ (k) ψλ (k) ⎦ T t=1 λ ⎤ T 1 ⎣ ψ (k) ε (k) ⎦ T t=1 λ λ where ψλ(k) = ∇λ(k) JT (λ) and ελ(k) = ot − ôt . An on-line version of the Gauss-Newton method is also possible if the gradient, and the inverse of the empirical information matrix, can be built up recursively, at each time instant, as new observation arrive. A general practical implementation is given by Ljung and Soderstrom (1983). Collings et al. (1994) extended the RPE for continuous HMM (known variance is assumed). The model is updated at each time step using: −1 λ̂t+1 = Ps λ̂t + ηt+1 Rt+1 ψt+1 Et+1 where ηt is a gain sequence satisfying (2.44) and Rt is the adaptive matrix which is computed by: Rt+1 = Rt + ηt+1 (ψt+1 ψt+1 − Rt ) 89 or alternatively, to avoid matrix inversion, Rt−1 could be computed as −1 Rt+1 R−1 ηt+1 Rt−1 ψt+1 ψt+1 1 t −1 = Rt − −1 1 − ηt+1 (1 − ηt+1 ) + ηt+1 ψt+1 Rt ψt+1 However, now for the HMM case, the output error is given by Et = ot − ôt|t−1 where, the output prediction, ôt|t−1 = E [ot | o1:t−1 , λt ], is conditioned on previous values and given by (using the unnormalized state variable αt (i) of the Forward-Backward algorithm) N N ôt|t−1 = i=1 j=1 aij αt−1 (i)bj (ot ) N i=1 αt−1 (i) In addition, a projection on the sphere, Ps , is made at each time step to ensure model constrains (2.31). Accordingly, the gradient of the output error is given by: ψt = [−∇λ ε̂t ] = ∇μj ôt|t−1 , ∇sij ôt|t−1 which can be also computed recursively: N N N aij αt (i)ςt (i)bj (ot ) i=1 aij αt (i) + i=1 aij ςt (i)bj (ot ) ∇μj ôt+1|t = − i=1 N 2 N i=1 αt−1 (i) i=1 αt−1 (i) N N N 2 2 i=1 sij αt (i) μj − i=1 j=1 sij bj (ot ) + N i=1 aij ζt (i, j)bj (ot ) ∇sij ôt+1|t = N 2 i=1 αt−1 (i) N aij αt (i)ζt (i, j)bj (ot ) − i=1 N 2 (2.50) i=1 αt−1 (i) where ςt (j) = ∇μj αt (j) and ζt (j, m, n) = ∇sij αt (j) are given by the following recursion: ςt+1 (j) = ζt+1 (i, j) = N i=1 N i=1 aij ςt (j)bj (ot+1 ) + N ot+1 − bj (ot ) aij αt (i)bj (ot+1 ) 2 σw i=1 aij ζt (i, j)bj (ot+1 ) − 2sij αt (i) N j=1 bj (ot+1 )s2ij + 2αt (i)sij bj (ot+1 ) (2.51) 90 The authors did not study the convergence properties of the algorithm, but through simulations. Later, a convergence problem for this algorithm has been noted when the variance error is small (low noise condition), as explained in (Collings and Ryden, 1998, Sec. 5) and (Ford and Moore, 1998b, Sec. 3-C). Ford and Moore (1998a) proposed an ELS (extended least squares) and RPE schemes in cases where the transition probabilities are assumed known, which exploit filtered state estimates. These schemes have been extended to the recursive state prediction error (RSPE) algorithm (Ford and Moore, 1998b), where the state transition probabilities are now estimated. Local convergence analysis, for the RSPE, is shown using the ordinary differential equation (ODE) approach developed for the RPE methods. The objective function is given (as a function of the filtered state error) by: ⎧ ⎪ ⎨ λ̂ = arg min Jt (λ) = λ ⎪ ⎩ ⎤2 ⎫ ⎪ ⎬ 1 ⎣γ (i) − ⎦ γ (i)a ij ⎪ t−1|t−1 2 t=1 t|t ⎭ i=1 τ ⎡ N and the model is updated, for each row i, using: λ̂it = λ̂it−1 + Hti Hti −1 i = Ht−1 −1 −1 wti + γt−1|t−1 (i) where Ht−1 is an approximation to the second derivative of wt , and wti = ∂Jt (λ) ∂aij can be computed recursively, similar to (2.49). In addition to the RMLE described previously, LeGland and Mevel (1997) proved the convergence of the recursive conditioned least squares estimator (RCLSE), which is a generalization of the RPE approach (Collings et al., 1994). A similar objective function 91 is used: Jτ (λ) = τ 2 1 ot − ôt|t−1,λ 2τ t=1 where the output prediction given by ôt|t−1 = N bj (ot )γt|t−1 (j) j=1 is based on the state predictive density. Accordingly, its gradients (wti ) can be computed recursively using the previously described recursions (2.49) according to each case. The model update for transition probabilities is given, for each row i, by ⎛ ⎡ λ̂it+1 = P ⎝λ̂it + ηt+1 ⎣μi (ot − N ⎤ ⎞ μj γt|t−1 (j))⎦ wti ⎠ j=1 and for the continuous state outputs by ⎛ ⎛ λ̂it+1 = P ⎝λ̂it + ηt+1 ⎝ N i=1 ⎡ ⎣∇i μi (ot − N ⎡ ⎣μi (ot − i=1 N N ⎤ μj γt|t−1 (j))⎦ wti + j=1 ⎤ ⎞⎞ μj γt|t−1 (j))⎦ γt|t−1 (i)⎠⎠ j=1 As for the discrete state outputs, the algorithm is only applicable for finite alphabet. In this case, the same formulas apply, however with μi = T t=1 ot bi (ot ) and of course using the gradient recursion (2.49) for the discrete case. 2.4 An Analysis of the On-line Learning Algorithms Algorithms for on-line learning of HMM parameters are mostly designed to learn observations from a source generating an infinite amount of data. For instance, these algorithms can be employed to re-estimate HMM parameters from a large number of sub-sequences (e.g., speech sentences), or a stream of symbols (e.g., signal in a noisy communication channel). Block-wise techniques re-estimate HMM parameters after observing each sub-sequence, while symbol-wise techniques re-estimate HMM parameters 92 λ0 Sub-sequences of observations λ0 λ0 λ1 o1 λ1 s1 λ2 o2 λ2 s2 λr or λr sr (a) On-line learning Two sequences of T observation symbols Two blocks of R sub-sequences λ0 o1 S1 ∪ S2 Stream of observations Block-wise Symbol-wise oT o1 oT D1 ∪ D2 Block-wise Symbol-wise s1 sR s1 sR λ(k) λ(k) λ λ (b) Batch learning Figure 2.5 An illustration of on-line learning (a) from an infinite stream of observation symbols (oi ) or sub-sequences (si ) versus batch learning (b) from accumulated sequences of observations symbols (S1 ∪ S2 ) or blocks of sub-sequences (D1 ∪ D2) upon observing each new symbol (Figure 2.5 (a)). The objective is to optimize HMM parameters to fit the source generating the data through one observation of the data. In contrast, batch learning algorithms re-estimate HMM parameters upon observing all accumulated sequences or blocks of sub-sequences of observations (Figure 2.5 (b)). Batch learning algorithms optimize HMM parameters to fit the accumulated data over several iterations. On-line learning is challenging since algorithms must converge rapidly to the true model with minimum resource requirements. It is assumed that the true generative model resides within the HMM parameters space. Finding recursive formulas to update the HMM parameter is necessary for on-line learning but not sufficient. Statistical properties such as consistency and asymptotic normality must be proven to ensure the convergence of an on-line learning algorithm. This section provides an analysis of the convergence properties and of the computational complexity of the on-line algorithms presented in Section 2.3. 93 2.4.1 Convergence Properties The extension of the batch EM to an incremental EM version using a partial data for the E-step followed by a direct update of the model parameters in the M-step, is a well known technique to overcome the resource requirements when processing a large fixedsize data set (Hathaway, 1986). Neal and Hinton (1998) showed that this EM variant gives non-increasing divergence, and that local minima in divergence are local maxima in likelihood. However, as argued by Gunawardana and Byrne (2005), this is insufficient to conclude that it converges to local maxima in the likelihood. They showed that this incremental EM variant is not in fact an EM procedure and hence the standard GEM convergence results (Wu, 1983) do not apply. This is because the E-step is modified to use summary statistics from the posterior distributions of previous estimates. By using the Generalized Alternating Minimization (GAM) procedure in an information geometric framework, Gunawardana and Byrne (2005) provided a convergence analysis which proves that the incremental EM converges to stationary points in likelihood, although not monotonically. The authors could find no proof of convergence for on-line EM-based algorithms when HMM parameters estimation is performed from infinite amount of data, for both blockwise (Digalakis, 1999; Mizuno et al., 2000) and symbol-wise (Cappe, 2009; Florez-Larrahondo et al., 2005; Mongillo and Deneve, 2008; Stenger et al., 2001; Stiller and Radons, 1999) techniques. Furthermore, statistical analysis such as consistency and asymptotic normality are not provided. Although the rate of convergence is not studied analytically, applying stochastic techniques in literature such as averaging (Polyak, 1991) has been shown to help improve convergence properties (Cappe, 2009). Among the block-wise algorithms, the gradient-based algorithm proposed by Cappe et al. (1998) should achieve the fastest convergence rate since it is based on a quasi-Newton method (second order approximation), while the others should have similar slower convergence rate. Similarly, for the recursive EM symbol-wise algorithms (Krishnamurthy and Moore, 1993; Slingsby, 94 1992) since the complete likelihood is optimized by a second order method, although no proof of convergence is provided. Among the methods based on RMLE, Ryden (1998, 1997) provided convergence analysis for his block-wise RMLE. He proved the consistency by using the classical result of Kushner and Clark (1978). Independently, LeGland and Mevel (1995, 1997) proved the convergence and asymptotic normality of their symbol-wise RMLE, under the assumption that the transition probability matrix is primitive, yet without any stationary assumption. This is accomplished by using the geometric ergodicity and the exponential forgetting of the predictive filter and its gradient. Krishnamurthy and Yin (2002) extended the RMLE results of LeGland and Mevel (1997) to auto-regressive models with Markov regime, and added results on convergence, rate of convergence, model averaging, and parameter tracking. In particular, they showed that the above assumption can be relaxed to the condition in which the transition probability matrix is aperiodic and irreducible. The authors also suggested using the iterate averaging (Polyak, 1991), as well as observation averaging for faster convergence and lower variances. RPE based techniques were first extended and applied for recursive estimation of HMM parameters due to their quadratic convergence rate (although asymptotically sub-optimal). This convergence speed comes at the expenses of an increased complexity, and as discovered later, the RPE algorithm proposed by Collings et al. (1994) suffers from numerical issues. Local convergence analysis is only shown for the RSPE (Ford and Moore, 1998a) using the ordinary differential equation. Finally, similar to the RMLE, LeGland and Mevel (1997) also proved the convergence of the RCLSE. 2.4.2 Time and Memory Complexity The time complexity of an algorithm, can be analytically defined as the sum of the worstcase running times T ime{p} for each operation p required to re-estimate HMM parameters 95 Table 2.1 Time and memory complexity for some representative block-wise algorithms used for on-line learning of a new sub-sequence o1:T of length T . N is the number of HMM states and M is the size of alphabet symbols Algo. Baldi and Chauvin (1998) Step γ and ξ # Multiplications 6N 2 T + 3N T − 6N 2 − N # Divisions N 2T + 3N T # Exp − N2 − N Memory N T + N 2 + 2N A 2N 2 N2 N2 2N 2 B 2N M 6N 2 T + 3N T − 4N 2 +2N M − N MN NM 2N M N 2 T + 3N T + M N − N N2 + NM N T + 3N 2 + 2N M + 2N Total ∇and Cappe et al. (1998), quasiNewton based γt 5N 2 T + 2N T 3N 2 T + N T N (N 2 + N M ) A 2N 2 N2 2N 2 B Total 2N 5N 2 T + 2N T + 2N 2 +2N N 4N 2 T + N T + 2N +N γ and ξ 6N 2 T + 3N T − 6N 2 − N N 2 T + 3N T − N 2 − N N T + N 2 + 2N A N2 N2 N2 B NM 6N 2 T + 3N T − 5N 2 +N M − N NM N 2 T + 3N T +N M − N NM Mizuno et al. (2000) Total 2N 2N 3 + 2N 2 N T + 2N 2 + N M + 2N based on new training sequence (block-wise) or new observation (symbol-wise): T ime = p T ime{p} = tp np p where tp is the constant time needed to execute an operation p (e.g., multiplication, division, etc.), and np is the number of times this operation is executed. The growth rate is then obtained by making the key parameters (T and N ) of the worst-case time complexity tend to ∞. For simplicity, only most costly type of operation is considered – multiplication, division, and exponent. Time complexity is independent from the convergence time, which varies among different algorithms as discussed in Section 2.4.1. Memory complexity is estimated as the number of 32 bit registers needed during learning process to store temporary variables. The worst-case memory space required for estimation of sufficient statistics such as conditional distributions of states and gradients of the log-likelihood is considered. 96 Tables 2.1 and 2.2 present a breakdown of time and memory complexity for some representative algorithms. Table 2.1 compares the complexity of block-wise algorithms applied to on-line learning of an observation sequence o1:T of length T and then updating the model parameters. Block-wise techniques typically employ a fixed-lag smoothing approach for estimating HMM states upon receiving each sub-sequence. In general, all these algorithms have the same time complexity, O(N 2 T ). However, EM-based block-wise techniques (Digalakis, 1999; Mizuno et al., 2000) are easier to implement than gradient-based (Baldi and Chauvin, 1994; Cappe et al., 1998; Singer and Warmuth, 1996) and RMLE (Ryden, 1998, 1997) techniques. EM-based techniques maintain the parameters constraints (2.2) and (2.3) implicitly, and do not employ derivatives and projections of the gradient filter. Accordingly, they require fewer computations to update the HMM parameters upon receiving an observation symbol (see Table 2.1). Although, block-wise algorithms have the same memory complexity of O(N T ), the memory requirements of the algorithm proposed by Cappe et al. (1998) is independent of the sequence length T . This is due to the recursion on the gradient itself and could be useful when T is large. However, it requires additional time complexity due to an internally employed line search technique. Table 2.2 compares the complexity of symbol-wise algorithms for on-line learning from a stream of observation symbols ot . The time complexity is based on the learning of one observation. Therefore, it must be multiplied by the length, T , of a finite observation sequence for comparison with block-wise techniques. Algorithms that requires more than N 2 computational complexity per time step may become less attractive when the number of HMM states N is large. As for block-wise techniques, EM-based symbolwise filtering and prediction techniques (Florez-Larrahondo et al., 2005; Stenger et al., 2001) are generally easier to implement than gradient-based (Collings and Ryden, 1998; Garg and Warmuth, 2003; LeGland and Mevel, 1995, 1997) and minimum prediction error (Collings et al., 1994; Ford and Moore, 1998a,b) techniques. On the other hand, although EM-based smoothing techniques (Cappe, 2009; Mongillo and Deneve, 2008; Stiller and Radons, 1999) have been shown to provide more accurate states estimate 97 Table 2.2 Time and memory complexity of some representative symbol-wise algorithms used for on-line learning of new symbol oi . N is the number of HMM states and M is the size of alphabet symbols Algorithms Stiller and Radons (1999), smoothing-based approach Florez-Larrahondo et al. (2005), prediction-based approach Step # Multiplications # Divisions Memory t (o ) ξijk t N4 + N2 2N 3 2N 3 M A = {aij (ot )} – N 2M N 2M Total N4 + N2 2N 3 + N 2 M 2N 3 M + N 2 M γ and ξ 5N 2 + N N2 + N N 2 + 3N A N2 2N 2 N2 B Total Slingsby (1992) +N NM +N 2N 2 + N M + 3N d, g,γ 2N 2 + N 2N 2 + N A N2 N 2 + 2N N2 Total N M + 3N 7N 2 + N M + 4N N M + 2N 3N 2 + N M + 5N NM 3N 2 + N M +N R1 and R2 N 3 + N 2M N2 + NM N 2 + N M + 4N A N2 2N 2 N2 B NM 2N M NM Total N 3 + N 2 (M + 1) + N M 3N 2 + 3N M 2N 2 + 2N M + 4N 3N 2 + M 2 + M N 2 + N M + 2N Rt−1 , ψt , Et Collings et al. (1994) 2N M 3N 2 + 2N M 6N 2 + N B Legland and Mevel (1995) NM 6N 2 + N M A B Total N 4 + 4N 3 + N 3 +5N 2 + 5M 2 + 3N M – (only N 2 additions) – N2 – (only N M additions) – NM N 4 + 4N 3 + N 3 +5N 2 + 5M 2 + 3N M 3N 2 + M 2 + M 2N 2 + 2N M + 4N than filtering and prediction (Anderson, 1999), their time complexity per observation is O(N 4 ), which makes them less attractive when the number of HMM states N is large. In general the memory requirement of symbol-wise techniques is independent of T , since the HMM re-estimation is either based the current symbol (filtering) or on small finite predictions – one symbol look-ahead or a larger fixed-lag smoothing. Block-wise algorithms typically require fewer computations than symbol-wise algorithms when learning from a large observation sequence, at the expenses of an increased memory requirements. This is mainly due to fewer update of HMM parameters that is required with block-wise learning algorithms. Figure 2.6 illustrates the time and memory complexity required by the EM-based symbol-wise prediction algorithm proposed by Florez- 98 Larrahondo et al. (2005) versus its block-wise fixed-lag smoothing counterpart proposed by Mizuno et al. (2000), when learning an observation sequence of length T = 1000, 000 symbols, with an output alphabet of size M = 50 symbols. While symbol-wise algorithms can be directly employed to learn such sequence of observations, block-wise algorithms require segmenting the sequence into non-overlapping sub-sequences using a sliding window of size W , which is also the fixed-lag value. 8 10 x 10 7 Florez Mizuno (W= 5) Mizuno (W= 10) Mizuno (W= 50) 4 Number of Multiplications Number of Multiplications 5 3 2 1 0 0 2 4 6 8 T x 10 Florez Mizuno (W= 5) Mizuno (W= 10) Mizuno (W= 50) 6 5 4 3 2 1 0 0 10 5 x 10 2 Florez Mizuno (W= 5) Mizuno (W= 10) Mizuno (W= 50) 800 1000 x 10 5 Memory Complexity Number of Multiplications 6 4 3 2 Florez Mizuno 4 3 2 1 1 0 0 10 5 x 10 4 10 5 8 (b) N = 100 states x 10 6 6 T (a) N = 5 states 7 4 20 40 60 80 N (c) T = 1000, 000 symbols 100 0 0 200 400 600 W (window size) (d) T = 1000, 000 symbols Figure 2.6 An example of the time and memory complexity of a block-wise (Mizuno et al. (2000)) versus symbol-wise (Florez-Larrahondo et al. (2005)) algorithm for learning an observation sequence of length T = 1000, 000 with an output alphabet of size M = 50 symbols As shown in Figure 2.6 (a), (b) and (c), block-wise algorithms are more time efficient with smaller number of states N and smaller window sizes W . While symbol-wise algo- 99 T W updates when learning the same observation sequence. When the N value increases the rithms require T updates of HMM parameters, block-wise algorithms require about computational time of block-wise algorithms may approach that of symbol-wise for large window sizes. In such cases, the computational time of parameters update is dominated by that of state densities which scales quadratically with N and linearly with W. Therefore, for a given N value, smaller window sizes should be favored since they reduce the time complexity for block-wise learning from a large observation sequence. The memory complexity is also decreased linearly with the window size as shown in Figure 2.6 (d). Increasing the fixed-lag value W may yield a more accurate local estimate of HMM states, and hence improves algorithms stability. If memory constraints are an issue, block-wise techniques that employ recursions on the gradient itself (Cappe et al., 1998) should be favored since their memory complexity are independent of W . When the application imposes stricter constraints on the memory space, symbol-wise algorithms should be employed. The strategy for selecting a value for W is thereby an additional issue to address with block-wise techniques. 2.5 Guidelines for Incremental Learning of HMM Parameters The target application implies a scenario for data organization and learning. Depending on the application, incremental learning may be performed on an abundant or limited amount of new data. In addition, this data may be organized into a block of observation sub-sequences or into one long observation sequence. Within the incremental learning scenario (Figure 2.1), HMM parameters should be re-estimated using the newly-acquired training data, without corrupting the previously acquired knowledge, and therefore degrading the system performance. One of the key issues is optimizing HMM parameters such that contributions of new data and pre-existing knowledge is balanced. The choice of user-defined parameters (e.g., learning rate and sub-sequence length) have a significant impact on HMM performance. The rest of this section provides guidelines and underscores challenges faced when applying on-line techniques to supervised incremental learning of new training data, when the data is either abundant or limited. 100 2.5.1 Abundant Data Scenario Under the abundant data scenario, it is assumed that a long sequence of training observations becomes available to update HMM parameters through incremental learning. A more complete view of phenomena is therefore presented to the learning algorithm. As stated previously, symbol-wise algorithms can be directly employed to learn such sequences, one observation at a time, while block-wise algorithms require buffering some amount of data using for instance a sliding window according to a user-defined buffer size W . When learning is performed in a static environment from a large sequence of observations, one view of the data is typically sufficient to capture the underlying structure by exploiting patterns redundancy. This is because the abundant data provide a more complete view of phenomena. Resource constraints would be another reason behind restricting the iterative learning procedure to one pass. Therefore, the convergence properties of an on-line algorithm must be known to determine the speed and limits of convergence behavior. As described in Section 2.4.1, when the true model is contained in the set of solutions, some algorithms are shown to be consistent and asymptotically normal by properly choosing a monotonically decreasing learning rate ηt and by applying PolyakRuppert averaging (Polyak, 1991). For on-line learning in static environment, decreasing step-sizes are essential conditions of convergence. Fixed step-sizes may yield the convergence of algorithms to oscillate around their limiting values with variance proportional to the step-size. However, some novelty criteria on the new data should be employed to reset the monotonically decreased learning rate when provided with a new sequence of observations. In contrast, when learning is performed in dynamically-changing environments, the notion of optimality is no longer valid because the on-line algorithm should forget past knowledge and adapt rapidly to the newly acquired information. Tracking non-stationary environments and handling slow drift is typically achieved by choosing a fixed learn- 101 ing rate η. For abrupt drift however, a data driven learning rate is required to detect changes and adapt the step-sizes according to the incoming data (Schraudolph, 1999; Stiller, 2003). In both static and dynamically-changing environments, algorithms with low time and memory complexity are favored to learn the long sequence of observations. As discussed in Section 2.4.2, symbol-wise algorithms that requires more than N 2 time complexity upon receiving each symbol are less attractive, especially when the number of HMM states N is large. For block-wise algorithms, smaller window sizes W are favored for reducing both time and memory complexity. However, the stability of algorithms must also be considered when selecting the window size. Some issues that require further investigation when adapting the HMM parameters from abundant data. Empirical benchmarking studies of these techniques could offer further insight on trade-offs for selecting algorithms and user-defined parameters for an application domain. For instance, the performance of techniques based on MLE (such as EM and gradient based) should be compared versus those based on MMD, and recursive techniques for MPE. An interesting comparison would also involve symbol-wise versus block-wise and filtering versus fixed-lag smoothing. Significant improvement, accuracy, speed and stability of convergence, computing resources, and amount of training data required to reach or maintain a given level of performance should be among the evaluation criteria. To this end, it is recommended to conduct non-parametric tests for statistical comparisons of multiple classifiers over multiple data sets (Demšar, 2006; García and Herrera, 2008). In addition to assessing the strength and weakness of each algorithm, such comparison may lead to efficient hybrid on-line algorithms that require fewer resources (see Section 2.2.2.3). 2.5.2 Limited Data Scenario Under the limited data scenario, it is assumed that a short sequence of observations becomes available to update the HMM parameters, providing therefore a limited view of 102 phenomena. For block-wise algorithms the sequence is typically segmented into shorter sub-sequences using a sliding window with a user-defined window size W . In contrast to the abundant data case, when provided with limited data, several iterations over the new sequence or block of observations are required to re-estimate HMM parameters. This local optimization raises additional challenges. Specialized strategies are needed for managing the learning rate which, at each iteration must balance integration of preexisting knowledge of the HMM and the newly-acquired training data. In addition, the learning technique may become trapped in local maxima on the cost function on the parameter space. To illustrate the difficulty, assume that the training block D2 , which contains one or multiple sub-sequences, becomes available after a HMM has been previously trained on a block D1 and deployed for operations. After training on D1 , the HMM has a parameter settings of λ1 . An on-line algorithm that optimizes, for instance, the loglikelihood function requires re-estimating the HMM parameters over several iterations until this function is maximized for D2 . The plain curve in Figure 2.7 represents the log-likelihood function of an HMM(λ1 ) that was previously learned D1 , and then has incrementally learned D2 over the space of on HMM parameter (λ). Since λ1 is the only information at hand from previous data D1 , the training process starts from HMM(λ1 ). Optimization of HMM parameters depends on the shape and position of the log-likelihood function of HMM(λ2 ) with respect to that of HMM(λ1 ). For instance, if λ1 was selected according to point (a) in Figure 2.7, then the optimization will probably lead to point (d), which degrades HMM performance. If λ1 was selected as point (b), the optimization would probably lead to a better solution (f ). In contrast, training a new HMM(λ2 ) on D2 by starting from the start on different initializations may lead to point (g) which, depending on the new data, may yield higher log-likelihood than all previous models. Combinations of HMM(λ1 ) and HMM(λ2 ) at the classifier level may however provide improved performance in such cases. Log-Likelihood 103 HMM (λ1 ) HMM (λ2 ) (g) (f ) (a) (b) (d) (e) (c) Optimization Space of one Model Parameter (λ) Figure 2.7 The dotted curve represents the log-likelihood function associated with and HMM (λ1 ) trained on block D1 over the space of one model parameter λ. After training on D1 , the on-line algorithm estimates λ = λ1 (point (a)) and provides HMM (λ1 ). The plain curve represents the log-likelihood function associated with HMM (λ2 ) trained on block D2 over the same optimization space Typical procedures for unbiased performance estimation should be employed. This involves stopping the training iterations when the log-likelihood of an HMM on independent validation data no longer improves. Using the hold-out or cross-validation procedures reduces the effects of overfitting. HMM parameters that yield the maximum log-likelihood value on other validation data are then selected, which guarantees generalization performance during operations. When the situation allows, some proportion of a new training data should be dedicated to validation. This would provide a better the stopping criterion and improve the generalization capabilities of the classification system. In order to perform such validation, a set of observations must be stored and updated over time in a fixed-sized buffer. Managing this validation set over time depends on the application environment and should be selected and maintained according to some relevant selection criteria (e.g., some informationtheoretic criteria). For instance, in dynamically-changing environments, older validation 104 data should be discarded and replaced with new observations data that is more representative of the underlying data distribution. In static or cycle-stationary environments, older data should be discarded in a first-in-first-out manner. A potential advantage of applying the on-line learning techniques over batch learning techniques to limited data scenario resides in their added stochasticity, which stems from the rapid re-estimation of HMM parameters. This may aid escaping local maxima during the early adaptation to newly provided data. Managing the internal learning rate, associated with each sub-sequence or observation symbol, is therefore different for limited data than for abundant data. In general, employing a fixed learning rate along with the validation strategies described above would maintaining the level of performance in both static and dynamically-changing environments. However, more insights are required in this regard to determine if applying a monotonically decreasing or auto-adaptive learning rates at the iteration, sequence, or symbol levels are beneficial for escaping local maxima and hence improving the performance. The resource requirements for the on-line algorithms when learning from limited data are comparable to that of when learning from abundant data scenario and presented in Section 2.4.2 except for abundant data T is much bigger. The only difference is that the complexity presented for both block and symbol wise techniques must be multiplied by the number of training iterations, which varies according to the algorithm employed, validation strategy and stopping criteria, and training data. Learning from limited data is less restrictive to memory and time complexity than learning from abundant data. Therefore, even the most compute-intensive symbol-wise algorithms – requiring O(N 4 ) time complexity upon receiving each symbol – or block-wise algorithms – requiring O(N 4 T ) time and O(N T ) memory complexity upon receiving each sub-sequence – can be afforded as long as they improve or maintain the HMM performance. In limited data scenario, escaping local maxima, managing learning rates and stopping criteria may be critical factors for selecting an on-line algorithm than minimizing the resource requirements. Finally, as for the abundant data scenario, applying the on-line learning techniques from 105 this survey to adapt HMM parameters to limited new data requires comparative benchmarking and statistical testing at the objective functions, optimization techniques and algorithms (symbol-wise versus block-wise) levels, and according to various user-defined parameters (learning rate and window size). 2.6 Conclusion This chapter present a survey of techniques found in literature that are suitable for incremental learning of HMM parameters. These include on-line learning algorithms that have been initially proposed for learning from a long training sequence, or block of sub-sequences. In this chapter, these techniques are categorized according to the objective functions, optimization techniques and target application. Some techniques are designed to update HMM parameters upon receiving a new symbol (symbol-wise), while others update the parameters upon receiving a sequence (block-wise). Convergence properties of these techniques are presented, along with an analysis of their time and memory complexity. In addition, the challenges faced when these techniques are applied to incremental learning is assessed for scenarios in which the new training data is limited and abundant. When the new training data is abundant (scenario 1), the HMM parameters should be re-estimated to fit the source generating the data through one pass over a sequence of observations. When new data corresponds to a long sequence or block of sub-sequences generated from a stationary source (static environment), these techniques must employ a specialized strategy to manage the learning rate such that new data and existing knowledge are integrated without compromising the HMM performance. This includes resetting a monotonically decreasing learning rate when provided with new data or employing an auto-adaptive learning rate. Rapid convergence with limited resource requirements are other major factors in such cases. However few learning algorithms have been provided with a proof of convergence or other relevant statistical properties. Another important issue is the ability to operate in dynamically-changing environments. In such cases, some 106 novelty criteria on the new data should be employed to detect changes and trigger adaptation by resting a monotonically decreasing learning rate or fine-tuning an auto-adaptive learning rate. When the new training data is limited (scenario 2), HMM parameters should be reestimated over several iterations due to the limited view of phenomena. Therefore, HMM parameters should be able to escape local maxima of the cost function associated with the new data. Given the limited data, early stopping criteria through hold-out or cross validation must be considered when learning the new block of data to reduce the effects of overfitting. Accumulating and updating representative validation data set over time must be considered. However, this requires investigating some selection criteria for maintaining the most informative sequences of observations and discarding less relevant ones. Another major challenge resides in adapting the current operational HMM to the newly-acquired data without corrupting existing knowledge. One possible solution involves using an adaptive learning rate which, at each iteration, controls the weight given to the current HMM with reference to the new information. This learning rate should preferably be inferred from the data. Finally, this chapter underscores the need for comparative benchmarking studies of these on-line algorithms for incremental learning in both abundant and limited data scenarios. Some interesting comparative studies, evaluation criteria and statistical testing methods are suggested for selecting an algorithm for a particular situation. Knowledge corruption and performance degradation may arise because incremental learning of new data is initiated from a pre-existing HMM. One promising solution involves combining HMMs at either classification or response levels. In such cases, new HMMs would be trained using the newly-acquired data, and combined with those trained on the previously-acquired data. Next chapter describes the novel decision-level Boolean combination technique proposed for efficient fusion of the responses from multiple classifiers in the ROC space. CHAPTER 3 ITERATIVE BOOLEAN COMBINATION OF CLASSIFIERS IN THE ROC SPACE: AN APPLICATION TO ANOMALY DETECTION WITH HMMS∗ In this chapter, a novel Iterative Boolean Combination (IBC ) technique is proposed for efficient fusion of the responses from multiple classifiers in the ROC space. It applies all Boolean functions to combine the ROC curves corresponding to multiple classifiers, requires no prior assumptions, and its time complexity is linear with the number of classifiers. The results of computer simulations conducted on both synthetic and real-world Host-Based intrusion detection data indicate that the IBC of responses from multiple HMMs can achieve a significantly higher level of performance than the Boolean conjunction and disjunction combinations, especially when training data is limited and imbalanced. The proposed IBC is general in that it can be employed to combine diverse responses of any crisp or soft one- or two-class classifiers, and for wide range of application domains. 3.1 Introduction Intrusion Detection Systems (IDS) is used to identify, assess, and report unauthorized computer or network activities. Host-Based IDSs (HIDS) are designed to monitor the activities of a host system and state, while network-based IDSs (NIDS) monitor the network traffic for multiple hosts. HIDSs and NIDSs have been designed to perform misuse detection and anomaly detection. Anomaly-based intrusion detection allows to detect novel attacks for which the signatures have not yet been extracted (Chandola et al., 2009a). In practice, anomaly detectors will typically generate false alarms due in large part to the limited data used for training, and to the complexity of underlying data ∗ THIS CHAPTER IS PUBLISHED AS AN ARTICLE IN PATTERN RECOGNITION JOURNAL, DOI: 10.1016/J.PATCOG.2010.03.006 108 distributions that may change dynamically over time. Since it is very difficult to collect and label representative data to design and validate an Anomaly Detection Systems (ADS), its internal model of normal behavior will tend to diverge from the underlying data distribution. In HIDSs applied to anomaly detection, operating system events are usually monitored. Since system calls are the gateway between user and kernel mode, traditional host-based anomaly detection systems monitor deviation in system call sequences. Forrest et al. (1996) confirmed that short sequences of system calls are consistent with normal operation, and unusual burst will occur during an attack. Their anomaly detection system, called Sequence Time-Delay Embedding (STIDE), is based on look-up tables of memorized normal sequences. During operations, STIDE must compare each input sequence to all “normal” training sequences. The number of comparisons increases exponentially with the detector window size. Moreover, STIDE is often used for design and validation of other state-of-the-art detectors. Various neural and statistical detectors have been applied to learn the normal process behavior through system call sequences (Warrender et al., 1999). Among these, techniques based on discrete Hidden Markov Models (HMMs) (Rabiner, 1989) have been shown to produce a very high level of performance (Warrender et al., 1999). A well trained HMM is able to capture the underlying structure of the monitored application and detect deviations from “normal” system call sequences. Once trained, an HMM provides a fast and compact detector, with tolerance to noise and uncertainty. Designing an HMM for anomaly detection involves estimating HMM parameters and order (number of hidden states, N ) from the training data. The value of N has a considerable impact not only on the detection rate but also on the training time. In the literature on HMMs applied to anomaly detection (Gao et al., 2002; Hoang and Hu, 2004; Warrender et al., 1999), the number of states is often chosen heuristically or em- 109 pirically using validation data1 . In addition, HMMs are designed to provide accurate results for a particular window size and anomaly size. Therefore, a single best HMM will not provide a high level of performance over the entire detection space. In a previous work (Khreich et al., 2009b), the authors proposed a multiple-HMM (μ-HMM) approach, where each HMM is trained using a different number of hidden states. During operations, each HMM outputs a probability that the HMM produced the input sequence. HMM responses are combined in the Receiver Operating Characteristics (ROC) space according to the Maximum Realizable ROC (MRROC) technique (Scott et al., 1998). This technique is robust in imprecise environments where for instance prior probabilities and/or classification costs may change (Provost and Fawcett, 1997), and can be always applied in practice without any assumption of independence between detectors (Scott et al., 1998). Results have shown that this μ-HMMs approach can provide a significant increase in performance over a single best HMM and STIDE (Khreich et al., 2009b). Boolean functions – especially the conjunction AND and disjunction OR operations – have recently been investigated to combine crisp or soft detectors within the ROC space. Successful applications for such combination include machine learning (Barreno et al., 2008; Fawcett, 2004), biometrics (Tao and Veldhuis, 2008), bio-informatics (Haker et al., 2005; Langdon and Buxton, 2001), automatic target recognition (Hill et al., 2003; Oxley et al., 2007). Boolean Combination (BC) based on conjunction or disjunction has been shown to improve performance over the MRROC in many applications, yet requires idealistic assumptions in which the detectors are conditionally independent and their respective ROC curves are smooth and proper. In contrast, applying all Boolean functions, using an exhaustive brute-force search to determine optimal combinations leads to an exponential explosion of combinations, which is prohibitive even for a small number of crisp detectors (Barreno et al., 2008). 1 This is also the case in many other applications of ergodic HMMs. However, recently nonparametric Bayesian approaches have been proposed to overcome HMMs order selection (Beal et al., 2002; Gael et al., 2008). 110 In practice, neural and statistical classifiers are typically trained with limited amount of representative data, and class distributions are often complex and imbalanced, which leads to concavities on empirical ROC curves, and to poor performance. This issue is especially critical in binary classification problems, such as anomaly detection, where samples from the positive class2 are inherently rare, and are costly to analyze. In such cases the ROC curve typically comprises large concavities. Some authors have argued that ROC curves can be simply repaired using the ROCCH prior to the conjunction or disjunction combinations (Haker et al., 2005). However this technique is inefficient, since the ROCCH selects thresholds of the locally superior points and discard the rest. This leads to a loss of diverse information which could be used to improve performance. In this chapter, the problem of ROC-based combination is addressed in the general case, where detectors are trained with limited and imbalanced training data. An Iterative Boolean Combination (IBCALL )3 technique is proposed to efficiently combine the responses from multiple detectors in the ROC space. In contrast with most techniques in literature, where only the AND or OR are investigated, the IBCALL exploits all Boolean functions applied to the ROC curves, and it does not require any prior assumption regarding the independence of the detectors and the convexity of ROC curves. At each iteration, the IBCALL selects the combinations that improve the ROCCH and recombines them with the original ROC curves until the ROCCH ceases to improve. Although it seeks a sub-optimal set of combinations, the IBCALL is very efficient in practice and does not suffer from the exponential explosion (Barreno et al., 2008), and it provides a higher level of performance than related techniques in literature (Haker et al., 2005; Scott et al., 1998; Tao and Veldhuis, 2008). The IBCALL technique can be applied when training data is limited and test data is heavily imbalanced. Another advantage of the proposed technique is its ability to repair the concavities in the ROC curve when applied to combine the responses of the same ROC curve. The IBCALL is general in the sense 2 The positive or target class is typically the class of interest for the detection problem. For anomaly detection, the target class corresponds to the intrusive or abnormal class. 3 The subscript “ALL” is used to emphasize that the IBC technique employs all Boolean functions. 111 that can be applied to combine the responses of any soft, crisp, or hybrid detector in the ROC space, whether the corresponding curves result from the same detector trained on different data or trained according to different parameters, or from different detectors trained on the same data. During computer simulations, multiple HMMs are applied to anomaly detection based on system calls. The performance obtained by combining the responses of μ-HMMs in ROC space with the proposed IBCALL technique is compared to that of MRROC fusion (Scott et al., 1998), to that of the conjunction (BCAN D ) and disjunction (BCOR ) combinations (Haker et al., 2005; Tao and Veldhuis, 2008), and to that of STIDE (for reference). To investigate the effect of repairing the concavities of the ROC curves on the combinations, each ROC curve is first repaired using the IBCALL technique, and then all curves are combined with the MRROC. This is also compared to the Largest Concavity Repair (LCR) proposed in (Flach and Wu, 2005). The impact on performance of using different training set sizes, detector window sizes and anomaly sizes is assessed through the area under the curve (AUC) (Hanley and McNeil, 1982), partial AUC (Walter, 2005), and true positive rate (tpr) at a fixed false positive rate (f pr). The experiments are conducted on both synthetically generated data and sendmail data from the University of New Mexico (UNM) data sets. The rest of this chapter is organized as follows. The next section describes the application of discrete HMMs in anomaly-based HIDS. In Section 3.3, existing techniques for combination of detectors in the ROC space are presented. The proposed technique for Iterative Boolean Combination of detector responses is presented in Section 3.4. Section 3.5 presents the experimental methodology (data sets, evaluation methods and performance metrics) used for proof-of-concept computer simulations. Finally, simulation results are presented and discussed in Section 3.6. 112 3.2 Anomaly Detection with HMMs A discrete-time finite-state HMM is a stochastic process determined by the two interrelated mechanisms – a latent Markov chain having a finite number of states, and a set of observation probability distributions, each one associated with a state. At each discrete time instant, the process is assumed to be in a state, and an observation is generated by the probability distribution corresponding to the current state. HMM parameters are usually trained using the Baum-Welch (BW) algorithm (Baum et al., 1970) – a specialized expectation maximization technique to estimate the parameters of the model from the training data. Theoretical and empirical results have shown that, given an adequate number of states and a sufficiently rich set of observations, HMMs are capable of representing probability distributions corresponding to complex real-world phenomena in terms of simple and compact models, with tolerance to noise and uncertainty. For further details regarding HMM the reader is referred to the extensive literature (Ephraim and Merhav, 2002; Rabiner, 1989). Discrete output distributions k a22 b1 (k) 1 0 a12 12 M S1 a11 aij S2 a21 0 12 M a32 a23 a13 S3 Observations M bj (k) = P (ot = vk | qt = j) b2 (k) 1 o1 , o2 , . . . , oT 2 1 vk bj (k) b3 (k) 1 0 a33 a31 = P (qt+1 = j | qt = i) Hidden states distributions vk Hidden states N j 12 M i aij q1 , q 2 , . . . , q T 1 2 t 1 2 T Figure 3.1 Illustration of a fully connected three state HMM with a discrete output observations (left). Illustration of a discrete HMM with N states and M symbols switching between the hidden states qt and generating the observations ot (right). The state qt = i denotes that the state of the process at time t is Si Formally, a discrete-time finite-state HMM consists of N hidden states in the finitestate space S = {S1 , S2 , ..., SN } of the Markov process. Starting from an initial state Si , 113 determined by the initial state probability distribution πi , at each discrete-time instant, the process transits from state Si to state Sj according to the transition probability distribution aij (1 ≤ i, j ≤ N ). As illustrated in Figure 3.1, the process then emits a symbol v according to the output probability distribution bj (v), which may be discrete or continuous, of the current state Sj . The model is therefore parametrized by the set λ = (π, A, B), where vector π = {πi } is initial state probability distribution, matrix A = {aij } denotes the state transition probability distribution, and matrix B = {bj (k)} is the state output probability distribution. Estimating the parameters of a HMM requires the specification of its order (i.e., the number of hidden states N ). The value of N may have a significant impact on both detection performance and training time. The time and memory complexity of BW training is O(N 2 T ) and O(N T ) respectively, for a sequence of length T symbols. In the literature on HMMs applied to anomaly detection (Gao et al., 2002; Hoang and Hu, 2004; Warrender et al., 1999), the number of states is often overlooked and typically chosen heuristically. In addition, a single “best” HMM will not provide a high level of performance over the entire detection space. A multiple-HMMs (μ-HMMs) approach, where each HMM is trained using a different number of hidden states, and where HMM responses are combined according to the MRROC has significantly improved performance over a single best HMM and STIDE (Khreich et al., 2009b). 3.3 Fusion of Detectors in the Receiver Operating Characteristic (ROC) Space A crisp detector (e.g., STIDE) outputs only a class label and produces a single operational data point in the ROC plane. In contrast, a soft detector (e.g., HMM) assigns scores or probabilities to the input samples, which can be converted to a crisp detector by setting a threshold on the scores. Given a detector and a set of test samples, the true positive rate (tpr) is the proportion of positives correctly classified over the total number of positive samples. The false positive rate (f pr) is the proportion of negatives incorrectly 114 classified (as positives) over the total number of negative samples. A ROC curve is a two-dimensional curve in which the tpr is plotted against the f pr. A parametric ROC curve typically assumes that a pair of normal distributions underlies the data (Hanley, 1988; Metz, 1978) and increases monotonically with the f pr. In practice, an empirical ROC curve may be obtained by connecting the observed (tpr, f pr) pairs of detectors, therefore it makes minimal assumptions (Fawcett, 2004). Given two operating points, say i and j, in the ROC space, i is defined as superior to j if f pri ≤ f prj and tpri ≥ tprj . If one ROC curve has all its points superior to those of another curve, it dominates the latter. If a ROC curve has tpri > f pri for all its points i then, it is a proper ROC curve. Finally, the area under the ROC curve (AUC) is the fraction of positive–negative pairs that are ranked correctly (see Section 3.5.3). The rest of this section provides an overview of techniques in literature for combining detectors and repairing curves within the ROC space. It includes the stochastic interpolation of the Maximum Realizable ROC (MRROC) or ROCCH, and the conjunction and disjunction rules for combining crisp and soft detectors. 3.3.1 Maximum Realizable ROC (MRROC) Given that the underlying distributions are generally considered fixed, a parametric ROC curve will always be proper and convex. In practice, an ROC plot is a step-like function which approaches a true curve as the number of samples approaches infinity. An empirical ROC curve is therefore not necessarily convex and proper as illustrated in Figure 3.2. Concavities indicate local performance that is worse than random behavior. These concavities occur with unequal variances between classes, with large skew in the data, or when the classifier is unable to capture the modality of the data (e.g., classifying multimodal data with a linear classifier). As shown in Figure 3.2, the ROCCH of an empirical ROC is the piece-wise outer envelope connecting only the superior points of an ROC with straight lines. The ROCCH of a single crisp detector connects its resulting point to the (0, 0) and (1, 1) points. The ROCCH 115 is also applicable to multiple detectors which may be crisp or soft. If one detector has a dominant ROC curve over f pr values, its convex hull constitutes the overall ROCCH. When there is no clear dominance among multiple detectors across different regions of the ROC space, the ROCCH corresponds to the envelope connecting the superior points in each region (Figure 3.2a). The ROCCH formulation has been proven to be robust in imprecise environments (Provost and Fawcett, 1997). As environmental conditions change, such as prior probabilities and/or costs of errors, only the portion of interest will change, not the hull itself. This change of conditions may lead to shifting the optimal operating point to another threshold or classifier on the convex hull. tnr 0.8 f nr Cb Cc tnr 1 f nr 1 Cb 0.8 Cc tpr 0.6 tpr 0.6 Ca 0.4 0.4 Class 1 Class 1 Class 2 Class 2 0.2 0 0 0.2 0.2 0.4 f pr 0.6 0.8 (a) Proper ROC curve 1 0 0 0.2 0.4 f pr 0.6 0.8 1 (b) Improper ROC curve Figure 3.2 Illustration of the ROCCH (dashed line) applied to: (a) the combination of two soft detectors (b) the repair of concavities of an improper soft detector. For instance, in (a) the composite detector Cc is realized by randomly selecting the responses from Ca half of the time and the other half from Cb The ROCCH is useful for combining detectors. The idea is based on a simple interpolation between the corresponding detectors (Provost and Fawcett, 2001; Scott et al., 1998). In practice, this is achieved by randomly alternating detectors responses proportionately between the two corresponding vertices of the line segment on the convex hull where the desired operational point lies. This approach has been called the maximum realizable 116 ROC (MRROC) by Scott et al. (1998) since it represents a system that is equal to, or better than, all the existing systems for all Neyman-Pearson criteria. Hereafter, the acronyms ROCCH and MRROC will be used interchangeably. The performance of the composite detector Cc can be readily set to any point along the line segment, connecting Ca and Cb , simply by varying the desired f prc and thus the ratio between the number of times one detector is used relative to the other. The MRROC considers only the responses of detectors that lie on the facet of the ROCCH, since they are potentially “optimal”, and discards the responses not touching the ROCCH (Provost and Fawcett, 1997; Scott et al., 1998). However, this may degrade the performance due to loss of information since inferior detectors are not exploited for decisions. Indeed, these detectors may contain valuable diverse information. As detailed next, some combination techniques have been proposed in literature to improve upon the ROCCH. 3.3.2 Repairing Concavities Repairing concavities is useful in itself for situations with only one detector with some concavities in its ROC curve. Repairing these concavities could be useful to improve the performance. The MRROC may be used to repair the ROC concavities by discretizing and interpolating between thresholds of superior points. Flach and Wu proposed a technique for largest concavity repairing (LCR) (Flach and Wu, 2005). It consists of inverting the largest concavity section with reference to the mid point of the local line segment on the ROCCH. The intuition is that points underneath the ROCCH can be mirrored in the same way negative classifiers (under the ascending diagonal) can be inverted. The LCR consist in determining the two thresholds on the ROCCH, limiting the section to be inverted, and then in negating the responses with reference to the mid point of the corresponding line segment of the ROCCH. In some cases the LCR may provide a higher level of performance than the MRROC. 117 3.3.3 Conjunction and Disjunction Rules for Crisp Detectors The Boolean conjunction (AND) and disjunction (OR) fusion functions were first introduced for combining crisp detectors (Daugman, 2000). Other authors such as Fawcett (2004) have noted that it is sometimes possible to find nonlinear combinations of detectors which produce an ROC exceeding their convex hulls. As illustrated in Figure 3.3, the conjunction rule decreases the f pr at the expenses of decreasing the tpr, thus providing a more conservative performance than each of the original detectors. Analogously, the disjunction rule increases the tpr at the expenses of increasing the f pr, providing a more aggressive performance that each of the original detectors. When these fusions are considered alone – outside the ROC space – their achieved performance may not be of considerable interest as advocated by Daugman (2000). However, depending on detector interaction, within the ROC space, these fusion rules may produce a new convex hull that is superior than that of existing detectors alone. In addition, the new MRROC curve provides the flexibility of choosing any operating point which lies on its hull by interpolating between the relevant vertices as described previously. The conditional independence assumption among the detectors simplifies the computation. In this cases, the combination rules depend only on the true and false positive rates. Let (tpra , f pra ) and (tprb , f prb ) be the true positive and false positive rates of detectors Ca and Cb , respectively. Under the conditional assumption, the performance of the composite crisp detectors Cc is given in Table 3.1 (Black and Craig, 2002; Fawcett, 2004). These formulas stem from the conditional independence of probability. For instance, the probability that both detectors correctly classify a positive test sample is given by: P11|1 = Pr(Ca = 1, Cb = 1 | 1) = Pr(Ca = 1 | 1) Pr(Cb = 1 | 1) = tpra tprb . 118 tnr f nr 1 Cb 0.8 Ca MRROC tpr 0.6 0.4 Ca ∨ Cb Ca ∧ Cb 0.2 0 0 0.2 0.4 f pr 0.6 0.8 1 Figure 3.3 Examples of combination of two conditionally-independent crisp detectors, Ca and Cb , using the AND and OR rules. The performance of the their combination is shown superior to that of the MRROC. The shaded regions are the expected performance of combination when there is an interaction between the detectors Table 3.1 Combination of conditionally independent detectors P11|1 = tpra tprb P10|1 = (1 − tpra )tprb P01|1 = tpra (1 − tprb ) P00|1 = (1 − tpra )(1 − tpra ) P11|0 = f pra f prb P10|0 = (1 − f pra )f prb P01|0 = f pra (1 − f prb ) P00|0 = (1 − f pra )(1 − f prb ) As mentioned, this assumption is violated in most real-world applications. In the more realistic conditionally dependent case, the performance of the composite crisp detectors Cc is given in Table 3.2. The performance now depends on the positive (P11|1 ) and negative (P00|0 ) correlations between detectors (Black and Craig, 2002). That is, the joint distributions of both detectors are required, and the performance can now be anywhere in the shaded regions of Figure 3.3. An attempt to characterize this dependence is given by Venkataramani and Kumar (2006), where the correlation coefficient between the detector scores is shown to be useful for predicting the “best” decision fusion rule, as well as for evaluating the quality of 119 Table 3.2 Combination of conditionally dependent detectors tpra tprb < P11|1 < min(tpra , tprb ) P10|1 = tpra − P11|1 P01|1 = tprb − P11|1 P00|1 = 1 − tpra − tpra + P11|1 (1 − f pra )(1 − f prb ) < P00|0 < min(1 − f pra , 1 − f prb ) P10|0 = (1 − f prb ) − P00|0 P01|0 = (1 − f pra ) − P00|0 P11|0 = f pra + f prb − 1 + P00|0 detectors. More recently, in order to avoid the restrictive conditional assumption among detectors, the combination rules were extended to include all Boolean functions (Barreno et al., 2008). By ranking these combinations according to their likelihood ratios, the optimal rules can be obtained. However, due to the doubly exponential explosion n of combinations – for n detectors there is 2n possible outputs resulting in 22 possible combinations – the proposed global search for the optimal rules is impractical. 3.3.4 Conjunction and Disjunction Rules for Combining Soft Detectors Several authors have proposed the application of the Boolean AND and OR fusion functions to combine soft detectors. For a pair-wise combination, the fusion function is applied to each threshold on the first ROC curve with respect to each threshold on of the second curve. The optimum threshold, as well as the combination function, is then found according to the Neyman-Person test (Neyman and Pearson, 1933). That is for each value of the f pr, the point which has the maximum tpr value is selected, along with the corresponding thresholds and Boolean function to be used during operations. Haker et al. (2005) proposed to apply the AND and OR functions to combine a pair of soft detectors under the assumption of conditional independence between detectors (Table 3.1), and when both detectors are proper and convex. The authors proposed a set of “maximum likelihood combination” rules (see Table 3.3) to select the combination rules or original detectors to be employed. For instance, the AND rule is selected if 120 Table 3.3 The maximum likelihood combination of detectors Ca and Cb as proposed by Haker et al. (2005) Ca 1 1 0 0 Cb 1 0 1 0 Selection rules for Cc tpra tprb ≥ f pra f prb tpra (1 − tprb ) ≥ f pra (1 − f prb ) (1 − tpra )tprb ≥ (1 − f pra )f prb (1 − tpra )(1 − tprb ) ≥ (1 − f pra )(1 − f prb ) the first condition (Ca = 1, Cb = 1 in Table 3.3) is exclusively true, while the OR rule is selected when the first three conditions are true. Otherwise one of the individual detectors is selected. This selection may however discard important combinations since for two thresholds several combination rules may emerge. For example, by computing these theoretical conditions for the ROC curves shown in Figure 3.3, one can observe that the first and third conditions are verified in Table 3.3, hence only Cb is considered for these two points. However, as shown in Figure 3.3, both the AND and OR combination improve the performance over the original detectors. Tao and Veldhuis (2008) applied and compared AND versus OR Boolean function separately for combining multiple ROC curves using a pair-wise combination. They showed that the OR rule emerges most of the time in their biometrics application. Oxley et al. (2007) proved that, under the independence assumption between detectors, a Boolean algebra of families of classification systems is isomorphic to a Boolean algebra of ROC curves. Shen (2008) tried to characterize the effect of correlation on the AND and OR combination rules, using a bivariate normal model. He showed that discrimination power is higher when the correlation is of opposite sign. Most research has addressed the problem of Boolean combinations under the assumption of smooth, convex and proper ROC curves. Such curves results from a parametric models, or when the data is abundant for both classes. In the ideal case, when both conditional independence and convexity assumptions are fulfilled, the AND and OR combinations are proven to be optimal, providing a higher level of performance than the original ROC curves (Barreno et al., 2008; Thomopoulos et al., 1989; Varshney, 1997). When provided 121 with limited and imbalanced data for training and validation, the ROC curves may be improper and large concavities will appear. When either one of the assumptions is violated, the performance of these combinations will be sub-optimal. In contrast to crisp detectors, the correlation between soft detector decisions also depends on the thresholds selection. At different thresholds on two ROC curves the conditional independence may be violated and different type of correlations may be introduced. 3.4 A Boolean Combination (BCALL ) Algorithm for Fusion of Detectors 3.4.1 Boolean Combination of Two ROC Curves In this chapter, a general technique for Boolean combination (BCALL ) is proposed for fusion of detector responses in the ROC space. In particular, this algorithm can exploit information from the ROC curves when detectors are trained from limited and imbalanced data. In this case, the ROC curves typically comprise concavities and are not necessarily proper. Figure 3.4 presents the block diagram of a system that combines the responses of two HMMs in the ROC space according to the BCALL technique. It involves fusing responses of detectors using all Boolean functions, prior to applying the MRROC. In contrast with most work in literature, where either AND or OR functions are applied, the BCALL technique takes advantage of all Boolean functions applied to the ROC curves and selects those that improve the ROCCH. The main steps of BCALL are presented in Algorithm 3.1. The BCALL technique inputs a pair of ROC curves defined by their decision thresholds, Ta and Tb , and the labels for the validation set. Using each of the ten Boolean functions (refer to IV.1), BCALL fuses the responses of each threshold from the first curve (Rai ) with the responses of each threshold from the second (Rbi )4 . Responses of the fused thresholds are then mapped to points (f pr, tpr) in the ROC space. The thresholds of points that exceeded the original ROCCH 4 For each ROC curve, the matrix of responses associated with the thresholds can be directly input to Algorithm 3.1 instead of converting each threshold to corresponding responses at line 8 and 10. Although it is not useful for combining two curves and it increases the memory requirements, the matrix of responses is needed to recombine resulting responses with another curve as in the following algorithms. 122 Input O = o1:T HMM1 Pr(O | λ1 ) (λ1 ) Boolean Combination (BCALL ) HMM2 Pr(O | λ2 ) (λ2 ) 1 S ∗ global ≥ T∗global p̂, (Y = 1) n̂, (Y = 0) 0.8 tpr 0.6 ROC curve of λ1 ROC curve of λ2 Original MRROC Emerging BCALL pts MRROC after BCALL 0.4 0.2 0 0 0.2 0.4 0.6 f pr 0.8 1 Figure 3.4 Block diagram of the system used for combining the responses of two HMMs. It illustrates the combination of HMM responses in the ROC space according to the BCALL technique of original curves are then stored along with their corresponding Boolean functions. The ROCCH is then updated to include the new emerging points. When the algorithm stops, the final ROCCH is the new MRROC in the Newman-Pearson sense. The outputs are the vertices of the final ROCCH, where each point is the results of two thresholds from the ROC curves fused with the corresponding Boolean function. These thresholds ∗ and Boolean functions form the elements of Sglobal , and are stored and applied during operations, as illustrated in Figure 3.4. The BCALL technique makes no assumptions regarding the independence of the detectors. Instead of fusing the points of ROC curves under the independence assumptions (Table 3.1), this techniques directly fuses the responses of each decision threshold, accounting for both independent and dependent cases. In fact, by applying all Boolean functions to combine the responses for each threshold (line 11 of Algorithm 3.1), it implicitly accounts for the effects of correlation (see Table 3.2). This is due to the direct fusion of responses, which considers the joint conditional probabilities of each detector at each threshold. Furthermore, the BCALL always provides a level of performance that 123 Algorithm 3.1: BCALL (Ta , Tb , labels): Boolean combination of two ROC curves Input: Thresholds of ROC curves, Ta and Tb , and labels (of validation set) Output: ROCCH and fused responses (Rab) of combined curves, where each point is the result of two fused thresholds along with the corresponding Boolean function (bf ) 1 let m ← number of distinct thresholds in Ta 2 let n ← number of distinct thresholds in Tb 3 Allocate F an array of size: [2, m × n] // holds temporary results of fusions 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 BooleanF unctions ← {a ∧ b, ¬a ∧ b, a ∧ ¬b, ¬(a ∧ b), a ∨ b, ¬a ∨ b, a ∨ ¬b, ¬(a ∨ b), a ⊕ b, a ≡ b} Compute ROCCHold of the original curves foreach bf ∈ BooleanF unctions do for i = 1, . . . , m do Ra ← (Ta ≥ Tai ) // converting threshold of 1st ROC to responses for j = 1, . . . , n do Rb ← (Tb ≥ Tbj ) // converting threshold of 2nd ROC to responses Rc ← bf (Ra , Rb ) // combined responses with bf Compute (tpr, f pr) using Rc and labels Push (tpr, f pr) onto F Compute ROCCHnew of F Store thresholds and corresponding Boolean functions that exeeded the ROCCHold , ∗ Sglobal ← (Tax , Tby , bf ) // to be used during operations Store the responses of these emerging points into R // to be used with BCMALL and IBCALL ROCCHnew ← ROCCHold // Update ROCCH ∗ Return ROCCHnew , R, Sglobal is equivalent or higher than that of the MRROC of the original ROC curves. In the worst-case scenario, when the responses of detectors do not provide diverse information, or when the shape of the ROC curve on the validation set differs significantly from that of the test set, the BCALL is lower bounded by the MRROC of the original curves. Including all Boolean functions accommodates for the concavities in the curves. Indeed, AND and OR rules will not provide improvements for the inferior points that correspond to concavities and make for an improper ROC curve, or points that are close to the diagonal line in the ROC space. Other Boolean functions, for instance those that exploit negations of responses, may however emerge. The BCALL technique can therefore be applied even when training and validation data are limited and heavily imbalanced, 124 to combine the decisions of any soft, crisp, or hybrid detectors in the ROC space. This includes combining the responses of the same detector trained on different data or features or trained according to different parameters, or from different detectors trained on the same data, etc. 3.4.2 Boolean Combination of Multiple ROC Curves Different strategies may be implemented for combining multi-ROC curves. A commonly proposed strategy for a cumulative combination of ROC curves is to start by any pair of the ROC curves then combine the resulting responses with the third, then with the fourth and so on, until the last ROC curve (Pepe and Thompson, 2000; Tao and Veldhuis, 2008). As described in Algorithm 3.2, the thresholds (T1 and T2 ) of first two ROC curves are initially combined with the BCALL technique. Then, their combined responses (R1 ) are directly input into line 8 of Algorithm 3.1 and combined with the thresholds of the third ROC curve (T3 ). A pair-wise combination of ROC curves is another alternative, in which the BCALL technique is applied to each pair of ROC curves, and then the MRROC is then applied to the resulting combinations. Both strategies have been investigated and typically lead to comparable results. However, the time and memory complexity associated with the cumulative strategy can be considerably lower than for the pair-wise one. This is due to the number of permutations required in the pair-wise combinations. In additions, the pair-wise strategy requires combining all thresholds for each two curves, while combining the resulting responses with a new curve is less demanding since the number of selected responses is typically much lower than the number of thresholds. The reader is referred to Subsection 3.4.3 for additional details. The cumulative strategy for Boolean Combination of Multiple ROC curves (BCMALL ) described in Algorithm 3.2 is adopted in this chapter. Further improvements in performance may be achieved by re-combining the output responses of combinations resulting from the BCALL (or BCMALL ) with those of the original ROC curves over several iterations. A novel Iterative Boolean Combination 125 Algorithm 3.2: BCMALL (T1 , . . . , TK , labels): Cummulative combination of multiple ROC curves based on BCALL Input: Thresholds of K ROC curves [T1 , . . . , TK ] and labels Output: ROCCH of combined curves where each point is the result of the combination of combinations 1 [ROCCH1 , R1 ] = BCALL (T1 , T2 , labels) // combine the first two ROC curves 2 for k = 3, . . . , K do // combine the responses of the previous combination with those of the following ROC curve 3 [ROCCHk−1 , Rk−1 ] = BCALL (Rk−2 , Tk , labels) 4 Return ROCCHK−1 , RK−1 and the stored tree of the selected responses/thresholds fusions along with their corresponding fusion functions (IBCALL ) is presented in Algorithm 3.3 and allows for combination that maximize the AUC of K ROC curves by re-combining the previously selected thresholds and fusion functions with those of the original ROC curves. During the first iteration, the ROC curves of two or more detectors are combined using the BCALL or BCMALL . This defines a potential direction for further improvement in performance within the combination space. Then, the IBCALL proceeds in this direction by re-considering information from the original curves over several iterations. The iterative procedure accounts for potential combinations that may have been disregarded during the first iteration, and are mostly useful when provided with limited and imbalanced training data. The iterative procedure stops when there are no further improvements between the AUC of old and new ROCCHs or a maximum number of iterations are performed. This stopping criteria can be controlled by tolerating a small difference between the old and new AUC values ( = AU C(ROCCHN EW ) − AU C(ROCCHOLD )), or by applying a statistical test for significance when working with several replications. Although sub-optimal, the IBCALL algorithm overcomes the impractical exponential explosion in computational complexity associated with the brute-force strategy suggested by Barreno et al. (2008) (see Subsection 3.4.3). Note that the IBCALL can also be applied to repair ROC concavities. In such scenarios, the same thresholds, say Ta , of the ROC curve to be repaired are input twice into the IBCALL algorithm, i.e., IBCALL (Ta , Ta , labels), and iterates until the AUC stops 126 Algorithm 3.3: IBCALL (T1 , . . . , TK , labels): Iterative Boolean combination based on BCALL or BCALL Input: Thresholds of K ROC curves [T1 , . . . , TK ] and labels Output: ROCCH of combined curves where each point is the result of the combination of combinations through several iterations 1 [ROCCHOLD , ROLD ] = BCM ([T1 , T2 , . . . , TK ], labels) 2 while (AU C(ROCCHN EW ) ≥ AU C(ROCCHOLD ) + ) or (numberIterations ≤ maxIter) do 3 [ROCCHN EW , RN EW ] = BC(ROLD , [T1 , T2 , . . . , TK ], labels) 4 return ROCCHN EW , RN EW and the stored tree of the selected responses fusions along with their corresponding fusion functions improving. After applying the IBCALL to a ROC curve the diverse information from the inferior points are taken into consideration in view improving the performance. The resulting MRROC curve is guaranteed to be proper and convex. In the worst-case scenario, the repaired is lower bounded by the ROCCH. Finally, the IBCALL repairs the concavities in a complementary way to LCR (Flach and Wu, 2005), therefore applying both techniques may yield even higher level of performance as presented in Section 3.6. 3.4.3 Time and Memory Complexity Given a pair of detectors, Ca and Cb , having respectively na and nb distinct thresholds on their ROC curves. During the design of the IBCALL system, the worst-case time required for fusion using BCALL is the time required for computing all ten Boolean functions to combine those thresholds, i.e., 10 × nb × nb . The worst-case time complexity is O(na nb ) Boolean operations. The worst-case memory requirements is an array of floating point registers of size 2 × nb × nb for storing the temporary results (tpr, f pr) of each Boolean function (denoted by F in Algorithm 3.1). Therefore, the worst-case memory complexity is O(na nb ). When the number of distinct thresholds becomes very large, these thresholds can be sampled (or histogrammed) into a smaller number of bins before applying the algorithm to reduce both time and memory complexity. Nevertheless, in scenarios with limited and 127 imbalanced data, which is the main focus of this work, the number of distinct thresholds is typically small. The BCALL is very efficient in these cases. When the BCMALL is applied to combine the response of several ROC curves of K detectors, the worst-case time can be roughly stated as K times that of the BCALL algorithm. However, after combining the first two ROC curves, the number of emerging responses on the ROCCH, is typically very small with respect to the number of thresholds on each ROC curve. Let nmax be the largest number of thresholds among the K ROC curves to be combined. When K grows, it is conservative to consider the worst-case time complexity of the order of O(n2max + K.nmax ) Boolean operations. The worstcase time complexity for combining K detectors with IBCALL is that of the BCMALL multiplied with the number of iterations (I), O(I(n2max + K.nmax )). The worst-case memory complexity for both BCMALL and IBCALL is O(nmax nmax ). This is limited to the memory required for combining the first two ROC curves with the BCALL algorithm. As a comparative example, consider two soft detectors Ca and Cb with their ROC curves having respectively a small number of distinct thresholds na = 100 and nb = 50. Since each threshold on the ROC curve of a soft detector is a crisp detector, the total number of crisp detectors is therefore n = na +nb = 150. The brute-force search for optimal combinan tion of crisp detector is 22 and can be reduced to 2n as proposed by Barreno et al. (2008). The exhaustive optimal search requires a prohibitively large number, 2150 ≈ 1.4 × 1045 , of Boolean computationsauthors stated that for only n = 40 detectors the computational time would require about a year and a half (Barreno et al., 2008).5, 000 Boolean computations with BCALL and I × 10, 200 ≈ 106 Boolean computations with IBCALL , where the number of iterations I is typically less than ten. Although the IBCALL algorithm is efficient, its time and memory complexity can be always reduced using the sampling technique, on the account of a small loss in the combination performance. In contrast, during operations the system will be using one vertex or interpolating between two vertices on the final convex hull provided by IBCALL according to a specific 128 false alarm rate. Each vertex has its own set of Boolean combination functions, which may be derived from all (or a subset of) K detectors that have been considered during the design phase. In practice, the computational overhead of these Boolean functions is lower than that of operating the required number of detectors. Therefore, the worst-case time and memory complexity is limited to operating the K detectors. When there are design constraints on the number of operational detectors K, they must be considered during the design phase. A larger set of detectors K > K could be first employed to realize an upper bound for analyzing the system performance. Then, the best subset of K detectors that limits the decrease of performance with reference to the upper bound can be selected for operations. This trade-off between the number of operational detectors and required performance can be a time consuming task. However since it is conducted during the design phase, various subsets selection and parallel processing techniques can be employed for a more efficient search. 3.4.4 Related Work on Classifiers Combinations Classifiers can be combined at various levels and categorized according to pre- and postclassification levels (Kuncheva, 2004a; Tulyakov et al., 2008). In pre-classification combination occurs at the sensor (or raw data) and feature levels, while post-classification combination occurs at the score, rank and decision levels. Pre-classification fusions methods are based on the generation of ensemble of classifiers (EoC), each trained on different data sets or subsets obtained by using techniques such as data-splitting, cross-validation, bagging (Breiman, 1996), and boosting (Freund and Schapire, 1996). This can be also obtained by constructing classifiers that are trained on different features subsets, for instance ensemble generation methods such as the random subspace method (Ho, 1998). Static ensemble selection attempts to select the “best” classifiers from the pool based on various diversity measures (Brown et al., 2005; Kuncheva and Whitaker, 2003), prior to combining their results. An alternative approach consists in combining the outputs of classifiers and then selecting the best performing ensemble evaluated on an independent validation set (Banfield et al., 2003; Ruta and Gabrys, 2005). The later approach has 129 proven to be more reliable than the diversity-based approach (Kuncheva and Whitaker, 2003; Ruta and Gabrys, 2005). However, since combination is performed before selection its success depends on the chosen method(s) of combination, which may be sub-optimal. In post-classification phase, whether the EoC is inherent to the problem at hand, generated or selected, classifier responses must be combined using a fusion function. Fusion at the score level is more prevalent in literature (Kittler, 1998). Normalization of the scores is typically required, which may not be a trivial task. Fusion functions may be static (e.g., sum, product, min, max, average, majority vote, etc.), adaptive (e.g., weighted average, weighted vote, etc.) or trainable (also known as stacked or meta-classifier), where another classifier is trained on classifier responses and then used as combiner (Roli et al., 2002; Wolpert, 1992). This trainable approach may introduce overfitting and requires an independent validation set for tuning the combiner parameters. Fusion at the rank level is mostly suitable for multi-class classification problems, where the correct class is expected to appear in the top of the ranked list. Logistic regression and Borda count (Ho et al., 1994; Van Erp and Schomaker, 2000) are among the fusion methods at this level. Rank-level methods simplify the combiner design since normalization is not required. Fusion at the decision level exploits the least amount of information since only class labels are input to the combiner. Compared to the other fusion methods, it is less investigated in literature. The simple majority voting rule (Ruta and Gabrys, 2002) and behaviorknowledge-space (BKS) (Raudys and Roli, 2003) are the two most representative decisionlevel fusion methods. One issue that appears with decision level fusion is the possibility of ties. The number of classifiers must therefore be greater than the number of classes. BKS can be only applied to low dimensional problems. Moreover, in order to have an accurate probability estimation, it requires a large number of training samples and another independent database to design the combination rules. The proposed IBC in this chapter provides an efficient technique which exploit all Boolean combinations as well as the MRROC interpolation for an improved performance. 130 Combination of responses within the ROC space does not require neither re-training of dichotomizers nor normalization of scores. This is because ROC curves are invariant to monotonic transformation of classification thresholds (Fawcett, 2004). These advantages allow the IBC technique to be directly applied at either the score or the decision levels. Bayesian learning approaches have been proposed to estimate HMM parameters by integrating over the parameters rather than optimizing. For instance, variational Bayesian learning has been proposed with a suitable prior for all variables to estimate an ensemble of HMMs with the same order, each trained on a different subset of the data, to approximate the entire posterior probability distribution (MacKay, 1997). Nonparametric Bayesian learning have also been proposed for estimating HMMs order and other parameters from the training data (Beal et al., 2002). Starting with a large order, the HMM parameters and the number of states are integrated out with reference to their posterior probabilities. As the method converges to a solution, redundant states are eliminated which yields to automatic order selection. These can been considered as pre-classification combinations of HMMs. HMMs with the same orders are trained and combined using different subsets of the data or HMMs with the different orders are trained and combined on the same data. Comparing the performance of these techniques to that of the IBC would be an interesting future work. Note however that the proposed IBC is a general post-classification combination technique at the response level. It can be used to combine the results of these Bayesian approaches, as well as the results of any other technique, for improved performance. 3.5 Experimental Methodology The experiments are conducted on both synthetically generated data and sendmail data from the University of New Mexico (UNM) data sets5 . 5 http://www.cs.unm.edu/~immsec/systemcalls.htm 131 3.5.1 University of New Mexico (UNM) Data The UNM data sets are commonly used for benchmarking anomaly detections based on system calls sequences (Warrender et al., 1999). In related work, intrusive sequences are usually labeled by comparing normal sequences, using STIDE matching technique. This labeling process considers STIDE responses as the ground truth, and leads to a biased evaluation and comparison of techniques, which depends on both training data size and detector window size. To confirm the results on system calls data from real processes, the same labeling strategy is used in this work. However fewer sequences are used to train the HMMs and STIDE to alleviate the bias. Therefore, STIDE is first trained on all the available normal data according to different window sizes, and then used to label the corresponding sub-sequences from the ten sequences available for testing. The resulting labeled sub-sequences of the same size are concatenated, then divided into blocks of equal sizes, one for validation and the other for testing. During the experiments, smaller blocks of normal data (100 to 1, 000 symbols) are used for training the HMMs and STIDE. In addition to the labeling issue, the normal sendmail data is very redundant and anomalous sub-sequences in the testing data are very limited. Nevertheless, due to the limited publicly available system call data, sendmail data is the mostly used in literature. 3.5.2 Synthetic Data The need to overcome issues encountered when using real-world data for anomaly-based HIDS (incomplete data for training and labeling) has lead to the implementation of a synthetic data generation platform for proof-of-concept simulations. It is intended to provide normal data for training and labeled data (normal and anomalous) for testing. This is done by simulating different processes with various complexities then injecting anomalies in known locations. The data generator is based on the Conditional Relative Entropy (CRE) of a source; it is closely related to the work of Tan and Maxion (Tan and Maxion, 2003). The CRE is defined as the conditional entropy divided by the maximum entropy (MaxEnt) of that source, which gives an irregularity index to the generated data. 132 For two random variables x and y the CRE is given by CRE = − x p(x) y p(y|x) log p(y|x) , M axEnt where for an alphabet of size Σ symbols, M axEnt = −Σ log(1/Σ) is the entropy of a theoretical source in which all symbols are equiprobale. It normalizes the conditional entropy values between CRE = 0 (perfect regularity) and CRE = 1 (complete irregularity or random). In a sequence of system calls, the conditional probability, p(y | x), represents the probability of the next system call given the current one. It can be represented as the columns and rows (respectively) of a Markov Model with the transition matrix M M = {aij }, where aij = p(St+1 = j | St = i) is the transition probability from state i at time t to state j at time t + 1. Accordingly, for a specific alphabet size Σ and CRE value, a Markov chain is first constructed, then used as a generative model for normal data. This Markov chain is also used for labeling injected anomalies as described below. Let an anomalous event be defined as a surprising event which does not belong to the process normal pattern. This type of event may be a foreign-symbol anomaly sequence that contains symbols not included in the process normal alphabet, a foreign n-gram anomaly sequence that contains n-grams not present in the process normal data, or a rare n-gram anomaly sequence that contains n-grams that are infrequent in the process normal data and occurs in burst during the test6 . Generating training data consists of constructing Markov transition matrices for an alphabet of size Σ symbols with the desired irregularity index (CRE) for the normal sequences. The normal data sequence with the desired length is then produced with the Markov chain, and segmented using a sliding window (shift one) of a fixed size, DW . To produce the anomalous data, a random sequence (CRE = 1) is generated, using the same alphabet size Σ, and segmented into sub-sequences of a desired length using a sliding window with a fixed size of AS. Then, the original generative Markov chain is used to compute the likelihood of each sub-sequence. If the likelihood is lower than a threshold it is labeled as anomaly. The threshold is set to (min(aij ))AS−1 , ∀i,j , the minimal value in the Markov transition matrix to the power (AS − 1), which is the number of symbol tran6 This is in contrast with other work which consider rare event as anomalies. Rare events are normal, however they may be suspicious if they occurs in high frequency over a short period of time. 133 sitions in the sequence of size AS. This ensures that the anomalous sequences of size AS are not associated with the process normal behavior, and hence foreign n-gram anomalies are collected. The trivial case of foreign-symbol anomaly is disregarded since it is easy to be detected. Rare n-gram anomalies are not considered since we seek to investigate the performance at the detection level, and such kind of anomalies are accounted for at a higher level by computing the frequency of rare events over a local region. Finally, to create the testing data another normal sequence is generated, segmented and labeled as normal. The collected anomalies of the same length are then injected into this sequence at random according to a mixing ratio. Figure 3.5 illustrates the data pre-processing for training, validating and testing using the UNM sendmail data or the generated data. The only difference is the ground truth for labeling, which is all the available normal data for sendmail and the generator itself Train DW O = o1 , . . . , oT of system calls AS O = o1 , . . . , oT Sliding Window of Size AS Mixed sequence of system calls Test Validation Labeled sub-sequences normal or anomalous Validation Sliding Window Normal sequence of Size DW Normal Sub-sequences for the synthetic data. Figure 3.5 Illustration of data pre-processing for training, validation and testing 134 The experiments conducted in this chapter using the data generator simulate a small process and a more complex process, with Σ = {8, 50} symbols, and CRE = {0.3, 0.4}, respectively. The sizes of injected anomalies are assumed equal to the detector window sizes AS = DW = {2, 4, 6}. For both scenarios, the presented results are for validation and test sets that comprise 75% of normal and 25% of anomalous data. 3.5.3 Experimental Protocol Figure 3.6 illustrates the steps involved for estimating HMM parameters. For each detector window set of size DW , different discrete-time ergodic HMMs are trained with various number of hidden states N = [Nmin , . . . , Nmax ]. The number of symbols is taken equal to the process alphabet size. The iterative Baum-Welch algorithm is used to estimate HMM parameters (Baum et al., 1970) using the training data, which only comprises normal sequences. To reduce overfitting effects, the evaluation of the log-likelihood, using the Forward algorithm (Baum et al., 1970), on an independent validation set also comprising only normal sequences is used as a stopping criterion. The training process is repeated ten times using a different random initialization to avoid local minima. The log-likelihood of a second validation set comprising normal and anomalous sequences is then evaluated by each HMM, which provides ten ROC curves. Finally, the model that gives the highest area under its convex hull is selected, which results an HMM for each N value. When working with the synthetic data, this procedure is replicated ten times with different training, validation and testing sets, and the results are averaged and presented along with the standard deviations to provide a statistical confidence intervals. The performance obtained by fusion of μ-HMMs in ROC space with the proposed BCALL technique is compared to that of MRROC fusion of the original models, and to that of the conjunction (BCAN D ) and disjunction (BCOR ) combinations (Haker et al., 2005; Tao and Veldhuis, 2008). This is also compared to the performance of the IBCALL technique applied to combine the μ-HMMs through several iterations. In addition, the performance of STIDE is shown as reference. To investigate the effect of repairing the concavities in DW For N = Nmin , . . . , Nmax Baum-Welch (training) Forward algo. (evaluation) Best Train Val1 # iterations λ1N λ10 N AS tpr Labeled sub-sequences Validation Restart the training 10 times from different random initializations log-likelihood Normal sub-sequences Validation Train 135 λNmin Highest AUCH value over the 10 initializations λNmax f pr Figure 3.6 Illustration of the steps involved for estimating HMM parameters ROC curves, each ROC curve is first repaired using the IBCALL and LCR (Flach and Wu, 2005) techniques, and then all curves are combined with the MRROC technique. The area under the ROC curve (AUC), has been proposed as more robust scalar summary of classifiers performance than accuracy (Bradley, 1997; Huang and Ling, 2005; Provost and Fawcett, 2001). The AUC assesses the ranking in terms of class separation, i.e., evaluates how well a classifier is able to sort its predictions according to the confidence it assigns to them. For instance, with an AU C = 1 all positives are ranked higher than negatives indicating a perfect discrimination between classes. A random classifier has an AU C = 0.5 that is both classes are ranked at random. For a crisp classifier, the AUC is simply the area under the trapezoid and is calculated as the average of the tpr and f pr values. For a soft classifier, the AUC may be estimated directly from the data either by summing up the areas of the underlying trapezoids (Fawcett, 2004) or by means of the Wilcoxon–Mann–Whitney (WMW) statistic (Hanley and McNeil, 1982). When the ROC curves cross, It is possible for a high-AUC classifier to perform worse in a specific 136 region of ROC space than a low-AUC classifier. In such case, the partial area under the ROC curve (pAUC) (Walter, 2005) could be useful for comparing the specific regions of interest (Zhang et al., 2002). If the AUCs (or the pAUCs) are not significantly different, the shape of the curves might need to be looked at. It may also be useful to look at the tpr for a fixed f pr of particular interest. Since the MRROC can be applied to any ROC curve, the performance measures in all experiments are taken with reference to the ROCCH. This includes the area under the convex hull (AUCH), the partial area under the convex hull for the range of f pr = [0, 0.1] (AU CH0.1 ), and the tpr at a fixed f pr = 0.1. 3.6 3.6.1 Simulation Results and Discussion An Illustrative Example with Synthetic Data Figure 3.7 presents an example of the impact on performance obtained after applying different techniques for combining and repairing ROC curves: MRROC, BC and IBC. The training, validation and testing data are generated synthetically as described in Section 3.5.2, with an alphabet of size Σ = 8 symbols and with a CRE = 0.3. The training and validation of the ergodic HMMs is carried out according to the methodology described in Section 3.5. A block of data of size 100 sequences, each of size DW = 4, is used to train HMMs, and another validation block of the same size is used to implement a stopping criterion. Each validation and test set is comprised of 200 sequences, each of size AS = 4. In both cases, the ratio of normal to anomalous sequences is four to one. For improved visibility, Figure 3.7 only shows the combinations of two HMMs, each one trained with different number of states, N = 4 and 12. The ROC curves for the two HMMs along with their MRROC are presented for the validation (Figure 3.7a) and test (Figure 3.7b) data sets. The performance of STIDE is also shown for reference. To visualize the impact on performance of combining with AND and OR Boolean functions separately (Haker et al., 2005; Tao and Veldhuis, 2008), Figures 3.7a and b show the results with BCAN D and BCOR along with BCALL and IBCALL . In Figures 3.7c and 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.5 STIDE λN=4 : auc=0.545 : auc=0.712 0.4 λN=12 : auc=0.708 0.3 MRROC : auc=0.780 BCAND : auc=0.802 0.2 BCOR : auc=0.812 0.1 BCALL : auc=0.826 0 IBCALL : auc=0.932 XOR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False alarm rate True positive rate True positive rate 137 0.6 0.5 STIDE λN=4 : auc=0.550 :auc=0.705 0.4 λN=12 :auc=0.728 0.3 MRROC : auc=0.798 BCAND : auc=0.806 0.2 BCOR : auc=0.819 0.1 BCALL : auc=0.827 0 IBCALL : auc=0.927 1 0 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 STIDE λN=4 : auc=0.545 : auc=0.712 0.4 λN=12 : auc=0.708 0.2 MRROC LCR(N=4) IBCALL(N=4) : auc=0.780 : auc=0.799 : auc=0.837 0.1 LCR(N=12) IBCALL(N=12) : auc=0.810 : auc=0.858 MRROC(LCR,IBC): auc=0.865 0 0.3 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False alarm rate (c) Repairing using the validation set. 1 (b) Combinations using the test set. True positive rate True positive rate (a) Combinations using the validation set. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False alarm rate 1 0.5 STIDE λN=4 : auc=0.550 : auc=0.705 0.4 λN=12 : auc=0.728 0.2 MRROC LCR(N=4) IBCALL(N=4) : auc=0.798 : auc=0.783 : auc=0.809 0.1 LCR(N=12) IBCALL(N=12) : auc=0.833 : auc=0.872 0.3 MRROC(LCR,IBC): auc=0.872 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False alarm rate 1 (d) Repairing using the test set. Figure 3.7 Illustrative example that compares the AUC performance of techniques for combination (top) and for repairing (bottom) of ROC curves. The example is conducted on a system consisting of two ergodic HMMs trained with N = 4 and N = 12 states, on a block of 100 sequences, each of length DW = 4 symbols, synthetically generated with Σ = 8 and CRE = 0.3 d, IBCALL , and LCR (Flach and Wu, 2005) are applied to repair each ROC curve, and the resulting curves are then combined with the MRROC technique. 138 As expected, STIDE performs poorly since the data provided for training is limited and the database memorized by STIDE only comprises a fraction of the normal behavior. This contrasts with the generalization capabilities of HMMs. As shown in Figures 3.7a or b, the improvement in AUC performance achieved by applying BCAN D and BCOR is modest. This is also seen when comparing their tpr values at an operating point of, for instance, f pr = 0.1 with respect to the MRROC of the original ROC curves. Indeed, these AND and OR Boolean functions are unable to exploit information in inferior points. In this case, the BCALL does not provide a considerably higher level of performance than the BCAN D and BCOR , this is not typically the case, but was deliberately selected for this example. However, as shown in Figure 3.7a, two additional combination rules have emerged from the XOR Boolean function with BCALL . Diverse information from the emerging combination rules (resulting, in this case, from AN D, OR and XOR Boolean functions) serve as a guide, within the combination space, to a further performance improvement with the IBCALL technique. Indeed, after seven iterations, IBCALL is able to achieve a considerably higher level of performance by recombining emerging rules from each iteration with the original curves. IBCALL iterative procedure repeatedly selects diverse information and accounts for potential combinations that may have been disregarded during previous iterations. In this example, the validation set is comprised of 200 sequences and the ROC curves of both HMMs (λN =4 and λN =12 ) contain the same number of unique thresholds7 , 200. The worst-case time complexity involved with BCALL is the time required to compute all ten Boolean functions for each combination of thresholds, i.e., 10 × 200 × 200 = 400, 000 Boolean operations. The worst-case memory complexity is an array of floating point registers of size 2 × 200 × 200 = 80, 000, holding the temporary results (tpr, f pr) of each Boolean function. This consists the first iteration of IBCALL and results in nine emerging combinations (or points on the ROCCH) as shown in Figure 3.7a. The time complexity of the second iteration is reduced to the time required for computing 10 × 9 × (200 + 200) = 7 In many cases, the number of unique thresholds could be lower than the number of samples in the validation set. 139 Table 3.4 Worst-case time and memory complexity for the illustrative example Number of iterations Number of emerging points Time complexity Memory complexity 1 9 400, 000 80, 000 2 10 36, 000 3, 600 3 10 40, 000 4, 000 4 11 40, 000 4, 000 5 10 44, 000 4, 400 6 9 40, 000 4, 000 7 6 36, 000 3, 600 36, 000 Boolean operations, and the memory complexity is reduced to 2 × 1, 800 = 3, 600 floating point registers. As shown in Table 3.4, the time and memory complexity of successive iterations are reduced by order of magnitude compared to the first iteration. However, as shown in Figure 3.7a, the level of performance is significantly improved due to these low cost iterations. During operations, the number of Boolean combinations varies with the selected operational point, however it is upper-bounded by the number of iterations. For instance, in this example a maximum of seven Boolean functions must be applied to HMMs response to achieve the desired performance as in Figure 3.7b. The time complexity of these Boolean combinations is negligible compared to that of computing the log-likelihood of the test sequences with HMM. When using the IBCALL technique to repair the ROC curve belonging to the HMM trained with N = 12 states, the resulting curve IBCALL (N = 12) dominates other ROC curves on the test set. This is because its test set performance exceeds its expected validation set performance, while the expected IBCALL (N = 4) performance decreases on the validation set. Similarly, the expected validation performance of the LCR technique can be largely different from that of the test set as shown for LCR(N = 12) curve. In general, the performance of both repairing techniques is less robust to variances between validation and test sets, since repairing relies on the shape of one ROC curve. In contrast, combining several ROC curves according to IBCALL technique is more robust to such variance since the interactions among all ROC curves are considered. A closer look at ROC-based repairing techniques shows that IBCALL and LCR are complimentary. For instance, ROC curves repaired for λN =4 , IBCALL (N = 4) and LCR(N = 4), cross in both validation and test sets. In general, IBCALL performs 140 well when the concavities are located in the bottom-left or in the top-right corners of the ROC space. IBCALL exploits the asymmetry in the ROC curve shape caused by an imbalanced variances at the head or tail of class distributions. The LCR technique is efficient for repairing concavities that are located close to non-major diagonal, since it only considers the shape of the ROC curve when negating the responses of the largest concavity. A higher level of AUC performance can therefore be achieved by using the MRROC to combine ROC curves repaired using the IBCALL and LCR techniques. The attempt to combine the responses of the ROC curves repaired with the IBCALL or LCR techniques using IBCALL , was not successful due to the low level of robustness of the repairing techniques. Combination are based on the few points that define repaired ROC curve points on the validation set, which may be in different locations on the test set. In contrast, the direct combination of ROC curves according to BCALL or IBCALL techniques starts with existing thresholds on the validation set. Although ROC curves thresholds may change on the test set, these algorithms proceed by using neighboring thresholds. 3.6.2 Results with Synthetic and Real Data Figure 3.8 shows the average AUC performance on the test sets versus the number of training blocks of a μ-HMM system where each HMM is trained with a different number of states (N = 4, 8, 10). Results are produced for synthetically generated data with Σ = 8 and CRE = 0.3, and for various training set sizes (50 to 450 symbols) and detector window sizes (DW = 2, 4, 6). The performance of the composite system obtained with the MRROC combination for the original HMM ROC curves, is compared to those obtained with the BCAN D , BCOR , BCALL 8 and IBCALL techniques. The AUC performance of STIDE is also shown for reference. The reader is referred to Section IV.2 of Appendix IV for additional supporting results – AU CH0.1 and tpr values at f pr = 0.1. 8 For simplicity, BCALL is also used to indicate BCMALL when combining multiple ROC curves. 141 As shown in these figures, the BCALL can significantly improve the AUC over the MRROC in all the presented cases. The performance of the MRROC approaches that of BCALL technique when the training data becomes abundant for the problem at hand, or when test cases are easily detected (e.g., when classes are well separated). For example, this is shown in Figure 3.8a when the training block size grows beyond 150 sequences, even STIDE is able to achieve a high level of performance on this simple scenario (that is rarely encountered in real-world applications). Although the BCAN D and BCOR fusion were able to increase the performance over the MRROC, their performances is often significantly lower than that of the BCALL , as shown in Figure 3.8c and d. The significant increase in performance achieved with the BCOR supports prior conclusion by Tao and Veldhuis (Tao and Veldhuis, 2008). The authors recommend using the OR Boolean fusion for detecting outliers in biometrics applications. However, this may not hold true when the number of positive samples is very limited. The IBCALL technique provides the highest level of performance over all the range of conducted experiments. Since it includes the BCALL in its first iteration, the performance of IBCALL is lower bounded by that of the BCALL . Results indicate that IBCALL is most suitable for cases in which data are limited and imbalanced, as shown in the first blocks of Figures 3.8b, d, and f, (see additional results in Section IV.2 of Appendix IV). The performance of IBCALL improves significantly over that of the BCALL . This is due to the iterative nature of the IBCALL that is able to repeatedly exploit the information residing in the inferior points, and hence select better combinations. When the number of blocks for training is limited, the performance of the IBCALL technique is higher than other combination techniques. In such cases, each HMM trained with a different order provides diverse information by capturing different data structure, which allows to increase the performance of IBCALL . When the amount training data becomes abundant, the HMMs tend to achieve the same performance with less diversity in 142 their responses, which degrades the performance achieved by IBCALL . With the increase of detector window size9 , the likelihood of anomalous sequences becomes smaller at a faster rate than normal ones, and hence increases HMM detection rate (Khreich et al., 2009b). As a consequence, HMMs responses become less diverse yielding to a decrease of IBCALL performance. The impact of DW on performance is illustrated with the second and more complex scenario below. Since the IBCALL technique incorporates all combinations employed within the BCALL technique, only results of IBCALL are compared to those of MRROC, IBCAN D and IBCOR for the second synthetic scenario (Σ = 50, CRE = 0.4) and for sendmail data. Figure 3.9 confirms the results of Figure 3.8 on the second synthetic and more complex scenario, where Σ = 50 and CRE = 0.4. Again the IBCALL technique provides higher level of performance than the MRROC, IBCAN D and IBCOR techniques. In this scenario, although the results of the AND and OR Boolean combinations were allowed to iterate until convergence, their achieved performances are still significantly lower than that of the IBCALL , as shown in Figure 3.9. This demonstrates the impact on performance of employing and iterating all Boolean functions until convergence. The performance gain is best illustrated for the first five training blocks (1000 to 3000 sequences), where the HMMs trained with different orders provide IBCALL with diverse responses for a significantly improved performance. Increasing the number of training blocks however increases the detection capabilities of the HMMs and reduces the diversity in their responses, which decreases the level of performance achieved with the IBCALL technique. As discussed previously, HMM detection ability increases with the detector window size (or anomaly size) which reduces the diversity in the μ-HMMs system. This negatively affects the performance achieved by the IBCALL technique as shown in Figure 3.9. This is also confirmed on the sendmail data in Figure 3.10. Note however that with sendmail, the training data is very redundant and the test samples are very limited. 9 For simplicity, the detector window size, DW , is assumed equal to the anomaly size, AS. 143 Even STIDE performance was moderate with only up to 1000 training sequences, out of the available 1.5 million sequences used for labeling. Nevertheless, the difference in performance between validation and test sets, not shown for improved visibility, is significantly large where the limited test samples are split in half for testing and half for validation. Although the IBCALL is able to increase the performance, this increase is not as prominent as in the synthetic cases. This is mainly due to redundancy in the training data, where the HMMs trained with different number of states were not able to capture enough diversity in the underlying data structure. Effective combination technique must exploit diverse and complimentary information to improve systems performance. 1 1 0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 AUCH AUCH 144 0.75 0.7 0.65 0.75 0.7 STIDE MRROC BCAND 0.65 0.55 BCOR 0.55 IBCOR 0.5 BCALL 0.5 IBCALL 0.6 50 100 150 200 250 300 350 number of sequences in training blocks 400 STIDE MRROC IBCAND 0.6 450 50 100 (a) BC (DW = 2) 1 1 0.95 0.95 0.9 0.9 0.85 0.85 450 400 450 400 450 0.8 AUCH AUCH 400 (b) IBC (DW = 2) 0.8 0.75 0.7 0.65 0.75 0.7 0.65 0.6 0.6 0.55 0.55 0.5 0.5 50 100 150 200 250 300 350 number of sequences in training blocks 400 450 50 100 (c) BC (DW = 4) 150 200 250 300 350 number of sequences in training blocks (d) IBC (DW = 4) 1 1 0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 AUCH AUCH 150 200 250 300 350 number of sequences in training blocks 0.75 0.7 0.65 0.75 0.7 0.65 0.6 0.6 0.55 0.55 0.5 0.5 50 100 150 200 250 300 350 number of sequences in training blocks (e) BC (DW = 6) 400 450 50 100 150 200 250 300 350 number of sequences in training blocks (f) IBC (DW = 6) Figure 3.8 Results for synthetically generated data with Σ = 8 and CRE = 0.3. Average AUC values obtained on the test sets as a function of the number of training blocks (50 to 450 sequences) for a 3-HMM system each trained with a different state (N = 4, 8, 12), and combined with the MRROC, BC and IBC techniques. Average AUC performance is compared for various detector windows sizes (DW = 2, 4, 6). Error bars are standard deviations over ten replications 145 0.9 1 0.95 0.8 0.9 0.7 0.85 AUCH tpr at fpr=0.1 STIDE 0.8 MRROC IBCAND 0.75 0.7 IBC OR IBCALL 0.65 0.6 0.6 0.5 STIDE MRROC 0.4 IBCAND 0.3 IBC 0.2 IBCALL OR 0.55 0.5 1000 1500 2000 2500 3000 3500 4000 4500 number of sequences in training blocks 0.1 1000 5000 1500 (a) DW=2 2000 2500 3000 3500 4000 4500 number of sequences in training blocks 5000 (b) DW=2 0.8 1 0.95 0.7 0.9 0.6 0.85 tpr at fpr=0.1 AUCH 0.8 0.75 0.7 0.65 0.6 0.5 0.4 0.3 0.2 0.55 0.1 0.5 1000 1500 2000 2500 3000 3500 4000 4500 number of sequences in training blocks 0 1000 5000 1500 (c) DW=4 2000 2500 3000 3500 4000 4500 number of sequences in training blocks 5000 (d) DW=4 1 0.9 0.95 0.8 0.9 0.7 0.85 0.6 tpr at fpr=0.1 AUCH 0.8 0.75 0.7 0.65 0.5 0.4 0.3 0.6 0.2 0.55 0.1 0.5 1000 1500 2000 2500 3000 3500 4000 4500 number of sequences in training blocks (e) DW=6 5000 0 1000 1500 2000 2500 3000 3500 4000 4500 number of sequences in training blocks 5000 (f) DW=6 Figure 3.9 Results for synthetically generated data with Σ = 50 and CRE = 0.4. Average AUC values (left) and tpr values at f pr = 0.1 (right) obtained on the test sets as a function of the number of training blocks (1000 to 5000 sequences) for a 3-HMM system each trained with a different state (N = 40, 50, 60), and combined with the MRROC and IBC techniques. The performance is compared for various detector windows sizes (DW = 2, 4, 6). Error bars are standard deviations over ten replications 146 0.95 0.8 0.9 0.7 STIDE MRROC IBCAND IBC OR 0.85 tpr at fpr=0.1 0.8 AUCH IBCALL 0.6 0.75 0.7 STIDE MRROC IBC 0.65 0.5 0.4 0.3 AND 0.2 IBCOR 0.6 IBC ALL 0.55 0.1 100 200 300 400 500 600 700 800 number of sequences in training blocks 900 1000 100 200 (a) DW=2 300 400 500 600 700 800 number of sequences in training blocks 900 1000 900 1000 900 1000 (b) DW=2 0.8 1 0.95 0.7 0.9 0.6 tpr at fpr=0.1 0.85 AUCH 0.8 0.75 0.7 0.65 0.5 0.4 0.3 0.6 0.2 0.55 0.5 0.1 100 200 300 400 500 600 700 800 number of sequences in training blocks 900 1000 100 200 (c) DW=4 300 400 500 600 700 800 number of sequences in training blocks (d) DW=4 0.8 1 0.95 0.7 0.9 0.6 tpr at fpr=0.1 0.85 AUCH 0.8 0.75 0.7 0.65 0.5 0.4 0.3 0.6 0.2 0.55 0.5 100 200 300 400 500 600 700 800 number of sequences in training blocks (e) DW=6 900 1000 0.1 100 200 300 400 500 600 700 800 number of sequences in training blocks (f) DW=6 Figure 3.10 Results for sendmail data. AUC values (left) and tpr values at f pr = 0.1 (right) obtained on the test sets as function of the number of training blocks (100 to 1000 sequences) for a 5-HMM system, each trained with a different state (N = 40, 45, 50, 55, 60), and combined with the MRROC and IBC techniques. The performance is compared for various detector windows sizes (DW = 2, 4, 6) 147 Overall, the AUCs tend to increase to a comparable level as the number of training blocks increases. With a sufficient amount of training data, all HMMs are capable of achieving an equal level of performance, however with less diverse and complimentary information. On the other hand, the IBCALL performance will outperform others for small training set and detector window sizes. When the number of blocks for training is limited, the performance of the IBCALL technique is higher than other combination techniques. In such scenarios, each HMM trained with a different number of states is a able to capture different underlying structures of data. These HMMs provide diverse information, which allows to increase the performance of IBCALL . This holds true for different detector window sizes (DW ), although the impact of combinations on performance degrades with the increase of DW values. HMMs are more capable of detecting larger anomalies, providing therefore less diverse responses to the IBCALL technique. Increasing the number of HMMs trained with different orders (N ) in the μ -HMM system has a significant impact on the performance achieved with the IBCALL technique due to the added diversity among the combined HMMs. Since diversity measures and the creation of diverse ensembles are not yet well defined in literature (Brown et al., 2005; Dos Santos et al., 2008), one can simply combine the responses of HMMs trained using a wide range of states to have an upper bound on performance. This is feasible in short period of time due to the efficiency of IBCALL . During operations however, simplified combination rules and fewer HMMs may be favored for speed constraints on the detection system. In such cases, the number of HMMs involved in the μ-HMM system can be reduced by selecting a subset of N values that does not significantly affect the upper bound achieved on performance (as described in Section 3.4.3). Training different HMMs on different subsets of the data provides other sources of diversity. Future work involves combining the responses of HMMs trained with different orders on different subsets of the data according to the IBCALL technique. Finally, Figure 3.11 shows results for repairing concavities using the first synthetically generated scenario (Σ = 8, CRE = 0.3). The μ-HMM system is trained with three dif- 148 ferent states (N = 4, 8, 12), for various training set sizes and detector window sizes, and combined with the MRROC technique. Then, the IBCALL and LCR techniques are applied to repair the concavities presented in each ROC curve associated with an HMM. The repaired ROC curves are then combined according to the MRROC. In addition, the results of IBCALL and LCR techniques are also combined with the MRROC. 1 1 0.95 0.9 0.9 0.8 0.85 0.7 AUCH tpr at fpr=0.1 STIDE 0.8 MRROC 0.75 LCR 0.7 IBCALL 0.65 0.6 STIDE 0.5 MRROC LCR 0.4 MRROC(IBCALL,LCR) IBC 0.6 ALL 0.3 MRROC(IBC ,LCR) ALL 0.55 0.2 0.5 0.1 50 100 150 200 250 300 350 number of sequences in training blocks 400 450 50 100 (a) DW=4 400 450 400 450 (b) DW=4 1 0.95 0.9 0.9 0.8 0.85 0.7 tpr at fpr=0.1 1 0.8 AUCH 150 200 250 300 350 number of sequences in training blocks 0.75 0.7 0.65 0.6 0.5 0.4 0.3 0.6 0.2 0.55 0.1 0.5 50 100 150 200 250 300 350 number of sequences in training blocks (c) DW=6 400 450 0 50 100 150 200 250 300 350 number of sequences in training blocks (d) DW=6 Figure 3.11 Performance versus the number of training blocks achieved after repairing concavities for synthetically generated data with Σ = 8 and CRE = 0.3. Average AUCs (left) and tpr values at f pr = 0.1 (right) on the test set for a μ-HMM where each HMM is trained with different number of states (N = 4, 8, 12). HMMs are combined with the MRROC technique and compared to the performance of IBCALL and LCR repairing techniques, for various training block sizes (50 to 450 sequences) and detector windows sizes (DW = 4 and 6) 149 Figure 3.11 shows that each of the repairing techniques, IBCALL and LCR, is able to exceed the MRROC of the original curves at the first blocks. This is because in such cases the ROC curves comprise large concavities due to the limited training data. However, when the amount of training data increases, IBCALL and LCR are not able to provide a higher performance than the MRROC of the original curves unless their responses are themselves combined with the MRROC technique. This noticeable increase in performance, achieved by combining the repaired curves according to IBCALL and LCR with the MRROC, clearly indicates the complementarity of both techniques as discussed in Section 3.6.1. However, the performance achieved by combining the repaired curves according to IBCALL and LCR with the MRROC is still lower than that of combining the original ROC curves using IBCALL , as presented in Figures 3.8d and f. Nevertheless, in results not shown in this chapter, repairing with the IBCALL and LCR techniques have shown lack of robustness to changes between the validation and test sets (as discussed in Section 3.6.1). In addition, repairing relies on large ROC concavities and they are not designed to improve the performance when the ROC curve comprises small concavities, but represent a poor performance (e.g., parallel to the diagonal of chance). In contrast, combining ROC curves using the IBCALL may allow to exploit this information to increase performance. In practice, when a ROC curve of a detector presents concavities, performance could be improved by combining the responses of both repairing techniques, IBCALL and LCR, using the MRROC fusion. Otherwise, the IBCALL could be directly applied to combine the responses of several ROC curves and provide a higher level of performance than other techniques. 3.7 Conclusion This chapter presents an Iterative Boolean Combination (IBC ) technique for efficient fusion of the responses from multiple classifiers in the ROC space. The IBC efficiently exploits all Boolean functions applied to the ROC curves and requires no prior assump- 150 tions about conditional independence of detectors or convexity of ROC curves. Although it seeks a sub-optimal set of combinations, the IBC is very efficient in practice and it provides a higher level of performance than related ROC-based combination techniques, especially when training data is limited and test data is heavily imbalanced. Its time complexity is linear with the number of classifiers while, the memory requirement is independent of the number of classifier, which allows for a large number of combinations. The proposed IBC is general in that it can be employed to combine diverse responses of any crisp or soft one- or two-class classifiers, within a wide range of application domains. This includes combining the responses of the same classifier trained on different data or features or trained according to different parameters, or from different classifiers trained on the same data, etc. It is also useful for repairing the concavities in a ROC curve. During simulations conducted on both synthetic and real HIDS data sets, the IBC has been applied to combine the responses a of multiple-HMM system, where each HMM is trained using a different number of states, and capturing different temporal structures of the data. Results indicate that the IBC significantly improves the overall system performance over a wide range of training set sizes with various alphabet sizes and complexities of monitored processes, and according to different anomaly sizes, without a significant computational and storage overhead. Results have shown that, even with one iteration, the IBC technique always increases system performance over the MRROC fusion, and over the Boolean conjunction and disjunction combinations. When the IBC is allowed to iterate until convergence, the system performance improves significantly and the time and memory complexity required for each iteration are reduced by an order of magnitude with reference to the first iteration. The performance gain, especially when provided with limited training data, is due to the ability of the IBC technique to exploit diverse information residing in inferior points on the ROC curves, which are disregarded by the other techniques. In addition, repairing the concavities in a ROC curve using the IBC technique can yield higher level of performance than with the MRROC technique alone. The impact on performance of repairing the concavities is shown to be comparable and 151 complementary to an existing repairing technique that relies on inverting the largest concavity section. Therefore, combining the results of both repairing techniques according to the MRROC fusion yields a higher level of performance than applying each repairing technique alone. In this chapter, the proposed ROC-based iterative Boolean combination technique is applied for fusion of responses from HMMs trained on fixed-size data sets. In next chapter, an adaptive system is proposed for incremental learning of new data over time using a learn-and-combine approach based on the proposed Boolean combination techniques. 152 CHAPTER 4 ADAPTIVE ROC-BASED ENSEMBLES OF HMMS APPLIED TO ANOMALY DETECTION∗ In this chapter, an efficient system is proposed to accommodate new data using a learnand-combine approach. When a new block of training data becomes available, a new pool of base HMMs is generated from the data using a different number of HMM states and random initializations. The responses from the newly-trained HMMs are then combined to those of the previously-trained HMMs in receiver operating characteristic (ROC) space using a novel incremental Boolean combination technique. The learn-and-combine approach allows to select a diversified ensemble of HMMs (EoHMMs) from the pool, and adapts the Boolean fusion functions and thresholds for improved performance, while it prunes redundant base HMMs. The proposed system is capable of changing its desired operating point during operations, and this point can be adjusted to changes in prior probabilities and costs of errors. Computer simulations conducted on synthetic and realworld host-based intrusion detection data indicate that the proposed system can achieve a higher level of performance than when parameters of a single best HMM are estimated, at each learning stage, using reference batch and incremental learning techniques. It also outperforms the static fusion functions (e.g., majority voting) for combining the new pool of HMMs with previously-generated HMMs. Over time, the proposed ensemble selection techniques have shown to form compact EoHMMs for operations, while maintaining or improving the overall system accuracy. Pruning has allowed to limit the pool size from increasing indefinitely, reducing thereby the storage space for accommodating HMMs parameters without negatively affecting the overall EoHMMs performance. Although applied for HMM-based ADSs, the proposed approach is general and can be employed for a wide range of classifiers and detection applications. ∗ THIS CHAPTER IS ACCEPTED, IN PRESS, IN PATTERN RECOGNITION JOURNAL, ON JULY 5, 2011, SUBMISSION NUMBER: PR-D-10-01131 154 4.1 Introduction Anomaly detection systems (ADSs) detect intrusions by monitoring for significant deviations from normal system behavior. Traditional host-based ADSs monitor for significant deviation in normal operating system calls – the gateway between user and kernel mode. ADSs designed with discrete hidden Markov models (HMMs) have been shown to produce a high level of accuracy (Hoang and Hu, 2004; Warrender et al., 1999). A well trained HMM provides a compact detector that captures the underlying structure of a process based on the temporal order of system calls, and detects deviations from normal system call sequences with high accuracy and tolerance to noise. An ADS allows to detect novel attacks, however will typically generate false alarms due in large part to the limited amount of representative data (Chandola et al., 2009a). Because the collection and analysis of training data to design and validate an ADS is costly, the anomaly detector will have an incomplete view of the normal process behavior, and hence misclassify rare normal events as anomalous. Substantial changes to the monitored environment, such as application upgrade or changes in users behavior, reduce the reliability of the detector as the internal model of normal behavior diverges with respect to the underlying data distribution. Since new training data may become available over time, an ADS should accommodate newly-acquired data, after it has originally been trained and deployed for operations, in order to maintain or improve system performance. The incremental re-estimation of HMM parameters is a common approach to accommodate new data, although it raises several challenges. Parameters should be updated from new data without requiring access to the previous training data, and without corrupting previously-learned models of normal behavior (Grossberg, 1988; Polikar et al., 2001). Standard techniques for estimating HMM parameters involve iterative batch learning algorithms (Baum et al., 1970; Levinson et al., 1983), and hence require observing the entire training data prior to updating HMM parameters. Given new training data, these techniques can be costly since they involve restarting the HMM training using all cu- 155 mulative training data. Alternatively, several efficient on-line learning techniques have been proposed to re-estimate HMM parameters continuously upon observing each new observation symbol or new observation sub-sequence from an infinite data stream (FlorezLarrahondo et al., 2005; Mizuno et al., 2000). In practice however, on-line learning using blocks comprising limited amount of data yields a low level of performance as one pass over each block is not sufficient to capture the phenomena (Khreich et al., 2009a). It is also possible to adapt on-line learning techniques for incremental learning over several passes of each block, although it requires strategies to manage the learning rate (Khreich et al., 2009a). Moreover, using a single HMM with a fixed number of states for incremental learning may not capture a representative approximation of the underlying data distribution due to the many local maxima in the solution space. Ensemble methods have been recently employed to overcome the limitations faced with a single classifier system (Dietterich, 2000; Kuncheva, 2004a; Polikar, 2006; Tulyakov et al., 2008). Theoretical and empirical evidence have shown that combining the outputs of several accurate and diverse classifiers is an effective technique for improving the overall system accuracy (Brown et al., 2005; Dietterich, 2000; Kittler, 1998; Kuncheva, 2004a; Polikar, 2006; Rokach, 2010; Tulyakov et al., 2008). In general, designing an ensemble of classifiers involves generating a diverse pool of base classifiers (Breiman, 1996; Freund and Schapire, 1996; Ho et al., 1994), selecting an accurate and diversified subset of classifiers (Tsoumakas et al., 2009), and then combining their output predictions (Kuncheva, 2004a; Tulyakov et al., 2008). Most existing ensemble techniques aim at maximizing the overall ensemble accuracy, which assumes fixed prior probabilities and fixed misclassification costs. In many real world applications, like anomaly detection, prior class distributions are highly imbalanced and misclassification costs may vary over time. Nevertheless, common techniques used for fusion, such as voting, sum or averaging, assume independent classifiers and do not consider class priors (Kittler, 1998; Kuncheva, 2002). This chapter focuses on HMM-based systems applied to anomaly detection, and 156 in particular on the efficient adaptation of ensembles of HMMs (EoHMMs) over time, in response to new data. The receiver operating characteristic (ROC) curve is commonly used for evaluating the performance of detectors at different operating points, without committing to a single decision threshold (Fawcett, 2004). ROC analysis is robust to imprecise class distributions and misclassification costs (Provost and Fawcett, 2001). In a previous work, the authors proposed a ROC-based iterative Boolean combination (IBC) technique for fusion of responses from any crisp or soft detector trained on fixed-size data sets (Khreich et al., 2010b). The IBC algorithm inputs all generated classifiers, and combines their responses in the ROC space by applying all Boolean functions, over several iterations until the ROC convex hull (ROCCH) stops improving. IBC was applied to combine the responses from a multiple-HMM ADS, where each HMM is trained through batch learning using the same data but with different number of states and initializations. Results have shown that detection accuracy improves significantly, especially when training data are limited and class distributions are imbalanced (Khreich et al., 2010b). The IBC is a general decisionlevel fusion technique in the ROC space that aims at improving ensembles accuracy, when learning is performed from a fixed-size set of limited training data. However, IBC does not allow to efficiently adapt a fusion function over time when new data becomes available, since it requires a fixed number of classifiers. In contrast, this chapter presents a new ROC-based system to efficiently adapt EoHMMs over time, from new training data, according to a learn-and-combine approach. Given new training data, a new pool of HMMs is generated from newly-acquired data using different HMM states and initializations. The responses from these newly-trained HMMs are then combined with those of the previously-trained HMMs in ROC space using the incremental Boolean combination (incrBC) technique. IncrBC extends on IBC in that Boolean fusion of HMMs may be adapted for new data, without multiple iterations. However, system efficiency and scalability remain a serious issue. On its own, incrBC allows to discard previously-learned data, yet it retains all previously-generated HMMs in 157 the pool. Over time, the pool size would therefore grow indefinitely, yielding unnecessarily high memory requirements, and increasingly complex Boolean fusion rules. Operating ever-growing EoHMMs would increase computational and memory complexity. To overcome these limitations, specialized algorithms are proposed for memory management. First, ensemble selection algorithms are proposed to efficiently select diversified EoHMMs from the pool, and to adapt the Boolean fusion functions and decision thresholds to improve overall system performance. The proposed system may employ one of two ensemble selection algorithms, called BCgreedy and BCsearch , that are adapted to benefit from the monotonicity in incrBC accuracy, for a reduced complexity. Both algorithms feature rank- and search-based selection strategies to optimize ROOCH of combinations, and hence maximize the area under the ROCCH (AUCH). Both algorithms preserve the ROCCH of combination which allows visualizing the expected performance over the entire ROC space, and adapting to changes in operating conditions during system operations, such as tolerated false alarm rate, prior probabilities and costs of errors. Most existing techniques for adapting ensembles of classifiers require restarting the training, selection, combination, and optimization procedures to account for such a change in operating conditions. Finally, to address the potentially growing memory requirements over time, the proposed system integrates a memory management strategy to prune less relevant HMMs from the pool. For a proof-of-concept validation, this system is applied to adaptive anomaly detection from system call sequences. The experiments are conducted on both synthetically generated data and sendmail data from the University of New Mexico (UNM) data sets1 (Warrender et al., 1999). UNM data sets are still commonly used in literature (Forrest et al., 2008; Kershaw et al., 2011) due to limited publicly available system call data sets. In fact labeling real system call sequences is a challenging task, which depends on detector window size. To overcome issues encountered when using real-world system call data for anomaly-based HIDS, the experimental results are complemented with synthetically gen1 http://www.cs.unm.edu/~immsec/systemcalls.htm 158 erated data. The synthetic data generator is based on the Conditional Relative Entropy (CRE), and is closely related to the work of Tan and Maxion (Tan and Maxion, 2003). It allows simulating different processes with various complexities and provides the desired amount of normal data for training and labeled data (normal and anomalous) for testing. For both real and synthetic data sets, the data are organized into several blocks that are assumed to have been acquired over time. The new data allow to account for rare events, and adapt to changes in monitored environments such as application updates or changes in user behavior over time. The performance of the proposed system is compared to that of the reference batch BW (BBW) trained on all cumulative data, of on-line BW (OBW) (Mizuno et al., 2000), and of an algorithm for incremental learning of HMM parameters based on BW (IBW) BW (IBW) (Khreich et al., 2009a) described in Appendix III. The performance of the median and majority vote fusion functions, combining outputs from the previously-generated pool of HMMs, are also presented. Accuracy is assessed using the area under the ROC curve (AUC) (Hanley and McNeil, 1982) and the true positive rate (tpr) achieved at a fixed false positive rate (f pr) value. The rest of this chapter is organized as follows. The next section briefly reviews the HMM, as it applies to adaptive ADS from system call sequences. It also reviews techniques for incremental learning of HMM parameters, and for designing multiple classifiers systems. Section 4.3 describes the proposed ROC-based system for efficient adaptation of EoHMM from new data according to the learn-and-combine approach, including new techniques for Boolean combination and model management. Section 4.4 presents the experimental methodology (data sets, evaluation methods and performance metrics) used for proof-of-concept computer simulations. Simulation results are presented and discussed in Section 4.5. 159 4.2 4.2.1 Adaptive Anomaly Detection Systems Anomaly Detection Using HMMs HMMs have been shown successful in detecting anomalies in sequences of operating system calls (Chen and Chen, 2009; Hoang and Hu, 2004; Khreich et al., 2009b; Tan and Maxion, 2003; Warrender et al., 1999). A discrete-time finite-state HMM is a stochastic process determined by the two interrelated mechanisms – a latent Markov chain having a finite number of states, N , and a set of observation probability distributions, each one associated with a state. Starting from an initial state Si ∈ {S1 , ..., SN }, determined by the initial state probability distribution πi , at each discrete-time instant, the process transits from state Si to state Sj according to the transition probability distribution aij . The process then emits a symbol vk , from a finite alphabet V = {v1 , . . . , vM } with M distinct observable symbols, according to the discrete-output probability distribution bj (vk ) of the current state Sj . An HMM is commonly parametrized by λ = (π, A, B), where the vector π = {πi } is the initial state probability distribution, matrix A = {aij } is the state transition probability distribution, and matrix B = {bj (vk )} is the state output probability distribution, (1 ≤ i, j ≤ N and 1 ≤ k ≤ M ). HMMs can perform three canonical functions, evaluation, decoding and learning (Rabiner, 1989). Evaluation aims to compute the likelihood of an observation sequence o1:T given a trained model λ, P (o1:T | λ). The likelihood is typically evaluated by using a fixed-interval smoothing algorithm such as the Forward-Backward (FB) (Rabiner, 1989) or the numerically more stable Forward-Filtering Backward-Smoothing (FFBS) (Ephraim and Merhav, 2002). Decoding means finding the most likely state sequence S that best explains a given observation sequence o1:T (i.e, maximize P (S | o1:T , λ)). The best state sequence is commonly determined by the Viterbi algorithm. Learning aims to estimate the HMM parameters λ to best fit the observed batch of data o1:T . HMM parameters estimation is frequently performed according to the maximum likelihood estimation (MLE) criterion. MLE consists of maximizing the log-likelihood, log P (o1:T | λ), of the train- 160 ing data over HMM parameters space. Unfortunately, since the log-likelihood depends on missing information (the latent states), there is no known analytical solution to the learning problem. In practice, iterative optimization techniques such as the Baum-Welch (BW) algorithm (Baum et al., 1970), a special case of the Expectation-Maximization (EM) (Dempster et al., 1977), or standard numerical optimization methods such as gradient descent (Baldi and Chauvin, 1994; Levinson et al., 1983) are often employed for this task. In either case, HMM parameters are estimated over several training iterations, until the likelihood function is maximized over data samples. Each training iteration typically involves observing all available training data to evaluate the log-likelihood value and estimate the state densities by using a fixed-interval smoothing algorithm. Anomaly detection creates a profile that describes normal behaviors. Events that deviate significantly from this profile are considered anomalous. In computer and network security, legitimate system call sequences generated during the normal execution of a privileged process are typically employed in host-based ADSs (Chandola et al., 2009a; Warrender et al., 1999). As stochastic models for temporal data, HMMs provide compact detectors that are able to capture probability distributions corresponding to complex real-world phenomena, with tolerance to noise and uncertainty. Given a sufficiently large training set of legitimate system call observations, an ergodic2 HMM trained according to the BW or gradient-based algorithms, is able to capture the underlying structure of the monitored process (Hoang and Hu, 2004; Warrender et al., 1999). During operations, system calls sequences generated by the monitored process are evaluated with the trained HMM, according to the FB or FFBS algorithms, which should assign significantly lower likelihood values to anomalous sequences than to normal ones. By setting a decision threshold on the likelihood values, monitored sequences can be classified as normal or anomalous. 2 An HMM with an ergodic or fully connected topology is typically employed in situations where the hidden states have no physical meaning such as modeling a process behavior using its generated system calls. 161 Estimating the parameters of a HMM requires the specification of the number of hidden states. Although it is a critical parameter for HMM performance, this number is often chosen heuristically or empirically, by selecting the single value that provides the best performance on training data (Chen and Chen, 2009; Hoang and Hu, 2004; Warrender et al., 1999). Using a single HMM with a pre-specified number of states may have limited capabilities to capture the underlying structure of the data. HMMs trained with a fixedsize training data and with a different number of states and initializations have been shown to provided a higher level of performance in host-based ADSs (Khreich et al., 2009b). 4.2.2 Adaptation in Anomaly Detection The design of accurate detectors for ADS requires the collection of a sufficient amount of representative data. In practice however, it is very difficult to collect, analyze and label comprehensive data sets for reasons that range from technical to ethical. Limited normal data are typically provided for training HMM-based detectors, and limited labeled data are also provided for validation of an ADS. Furthermore, the underlying data distribution of normal behaviors may vary according to gradual, systematic, or abrupt changes in the monitored environment such as an application update or changes in users behavior. The ability to incrementally adapt HMM parameters in response to newly-acquired training data from the operational environment or other sources is therefore an undisputed asset for sustaining a high level of performance. Indeed, refining a HMM to novelty encountered in the environment may reduce its uncertainty with respect to the underlying data distribution. Assume that an HMM-based ADS is a priori trained, optimized and validated off-line using a finite set of system call data, to provide an acceptable level of performance in terms of false and true positive rates. During operations, monitoring a centralized server, for instance, the system is susceptible to produce a higher false alarms rate than tolerated by the system administrator. This is largely due to the limited amount data 162 that is available for training. The anomaly detector will have an incomplete view of the normal process behavior, and rare events will be mistakenly considered as anomalous. The HMM detector should be refined incrementally on some additional normal data when it becomes available, to better fit the normal behavior of process in consideration. Such situations may also occur when trying to implement an HMM-based ADS specialized for monitoring a specific process, but on different hosts machines with different settings and functionality. Generic models that are trained using data corresponding to common settings, could be set into operation then incrementally refined to fit the normal process behavior on each host. Otherwise, training data must be collected and analyzed from the same process on every host. As a part of an ADS, the system administrator plays a crucial role in providing new data for training and validation of the detectors. By accommodating new data and refining its detection boundaries ADSs motivate the administrator to interact more positively and build confidence in the system responses over time. When an alarm is raised, the suspicious system call sub-sequences are logged and analyzed for evidence of an attack. If an intrusion attempt is confirmed, the corresponding anomalous system calls should be provided to update the validation set (V). A response team should react to limit the damage, and a forensic analysis team should investigate the cause of the successful attack. Otherwise, the intrusion attempt is considered as a false alarm and these rare sub-sequences are tagged as normal and employed to update the HMM detectors. One challenge is the efficient integration of the newly-acquired data into the ADS without corrupting the existing knowledge structure, and thereby degrading the performance. 4.2.3 Techniques for Incremental Learning of HMM Parameters Incremental learning refers to the ability of an algorithm to learn from new data after a classifier has already been trained from previous data and deployed for operations. As illustrated in Figure 4.1, once a new block of data becomes available, incremental learning produces a new hypothesis (i.e., decision rule) that depends only on the previous 163 h0 HMM1 (λ1 ) h1 HMM2 (λ2 ) h2 hn−1 HMMn (λn ) D1 D2 Dn 1 2 n hn Time Figure 4.1 An incremental learning scenario where blocks of data are used to update the HMM parameters (λ) over a period of time. Let D1 , D2 , ..., Dn be the blocks of training data available to the HMM at discrete instants in time. The HMM starts with an initial hypothesis h0 associated with the initial set of parameters λ0 , which constitutes the prior knowledge of the domain. Thus, h0 and λ0 get updated to h1 and λ1 on the basis of D1 , then h1 and λ1 get updated to h2 and λ2 on the basis of D2 , and so forth hypothesis and the current training data (Polikar et al., 2001). Incremental learning allows to decrease the memory requirements needed to learn the new data, since there is no need to store and access previously-learned data. Furthermore, since training is performed only on new training blocks, and not on all accumulated data, incremental learning would also lower the time complexity needed to learn new data. The main challenge of incremental learning is the ability to sustain a high level of performance without corrupting previously-learned knowledge (Grossberg, 1988; Polikar et al., 2001). In fact, considering the HMM parameters obtained using a batch learning technique on a first block of sub-sequences as starting point for training the HMM on a second newly-acquired block, may not allow the optimization process to escape the local maximum associated with the first block. Remaining trapped in local maxima leads to knowledge corruption and hence to a decline in system performance (see Figure 4.2). Several on-line learning techniques have been proposed in literature to re-estimate HMM parameters from a continuous stream of symbols. The objective is to optimize HMM parameters to fit the source generating the data through one observation of the data. These techniques are typically designed to reduce the training time and memory complexity and to accelerate the convergence when the learning scenario involves very long stream of data. These techniques are derived from their batch counterparts, and in- 164 clude, for instance, EM-based (Florez-Larrahondo et al., 2005; Mizuno et al., 2000) and gradient-based (Baldi and Chauvin, 1994; LeGland and Mevel, 1997) techniques. However, the key difference is that the on-line optimization and update of HMM parameters are continuously performed upon observing each new sub-sequence (Baldi and Chauvin, 1994; Mizuno et al., 2000) or new symbol (Florez-Larrahondo et al., 2005; LeGland and Mevel, 1997) of observation, with no iterations. In practice however, when an on-line learning technique is applied to incremental learning of a finite data set, one pass over the sub-sequences within a new training block is insufficient to capture the phenomena. In addition, the convergence properties of these on-line algorithms no longer hold when provided with limited amount of data. Several iterations over the new block of observations are therefore required to better fit the HMM parameters to the newly-acquired data, and hence provide an operational model with acceptable performance. When several iterations are allowed, on-line algorithms may attain a local maximum on the log-likelihood function associated with the current block of data, and must overcome the same issue of remaining trapped in the previous local maximum (see Figure 4.2). A potential advantage of applying on-line learning over batch learning techniques to incremental learning resides in their added stochasticity, which stems from the rapid reestimation of HMM parameters after each sub-sequence or symbol of observations. This may aid escaping local maxima during the early adaptation to newly provided blocks of data. In addition, a fixed or monotonically decreasing learning rate is typically employed in these algorithms to integrate pre-existing knowledge of the HMM and the newlyacquired information associated with each sub-sequence or observation symbol. This may alleviate knowledge corruption. Figure 4.2 illustrates the decline in performance that may occur with batch or on-line algorithms during incremental update of one HMM parameter. The values λ∗1 , λ∗2 , and λ∗12 correspond to the optimal values of parameters after learning D1 , D2 and D1 ∪ D2 , 165 respectively. Assume that the training block D2 , which contains one or multiple subsequences, becomes available after an HMM(λ1 ) has been previously trained on a block D1 and deployed for operations. An incremental algorithm that optimizes, for instance, the log-likelihood function must re-estimate the HMM parameters over several iterations until this function is maximized for D2 . The optimization of HMM parameters depends on the shape and position of the log-likelihood function of HMM(λ2 ) with respect to that of HMM(λ1 ). When an incremental learning technique is employed to train on D2 alone, λ∗1 is the starting point for re-estimating the HMM. If the log-likelihood function of HMM(λ2 ) has some local maxima in the proximity of λ∗1 , the optimization may remain trapped in these maxima, and hence does not accommodate D2 , providing a similar model parameter. If λ∗1 was selected according to point (a), the optimization will become trapped in the local maximum at point (c) instead of approaching λ∗12 . This leads to a knowledge corruption, and a decline in HMM performance. In contrast, training a new HMM(λ2 ) on D2 from the start using different initializations may lead to point (d). As described in subsequent sections, combination of responses from two HMMs selected according to λ∗1 and λ∗2 exploits their complementary information for a higher overall level of performance. 4.2.4 Incremental Learning with Ensembles of Classifiers Multiple classifier systems (MCSs) are based on the combination of several accurate and diverse base classifiers to improve the overall system accuracy (Brown et al., 2005; Dietterich, 2000; Kittler, 1998; Kuncheva, 2004a). MCSs have been employed to overcome the limitations faced with a single classifier system (Dietterich, 2000; Kuncheva, 2004a; Polikar, 2006; Tulyakov et al., 2008). According to the no-free-lunch theorem (Wolpert and Macready, 1997), no algorithm is best for all possible problems. Classifiers may converge to different local optima according to different initializations or different parameter values. Different classifiers can have different domains of expertise or perceptions of the same problem. Therefore, combining the predictions of an ensemble of classifiers (EoC) reduces the risk of selecting a single classifier with poor generalization performance. 166 Optimization Space of one Model Parameter (λ) λ∗12 λ∗2 Log-Likelihood λ∗1 (d) (a) (c) (b) HMM(λ1 ) trained on D1 HMM(λ2 ) trained on D2 HMM(λ12 ) trained on D1 ∪ D2 Figure 4.2 An illustration of the decline in performance that may occur with batch or on-line estimation of HMM parameters, when learning is performed incrementally on successive blocks Furthermore, base classifiers may commit different errors providing an EoC with complementary information that increase system performance. EoCs provide an alternative solution for incremental learning by combining classifiers trained on each block of data as it becomes available, instead of selecting and updating a single classifier. In general, the design of EoCs involves generating a diversified pool of base classifiers, selecting a subset of members from that pool, and then combining their output predictions such that the overall accuracy and reliability is improved. A pool can be composed of either homogeneous or heterogeneous classifiers. Homogeneous classifiers are generated by the same classifier trained using different manipulations of parameters, training data (Breiman, 1996; Freund and Schapire, 1996), or input features (Ho, 1998). Heterogeneous classifiers are generated by training different classifiers on the same data set. The combination of selected classifiers typically occurs at the score, rank or decision levels (Kuncheva, 2004a; Tulyakov et al., 2008), and may be static (e.g., sum, product, majority voting, etc.), adaptive (e.g., weighted average, weighted voting, etc.) or trainable (also known as stacked or meta-classifier). Fusion at the score level is most 167 prevalent in literature (Kittler, 1998). Normalization of the scores is typically required before applying static or adaptive fusion functions, which may not be a trivial task for heterogeneous classifiers. Trainable approaches, where a global meta-classifier is trained for fusion on classifiers responses (Kittler, 1998; Roli et al., 2002; Ruta and Gabrys, 2005; Wolpert, 1992), are prone to overfitting and require a large independent validation set for estimating the parameters of the combining classifier (Roli et al., 2002; Wolpert, 1992). Fusion at the rank level is mostly suitable for multi-class classification problems, where the correct class is expected to appear in the top of the ranked list (Ho et al., 1994; Van Erp and Schomaker, 2000). Fusion at the decision level exploits the least amount of information since only class labels are input to the combiner. Compared to the other fusion methods, it is less investigated in literature. Majority voting (Ruta and Gabrys, 2002), weighted majority voting (Ali and Pazzani, 1996; Littlestone and Warmuth, 1994) and behavior-knowledge-space (BKS) methods (Raudys and Roli, 2003) are the most representative decision-level fusing methods. One issue that appears with decision level fusion is the possibility of ties. The number of base classifiers must therefore be greater than the number of output classes. BKS only applies to low dimensional problems. Moreover, in order to have an accurate probability estimation, it requires a large independent training and validation set to design the combination rules. EoCs may be adapted for incremental learning by generating a new pool of classifiers as each block of data becomes available, and combining the outputs with those of previouslygenerated classifiers with some fusion technique (Kuncheva, 2004b). The main advantage of using EoCs to implement a learn-and-combine approach is the possibility of avoiding knowledge corruption as classifiers are trained from the start on a new block of data. For instance, the adaptive boosting methods employed in AdaBoost (Freund and Schapire, 1996) have been extended for incremental learning from new blocks of data in Learn++ (Polikar et al., 2001). Given a fixed-size training set, AdaBoost iteratively generates a sequence of weak classifiers each focusing on training observations that have been misclassified by the previous classifier. This is achieved by updating a weight distribution 168 over the entire training set according to the performance of the previously-generated classifier. At each iteration, AdaBoost increases the weights of wrongly classified training observations and decreases those of correctly classified observations, such that the new classifier focuses on hard-to-classify observations. The final output is a weighted majority voting of all generated classifiers. By contrast, Learn++ generates a number of base classifiers for each block of data that becomes available. The weight distribution is updated according to the performance of all previously-generated classifiers, which allows to accommodate previously unseen classes (Polikar et al., 2001). The outputs are then combined through weighted majority voting to obtain the final classification. Improvements proposed for this algorithm essentially differ by the combination functions and weight distribution update. In particular, Learn++ variant for imbalanced data (Muhlbaier et al., 2004) employs a class conditional weight factor for updating classifiers weight and accounting for class imbalance. This factor is the ratio of the number of observations from a particular class used for training a classifier, to the number of observations from the same class used for training all other previously-generated classifiers. An on-line version of bagging and boosting ensembles have been proposed for learning from labeled streams of data (Oza and Russell, 2001). The bootstrap sampling and weight distribution update that are employed for fixed-sized data sets are now simulated according to Poisson distribution for data streams. For on-line AdaBoost algorithm, the ensemble is composed of a fixed number of classifiers. Each new observation is presented sequentially to the ensemble members. A classifier is trained on this observation k times, where k is drawn from a Poisson distribution parametrized by the observation weight. When an observation is misclassified its weight increases and hence will be presented more often to the following classifier in the ensemble. The final output is the weighted majority vote over all the base classifiers. However, on-line bagging and boosting techniques consider learning continuously from labeled streams of data. 169 The techniques described above are designed for two- or multi-class classification and hence require labeled training data sets to compute the errors committed by the base classifiers and update the weight distribution at each iteration. In HMM-based ADSs, the HMM detector is a one-class classifier that is trained using the normal system call data only as described in Sections 4.2.1 and 4.2.2. Therefore, they are not suitable for ADSs using system call sequences. Some authors however, adopt these techniques by considering that rare normal system call sequences are anomalous. For instance, an EoHMMs based on discrete AdaBoost and its on-line version has been applied for anomaly detection (Chen and Chen, 2009). With this method, an arbitrary threshold is set according to the log-likelihood output from HMMs trained on the normal system call behavior, below which normal system call sequences are considered as anomalous. Five HMMs are trained with the same number of states on fixed-sized data, and then employed as base detectors in AdaBoost algorithm. When new data becomes available, these HMMs are updated according to an on-line AdaBoost variant (Oza and Russell, 2001). Threshold setting is shown to have a significant impact on the detection performance of the EoHMMs (Chen and Chen, 2009). In fact, rare system call events are normal, if they are considered anomalous during the design phase, they will generate false alarms during the testing phase. These rare events may be suspicious if, during operation, they occur in bursts over a short period of time. Another specific combination technique for HMMs consists of weight averaging the parameters of an HMM trained on a newly-acquired block of data with those of a previouslytrained HMMs (Hoang and Hu, 2004). Although this technique may work for left-right HMMs, it is not suitable for ergodic HMMs as their states are permuted with each different training phase, and hence averaging parameters leads to knowledge corruption. Furthermore, it is restricted to combining HMMs with the same number of states. For one- or two-class classification problems, Boolean combination of classifier responses in the ROC space provides several advantages over related fusion techniques (Khreich et al., 2010b; Tao and Veldhuis, 2008). It does not require any prior assumption regard- 170 ing the independence of classifiers, in contrast with other fusion functions such as sum, product or averaging that are based on the independence of classifiers (Kittler, 1998; Kuncheva, 2002). In addition, normalization of heterogeneous classifiers output scores is not required, because ROC curves are invariant to monotonic transformation of classification thresholds (Fawcett, 2004). Boolean combination techniques may therefore be applied to combine the responses from any crisp or soft homogeneous or heterogeneous classifiers, or from classifiers trained on different data. Most existing ensemble techniques, especially those proposed for incremental learning, aim at maximizing the system performance through a single measure of accuracy. This implicitly assumes fixed prior probabilities and misclassification costs. Furthermore, an arbitrary and fixed threshold is typically set (implicitly or explicitly) on the output scores of soft classifiers of an ensemble prior to taking a majority voting decision. In anomaly detection however, prior class distributions are highly imbalanced and misclassification costs are different and both may change over time. As detailed in Section 4.3.1, Boolean combination in the ROC space aims at maximizing the overall convex hull of combinations overall classifiers decision thresholds, which allows to adapt to changes in prior probabilities and cost of errors during system operation. 4.3 Learn-and-Combine Approach Using Incremental Boolean Combination In this section, an adaptive system is proposed for incremental learning of new data over time using a learn-and-combine approach. As new blocks of data become available, the system learn new HMMs to evolve a pool of classifiers, and performs incremental Boolean combination (incrBC) of a diversified subset of HMM responses in the ROC space. This is achieved by selecting the HMMs, decision thresholds, and Boolean functions such that the overall EoHMMs performance is maximized. The main components of the system are illustrated in Figure 4.3 and described in Algorithm 4.1. When a new block of normal training sub-sequences (Dk ) becomes available, a new pool of base HMMs (Pk ) is generated according to different number of states and 171 random initializations. (Further details on HMMs training and validation are given in Section 4.4.2.) Pk is then appended to the previously-generated pool of base HMMs P = {P1 , P2 , . . . , Pk−1 } ∪ Pk , and Dk is discarded. The model management module, described in Section 4.3.2, performs model selection and pruning. The selection module forms the EoHMMs (Ek ) considered for operations, by selecting HMMs from the pool P that most improve the overall system performance according to one of two proposed model selection algorithms, BCgreedy and BCsearch (see Algorithm 4.3 and 4.4), both of which depend on the incremental Boolean combination module. The pruning module controls the size of the pool by discarding less accurate HMMs that have not been selected for some user-defined time interval. To form and EoHMMs, the incremental Boolean combination module combines the responses from the selected HMMs in the ROC space using all Boolean functions according to the incrBC algorithm (see Algorithm 4.2 in Section 4.3.1). The incrBC algorithm also selects and stores the set (S) of decision thresholds (from each base HMM of the EoHMMs) and Boolean functions that most improves the overall EoHMMs performance. The HMMs that form the EoHMMs, Ek , and the set of decision thresholds and Boolean functions, S, are used during operations. A set of validation data (V) comprised of normal and anomalous sub-sequences, is required during the design phase for selection of HMMs, decision thresholds, and Boolean functions. It is managed and input by the system administrator as discussed in Section 4.2.2. When manifestations of novel attacks are discovered, relevant anomalous subsequences are then stored to update the validation set. The HMMs, decision thresholds, and Boolean combinations must then be re-selected to accommodate the new information. Furthermore, the system administrator ensures that the blocks of training data provided for updating the pool of HMMs are clean from intrusions. He is also responsible for selecting and tuning the model management algorithms to trade-off the size of the pool and EoHMMs against overall system accuracy. The proposed approach inherits all desirable properties of Boolean combination in the ROC space as detailed in the following sections. It covers the whole performance range of Validation data set (normal & anomalous) Forensic information on past intrusion V Tuning parameters Attack-free training data Generation of pool of HMMs HMM parameter estimation Dk New block of normal training data Pk : new pool of HMMs ⎧ k ⎫ ⎪ ⎨ λ1 ⎪ ⎬ .. . ⎪ ⎪ ⎩ k ⎭ λL Operational phase System administrator Design phase 172 Model Management Model Selection • BCgreedy algo. • BCsearch algo. Ek : EoHMMs 1≤i≤k Ek = λil ∈ P 1≤l≤L Model Pruning Dk−1 D1 Discard old training data Pk−1 = {λk−1 , . . . , λk−1 1 L } .. . • Discard HMMs with low accuracy that are infrequently selected by BCgreedy or BCsearch S : set of decision thresholds (λil , tj ) and Boolean functions (bf ) P1 = {λ11 , . . . , λ1L } P : previously-generated pool of HMMs Incremental Boolean Combination • incrBC algorithm S= λil , tj , bf , λil ∈ Ek Figure 4.3 Block diagram of the adaptive system proposed for incrBC of HMMs, trained from newly-acquired blocks of data Dk , according to the learn-and-combine approach. It allows for an efficient management of the pool of HMMs and selection of EoHMMs, decision thresholds, and Boolean functions false and true positive rates, which allows for a flexible selection of the desired operating performance. As conditions change, such as prior probabilities and costs of errors, the overall convex hull of combinations does not change; only the portion of interest. This change shifts the optimal operating point to another vertex on the composite convex hull. The corresponding HMMs are then activated, and their responses are combined according to different decision thresholds and Boolean functions. 4.3.1 Incremental Boolean Combination in the ROC Space The incremental Boolean combination (incrBC) technique summarized in this section has been proposed by the authors (Khreich et al., 2010a,b) to combine the responses from models trained with different number of states and initialization on a fixed-size data set. In the learn-and-combine approach for incremental learning, it is adapted to combine the 173 Algorithm 4.1: Learn-and-combine approach using incremental Boolean combination input : Training data blocks D1 , D2 , . . . , DK that become available over time (normal sub-sequences) Labeled validation data set V (normal and anomalous sub-sequences) output: Selected HMMs to form the EoHMMs (Ek ) of size |Ek | Selected set (S) of decision thresholds (λil , tj ) and Boolean functions, each of size 2 to |Ek | detector 1 select selection_algo ∈ {BCgreedy , BCsearch } ; // choose the model selection algorithm 2 set counti ← 0, ∀λil ∈ P ; // counter for unselected HMMs over time 3 set LT ; // lifetime expectancy of models in the pool 4 foreach Dk do 5 train new pool of HMMs, Pk = {λk1 , . . . , λkL } ; // use different no. states and initializations 6 // add the new pool to previously-learned pools P ← P ∪ Pk ; 7 switch selection_algo do 8 case BCgreedy 9 Ek = BCgreedy (P, V) ; // use greedy selection (Algorithm 4.3) 10 11 case BCsearch Ek = BCsearch (P, V) ; (Algorithm 4.4) // use incremental search // both algorithms call the incrBC algorithm (Algorithm 3.1) for selecting and storing the set of best decision thresholds and Boolean functions, S = {(λil , tj ), bf }, λil ∈ Ek 12 count(λil ) ← count(λil ) + 1, ∀λil ∈ P\Ek ; // increment counter for unselected HMMs 13 if k > LT then prune λil : count(λil ) > LT ; // discard HMMs that are not selected for LT blocks from P 14 15 return Ek and S responses of HMMs trained with different number of states and initialization, however using new data blocks that are acquired over time. A crisp detector outputs a class label while a soft detector assigns scores or probabilities to the input samples, which can be converted to a crisp detector by setting a decision threshold on the scores. Given the responses of a crisp detector on a validation set, the true positive rate (tpr) is the proportion of positives correctly classified over the total number of positive samples. The false positive rate (f pr) is the proportion of negatives 174 incorrectly classified over the total number of negative samples. A ROC curve is a plot of tpr against f pr. A crisp detector produces a single data point in the ROC plane, while a soft detector produces a ROC curve by varying the decision thresholds. For equal priors and cost of errors, the optimal decision threshold (minimizing overall costs) corresponds to the vertex that is closest to the upper-left corner of the ROC plane. Given two operating points, say a and b, in the ROC space, a is defined as superior to b if f pra ≤ f prb and tpra ≥ tprb . If one ROC curve has all its points superior to those of another curve, it dominates the latter. If a ROC curve has tprx > f prx for all its points x then, it is a proper ROC curve. In practice, the number of decision thresholds is the number of distinct score values assigned by a soft detector to the validation samples. This number also corresponds to the number of resulting crisp detectors and to the number of vertices on the empirical ROC plot. In fact, an empirical ROC plot is obtained by connecting the observed (tpr, f pr) pairs of a soft detector at each decision threshold. With a finite number of decision thresholds, an empirical ROC plot is a step-like function which approaches a true curve as the number of samples, and hence the number of decision thresholds, approaches infinity. Therefore, it is not necessarily convex and proper. Concavities indicate local performance that is worse than random behavior and may provide diverse information. The convex hull of an empirical ROC plot (ROCCH) is the smallest convex set containing its vertices, i.e., the piece-wise outer envelope connecting only its superior points. It may be used to combining detectors based on a simple interpolation between their responses (Provost and Fawcett, 2001; Scott et al., 1998). This approach has been called the maximum realizable ROC (MRROC) (Scott et al., 1998) since it represents a system that is equal to, or better than, all the existing systems for all Neyman-Pearson criteria (Neyman and Pearson, 1933). However, the MRROC discards responses from inferior detectors which may provide diverse information for an improved performance. Using Boolean conjunction and disjunction functions to combine the responses of multiple soft detectors in the ROC space have shown and improved performance over the MRROC 175 (Haker et al., 2005; Tao and Veldhuis, 2008). However, these techniques assume that the detectors are conditionally independent, and their of ROC curves are convex. These assumptions are violated in most real-world applications, especially when detectors are designed using limited and imbalanced training data. Furthermore, the correlation between soft detector decisions also depends on the threshold selections. The incrBC technique applies all Boolean functions to combine the responses of multiple crisp or soft detectors, requires no prior assumptions, and selects the thresholds and Boolean functions that improve the MRROC (Khreich et al., 2010b). By applying all Boolean functions to combine the responses of detectors at each corresponding decision threshold, the incrBC technique implicitly accounts for the effects of correlation among detectors and accommodates for the concavities in the ROC curves. Indeed, AND and OR rules will not provide improvements for the inferior points that correspond to concavities. Other Boolean functions, for instance those that exploit negations of responses, may however emerge (see Figure 4.4). The main steps of incrBC are presented in Algorithm 3.1. Given a pool of K HMMs P = {λ1 , λ2 , . . . , λK } having nk , (k = 1, . . . , K), distinct decision thresholds on a validation set (V), the incrBC algorithm starts by combining the responses of the first two HMMs in the pool. Each of the ten Boolean functions is applied to combine the responses of each decision threshold from the first HMM (λ1 , ti ), i = 1, . . . , n1 , with the responses of each decision threshold from the second HMM (λ2 , tj ), j = 1, . . . , n2 . The fused responses are then mapped to points (f pr, tpr) in the ROC space and their ROCCH is computed. Then, incrBC selects the emerging vertices which are superior to the ROCCH of original HMMs, stores the set (S) of selected decision thresholds from each HMM and Boolean functions, and updates the ROCCH to include the new emerging vertices. The responses of previously emerging vertices, resulting from the first two HMMs, are then combined with the responses of the third HMM and so on, until the last HMM. The incrBC algorithm outputs the final composite ROCCH for visualization and selection of operating points (see Figure 4.4). It also stores the selected set of decision thresholds and Boolean 176 Algorithm 4.2: incrBC(λ1 , λ2 , . . . , λK , V): Incremental Boolean combination of HMMs in ROC space input : K HMMs (λ1 , λ2 , . . . , λK ) and a validation set V of size |V| output: ROCCH of combined HMMs where each vertex is the result of 2 to K combination of crisp detectors. Each combination selects the best decision thresholds from different HMMs (λi , tj ) and the best Boolean function, which are stored in the set (S) 1 nk ← no. decision thresholds of λk using V ; // no. vertices on ROC(λk ) or no. crisp detectors 2 BooleanF unctions ← {a ∧ b, ¬a ∧ b, a ∧ ¬b, ¬(a ∧ b), a ∨ b, ¬a ∨ b, a ∨ ¬b, ¬(a ∨ b), a ⊕ b, a ≡ b} 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 // combine responses from the first two HMMs compute ROCCH1 of the first two HMMs (λ1 and λ2 ) allocate F an array of size: [2, n1 × n2 ] ; // temporary storage of combination results foreach bf ∈ BooleanF unctions do for i ← 1 to n1 do R1 ← (λ1 , ti ) ; // responses of λ1 at decision threshold ti using V for j ← 1 to n2 do R2 ← (λ2 , tj ) ; // responses of λ2 at decision threshold tj using V Rc ← bf (R1 , R2 ) ; // combine responses using current Boolean function compute (tpr, f pr) of Rc using V ; // map combined responses to ROC space (1 pt) push (tpr, f pr) onto F compute ROCCH2 of all ROC points in F nev ← number of emerging vertices ; // no. vertices of ROCCH2 superior to ROCCH1 S2 ← {(λ1 , ti ), (λ2 , tj ), bf } // set of selected decision thresholds from each HMM and Boolean functions for emerging vertices // combine responses of each successive HMM with the previously-combined responses for k ← 3 to K do allocate F of size: [2, nk × nev ] foreach bf ∈ BooleanF unctions do for i ← 1 to nev do Ri ← Sk−1 (i) ; // responses from previous combinations for j ← 1 to nk do Rk ← (λk , tj ) Rc ← bf (Ri , Rk ) compute (tpr, f pr) of Rc using V push (tpr, f pr) onto F compute ROCCHk of all ROC points in F nev ← number of emerging vertices ; // no. vertices of ROCCHk superior to ROCCHk−1 Sk ← {Sk−1 (i), (λk , tj ), bf } // set of selected subset from previous combinations, decision thresholds from the newly-selected HMM, and Boolean functions for emerging vertices store Sk , 2 ≤ k ≤ K return ROCCHK functions (S = {(λi , tj ), bf }, 2 ≤ i ≤ K; 1 ≤ j ≤ ni ), for each vertex on the composite ROCCH to be applied during operations. 177 1 V c15c16 c14 0.5 c13 λ11 λ12 c12 c11 0.5 tpr 1 0.5 c23 λ13 c25 c24 c22 c21 S2 λ11 , auc=0.660 λ12 , auc=0.715 CH2 , auc=0.831 (λ11 , t1 ) 0 P1 = (λ12 , t3 ) (λ12 , t5 ) (λ12 , t2 ) (λ12 , t1 ) (λ11 , t4 ) (λ11 , t2 ) 0 c17 c18 (λ11 , t3 ) (λ11 , t5 ) (λ12 , t6 ) (λ12 , t4 ) c16 c26 c18 c17 1 c24 λ14 0.5 c33 c32 c c31 c2223 c21 c22 = c12 ∨ (λ13 , t1 ) c23 = c13 ∨ (λ13 , t2 ) c24 = c17 ∧ (λ13 , t3 ) CH3 , auc=0.890 0.5 tpr 0 c25 (λ14 , t3 ) (λ14 , t1 ) Design phase 0 λ11 λ12 0 0.5 f pr 1 λ11 ∧ λ12 ¬λ11 ∧ λ12 λ11 ∧ ¬λ12 ¬λ1 ∧ ¬λ2 ¬λ11 ∨ λ12 λ11 ∨ λ12 λ11 ∨ ¬λ12 ¬λ11 ∨ ¬λ12 λ11 ⊕ λ12 λ11 ≡ λ12 1 f pr c31 S4 = ¬c23 ∧ (λ14 , t1 ) c32 = c22 ∨ (λ14 , t2 ) 0 Composite ROCCH (CH2 ) Original ROCCH (CH1 ) c26 c18 incrBC(CH3 , λ14 ) (λ14 , t2 ) 1 Each Boolean function λ13 , auc=0.735 (λ13 , t1 ) c16 = ¬(λ11 , t3 ) ∨ (λ12 , t3 ) 1 1 c17 = (λ1 , t4 ) ∨ (λ2 , t4 ) c18 = (λ11 , t5 ) ∨ (λ12 , t6 ) All Boolean functions 0 c12 c11 0 c11 c12 c13 = ¬(λ11 , t1 ) ∧ (λ12 , t2 ) 1 1 = ¬(λ1 , t1 ) ∧ (λ2 , t5 ) = (λ11 , t2 ) ∨ (λ12 , t1 ) 1 f pr c15 c14 (λ13 , t3 ) c13 incrBC(CH2 , λ13 ) S 3 (λ13 , t2 ) 11 1 BC(λ11 , λ ) 2) incrBC(λ 1 ,2λ (λ, t): decision thresholds from λ c: emerging vertices tpr tpr incrBC(λ11 , λ12 ) 0.5 Validation set λ14 , auc=0.707 CH4 , auc=0.897 0.5 c33 = c22 ∨ (λ14 , t3 ) c34 = c24 1 f pr Operational phase Final composite ROCCH S = {S2 ∪ S3 ∪ S4 } Figure 4.4 An illustration of the steps involved during the design phase of the incrBC algorithm employed for incremental combination from a pool of four HMMs P1 = {λ11 , . . . , λ14 }. Each HMM is trained with different number of states and initializations on a block (D1 ) of normal data synthetically generated with Σ = 8 and CRE = 0.3 using the BW algorithm (see Section 4.4 for details on data generation and HMM training). At each step, the example illustrates the update of the composite ROCCH (CH) and the selection of the corresponding set (S) of decision thresholds and Boolean functions for overall improved system performance During the operation phases applied to incremental learning, the system uses one vertex or the interpolation between a pair of vertices on the facets of the final composite ROCCH, depending on the desired false alarm rate. Given a tolerable f pr value, the 178 Design phase tpr Final composite ROCCH 1 c25 c24 f pr=10% 0.5 cop c33 c32 c31 0 c26 c18 cop 0.85 × c33 = 0.85 × 0.15 × c24 = 0.15 × (λ13 , t3 ) (λ12 , t4 ) (λ12 , t5 ) (λ11 , t4 ) (λ11 , t1 ) (λ14 , t3 ) (λ13 , t1 ) (λ14 , t2 ) 0 S 0.5 Test set λ11 λ12 λ13 λ14 T ∨ (λ13 , t1 ) (λ13 , t3 ) (λ11 , t1 ) (λ12 , t5 ) 85% of the time (λ13 , t1 ) (λ14 , t3 ) 0: normal (λ13 , t3 ) 1: anomaly (λ12 , t4 ) 15% of the time (λ11 , t4 ) 1 f pr Operational phase ∨ (λ14 , t3 ) ¬(λ11 , t1 ) ∧ (λ12 , t5 ) 1 (λ1 , t4 ) ∨ (λ12 , t4 ) ∧ Figure 4.5 An illustration of the incrBC algorithm during the operational phase. The example presents the set of decision thresholds and Boolean functions that are activated given, for instance, a maximum specified f pr of 10% for the system designed in Figure 4.4. The operational point (cop ) that corresponds to f pr = 10% is located between the vertices c33 and c24 of the composite ROCCH, which can be achieved by interpolation of responses (Provost and Fawcett, 2001; Scott et al., 1998). The desired cop is therefore realized by randomly taking the responses from c33 with probability value of 0.85 and from c24 with probability value of 0.15 . The decision thresholds and Boolean functions of c33 and c24 are then retrieved from S and applied for operations corresponding vertex or pair of vertices (which form an edge intersecting the vertical line at the specified f pr value) are first located on the ROCCH. The decision thresholds and Boolean functions, which have been derived and stored during the design phase, are then extracted and applied during the operational phase (see Figure 4.5). When the operational conditions change, the optimal operating point can be shifted during operations to adapt for the changes by activating a different set of decision thresholds and Boolean functions. If, for instance, c32 (in Figure 4.5) has become the optimal operating point because of a different cost of errors or a different tolerance in f pr, then cop can be shifted during operation to c32 = ¬(λ11 , t1 ) ∧ (λ12 , t5 ) ∨ (λ13 , t1 ) ∨ (λ14 , t2 ) . The activated decision thresholds and Boolean functions are then updated accordingly. During the design phase, the worst-case time and memory complexity for incrBC of a pool of K HMMs are O(Kn1 n2 ) Boolean operations and O(n1 n2 ) floating point registers, where nk is the number of distinct decision thresholds of λk on a validation set V. They 179 can be roughly stated as functions of n1 and n2 , because the number of emerging vertices (nev ) from each successive combination of the remaining HMMs is typically lower than n1 and n2 . During operations, the computational overhead of the activated set of Boolean functions is lower than that of operating the required number of HMMs. Therefore, the worst-case time and memory complexity is limited to operating the K HMMs. By selecting crisp detectors from all available HMMs in the pool, the incrBC algorithm yields to unnecessarily high computational and storage cost when the pool size grows as new blocks of data become available. In addition, HMMs are combined according to the order in which they are stored in the pool. An HMM trained on new data block may capture different underlying data structure and could replace several previously-selected HMMs. Model management strategies are therefore required to manage system resources. 4.3.2 Model Management When batch learning is performed on a fixed-size data set, a fixed number of classifiers is typically generated. In this case, model management consists in selecting the most accurate and diverse base classifiers from the pool and, if necessary, pruning less relevant classifiers from the pool to reduce computational costs. The selected classifiers form an ensemble to be applied during system operations. Therefore, ensemble selection and pruning are commonly used interchangeably in related literature (Tsoumakas et al., 2009). When learning is performed incrementally, the pool size grows as new blocks of data become available over time. Different ensemble selection and pruning mechanisms are therefore required. Ensemble selection is performed at each learning stage, when a new pool of classifiers is generated from a new block of data, and selected classifiers are deployed for operations. It involves applying some optimization criteria for selection the of the best ensemble of classifiers from all base classifiers in the pool. Pruning involves discarding less relevant members of the pool from future combination according to different criteria. Discarding all unselected classifiers from the pool at each learning stage decreases the system performance as described in Section 4.3.2.2. 180 4.3.2.1 Model Selection Ensemble selection techniques are employed to choose a compact ensemble from a large pool of classifiers to increase accuracy and reduce the computational and storage costs of systems deployed during operation. A brute-force search for the optimal ensemble of classifiers results in a combinatorial complexity explosion: the search space comprises 2K possible ensembles for a pool of K classifiers. Therefore, ensemble selection attempts to select the best ensemble of classifiers from the pool based on different optimization criteria and selection strategies (Tsoumakas et al., 2009; Ulaş et al., 2009). Typical optimization criteria are ensemble accuracy (Ulaş et al., 2009), cross-entropy (Caruana et al., 2004), and ROC-based measures (Rokach et al., 2006). Although there is no universal consensus about the definition and efficiency of diversity measures (Brown et al., 2005), certain approaches have shown promising results (Banfield et al., 2003; Margineantu and Dietterich, 1997; Partalas et al., 2008; Rokach et al., 2006). The selection strategies include ranking-based techniques in which the pool members are ordered according to an evaluation measure on a validation set, and the top classifiers are then selected to form an ensemble (Martinez-Munoz et al., 2009). Search-based ensemble selection is an alternative approach which consists in combining the outputs of classifiers and then selecting the best performing ensemble evaluated on an independent validation set. A heuristic search in the space of all possible ensembles is typically performed, while evaluating the collective merit of selected members (Banfield et al., 2003; Caruana et al., 2004; Margineantu and Dietterich, 1997; Ruta and Gabrys, 2005). Typically, the selection procedure stops when a user-defined maximum number of classifiers is reached (Caruana et al., 2004) or when remaining classifiers provide no further improvement to the ensemble (Ruta and Gabrys, 2005; Ulaş et al., 2009). Many ensemble selection approaches have been considered for cost-insensitive applications with sufficient amount of training data. Few approaches consider cost-sensitive and limited training data (Fan et al., 2002; Rokach et al., 2006), which are prevalent in anomaly detection applications. Although the performance measures employed in these 181 cost-sensitive approaches consider the cost of errors and class imbalance, they are either computed at a specific decision threshold or averaged over all decision thresholds. Any change in the operating conditions, such as prior probabilities, cost of errors, and tolerable f pr values, requires reconsidering the combination and selection process. The proposed ensemble selection algorithms (BCgreedy and BCsearch ), described below, employ both rank- and search-based selection strategies to optimize the area under the ROCCH (AUCH). Before applying their selection strategies, they rank all available pool members in decreasing order of AUCH performance on a validation set. In contrast with existing techniques, the proposed selection algorithms exploit the monotonicity of the incrBC algorithm for an early stopping criterion. As show in line 15 and 28 of Algorithm 3.1, incrBC is bound below by the previous ROCCH, and hence guarantees a monotonic improvement in AUCH performance. Furthermore, both algorithms allow for adaptation to changes in prior probabilities and costs of errors by simply shifting the desired operational point, which activates different HMMs, decision thresholds and Boolean functions. As described in Algorithm 4.3, the BCgreedy algorithm employs a greedy search within the pool of HMMs. First, the HMMs in the pool are ranked in decreasing order of their AUCH performance on a validation set and the best HMM is selected. Then, the algorithm starts a cumulative combination of successive HMMs one at the time, selecting only those that improve the AUCH performance, of a cumulative EoHMMs, over a userdefined tolerance value. The algorithms stops when the last HMM in the pool is reached. The worst-case time complexity of this algorithm is O(K log K) for sorting, followed by O(K) for scanning down the ensemble, resulting in O(K log K) time complexity w.r.t. the number of Boolean combination of incrBC (see Section 4.3.1). As described in Algorithm 4.4, the BCsearch algorithm employs an incremental selective search strategy. It also ranks all HMMs in decreasing order of their AUCH performance on a validation set and selects the HMM with the largest AUCH value. Then, it combines 182 the selected HMM with each of the remaining HMMs in the pool. The HMM that most improves the AUCH performance of the EoHMMs, is then selected. The cumulative EoHMMs are then combined with each of the remaining HMMs in the pool, and the HMM that provides the largest AUCH improvement to the EoHMMs is selected, and so on. The algorithm stops when the AUCH improvement of the remaining HMMs is lower than a user-defined tolerance value, or when all HMMs in the original ensemble are selected. The worst-case time complexity of this selection algorithm is O(K 2 ) w.r.t. the number of Boolean operation of incrBC (see Section 4.3.1). However, this is only attained for a zero tolerance value for which the algorithm selects all models with the Algorithm 4.3: BCgreedy (P, V): Boolean combination with a greedy selection input : Pool of HMMs P = {λ1 , λ2 , . . . , λK } and a validation set V output: EoHMMs (E) from the pool P 1 set tol ; // tolerated improvement in AUCH values 2 j←1 3 foreach λk ∈ P do 4 compute ROC curves and their ROCCHk , using V 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 sort HMMs (λ1 , . . . , λK ) in descending order of their AU CH(ROCCHk ) values λj = arg maxk {AU CH(ROCCHk ) : λk ∈ P} E ← λj ; // select HMM with the largest AUCH value Sj ← λj ; for k ← 2 to K do ROCCHk = incrBC ((Sj , λk ), V) if AU C(ROCCHk ) ≥ AU C(ROCCHj ) + tol) then j ← j +1 E ← E ∪ λk ; // select the k th HMM ROCCHj ← ROCCHk ; // update the convex hull if k = 2 then Sj ← {(λj , ti ), (λk , ti ), bf } // set of selected decision thresholds from each HMM and Boolean functions; derived from incrBC for emerging vertices else Sj ← {Sj−1 (i), (λk , ti ), bf } // set of selected subset from previous combinations, decision thresholds from the newly-selected HMM, and Boolean functions; derived from incrBC for emerging vertices store Sj , 2 ≤ j ≤ K return E 183 optimum order for combination. In practice, however, depending on the value of the tolerance, the selected number of models k < K and also the computational time, can be traded-off as desired. Algorithm 4.4: BCsearch (P, V): Boolean combination with an incremental selective search input : Pool of HMMs P = {λ1 , λ2 , . . . , λK } and a validation set V output: EoHMMs (E) from the pool P 1 set tol ; // tolerated improvement in AUCH values 2 foreach λk ∈ P do 3 compute ROC curves and their ROCCHk , using V 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 sort HMMs (λ1 , . . . , λK ) in descending order of their AU CH(ROCCHk ) values λ1 = arg maxk {AU CH(ROCCHk ) : λk ∈ P} E ← λ1 ; // select HMM with the largest AUCH value foreach λk ∈ P\E do // remaining HMMs in P ROCCHk =incrBC ((λ1 , λk ), V) λ2 = arg maxk {AU CH(ROCCHk ) : λk ∈ P\E} E ← E ∪ λ2 ; // select HMM that provides the largest AUCH improvement to E ROCCH1 ← ROCCH2 ; // update the convex hull S2 ← {(λ1 , ti ), (λ2 , ti ), bf } // Set of selected decision thresholds from each HMM and Boolean functions; derived from incrBC for emerging vertices j←3 repeat foreach λk ∈ P\E do ROCCHk =incrBC ((Sj−1 , λk ), V) λj = arg maxk {AU CH(ROCCHk ) : λk ∈ P\E} E ← E ∪ λj ROCCHj−1 ← ROCCHj Sj ← {Sj−1 (i), (λj , ti ), bf } // Set of selected subset from previous combinations, decision thresholds from the newly-selected HMM, and Boolean functions; derived from incrBC for emerging vertices j ← j +1 until AU CH(ROCCHj ) ≤ AU CH(ROCCHj−1 ) + tol ; // no further improvement store Sj , 2 ≤ j ≤ K return E For both ensemble selection algorithms the AUCH improvement is the only user-defined parameters. If desired, it is also possible to impose a maximum number of HMMs as a stopping criterion. The performance measure employed to guide the search is not restricted to AUCH. Other measures such as the partial AUCH or the true positive rate at a required false positive rate, can be also employed for a region-specific performance. 184 Although these performance measures summarize the ROC space with a single value to guide the search for potential ensemble members, the final composite ROCCH is always stored for visualization of the whole range of performance and adaptation of the operating point to environmental changes. Figure 4.6 presents an example illustrating the level of performance achieved by incrBC, BCgreedy , and BCsearch algorithms. In Figure 4.6, all algorithms achieve a comparable ROCCH and AUC performance. However, as shown in the legends, the size of the EoHMMs obtained by BCsearch is four HMMs compared to seven HMMs selected by BCgreedy from the pool of eight HMMs, which are all combined by incrBC. For comparison, the selected set of decision thresholds and Boolean functions for the vertices denoted by C1 and C2 on the ROCCHs of Figure 4.6 are shown below for each algorithms: ⎧ (((((( 1 ) ) ) ) ) ) ⎪ ⎨C1 = ((λ1 , t1 ) ∨ (λ12 , t1 )) ∧ ¬(λ13 , t1 ) ∨ (λ14 , t1 ) ∨ (λ21 , t2 ) ∨ (λ22 , t2 ) ∧ (λ23 , t2 ) ∧ (λ24 , t2 ) incrBC ⎪ ⎩C2 = ((((((((λ1 , t1 ) ∨ (λ1 , t1 )) ∧ ¬(λ1 , t1 )) ∨ (λ1 , t1 )) ∨ (λ2 , t2 )) ∨ (λ2 , t2 )) ∧ (λ2 , t2 )) ∧ (λ2 , t2 )) 1 2 3 1 2 3 4 4 ⎧ ((((( ) ) ) ) ) ⎪ ⎨C1 = (¬(λ23 , t1 ) ∧ (λ24 , t1 )) ∨ (λ13 , t1 ) ∨ ¬(λ22 , t1 ) ∧ (λ12 , t1 ) ∧ ¬(λ21 , t1 ) ∨ (λ11 , t1 ) BCgreedy ⎪ ⎩C2 = ((((((¬(λ2 , t1 ) ∧ (λ2 , t1 )) ∨ (λ1 , t1 )) ∨ ¬(λ2 , t1 )) ∧ (λ1 , t1 )) ∨ (λ2 , t2 )) ∨ (λ1 , t2 )) 3 3 2 2 1 1 4 ⎧ (( ) ) ⎪ ⎨C1 = (¬(λ2 , t1 ) ∧ (λ2 , t1 )) ∨ (λ1 , t1 ) ∨ (λ2 , t1 ) 3 2 3 1 BCsearch ⎪ ⎩C2 = (((¬(λ2 , t1 ) ∧ (λ2 , t1 )) ∨ (λ1 , t2 )) ∨ (λ2 , t2 )) 3 2 3 1 This example demonstrates the efficiency of the proposed ensemble selection techniques in forming compact EoHMMs by exploiting the new information provided from newlyacquired data while maintaining a high level of performance. More extensive simulations and detailed discussions are presented in Section 4.5. 4.3.2.2 Model Pruning Pruning less relevant models is essential to restrict the pool size with incremental learning from growing indefinitely as new blocks of data become available. As described in the previous subsection, BCgreedy and BCsearch algorithms are designed to form the most compact EoHMMs from a pool P of HMMs. According to the learn-and-combine approach new pools of HMMs are accumulated from successive blocks of data. When the 185 number of data blocks increases over time, an increasingly large storage space is required incrBC(P) C2 incrBC(P1 ) BCgreedy (P) 1 tpr tpr 1 C1 .5 0 0 C2 (λ24 , t2 ) 2 2 (λ2 , t2 ) (λ3 , t2 ) (λ21 , t1 ) (λ21 , t2 ) (λ12 , t1 ) λ11 , auc = 0.660 1 λ12 , auc = 0.715 (λ1 , t1 ) λ13 , auc = 0.735 (λ24 , t1 ) λ14 , auc = 0.707 (λ13 , t1 ) λ21 , auc = 0.684 λ22 , auc = 0.719 (λ22 , t1 ) λ23 , auc = 0.749 (λ14 , t1 ) λ24 , auc = 0.748 2 CH 1 , auc = 0.897 (λ3 , t1 ) CH, auc = 0.953 (λ24 , t1 ) λ23 , λ24 , λ13 , λ22 , λ12 , λ21 , λ11 , .5 (λ13 , t1 ) (λ23 , t1 ) (λ21 , t2 ) 1 (λ 1 , t2 ) (λ21 , t1 ) (λ11 , t1 ) 0 tpr 1 .5 (λ12 , t1 ) (λ22 , t1 ) C1 0 CH, .5 (a) incrBC auc auc auc auc auc auc auc auc = 0.749 = 0.748 = 0.735 = 0.719 = 0.715 = 0.684 = 0.660 = 0.950 tpr 1 (b) BCgreedy tpr 1 BCsearch (P) C2 C1 (λ22 , t1 ) (λ21 , t2 ) .5 λ23 , auc = 0.749 λ22 , auc = 0.719 (λ23 , t1 ) (λ13 , t1 ) λ13 , auc = 0.735 (λ21 , t1 ) λ21 , auc = 0.684 (λ13 , t2 ) 0 0 CH, .5 auc = 0.950 tpr 1 (c) BCsearch Figure 4.6 A comparison of the composite ROCCH (CH) and AUC performance achieved with the HMMs selected according to the BCgreedy and BCsearch algorithms. Suppose that a new pool of four HMMs P2 = {λ21 , . . . , λ24 } is generated from a new block of training data (D2 ), and appended to the previously-generated pool, P1 = {λ11 , . . . , λ14 }, in the example presented in Section 4.3.1 (Figure 4.4). The incrBC algorithm combines all available HMMs in P = {λ11 , . . . , λ14 , λ21 , . . . , λ24 }, according to their order of generation and storage, while BCgreedy and BCsearch start by ranking the members of P according to AUC values and then apply their ensemble selection strategies 186 for storing these HMMs. In addition, the time complexity of BCgreedy and BCsearch algorithms also increases over time. Discarding unselected classifiers immediately from the pool yields to knowledge corruption and hence to a decline system performance. Although these classifiers did not provide any improvement to the cumulative EoHMMs, they may provide diverse information to the newly-generated pool of HMMs, which have different view of the data, to increase system performance. One solution to this issue is to discard repeatedly unselected HMMs over time. A counter is therefore assigned to each HMM in the pool indicating the number of blocks for which an HMM was not selected as an ensemble member (see Algorithm 4.1). An HMM is then pruned from the pool, according to a user-defined life time (LT ) expectancy value of unselected models. For instance, with an LT = 3 all HMMs that have not been selected after receiving three blocks of data, as indicated by their counters, are discarded from the pool. 4.4 4.4.1 Experimental Methodology Data Sets The experiments are conducted on both synthetically generated data and sendmail data from the University of New Mexico (UNM) data sets3 . The UNM data sets are commonly used for benchmarking anomaly detections based on system calls sequences (Warrender et al., 1999). In related work, intrusive sequences are usually labeled by comparing normal sequences, using the Sequence Time-Delay Embedding (STIDE) matching technique. This labeling process considers STIDE responses as the ground truth, and leads to a biased evaluation and comparison of techniques, which depends on both training data size and detector window size. To confirm the results on system calls data from real processes, the same labeling strategy is used in this work. However fewer sequences are used to train the HMMs to alleviate the bias. Therefore, STIDE is first trained on all the available normal data, and then used to label the corresponding sub-sequences from the ten sequences available for testing. The resulting labeled sub-sequences are concatenated, 3 http://www.cs.unm.edu/~immsec/systemcalls.htm 187 then divided into blocks of equal sizes, one for validation and the other for testing. During the experiments, smaller blocks of normal data (100 to 1, 000 sub-sequences) are used for training the HMMs as normal system call observations are very redundant. In spite of labeling issues, redundant training data, and unrepresentative test data, UNM sendmail data set is the mostly used in literature due to limited publicly available system call data sets. The need to overcome issues encountered when using real-world data for anomaly-based HIDS (incomplete data for training and labeling) has lead to the implementation of a synthetic data generation platform for proof-of-concept simulations. It is intended to provide normal data for training and labeled data (normal and anomalous) for testing. This is done by simulating different processes with various complexities then injecting anomalies in known locations. The data generator is based on the Conditional Relative Entropy (CRE) of a source; it is closely related to the work of Tan and Maxion (Tan and Maxion, 2003). The CRE is defined as the conditional entropy divided by the maximum entropy (MaxEnt) of that source, which gives an irregularity index to the generated data. For two random variables x and y the CRE is given by CRE = − x p(x) y p(y|x) log p(y|x) , M axEnt where for an alphabet of size Σ symbols, M axEnt = −Σ log(1/Σ) is the entropy of a theoretical source in which all symbols are equiprobale. It normalizes the conditional entropy values between CRE = 0 (perfect regularity) and CRE = 1 (complete irregularity or random). In a sequence of system calls, the conditional probability, p(y | x), represents the probability of the next system call given the current one. It can be represented as the columns and rows (respectively) of a Markov Model with the transition matrix M = {aij }, where aij = p(St+1 = j | St = i) is the transition probability from state i at time t to state j at time t + 1. Accordingly, for a specific alphabet size Σ and CRE value, a Markov chain is first constructed, then used as a generative model for normal data. This Markov chain is also used for labeling injected anomalies as described below. Let an anomalous event be defined as a surprising event which does not belong to the process normal pattern. This type of event may be a foreign-symbol anomaly sequence 188 that contains symbols not included in the process normal alphabet, a foreign n-gram anomaly sequence that contains n-grams not present in the process normal data, or a rare n-gram anomaly sequence that contains n-grams that are infrequent in the process normal data and occurs in burst during the test4 . Generating training data consists of constructing Markov transition matrices for an alphabet of size Σ symbols with the desired irregularity index (CRE) for the normal sequences. The normal data sequence with the desired length is then produced with the Markov chain, and segmented using a sliding window (shift one) of a fixed size, DW . To produce the anomalous data, a random sequence (CRE = 1) is generated, using the same alphabet size Σ, and segmented into sub-sequences of a desired length using a sliding window with a fixed size of AS. Then, the original generative Markov chain is used to compute the likelihood of each sub-sequence. If the likelihood is lower than a threshold it is labeled as anomaly. The threshold is set to (min(aij ))AS−1 , ∀i,j , the minimal value in the Markov transition matrix to the power (AS − 1), which is the number of symbol transitions in the sequence of size AS. This ensures that the anomalous sequences of size AS are not associated with the process normal behavior, and hence foreign n-gram anomalies are collected. The trivial case of foreign-symbol anomaly is disregarded since it is easy to be detected. Rare n-gram anomalies are not considered since we seek to investigate the performance at the detection level, and such kind of anomalies are accounted for at a higher level by computing the frequency of rare events over a local region. Finally, to create the testing data another normal sequence is generated, segmented and labeled as normal. The collected anomalies of the same length are then injected into this sequence at random according to a mixing ratio. 4.4.2 Experimental Protocol The experiments conducted in this chapter using the data generator simulate a simple process, with Σ = 8 and CRE = 0.3 and a more complex process, with Σ = 50 and 4 This is in contrast with other work which consider rare event as anomalies. Rare events are normal, however they may be suspicious if they occur in high frequency over a short period of time. 189 CRE = 0.4. The sizes of injected anomalies are assumed equal to the detector window sizes AS = DW = 4. For both scenarios, the presented results are for validation and test sets that comprise 75% of normal and 25% of anomalous data. Although not show in this chapter, various experiments have been conducted using different values of AS and DW , and different ratios of normal to anomalous data. These experiments produced similar results and hence the discussion presented in the next section hold. Figure 4.7 illustrates the steps involved for estimating HMM parameters. Given the first block of training data D1 , different discrete-time ergodic HMMs are trained with various number of hidden states N = [Nmin , . . . , Nmax ] using the 10-fold cross validation (10FCV). The training block D1 , which only comprises normal sub-sequences, is randomly partitioned into K = 10 sub-blocks of equal size. For each fold k, each of the learning techniques described below is used to estimate HMM parameters using K − 1 sub-blocks. The Forward algorithm is then used to evaluate the log-likelihood on the remaining sub-block, which is used as a stopping criterion to reduce the overfitting effects. The training process is repeated ten times using a different random initialization to avoid local maxima, which provides 100 ROC curves for each N value. Finally, the model that gives the highest area under its convex hull on a validation set (V) comprising normal and anomalous sub-sequences is selected, which results an HMM for each N value. Figure 4.8 presents HMM parameter estimation according to each learning technique from successive blocks of data for each N value. When the second training block D2 becomes available, the batch Baum-Welch (BBW) algorithm discards the previously learned HMMs and restarts the training procedure with all cumulative data (D1 ∪ D2 ), while the Baum-Welch (BW) algorithm restarts the training using D2 only providing HMMs for incremental combination according to the incrBC algorithm. The on-line BW (OBW) and incremental BW (IBW) algorithms resume the training from the previously-learned HMMs (λ1N ) using only the current block (D2 ). The OBW algorithm re-estimates HMM parameters based on each sub-sequence without iterations, while the IBW algorithm reiterates on D2 and integrates the new information with previously-learned model at each 190 DW Train on k − 1 folds: BBW, BW, OBW, IBW Evaluate on remaining fold (Forward algorithm) D1 : Normal sub-sequences Stopping criterion log-likelihood 10-fold cross validation For k = 1, . . . 10 For N = Nmin , . . . , Nmax folds For i = 1, . . . , 10 random initializations Best Train Valid. # iterations Temporary HMMs λk,1 N λk,10 N AS Validation set (V ) Normal & anomalous sub-sequences tpr Best HMMs Highest AUCH value over the 10 × 10 crossvalidations and initializations λNmin λNmax f pr Figure 4.7 Overall steps involved to estimate HMM parameters and select HMMs with the highest AUCH for each number of states from the first block (D1 ) of normal data, using 10-FCV and ten random initializations iteration, until the stopping criteria are met (see Appendix III and Khreich et al., 2009a). The area under the ROC curve (AUC), has been largely suggested as a robust scalar summary of classifiers performance (Huang and Ling, 2005; Provost and Fawcett, 2001). The AUC assesses ranking in terms of class separation – it evaluates how well a classifier is able to sort its predictions according to the confidence they are assigned. For instance, with an AU C = 1 all positives are ranked higher than negatives indicating a perfect discrimination between classes. A random classifier has an AU C = 0.5 that is both classes are ranked at random. If the AUC or the partial AUC (Walter, 2005) are not significantly different, the shape of the curves might need to be looked at. It may also be useful to observe the tpr for a fixed f pr of particular interest. Since the MRROC 191 Normal data blocks: D1 D2 Time V Normal & Anomalous Validation set λ0N BBW D1 λ(iter) λ0N BW D1 λ(iter) λ0N λ1N λ0N λ1N OBW D1 λ0N λ0N λ(iter) D1 ∪ D2 λ(iter) λ2N BW D2 λ(iter) λ2N OBW λ1N IBW D1 BBW D2 λ2N IBW λ1N D2 λ(iter) λ2N Figure 4.8 An illustration of HMM parameter estimation according to each learning technique (BBW, BW, OBW, and IBW) when subsequent blocks of observation sub-sequences (D1 , D2 , D3 , . . .) become available can be applied to any ROC curve, the performance in all experiments are measured with reference to the ROCCH, including the area under the convex hull (AUCH) and the tpr at f pr = 0.1. When working with the synthetic data, the whole procedure is replicated ten times with different training, validation and testing sets, and the median results are presented along with the lower and upper quartiles to provide statistical confidence intervals. 4.5 Simulation Results The first subsection presents the performance of the proposed learn-and-combine approach for incremental learning of new data without employing a model management strategy. In this case, a pool of HMMs for a new block of training data is combined with all previously-generated HMMs according to the incrBC algorithm. The second 192 subsection shows the impact on performance of the ensemble selection algorithms and of the pruning strategies for limiting the pool size. 4.5.1 Evaluation of the Learn-and-Combine Approach: A. HMMs Trained with a Fixed Number of States on Successive Blocks of Data. The first experiment involves ergodic HMMs trained with N = 6 states on ten blocks of data. The training, validation and testing data are generated synthetically as described in Section 4.4.1, with Σ = 8 and CRE = 0.3. Each training block Dk (k = 1, . . . , 10) comprises 50 normal sub-sequences, each of size DW = 4. Each validation (V) and test (T ) set comprises 200 sub-sequences, each of size AS = 4. In both data sets, the ratio of normal to anomalous sub-sequences is 4 : 1. The training and validation of HMMs follow the methodology described in Section 4.4.2. For each block Dk , a pool Pk is obtained by applying the BW algorithm to Dk using 10-FCV and ten different random initializations, and selecting the HMM (λkN =6 ) that gives the highest AUCH on V (see Figure 4.7). Pk is then appended to a pool P (P ← P ∪ Pk ). After receiving the last block of data, the pool P = λ1N =6 , . . . , λ10 N =6 of size |P| = 10 HMMs is provided for incremental combination according to the incrBC algorithm. The reference BBW follows the same training procedure but with cumulative blocks of data, and both OBW and IBW algorithms resume the training from the previously-learned HMMs using only the current block of data (see Figure 4.8). Figure 4.9 compares the median AUCH performance (Figure 4.9a) and the median tpr values at f pr = 0.1 (Figure 4.9b) as well as their lower and upper quartiles over ten replications for each learning technique. The performance obtained with BBW, OBW and IBW algorithms are compared to that of the incrBC algorithm combining the responses of the HMMs in P, incrementally, over the ten blocks of data. The performance achieved by combining the outputs of the HMMs in P, over the ten blocks of data, with the static median (MED) and majority vote (VOTE) fusion functions are also provided for reference. 193 1 0.95 0.9 0.9 0.8 0.85 0.7 tpr at f pr = 0.1 1 AU CH 0.8 0.75 0.7 0.6 0.5 0.4 0.65 0.3 0.6 0.2 0.55 0.1 0.5 D1 D2 D3 D4 D5 D6 D7 Number of training blocks D8 D9 D10 0 D1 D2 D3 D4 D5 D6 D7 Number of training blocks (a) BBW D8 D9 D10 (b) OBW IBW incrBC M EDall N V OT Eall N Figure 4.9 Results for synthetically generated data with Σ = 8 and CRE = 0.3. The HMMs are trained according to each technique with N = 6 states for each block of data providing a pool of size |P| = 1, 2, . . . , 10 HMMs. Error bars are lower and upper quartiles over ten replications As shown in Figure 4.9, the learn-and-combine approach using the incrBC algorithm provides the highest level of performance (with the lowest variances) among all incremental learning techniques over the whole range of results. Performance is significantly higher than that of the reference BBW, especially when provided with limited training data, as shown in the performance achieved with the first few blocks. The level of performance provided by the static MED and VOTE combiners is lower (with higher variances) versus other techniques, and oscillates around their initial values. Not surprisingly, the OBW algorithm has achieved the worst level of performance as one pass over a limited data is insufficient to capture its underlying structure. In contrast, the IBW algorithm has provided a higher level of performance than that of OBW as it iterates over each block and employs a fixed learning rate to integrate the newly-acquired information into HMM parameters at each iteration (see Appendix III). 194 B. HMMs Trained with Different Number of States on Successive Blocks of Data. In order to investigate the effects of the number of states on the performance of each technique, the previous experiment is conducted with a number of states ranging from N = 4 to 12. For each N value, a pool PN is obtained by applying the BW algorithm to each block Dk using 10-FCV and ten different random initializations, and selecting the HMM (λkN ) that gives the highest AUCH on V (see Figure 4.7 and 4.8). After receiving the last block of data, nine pools are generated, a pool PN = λ1N , . . . , λ10 N for each state value N = 4, . . . , 12, each of size |PN | = 10 HMMs. The responses of HMMs in each pool PN are incrementally combined using the incrBC algorithm over each block. The number of states that achieved the highest average level of performance on each block of data is selected. The same procedure is also applied for BBW, OBW and IBW algorithms (see Figure 4.7 and 4.8). In addition, the learn-and-combine approach is employed to incrementally combine the HMMs trained using the whole range of states over the ten blocks. The nine pools are therefore concatenated into one pool 10 P = λ1N =4 , . . . λ1N =12 ; . . . ; λ10 N =4 , . . . λN =12 of size |P| = 90 HMMs, and incrementally combined according to the incrBC algorithm over all the successive blocks (incrBCall N ). The HMMs in P are also combined according to the median (M EDall N ) and majority vote (V OT Eall N ) functions for comparison. Figure 4.10 presents the median results of the experiments as well as their lower and upper quartiles over ten replications for each learning technique. For BBW, OBW, IBW, and incrBC technique, the number of states that achieved the highest average level of performance on each block of data is indicated with each median value. As shown in Figure 4.10, the best number of states varies from one block to the next and different among all techniques. Therefore, combination of HMMs each trained with different number of states and random initializations may increase the ensemble diversity and improve system performance. This is clearly seen in the performance achieved with incrBCall N , which is significantly higher than all other techniques. The previous obser- 195 vations hold true for the remaining techniques. In fact, to obtain the number of states which provides the best performance for each learning techniques on each data block, the HMMs are trained and evaluated with each N values according to different random initializations. Instead of selecting the single “best” HMM from the pool, incrBCall N combines all HMMs with a minimal overhead. As expected, the level of performance achieved by the learning algorithms increases when provided with more training blocks. Although incrBC and incrBCall N algorithms are still capable of achieving the highest level of performance, the increased performance over other techniques is relatively lower, as the combined HMMs become more positively correlated. 6 6 6 0.9 9 10 11 6 10 9 8 9 6 9 8 9 10 6 8 10 10 10 4 7 4 4 0.5 D1 D2 D3 4 10 12 12 8 9 0.6 0.4 D9 D10 OBW 9 6 9 10 6 6 7 6 11 10 9 10 9 10 9 9 10 0 D1 7 4 4 D2 D3 (a) BBW 9 7 7 6 7 D8 6 6 0.2 7 D6 D4 D5 D7 Number of training blocks 6 6 6 8 11 6 0.8 0.7 0.6 1 6 12 11 AU CH 0.8 8 9 6 tpr at f pr = 0.1 1 11 7 4 12 D6 D4 D5 D7 Number of training blocks D8 D9 11 D10 (b) IBW incrBC incrBCall N M EDall N V OT Eall N Figure 4.10 Results for synthetically generated data with Σ = 8 and CRE = 0.3. The HMMs are trained according to each technique with nine different states (N = 4, 5, . . . , 12) for each block of data providing a pool of size |P| = 9, 18, . . . , 90 HMMs. Numbers above points are the state values that achieved the highest average level of performance on each block. Error bars are lower and upper quartiles over ten replications The previous results are also confirmed on the more complex synthetic data in Figure 4.11 and on sendmail data in Figure 4.12. The training and validation of the ergodic HMMs with 20 different number of states (N = 5, 10, . . . , 100), is carried out according to the methodology described in Section 4.4. For each of the ten data blocks, 20 HMMs are 196 generated and appended to the pool, which gives a pool of size |P| = 20, 40, . . . , 200 HMMs. For each replication of the synthetic data, the training is conducted on ten successive blocks each comprising 500 sub-sequences of length DW = 4 symbols, the validation set V comprises 2, 000 sub-sequences and the test set T comprises 5, 000 subsequences. For sendmail data, the training is conducted on ten successive blocks each comprising 100 sub-sequences of length DW = 4 symbols, each V and T comprise 450 subsequences. In both cases, the anomaly size is AS = 4 symbols and the ratio of normal to anomalous sequences is 4 : 1. Again, the incremental learn-and-combine approach provides a significantly higher level of performance than the BBW, OBW, IBW, MED and VOTE techniques. 1 0.7 0.6 0.9 AU CH 0.8 90 95 95 0.7 20 25 65 35 95 75 75 40 25 85 95 30 65 55 50 35 5 D1 5 D2 5 D3 5 5 5 50 D6 D4 D5 D7 Number of training blocks (a) 95 65 90 65 65 D8 60 D9 0.5 80 BBW OBW IBW incrBC incrBCall N M EDall N V OT Eall N 0.6 0.5 100 tpr at f pr = 0.1 90 75 50 0.4 90 90 0.3 95 0.2 5 0.1 35 75 35 5 0 D1 20 5 D2 35 5 D3 25 25 20 5 5 70 25 45 80 85 85 25 30 30 D10 90 90 50 95 65 25 85 30 5 5 D4 D5 D6 D7 Number of training blocks 5 D8 60 D9 80 D10 (b) Figure 4.11 Results for synthetically generated data with Σ = 50 and CRE = 0.4. The HMMs are trained according to each technique with 20 different states (N = 5, 10, . . . , 100) for each block of data providing a pool of size |P| = 20, 40, . . . , 200 HMMs. Numbers above points are the state values that achieved the highest average level of performance on each block. Error bars are lower and upper quartiles over ten replications The level of performance achieved by applying the learn-and-combine approach to HMMs trained on each newly-acquired block of data always provide the highest overall accuracy. In particular, the results have shown that it provides a higher level of performance than 197 0.9 1 0.8 0.95 95 40 0.85 15 5 60 0.8 5 20 5 0.6 50 55 40 0.7 70 40 100 30 20 5 15 20 95 0.75 5 5 5 60 15 35 5 5 5 0.7 10 D1 D2 D3 V OT Eall N D9 D10 25 100 45 35 60 45 0.4 5 0.3 0 30 40 50 50 95 50 25 80 10 55 75 95 5 70 15 100 35 (a) D2 D3 40 40 40 5 5 15 5 D1 35 80 40 85 80 25 D8 45 0.5 0.1 10 D4 D5 D6 D7 Number of training blocks M EDall N 0.2 10 10 tpr at f pr = 0.1 100 50 75 AU CH 35 35 35 0.9 0.65 BBW OBW IBW incrBC incrBCall N 5 5 D4 D5 D6 D7 Number of training blocks 5 D8 D9 D10 (b) Figure 4.12 Results for sendmail data. The HMMs are trained according to each technique with 20 different states (N = 5, 10, . . . , 100) for each block of data providing a pool of size |P| = 20, 40, . . . , 200 HMMs. Numbers above points are the state values that achieved the highest level of performance on each block the reference BBW, especially when provided with limited data. This stems from the capabilities of the incrBC algorithm in effectively exploiting the diverse and complementary information provided from the pool of HMMs trained with different number of states and different initializations, and from the newly-acquired data. In fact, BW training optimizes HMM parameters locally, which results in converging to different local maxima. Furthermore, HMMs trained with a different number of states allow for capturing different structures of the underlying data. In addition, the newly-acquired blocks of training data provide different views of the true underlying distribution. According to the bias-variance decomposition (Breiman, 1996; Domingos, 2000), segmenting the training data introduces bias and decreases the generalization ability of the individual classifiers. However, this increases the variances among classifiers trained on each data block. An increased generalization ability is therefore achieved by combining the outputs of those classifiers and hence improved level of performance. 198 The MED and VOTE fusion functions have shown incapable of increasing the level of performance compared to the incrBC technique. This reflects their inabilities to effectively exploit the information provided from the validation set. The MED function directly combines HMMs likelihood values for each sub-sequence in the test, while VOTE considers the crisp decisions from HMMs at optimal operating thresholds or equivalently at equal error rates. In contrast, the incrBC algorithm applies ten Boolean functions to the crisp decisions provided by each threshold from the first HMM to those provided by the second HMM, and then selects the decision thresholds and Boolean functions that improve the overall ROCCH of the validation set V. Shifting the decision thresholds for each HMM in the ensemble before combining their responses has shown to largely increase the diversity of the ensemble. The incrBC algorithm is effectively designed to take advantage from these threshold-based complementary information and to select those that most improve the level of performance. 4.5.2 Evaluation of Model Management Strategies: A. Model Selection. In the previously conducted experiments, using the synthetic (Σ = 50 and CRE = 0.4) and sendmail data sets, the pool P comprised 20 HMMs for each data block (an HMM for each number of states N = 5, 10, . . . , 100), yielding a pool of size |P| = 200 HMMs after receiving the ten blocks of data. This is also the size of the EoHMMs (|E| = 20, 40, . . . , 200 HMMs), since for each block Dk , all available HMMs in P were selected and incrementally combined according to the learn-and-combine approach (incrBCall N ). Figure 4.13 presents the results of the combined EoHMMs previously considered using the synthetically generated data with Σ = 50 and CRE = 0.4. These results belong to the first replication of Figure 4.11. AUCH performance of the incrBCall N that combines all available HMMs from the pool is compared to that of the proposed ensemble selection algorithms, BCgreedy (Algorithm 4.3) and BCsearch (Algorithm 4.4). For each block, the size of the EoHMMs selected according to each technique is shown on the respective 199 curves of the figure. The performance of the BBW, with the selected (best) state value 1 1 0.95 0.95 160 7 180 7 200 7 100 7 120 7 140 60 ↓ ↓ 0.9 8 80 ↓ ↓ ↓ 6 ↓ ↓ 7 40 6 ↓ ↓ 13 14 12 0.85 14 12 20 N=65 13 N=10 14 ↓ 12 N=50 13 0.8 3 N=35 N=25 0.75 10 N=55 N=40 0.7 0.65 D2 D3 D4 D5 D6 D7 Number of training blocks D8 41 21 47 56 55 57 22 60 64 N=65 N=50 N=35 N=25 N=40 N=55 BBW N=50 incrBCall N 0.6 BCgreedy N=15 0.55 BCsearch N=45 D9 20 21 23 N=10 18 0.75 0.65 BCsearch N=45 D1 30 BCgreedy N=15 0.55 43 21 0.7 incrBCall N 0.6 0.5 0.85 12 0.8 BBW N=50 21 23 19 0.9 AU CH AU CH at each block, is also shown for reference. 0.5 D10 D1 D2 (a) tolerance = 0.01 D3 D4 D5 D6 D7 Number of training blocks D8 D9 D10 (b) tolerance = 0.001 N =5 N = 10 N = 15 N = 20 N = 25 N = 30 N = 35 N = 40 N = 45 N = 50 N = 55 N = 60 N = 65 N = 70 N = 75 N = 80 N = 85 N = 90 N = 95 N = 100 D6 D7 λ D1 λ D2 λ D3 λ D4 λ D5 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 D8 D9 D10 , λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 λ D9 λD10 D5 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 λ D9 D4 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 D3 λ D1 λ D2 λ D3 λ D4 λ D1 λ D2 λ D1 D1 D2 λ D1 λ D2 λ D3 Figure 4.13 Ensemble selection results for synthetically generated data with Σ = 50 and CRE = 0.4. These results are for the first replication of Figure 4.11. For each block, the values on the arrows represent the size of the EoHMMs (|E|) selected by each technique from the pool of size |P| = 20, 40, . . . , 200 HMMs , % % % , , % M Figure 4.14 Representation of the HMMs selected in Figure 4.13a with the presentation of each new block of data according to the BCgreedy algorithm with tolerance = 0.01. HMMs trained on different blocks are presented with a different symbol. |P| = 20, 40, . . . , 200 HMMs indicated by the grid on the figure D5 D6 D7 D8 λ D1 λ D2 λ D3 λ D4 λ D5 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 N =5 N = 10 N = 15 N = 20 N = 25 N = 30 N = 35 N = 40 N = 45 N = 50 N = 55 N = 60 N = 65 N = 70 N = 75 N = 80 N = 85 N = 90 N = 95 N = 100 D9 , , D10 , λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 λ D9 λD10 D4 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 λ D9 D3 λ D1 λ D2 λ D3 λ D4 λ D1 λ D2 λ D1 D1 D2 λ D1 λ D2 λ D3 200 , % % Figure 4.15 Representation of the HMMs selected in Figure 4.13a with the presentation of each new block of data according to the BCsearch algorithm with tolerance = 0.01. HMMs trained on different blocks are presented with a different symbol. |P| = 20, 40, . . . , 200 HMMs indicated by the grid on the figure Figure 4.13a, presents the results of the BCgreedy and BCsearch algorithms for a tolerated improvement value of 0.01 between the AUCH values. Both BCgreedy and BCsearch are capable of maintaining a slightly lower level of AUCH performance than that of the incrBCall N algorithm. However, for each block, the size of the selected EoHMMs (|E|) is largely reduced compared with the original pool size (|P| = 20, 40, . . . , 200 HMMs). The BCgreedy algorithm selects ensembles of sizes |E| = 10 to 14 HMMs (see Figure 4.14), while BCsearch selects ensembles of sizes |E| = 3 to 8 HMMs (see Figure 4.15). When the number of blocks increases, the number of the selected HMMs according to both algorithms remains almost stable. In particular, at the 10th block of data, BCgreedy selects an ensemble of size |E| = 12 HMMs, while BCsearch selects an ensemble of size |E| = 7 from the generated pool of size |P| = 200 HMMs, and even provides a slight increase in AUCH performance over that of incrBCall N . This indicates that both BCgreedy and BCsearch are effective in exploiting the complimentary information provided from the new blocks of data. As expected, the incremental search strategy employed within the BCsearch algorithm is most effective for reducing the ensemble sizes while maintaining or improving the overall performance. The selection strategy employed withing the BCgreedy algorithm 201 200 180 160 140 ↓ 12 ↓ ↓ ↓ 8 9 120 8 0.95 100 9 ↓ ↓ 80 21 20 21 8 19 ↓ 60 0.9 9 16 40 10 18 ↓ N=70 8 ↓ N=95 N=55 N=50 16 0.85 12 N=75 20 10 0.8 ↓ N=15 5 N=20 N=60 N= 5 BBW 7 0.75 incrBCall N 1 33 0.95 0.9 32 27 18 38 63 22 45 32 D1 D2 D3 D4 D5 D6 D7 Number of training blocks D8 84 78 64 47 N=70 N=95 N=55 N=50 N=15 0.75 N=20 N=60 N= 5 BBW incrBCall N BCgreedy 0.7 N= 5 BCsearch 0.65 31 32 N=75 12 0.8 18 BCgreedy 0.7 N= 5 0.85 20 21 19 AU CH AU CH 1 BCsearch D9 0.65 D10 D1 (a) tolerance = 0.003 D2 D3 D4 D5 D6 D7 Number of training blocks D8 D9 D10 (b) tolerance = 0.0001 λ D1 λ D2 λ D3 λ D4 λ D1 λ D2 λ D3 λ D4 λ D5 D7 D8 D9 , , , , % D10 , ,% λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 λ D9 λD10 D6 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 λ D9 D5 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 D4 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 D3 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 N =5 N = 10 N = 15 N = 20 N = 25 N = 30 N = 35 N = 40 N = 45 N = 50 N = 55 N = 60 N = 65 N = 70 N = 75 N = 80 N = 85 N = 90 N = 95 N = 100 λ D1 λ D2 λ D1 D1 D2 λ D1 λ D2 λ D3 Figure 4.16 Ensemble selection results for sendmail data of Figure 4.12. For each block, the values on the arrows represent the size of the EoHMMs (|E|) selected by each technique from P of size |P| = 20, 40, . . . , 200 HMMs M , ,% , , , , Figure 4.17 Representation of the HMMs selected in Figure 4.16a with the presentation of each new block of data according to the BCgreedy algorithm with tolerance = 0.003. HMMs trained on different blocks are presented with a different symbol. |P| = 20, 40, . . . 200 HMMs indicated by the grid on the figure is more conservative than that of the BCsearch (see Figure 4.14 and 4.15). BCgreedy tends to select and conserve older models due to its linear scanning and selection, while the incremental search strategy of BCsearch exploits further information by exploring the benefit achieved after combining each new HMM with the cumulative EoHMMs. In λ D1 λ D2 λ D3 λ D4 λ D5 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 N =5 N = 10 N = 15 N = 20 N = 25 N = 30 N = 35 N = 40 N = 45 N = 50 N = 55 N = 60 N = 65 N = 70 N = 75 N = 80 N = 85 N = 90 N = 95 N = 100 D7 D8 D9 , D10 , , , λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 λ D9 λD10 D6 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 λ D9 D5 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 λ D8 D4 λ D1 λ D2 λ D3 λ D4 λ D5 λ D6 λ D7 D3 λ D1 λ D2 λ D3 λ D4 λ D1 λ D2 λ D1 D1 D2 λ D1 λ D2 λ D3 202 , , , , , % Figure 4.18 Representation of the HMMs selected in Figure 4.16a with the presentation of each new block of data according to the BCsearch algorithm with tolerance = 0.003. HMMs trained on different blocks are presented with a different symbol. |P| = 20, 40, . . . 200 HMMs indicated by the grid on the figure fact, the order in which the models are selected in BCsearch assures the best ensemble performance up to the tolerance value. Therefore, as illustrated in Figure 4.13b, with lower tolerance values, the level of performance achieved by BCsearch is higher than that of BCgreedy and incrBCall N algorithms. In this case, the size of the EoHMMs selected according to BCsearch is on average about half of that selected by BCgreedy , over the ten blocks. However, this comes at the cost of efficiency, where BCsearch may be an order of magnitude more computationally expensive than BCgreedy (see Section 4.3.2). These selection results are also confirmed on sendmail data as shown in Figure 4.16, and detailed in Figure 4.17 and Figure 4.18 for BCgreedy and BCsearch algorithms. In practice, when efficiency is important, BCgreedy should be considered as it scans the ordered list of detectors once. Otherwise, BCsearch has shown to be more effective in selecting detectors that contribute the most to an improved performance of the ensemble. B. Model Pruning. Previous experiments have shown that BCgreedy and BCsearch algorithms are effective in maintaining a high level of performance by selecting relatively small sized EoHMMs from the entire pool of previously-generated HMMs. In practice, however, the size of the 203 pool must be restricted from increasing indefinitely with the number of data blocks. The impact on performance of the pruning strategy proposed in Section 4.3.2 is now evaluated for BCgreedy and BCsearch algorithms according to various life time (LT) expectancy values. In the following results, an HMM is pruned if it is not selected for an LT corresponding to 1, 3 or 5 data blocks. The reference results of previous experiments, conducted without pruning HMMs from the pool, are shown with an LT = ∞. Figure 4.19 illustrates the impact on accuracy of pruning the pool of HMMs in Figure 4.13a (synthetic data with Σ = 50 and CRE = 0.4), while Figure 4.20 illustrates the impact of pruning the pool of HMMs in Figure 4.16a (sendmail data). The size of the se0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75 BBW BBW incrBCall N 0.7 0.7 incrBCall N BCgreedy (LT = ∞) 0.65 BCgreedy (LT = 3) 0.6 BCsearch (LT = ∞) 0.65 BCgreedy (LT = 1) BCsearch (LT = 1) BCsearch (LT = 3) 0.6 BCgreedy (LT = 5) 0.55 D2 D1 D3 D4 D5 D6 D7 Number of training blocks D8 D9 BCsearch (LT = 5) D10 0.55 D1 D2 D3 (a) BCgreedy D1 D2 D3 |E| |P| 10 20 13 40 14 60 |E| |P| 7 20 7 22 7 24 |E| |P| 7 20 7 40 9 53 |E| |P| 7 20 7 40 9 60 D4 D5 D6 D7 BCgreedy (LT = ∞) 12 14 13 12 80 100 120 140 BCgreedy (LT = 1) 7 10 10 7 28 25 25 25 BCgreedy (LT = 3) 9 9 9 9 61 55 54 53 BCgreedy (LT = 5) 9 10 10 8 80 94 104 99 D4 D5 D6 D7 Number of training blocks D9 D8 D10 (b) BCsearch D1 D2 D3 |E| |P| 3 20 7 40 6 60 8 26 |E| |P| 6 20 6 22 7 24 10 58 10 57 |E| |P| 6 20 6 40 7 52 13 95 13 94 |E| |P| 6 20 6 40 7 60 D8 D9 D10 13 160 14 180 12 200 7 26 8 26 9 56 8 97 D4 D5 D6 D7 BCsearch (LT = ∞) 8 7 7 7 80 100 120 140 BCsearch (LT = 1) 7 3 3 7 23 25 23 25 BCsearch (LT = 3) 7 7 7 3 57 49 50 50 Bsearch (LT = 5) 7 7 7 8 80 92 97 92 D8 D9 D10 6 160 7 180 7 200 7 25 3 24 3 25 3 46 7 47 7 52 8 93 8 94 8 92 Figure 4.19 illustration of the impact on performance of pruning the pool of HMMs in Figure 4.13a 204 lected EoHMMs (|E|) and of the pool (|P|) are presented below the figures for each block of data. The performance of the BBW algorithm and that of incrBCall N , combining all HMMs without any pruning, are also shown for reference. As shown in Figure 4.19, the level of performance achieved with BCgreedy and BCsearch algorithms is decreased for LT = 1 compared to that achieved with LT = ∞. This aggressive pruning strategy eliminates all HMMs that have not been selected as ensemble members during the previous incremental learning stage, which may lead to knowledge corruption, and hence degrades the system performance. In fact, the discarded HMMs may have complementary information with respect to the newly-generated HMMs from new block of data. Early elimination of this information limits the search space for future combinations. The impact on performance of pruning depends on the variability of the data and the block size. As shown in Figure 4.20, the decline in performance with BCgreedy (LT = 1) and BCsearch (LT = 1) is relatively small for sendmail data, which incorporates more redundancy than the synthetically-generated data. The performance achieved with a delayed pruning of HMMs approaches that of retaining all generated HMMs in the pool for larger LT values. As shown in Figures 4.19 and 4.20, the performance achieved by pruning the HMMs that have not been selected for LT = 3 and 5 blocks according to BCgreedy and BCsearch algorithms is comparable to that of BCgreedy (LT = ∞) and BCsearch (LT = ∞), respectively. For fixed tolerance and LT values, BCsearch is capable of selecting smaller EoHMMs and further reducing the size of the pool than BCgreedy algorithm. The results show that BCsearch limits the size of pool to about LT + 1 times the averaged number of generated HMMs from new blocks, while BCgreedy upper bound on pool size is slightly higher. For instance, with LT = 5 we need a storage that accommodates parameters of about 100 HMMs. A fixed-size pool may be obtained by adaptively changing the tolerance and LT values upon receiving a new block of data. In HMM-based ADSs, the system administrator must set these parameters to optimize the system performance, while reducing the size of the selected EoHMMs and of the pool of HMMs for reduced time and memory requirements. 205 1 1 0.95 0.95 0.9 0.9 0.85 0.85 BBW BBW incrBCall N 0.8 incrBCall N BCgreedy (LT = ∞) 0.8 BCsearch (LT = ∞) BCgreedy (LT = 1) 0.75 BCsearch (LT = 1) 0.75 BCgreedy (LT = 3) BCsearch (LT = 3) BCgreedy (LT = 5) 0.7 D2 D1 D3 D7 D4 D5 D6 Number of training blocks D8 D9 BCsearch (LT = 5) D10 0.7 D1 D2 D3 (a) BCgreedy D1 D2 D3 |E| |P| 7 20 10 40 12 60 |E| |P| 7 20 7 27 10 30 |E| |P| 7 20 7 40 10 47 |E| |P| 7 20 7 40 10 60 D4 D5 D6 D7 D8 D9 D10 BCgreedy (LT = ∞) 16 18 16 19 21 20 21 80 100 120 140 160 180 200 BCgreedy (LT = 1) 10 9 9 15 15 17 17 31 38 37 43 40 40 42 BCgreedy (LT = 3) 10 12 12 16 16 18 18 59 57 62 60 66 70 66 BCgreedy (LT = 5) 10 12 12 16 16 18 18 80 87 100 98 104 102 110 D7 D4 D5 D6 Number of training blocks D8 D9 D10 (b) BCsearch D1 D2 D3 |E| |P| 5 20 8 40 10 60 |E| |P| 5 20 5 25 10 30 |E| |P| 5 20 5 40 8 46 |E| |P| 5 20 5 40 8 60 D4 D5 D6 D7 BCsearch (LT = ∞) 9 8 9 8 80 100 120 140 BCsearch (LT = 1) 10 3 3 8 26 29 30 31 BCsearch (LT = 3) 8 8 8 9 53 55 56 55 Bsearch (LT = 5) 8 10 10 9 80 86 97 101 D8 D9 D10 8 160 9 180 12 200 8 31 9 30 9 31 9 54 8 54 8 52 9 99 8 95 8 92 Figure 4.20 illustration of the impact on performance of pruning the pool of HMMs in Figure 4.16a 4.6 Conclusion This chapter presents a ROC-based system to efficiently adapt EoHMMs over time, from new training data, according to a learn-and-combine approach. When a new block of data becomes available, a pool of base HMMs is generated and combined to a global pool comprising previously-generated pools. The base HMMs are trained from each new data block with different number of states and initializations, which allows to capture different underlying structures of the data and hence increases diversity among pool members. The responses from these newly-trained HMMs are then combined with those of the previously-trained HMMs in ROC space using a novel incremental Boolean combination 206 (incrBC) technique. On its own, incrBC allows to maintain or improve system accuracy, yet it retains all previously-generated HMMs in the pool. Specialized model management algorithms are proposed to limit the computational and memory complexity from increasing indefinitely with the incrBC of new HMMs. First, the proposed system selects a diversified EoHMMs from the pool according to one of two ensemble selection algorithms, called BCgreedy and BCsearch , that are adapted to benefit from the monotonicity in incrBC accuracy, for a reduced complexity. Decision thresholds from selected HMMs and Boolean fusion functions are then adapted to improve overall system performance. Finally, redundant and inaccurate base HMMs, which have not been selected for some user-defined time interval, are pruned from the pool. Since the overall composite ROC convex hull is retained, the proposed system is capable of changing its desired operating point during operations, and hence accounts for changes in operating conditions, such as tolerated false alarm rate, prior probabilities, and costs of errors. Most existing techniques for adapting ensembles of classifiers require restarting the training, selection, and combination procedures to account for such a change. During simulations conducted on both synthetic and real-world HIDS data, the proposed system has been shown to achieve a higher level of accuracy than when parameters of a single best HMM are estimated, at each learning stage, using reference batch and incremental learning techniques. It also outperforms the learn-and-combine approaches using static fusion functions (e.g., majority voting) for combining newly- and previouslygenerated pools of HMMs. The proposed ensemble selection algorithms have been shown to provide compact and diverse EoHMMs for operations, and hence simplified Boolean fusion rules, while maintaining or improving the overall system accuracy. The employed pruning strategy has been shown to limit the ever-growing pool size, thereby reducing the storage space for accommodating HMMs parameters and the computational costs for the selection algorithms, without negatively affecting the overall system performance. CONCLUSIONS Anomaly detection approaches first learn normal system behavior, and then attempt to detect significant deviations from this baseline behavior during operations. Anomaly detection is an active and challenging research area. It has been motivated by numerous real-world applications such as fraud detection in electronic commerce, anomaly detection in control systems and medical image analysis, and target detection in military systems. In computer and network security applications, anomaly detection complements misuse detection techniques for improved intrusion detection systems. Network traffic and protocols, operating systems and host events, as well as user behavior (e.g., keystroke dynamics) can be monitored for anomaly detection. Although capable of detecting novel attacks, ADSs suffer from excessive false alarms and degradation in performance, because they are typically designed using unrepresentative training and validation data with limited prior knowledge about their distributions, and because they face complex environments that change during operations. Providing representative data for designing ADSs that monitor computer or network events is particularly difficult in today’s internetworked environments. An HMM detector employed within an intrusion detection system can not be directly trained from data generated by a process of interest. These data may include patterns of attacks that will be missed by the ADS during operations. As described in Section 1.6.2, unrepresentative normal data is typically provided for training, whether the amount of data is abundant (collected from overly secured environments with restricted functionalities) or limited (collected from unconstrained environments which require significant time and effort to identify existing attacks). The anomaly detector will therefore have an incomplete view of the normal process behavior, and hence rare normal events will be misclassified as anomalous. Therefore, ADSs should be able to efficiently accommodate new data, to account for rare normal events and adapt to changes in normal behaviors that may occur over time. 208 Contributions and Discussion This thesis presents novel approaches at the HMM and decision levels for improving the accuracy, efficiency and adaptability of HMM-based ADSs. To sustain a high level of performance, an HMM should be capable to incrementally update its parameters in response to new data that may become available after a it has already been trained on previous data, and deployed for operation. However, the incremental re-estimation of HMM parameters raises several challenges. HMM parameters should be updated from new data without requiring access to the previously-learned training data, and without corrupting previously-acquired knowledge (Grossberg, 1988; Polikar et al., 2001), i.e., previously-learned models of normal behavior. As described in Chapter 2 and Appendix I, standard techniques for estimating HMM parameters involve batch iterative learning, which require a finite amount of training data throughout the training process. The HMM parameters are estimated over several training iterations, where each iteration requires processing the entire training data, until some objective function is maximized. To avoid knowledge corruption, learning new data with these techniques require a cumulative storage of training data over time, and increasingly large time and memory complexity for re-estimation of HMM parameters from the start using all accumulated data. This may represent a very costly solution in practice. Chapter 2 presents a survey and detailed analysis of on-line learning techniques that have been proposed in literature to estimate HMM parameters from an infinite stream of observation symbols or sub-sequences of observation symbols. They assume being provided with a complete representation of normal system behavior through the stream of generated observations over time, and focus on issues such as efficiency and convergence rate. Therefore, these algorithms update HMM parameters continuously upon observing each new observation symbol or new observation sub-sequence, and typically employ a learning rate to control model stability and accelerate the rate of convergence. When provided with unrepresentative data for training, these on-line learning techniques yield a low level of performance as one pass over the data is not sufficient to capture normal 209 system behavior. This have been addressed analytically in Chapter 2, and confirmed empirically in Appendix II. When the previously-learned model is considered as a starting point for incremental learning from successive data blocks, both batch and on-line learning algorithms lead to a decline in system performance caused by the plasticity-stability dilemma. In fact these algorithms may remain trapped in local optima of the likelihood function associated with previously-learned data and would not be able to accommodate the newly-acquired information. Otherwise, they will adapt completely to the new data, thereby corrupting the previously-acquired knowledge. Appendix II extends some on-line learning techniques for HMM parameters to incremental learning, by allowing them to iterate over each block of data and by resetting their internal learning rates when a new block is presented. These learning rates are employed to integrate the information from each new symbol or sub-sequences of observation with the previously-learned model. Evaluation results for detecting system call anomalies, have shown that the extended techniques for incremental learning can provide a higher level of detection accuracy than that of the original on-line learning algorithms, as described in Appendix II. However, optimizing the internal learning rates employed within these algorithms to trade off plasticity against stability remains a challenging task. This is because all previously learned information contained within the current model should be balanced with new information contained in a symbol or sub-sequence of observation. Appendix III presents an improved incremental learning strategy, which consists of employing an additional (global) learning rate at the iteration level. This aims at delaying the integration of newly-acquired information with the previously-learned model until at least on iteration over the new block of data is performed by the new model. Furthermore, the previously-learned model is kept in memory and weight-balanced with the new model over each iteration. This incremental learning strategy is general in that it can be applied to any batch of on-line learning algorithm. When limited amount of data is provided for training within each new block, batch EM-based algorithms may provide easier solutions, as no internal learning rate is required and monotonic convergence to 210 a local maximum of the likelihood function is guaranteed. However, when a new block comprising abundant amount of data is provided for incremental learning, on-line algorithms may be employed due to their reduced storage requirements. An optimization of their internal learning rates is yet required to ensure convergence to a local maximum of the likelihood function, although not guaranteed. The EFFBS algorithm proposed in Appendix I provides an alternative solution, which allow standard batch learning techniques to efficiently estimate HMM parameters from abundant data. In practice, the EFFBS algorithm requires fewer resources than the traditional FB algorithm, yet provides the same results. The memory complexity of FB O(N T ) is reduced to O(N ) with EFFBS, which is independent of the observation sequence length T , but only depends on number of HMM states N . Although EFFBS by itself does not provide an incremental solution for estimation HMM parameters, it can be employed within the BW algorithm according to the incremental learning strategy, when the new block comprises abundant training data. This allows to take advantage of the monotonic convergence over each iteration of the BW algorithm, without any internal learning rate. EFFBS may also be useful in batch learning scenarios, where segmenting a long sequence of observation into smaller training sub-sequences is not desirable. Segmenting long sequences adds another variable to the problem at hand – the best sub-sequence length to be employed, and may raise dependence (or independence) issues between segmented sub-sequences. As discussed in Section 2.2.2.1, training an HMM with multiple sub-sequences assumes their conditional independence (Levinson et al., 1983), or otherwise requires a costly combinatorial techniques Li et al. (2000) to overcome the independence assumption. Training HMM parameters on the entire sequence of system call observations eliminates the issue of sub-sequence length selection during the design phase, and allow to use a desired detector window size for operations (Lane, 2000; Warrender et al., 1999). The issue of learning form long sequences of observations has been also raised in other fields, such as robot navigation systems (Koenig and Simmons, 1996), and bioinformatics (Lam and Meyer, 2010). 211 Designing an HMM for anomaly detection involves estimating HMM parameters and the number of hidden states, N , from the training data. The value of N and the initial guess of HMM parameters have a considerable impact not only on system accuracy but also on HMM training time and memory requirements, yet overlooked in most related worked as described in Section 1.6. In fact, HMMs trained with different number of states are able to capture different underlying structures of data. Different initializations may lead the algorithm to converge to different solutions in the parameters space due to the many local maxima of the likelihood function. Therefore, a single best HMM will not provide a high level of performance over the entire detection space. Appendix V empirically confirms the impact of N on performance of HMM-based anomaly detectors using different fixed-size synthetic HIDS data sets. A multiple-HMMs approach is therefore proposed, where each HMM is trained using a different number of hidden states, and where HMM responses are combined in the ROC space according to the MRROC technique (Scott et al., 1998). Results indicate that this approach provides a higher level of performance than a single best HMM and STIDE matching techniques , over a wide range of training set sizes with various alphabet sizes and irregularity indices, and different anomaly sizes. Chapter 3 presents a novel iterative Boolean combination (IBC ) technique for efficient fusion of the responses from multiple classifiers in the ROC space. The IBC efficiently exploits all Boolean functions applied to the ROC curves and therefore requires no prior assumptions about conditional independence of detectors or convexity of ROC curves. During simulations conducted on fixed-size synthetic and real HIDS data sets, the IBC has been also applied to combine the responses a of multiple-HMM system, where each HMM is trained using a different number of states and initializations. Results have shown that, even with one iteration, the IBC technique always increases system performance over related ROC-based combination techniques, such as the MRROC fusion, and the Boolean conjunction or disjunction fusion functions, without a significant computational and storage overhead. In fact, the time complexity of IBC is linear with the number of classifiers while its memory requirement is independent of the number of classifiers, 212 which allows for a large number of combinations. When the IBC is allowed to iterate until convergence, the performance improves significantly while the time and memory complexity required for each iteration are reduced by an order of magnitude with reference to the first iteration. The performance gain, especially when provided with limited training data, is due to the ability of the IBC technique to exploit diverse information residing in inferior points on the ROC curves, which are disregarded by the other techniques. IBC has been also shown useful for repairing the concavities in a ROC curve. The proposed IBC is general in that it can be employed to combine diverse responses of any crisp or soft detectors, within a wide range of application domains. For instance, it has been successfully applied to combine the responses from different biometric systems (trained on the same fixed-size data) for iris recognition (Gorodnichy et al., 2011). Similar to batch learning form fixed-size data sets, a single HMM system for incremental learning from successive blocks of data may not approximate the underlying data distribution adequately. Chapter 4 presents an adaptive system for incremental learning of new data over time using a learn-and-combine approach based on the proposed Boolean combination techniques. When a new block of training data becomes available, a new pool of HMMs is generated using different number of HMM states and random initializations. The responses from the newly-trained HMMs are then incrementally combined to those of the previously-trained HMMs in ROC space using the proposed Boolean combination techniques. Since the pool size grows indefinitely as new blocks of data become available over time, employing specialized model management strategies is therefore a crucial aspect for maintaining the efficiency of an adaptive ADS. The proposed ensemble selection techniques have been shown to form compact EoHMMs by selecting diverse and accurate base HMMs from the pool, while maintaining or improving the overall system accuracy. The proposed pruning strategy has been shown to limit the pool size from increasing indefinitely with the number of data blocks, reducing thereby the storage space for accommodating HMMs parameters and the computational costs of the selection algorithms, without negatively affecting the overall system performance. The proposed 213 system is capable of changing its desired operating point during operations, and this point can be adjusted to changes in prior probabilities and costs of errors. During simulations conducted for incremental learning from on successive blocks of synthetic and real-world HIDS data, the proposed learn-and-combine approach has been shown to achieve the highest level of accuracy than all related techniques presented in Chapter 4. In particular, it has been shown to achieve higher level of accuracy than when parameters of a single best HMM are estimated, at each learning stage, using reference batch (BBW) and the proposed incremental learning techniques (IBW). The results of this thesis provide an answer to the main research question: Given newly-acquired training data, is it best to adapt parameters of HMM detectors trained on previously-acquired data, or to train new HMMs on newly-acquired data and combine them with those trained on previously-acquired data? Adapting parameters of a previously-trained (single) HMM to newly-acquired training data according to IBW provide an efficient solution, which can limit existing knowledge corruption, and maintain or improve the detection accuracy over time. However, the corruption of existing knowledge caused by the large number of local maxima in the solution space remains an issue. This is illustrated in the large variances of the results achieved with the IBW algorithm, and in the lower level of detection accuracy compared to that of BBW (see Section 4.5). In this thesis, a learn-and-combine has been proposed whereby a new HMM is trained on newly-acquired data and its responses are combined with those trained on previouslyacquired data using incremental Boolean combination in ROC space. This approach has been shown to provide the overall highest level of accuracy over time, especially when training data is limited (see Section 4.5). In contrast, existing techniques for combining HMMs have been shown shown unable to maintain system accuracy over time. For instance, the combination of these (previously- and newly-trained) HMMs at the detector level by using model averaging or weighted averaging (Hoang and Hu, 2004), have been shown inadequate for ergodic HMMs and lead to knowledge corruption, as presented in 214 Section II.5. Similarly, combining the outputs of these HMMs using traditional fusion techniques at the score-level (such as median) or decision-level (e.g., majority vote), have been shown to provide poor detection accuracy over time (see Section 4.5). The learnand-combine approach may require more resources than IBW since several HMMs are employed during operations. Depending on the operation requirements, higher tolerance values can be imposed on the model selection algorithms to reduce the number of selected HMMs for operations, and hence trade-off accuracy with efficiency. Finally, the learnand-combine approach provide a general and versatile solution for adapting and selecting HMMs over time. In particular, HMMs that are incrementally updated according to IBW can be also combined according to the learn-and-combine approach. Several techniques have been applied to learn the normal process behavior using system call sequences (Forrest et al., 2008). It is outside the scope of this thesis to compare with every possible approach. In particular, different experimental protocols (data pre-processing, detector window size, detector training, etc.) and different evaluation techniques (anomaly definition and labeling) have been employed during the conducted experiments. Therefore, comparison, in this thesis have been limited to classical sequence matching techniques, HMM batch and incremental techniques, and all related fusion techniques. Future work Several works have discussed the possibility of evading system call based IDSs by disguising anomalous sub-sequences within the normal process behavior. Mimicry attacks (Krügel et al., 2005; Parampalli et al., 2008; Tandon and Chan, 2006; Wagner and Soto, 2002) rely on crafting code that can generate malicious system call sequences that are considered legitimate by anomaly detection model. For instance, changing foreign system calls required for the attack to equivalent system calls used by the original process (Tan et al., 2002). In essence, mimicry attacks illustrate possible disadvantages of using the system call temporal order only as features. Therefore, some authors proposed including 215 system call arguments additional features (Cho and Han, 2003; Tandon and Chan, 2006). The applicability of mimicry attacks and their variants in practice, and the techniques to detect these attacks are addressed by Forrest et al. (2008). The proof-of-concept simulations conducted in this thesis considered the order of system calls only, for simplicity and availability of UNM datasets. However, including feature vectors to train HMM is a direct extension and does not affect the proposed incremental learning strategy nor the EFFBS algorithm. In reality there are many different types of intrusions, and different features and detectors are therefore needed. A hybrid intrusion detection system combines two or more types of IDSs for improved accuracy. A hybrid IDS attempts to provide a more comprehensive system that may address the limitations of each IDS type. For instance, anomaly detection techniques may provide useful information to define signatures for misuse detectors, yielding an improved classification of attacks. On the other hand, misuse detection systems provide precise and descriptive alarms; while anomaly detection can only detect symptoms of attacks without specific knowledge of details. A hybrid IDS can have both host and network sensors and thus detects both host-based and network-based attacks. Furthermore, a HIDS may complement a NIDS as a verification system, to confirm for instance whether the suspicious network traffic has resulted in a successful attack or not. In this research, the proposed Boolean combination techniques are applied to combine responses from the same one-class classifiers (HMMs) trained on the same data and using the same features, however according to different hyperparameters and initializations. Future work involves applying these techniques to combine various detection techniques employed within a hybrid IDS. For instance, to combine the response from the same detectors trained using different feature sets and also to combine the responses from different detectors (e.g., using supervised, unsupervised, generative, and discriminative approaches) trained on the same data. Investigation of diversity measures and impact of correlations within the ROC space is an important future direction. This may yield 216 further insight into the design of classifier ensembles that contribute toward efficient and accurate combinations. The robustness of the learn-and-combine approach depends on maintaining a representative validation set over time, for selection of HMMs, decision thresholds and Boolean functions. The proposed ADS adopts a human-centric approach, which relies on the system administrator to update the validation set with recent and most informative anomalous sub-sequences. Future work involves investigating techniques such as active learning to reduce labeling and selection costs. The presented results are for validation and test sets that comprised moderate skew (up to 75% of normal and 25% of anomalous data). Another future investigation would be to assess the impact on system performance of heavily imbalanced data. Another important issue is the ability to operate in dynamically-changing environments. In such cases, some novelty criteria on the new data should be employed to detect changes and trigger adaptation by resting a monotonically decreasing learning rate or fine-tuning an auto-adaptive learning rate. It may be useful to investigate whether the on-line algorithms for estimation of HMM parameters, which are surveyed in Chapter 2, can be employed to detect changes in normal behaviors over time. Although not explored in this thesis, the proposed learn-and-combine approach can handle concept drift in changing environments, which is typically attributed to changes in prior, class-conditional, or posterior probability distribution of classes (Kuncheva, 2004b). As designed, the learn-and-combine approach allows to account for changes in class prior probabilities. Recent techniques for on-line estimation of evolving class priors (Zhang and Zhou, 2010) can, thereby, be directly employed in the proposed system, for a dynamic selection of the best ensemble of classifiers and for adaptation of decision thresholds and Boolean fusion functions. Other types of drift can be accounted for by updating training data and ensemble members, replacing inaccurate classifiers, and adding new features (Kuncheva, 2004b). The learn-and-combine approach is designed to select and combine 217 output decisions from different homogeneous or heterogeneous, crisp or soft detectors, which are trained on different data or features using on-line or batch learning techniques. Another interesting future work is to evaluate the ability of the learn-and-combine approach to adapt to concept drift and evolving priors in various detection applications with dynamically changing environments. 218 APPENDIX I ON THE MEMORY COMPLEXITY OF THE FORWARD BACKWARD ALGORITHM∗ Abstract The Forward-Backward (FB) algorithm forms the basis for estimation of Hidden Markov Model (HMM) parameters using the Baum-Welch technique. It is however known to be prohibitively costly when estimation is performed from long observation sequences. Several alternatives have been proposed in literature to reduce the memory complexity of FB at the expense of increased time complexity. In this paper, a novel variation of the FB algorithm – called the Efficient Forward Filtering Backward Smoothing (EFFBS) – is proposed to reduce the memory complexity without the computational overhead. Given an HMM with N states and an observation sequence of length T , both FB and EFFBS algorithms have the same time complexity, O(N 2 T ). Nevertheless, FB has a memory complexity of O(N T ), while EFFBS has a memory complexity that is independent of T , O(N ). EFFBS requires fewer resources than FB, yet provides the same results. I.1 Introduction Hidden Markov Model (HMM) is a stochastic model for sequential data. Provided with an adequate number of states and sufficient set of data, HMMs are capable of representing probability distributions corresponding to complex real-world phenomena in term of simple and compact models. The forward-backward (FB) algorithm is a dynamic programming technique that forms the basis for estimation of HMM parameters. Given a finite sequence of training data, it efficiently evaluates the likelihood of this data given an HMM, and computes the smoothed conditional state probability densities – the sufficient statistics – for updating HMM parameters according to the Baum-Welch (BW) algorithm ∗ This chapter is published DOI:10.1016/j.patrec.2009.09.023 as an article in Pattern Recognition Letters journal, 220 (Baum, 1972; Baum et al., 1970). BW is an iterative expectation maximization (EM) technique specialized for batch learning of HMM parameters via maximum likelihood estimation. The FB algorithm is usually attributed to Baum (1972); Baum et al. (1970), although it was discovered earlier by Chang and Hancock (1966) and then re-discovered in different areas in the literature (Ephraim and Merhav, 2002). Despite suffering from numerical instability, the FB remains more famous than its numerically stable counterpart; the Forward Filtering Backward Smoothing (FFBS)1 algorithm (Ott, 1967; Raviv, 1967). As a part of the BW algorithm, the FB algorithm has been employed in various applications such as signal processing, bioinformatics and computer security, (Cappe, 2001). When learning from long sequences of observations the FB algorithm may be prohibitively costly in terms of time and memory complexity. This is the case, for example, of anomaly detection in computer security (Lane, 2000; Warrender et al., 1999), protein sequence alignment (Krogh et al., 1994) and gene-finding (Meyer and Durbin, 2004) in bioinformatics, and robot navigation systems (Koenig and Simmons, 1996). For a sequence of length T and an ergodic HMM with N states, the memory complexity of the FB algorithm grows linearly with T and N , O(N T ), and its time complexity grows quadratically with N and linearly with T , O(N 2 T ) 2 . For a detailed analysis of complexity the reader is referred to Table AI.1 (Section I.3). When T is very large, the memory complexity may exceed the resources available for the training process, causing overflow from internal system memory to disk storage. Several alternatives have been proposed in literature trying to reduce the memory complexity of the FB algorithm or, ideally, to eliminate its dependence on the sequence length T . Although there is an overlap between these approaches and some have been re-discovered, due to the wide range of HMMs applications, they can be divided into 1 FFBS is also referred to in literature as α − γ and disturbance smoothing algorithm. the HMM is not fully connected, the time complexity reduces to O(N Qmax T ), where Qmax is the maximum number of states that any state is connected to. 2 If 221 two main types. The first type performs exact computations of the state densities using fixed-interval smoothing (similar to FB and FFBS), such as the checkpointing and forward-only algorithms. However, as detailed in section I.3, the memory complexity of the former algorithm still depends on T , while the latter eliminates this dependency at the expenses of a significant increase in time complexity, O(N 4 T ). Although outside the scope of this paper, there are approximations to the FB algorithm. Such techniques include approximating the backward variables of the FB algorithm using a sliding time-window on the training sequence (Koenig and Simmons, 1996), or slicing the sequence into several shorter ones that are assumed independent, and then applying the FB on each sub-sequence (Wang et al., 2004; Yeung and Ding, 2003). Other solutions involve performing on-line learning techniques on a finite data however (FlorezLarrahondo et al., 2005; LeGland and Mevel, 1997). These techniques are not explored because they provide approximations to the smoothed state densities, and hence lead to different HMMs. In addition, their theoretical convergence properties are based on infinite data sequence (T → ∞). A novel modification to the FFBS called Efficient Forward Filtering Backward Smoothing (EFFBS) is proposed, such that its memory complexity is independent of T , without incurring a considerable computational overhead. In contrast with the FB, it employs the inverse of HMM transition matrix to compute the smoothed state densities without any approximations. A detailed complexity analysis indicates that the EFFBS is more efficient than existing approaches when learning of HMM parameters form long sequences of training data. This paper is organized as follows. In Section I.2, the estimation of HMM parameters using the FB and the Baum-Welch algorithms is presented. Section I.3 reviews techniques in literature that addressed the problem of reducing the memory complexity of the FB algorithm, and accesses their impact on the time complexity. The new EFFBS algorithm is presented in Section I.4, along with a complexity analysis and case study. 222 I.2 Estimation of HMM Parameters The Hidden Markov Model (HMM) is a stochastic model for sequential data. A discretetime finite-state HMM is a stochastic process determined by the two interrelated mechanisms – a latent Markov chain having a finite number of states, and a set of observation probability distributions, each one associated with a state. At each discrete time instant, the process is assumed to be in a state, and an observation is generated by the probability distribution corresponding to the current state. The model is termed discrete (or finite-alphabet) if the output alphabet is finite. It is termed continuous (or general) if the output alphabet is not necessarily finite e.g., the state is governed by a parametric density function, such as Gaussian, Poisson, etc. For further details regarding HMM the reader is referred to the extensive literature (Ephraim and Merhav, 2002; Rabiner, 1989). Formally, a discrete-time finite-state HMM consists of N hidden states in the finitestate space S = {S1 , S2 , ..., SN } of the Markov process. Starting from an initial state Si , determined by the initial state probability distribution πi , at each discrete-time instant, the process transits from state Si to state Sj according to the transition probability distribution aij (1 ≤ i, j ≤ N ). As illustrated in Figure AI.1, the process then emits a symbol v according to the output probability distribution bj (v), which may be discrete or continuous, of the current state Sj . The model is therefore parametrized by the set λ = (π, A, B), where vector π = {πi } is initial state probability distribution, matrix A = {aij } denotes the state transition probability distribution, and matrix B = {bj (k)} is the state output probability distribution. I.2.1 Baum-Welch Algorithm: The Baum-Welch (BW) algorithm (Baum and Petrie, 1966; Baum et al., 1970) is an Expectation-Maximization (EM) algorithm (Dempster et al., 1977) specialized for estimating HMM parameters. It is an iterative algorithm that adjusts HMM parameters to best fit the observed data, o1:T = {o1 , o2 , . . . , oT }. This is typically achieved by maximiz- 223 Continuous output Discrete output bj (k) = f (ot = v | qt = j) bj (k) = P (ot = vk | qt = j) b2 (v) 1.0 a22 0.5 0.0 −5 0 5 v a12 a21 S1 0.0 S2 o1 , o2 , . . . , oT 2 1 2 3 4 vk 1 bj (k) a32 a23 a13 Observations k 0.5 a11 aij M b2 (k) 1.0 N j S3 a33 a31 = P (qt+1 = j | qt = i) i Hidden states aij q1 , q 2 , . . . , q T 2 1 t 1 2 T Figure AI.1 Illustration of a fully connected (ergodic) three states HMM with either a continuous or discrete output observations (left). Illustration of a discrete HMM with N states and M symbols switching between the hidden states qt and generating the observations ot (right). qt ∈ S denotes the state of the process at time t with qt = i means that the state at time t is Si ing the log-likelihood, T (λ) log P (o1:T | λ) of the training data over HMM parameters space (Λ): λ∗ = argmaxλ∈Λ T (λ) (I.1) The subscripts (.)t|T in formulas (II.5) to (II.8) are used to stress the fact that these are smoothed probability estimates. That is, computed from the whole sequence of observations during batch learning. During each iteration, the E-step computes the sufficient statistics, i.e., the smoothed a posteriori conditional state density: γt|T (i) P (qt = i | o1:T , λ) (I.2) and the smoothed a posteriori conditional joint state density 3 : ξt|T (i, j) P (qt = i, qt+1 = j | o1:T , λ) 3 To (I.3) facilitate reading, the terms a posteriori and conditional will be omitted whenever it is clear from the context 224 which are then used, in the M-step, to re-estimate the model parameters: (k+1) πi (k+1) aij (k) = γ1|T (i) T −1 (k) t=1 ξt|T (i, j) = T −1 (k) t=1 γt|T (i) (k+1) (m) = bj (I.4) T (k) t=1 γt|T (j)δot vm T (k) t=1 γt|T (j) The Kronecker delta δij is equal to one if i = j and zero otherwise. Starting with an initial guess of the HMM parameters, λ0 , each iteration k, of the E- and M-step is guaranteed to increase the likelihood of the observations giving the new model until a convergence to a stationary point of the likelihood is reached (Baum, 1972; Baum et al., 1970). The objective of the E-step is therefore to compute an estimate q̂t|τ of an HMM’s hidden state at any time t given the observation history o1:τ . The optimal estimate q̂t|τ in the minimum mean square error (MMSE) sense of the state qt of the HMM at any time t, given the observation is the conditional expectation of the state qt given o1:τ (Li and Evans, 1992; Shue et al., 1998): q̂t|τ = E(qt | o1:τ ) = N γt|τ (i) (I.5) i=1 In estimation theory, this conditional estimation problem is called filtering if t = τ ; prediction if t > τ , and smoothing if t < τ . Figure AI.2 provides an illustration of these estimation problems. The smoothing problem is termed fixed-lag smoothing when computing the E(qt | o1:t+Δ ) for a fixed-lag Δ > 0, and fixed-interval smoothing when computing the E(qt | o1:T ) for all t = 1, 2, . . . , T . For batch learning, the state estimate is typically performed using fixed-interval smoothing algorithms (as described in Subsections I.2.2 and I.2.3). Since it incorporates more 225 evidence, fixed-interval smoothing provides better estimate of the states than filtering. This is because the latter relies only on the information that is available at the time. Filtering o1 , o2 , , oτ ... qt Prediction o1 , o2 , , oτ h ... qt Fixed-lag o ,o , smoothing 1 2 , oτ Δ qt ... Fixed-interval o1 , o2 , smoothing ... , oT qt Figure AI.2 Illustration of the filtering, prediction and smoothing estimation problems. Shaded boxes represent the observation history (already observed), while the arrows represent the time at which we would like to compute the state estimates I.2.2 Forward-Backward (FB) The Forward-Backward (Baum, 1972; Baum et al., 1970; Chang and Hancock, 1966) is the most commonly used algorithm for computing the smoothed state densities of Eqs. (II.5) and (II.6). Its forward pass computes the joint probability of the state at time t and the observations up to time t: αt (i) P (qt = i, o1:t | λ), (I.6) using the following recursion, for t = 1, 2, . . . , T and j = 1, . . . , N : ⎡ αt+1 (j) = ⎣ N i=1 ⎤ αt (i)aij ⎦ bj (ot+1 ) (I.7) 226 Similarly, the backward path computes the probability of the observations from time t + 1 up to T given the state at time t, βt (i) P (ot+1:T | qt = i, λ), (I.8) according to the reversed recursion, for t = T − 1, . . . , 1 and i = 1, . . . , N : βt (i) = N aij bj (ot+1 )βt+1 (j) (I.9) j=1 Then, both smoothed state and joint state densities of Eqs. (II.5) and (II.6) – the sufficient statistics for HMM parameters update – are directly obtained: αt (i)βt (i) γt|T (i) = N k=1 αt (k)βt (k) αt (i)aij bj (ot+1 )βt+1 (j) ξt|T (i, j) = N N k=1 l=1 αt (k)akl bl (ot+1 )βt+1 (l) (I.10) (I.11) By definition, the elements of α are not probability measures unless normalized, that is why they are usually called the α-variables. Due to this issue, the FB algorithm is not stable and susceptible to underflow when applied to a long sequence of observation. In Levinson et al. (1983) and Rabiner (1989) the authors suggest scaling the α-variables by their summation at each time: P (qt = i, o1:t λ) αt (i) ᾱt (i) = = = P (qt = i | o1:t , λ) P (o1:t λ) i αt (i) (I.12) making for a filtered state estimate. This of course requires scaling the beta β values, which have no an intuitive probabilistic meaning, with the same quantities (refer to Algorithm AI.1 and AI.2). Other solutions involves working with extended exponential and extended logarithm functions (Mann, 2006). I.2.3 Forward Filtering Backward Smoothing (FFBS) 227 Algorithm AI.1: Forward-Scaled(o1:T , λ) 1 2 3 4 5 6 7 8 9 10 11 12 13 Output: αt (i) P (qt = i, o1:t | λ), T (λ) and scale() for i = 1, . . . , N do α1 (i) ← πi bi (o1 ) // Initialization scale(1) ← N i=1 α1 (i) for i = 1, . . . , N do α1 (i) ← α1 (i)/scale(1) for t = 1, . . . T − 1 do for j = 1, . . . , N do αt+1 (j) ← N i=1 αt (i)aij bj (ot+1 ) // Induction scale(t + 1) ← N i=1 αt+1 (i) for i = 1, . . . , N do αt+1 (i) ← αt+1 (i)/scale(t + 1) for t = 1, . . . T do T (λ) ← T (λ) + log(scale(t)) // Evaluation Algorithm AI.2: Backward-Scaled (o1:T , λ, scale()) 1 2 Output: βt (i) P (ot+1:T | qt = i, λ) for i = 1, . . . , N do βT (i) ← 1 4 for t = T − 1, . . . , 1 do for i = 1, . . . , N do 5 βt (i) ← 3 // Initialization // Induction N a b (o )β (j) j=1 ij j t+1 t+1 scale(t+1) As highlighted in literature (Devijver, 1985; Ephraim and Merhav, 2002; Cappe and Moulines 2005), the FFBS is a lesser known but numerically stable alternative algorithm to FB, which has been proposed in Askar and Derin (1981); Lindgren (1978); Ott (1967); Raviv (1967). In addition, the FFBS is probabilistically more meaningful since it propagates probability densities in its forward and backward passes. During the forward pass, the algorithm splits the state density estimates into predictive (I.13) and filtered (I.14) state densities: γt|t−1 (i) P (qt = i | o1:t−1 , λ) (I.13) γt|t (i) P (qt = i | o1:t , λ) (I.14) 228 The initial predictive state density is the the initial probability distribution, γ1|0 (i) = πi . For t = 1, . . . , T , the filtered state density at time t is computed from the predictive state density at time t − 1, γt|t−1 (i)bi (ot ) γt|t (i) = N j=1 γt|t−1 (j)bj (ot ) (I.15) and then, the predictive state density at time t + 1 is computed from the filtered state density at time t: γt+1|t (j) = N γt|t (i)aij (I.16) i=1 Both the filtered and one-step prediction density, follow from a combination of the Bayes’ rule of conditional probability and the Markov property. Algorithm AI.3: Forward-Filtering(o1:T , λ) Output: γt|t (i) = P (qt = i | o1:t , λ) and T (λ) 1 for i = 1, . . . , N do γ1|0 (i) = πi 2 3 4 for t = 1, . . . T − 1 do for i = 1, . . . , N do γ 5 6 7 8 9 γt|t (i) ← N t|t−1 // Initialization // Induction (i)bi (ot ) γ (j)bj (ot ) j=1 t|t−1 for j = 1, . . . , N do γt+1|t (j) ← N i=1 γt|t (i)aij for t = 1, . . . T do T (λ) ← T (λ) + log N i=1 bi (ot )γt|t−1 (i) // Evaluation Algorithm AI.4: Backward-Smoothing (γT |T , λ) 1 2 3 4 Output: γt|T (i) = P (qt = i | o1:T ) and ξt|T (i, j) = P (qt = i, qt+1 = j | o1:T ) // Initialization γT |T ← input for t = T − 1, . . . , 1 do // Induction for i = 1, . . . , N do for j = 1, . . . , N do γ 5 6 ξt|T (i, j) ← Nt|t γt|T (i) ← N (i)aij γ (i)aij i=1 t|t j=1 ξt|T (i, j) γt+1|T (j) 229 Backward recursion, leads directly to the required smoothed densities by solely using the filtered and predictive state densities (Askar and Derin, 1981; Lindgren, 1978) for t = T − 1, . . . , 1: γt|t (i)aij γt+1|T (j) ξt|T (i, j) = N i=1 γt|t (i)aij γt|T (i) = N ξt|T (i, j) (I.17) (I.18) j=1 In contrast with the FB algorithm, the backward pass of the FFBS does not require access to the observations. In addition, since it is propagating probabilities densities, in both forward and backward passes, the algorithm is numerically stable. In the forward pass, the filtered state density can be also computed in the same way as the normalized ᾱ-variables (I.12), since after the normalization ᾱ becomes the state filter. When the forward pass is completed at time T , γT |T is the only smoothed state density, while all previous ones are filtered and predictive state estimates. The reader is referred to Algorithm AI.3 and AI.4 for further details on the FFBS. I.3 Complexity of Fixed-Interval Smoothing Algorithms This section presents a detailed analysis of the time and memory complexity for the FB and FFBS algorithms. It also presents the complexity of the checkpointing and forward only algorithms. These algorithms have been proposed to reduce the memory complexity of FB at the expense of time complexity. The time complexity is defined as the sum of the worst-case running time for each operation (e.g., multiplication, division and addition) required to process an input. The growth rate is then obtained by making the parameters of the worst-case complexity tend to ∞. Memory complexity is estimated as the number of 32 bit registers needed during learning process to store variables. Only the worst-case memory space required during processing phase is considered. 230 I.3.1 FB and FFBS Algorithms The main bottleneck for both FB and FFBS algorithms occurs during the induction phase. In the ergodic case, the time complexity per iteration is O(N 2 T ) as shown in (I.7) for the FB algorithm (see also lines 6-11 of Algorithm AI.1), and in (I.15) and (I.16) for the FFBS algorithm (see also lines 3-7 Algorithm AI.3). In addition, as presented in Figure AI.3, the filtered state densities (Eqs. (I.7) for FB, and (I.15) and (I.16) for FFBS) computed at each time step of the forward pass, must be loaded into the memory in order to compute smoothed state densities (Eqs. (II.5) and (II.6)) in the backward pass. The bottleneck in terms of memory storage, for both algorithms, is therefore the size of the matrix of N × T floating-points values. A memory complexity of O(N 2 T ) is often attributed to FB and FFBS algorithms in the literature. In such analysis, memory is assigned to ξt|T (i, j) which requires T matrices of N × N floating points. However, as shown in the transition update Equation (II.8), only the summation over time is required; hence only one matrix of N 2 floating points is required for storing T t=1 ξt|T (i, j). In addition, the N floating point values required for storing the filtered state estimate, γt|t , (i.e., N 2 + N ), are also unavoidable. Table AI.1 details an analysis of the worst-case time and memory complexity for the FB and FFBS algorithms processing an observation sequence o1:T of length T . Time complexity represents the worst-case number of operations required for one iteration of BW. A BW iteration involves computing one forward and one backward pass with o1:T . Memory complexity is the worst-case number of 32 bit words needed to store the required temporary variables in RAM. Analysis shows that the FFBS algorithm requires slightly fewer computations (3N 2 T + N T multiplications, 2N T divisions and 3N T additions) than the FB algorithm. It can be also seen that both algorithms require the same amount of storage: an array of N T + N 2 + N floating point values. Forward Pass 231 o1 o2 ot oT −1 S1 P (q1 = 1 | o1 ) P (q2 = 1 | o1:2 ) P (qt = 1 | o1:t ) P (qT = 1 | o1:T ) P (qT = 1 | o1:T ) Si P (q1 = i | o1 ) P (q2 = i | o1:2 ) P (qt = i | o1:t ) P (qT = i | o1:T ) P (qT = i | o1:T ) SN P (q1 = N | o1 ) P (q2 = N | o1:2 ) P (qt = N | o1:t ) P (qT = N | o1:T ) P (qT = N | o1:T ) ᾱ2 γ2|2 ᾱt γt|t ᾱT −1 γT −1|T −1 ᾱT γT |T Backward Pass State FB: ᾱ1 FFBS: γ1|1 Time γ2|1 γt+1|t γT |T −1 o1 o2 ot oT −1 oT β1 β2 γ2|T ξ2|T βt γt|T ξt|T βT −1 γT −1|T ξT −1|T βT γT |T ξT |T FB: γ1|T ξ1|T FFBS: γ3|2 oT Time γ1|T ξ1|T γ2|T γ2|T ξ2|T γ3|T γt|T ξt|T γt+1|T γT −1|T ξT −1|T γT |T −1 γT |T ξT |T Figure AI.3 An illustration of the values that require storage during the forward and backward passes of the FB and FFBS algorithms. During the forward pass the filtered state densities at each time step (N × T floating-points values) must be stored into the internal system memory, to be then used to compute the smoothed state densities during the backward pass Table AI.1 Worst-case time and memory complexity analysis for the FB and FFBS for processing an observation sequence o1:T of length T with an N state HMM. The scaling procedure as described in Rabiner (1989) and shown in Algorithms AI.1 and AI.2 is taken into consideration with the FB algorithm Computations αt βt T t γt|T T −1 t=1 ξt|T Total γt|t βt T t γt|T T −1 t=1 ξt|T Total Time # Multiplications # Divisions # Additions Forward-Backward (FB) N 2T + N T − N 2 NT − N N 2T − N 2 + N 2N 2 T − 2N 2 NT N 2T − N T − N 2 + N NT NT NT − T 3N 2 T − 3N 2 N 2T − N 2 N 2T − N T − N 2 + N 6N 2 T + 2N T − 6N 2 N 2 T + 3N T − N 2 − N 3N 2 T + N T − 3N 2 + 3N − T Forward Filtering Backward Smoothing (FFBS) N 2T + N T − N 2 NT N 2T − N 2 + N 2 – – N T − NT − N2 + N – – NT − T 2N 2 T − 2N 2 N 2T − N 2 N 2T − N T − N 2 + N 3N 2 T + N T − 3N 2 N 2T + N T − N 2 3N 2 T + N T − 3N 2 + 3N − T Memory NT – N N2 NT + N2 + N NT – N N2 NT + N2 + N With both FB and FFBS algorithms a memory problem stems from its linear scaling with the length of the observation sequence T . Applications that require estimating HMM 232 parameters from a long sequence will require long training times and large amounts of memory, which may be prohibitive for the training process. As described next, some alternatives for computation of exact smoothed density have been proposed to alleviate this issue. However, these alternatives are either still partially dependent on the sequence length T , or increase the computational time by orders of magnitude to achieve a constant memory complexity. I.3.2 Checkpointing Algorithm The checkpointing algorithm stores only some reference columns or checkpoints, instead of storing the whole matrix of filtered state densities (Grice et al., 1997; Tarnas and √ Hughey, 1998). Therefore, it divides the input sequence into T sub-sequences. While performing the forward pass, it only stores the first column of the forward variables for each sub-sequences. Then, during the backward pass, the forward values for each subsequence are sequentially recomputed, beginning with their corresponding checkpoints. √ Its implementation reduces the memory complexity to O(N T ), yet increases the time √ complexity to O(N 2 (2T − T )). The memory requirement can be therefore traded-off against the time complexity. I.3.3 Forward Only Algorithm A direct propagation of the smoothed densities in forward-only manner is an alternative that has been proposed in the communication field (Elliott et al., 1995; Narciso, 1993; Sivaprakasam and Shanmugan, 1995) and re-discovered in bioinformatics (Churbanov and Winters-Hilt, 2008; Miklos and Meyer, 2005). The basic idea is to directly propagate all smoothed information in the forward pass σt (i, j, k) t−1 τ =1 P (qτ = i, qτ +1 = j, qt = k | o1:t , λ), 233 which represents the probability of having made a transition from state Si to state Sj at some point in the past (τ < t) and of ending up in state Sk at the current time t. At the next time step, σt+1 (i, j, k) can be recursively computed from σt (i, j, k) using: σt (i, j, k) = N σt−1 (i, j, n)ank bk (ot ) + αt−1 (i)aik bk (ot )δjk (I.19) n=1 from which the smoothed state densities can be obtained, at time T , by marginalizing over the states. By eliminating the backward pass, the memory complexity becomes O(N 2 + N ), which is independent T . However, it is achieved at the expenses of increasing the time complexity to O(N 4 T ) as can be seen in the four-dimensional recursion (I.19). Due to this computational overhead for training, this forward-only approach has not gained much popularity in practical applications. I.4 Efficient Forward Filtering Backward Smoothing (EFFBS) Algorithm In this section a novel and efficient alternative that is able to compute the exact smoothed densities with a memory complexity that is independent of T , and without increasing time complexity is described. The proposed alternative – termed the Efficient Forward Filtering Backward Smoothing (EFFBS) – is an extension of the FFBS algorithm. It exploits the facts that a backward pass of the FFBS algorithm does not require any access to the observations and that the smoothed state densities are recursively computed from each other in a backward pass. Instead of storing all the predictive and filtered state densities (see Eqs. (I.13) and (I.14)) for each time step t = 1, . . . , T , the EFFBS recursively recomputes their values starting from the end of the sequence. Accordingly, the memory complexity of the algorithm becomes independent of the sequence length. That is, by storing only the last values of γT |T and γT |T −1 from the forward pass, the previously-computed filtered and predictive 234 densities can be recomputed recursively, in parallel with the smoothed densities (refer to Algorithm AI.5). Let the notations and (÷) be the term-by-term multiplication and division of two vectors, respectively. Eqs. (I.14) and (I.16) can be written as: γt|t = γt|t−1 bt * * * * *γt|t * (I.20) 1 γt|t−1 = A γt−1|t−1 (I.21) where bt = [b1 (ot ), . . . bN (ot )] , .1 is the usual 1-norm of a vector (column sum) and the prime superscript denotes matrix transpose. The backward pass of the EFFBS algorithm starts by using only the column-vector filtered state estimate of the last observation, γT |T , resulting from the forward pass. Then, for t = T − 1, . . . 1, γt|t−1 is recomputed from γt|t using: γt|t (÷)bt * γt|t−1 = ** * *γt|t−1 * (I.22) 1 from which, the filtered state estimate can be recomputed by γt−1|t−1 = A −1 γt|t−1 (I.23) This step requires the inverse of the transposed transition matrix [A ]−1 to solve N linear equations with N unknowns. The only necessary condition is the nonsingularity of the transition matrix A to be invertible. This is a weak assumption that has been used in most of the studies that addressed the statistical properties of HMMs (c.f. Ephraim and Merhav, 2002). In particular, it holds true in applications that use fully connected HMMs, where the elements of the transition matrix are positives (i.e., irreducible and aperiodic). I.4.1 Complexity Analysis of the EFFBS Algorithm 235 Algorithm AI.5: EFFBS(o1:T , λ): Efficient Forward Filtering Backward Smoothing Output: γt|T (i) = P (qt = i | o1:T ) and ξt|T (i, j) = P (qt = i, qt+1 = j | o1:T ) // Forward Filtering: the filtered, γt|t , and predictive, γt+1|t , state densities are column vectors of length N . Their values are recursively computed, however they are not stored at each time step, t, but recomputed in the backward pass to reduce the memory complexity. 1 for i = 1, . . . , N do // Initialization γ1|0 (i) = πi 2 3 4 for t = 1, . . . T − 1 do for i = 1, . . . , N do γ 5 6 7 8 9 10 11 12 13 γt|t (i) ← N t|t−1 15 (i)bi (ot ) γ (k)bk (ot ) k=1 t|t−1 for j = 1, . . . , N do γt+1|t (j) ← N i=1 γt|t (i)aij for t = 1, . . . T do T (λ) ← T (λ) + log N i=1 bi (ot )γt|t−1 (i) γt|t−1 (i) = N t|t (i)/bi (ot ) γ (k)/bk (ot ) k=1 t|t for j = 1, . . . , N do γ 16 17 // Evaluation // Backward Smoothing: the last values of the filtered, γT |T , and predictive, γT |T −1 , state densities are now used to recompute their previous values. A = a−1 // matrix inversion only once per iteration for t = T − 1, . . . , 1 do // Induction for i = 1, . . . , N do γt|t (i) = N k=1 Aki γt+1|t (k) γ 14 // Induction ξt|T (i, j) ← Nt|t γt|T (i) ← N (i)aij γ (i)aij i=1 t|t γt+1|T (j) j=1 ξt|T (i, j) In order to compute the smoothed densities in a constant memory complexity, the EFFBS algorithm must recompute γt|t and γt|t−1 values in its backward pass, instead of storing them, which requires twice the effort of the forward pass computation. This recomputation however takes about the same amount of time required for computing the β values in the backward pass of the FB algorithm. Table AI.2 presents a worst-case analysis of time and memory complexity for the EFFBS. The proposed algorithm is comparable to both FB and FFBS algorithms (see Table AI.1) in terms of computational time, with an overhead of about N 2 T + N T multiplications 236 Table AI.2 Worst-case Time and Memory Complexity of the EFFBS Algorithm for Processing an Observation Sequence o1:T of Length T with an N State Ergodic HMM Computation γT |T γt|t and γt|t−1 A−1 T t γt|T T −1 t=1 ξt|T Total Time # Divisions NT NT – – 2 N T − N2 # Multiplications N 2T + N T − N 2 N 2T + N T − N 2 N3 N2 N 3 + 2 − 3 – 2 2N T − 2N 2 4N 2 T + 2N T + N 2 T + 2N T − N 2 N3 7N 2 N − − 3 2 2 Memory # Additions N 2T − N 2 + N N 2T − N 2 + N N3 N2 5N 3 + 2 − 6 2 N T − NT − N2 + N N 2T − N T − N 2 + N 4N 2 T − 2N T + N3 7N 2 19N 3 − 2 + 6 N 2N N2 N N2 N 2 + 5N Table AI.3 Time and Memory Complexity of the EFFBS Versus Other Approaches (Processing an Observation Sequence o1:T of Length T , for an Ergodic HMM with N States) Algorithm FB FFBS Checkpointing Forward-Only EFFBS Time O(N 2 T ) O(N 2 T )√ O(N 2 (2T − T )) O(N 4 T ) O(N 2 T ) Memory O(N T ) O(N√T ) O(N T ) O(N ) O(N ) and additions. An additional computational time is required for the inversion of the transition matrix, which is computed only once for each iteration. The classical GaussJordan elimination, may be used for instance to achieve this task in O(N 3 ). Faster algorithms for matrix inversion could be also used. For instance, inversion based on Coppersmith–Winograd algorithm (Coppersmith and Winograd, 1990) is O(N 2.376 ). The EFFBS only requires two column vectors of N floating point values on top of the unavoidable memory requirements (N 2 + N ) of the smoothed densities. In addition, a matrix of N 2 floating point values is required for storing the inverse of the transposed transition matrix [A ]−1 . Table AI.3 compares the time and memory complexities of existing fixed-interval smoothing algorithms for computing the state densities to the proposed EFFBS algorithm. 237 I.4.2 Case Study in Host-Based Anomaly Detection Intrusion Detection Systems (IDSs) are used to identify, assess, and report unauthorized computer or network activities. Host-based IDSs (HIDSs) are designed to monitor the host system activities and state, while network-based IDSs monitor network traffic for multiple hosts. In either case, IDSs have been designed to perform misuse detection – looking for events that match patterns corresponding to known attacks – and anomaly detection – detecting significant deviations from normal system behavior. Operating system events are usually monitored in HIDSs for anomaly detection. Since system calls are the gateway between user and kernel mode, early host-based anomaly detection systems monitor deviation in system call sequences. Various detection techniques have been proposed to learn the normal process behavior through system call sequences (Warrender et al., 1999). Among these, techniques based on discrete Hidden Markov Models (HMMs) have been shown to shown to provide high level of performance (Warrender et al., 1999). The primary advantage of anomaly-based IDS is the ability to detect novel attacks for which the signatures have not yet been extracted. However, anomaly detectors will typically generate false alarms mainly due to incomplete data for training. A crucial step to design an anomaly-based IDS is to acquire sufficient amount of data for training. Therefore, a large stream of system call data is usually collected from a process under normal operation in a secured environment. This data is then provided for training the parameters of an ergodic HMM to fit the normal behavior of the monitored process. A well trained HMM should be able to capture the underlying structure of the monitored application using the temporal order of system calls generated by the process. Once trained, a normal sequence presented to HMM should produce a higher likelihood value than for a sequence that does not belong to the normal process pattern or language. According to a decision threshold the HMM-based Anomaly Detection System (ADS) is able to discriminate between the normal and anomalous system call sequences. 238 The sufficient amount of training data depends on the complexity of the process and is very difficult to be quantified a priori. Therefore, the larger the sequence of system calls provided for training, the better the discrimination capabilities of the HMM-based ADS and the fewer the rate of false alarms. In practice however, the memory complexity of the FB (or FFBS) algorithm could be be prohibitively costly when learning is performed for a large sequence of system calls, as previously described. In their experimental work on training HMM for anomaly detection using large sequences of system calls, Warrender et al. (1999) state: On our largest data set, HMM training took approximately two months, while the other methods took a few hours each. For all but the smallest data sets, HMM training times were measured in days, as compared to minutes for the other methods. This is because the required memory complexity for training the HMM with the FB algorithm exceeds the resources available, causing overflows from internal system memory, e.g., Random-Access Memory (RAM), to disk storage. Accessing paged memory data on a typical disk drive is approximately one million time slower than accessing data in the RAM. For example, if an ergodic HMM with N = 50 states must be trained using an observation sequence of length T = 100 × 106 system calls collected from a complex process. The time complexity per iteration of the FFBS algorithm (see Table AI.1): T ime{F F BS} = 75.5 × 1010 MUL + 25.5 × 1010 DIV + 75.5 × 1010 ADD is comparable to that of the EFFBS algorithm (see Table AI.2): T ime{EF F BS} = 101 × 1010 MUL + 26 × 1010 DIV + 99 × 1010 ADD 239 However, the amount of memory required by the EFFBS algorithm is 0.01 MB, which is negligible compared to the 20, 000 MB (or 20 GB) required by the FFBS algorithm. Therefore, the EFFBS algorithm could form a practical approach for designing lowfootprint intrusion detection systems. I.5 Conclusion This paper considered the memory problem of the Forward-Backward algorithm when trained with a long observation sequence. This is an interesting problem facing various HMM applications - ranging from computer security to robotic navigation systems and bioinformatics. To alleviate this issue, an Efficient Forward Filtering Backward Smoothing (EFFBS) algorithm is proposed to render the memory complexity independent of the sequence length, without incurring any considerable computational overhead and without any approximations. A detailed complexity analysis has shown that the proposed algorithm is more efficient than existing solutions in terms of both memory and computational requirements. 240 APPENDIX II A COMPARISON OF TECHNIQUES FOR ON-LINE INCREMENTAL LEARNING OF HMM PARAMETERS IN ANOMALY DETECTION∗ Abstract Hidden Markov Models (HMMs) have been shown to provide a high level performance for detecting anomalies in intrusion detection systems. Since incomplete training data is always employed in practice, and environments being monitored are susceptible to changes, a system for anomaly detection should update its HMM parameters in response to new training data from the environment. Several techniques have been proposed in literature for on-line learning of HMM parameters. However, the theoretical convergence of these algorithms is based on an infinite stream of data for optimal performances. When learning sequences with a finite length, incremental versions of these algorithms can improve discrimination by allowing for convergence over several training iterations. In this paper, the performance of these techniques is compared for learning new sequences of training data in host-based intrusion detection. The discrimination of HMMs trained with different techniques is assessed from data corresponding to sequences of system calls to the operating system kernel. In addition, the resource requirements are assessed through an analysis of time and memory complexity. Results suggest that the techniques for incremental learning of HMM parameters can provide a higher level of discrimination than those for on-line learning, yet require significantly fewer resources than with batch training. Incremental learning techniques may provide a promising solution for adaptive intrusion detection systems. ∗ This chapter is published in proceedings of the second IEEE international conference on Computational Intelligence for Security and Defense Applications (CISDA09) 242 II.1 Introduction Intrusion Detection Systems (IDSs) are used to identify, assess, and report unauthorized computer or network activities. Host-based IDSs (HIDSs) are designed to monitor the host system activities and state, while network-based IDSs monitor network traffic for multiple hosts. In either case, IDSs have been designed to perform misuse detection – looking for events that match patterns corresponding to known attacks – and anomaly detection – detecting significant deviations from normal system behavior. Operating system events are usually monitored in HIDSs for anomaly detection. Since system calls are the gateway between user and kernel mode, early host-based anomaly detection systems monitor deviation in system call sequences (Forrest et al., 1996). Various detection techniques have been proposed to learn the normal process behavior through system call sequences (Warrender et al., 1999). Among these, techniques based on discrete Hidden Markov Models (HMMs) have been shown to shown to provide high level of performance (Warrender et al., 1999). HMM is stochastic process for sequential data (Rabiner, 1989). Given an adequate amount of system call training data, HMM-based anomaly detectors can efficiently model the normal process behavior. A well trained HMM should be able to capture the underlying structure of the monitored application using the temporal order of system calls generated by the process. Once trained, an HMM provides a compact model, with tolerance to noise and uncertainty, which allows a fast evaluation during operations1 . A normal sequence presented to HMM should produce a higher likelihood value than for a sequence that does not belong to the normal process pattern or language. Their ability to discriminate between normal and malicious sequences have been discussed in literature (Gao et al., 2002; Hoang et al., 2003; Wang et al., 2004). The effects on performance of 1 In contrast, matching techniques that are based on look-up tables, e.g., STIDE (sequence time-delay embedding) must compare inputs to all normal training sequences. The number of comparisons increases exponentially with the detector window size DW , while for HMM evaluation the time complexity grows linearly with DW . 243 the training set size, irregularity of the process, anomaly types, and number of hidden states of HMM were recently investigated in Khreich et al. (2009b). The primary advantage of anomaly-based IDS is the ability to detect novel attacks for which the signatures have not yet been extracted. However, anomaly detectors will typically generate false alarms mainly due to incomplete data for training, poor modeling, and difficulty in obtaining representative labeled data for validation. In practice, it is very difficult to acquire (collect and label) comprehensive data sets to design a HIDS for anomaly detection. Therefore, a major requirement for an anomaly detection system (ADS) is the ability to accommodate new data without the need to restart the training process with all accumulated data. Most research found in literature for HMM-based anomaly detection using system calls assume being provided with a sufficient amount of data. Furthermore, the monitored process is not static – changes in the environment may occur, such as application update. This is also the case when fine tuning a base model to a specific host platform. Therefore, HMM parameters should be refined incrementally over time by accommodating newly acquired training data, to better fit the normal process behavior. Standard techniques for training HMM parameters involve batch learning, based either on the Baum-Welch (BW) algorithm (Baum et al., 1970), a specialized expectation maximization (EM) technique (Dempster et al., 1977), or on numerical optimization methods, such as the Gradient Descent (GD) algorithm (Levinson et al., 1983). Both approaches are iterative algorithms for maximizing the likelihood estimate (MLE) of the data. For a batch learning technique, the data sequence is assumed to be finite. Each training iteration of BW or GD involves observing all subsequences in the presented block2 for training prior to updating HMM parameters. Successive iterations continues until some stopping criterion is achieved (e.g., likelihood drop on a validation set). Given a new 2A block of data is defined as a sequence of system call observations that has been segmented into overlapping subsequences according to a user-defined window size. 244 block of data, an HMM trained with BW or GD must be trained from start using all cumulative training data. As an alternative, incremental algorithm updates the HMM parameters after each subsequence, yet is allowed to perform several iterations over all subsequences within the block (refer to Figure AII.1). Some desirable characteristics for incremental learning include the ability to update HMM parameters from new training data, without requiring access to the previously-learned training data and without corrupting previously acquired knowledge (Polikar et al., 2001). In contrast, for an on-line learning technique, the data sequence is assumed to be infinite. Such techniques update HMM parameters after observing each subsequence, with no iterations. User-defined hyper-parameters remain constant or they are allowed to degrade monotonically over time. Given a new block, an HMM that performs on-line learning continues the training seamlessly. In practice when learning sequences with finite length, on-line learning may lead to poor performance. Several techniques have been proposed in literature for different real-world applications. Among these, on-line learning techniques are based on the current sequence of observations for optimizing the objective function (commonly the MLE), and updating the HMM parameters. As with batch learning, they can also be divided into EM-based (Mizuno et al., 2000) and gradient-based (Baldi and Chauvin, 1994; Ryden, 1998; Singer and Warmuth, 1996) learning techniques. These on-line techniques are extended in this work to incremental learning by allowing them to iterate over each block of data and by resetting the learning rates when a new block is presented. Another solution consists of learning an HMM for each new block of data then merging it with old ones using weight-averaging (Hoang and Hu, 2004). The objectives of this paper are to compare the techniques for on-line and incremental learning of HMMs parameters when applied for anomaly HIDS application using system calls sequences. A synthetic generator of normal data plus injection of anomaly has been 245 utilized to avoid various drawbacks encountered when experimenting with real data. The receiver operating characteristics (ROC) curves and the area under the ROC curve (AUC) are used as a measure of performance (Fawcett, 2006). Analytical comparison of convergence time and resources requirements are also provided and discussed. The rest of this paper is organized as follows. The next section discusses the importance of incremental update for anomaly detection. Section II.3 presents techniques for batch, on-line and incremental learning for HMM parameters. The experimental methodology in II.4 describes data generation, evaluation methods and performance metrics. Finally, simulation results are discussed in II.5. II.2 Incremental Learning in Anomaly Detection System An crucial step to design an ADS is to acquire sufficient amount of data for training. In practice however, it is very difficult to collect, analyze and label comprehensive data sets due to many reasons that range from technical to ethical. In addition, even in the simplest scenario where no change in the environment is assumed, characterizing the sufficient amount of data required for building an efficient ADS is not a trivial task. In practice, limited data is always provided for training, and the computer environment is always susceptible to dynamic changes. This work focuses on providing solutions to the limited data problem as described with the following practical scenarios. Given some amount of normal system call data for training, an ADS based on HMM could be trained, optimized and validated off-line to provide some acceptable performance in terms of false and true positive rates. However, during operations, monitoring a centralized server for instance, the system is susceptible to produce a higher rate of false alarms than expected and tolerated by the system administrator. This is largely due to the limited data that is the provided for training. The anomaly detector will have a limited view of the normal process behavior, and rare events will be mistakenly considered as anomalous. Accordingly, the HMM detector should be refined, i.e., trained on some 246 additional normal data when it becomes available, to better fit the normal behavior of process in consideration. As a part of the detection system, the system administrator plays an important role for providing such new data. When an alarm is raised the suspicious system call subsequences are logged and the system administrator starts an investigation into other evidence of an attack. If an intrusion attempt is detected the response team will act to limit the damage, and the forensic analysis team try to find the cause of the successful attack. Otherwise, it is considered as a false alarm and the logged subsequences (which are possibly rare events) are tagged as normal and collected for updating the HMM detector. One challenge is the efficient integration of this newly-acquired data into the ADS without corrupting the existing knowledge structure, and thereby degrading the performance. II.3 Techniques for Learning HMM Parameters A discrete-time finite-state HMM is a stochastic process determined by the two interrelated mechanisms. A latent Markov chain having comprising N states in the finite-state space S = {S1 , S2 , ..., SN }, and a set of observation discrete probability distributions bj (v), each one associated with a state (Ephraim and Merhav, 2002; Rabiner, 1989). Starting from an initial state Si , determined by the initial state probability distribution πi , at each discrete-time instant, the process transits from state Si to state Sj according to the transition probability distribution aij (1 ≤ i, j ≤ N ). The process then emits a symbol v according to the output probability distribution bj (v) of the current state Sj . The model is therefore parametrized by the set λ = (π, A, B), where vector π = {πi } is initial state probability distribution, matrix A = {aij } denotes the state transition probability distribution, and matrix B = {bj (k)} is the state output probability distribution. Both A, B, and π are row stochastic, which impose the following constraints: N j=1 aij = 1∀i, M k=1 bj (k) = 1 ∀j, and N i=1 πi = 1 (II.1) 247 aij , bj (k), and πi ∈ [0, 1], ∀ijk II.3.1 (II.2) Batch Learning The target in HMM parameters learning is to train the model λ to best fit the observed batch of data o1:T . The estimation of HMM parameters is frequently performed according to the maximum likelihood estimation (MLE) criterion3 . MLE consists of maximizing the log-likelihood T (λ) log Pr(o1:T | λ) (II.3) of the training data over HMM parameters space (Λ): λ∗ = argmaxλ∈Λ T (λ) (II.4) Unfortunately, since the log-likelihood depends on missing information (the latent states), there is no known analytical solution to the training problem. In practice, iterative optimization techniques (briefly described below) such as the Baum-Welch algorithm, a special case of the Expectation-Maximization (EM), or alternatively the standard numerical optimization methods such as gradient descent are usually used for this task. In either case, the optimization requires the evaluation of the log-likelihood value (II.3) at each iteration, and the estimation of the conditional state densities. That is, the smoothed a posteriori conditional state density: γt (i) Pr(qt = i | o1:T , λ) (II.5) and the smoothed a posteriori conditional joint state density: ξt (i, j) Pr(qt = i, qt+1 = j | o1:T , λ) 3 Other (II.6) criteria such as the maximum mutual information (MMI), and minimum discrimination information (MDI) could be also used for estimating HMM parameters. However, the widespread usage of the MLE for HMM is due to its attractive statistical properties – consistency and asymptotic normality – proved under quite general conditions. 248 The Forward-Backward (FB) (Rabiner, 1989) or the numerically more stable ForwardFiltering Backward-Smoothing (FFBS) (Ephraim and Merhav, 2002) algorithms are typically used for computing the log-likelihood value (II.3) and the smoothed state densities of Eqs. (II.5) and (II.6). The Baum-Welch (BW) algorithm (Baum et al., 1970) is an Expectation-Maximization (EM) algorithm (Dempster et al., 1977) specialized for estimating HMM parameters. Instead of a direct maximization of the log-likelihood (II.3) BW optimizes the auxiliary Q-function: QT (λ, λ(k) ) = Pr(o1:T , q1:T | λ(k) ) log Pr(o1:T , q1:T | λ) (II.7) q∈S which is the expected value of the complete-data log-likelihood and hence easier to be optimized. This is done by alternating between the expectation step (E-step) and maximization step (M-step). The E-step uses the FB or FFBS algorithms to compute the state densities (Eqs. II.5 and II.6) which are then used, in the M-step to re-estimate the model parameters: (k+1) (k) πi = γ1 (i) (k+1) aij = (k+1) (m) bj T −1 (k) T = ξt (i, j) t=1 γt (i) t=1 T −1 (II.8) (k) t=1 γt (j)δot vm T t=1 γt (j) The Kronecker delta δij is equal to one if i = j and zero otherwise. Starting with an initial guess of the HMM parameters, λ0 , each iteration k, of the E- and M-step is guaranteed to increase the likelihood of the observations giving the new model until a convergence to a stationary point of the likelihood is reached (Baum et al., 1970). In contrast, standard numerical optimization methods work directly with the log-likelihood function (II.3) and its derivatives. Starting with an initial guess of HMM parameters λ0 , 249 the gradient descent (GD) updates the model at each iteration k using: λ(k+1) = λ(k) + ηk ∇λ T (λ(k) ) (II.9) where the learning rate ηk could be fixed, or adjusted at each iteration. One way of computing the gradient of the likelihood ∇λ T (λ(k) ) is by using the values of the conditional densities (Eqs. II.5 and II.6) obtained from the FB or FFBS algorithms: ∂T (λ(k) ) = ∂aij ∂T (λ(k) ) = ∂bj (m) T −1 t=1 T ξt (i, j) aij t=1 γt (j)δot vm bj (m) (II.10) (II.11) However, with the numerical optimization methods HMM parameters are not guaranteed to stay within their space limits. As described next, the parameters constraints (Eqs. II.1 and II.2) must therefore be imposed explicitly through a re-parametrization, to reduce the problem to unconstrained optimization. Since, at each iteration, both BW and GD algorithms rely on the fixed-interval smoothing algorithms (FB or FFBS) to compute the state conditional densities, they are therefore performing a batch learning approach. This is because these fixed-interval smoothing algorithms require an access to the end of sequence in order to compute the smoothed densities. Similarly, when learning from a block of multiple subsequences, each iteration of the BW and GD requires the averaged smoothed densities over all the subsequences in the block. Therefore, the block must have a finite number of subsequences, and all the subsequences are visited at each iteration. Consequently, when provided with a new block of subsequences the training process must be restarted using the accumulated (old and new) data to accommodate the new data. II.3.2 On-line and Incremental Learning 250 Figure AII.1 presents an illustration of the batch, on-line, and incremental learning approaches when provided with subsequent blocks each comprises R observation subsequences. When the first block (D1 ) is presented, all algorithms start with the same initial guess of HMM parameters (λ0 ). At each iteration k, batch algorithms update the model parameters using the averaged state densities (Eqs. II.5 and II.6) over all the subsequences in the block until stopping criteria are met, then the first operational model (λ1 ) is produced. On-line algorithms directly update the model parameters using the stated densities based on each subsequence and output λ1 upon reaching the last subsequence in the block. This constitutes one iteration of the incremental algorithms which then re-iterate until reaching the stopping criteria before producing λ1 . When the second block (D2 ) is presented, batch algorithms restart the training procedure, using all accumulated training data (D1 ∪ D2 ). The on-line algorithms however resumes training from the previous model (λ1 ) using only the current block (D2 ) without iterations, while the incremental algorithms will re-iterate on D2 until the stopping criteria are met. On-line learning techniques for HMM parameters can be broadly divided into EM-based (Mizuno et al., 2000) or gradient-based (Baldi and Chauvin, 1994; Ryden, 1998; Singer and Warmuth, 1996) optimization of the log-likelihood function. These techniques are essentially derived from their batch counterparts, however the key difference is that the optimization and the HMM update are based on the currently presented subsequence of observations without iterations. EM-based techniques employ an indirect maximization of log-likelihood, through the complete log-likelihood (II.7), in which the E-step is performed on each subsequence of observation and then the model parameters are updated (Mizuno et al., 2000). Gradient-based techniques however directly maximize the log-likelihood function (II.3) and then update HMM parameters after processing each subsequence of observation. Several gradient-based optimization algorithms have been proposed, such as the GD algorithm (Baldi and Chauvin, 1994), the exponentiated 251 Data Blocks D1 (1) s1 Batch On-line λ0 Incremental λ0 (1) s2 (1) sR λ0 λ1 λ2 λ1 λ2 λR λR λ(k) λ(k) λ1 D2 (2) s1 (2) s2 (2) sR λ1 λ1 λ0 D1 ∪ D2 D2 D2 λ(k) λ(k) λ2 λ2 λ2 Figure AII.1 Illustration of batch, on-line, and incremental learning approaches when learning from subsequent blocks (D1 , D2 , . . .) of R observation subsequences, provided at different time intervals gradient framework which also minimizes the model parameters divergence (Singer and Warmuth, 1996), and the recursive estimation technique (Ryden, 1998). The on-line learning technique proposed by Mizuno et al. (2000) is based on the BW algorithm, however applies a decayed accumulation of the state densities and a direct update of the model parameters after each subsequence of observations (Mizuno et al., 2000). Starting with an initial model λ0 , the conditional state densities are recursively computed after processing each subsequence (r) of observations of length T by: T −1 ξtr+1 (i, j) = (1 − ηr ) t=1 T t=1 γtr+1 (j)δot k = (1 − ηr ) T −1 t=1 T t=1 ξtr (i, j) + ηr T −1 ξtr+1 (i, j) t=1 T γtr (j)δot k + ηr t=1 γtr+1 (j)δot k (II.12) (II.13) 252 and the model parameters are then directly updated using (II.8). The learning rate ηk is proposed in polynomial form ηr = c( 1r )d for some positive constants c and d. Based on the GD (II.9) of the negative likelihood, the on-line algorithm for HMM parameters estimations introduced by Baldi and Chauvin (1994) employs a softmax parametrization to transform the constraint optimization into an unconstrained one. This is achieved by mapping the bounded space (a, b) to the unbounded space (u, v): euij evj (k) aij = u and bj (k) = v (z) j k e ik ze (II.14) The transformed parameters are then updated, after each subsequence of observations, as follows: r ur+1 ij = uij + η T ξtr+1 (i, j) − aij γtr+1 (i) (II.15) t=1 vjr+1 (k) = vjr (k) + η T γtr+1 (j)δot k − bj (k)γtr+1 (j) (II.16) t=1 The objective function proposed by Singer and Warmuth (1996) minimizes the divergence between the old and new model parameters penalized by the negative log-likelihood of each subsequence multiplied by a fixed positive learning rate (η > 0): λr+1 = arg min KL(λr+1 , λr ) − ηT (λr+1 ) λ (II.17) The Kullback-Leibler (KL) divergence (or relative entropy) is defined between two probability distributions Pλ(k) (o1:t ) and Pλ (o1:t ) by: KL(Pλr Pλ ) = Prr (o1:t ) log t λ Prλr (o1:t ) Prλ (o1:t ) (II.18) KL is always non-negative and attains its global minimum at zero for Prλ → Prλr . This optimization is based on the exponentiated gradient framework, therefore the parameters 253 constraints are respected implicitly. After processing each subsequence of observations, the model parameters are updated using the conditional state densities and the derivatives of the log-likelihood ((II.10) and (II.11)): ar+1 ij 1 r = a e Z1 ij br+1 j (k) = − T η γ r+1 (i) t=1 t ∂T (λr+1 ) ∂aij 1 r b (k)e Z2 j (II.19) − T η γ r+1 (j) t=1 t ∂T (λr+1 ) ∂bj (k) (II.20) where Z1 and Z2 are normalization factors. The idea proposed by Ryden (1998) is to consider successive subsequences of T observations taken from a data stream, or = {o(r−1)T +1 , . . . , orT }, as independent of each other. This assumption reduces the extraction of information from all the previous observations to a data-segment of length T . In fact, this has been considered implicitly with all the above techniques that process multiple subsequences of T observations. To enforce parameters stochastic constraints (II.1), a projection (PG ) on a simplex is suggested, which updates all but one of the parameters in each row of the matrices. At each iteration, the recursion is given (without matrix inversion) λr+1 = PG (λr + ηr h(or+1 | λr )) (II.21) where h(or+1 | λr ) = ∇λr log Pr(or+1 | λr ) and ηr = η0 r−ρ for some positive constant η0 and ρ ∈ ( 12 , 1]. It was shown to converge almost surely to the set of Kuhn–Tucker points for minimizing the Kullback-Leibler divergence KLk (λr λ(true) ) defined in (II.18). KL attains its global minimum at λr → λ(true) , provided that the HMM is identifiable, therefore the subsequence must contains at least two symbols (T ≥ 2). On-line learning algorithms are therefore proposed for situations where a long (ideally infinite) sequences of observation are provided for training. In this paper however, these techniques are applied in an incremental fashion, where algorithms are allowed to con- 254 verge over several iterations of each new training block. However, when a new block is provided all learning rates are reset. II.4 Experimental Methodology The University of New Mexico (UNM) data sets are commonly used for benchmarking ADS based on system calls sequences. Normal data are collected from a monitored process in a secured environment, while testing data are the collection of the system calls when this process is under attack (Warrender et al., 1999). Since it is very difficult to isolate the manifestation of an attack at the system call level, the UNM test sets are not labeled. Therefore, in related work, intrusive sequences are usually labeled in comparison with the normal subsequences, (e.g., using STIDE). This labeling process leads to a biased evaluation of techniques, which depends on both training data size and detector window size. The need to overcome issues encountered when using real-world data for anomaly-based HIDS (incomplete data for training, and labeled data) has lead to the implementation of a synthetic data generation platform for proof-of-concept simulations. It is intended to provide normal data for training and labeled data (normal and anomalous) for testing. This is done by simulating different processes with various complexities then injecting anomalies in known locations. Inspired by the work of Tan and Maxion (Maxion and Tan, 2000; Tan and Maxion, 2003), the data generator is based on the Conditional Relative Entropy (CRE) of a source. It is defined as the conditional entropy divided by the maximum entropy (MaxEnt) of that source, which gives an irregularity index to the generated data. For two random variables x and y the CRE can be computed by: CRE = − x p(x) | x) log p(y | x) M axEnt y p(y (II.22) 255 where for an alphabet of size Σ symbols, M axEnt = −Σ log(1/Σ) is the entropy of a theoretical source in which all symbols are equiprobale. It normalizes the conditional entropy values between CRE = 0 (perfect regularity) and CRE = 1 (complete irregularity or random). In a subsequence of system calls, the conditional probability, p(y | x), represents the probability of the next system call given the current one. It can be represented as the columns and rows (respectively) of a Markov Model with the transition matrix M = {aij }, where aij = p(St+1 = j | St = i) is the transition probability from state i at time t to state j at time t + 1. Accordingly, for a specific alphabet size Σ and CRE value, a Markov chain is first constructed, then used as a generative model for normal data. This Markov chain is also used for labeling injected anomalies as described below. Let an anomalous event be defined as a surprising event which does not belong to the process normal pattern. This type of event may be a foreign-symbol anomaly subsequence that contains symbols not included in the process normal alphabet, a foreign n-gram anomaly subsequence that contains n-grams not present in the process normal data, or a rare n-gram anomaly subsequence that contains n-grams that are infrequent in the process normal data and occurs in burst during the test4 . Generating training data consists of constructing Markov transition matrices for an alphabet of size Σ symbols with the desired irregularity index (CRE) for the normal sequences. The normal data sequence with the desired length is then produced with the Markov chain, and segmented using a sliding window (shift one) of a fixed size, DW . To produce the anomalous data, a random sequence (CRE = 1) is generated, using the same alphabet size Σ, and segmented into subsequences of a desired length using a sliding window with a fixed size of AS. Then, the original generative Markov chain is used to compute the likelihood of each subsequence. If the likelihood is lower than a threshold it is labeled as anomaly. The threshold is set to (min(aij ))AS−1 , ∀i,j , the minimal value in the Markov transition matrix to the power (AS − 1), which is the number of symbol transitions in the subsequence of size AS. This ensures that the anomalous subsequences 4 This is in contrast with other work which consider rare event as anomalies. Rare events are normal, however they may be suspicious if they occurs in high frequency over a short period of time. 256 of size AS are not associated with the process normal behavior, and hence foreign n-gram anomalies are collected. The trivial case of foreign-symbol anomaly is disregarded since it is easy to be detected. Rare n-gram anomalies are not considered since we seek to investigate the performance at the detection level, and such kind of anomalies are accounted for at a higher level by computing the frequency of rare events over a local region. Finally, to create the testing data another normal sequence is generated, segmented and labeled as normal. The collected anomalies of the same length are then injected into those subsequences at random according to a mixing ratio. In the presented experiments, a normal data sequence of length 1, 600 symbols is produced using a Markov model with an irregularity index CRE = 0.4, and segmented using a sliding window of a fixed size, DW = 8 (Khreich et al., 2009b). The data are then divided into 10 blocks, Di , for i = 1, . . . 10, each comprises R = 20 subsequences. A test set of 400 subsequences each of size AS = 8 is prepared as described above. It comprises 75% of normal and 25% of anomalous data. Each block of the normal data Di is divided into blocks of equal size – one is used for training (Ditrain ) and the other for validation (Divalid ), which is used to reduce the overfitting effects (hold-out validation). For batch algorithms (BW-batch and GD-batch), successive blocks for training are accumulated and training is restarted each time a new block is presented. That is, the HMMs are first trained and validated on the first block of data (D1train ,D1valid ). The training is then restarted, by re-initializing the models at random, and performing the batch algorithms using the accumulated blocks of data (D1train ∪ D2train ,D1valid ∪ D2valid ), and so on. In contrast, the incremental algorithms resume training by starting with the corresponding models produced from the previous block and by using the presented block only (Ditrain , Divalid ). The stopping criterion for the batch and the incremental algorithms is set to a maximum of 100 iterations or to when the log-likelihood remains constant for at least 10 iterations for the validation data. On-line learning algorithms are not allowed to iterate on successive blocks of data. In this case, all learning rates are optimized and reset with the presentation of each new block. In all cases, the algorithms are applied to 257 an ergodic (fully connected) HMM with eight hidden states (N = 8), and are initialized with the same random model. For each algorithm, the model that produced the highest log-likelihood value on the validation data is selected for testing. The log-likelihood of the test data (normal + anomalous) are then evaluated using the forward algorithm. By sorting the test subsequences decreasing by these log-likelihood values, and updating the true positive rate (tpr) and the false positive rate (fpr) while moving down the list, results in a Receiver Operating Characteristics (ROC) curves (Fawcett, 2006). The ROC curve depicts the trade-off between the tpr – the number of anomalous subsequences correctly detected over the total number of anomalous subsequences – and the fpr – the number of normal subsequences detected as anomalous over the total number of normal subsequences in the test set. The Area Under the ROC Curve (AUC) is used as a measure of performance. AU C = 1 means a perfect separation between normal and anomalous (tpr = 100%, f pr = 0%), while AU C = 0.5 means a random classification. This procedure is replicated ten times with different training, validation and testing sets, and the resulting AUCs are averaged and presented along with their standard deviations (error bars). Although not show in this paper, various experiments have been conducted using different values of N , CRE, DW and Σ. These experiments produced similar results and hence the below discussion hold. 258 II.5 Results The EM-based algorithm (Mizuno (Mizuno et al., 2000)), and the gradient-based ones (Baldi (Baldi and Chauvin, 1994), Singer (Singer and Warmuth, 1996), and Ryden (Ryden, 1998)) are first applied in an on-line learning approach as originally proposed by the authors. Figure AII.2 presents the average AUC achieved for on-line techniques for each block of training data. The results of the batch BW and GD algorithms are presented for reference. The performances of the on-line algorithms tend to be equivalent with the increase of data. This is also confirmed by the statistical tests, conducted at the final block of data as shown in Figure AII.4. However, at the beginning, when few blocks of data 1 0.95 0.9 AUC 0.85 0.8 0.75 BW(batch) GD(batch) Mizuno Baldi Singer Ryden 0.7 0.65 0 2 4 6 Number of Blocks 8 10 12 Figure AII.2 Average AUC of on-line learning techniques vs the amount of training data that are used to train an ergodic HMM with N = 8 259 are presented, the EM-based algorithm performs better than gradient-based ones. Due to one view of the data, on-line algorithms require a large amount of data to achieve good performance. Theoretically, an infinite amount of data is assumed when trying to prove the convergence of such on-line algorithms. Among the presented techniques, only Ryden provided convergence analysis and proof of consistency (Ryden, 1998). The average performances of the batch algorithms are equivalent and increase with the number of blocks. This is expected, since the batch algorithms are allowed to iterate on the accumulated data, they have therefore a global (backward) view. Figure AII.3 presents the averaged AUC achieved by incremental techniques for each block of training data. The average AUC of the incremental algorithms are lower than the performance of batch and higher than that of the on-line ones. In fact, these algorithms have a local view of the presented data. Although they are allowed to learn the new data through several iterations, there is a loss of information form the previously learned data. This is usually controlled with the decaying learning rate, which assigns a weight to the past information with reference to the future data contribution. Among the incremental algorithms, statistical testing shows that Baldi’s algorithm (Baldi and Chauvin, 1994) achieved slightly superior performance than the others as shown in Figure AII.4. This is possibly related to the short length of training subsequences (DW = 8). The other incremental algorithms may require larger subsequences to accumulate enough information from each one before updating the model parameters. For instance, Ryden (1998) suggests using a minimum subsequence length of DW = 20. However, the stochastic nature of the incremental algorithms allows them to escape local minima. This important characteristic can be shown when both batch and incremental algorithms are trained on the same data. For instance, this is illustrated when learning the first block in Figure AII.3. It can be seen that incremental algorithms are capable of producing superior results to their batch counterparts. Therefore, they may be used as an alternative for batch training, especially that they converge faster than the batch 260 1 0.95 AUC 0.9 0.85 0.8 BW(batch) GD(batch) Mizuno Baldi Singer Ryden BW(ib) Hoang 0.75 0.7 0 2 4 6 Number of Blocks 8 10 12 Figure AII.3 Average AUC of incremental learning techniques vs the amount of training data that are used to train an ergodic HMM with N = 8 algorithms since they update the model after each subsequence and hence exploit new information faster. Figure AII.3 also presents the results of the BW incremental batch, BW(ib) for reference. That is, when learning D1 , BW is initialized with a random HMM and the algorithm is applied until it converges. For the subsequent block D2 , BW is then initialized with λ1 . The performance of BW(ib) indicates data corruption since it is prone to get stuck in a local minimum from the previous block. Interestingly, this straightforward EM-based solutions produced similar results to Mizuno (Mizuno et al., 2000). This indicates that the learning rates employed by the latter during the experiments may be better optimized to escape local minima. 261 In addition, Figure AII.3 includes another incremental approach based on learning an HMM for each new block of data then merging it with old ones using weight-averaging (Hoang and Hu, 2004). This learn and merge approach performed statistically worse than most of the incremental techniques as shown in Figure AII.4. This is due to averaging ergodic HMMs since the states order may be mixed up between the HMM trained on the first block and that trained on the second block of data. BW GD Mizuno(OL) Baldi(OL) Singer(OL) Ryden(OL) Mizuno(IL) Baldi(IL) Singer(IL) Ryden(IL) BW(ib) Hoang 0 20 40 60 80 100 120 Figure AII.4 Kruskall-Wallis (one-way analysis of variance) statistical test for batch, on-line (OL) and incremental learning (IL) algorithms after processing the final block D10 of data Table AII.1 compares the time and memory complexity of EM-based and gradient-based algorithms each processing a subsequence of T observations and then updating the model parameters. This represents the core operations required by all learning approaches. Time complexity represents the worst-case number of operations required for one iteration of EM-based and gradient-based algorithms. For both algorithms one iteration involves computing one forward and one backward pass with o1:T . Memory complexity is the worst-case number of 32 bit words needed by the algorithms to store the required temporary variables in RAM. Since both algorithms rely on the FB or FFBS to compute the state conditional densities (Eqs. II.5 and II.6), therefore they require about the same computational time complexity, O(N 2 T ), and the same memory requirements, 262 O(N T ). The additional computation time required by the gradient-based algorithms while updating HMM parameters stems from the re-parametrization (Section 3). Table AII.1 Worst-case time and memory complexity analysis for EM- and gradient-based algorithms each processing a subsequence of T observations, with an N state HMM states and an alphabet of size Σ = M symbols. This represents the core operations required by batch, on-line and incremental algorithms, the differences stem from iterating until convergence Algo. EMbased Estimation State Prob. (Eqs. II.5 & II.6) Transition Prob. (A) Emission Prob. (B) Total Gradbased State Prob. (Eqs. II.5 & II.6) Transition Prob. (A) Emission Prob. (B) Total # Multiplications 6N 2 T + 3N T − 6N 2 − N N2 NM 6N 2 T + 3N T − 5N 2 +N M − N 6N 2 T + 3N T − 6N 2 − N N2 NM 6N 2 T + 3N T − 5N 2 +N M − N Time # Divisions N 2 T + 3N T − N 2 − N N2 NM Memory # Exponentiation N T + N 2 + 2N N2 NM N 2 T + 3N T + M N − N N T + 2N 2 + N M + 2N N 2 T + 3N T − N 2 − N N2 MN N2 NM N T + N 2 + 2N N2 NM N 2 T + 3N T + M N − N N2 + NM N T + 2N 2 + N M + 2N For a block of R subsequences, batch learning involves R times the computations of the state densities whereas only one update of the parameters is performed at each iteration. On the other hand, the on-line algorithms perform R times the computation of state densities and R times the update of the model, without iterating however. On-line algorithms are therefore the fastest in learning HMM parameters. At each iteration, the incremental algorithms require the same computational time need by the on-line ones. Accordingly, the incremental techniques require more computational time per iteration than the batch counterparts, however fewer iterations are required to converge. The memory complexity of the on-line algorithms is constant in time, O(N T ), and also for the incremental ones though it is R times greater. However, for batch algorithms this value scales linearly with the number of the accumulated subsequences. II.6 Conclusion In this paper, the performance of several techniques is compared for on-line incremental learning of HMM parameters as new sequences of training data becomes available. These techniques are considered for updating host-based intrusion detection systems from sys- 263 tem calls to the OS kernel, but apply to other HMM-based detection systems that face the challenges associated with limited training data and environmental changes. Results have shown that techniques for incremental learning of HMM parameters can provide a higher level of discrimination than those for on-line learning, yet require significantly fewer resources than with batch training. Incremental learning techniques may provide a promising solution for adaptive intrusion detection systems. Future work includes in-depth investigation of the learning rates effects on performances, which are sensitive parameters especially for gradient-based techniques. Comparing these techniques with models combinations at the training level or responses fusion at the decision level is also an interesting direction to explore. 264 APPENDIX III INCREMENTAL LEARNING STRATEGY FOR UPDATING HMM PARAMETERS This Appendix presents an effective strategy to overcome the challenges faced when HMM parameters must be incrementally updated from a newly-acquired data. Incremental learning refers to the ability of an algorithm to learn from new data that may become available after a classifier (or a model) has already been generated from a previously available data set. Incremental learning produces a sequence of hypotheses, for a sequence of training data sets, where the current hypothesis describes all data that have been seen thus far, but depends only on previous hypotheses and the current training data (Grossberg, 1988; Polikar et al., 2001). Incremental learning leads to a reduced storage requirement since there is no need for storing the data from previous training phases, after adapting HMM parameters. Furthermore, since training is performed only on the newly-acquired data, and not on all accumulated data, incremental learning would also lower the time and memory complexity required by the algorithm employed to train an HMM from new data. However, incremental learning is typically faced with the stability-plasticity dilemma, which refers the problem of learning new information incrementally, yet overcoming the problem of catastrophic forgetting (Grossberg, 1988; Polikar et al., 2001). To sustain a high level of of generalization during future operations, HMM parameters should therefore be updated from new data without corrupting previously-acquired knowledge and without being subject to catastrophic forgetting. When new training data become available, standard batch techniques for estimating HMM parameters must restart the training procedure using all cumulative training data, as described in Chapter 2. Alternatively, the on-line learning techniques proposed in literature and surveyed in Chapter 2 assume infinite data sequence, and update HMM parameters continuously upon observing each 266 new sub-sequence or new symbol of observation. They typically employ some (fixed or decreasing) internal learning rate or decay function between the successive updates of HMM parameters, to help stabilizing the algorithm and adapt to the newly acquired information. In practice, both batch and on-line learning techniques lead to knowledge corruption when performed incrementally to update HMM parameters. Figure AIII.1 illustrates the challenges caused by the stability-plasticity dilemma, when a batch or on-line learning algorithm is performed incrementally to update one HMM parameter. Assume that the training block D2 , which contains one or multiple subsequences, becomes available after an HMM(λ1 ) has been previously trained on a block D1 and deployed for operations. An incremental algorithm that optimizes, for instance, the log-likelihood function requires re-estimating the HMM parameters over several iterations until this function is maximized for D2 . The dotted curve in Figure AIII.1 represents the log-likelihood function of an HMM(λ1 ) that has previously learned D1 , while the plain curve represents the log-likelihood function associated with HMM(λ2 ) that has incrementally learned D2 over the space of one model parameter (λ). Since λ1 is the only information at hand from previous data D1 , the incremental training process starts from HMM(λ1 ). The optimization of HMM parameters depends on the log-likelihood function of HMM(λ2 ) with respect to that of HMM(λ1 ). If the log-likelihood function of HMM(λ2 ) has some local maxima in the proximity of that of HMM(λ1 ), the optimization of λ2 may remain trapped in these local maxima and would not be able to accommodate the newly-acquired information from D2 . In this case, the model stability is privileged and its ability to learn new information becomes limited. For instance, if λ1 was selected according to point (a) in Figure AIII.1, the optimization will probably lead to point (d), providing the same model parameters (λ2 ≈ λ1 ). When HMM parameters remain trapped in the proximity of the same local maxima and old information is mostly conserved, the performance of HMM will decline over time. On the other hand, if the HMM parameters (λ2 ) were able to escape the local maximum found by λ1 , λ2 will be completely adapted to D2 , thereby 267 corrupting the previously-acquired knowledge, and compromising HMM performance. In this case, the plasticity of the model is privileged and its adaptability leads to a catastrophic forgetting. For instance, If λ1 was selected as point (b), the optimization will lead to a better (g) or worse (e) solution on the log-likelihood function associated with D2 alone. The log-likelihood function depends on other factors than the starting point (λ1 ), such as the size and regularity of newly-acquired data D2 with respect to previously-learned data D1 , as well as on the learning algorithm. Given the same starting point λ1 , and assuming that a batch or on-line algorithm is employed to incrementally update HMM parameters from the same block of data D2 . In general, batch algorithms tend to privilege stability, while on-line algorithms tend to privilege plasticity. In fact, starting from λ1 , one iteration of a batch algorithm requires accumulating the sufficient statistic – the Stability λ2 remains trapped in the proximity of previous local maxima, and hence unable to accommodate new information provided by D2 Log-Likelihood conditional state and joint state densities required to update HMM parameters, over (h) (g) (a) (d) (b) (e) (c) (f ) Plasticity λ2 adapts completely to the new information provided by D2 , and hence corrupts the previously-acquired knowledge from D1 HMM (λ1 ) HMM (λ2 ) Optimization Space of one Model Parameter (λ) Figure AIII.1 An illustration of the degeneration that may occur with batch or on-line estimation of HMM parameters, when learning is performed incrementally on successive blocks data, each comprising unrepresentative amount of data providing incomplete view of phenomena 268 the entire sequence or sub-sequences in D2 , prior to updating λ1 . On the other hand, given the initial model parameters λ1 , a typical iteration of on-line algorithms involves a sequence of updates to λ1 based on each observed symbol or each observed sub-sequence in D2 , until the last symbol or sub-sequence is reached. This added stochasticity induced by the early updates of HMM parameters according to on-line algorithms force the HMM (λ1 ) to escape previous local maxima, and to achieve a faster and complete adaptation to D2 . However, the previously-acquired knowledge of the model may be compromised with the first few symbols or sub-sequences of D2 , leading thereby to a catastrophic forgetting. In our previous work, described in Appendix II, some representative on-line learning techniques for HMM parameters are extended to incremental learning, by allowing them to iterate over each block of data and by resetting their internal learning rates when a new block is presented. The learning rates are employed to integrate the information from each new symbol or sub-sequences of observation with the previously-learned model. Although they have been shown to improve accuracy over time, optimizing the internal learning rates employed within these algorithms to trade off plasticity against stability is a challenging task. For one, information form all previously learned data contained within the current model should be balanced with new information contained in a symbol or subsequence of observation. Furthermore, model selection and stopping criteria become very difficult, because these on-line algorithms do not guarantee a monotonic improvement of the log-likelihood value at each update of parameters. To alleviate these issues, the incremental learning strategy proposed in this appendix consists of employing a (global) learning rate at the iteration level. This aims at delaying the integration of newly-acquired information with the previously-learned model until at least on iteration over the new block of data is performed by the new model. In addition, the previously-learned model is kept in memory and weight-balanced with the new model over each iteration. 269 Assume that the HMM parameters (λDn ) obtained from the nth data block Dn must be updated to accommodate the newly-acquired data from Dn+1 . The learning process of λDn+1 starts from λDn . At each iteration (iter), the parameters of the new HMM (λDn+1 ) are weight-balanced with those of previously-trained HMM (λDn ) and the new (iter) model parameters, which are iteratively updated from Dn+1 (λDn+1 ), according to a learning rate η: (iter) λDn+1 ← ηλDn + (1 − η)λDn+1 (III.1) Several advantages are provided by retaining the previous model λDn in memory during the current learning stage, and employing the learning rate η to reintegrate pre-existing knowledge from λDn and the newly-acquired information from successive iterations of (iter) λDn+1 over Dn+1 . For one, the previously-acquired knowledge contained in λDn is no longer susceptible to be compromised by the insufficient information provided by the first symbol or sub-sequence of observation from Dn+1 . The internal learning rates of on-line algorithms can be now optimize based on new data to stabilize the algorithm and ensure its converge smooth convergence, without worrying about corrupting λDn . In addition, monitoring the improvement of log-likelihood values of the new data given the updated model parameters at each iteration becomes more tractable, which facilitates both stopping criteria of algorithms and selection of models for operations. This incremental learning strategy is general in that it can be applied to any batch of online learning algorithm. The iterative procedure provided by Eq. III.1 does not specify the (iter) algorithm for local learning from the new block of data Dn+1 . Therefore, updating λDn+1 is not restricted to on-line algorithms, other techniques such as batch learning algorithms for estimating HMM parameters can be also applied. Algorithm selection depends on the size and regularity of the data. When limited amount of data is provided for training within each new block, batch algorithms provide easier solution, as no internal learning rate is employed. However when abundant yet unrepresentative data is provided on-line algorithms may be employed due to their reduced storage requirements. However, an 270 effort is still required for optimizing the internal learning rates of these on-line algorithms to ensure convergence. In this case, the EFFBS algorithm proposed in Appendix I provides an alternative efficient solution, which requires no learning rate optimization. In general, employing a fixed user-defined learning rate along with a validation strategy would maintaining the level of performance in static or slowly changing environments. As described in Section 2.5, various validation strategies may be employed, to reduce the overfitting effects and improve the generalization performance during operations. For instance, such as hold-out or cross-validation procedures, or when applicable accumulating and updating representative validation data set over time. However, this requires investigating some selection criteria for maintaining the most informative sequences of observations and discarding less relevant one. Other strategies for an automatic update of the learning rate based upon the magnitude of the performance change, during the incremental training and validation can be also applied (Kuncheva and Plumpton, 2008). A version of this incremental learning strategy called incremental BW (IBW), has been applied in the proof-of-concept simulations conducted in Section 4.4. IBW employs the BW algorithm to update HMM parameters at each iteration, and uses a fixed learning rate η optimized during the first learning stage, with a 10-fold cross validation procedure for model selection. APPENDIX IV BOOLEAN FUNCTIONS AND ADDITIONAL RESULTS IV.1 Boolean Functions Figure AIV.1 presents all the distinct Boolean functions on two variables. Table (a) shows the four Boolean operations that depend on one variable (the inputs and their negations) along with the two constant operations (always positive and always negatives). Table (b) represents the ten operations that take on two variables. The first four are obtained from conjunction with some subset of its inputs negated. Similarly for the next four which are obtained from disjunction however. The last two (the XOR and its negation, EQV) are obtained with a “checkerboard” truth table. In general, for two inputs variables (n = 2) n with two possible binary outputs, there are 22 = 16 distinct Boolean operations, whereas ten are effectively used for combining. Ca Cb ¬Ca ¬Cb One Zero 0 0 1 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 De Morgan’s laws: ¬ = N OT ∧ = AN D ¬(Ca ∧ Cb ) = ¬Ca ∨ ¬Cb ¬(Ca ∨ Cb ) = ¬Ca ∧ ¬Cb ∨ ⊕ IM P EQV Constant operations Depend on one variable (a) AN D Ca Cb Symbols: Ca ∧ Cb N AN D OR ¬Ca ∧ Cb Ca ∧ ¬Cb ¬(Ca ∧ Cb ) Ca ∨ Cb Ca IM P Cb Cb IM P Ca ¬Ca ∨ Cb Ca ∨ ¬Cb N OR = = = = OR XOR Implication Equivalence XOR ¬(Ca ∨ Cb ) Ca ⊕ Cb EQV ¬(Ca ⊕ Cb ) 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 1 0 0 1 Called minterms: they represent the four miminum classes Called maxterms: they represent the four maximum classes (b) Operations with a ”checkerboard” truth table Figure AIV.1 All distinct Boolean operations on two variables 272 IV.2 Additional Results Figures AIV.2 and AIV.3 present the average performance in terms of the partial area under the convex hull for the range of f pr = [0, 0.1] (AU CH0.1 ), and the tpr at a fixed f pr = 0.1, respectively. These additional results are provided for a μ-HMM with three HMMs, each one trained with a different number of states (N = 4, 8, 12), and using the synthetically generated data with Σ = 8 and CRE = 0.3. These results complement those shown in Figure 3.8 (Section 3.6.2). 1 0.9 0.9 0.8 0.8 0.7 0.7 0.1 1 AUCH AUCH 0.1 273 0.6 0.6 0.5 STIDE MRROC BC 0.5 STIDE MRROC IBC 0.4 BC 0.4 IBC AND AND OR OR BCALL 50 100 150 200 250 300 350 number of seuqences in training blocks 400 IBCALL 450 50 100 (a) BC (DW = 2) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 AUCH AUCH 450 400 450 400 450 0.1 0.9 0.1 1 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 50 100 150 200 250 300 350 number of sequences in training blocks 400 0 450 50 100 (c) BC (DW = 4) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.1 0.9 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 50 100 150 200 250 300 350 number of sequences in training blocks (e) BC (DW = 6) 150 200 250 300 350 number of sequences in training blocks (d) IBC (DW = 4) AUCH 0.1 400 (b) IBC (DW = 2) 1 0 AUCH 150 200 250 300 350 number of sequences in training blocks 400 450 0 50 100 150 200 250 300 350 number of sequences in training blocks (f) IBC (DW = 6) Figure AIV.2 Results for synthetically generated data with Σ = 8 and CRE = 0.3. Average AU CH0.1 values obtained on the test sets as a function of the number of training blocks (50 to 450 sequences) for a 3-HMM system, each trained with a different state (N = 4, 8, 12), and combined with the MRROC, BC and IBC techniques. Results are compared for various detector windows sizes (DW = 2, 4, 6). Error bars are standard deviations over ten replications 1 1 0.95 0.95 0.9 0.9 0.85 0.85 tpr at fpr=0.1 tpr at fpr=0.1 274 0.8 0.75 STIDE MRROC BC 0.7 0.8 0.75 STIDE MRROC IBC 0.7 AND AND BCOR 0.65 IBCOR 0.65 BC IBC ALL 50 100 150 200 250 300 350 number of sequences in training blocks 400 ALL 450 50 100 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.5 0.3 0.3 0.2 0.2 100 150 200 250 300 350 number of sequences in training blocks 400 0.1 450 50 100 (c) BC 0.9 0.9 0.8 0.8 450 400 450 0.7 0.7 0.6 tpr at fpr=0.1 tpr at fpr=0.1 150 200 250 300 350 number of sequences in training blocks (d) IBC 1 0.6 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 400 0.5 0.4 50 450 0.6 0.4 0.1 400 (b) IBC tpr at fpr=0.1 tpr at fpr=0.1 (a) BC 150 200 250 300 350 number of sequences in training blocks 50 100 150 200 250 300 350 number of sequences in training blocks (e) BC 400 450 0 50 100 150 200 250 300 350 number of sequences in training blocks (f) IBC Figure AIV.3 Results for synthetically generated data with Σ = 8 and CRE = 0.3. Average tpr values at f pr = 0.1 obtained on the test sets as a function of the number of training blocks (50 to 450 sequences) for a 3-HMM system, each trained with a different state (N = 4, 8, 12), and combined with the MRROC, BC and IBC techniques. Results are compared for various detector windows sizes (DW = 2, 4, 6). Error bars are standard deviations over ten replications APPENDIX V COMBINING HIDDEN MARKOV MODELS FOR IMPROVED ANOMALY DETECTION ∗ Abstract In host-based intrusion detection systems (HIDS), anomaly detection involves monitoring for significant deviations from normal system behavior. Hidden Markov Models (HMMs) have been shown to provide a high level performance for detecting anomalies in sequences of system calls to the operating system kernel. Although the number of hidden states is a critical parameter for HMM performance, it is often chosen heuristically or empirically, by selecting the single value that provides the best performance on training data. However, this single best HMM does not typically provide a high level of performance over the entire detection space. This paper presents a multiple-HMMs approach, where each HMM is trained using a different number of hidden states, and where HMM responses are combined in the Receiver Operating Characteristics (ROC) space according to the Maximum Realizable ROC (MRROC) technique. The performance of this approach is compared favorably to that of a single best HMM and to a traditional sequence matching technique called STIDE, using different synthetic HIDS data sets. Results indicate that this approach provides a higher level of performance over a wide range of training set sizes with various alphabet sizes and irregularity indices, and different anomaly sizes, without a significant computational and storage overhead. ∗ This chapter is published in International Conference on Communications (ICC09) 276 V.1 Introduction Intrusion Detection Systems (IDS) is used to identify, assess, and report unauthorized computer or network activities. Host-based IDSs (HIDS) are designed to monitor the host system activities and state, while network-based IDSs monitor network traffic for multiple hosts. In either case, IDSs have been designed to perform misuse detection (looking for events that match patterns corresponding to known attacks) and anomaly detection (detecting significant deviations from normal system behavior). In a HIDS for anomaly detection, operating system events are usually monitored. Since system calls are the gateway between user and kernel mode, most traditional host-based anomaly detection systems monitor deviation in system call sequences. In general, anomaly detection systems seek to detect novel attacks, yet will typically generate false alarms mainly due to incomplete data for training, poor modeling, and difficulty in obtaining representative labeled data for validation. In practice, it is very difficult to acquire (collect and label) comprehensive data sets to design a HIDS for anomaly detection. Forrest et al. (1996) confirmed that short sequences of system calls are consistent with normal operation, and unusual burst will occur during an attack. Their anomaly detection system, called Sequence Time-Delay Embedding (STIDE), is based on segmenting and enumerating the training data into fixed-length contiguous sequences, using a fixed-size sliding window, shifted by one symbol. During operations, the sliding window scheme is used to scan the data for anomalies – sequences that are not found in the normal data. Various anomaly detection techniques have been applied to learn the normal process behavior through system call sequences (Warrender et al., 1999). Among these, techniques based on discrete Hidden Markov Models (HMMs) (Rabiner, 1989) have been shown to produce slightly superior results, at the expense of training resource requirements. Estimating the parameters of an HMM requires the choice of the number of hidden states. In the literature on HMMs applied to anomaly detection (Gao et al., 2002; Hoang and 277 Hu, 2004; Hoang et al., 2003; Wang et al., 2004), the impact of the number of states on performance is often overlooked. It is typically chosen heuristically or empirically by selecting the number of states that provides the best performance on training data. Given incomplete training data, types of anomalies, and detector window (DW ) sizes, a single “best” HMM does not provide a high level of performance over the entire detection space. For a given data set, a multiple-HMMs (μ-HMMs) approach, where each model is trained using a different number of hidden states, allows to outperform any single best HMM. The responses of these HMMs are combined in the receiver operating characteristics (ROC) space, using the Maximum Realizable ROC (MRROC). This approach allows selecting operational HMMs independently from prior and class distributions, and cost contexts. Although the MRROC fusion has been applied to combine responses of different classifiers in fields such as medical diagnostics and image segmentation (Scott et al., 1998), it has never appeared in HIDS literature. In particular, it has never been proposed in HMM application to anomaly detection due to the overlooked impact on performance of the number of HMM states, and to its expensive training resource requirements, which are emphasized in this work. However, other fusion techniques such as weighted voting has been applied to combine HMMs trained on different features (Choy and Cho, 2000). The objective of this paper consists in comparing the performance of μ-HMMs combined through the MRROC fusion, to that of a single best HMM and STIDE for detecting anomalies in system calls sequences. The impact on performance of using different training set sizes, anomaly sizes, and complexities of the monitored processes is assessed using ROC analysis (Fawcett, 2006). To this end, a synthetic simulator for normal data generation and anomaly injection has been constructed to overcome the issues encountered when experimenting with real data. The rest of this paper is organized as follows. The next section describes the application of HMMs in anomaly-based HIDS. In Section V.3, the proposed μ-HMMs approach is 278 presented. Then, the experimental methodology in Section V.4 describes data generation, evaluation methods and performance metrics. Finally, simulation results are presented and discussed in Section V.5. V.2 Anomaly Detection with HMM A discrete-time finite-state HMM is a stochastic process determined by the two interrelated mechanisms – a latent Markov chain having a finite number of states, and a set of observation probability distributions, each one associated with a state. At each discrete time instant, the process is assumed to be in a state, and an observation is generated by the probability distribution corresponding to the current state. HMMs are usually trained using the Baum-Welch algorithm (Baum et al., 1970) – a specialized expectation maximization technique to estimate the parameters of the model from the training data. Theoretical and empirical results have shown that, given an adequate number of states and a sufficiently rich set of observations, HMMs are capable of representing probability distributions corresponding to complex real-world phenomena in terms of simple and compact models. Given the trained model, the Forward algorithm (Baum et al., 1970) can be used to evaluate the likelihood of the test sequence. For further details regarding HMM the reader is referred to the extensive literature e.g., Rabiner (1989). Estimating the parameters of a HMM requires the specification of the number of hidden states (N ). Warrender et al. (1999) present the first application to anomaly-based HIDS with the University of New Mexico (UNM) data sets1 . The authors trained an ergodic HMM on the normal data sequence for each process of the UNM data set. The number of states was selected heuristically. It is set roughly equal to the alphabet size – the number of unique system call symbols used by the process. Each HMM is then tested on the anomalous sequences, looking for unusual state transitions and/or symbol outputs according to a predefined threshold. Compared to other techniques, HMM produced slightly superior results in terms of average detection and false alarm rate, at the expenses 1 http://www.cs.unm.edu/~immsec/systemcalls.htm 279 of expensive training resource requirements. Indeed, the time complexity of the BaumWelch algorithm per iteration scales linearly with the sequence length and quadratically with the number of states. In addition, its memory complexity scales linearly with both sequence length and number of states. Subsequent applications of HMMs to anomaly-based HIDS, e.g., (Gao et al., 2002; Hoang and Hu, 2004; Hoang et al., 2003; Wang et al., 2004), follow Warrender et al. (1999) in choosing the number of HMM states without further optimization. The authors in (Yeung and Ding, 2003) are among few who tried to investigate the effect of the number of states and the detector window size on the performance using UNM data sets. However only unique contiguous sequences are used for HMM training, which might affect the model parameters. Furthermore, as explained in Section V.4, the data sets are labeled using the STIDE matching technique which makes it as the ground truth for other classifiers. V.3 Selection and Fusion of HMMs in the ROC Space The ROC curve displays a trade-off between true detection (or true positive) rate and false alarm (or false positive) rate for all thresholds. In our context, the true detection rate is the number of anomalous sequences correctly detected over the total number of anomalous sequences, and the false alarm rate is the number of normal sequences detected as anomalous over the total number of normal sequences in the test set. ROC analysis provides a useful tool for combining classifiers given a set of one- or twoclass classifiers using the MRROC (Scott et al., 1998). The MRROC is a fusion technique that uses randomized decision rules and provides an overall classification system whose curve is the convex hull over all existing detection and false alarm rates. In contrast with traditional ROCs where only operational thresholds change when moving along the curve, the MRROC points may also change other model parameters or even switch models. All operating points on the MRROC are achievable in practice, using an exhaustive search, and are theoretically proved by application of the realizable classifier theory (Scott et al., 280 1998). The procedure is to create from two existing classifiers Ca and Cb a composite one Cc whose performance in terms of its ROC, lies on the line segment connecting Ca and Cb . This is done by interpolating the responses of the existing classifiers as follows. Let (tpra , f pra ) and (tprb , f prb ) be the true positive and false positive rates of the existing classifiers Ca and Cb , and (tprc , f prc ) those of the desired composite classifier Cc . The output of Cc for any input O = {o1 , . . . , oT } from the test set, is a random variable that selects the output of Ca or Cb with probability: f prc − f pra f prb − f pra Pr(Cc = Ca ) = 1 − Pr(Cc = Cb ) Pr(Cc = Cb ) = (V.1) Instead of training the HMM with an arbitrary chosen number of states, our approach consists in training several HMMs over a range of n = [Nmin , Nmax ] values and combine their decisions in the ROC space using the MRROC. HMMs trained with various number of states capture different temporal structures of the data. As illustrated in Figure AV.1, for a desired operational system Cc the fusion is done by switching at random between the responses of neighboring models Ca and Cb on the convex hull curve according to eq. (V.1). This combination leads to the realization of the virtual desired model Cc , which is able to achieve in the least a higher performance than any existing models. Since only the HMMs touching the MRROC are potentially optimal, no others need be retained. In addition, this approach allows visualizing systems’ performance and selecting operational HMMs independently from both prior and class distributions as well as cost contexts (Fawcett, 2006). As conditions change, the MRROC does not change; only the portion of interest will. This change is accounted for by shifting the operational system to another point on the facets of the MRROC. In contrast with the commonly used anomaly index measure in related work, this approach also permits to use the Area Under the ROC Curve (AUC), which has been proved an effective evaluation measure in contexts with variable misclassification costs or skewed data (Fürnkranz and Flach, 2003). 281 Input O HM M1 λ1 P r(O | λi ) HM Mn λn ROC Fusion MRROC 1 tpr .8 .6 Cb Cc Ca .4 Disregarded HMM .2 0 Output Cc superior to Ca and Cb .2 .4 .6 .8 1 fpr Figure AV.1 Illustration of the μ-HMMs using the MRROC fusion. For a particular operational system, the combined model may achieve a higher performance than any existing one. Models below the MRROC are disregarded The time and memory complexity required for Baum-Welch training of n different HMMs for the μ-HMMs is comparable to that of training an HMM over a range of n different N values, and selecting the single best HMM (i.e., O(n.N.DW 2 )). During operations, the worst-case time complexity of the Forward-Backward algorithm to evaluate a sequence of data (for a given detection and false alarm rate) with the μ-HMMs solution is two times of the single best HMM solution (i.e., O(N.DW 2 )). For a specific detection and false alarm rate, the MRROC requires in the worst-case the responses of two HMMs to interpolate between two convex hull facets in the ROC space. The worst-case memory complexity required to store HMM parameters may be significantly higher than that of a single best HMM. However, HMMs with a ROC curve that does not touch the MRROC are suboptimal and may be discarded. 282 V.4 Experimental Methodology The UNM data sets are commonly used for benchmarking anomaly detections based on system calls sequences. Normal data are collected from the monitored process in a secured environment, while testing data are the collection of the system calls when this process is under attack (Warrender et al., 1999). Since it is very difficult to isolate the manifestation of an attack at the system call level, the UNM test sets are not labeled. Therefore, in related work, intrusive sequences are usually labeled by comparing normal sequences, using STIDE matching technique. This labeling process considers STIDE response the ground truth, and leads to a biased evaluation and comparison of techniques, which depends on both training data size and detector window size. The need to overcome issues encountered when using real-world data for anomaly-based HIDS (incomplete data for training, and labeled data) has lead to the implementation of a synthetic data generation platform for proof-of-concept simulations. It is intended to provide normal data for training and labeled data (normal and anomalous) for testing. This is done by simulating different processes with various complexities then injecting anomalies in known locations. Inspired by the work of Tan and Maxion (Maxion and Tan, 2000; Tan and Maxion, 2002, 2003), the data generator is based on the Conditional Relative Entropy (CRE) of a source. It is defined as the conditional entropy divided by the maximum entropy (MaxEnt) of that source, which gives an irregularity index to the generated data. For two random variables x and y the CRE can be computed by: CRE = − x p(x) | x) log p(y | x) M axEnt y p(y where for an alphabet of size Σ symbols, M axEnt = −Σ log(1/Σ) is the entropy of a theoretical source in which all symbols are equiprobale. It normalizes the conditional entropy values between CRE = 0 (perfect regularity) and CRE = 1 (complete irregularity or random). In a sequence of system calls, the conditional probability, p(y | x), represents 283 the probability of the next system call given the current one. It can be represented as the columns and rows (respectively) of a Markov Model with the transition matrix M = {aij }, where aij = p(St+1 = j | St = i) is the transition probability from state i at time t to state j at time t + 1. Accordingly, for a specific alphabet size Σ and CRE value, a Markov chain is first constructed, then used as a generative model for normal data. This Markov chain is also used for labeling injected anomalies as described below. Let an anomalous event be defined as a surprising event which does not belong to the process normal pattern. This type of event may be a foreign-symbol anomaly sequence that contains symbols not included in the process normal alphabet, a foreign n-gram anomaly sequence that contains n-grams not present in the process normal data, or a rare n-gram anomaly sequence that contains n-grams that are infrequent in the process normal data and occurs in burst during the test2 . Generating training data consists of constructing Markov transition matrices for an alphabet of size Σ symbols with the desired irregularity index (CRE) for the normal sequences. The normal data sequence with the desired length is then produced with the Markov chain, and segmented using a sliding window (shift one) of a fixed size, DW . To produce the anomalous data, a random sequence (CRE = 1) is generated, using the same alphabet size Σ, and segmented into sub-sequences of a desired length using a sliding window with a fixed size of AS. Then, the original generative Markov chain is used to compute the likelihood of each sub-sequence. If the likelihood is lower than a threshold it is labeled as anomaly. The threshold is set to (min(aij ))AS−1 , ∀i,j , the minimal value in the Markov transition matrix to the power (AS − 1), which is the number of symbol transitions in the sequence of size AS. This ensures that the anomalous sequences of size AS are not associated with the process normal behavior, and hence foreign n-gram anomalies are collected. The trivial case of foreign-symbol anomaly is disregarded since it is easy to be detected. Rare n-gram anomalies are not considered since we seek to investigate the performance at the detection level, and such kind of anomalies are accounted for at 2 This is in contrast with other work which consider rare event as anomalies. Rare events are normal, however they may be suspicious if they occurs in high frequency over a short period of time. 284 a higher level by computing the frequency of rare events over a local region. Finally, to create the testing data another normal sequence is generated, segmented and labeled as normal. The collected anomalies of the same length are then injected into this sequence at random according to a mixing ratio. The experiments conducted in this paper cover a wide range of the parameters space: Σ = 8 − 50 symbols, training set size = 100 − 1000, 000 symbols and CRE = 0.3 − 0.8, to approach as much as possible to real-world processes (Lee and Xiang, 2001). The sizes of injected anomalies are assumed equal to the detector window sizes AS = DW = {2, 4, 6, 8}, and different normal/anomalous ratios are considered for the test phase. For each training set of size DW , different discrete-time ergodic HMMs are trained with various number of hidden states N . The number of symbols is taken equal to the process alphabet size. The iterative Baum-Welch algorithm is used to estimate HMM parameters (Baum et al., 1970). To reduce overfitting effects, the evaluation of the log-likelihood on an independent validation set is used as a stopping criterion. The training process is repeated ten times using a different random initialization to avoid local minima. Finally, the model that gives the highest log-likelihood value on validation data is selected. This procedure is replicated ten times with different training, validation and testing sets, and the results are averaged and presented along with the standard deviations. After training an HMM, the log-likelihood of a test sub-sequences is evaluated using the forward algorithm (Rabiner, 1989). By sweeping all the decision thresholds, HMM evaluation results in an ROC curve for each value of N . In contrast, STIDE testing consists of comparing the test sub-sequences with its normal database, which gives a point in the ROC space. It should be noted that since in this paper, the detector window size is always considered equal to the anomaly size (DW = AS), STIDE detection rate is always 100% and only the false alarm rate varies. This is due to its blind regions – STIDE only misses anomalous sequences that are larger that the detector window size (DW < AS) and have all of its sub-sequences in STIDE normal database. The window 285 will slide on its sub-sequences that are all normal, without being able to discover that the whole sequence is anomalous (Maxion and Tan, 2000; Tan and Maxion, 2002, 2003). V.5 Simulation Results Due to space limitations only a limited number of results are presented. First, the results of a simpler scenario are illustrated in Figure AV.2 and AV.3, and then those of a more complex scenario are shown in Figure AV.4. The simpler scenario involves a relatively simple case with a small alphabet size Σ = 8 and a low irregularity index CRE = 0.3, while for the more complex scenario Σ = 50 symbols and the CRE = 0.4. For both scenarios, the presented results are for test sets that comprise 75% of normal and 25% of anomalous data. However, these results are consistent throughout the range of experiments not shown in this paper. The ROC curves in Figure AV.2 show the impact of using different training set and anomaly sizes on the performance of STIDE and HMM, where HMMs are trained with different numbers of states N . Results in this figure indicate that the AUC of HMM grow with the training set size. As illustrated in Figure AV.2c, when the anomaly size increases, the search space grows exponentially (according to ΣAS ) which also expands both anomalous and normal partitions of the space. For instance, for Σ = 8 and AS = 2 the space of combinations comprises 82 = 64 sequences, part of which is normal and the rest anomalous (depending on the CRE the data). By changing the anomaly size to AS = 8 the combinations grows to 88 = 16, 777, 216 sequences, which imposes a much larger amount of training data for STIDE to memorize the normal space. The storage capacity and detection time associated with STIDE increases considerably. Accordingly, for a fixed training set size, the performance of STIDE is significantly degraded with the increase of the anomaly size. Nevertheless, any normal sequence that was not collected during training will be classified as anomaly, triggering false alarms in operational mode. This problem is usually mitigated by introducing an anomaly counter or index in local 286 regions according to an arbitrary threshold, which may however be exploited by an 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Detection rate Detection rate adversary. 0.6 HMM: N= 4, AUC=0.845 HMM: N= 5, AUC=0.901 HMM: N= 6, AUC=0.902 HMM: N= 7, AUC=0.967 HMM: N= 8, AUC=0.938 HMM: N= 9, AUC=0.934 HMM: N=10, AUC=0.940 HMM: N=11, AUC=0.963 HMM: N=12, AUC=0.971 HMM: MRROC,AUC=0.984 STIDE 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.6 HMM: N= 4, AUC=0.814 HMM: N= 5, AUC=0.890 HMM: N= 6, AUC=0.952 HMM: N= 7, AUC=0.938 HMM: N= 8, AUC=0.937 HMM: N= 9, AUC=0.984 HMM: N=10, AUC=0.966 HMM: N=11, AUC=0.971 HMM: N=12, AUC=0.977 HMM: MRROC,AUC=0.993 STIDE 0.5 0.4 0.3 0.2 0.1 0 0 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False alarm rate False alarm rate (a) Training Set Size = 100 symbols, DW = AS = 2 symbols (b) Training Set Size = 300 symbols, DW = AS = 2 symbols 1 0.9 0.8 Detection rate 0.7 0.6 HMM: N= 4, AUC=0.991 HMM: N= 5, AUC=0.998 HMM: N= 6, AUC=0.979 HMM: N= 7, AUC=0.995 HMM: N= 8, AUC=0.997 HMM: N= 9, AUC=0.989 HMM: N=10, AUC=0.997 HMM: N=11, AUC=0.997 HMM: N=12, AUC=0.997 HMM: MRROC,AUC=0.999 STIDE 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False alarm rate (c) Training Set Size = 100 symbols, DW = AS = 8 symbols Figure AV.2 Illustration of the effect of training set sizes and anomaly sizes on the performance of STIDE, HMM with and the μ-HMMs using MRROC for the simple scenario. The HMMs were trained for a numbers of states N ∈ [4, 12] 287 In contrast, even when an HMM is trained on a relatively small data set, its discrimination capabilities increases with the anomaly size. This is due to the HMM abilities to capture dependencies in temporal patterns and to generalize in the case of unknown test sequences. Indeed, when evaluating the likelihood P r(O | HM M ), an HMM computes the product of probabilities (over states transitions and emissions) for symbols in the sequence presented for testing. Anomalous sequences should lead to significantly lower likelihood values than normal ones with a well trained HMM. Therefore, with the increase of the anomaly size, the likelihood of anomalous sequences becomes smaller at a faster rate than normal ones, and hence increases HMM detection rate. 1 Area Under the ROC Curve (AUC) Area Under the ROC Curve (AUC) 1 0.95 0.9 0.85 0.8 100 200 300 400 Training Data Set Size (a) Detector Window Size, DW = 2 500 0.995 N=4 N=5 N=6 N=7 N=8 N=9 N=10 N=11 N=12 MRROC 0.99 0.985 0.98 100 200 300 400 Training Data Set Size 500 (b) Detector Window Size, DW = 4 Figure AV.3 μ-HMMs vs HMM performance evaluation using the AUC for various training set sizes, DW and N (Σ = 8 and CRE = 0.3) The impact of the number of HMM states N on performance is rarely addressed in the HIDS applications. Figures AV.2 and AV.3 illustrates this effect on AUC measures versus different amounts of training data and different detector window sizes DW . It can be seen that the value of N that provides the best performance depends on the training data set size and DW . More importantly, the common practice (in most application to HMMs for anomaly-based HIDS) of selecting the number of states equal to the alphabet size (e.g., N = Σ = 8 in Figure AV.3), does not provide a high level of performance over 288 the wide range of training set sizes and DW values. This effect is further illustrated in Figure AV.4 with the average AUC values (over ten independent replications) for the more complex scenario. 1 0.9 0.8 0.7 N=40 N=45 N=50 N=55 N=60 MRROC 0.6 0.5 0.4 2000 4000 6000 8000 Training Data Set Size Area Under the ROC Curve (AUC) Area Under the ROC Curve (AUC) 1 0.9 0.8 0.7 0.6 0.5 0.4 10000 2000 (a) Detector Window Size, DW = 2 10000 (b) Detector Window Size, DW = 4 1 Area Under the ROC Curve (AUC) 1 Area Under the ROC Curve (AUC) 4000 6000 8000 Training Data Set Size 0.9 0.8 0.7 0.6 0.5 2000 4000 6000 8000 Training Data Set Size (c) Detector Window Size, DW = 6 10000 0.9 0.8 0.7 0.6 0.5 0.4 2000 4000 6000 8000 Training Data Set Size 10000 (d) Detector Window Size, DW = 8 Figure AV.4 Average AUC of single HMMs and the μ-HMMs with MRROC combination for various training set sizes, DW and N values (Σ = 50 and CRE = 0.4). Error bars are standard deviations The optimal number of states does not exist over the whole range of detection space to design a “single best” HMM. Combining the responses of μ-HMMs with the MRROC fusion is shown to achieve the highest over all performance over anyone of the HMMs alone. Furthermore, the resulting convex hull (from the MRROC combination) provides a smoother curve than any existing single model. The operational system may therefore 289 be changed according to the desired detection and false alarm rate without compromising the performance. In fact, moving along the convex hull curve allows switching to another model or interpolating the responses of two models. In contrast, as illustrated in Figure AV.2a and b, using a single HMM typically results a staircase-shaped ROC curve and this effect is even worse in real-world cases since the anomalies are relatively rare with reference to normal data. Finally, in contrast with STIDE, HMM and the μ-HMMs are capable of detecting different anomaly sizes with the same detector window size. This impact along with the combination of HMMs trained with different window sizes is being characterized and will appear in future work. V.6 Conclusion This paper presents a multiple-HMMs approach, where each HMM is trained using a different number of hidden states capturing different temporal structures of the data. The HMM responses are then combined in the ROC space according to the MRROC technique. Results have shown that the overall performance of the proposed μ-HMMs approach increased considerably over a single best HMM and STIDE. In addition, this combination provides a smooth convex hull for composite operational systems without significantly increasing the computational overhead. Future work also involves characterizing the performance of the μ-HMMs approach using for instance the UNM data and observing the impact of combining HMMs trained with different window sizes to detect various anomaly sizes. 290 BIBLIOGRAPHY Abouzakhar N. and Manson G.A., 2004. Evaluation of intelligent intrusion detection models. International Journal of Digital Evidence, volume 3. Alanazi Hamdan O., Noor Rafidah Md, Zaidan B. B., and Zaidan A. A., February 2010. Intrusion detection system: Overview. Journal of Computing, 2(2):130–133. AlephOne, November 1996. Smashing the stack for fun and profit. Online. Phrack Magazine, 7(14). URL http://www.phrack.com/issues.html?issue=49&id=14. Accessed December 10, 2010. Alessandri Dominique, Cachin Christian, Dacier Marc, Deak Oliver, Julisch Klaus, Randell Brian, Riordan James, Tscharner Andreas, Wespi Andreas, and Wüest Candid, September 2001. Towards a taxonomy of intrusion detection systems and attacks. Malicious- and Accidental-Fault Tolerance for Internet Applications, MAFTIA Deliverable D3 EU Project IST-1999-11583, IBM, Zurich, Switzerland. Ali Kamal M. and Pazzani Michael J., 1996. Error reduction through learning multiple descriptions. Machine Learning, 24:173–202. Anderson B.D.O., 1999. From Wiener to hidden Markov models. IEEE Control Systems Magazine, 19(3):41–51. ISSN 0272-1708. Anderson Debra, Frivold Thane, and Valdes Alfonso, May 1995. Next-generation intrusion detection expert system (nides): A summary. Technical Report SRI-CSL-9507, Computer Science Laboratory, SRI International. Anderson James P., April 1980. Computer security threat monitoring and surveillance. Technical Report 79F26400, James P. Anderson Co., Fort Washington. Arapostathis Aristotle and Marcus Steven I., 1990. Analysis of an identication algorithm arising in the adaptive estimation of Markov chains. Mathematics of Control, Signals and Systems, 3(1):1–29. Askar M. and Derin H., 1981. A recursive algorithm for the bayes solution of the smoothing problem. IEEE Transactions on Automatic Control, 26(2):558–561. Axelsson Stefan, March 2000. Intrusion detection systems: A survey and taxonomy. Technical Report 99–15, Chalmers University. Bahl Lalit R., Jelinek Frederick, and Mercer Robert L., 1982. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5, 2:179–190. Baldi Pierre and Chauvin Yves, 1994. Smooth on-line learning algorithms for hidden Markov models. Neural Computation, 6(2):307–318. ISSN 0899-7667. 292 Banfield Robert, Hall Lawrence, Bowyer Kevin, and Kegelmeyer W., 2003. A new ensemble diversity measure applied to thinning ensembles. Multiple Classifier Systems, 2709:306–316. Barreno Marco, Cardenas Alvaro, and Tygar Doug, January 2008. Optimal ROC for a combination of classifiers. Advances in Neural Information Processing Systems (NIPS), 20. Baum Leonard E., 1972. An inequality and associated maximization technique in statistical estimation for probalistic functions of Markov processes. Inequalities, III: New York: Academic. (Proc. 3rd Symp., Univ. Calif., Los Angeles, Calif., 1969; dedicated to the memory of Theodore S. Motzkin). Baum Leonard E. and Petrie Ted, 1966. Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 37:1554–1563. Baum Leonard E., Petrie G. Soules, and Weiss N., 1970. A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1):164–171. Beal Matthew J., Ghahramani Zoubin, and Rasmussen Carl E., 2002. The infinite Hidden Markov Model. Advances in Neural Information Processing Systems (NIPS), 14. Bengio Y., 1999. Markovian models for sequential data. Neural Computing Surveys, 2: 129–162. ISSN 1093-7609. Benveniste Albert, Priouret Pierre, and Métivier Michel, 1990. Adaptive algorithms and stochastic approximations. Springer-Verlag New York, Inc., New York, NY, USA. ISBN 0-387-52894-6. Bickel Peter J., Ritov Yaacov, and Ryden Tobias, 1998. Asymptotic normality of the maximum-likelihood estimator for general hidden Markov models. Annal of Statistics, 26(4):1614–1635. Bilmes Jeff A., 2002. What HMMs can do. Technical Report UWEETR-2002-0003, Dept of EE, University of Washington Seattle WA. Black Michael A. and Craig Bruce A., 2002. Estimating disease prevalence in the absence of a gold standard. Statistics in Medicine, 21(18):2653–2669. Blunden Bill, 2009. The Rootkit Arsenal: Escape and Evasion in the Dark Corners of the System. Wordware Publishing. Bottou Léon, 2004. Stochastic learning. Bousquet Olivier and von Luxburg Ulrike, editors, Advanced Lectures on Machine Learning, number LNAI 3176 in Lecture Notes in Artificial Intelligence, pages 146–168. Springer Verlag, Berlin. 293 Bradley Andrew P., 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159. Brand M. and Kettnaker V., 2000. Discovery and segmentation of activities in video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):844–851. Bray Rory, Cid Daniel, and Hay Andrew, March 2008. OSSEC Host-Based Intrusion Detection Guide. SYNGRESS. Breiman Leo, August 1996. Bagging predictors. Machine Learning, 24(2):123–140. Brown G., Wyatt J., Harris R., and Yao X., March 2005. Diversity creation methods: A survey and categorisation. Journal of Information Fusion, 6(1):5–20. Brugger S. T., Kelley M., Sumikawa K., and Wakumoto S., April 2001. Data mining for security information: A survey. Technical Report UCRL-JC-143464, U.S. Department of Energy, Lawrence Livermore National Laboratory., Philadelphia, PA. Bulla Jan and Berzel Andreas, January 2008. Computational issues in parameter estimation for stationary hidden Markov models. Computational Statistics, 23(1): 1–18. Cappe O., 2009. Online EM algorithm for hidden Markov models. preprint. Cappe O. and Moulines E., 2005. Recursive computation of the score and observed information matrix in hidden Markov models. IEEE/SP 13th Workshop on Statistical Signal Processing, pages 703–708. Cappe O., Buchoux V., and Moulines E., 1998. Quasi-Newton method for maximum likelihood estimation of hidden Markov models. Proceedings of the 1998 IEEE International Conference onAcoustics, Speech, and Signal Processing, ICASSP98, volume 4, pages 2265–2268. Cappe Olivier, March 12 2001. Ten years of HMMs. Available online: http://www.tsi.enst.fr/ cappe/docs/hmmbib.html. Accessed December 10, 2010. Cappe Olivier, Moulines Eric, and Ryden Tobias, 2005. Inference in Hidden Markov Models. Series in Statistics. Springer-Verlag, New-York. Caragea D., Silvescu A.., and Honavar V., 2001. Towards a theoretical framework for analysis and synthesis of agents that learn from distributed dynamic data sources. Emerging Neural Architectures Based on Neuroscience, pages 547–559. SpringerVerlag. Cardenas Alvaro A., Ramezani Vahid, and Baras John S., 2003. HMM sequential hypothesis tests for intrusion detection in MANETs. Technical Report ISR TR 2003-47, Department of Electrical and Computer Engineering and Institute for Systems Research University of Maryland College Park, MD, 20740 USA. 294 Caruana Rich, Niculescu-Mizil Alexandru, Crew Geoff, and Ksikes Alex, 2004. Ensemble selection from libraries of models. ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 18, New York, NY, USA, 2004. ACM. ISBN 1-58113-828-5. Chandola Varun, Banerjee Arindam, and Kumar Vipin, 2009a. Anomaly detection for discrete sequences: A survey. Technical Report TR 09-015, University of Minnesota, Department of Computer Science and Engineering. Chandola Varun, Banerjee Arindam, and Kumar Vipin, July 2009b. Anomaly detection: A survey. ACM Comput. Surv., 41:15:1–15:58. ISSN 0360-0300. Chang R. and Hancock J., 1966. On receiver structures for channels having memory. IEEE Transactions on Information Theory, 12(4):463–468. ISSN 0018-9448. Chen Hao, Wagner David, and Dean Drew, 2002. Setuid demystified. USENIX Security Symposium, pages 171–190. Chen Yu-Shu and Chen Yi-Ming, 2009. Combining incremental hidden Markov model and Adaboost algorithm for anomaly intrusion detection. CSI-KDD ’09: Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, pages 3–9, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-669-4. Cho Sung-Bae and Han Sang-Jun, 2003. Two sophisticated techniques to improve hmmbased intrusion detection systems. Vigna Giovanni, Kruegel Christopher, and Jonsson Erland, editors, Recent Advances in Intrusion Detection, volume 2820 of Lecture Notes in Computer Science, pages 207–219. Springer Berlin / Heidelberg. Choy Jongho and Cho Sung-Bae, 2000. Intrusion detection by combining multiple HMMs. PRICAI 2000 Topics in Artificial Intelligence, 1886:829–829. Churbanov Alexander and Winters-Hilt Stephen, 2008. Implementing EM and viterbi algorithms for hidden Markov model in linear memory. BMC Bioinformatics, 9 (224):–. Cohen William W., July 1995. Fast effective rule induction. Prieditis Armand and Russell Stuart, editors, Proceeding of the 12th International Conference on Machine Learning, pages 115–123, Tahoe City, CA, July 1995. Morgan Kaufmann. ISBN 1-55860-377-8. Collings Iain B., Krishnamurthy Vikram., and Moore Jhon B., December 1994. On-line identication of hidden Markov models via recursive prediction error techniques. IEEE Transactions on Signal Processing, 42(12):3535–3539. Collings I.B. and Ryden T., 1998. A new maximum likelihood gradient algorithm for on-line hidden Markov model identification. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP98, volume 4, pages 2261–2264. 295 Coppersmith D. and Winograd S., 1990. Matrix multiplication via arithmetic progressions. J. Symbolic Comput., 9:251–280. Daugman John, 2000. Biometric decision landscapes. Technical Report UCAM-CL-TR482, Universtity of Cambridge, UK. De Boer Pieter and Pels Martin, 2005. Host-based intrusion detection systems. Technical Report 1.10, Faculty of Science, Informatics Institute, University of Amsterdam. Debar H., Becker M., and Siboni D., May 1992. A neural network component for an intrusion detection system. IEEE Computer Society Symposium on Research in Security and Privacy, pages 240–250. Debar Herve, Dacier Marc, and Wespi Andreas, 2000. Revised taxonomy for intrusiondetection systems. Annales des Telecommunications, 55(7):361–378. ISSN 00034347. Dempster A.P., Laird N., and Rubin D.B., 1977. Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38. Demšar Janez, 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30. ISSN 1532-4435. Denning Dorothy E., May 1986. An Intrusion Detection Model. Proceedings of the Seventh IEEE Symposium on Security and Privacy, pages 119–131. Devijver P., 1985. Baum’s forward-backward algorithm revisited. Pattern Recognition Letters, 3:369–373. Dietterich Thomas, 2000. Ensemble methods in machine learning. Multiple Classifier Systems, 1857:1–15. Digalakis Vassilios V., 1999. Online adaptation of hidden Markov models using incremental estimation algorithms. IEEE Transactions on Speech and Audio Processing, 7 (3):253–261. ISSN 1063-6676. Domingos Pedro, 2000. A unified bias-variance decomposition and its applications. 17th International Conference on Machine Learning, pages 231–238. Morgan Kaufmann. Dos Santos E. M, Sabourin R., and Maupin P., July 2008. Pareto analysis for the selection of classifier ensembles. Genetic and Evolutionary Computation Conference (GECCO), pages 681–688, Atlanta, GA, USA, July 2008. Du Ye, Wang Huiqiang, and Pang Yonggang, 2004. A hidden Markov models-based anomaly intrusion detection method. Proceedings of the World Congress on Intelligent Control and Automation (WCICA), volume 5, pages 4348–4351, Hangzhou, China, 2004. 296 Eddy S. R., 1998. Profile hidden Markov models. Bioinformatics, 14(9):755–763. Elliott Robert J., 1994. Exact adaptive filters for Markov chains observed in gaussian noise. Automatica, 30(9):1399–1408. ISSN 0005-1098. Elliott Robert J., Aggoun Lakhdar, and Moore John B., 1995. Hidden Markov Models: Estimation and Control. Springer-Verlag. Ephraim Y. and Merhav N., 2002. Hidden Markov processes. IEEE Transactions on Information Theory, 48(6):1518–1569. Ephraim Y. and Rabiner L.R., September 1990. On the relations between modeling approaches for information sources [speech recognition]. IEEE Transactions on Information Theory, 36(2):372–380. Estevez-Tapiador Juan M., Garcia-Teodoro Pedro, and Diaz-Verdejo Jesus E., 2004. Anomaly detection methods in wired networks: A survey and taxonomy. Computer Communications, 27(16):1569–1584. ISSN 0140-3664. Fan W., Miller M., Stolfo S., Lee W., and Chan P., 2004. Using artificial anomalies to detect unknown and known network intrusions. Knowledge and Information Systems, 6:507–527. ISSN 0219-1377. Fan Wei, Chu Fang, Wang Haixun, and Yu Philip S., 2002. Pruning and dynamic scheduling of cost-sensitive ensembles. Eighteenth national conference on Artificial intelligence, pages 146–151, Menlo Park, CA, USA, 2002. American Association for Artificial Intelligence. ISBN 0-262-51129-0. Fawcett Tom, 2004. ROC graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Laboratories, Palo Alto, CA, USA. Fawcett Tom, 2006. An introduction to ROC analysis. Pattern Recognition Letter, 27 (8):861–874. ISSN 0167-8655. Flach Peter A. and Wu Shaomin, August 2005. Repairing concavities in ROC curves. Proceedings of the 19th International Joint Conference on Artificial Intelligence, pages 702–707. IJCAI. Florez-Larrahondo German, Bridges Susan, and Hansen Eric A., 2005. Incremental estimation of discrete hidden Markov models based on a new backward procedure. Proceedings of the National Conference on Artificial Intelligence, 2:758–763. Ford J.J. and Moore J.B., 1998a. Adaptive estimation of HMM transition probabilities. IEEE Transactions on Signal Processing (USA), 46(5):1374–85. ISSN 1053-587X. Ford J.J. and Moore J.B., 1998b. On adaptive HMM state estimation. IEEE Transactions on Signal Processing (USA), 46(2):475–86. ISSN 1053-587X. 297 Forney G. David, March 2005. The viterbi algorithm: A personal history. Viterbi Conference, University of Southern California, Los Angeles. Forrest S., Hofmeyr S., and Somayaji A., December 2008. The evolution of systemcall monitoring. Computer Security Applications Conference, 2008. ACSAC 2008. Annual, pages 418–430. Forrest Stephanie, Hofmeyr Steven A., Somayaji Anil, and Longstaff Thomas A., 1996. A sense of self for Unix processes. Proceedings of the 1996 IEEE Symposium on Research in Security and Privacy, pages 120–128. Foster James C., Osipov Vitaly, Bhalla Nish, and Heinen Niels, 2005. Buffer Overflow Attacks: Detect, Exploit, Prevent. Syngress. ISBN 1-932266-67-4. Freund Yoav and Schapire Robert E., 1996. Experiments with a new boosting algorithm. ICML 96, pages 148–156. Fürnkranz Johannes and Flach Peter, 2003. An analysis of rule evaluation metrics. Proceedings of the 20th International Conference on Machine Learning, volume 1, pages 202–209. Gadelrab Mohammed El-Sayed, December 2008. Evaluation of Intrusion Detection Systems. PhD thesis, Université Toulouse III – Paul Sabatier. Gael Jurgen Van, Saatci Yunus, Teh Yee Whye, and Ghahramani Zoubin, 2008. Beam sampling for the infinite Hidden Markov Model. Proceedings of the 25th international conference on Machine learning, pages 1088–1095, Helsinki, Finland, 2008. ACM. Gao Bo, Ma Hui-Ye, and Yang Yu-Hang, 2002. HMMs (Hidden Markov Models) based on anomaly intrusion detection method. Proceedings of 2002 International Conference on Machine Learning and Cybernetics, 1:381–385. Gao Fei, Sun Jizhou, and Wei Zunce, 2003. The prediction role of hidden Markov model in intrusion detection. Canadian Conference on Electrical and Computer Engineering, volume 2, pages 893–896, Montreal, Canada, 2003. García Salvador and Herrera Francisco, December 2008. An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research, 9:2677–2694. Garg Ashutosh and Warmuth Manfred K., September 2003. Inline updates for HMMs. EUROSPEECH-2003, pages 1005–1008. Ghahramani Z., 2001. An introduction to hidden Markov models and Bayesian networks. International Journal of Pattern Recognition and Artificial Intelligence, 15(1):9–42. ISSN 0218-0014. 298 Ghorbani Ali A., Lu Wei, Tavallaee Mahbod, Ghorbani Ali A., Lu Wei, and Tavallaee Mahbod, 2010. Intrusion response. Jajodia Sushil, editor, Network Intrusion Detection and Prevention, volume 47 of Advances in Information Security, pages 185–198. Springer US. ISBN 978-0-387-88771-5. Ghosh Anup K., Schwartzbard Aaron, and Schatz Michael, 1999. Learning program behavior profiles for intrusion detection. Proceedings of the Workshop on Intrusion Detection and Network Monitoring, pages 51–62, Berkeley, CA, USA, 1999. USENIX Association. ISBN 1-880446-37-5. Gorodnichy Dmitry O., Dubrofsky Elan, Hoshino Richard, Khreich Wael, Granger Eric, and Sabourin Robert, April 2011. Exploring the upper bound performance limit of iris biometrics using score calibration and fusion. (SSCI) CIBIM - 2011 IEEE Workshop on Computational Intelligence in Biometrics and Identity Management, Paris, France, April 2011. Gotoh Yoshihiko, Hochberg Michael M., and Silverman Harvey F., 1998. Efficient training algorithms for HMM’s using incremental estimation. IEEE Transactions on Speech and Audio Processing, 6(6):539–548. ISSN 1063-6676. Grice JA., Hughey R., and Speck D., 1997. Reduced space sequence alignment. CABIOS, 13(1):45–53. Grossberg Stephen, 1988. Nonlinear neural networks: principles, mechanisms and architectures. Neural Networks, 1:17–61. Gunawardana Asela and Byrne William, 2005. Convergence theorems for generalized alternating minimization procedures. Journal of Machine Learning Research, 6: 2049–2073. ISSN 1533-7928. Haker Steven, Wells William M., Warfield Simon K., Talos Ion-Florin, Bhagwat Jui G., Goldberg-Zimring Daniel, Mian Asim, Ohno-Machado Lucila, and Zou Kelly H., 2005. Combining classifiers using their receiver operating characteristics and maximum likelihood estimation. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 3749:506–514. Hanley JA and McNeil BJ, 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1):29–36. Hanley James A., 1988. The robustness of the “binormal” assumptions used in fitting ROC curves. Medical Decision Making, 8(3):197–203. Hathaway R. J., 1986. Another interpretation of the EM algorithm for mixture distributions. Statistics and Probability Letters, 4:53–56. Helali Rasha G. Mohammed, 2010. Data mining based network intrusion detection system: A survey. Sobh Tarek, Elleithy Khaled, and Mahmood Ausif, editors, Novel 299 Algorithms and Techniques in Telecommunications and Networking, pages 501–505. Springer Netherlands. ISBN 978-90-481-3662-9. Hill J.M., Oxley M.E., and Bauer K.W., 2003. Receiver operating characteristic curves and fusion of multiple classifiers. Proceedings of the 6th International Confenrence on Information Fusion, volume 2, pages 815–822. Ho Tin Kam, 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844. ISSN 0162-8828. Ho Tin Kam, Hull J.J., and Srihari S.N., 1994. Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1): 66–75. ISSN 0162-8828. Hoang X.D. and Hu J., 2004. An efficient Hidden Markov Model training scheme for anomaly intrusion detection of server applications based on system calls. IEEE International Conference on Networks, ICON, volume 2, pages 470–474, Singapore, 2004. Hoang Xuan Dau, Hu Jiankun, and Bertok Peter, 2003. A multi-layer model for anomaly intrusion detection. IEEE International Conference on Networks, ICON, volume 1, pages 531–536. Hofmeyr Steven A., Forrest Stephanie, and Somayaji Anil, 1998. Intrusion detection using sequences of system calls. Journal of Computer Security, 6(3):151–180. Holst Ulla and Lindgren Georg, 1991. Recursive estimation in mixture models with Markov regime. IEEE Transactions on Information Theory, 37(6):1683–1690. Hovland Geir E. and McCarragher Brenan J., 1998. Hidden Markov models as a process monitor in robotic assembly. International Journal of Robotics Research, 17(2): 153–168. ISSN 0278-3649. Hu Jiankun, 2010. Host-based anomaly intrusion detection. Stavroulakis Peter and Stamp Mark, editors, Handbook of Information and Communication Security, pages 235–255. Springer Berlin Heidelberg. ISBN 978-3-642-04117-4. Huang Jin and Ling C.X., 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17(3):299–310. ISSN 1041-4347. Huang Xuedong and Hon Hsiao-Wuen, 2001. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River, NJ, USA. ISBN 0130226165. Foreword By-Raj Reddy. 300 Ilgun Koral, Kemmerer R. A., and Porras Phillip A., March 1995. State transition analysis: A rule-based intrusion detection approach. IEEE Transactions on Software Engineering, 21:181–199. ISSN 0098-5589. Ingham Kenneth and Forrest Stephanie, 2005. Network firewalls. Vemuri V. Rao and Sreeharirao V., editors, Enhancing Computer Security with Smart Technology, pages 9–35. CRC Press. Jha S., Tan K., and Maxion R.A., 2001. Markov chains, classifiers, and intrusion detection. Proceedings of the Computer Security Foundations Workshop, pages 206–219. Juang Bing-Hwang, Levinson S., and Sondhi M., 1986. Maximum likelihood estimation for multivariate mixture observations of Markov chains (corresp.). IEEE Transactions on Information Theory, 32(2):307–309. ISSN 0018-9448. Kershaw David, Gao Qigang, and Wang Hai, 2011. Anomaly-based network intrusion detection using outlier subspace analysis: A case study. Butz Cory and Lingras Pawan, editors, Advances in Artificial Intelligence, volume 6657 of Lecture Notes in Computer Science, pages 234–239. Springer Berlin / Heidelberg. Khreich Wael, Granger Eric, Miri Ali, and Sabourin Robert, 2009a. A comparison of techniques for on-line incremental learning of HMM parameters in anomaly detection. Proceedings of the Second IEEE international conference on computational intelligence for security and defense applications (CISDA09), pages 1–8, Ottawa, Canada, 2009a. ISBN 978-1-4244-3763-4. Khreich Wael, Granger Eric, Sabourin Robert, and Miri Ali, 2009b. Combining Hidden Markov Models for anomaly detection. International Conference on Communications (ICC09), pages 1–6, Dresden, Germany, 2009b. Khreich Wael, Granger Eric, Miri Ali, and Sabourin Robert, 2010a. Boolean combination of classifiers in the ROC space. 20th International Conference on Pattern Recognition (ICPR10), pages 4299–4303, Istanbul, Turkey, 2010a. Khreich Wael, Granger Eric, Miri Ali, and Sabourin Robert, July 2010b. Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs. Pattern Recognition, 43(8):2732–2752. ISSN 0031-3203. Khreich Wael, Granger Eric, Miri Ali, and Sabourin Robert, 2010c. On the memory complexity of the forward-backward algorithm. Pattern Recognition Letters, 31(2): 91–99. ISSN 0167-8655. Khreich Wael, Granger Eric, Miri Ali, and Sabourin Robert, 2011a. A survey of techniques for incremental learning of HMM parameters. Information Sciences. Accepted. 301 Khreich Wael, Granger Eric, Miri Ali, and Sabourin Robert, 2011b. Incremental Boolean combination of classifiers. 10th International Workshop on Multiple Classifier Systems (MCS11), pages 340–349, Naples, Italy, 2011b. Khreich Wael, Granger Eric, Miri Ali, and Sabourin Robert, 2011c. Adaptive ROCbased ensembles of HMMs applied to anomaly detection. Pattern Recognition. ISSN 0031-3203. Accepted. Kittler J., March 1998. Combining classifiers: A theoretical framework. Pattern Analysis & Applications, 1(1):18–27. Kivinen Jyrki and Warmuth Manfred K., 1997. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63. ISSN 0890-5401. Koenig Sven and Simmons Reid G., 1996. Unsupervised learning of probabilistic models for robot navigation. Proceedings - IEEE International Conference on Robotics and Automation, 3:2301–2308. ISSN 1050-4729. Konorski J., 2005. Solvability of a Markovian model of an IEEE802.11 LAN under a backoff attack. 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pages 491–8, Atlanta,GA,USA, 2005. Kosoresow A.P. and Hofmeyer S.A., 1997. Intrusion detection via system call traces. IEEE Software, 14(5):35–42. ISSN 0740-7459. Krishnamurthy V. and Moore J.B., 1993. On-line estimation of hidden Markov model parameters based on the Kullback-leibler information measure. IEEE Transactions on Signal Processing, [see also IEEE Transactions on Acoustics, Speech, and Signal Processing], 41(8):2557–2573. ISSN 1053-587X. Krishnamurthy V. and Yin George Gang, 2002. Recursive algorithms for estimation of hidden Markov models and autoregressive models with Markov regime. IEEE Transactions on Information Theory, 48(2):458–476. ISSN 0018-9448. Krogh A., Brown M., Mian I. S., Sjolander K., and Haussler D., February 1994. Hidden Markov-models in computational biology applications to protein modeling. JOURNAL OF MOLECULAR BIOLOGY, 235(5):1501–1531. Krogh A., Larsson B., von Heijne G., and Sonnhammer E. L. L., January 2001. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. Journal of Molecular Biology, 305(3):567–580. Kruegel Christopher, Mutz Darren, Robertson William, and Valeur Fredrik, 2003. Bayesian event classification for intrusion detection. Proceedings of the 19th Annual Computer Security Applications Conference, ACSAC ’03, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-2041-3. 302 Krügel Christopher, Kirda E., Mutz D., Robertson W., and Vigna G., 2005. Automating mimicry attacks using static binary analysis. Proceedings of the 14th USENIX Security Symposium, pages 161–176, Baltimore, MD, USA, 2005. Kumar Gulshan, Kumar Krishan, and Sachdeva Monika, 2010. The use of artificial intelligence based techniques for intrusion detection: a review. Artificial Intelligence Review, 34:369–387. ISSN 0269-2821. Kuncheva Ludmila and Plumpton Catrin, 2008. Adaptive learning rate for online linear discriminant classifiers. Joint IAPR International Workshops on Structural and Syntactic Pattern Recognition and Statistical Techniques in Pattern Recognition S+SSPR, pages 510–519, Orlando, Florida, USA, 2008. Kuncheva Ludmila I., 2002. A theoretical study on six classifier fusion strategies. IEEE Trans. Pattern Anal. Mach. Intell., 24(2):281–286. ISSN 0162–8828. Kuncheva Ludmila I., 2004a. Combining Pattern Classifiers: Methods and Algorithms. Wiley, Hoboken, NJ. Kuncheva Ludmila I., 2004b. Classifier ensembles for changing environments. Multiple Classifier Systems, volume 3077, pages 1–15, Cagliari, Italy, 2004b. SpringerVerlag, LNCS. Kuncheva Ludmila I. and Whitaker Christopher J., May 2003. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2):181–207. Kushner H.J. and Clark D.S., 1978. Stochastic Approximation for Constrained and Unconstrained Systems, volume 26. Springer-Verlag, New York. Lam Tin and Meyer Irmtraud, 2010. Efficient algorithms for training the parameters of hidden markov models using stochastic expectation maximization (em) training and viterbi training. Algorithms for Molecular Biology, 5(1):38. ISSN 1748-7188. Lane Terran and Brodley Carla E., 2003. An empirical study of two approaches to sequence learning for anomaly detection. Machine Learning, 51(1):73–107. Lane Terran D., 2000. Machine learning techniques for the computer security domain of anomaly detection. PhD thesis, Purdue University, W. Lafayette, IN. Langdon William B. and Buxton Bernard F., 2001. Evolving receiver operating characteristics for data fusion. EuroGP ’01: Proceedings of the 4th European Conference on Genetic Programming, pages 87–96, London, UK, 2001. Springer-Verlag. Lange Kenneth, 1995. A gradient algorithm locally equivalent to the EM algorithm. Journal of the Royal Statistical Society, Series B, 57(2):425–437. 303 Lazarevic Aleksandar, Kumar Vipin, and Srivastava Jaideep, 2005. Intrusion detection: A survey. Kumar Vipin, Srivastava Jaideep, and Lazarevic Aleksandar, editors, Managing Cyber Threats, volume 5 of Massive Computing, pages 19–78. Springer US. ISBN 978-0-387-24230-9. Lee Wenke and Xiang Dong, 2001. Information-theoretic measures for anomaly detection. Proceedings of the 2001 IEEE Symposium on Security and Privacy, pages 130–143. Lee Wenke, Stolfo Salvatore J., and Mok Kui W., 2000. Adaptive intrusion detection: A data mining approach. Artificial Intelligence Review, 14:533–567. ISSN 0269-2821. LeGland F. and Mevel L., December 1995. Recursive identication of HMM’s with observations in a finite set. Proceedings of the 34th IEEE Conference on Decision and Control, pages 216–221, New Orleans, December 1995. LeGland F. and Mevel L., December 1997. Recursive estimation in hidden Markov models. Proceedings of the 36th IEEE Conference on Decision and Control, volume 4, pages 3468–3473, San Diego, CA, December 1997. Leroux B.G., 1992. Maximum-likelihood estimation for hidden Markov models. Stochastic Processes and its Applications, 40:127–143. Levinson S. E., Rabiner L. R., and Sondhi M. M., 1983. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Technical Journal, 62:1035–1074. Li Xiaolin, Parizeau M., and Plamondon R., 2000. Training hidden Markov models with multiple observations–A combinatorial method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(4):371–377. ISSN 0162-8828. Li Z. and Evans R.J., 1992. Optimal filtering, prediction and smoothing of hidden Markov models. Proceedings of the 31st IEEE Conference on Decision and Control, volume 4, pages 3299–3304. Liao Yihua and Vemuri V. Rao, 2002. Use of k-nearest neighbor classifier for intrusion detection. Computers & Security, 21(5):439–448. ISSN 0167-4048. Lim S.Y. and Jones A., August 2008. Network anomaly detection system: The state of art of network behaviour analysis. International Conference on Convergence and Hybrid Information Technology, ICHIT08, pages 459–465. Lindgren G., 1978. Markov regime models for mixed distributions and switching regressions. Scandinavian Journal of Statistics, 5:81–91. Lindqvist Ulf and Porras Phillip A., 1999. Detecting computer and network misuse through the production-based expert system toolset (p-best). Proceedings of the 1999 IEEE Symposium on Security and Privacy, pages 146–161. 304 Liporace L., 1982. Maximum likelihood estimation for multivariate observations of Markov sources. IEEE Transactions on Information Theory, 28(5):729–734. Littlestone N. and Warmuth M. K., 1994. The weighted majority algorithm. Information and Computation, 108(2):212–261. ISSN 0890-5401. Ljung L., 1977. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4):551–575. Ljung Lennart and Soderstrom Torsten, 1983. Theory and Practice of Recursive Identification. MIT Press, Boston. MacKay D., 1997. Ensemble learning for hidden Markov models. Technical report, Cavendish Laboratory, Cambridge, UK. Mann Tobias P., February 2006. Numerically stable hidden markov Model implementation. An HMM scaling tutorial. Marceau Carla, 2000. Characterizing the behavior of a program using multiple-length n-grams. Proceedings of the 2000 workshop on New security paradigms NSPW ’00, pages 101–110, New York, NY, USA, 2000. ACM Press. ISBN 1-58113-260-3. Margineantu Dragos D. and Dietterich Thomas G., 1997. Pruning adaptive boosting. ICML, pages 211–218. Markou Markos and Singh Sameer, 2003a. Novelty detection: a review part 1: statistical approaches. Signal Process., 83(12):2481–2497. ISSN 0165-1684. Markou Markos and Singh Sameer, 2003b. Novelty detection: a review part 2: neural network based approaches. Signal Process., 83(12):2499–2521. ISSN 0165-1684. Martinez-Munoz Gonzalo, Hernandez-Lobato D., and Suraez Alberto, feb. 2009. An analysis of ensemble pruning techniques based on ordered aggregation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):245–259. ISSN 0162-8828. Maxion R.A. and Tan K.M.C., 2000. Benchmarking anomaly-based detection systems. Proceedings of the 2000 International Conference on Dependable Systems and Networks, pages 623–630. McHugh J., Christie A., and Allen J., September/October 2000. Defending yourself: the role of intrusion detection systems. IEEE Software, 17(5):42–51. ISSN 0740-7459. Metz C.E., 1978. Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8: 283–298. Meyer Irmtraud M. and Durbin Richard, 2004. Gene structure conservation aids similarity based gene prediction. Nucleic Acids Research, 32(2):776–783. 305 Michael C. C. and Ghosh Anup, August 2002. Simple, state-based approaches to programbased anomaly detection. ACM Trans. Information System Security, 5:203–237. ISSN 1094-9224. Miklos Istvan and Meyer Irmtraud M., 2005. A linear memory algorithm for baum-welch training. BMC Bioinformatics, 6:231. Mitnick Kevin D. and Simon William L., 2005. The Art of Intrusion: The Real Stories Behind the Exploits of Hackers, Intruders and Deceivers. Wiley. Mizuno J., Watanabe T., Ueki K., Amano K., Takimoto E., and Maruoka A., 2000. On-line estimation of hidden Markov model parameters. Proceedings of Third International Conference on Discovery Science, DS 2000, 1967:155–169. Mongillo Gianluigi and Deneve Sophie, 2008. Online learning with hidden Markov models. Neural Computation, 20(7):1706–1716. Muhlbaier M., Topalis A., and Polikar R., July 2004. Incremental learning from unbalanced data. Proc. of the Internaltional Joint Conference on Neural Networks (IJCNN), pages 1057–1062, Budapest, Hungary, July 2004. Narciso Tan Lau, 1993. Adaptive channel/code matching. PhD thesis, University of Southern California, California, United States. Neal R. M. and Hinton G. E., 1998. A new view of the EM algorithm that justifies incremental, sparse and other variants. Jordan M. I., editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers. Neyman J. and Pearson E. S., 1933. On the problem of the most efficient tests of statistical hypotheses. Royal Society of London Philosophical Transactions Series A, 231:289–337. Ng Brenda, October 2006. Survey of anomaly detection methods. Technical Report UCRL-TR-225264, Lawrence Livermore National Laboratory, Livermore, CA. Nocedal Jorge and Wright Stephen J., 2006. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2nd edition. Northcutt Stephen and Novak Judy, 2002. Network Intrusion Detection: An Analyst’s Handbook. New Riders Publishing, Thousand Oaks, CA, USA, 3rd edition. ISBN 0735712654. Ott G., 1967. Compact encoding of stationary markov sources. Information Theory, IEEE Transactions on, 13(1):82–86. ISSN 0018-9448. Ourston D., Matzner S., Stump W., and Hopkins B., January 2003. Applications of hidden Markov models to detecting multi-stage network attacks. Proceedings of the 36th Annual Hawaii International Conference on System Sciences. 306 Oxley M.E., Thorsen S.N., and Schubert C.M., 2007. A Boolean Algebra of receiver operating characteristic curves. 10th International Conference on Information Fusion, pages 1–8. Oza Nikunj C. and Russell Stuart, January 2001. Online bagging and boosting. Jaakkola Tommi and Richardson Thomas, editors, Eighth International Workshop on Artificial Intelligence and Statistics, pages 105–112, Key West, Florida. USA, January 2001. Morgan Kaufmann. Parampalli Chetan, Sekar R., and Johnson Rob, 2008. A practical mimicry attack against powerful system-call monitors. Proceedings of the 2008 ACM symposium on Information, computer and communications security, ASIACCS ’08, pages 156–167, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-979-1. Partalas Ioannis, Tsoumakas Grigorios, and Vlahavas Ioannis, 2008. Focused ensemble selection: A diversity-based method for greedy ensemble selection. Proceeding of the 2008 conference on ECAI 2008, pages 117–121, Amsterdam, The Netherlands, The Netherlands, 2008. IOS Press. ISBN 978-1-58603-891-5. Paxson Vern, December 1999. Bro: A system for detecting network intruders in real-time. Computer Networks, 31:2435–2463. ISSN 1389-1286. Peng Tao, Leckie Christopher, and Ramamohanarao Kotagiri, April 2007. Survey of network-based defense mechanisms countering the DoS and DDoS problems. ACM Computing Surveys (CSUR), 39(1):1–42. ISSN 0360-0300. Pepe Margaret Sullivan and Thompson Mary Lou, 2000. Combining diagnostic test results to increase accuracy. Biostat, 1(2):123–140. Polikar R., Upda L., Upda S.S., and Honavar V., 2001. Learn++: An incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part C, 31(4):497–508. Polikar Robi, 2006. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, 6(3):21–45. Polyak B. T., 1991. New method of stochastic approximation type. Automation Remote Contr., 7:937–946. Poritz A.B., 1988. Hidden Markov models: a guided tour. International Conference on Acoustics, Speech, and Signal Processing ICASSP-88., volume 1, pages 7–13. Porras Phillip A. and Neumann Peter G., 1997. EMERALD: Event monitoring enabling responses to anomalous live disturbances. In Proceedings of the 20th National Information Systems Security Conference, pages 353–365. Proctor Paul E., 2000. Practical Intrusion Detection Handbook. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition. ISBN 0130259608. 307 Provost Foster and Fawcett Tom, 1997. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 43–48, Menlo Park, CA, 1997. AAAI Press. Provost Foster J. and Fawcett Tom, 2001. Robust classification for imprecise environments. Machine Learning, 42(3):203–231. Provost Foster J., Fawcett Tom, and Kohavi Ron, 1998. The case against accuracy estimation for comparing induction algorithms. Proceedings of the Fifteenth International Conference on Machine Learning, pages 445–453. Morgan Kaufmann Publishers Inc. Ptacek Thomas H. and Newsham Timothy N., 1998. Insertion, evasion, and denial of service: Eluding network intrusion detection. Technical Report T2R-0Y6, Secure Networks, Calgary, Canada. Qiao Y., Xin X.W., Bin Y., and Ge S., 2002. Anomaly intrusion detection method based on HMM. Electronics Letters, 38(13):663–664. Qin Feng, Auerbach Anthony, and Sachs Frederick, 2000. A direct optimization approach to hidden markov modeling for single channel kinetics. Biophysical Journal, 79(4): 1915–1927. Rabiner L., 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. Rash Michael, Orebaugh Angela D., Clark Graham, Pinkard Becky, and Babbin Jake, 2005. Intrusion Prevention and Active Response: Deployment Network and Host IPS. Syngress. Raudys Å arunas and Roli Fabio, 2003. The behavior knowledge space fusion method: Analysis of generalization error and strategies for performance improvement. Multiple Classifier Systems, 2709:55–64. Raviv J., 1967. Decision making in markov chains applied to the problem of pattern recognition. Information Theory, IEEE Transactions on, 13(4):536–551. ISSN 0018-9448. Rittscher Jens, Kato Jien, Joga Sébastien, and Blake Andrew, 2000. A probabilistic background model for tracking. Proceedings of the 6th European Conference on Computer Vision-Part II, ECCV00, pages 336–350, London, UK, 2000. SpringerVerlag. ISBN 3-540-67686-4. Roesch Martin, 1999. Snort–lightweight intrusion detection for networks. 13th Systems Administration Conference (LISA), pages 229–238. 308 Rokach L., Maimon O., and Arbel R., 2006. Selective voting–getting more for less in sensor fusion. International Journal of Pattern Recognition and Artificial Intelligence, 20(3):329–350. Rokach Lior, February 2010. Ensemble-based classifiers. Artificial Intelligence Review, 33(1):1–39. Roli F., Fumera G., and Kittler J., 2002. Fixed and trained combiners for fusion of imbalanced pattern classifiers. Information Fusion, 2002. Proceedings of the Fifth International Conference on, volume 1, pages 278–284 vol.1. Ruta Dymitr and Gabrys Bogdan, October 2002. A theoretical analysis of the limits of majority voting errors for multiple classifier systems. Pattern Analysis & Applications, 5(4):333–350. Ruta Dymitr and Gabrys Bogdan, March 2005. Classifier selection for majority voting. Information Fusion, 6(1):63–81. ISSN 1566-2535. Ryden T., 1998. Asymptotic efficient recursive estimation for incomplete data models using the observed information. Metrika, 44:119–145. Ryden Tobias, 1994. Consistent and asymptotically normal parameter estimates for hidden Markov models. Annals of Statistics, 22(4):1884–1895. Ryden Tobias, February 1997. On recursive estimation for hidden Markov models. Stochastic Processes and their Applications, 66(1):79–96. Salem Malek Ben, Hershkop Shlomo, and Stolfo Salvatore J., 2008. A survey of insider attack detection research. Jajodia Sushil, Stolfo Salvatore J., Bellovin Steven M., Keromytis Angelos D., Hershkop Shlomo, Smith Sean W., and Sinclair Sara, editors, Insider Attack and Cyber Security, volume 39 of Advances in Information Security, pages 69–90. Springer US. ISBN 978-0-387-77322-3. Saltzer J.H. and Schroeder M.D., September 1975. The protection of information in computer systems. Proceedings of the IEEE, 63(9):1278 –1308. ISSN 0018-9219. Scarfone Karen and Mell Peter, February 2007. Guide to intrusion detection and prevention systems (IDPS). Recommendations of the National Institute of Standards and Technology sp800-94, NIST, Technology Administration, Department of Commerce, USA, 2007. Schraudolph N.N., 1999. Local gain adaptation in stochastic gradient descent. Ninth International Conference on Artificial Neural Networks, ICANN99, volume 2, pages 569–574. Scott M. J. J., Niranjan M., and Prager R. W, September 1998. Realisable classifiers: Improving operating performance on variable cost problems. Lewis Paul H. and 309 Nixon Mark S., editors, Proceedings of the Ninth British Machine Vision Conference, volume 1, pages 304–315, University of Southampton, UK, September 1998. Sekar R., Bendre M., Dhurjati D., and Bollineni P., 2001. A fast automaton-based method for detecting anomalous program behaviors. IEEE Symposium on Security and Privacy, Los Alamitos, CA, USA, 2001. IEEE Computer Society. Shen Changyu, 2008. On the principles of believe the positive and believe the negative for diagnosis using two continuous tests. Journal of Data Science, 6:189–205. Shue L., Anderson D.O., and Dey S., 1998. Exponential stability of filters and smoothers for hidden markov models. Signal Processing, IEEE Transactions on, 46(8):2180– 2194. ISSN 1053-587X. Silberschatz Avi, Galvin Peter Baer, and Gagne Greg, July 2008. Operating System Concepts. John Wiley & Sons. Singer Yoram and Warmuth Manfred K., 1996. Training algorithms for hidden Markov models using entropy based distance functions. Neural Information Processing Systems, pages 641–647. Sivaprakasam S. and Shanmugan Sam K., 1995. A forward-only recursion based HMM for modeling burst errors in digital channels. Global Telecommunications Conference, 1995. GLOBECOM ’95., IEEE, volume 2, pages 1054–1058. Slingsby P.L., 1992. Recursive parameter estimation for arbitrary hidden Markov models. IFAC Adavptive Systems in Control and Signal Processing, Grenoble, France, 1992. Smyth Padhraic, Heckerman David, and Jordan Michael, 1997. Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9: 227–269. Somayaji Anil B., July 2002. Operating System Stability and Security through Process Homeostasis. PhD thesis, University of New Mexico. Stakhanova Natalia, Basu Samik, and Wong Johnny, 2007. A taxonomy of intrusion response systems. International Journal of Information and Computer Security, 1 (1/2):169–184. Stenger B., Ramesh V., Paragios N., Coetzee F., and Buhmann J.M., 2001. Topology free hidden Markov models: Application to background modeling. Proceedings of the IEEE International Conference on Computer Vision, 1:294–301. Stiller J.C., 2003. Adaptive online learning of generative stochastic models. Complexity (USA), 8(4):95–101. ISSN 1076-2787. Stiller J.C. and Radons G., 1999. Online estimation of hidden Markov models. IEEE Signal Processing Letters (USA), 6(8):213–15. ISSN 1070-9908. 310 Tan K.M.C. and Maxion R.A., 2002. "Why 6?" Defining the operational limits of stide, an anomaly-based intrusion detector. IEEE Symposium on Security and Privacy, pages 188–201. Tan K.M.C. and Maxion R.A., 2003. Determining the operational limits of an anomalybased intrusion detector. IEEE Journal on Selected Areas in Communications, 21 (1):96–110. Tan K.M.C. and Maxion R.A., 2005. The effects of algorithmic diversity on anomaly detector performance. International Conference on Dependable Systems and Networks (DSN), pages 216–225. Tan Kymie M. C., Killourhy Kevin S., and Maxion Roy A., October 2002. Undermining an anomaly-based intrusion detection system using common exploits. Proceedings of the 5th international conference on Recent advances in intrusion detection, volume 2516 of RAID’02, pages 54–73, Berlin, Heidelberg, October 2002. SpringerVerlag. Tandon G. and Chan P., 2006. On the learning of useful system call attributes for hostbased anomaly detection. International Journal on Artificial Intelligence Tools, 15 (6):875–892. Tao Qian and Veldhuis Raymond, 2008. Threshold-optimized decision-level fusion and its application to biometrics. Pattern Recognition, 41(5):852–867. ISSN 0031-3203. Tarnas Christopher and Hughey Richard, 1998. Reduced space hidden markov model training. Bioinformatics, 14:401–406. Thomopoulos S.C.A., Viswanathan R., and Bougoulias D.K., 1989. Optimal distributed decision fusion. IEEE Transactions on Aerospace and Electronic Systems, 25(5): 761–765. Titterington D. M., 1984. Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society, Series B (Methodological), 46(2):257–267. Tosun Umut, 2005. Hidden Markov models to analyze user behaviour in network traffic. Technical report, Bilkent University 06800 Bilkent, Ankara, Turkey. Tsai Chih-Fong, Hsu Yu-Feng, Lin Chia-Ying, and Lin Wei-Yang, December 2009. Intrusion detection by machine learning: A review. Expert Systems with Applications, 36(10):11994–12000. ISSN 0957-4174. Tsoumakas Grigorios, Partalas Ioannis, and Vlahavas Ioannis, 2009. An ensemble pruning primer. Applications of Supervised and Unsupervised Ensemble Methods, 245:1–13. Tucker C.J., Furnell S.M., Ghita B.V., and Brooke P.J., 2007. A new taxonomy for comparing intrusion detection systems. Internet Research, 17:88–98. 311 Tulyakov Sergey, Jaeger Stefan, Govindaraju Venu, and Doermann David, 2008. Review of classifier combination methods. Simone Marinai Hiromichi Fujisawa, editor, Studies in Computational Intelligence: Machine Learning in Document Analysis and Recognition, pages 361–386. Springer. Turner Rolf, 2008. Direct maximization of the likelihood of a hidden Markov model. Computational Statistics and Data Analysis, 52(9):4147–4160. ISSN 0167-9473. Ulaş Aydin, Semerci Murat, Yildiz Olcay Taner, and Alpaydin Ethem, 2009. Incremental construction of classifier and discriminant ensembles. Information Sciences, 179 (9):1298–1318. Vaccaro H.S. and Liepins G.E., May 1989. Detection of anomalous computer session activity. IEEE Symposium on Security and Privacy, pages 280–289. Van Erp Merijn and Schomaker Lambert, 2000. Variants of the borda count method for combining ranked classifier hypotheses. Seventh International Workshop on Frontiers in Handwriting Recognition, Amsterdam, 2000. Varshney P. K., 1997. Distributed detection and data fusion. Springer-Verlag, New York. Venkataramani Krithika and Kumar B., 2006. Role of statistical dependence between classifier scores in determining the best decision fusion rule for improved biometric verification. Multimedia Content Representation, Classification and Security, 4105: 489–496. Vigna G. and Kruegel C., December 2005. Host-based intrusion detection systems. Bigdoli H., editor, Handbook of Information Security, volume III. Wiley. Viterbi A. J., April 1967. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Infromation Theory, IT-13: 260–269. Wagner David and Dean Drew, 2001. Intrusion detection via static analysis. Proceedings of the 2001 IEEE Symposium on Security and Privacy, Washington, DC, USA, 2001. IEEE Computer Society. Wagner David and Soto Paolo, 2002. Mimicry attacks on host-based intrusion detection systems. Proceedings of the 9th ACM conference on Computer and communications security, pages 255–264, Washington, DC, United States, 2002. Walter S. D., 2005. The partial area under the summary ROC curve. Statistics in Medicine, 24(13):2025–2040. Wang Panhong, Shi Liang, Wang Beizhan, Wu Yuanqin, and Liu Yangbin, August 2010. Survey on HMM based anomaly intrusion detection using system calls. Computer Science and Education (ICCSE), 2010 5th International Conference on, pages 102– 105. 312 Wang Shaojun and Zhao Yunxin, December 2006. Almost sure convergence of Titterington’s recursive estimator for mixture models. Statistics and Probability Letters, 76 (18):2001–2006. Wang Wei, Guan Xiao-Hong, and Zhang Xiang-Liang, 2004. Modeling program behaviors by Hidden Markov Models for intrusion detection. Proceedings of 2004 International Conference on Machine Learning and Cybernetics, 5:2830–2835. Warrender Christina, Forrest Stephanie, and Pearlmutter Barak, 1999. Detecting intrusions using system calls: Alternative data models. Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy, pages 133–45, Oakland, CA, USA, 1999. Weinstein E. M. Feder and Oppenheim A. V., 1990. Sequential algorithms for parameter estimation based on the Kullback-Leibler information measure. IEEE Transactions on Acoustics Speech and Signal Processing, 38(9):1652–1654. Wolpert David H., 1992. Stacked generalization. Neural Networks, 5:241–259. Wolpert D.H. and Macready W.G., apr 1997. No free lunch theorems for optimization. Evolutionary Computation, IEEE Transactions on, 1(1):67–82. ISSN 1089-778X. Wu C. F. Jeff, March 1983. On the convergence properties of the EM algorithm. Annals of Statistics, 11(1):95–103. Yeung Dit-Yan and Ding Yuxin, 2003. Host-based intrusion detection using dynamic and static behavioral models. Pattern Recognition, 36(1):229–243. Zhang Dong D., Zhou Xia-Hua, Freeman Jr. Daniel H., and Freeman Jean L., 2002. A non-parametric method for the comparison of partial areas under ROC curves and its application to large health care data sets. Statistics in Medicine, 21(5):701–715. Zhang Xiaoqiang, Fan Pingzhi, and Zhu Zhongliang, 2003. A new anomaly detection method based on hierarchical HMM. Parallel and Distributed Computing, Applications and Technologies, 2003. PDCAT’2003. Proceedings of the Fourth International Conference on, pages 249–252. Zhang Zhihao and Zhou Jie, September 2010. Transfer estimation of evolving class priors in data stream classification. Pattern Recognition, 43(9):3151–3161. ISSN 00313203. Zucchini Walter and MacDonald Iain, 2009. Hidden Markov and Other Models for Discrete-valued Time Series: An Introduction Using R. Monographs on statistics and applied probability, 110. Chapman & Hall.