Wi-Fi Activity in Open Environments
Transcription
Wi-Fi Activity in Open Environments
Thèse de doctorat U NIVERSIT É P IERRE ET M ARIE C URIE École doctorale I NFORMATIQUE , T ÉL ÉCOMMUNICATIONS ET É LECTRONIQUE présentée par Thomas Claveirole pour obtenir le grade de Docteur de l’Université Pierre et Marie Curie Activité Wi-Fi en environnement ouvert : outils, mesures et analyses à soutenir le 26 février 2010 devant le jury composé de : Ana Cavalli Rapporteur Prof. TELECOM & Management SudParis Thierry Turletti Rapporteur Chargé de recherche INRIA Khaldoun Al Agha Examinateur Prof. Université Paris-Sud 11 Guillaume Chelius Examinateur Chargé de recherche INRIA Marcelo Dias de Amorim Co-encadrant Chargé de recherche CNRS Serge Fdida Co-encadrant Prof. Université Pierre et Marie Curie Numéro bibliothèque : PhD Thesis U NIVERSITY P IERRE AND M ARIE C URIE Doctoral school C OMPUTER S CIENCE , T ELECOMMUNICATIONS , AND E LECTRONICS presented by Thomas Claveirole submitted for the degree of Doctor of Science of the University Pierre and Marie Curie Wi-Fi Activity in Open Environments: Tools, Measurements, and Analyses Commitee in charge: Ana Cavalli Reviewer Prof. TELECOM & Management SudParis Thierry Turletti Reviewer INRIA researcher Khaldoun Al Agha Examiner Prof. University of Paris-Sud 11 Guillaume Chelius Examiner INRIA researcher Marcelo Dias de Amorim Co-advisor CNRS researcher Serge Fdida Co-advisor Prof. University Pierre and Marie Curie Remerciements J E tiens tout d’abord à remercier mes rapporteurs et mon jury. Des chercheurs confirmés qui me consacrent du temps malgré des emplois du temps chargés, alors que rien ne les y oblige ; cela m’a toujours paru un peu saugrenu. Je ne peux donc que témoigner ma gratitude. Merci à Ana Cavalli, Thierry Turletti, et Guillaume Chélius. Je me permet de remercier Khaldoun Al Agha à part, spécialement, parce que les réseaux, c’est lui qui m’a jeté dedans, il y a cinq ans. Également, impossible d’oublier Serge Fdida et Marcelo Dias de Amorim. Merci pour l’encadrement, pour l’accueil, pour tout ce que vous m’avez apporté. Marcelo, particulièrement, je dois te dire quelque chose. Plusieurs fois, je venais te voir avec le moral à zéro. Je suis dans une impasse, rien ne va marcher comme prévu, c’est sûr. Et toujours, tu discutes, et me voila gonflé à bloc, motivé comme jamais. Je ne sais pas comment tu fais. Je n’ai pas compris ton secret. Mais je souhaite ne jamais travailler qu’avec des personnes aussi motivantes que toi. Je remercie aussi Mathias Boc. Sans notre collaboration, cette thèse serait un peu moins complète. Et puis, reconnaissons le, je lui dois la moitié d’un voyage à San Francisco. . . Il y a aussi des personnes un peu spéciales, que je voudrais absolument remercier. Ce sont mes amis de mon ancienne école, l’ÉPITA. Entre autres le laboratoire de recherche et de développement (Akim, Théo, Raph., tant d’autres). Mais aussi Claire, Max., Alexandre, tous les (ex-)élèves-assistants. Merci. Bien avant ma thèse j’ai partagé beaucoup avec vous. Pas seulement du travail, mais beaucoup de travail tout de même. Et cela a modifié ma façon d’appréhender l’informatique. Au risque de passer pour fou, je le dis : je pense que vos personnalités suintent de mes recherches, de mes lignes de code, de mes articles. Mais j’ai beaucoup d’autres amis à remercier. Il m’aura fallu du temps pour bien mesurer ce que ma thèse leur doit. Il aura également fallu qu’un quiproquo sur le sujet m’oppose à l’un d’eux. Par erreur, presque par hasard, au moment même où j’écris ces lignes. Merci donc Yosra, Brice, Salim, Amélie, Mathias, Cédric. Merci P.-E., Thomas, Clémence, Matthieu, Anneli. Merci Magali, merci Sophia, merci Élodie. Merci Pierre. Ça fait beaucoup de mercis, et beaucoup de gens, c’est vrai. Et pourtant, je suis sûr que j’en oublie. Plein. Et pourtant, ils ont tous indirectement participé à ma thèse, pas toujours consciemment. Merci. Enfin, je termine avec une pensée pour ma famille. Et notamment pour mes grandsparents : ils m’ont déjà dit combien ils seront fiers d’avoir un petit-fils docteur. i ii Résumé Depuis environ dix ans, le Wi-Fi rencontre un énorme succès. En conséquence, une partie importante de la recherche sur les réseaux consiste a mesurer son protocole sous-jacent, IEEE 802.11, afin de mieux le comprendre. Le sniffing est l’une des techniques utiles à cette compréhension. Elle consiste a déployer des moniteurs au sein d’une zone de mesure, qui enregistrent tout le trafic pouvant être entendu. C’est une technique passive, et chaque moniteur produit des traces de paquets. C’est également une opération fondamentale pour un certain nombre d’opérations, dont le diagnostique de problèmes réseau, l’amélioration de la sécurité, et l’analyse du comportement de certains protocoles. Les travaux existants qui se basent sur le sniffing soulèvent un certain nombres de questions. Alors que cette technique repose essentiellement sur la manipulation de traces de paquets IEEE 802.11, il n’existe pas de boı̂te à outil logicielle générique pour effectuer ces manipulations. En conséquence, des efforts sont dupliqués, certains outils sont trop spécifiques, l’interopérabilité est parfois mauvaise, et les performances pas toujours au rendez-vous. C’est particulièrement vrai dans le cas de la fusion de traces. Alors qu’il s’agit d’une étape commune à plusieurs études, peu d’outils existent, dont l’utilisabilité est limitée. En dehors de ces problèmes prosaı̈ques il existe aussi des questions de plus haut niveau. D’abord, il existe une incertitude sur la précision que l’on peut attendre des moniteurs. Ensuite, la plupart des études se concentrent sur les caractéristiques de bas niveau de IEEE 802.11. Dans la mesure ou ce protocole est présent aujourd’hui sur des nouvelles catégories d’appareils, notamment des appareils mobiles, il serait également intéressant d’étudier les habitudes de ses usages plutôt que les problèmes de protocole. Enfin, la plupart des expériences se concentrent sur des environnements académiques (universités, laboratoires, conférences). Il est vraisemblable d’imaginer que d’autres environnements offrent des caractéristiques différentes. Au sein de cette thèse, nous proposons WiPal, un ensemble logiciel pour traiter les traces de paquets IEEE 802.11, et nous l’utilisons pour résoudre les problèmes précédemment décris. WiPal inclue une bibliothèque générique pour manipuler les traces de paquets et les trames IEEE 802.11. Il fournit également un ensemble de programmes au dessus de cette bibliothèque. Ceux-ci permettent d’effectuer des opérations diverses (par exemple concaténation ou comparaison), d’extraire des statistiques, de rendre des traces anonymes, ou encore, de fusionner des traces. Afin de rendre WiPal générique et efficace, nous avons développés plusieurs iii algorithmes spécifiques, ainsi que des optimisations pour pouvoir traiter efficacement de grands jeux de données. Grâce à l’utilisation de WiPal, nous effectuons plusieurs analyses dans différents environnements. En analysant deux jeux de données de courtes durées nous améliorons notre compréhension de la précision du sniffing. Ensuite, en analysant trois jeux de données de longues durées (plusieurs jours) dans des environnements différents nous obtenons une meilleure compréhension des comportements journaliers des utilisateurs vis à vis des réseaux sans-fils. Ces environnements possèdent des caractéristiques sociales différentes : un espace de bureaux, une zone pavillonnaire, et une zone résidentielle urbaine dense. Nos résultats dévoilent des propriétés nouvelles et inattendues. Par exemple, nous montrons que les techniques usuelles de mesure de précision des traces ne sont pas aussi fiables que prévu. Ou encore, que les traces de longues durées contiennent une très faible proportions d’utilisateurs réguliers. Mots-clefs Mesure, Wi-Fi, IEEE 802.11, sniffing, trace de paquets, fusion de traces. iv Abstract For about a dozen years Wi-Fi has known a tremendous success. Consequently, a large part of networking research has focused on measuring and understanding its corresponding protocol, IEEE 802.11. Among the techniques that proved to be useful to this research is wireless sniffing. Such a passive measurement technique consists in spreading within some target area a number of monitors that capture all wireless traffic they hear to produce packet traces. It is a fundamental step in a number of network operations, including network diagnosis, security enhancement, and behavioral analysis of protocols. Existing work based on wireless sniffing raises however a number of issues. Despite IEEE 802.11 packet trace manipulations are fundamental to this technique, no generic framework exists to carry them. This results in duplicated efforts among scientists, overspecialized tools, bad interoperability, and sometimes sub-optimal performance. This is especially true for trace merging. Though being a common step in many studies, only a few tools exist to merge Wi-Fi traces, and they have limited usability. Beyond these “prosaic” problems there are also more challenging questions. First, there is a lack of insights into the accuracy one can expect from wireless sniffers. Second, most studies focus on low level characteristics of IEEE 802.11. As Wi-Fi now equips new categories of mobile devices, studying usage patterns instead of protocol issues becomes also interesting. Finally, most experiments collect traces in “academic” environments (university campuses, laboratories, or conference venues). It is likely that other environments would display different properties. In this thesis we propose WiPal, a framework to process IEEE 802.11 packet traces, and use it to tackle the aforementioned issues. WiPal includes a generic library to handle packet traces and IEEE 802.11 frames. It also provides a set of programs atop this library. These programs feature miscellaneous operations such as concatenation or comparison, statistics extraction, trace anonymization, and, most notably, trace merging. We developed a number of specific algorithms and optimizations in order to make WiPal a generic tool able to cope efficiently with large datasets. By using WiPal, we perform a number of analysis on traces we collected in various environments. Through the analysis of two short-lived datasets using up to eight monitors, we extend our understanding on the accuracy of Wi-Fi traces. Then, through the analysis of three long-lived datasets (several days), we obtain a better understanding of people’s daily behaviors with respect to the underlying wireless network. These environments present dif- v ferent sociological means: an office area, a sparse residential area, and a dense residential area. Our results reveal unseen and unexpected properties. For instance, traditional techniques to estimate trace accuracy are much less reliable than previously thought, or regular users count for a very small portion of the total population in long-lived traces. Keywords Measurement, Wi-Fi, IEEE 802.11, sniffing, packet trace, trace merging. vi Contents Remerciements i Résumé iii Abstract v Contents vii 1 Introduction 1.1 Context: Wi-Fi measurements . . . . . . . . . . . 1.2 Issues with Wi-Fi sniffing and related techniques 1.3 Contributions of this thesis . . . . . . . . . . . . 1.3.1 WiPal: manipulating IEEE 802.11 traces . 1.3.2 Applying WiPal: empirical analyses . . . 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I The WiPal trace manipulation framework 2 WiPal: overview and design 2.1 Trace manipulation tools: related work . 2.2 Overview of WiPal . . . . . . . . . . . . 2.2.1 Features . . . . . . . . . . . . . . 2.2.2 Overall architecture . . . . . . . 2.3 Packet parsing . . . . . . . . . . . . . . . 2.3.1 PHY headers . . . . . . . . . . . 2.3.2 IEEE 802.11 parsing . . . . . . . 2.4 Filters . . . . . . . . . . . . . . . . . . . . 2.4.1 Filter sources: pcap abstractions 2.4.2 Processing filters . . . . . . . . . 2.5 Performance evaluation . . . . . . . . . 2.5.1 Methodology . . . . . . . . . . . 2.5.2 Results . . . . . . . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . 3 WiPal: IEEE 802.11 trace merging 3.1 Trace merging: state of the art . . . . . . . . 3.2 WiPal’s basics . . . . . . . . . . . . . . . . . 3.3 Detailed operation of WiPal’s trace merging 3.3.1 Identifying reference frames . . . . . 3.3.2 Extraction of unique frames . . . . . vii 1 1 3 4 4 5 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 10 11 12 13 15 17 17 19 22 22 23 25 . . . . . 27 28 29 30 31 32 Contents viii 3.4 3.5 3.3.3 Intersection . . . 3.3.4 Synchronization . 3.3.5 Merging . . . . . Evaluation . . . . . . . . 3.4.1 Correctness . . . 3.4.2 Efficiency . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 34 36 36 36 38 39 II Applying WiPal: empirical analyses 41 4 Accuracy of wireless packet sniffing 4.1 Completeness evaluation: state of the art 4.2 Datasets . . . . . . . . . . . . . . . . . . . 4.2.1 Overview . . . . . . . . . . . . . . 4.2.2 Preliminary analysis . . . . . . . . 4.3 Completeness evaluation: shortcomings . 4.4 Completeness and number of sniffers . . 4.4.1 Methodology . . . . . . . . . . . . 4.4.2 Results . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 45 45 46 48 49 49 51 53 5 Empirical analysis of Wi-Fi activity in three urban scenarios 5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Device diversity . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Cumulated activity durations . . . . . . . . . . . . 5.2.2 Growth of the number of devices . . . . . . . . . . 5.3 Activity/Mobility Behaviors . . . . . . . . . . . . . . . . . 5.3.1 Inter-activity patterns . . . . . . . . . . . . . . . . 5.3.2 Predominant activity pattern . . . . . . . . . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 56 57 58 59 61 61 63 64 . . . . 67 67 68 68 69 6 Conclusion and future work 6.1 WiPal . . . . . . . . . . . 6.2 Wi-Fi sniffing accuracy . 6.3 Wi-Fi activity . . . . . . 6.4 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendices 71 A Résumé de la thèse en français A.1 Contexte . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Mesures passives Wi-Fi et sniffing . . . . . . . A.1.2 Questions ouvertes . . . . . . . . . . . . . . . A.2 Contributions de cette thèse . . . . . . . . . . . . . . A.2.1 WiPal : manipulation de traces IEEE 802.11 . A.2.2 Applications de WiPal : analyses empiriques A.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 73 73 74 75 76 77 81 88 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents ix B WiPal manual B.1 The programs . . . . . . . . . . . . . . . . . . . . B.1.1 Invocation . . . . . . . . . . . . . . . . . . B.1.2 Concatenation (and Prism noise filtering) B.1.3 Comparisons . . . . . . . . . . . . . . . . B.1.4 Sub-traces . . . . . . . . . . . . . . . . . . B.1.5 Merging . . . . . . . . . . . . . . . . . . . B.1.6 Synchronization . . . . . . . . . . . . . . . B.1.7 Unique frames . . . . . . . . . . . . . . . B.1.8 Duplicate data frames . . . . . . . . . . . B.1.9 Statistics . . . . . . . . . . . . . . . . . . . B.1.10 Anonymization . . . . . . . . . . . . . . . B.1.11 Miscellaneous programs . . . . . . . . . . B.1.12 Undocumented programs . . . . . . . . . B.2 The library . . . . . . . . . . . . . . . . . . . . . . B.3 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 91 91 96 96 96 97 99 100 102 102 107 107 108 108 109 C List of publications C.1 Journals . . . . . . . C.2 Conferences . . . . C.3 Demos and posters C.4 Software . . . . . . C.5 Under review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 115 115 115 116 116 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography 117 List of Figures 123 List of Tables 125 Listings 127 x Contents Chapter 1 Introduction T HE IEEE 802.11 standard [37] defines base layers for wireless communications. It ap- peared about a dozen years ago, using the trademark Wi-Fi, and is widely used today. Personal computers featuring IP communications over wireless links rely almost exclusively on this protocol. Furthermore, Wi-Fi also plays a major role on other wireless-capable mobile devices: it is available on most PDA’s, smart phones, portable music players, even some digital cameras. As a consequence, Wi-Fi is part of the landscape of ubiquitous computing [58] . Along with other wireless protocols, such as Bluetooth or GSM, it is involved in creating a transparent digital environment for everyday life. For instance, Wi-Fi access points provide Internet access (hotspots) in households, hotels, conferences, and many other places. Understanding how IEEE 802.11 implementations behave in the wild, and what are the usage patterns of its users, is therefore essential. This insight is necessary for developing new applications and protocols, or improving existing ones. 1.1 Context: Wi-Fi measurements IEEE 802.11 specifies a physical layer (PHY) and a medium access control scheme (MAC) for wireless networks. The PHY is in charge of encoding and decoding digital information (bit sequences) to and from radio wave signals. The MAC, on the other hand, schedules transmissions so devices can share the medium and do not interfere with each other. Despite mostly an industry-pushed standard, computer scientists have produced a wide amount of research concerning IEEE 802.11. This includes specialized topics such as studying its PHY [30;45] , its MAC [46] , and other features such as security [12;14] . More generic research topics also involves this protocol: ad hoc and mesh networking [10;27] , sensor networks [60] , or pervasive computing [58] . A proper understanding of IEEE 802.11 can therefore benefit all these topics. To achieve this understanding one needs both theoretical analyses 1 1.1. Context: Wi-Fi measurements 2 Figure 1.1: Wireless sniffing: passive monitors listen to the wireless activity inside the measurement area. and practical experiments. This thesis focuses on experiments, and most specifically measurements of wireless networks in the wild. Every network measurement is either active or passive. Active measurements alter network traffic so they can evaluate various parameters. Classic active techniques include saturating a link to measure maximum throughput, or sending probes back and forth to evaluate round trip delays. On the opposite, passive measurements do not interfere with network traffic. This occurs, for instance, when taping a network link to analyze packet flows. Note that passive techniques still might interfere with the infrastructure: they might require users to embed a specific software, or administrators to plug specific tapping equipment. A common passive technique to measure wireless networks is sniffing. Wireless sniffing consists in spreading within some target area a number of monitors (or sniffers) that capture all wireless traffic they hear (see Figure 1.1). Sniffers produce traces composed of a succession of MAC frames. Wireless sniffing is a fundamental step in a number of network operations, including network diagnosis [23;34] , security enhancement [12;48] , and behavioral analysis of protocols [22;39;43;59] . Although not mandatory, it is also possible to use wireless sniffing to support some location systems [20;21;61] . It comes in a variety of flavors: there might be only one or several sniffers, sniffers may be commodity hardware or specialized devices, they can operate offline or be part of a wired infrastructure (among other parameters). In any cases, the sniffing operation is passive and does not interfere with the network’s normal operation. Wireless sniffing often involves a centralized process that is responsible for merging the traces [22;43;59] . The objective is to have a global view of the wireless activity from multiple local measurements. By providing overlapping coverage zones, it is also possible to compensate for frame losses with data from different sniffers. Merging is however a difficult Chapter 1. Introduction 3 task; it requires precise synchronization among traces (up to a few microseconds) and bearing the unreliable nature of the medium (frame loss is unavoidable). 1.2 Issues with Wi-Fi sniffing and related techniques There are still, however, a number of issues with Wi-Fi sniffing. This thesis focuses on technical issues.1 We categorize them in two classes: issues with the technique itself and issues with the tools. This thesis addresses both, in an effort to collect new datasets and produce original analyses. Issues with the technique relates to the relevance of the produced traces. This includes sniffer accuracy. Even in good radio conditions, sniffers may miss successfully transmitted frames. In this context, a natural question arise: each sniffer trace being incomplete (i.e., lacking some frames), it is likely that a merged trace be incomplete as well. What is the accuracy one can expect from a single sniffer? From multiple sniffers? What results can be drawn from incomplete traces? Another issue regarding the relevance of traces concerns the available datasets. Despite Wi-Fi is almost ubiquitous, most of the datasets made available by the research community are about university campuses, laboratories, or conference venues [2] . This is partly due to current practices focusing on easy-to-access environments for researchers, but also to the fact that existing monitoring techniques only fit specific scenarios. Most of the techniques available in the literature either focus on one single network, or require setting up a whole infrastructure, or need intrusive access to one’s network. Once down the street or inside an individual’s house, such techniques are therefore difficult to implement. Wireless sniffing however has a strong potential for monitoring all kinds of environments: it is passive, it does not interfere with monitored networks’ infrastructures, and in some cases it does not even need to rely on any infrastructure at all. But this potential has remained unexploited so far. A consequence is most researchers restrict sniffing to studying protocol quirks [22;39;43] . We think however that sniffing could be a great tool to focus on wireless network usage in usually hard-to-reach environments (e.g., individual houses, streets, or parks). Issues with tools mostly relate to the manipulation of packet traces. Many network operations involve these traces: administrators use them for monitoring or troubleshooting, researchers use them for measurements, simulations, or validations. Wireless sniffers produce packet traces consisting of lists of MAC frames. Many tools exist for their creation and manipulation, but most of these are designed for a very specific goal, and carry their own packet-processing code. For instance, tcpdump [8] understands many network protocols, but its parsing code may not be used for other purpose than displaying packets on a terminal. As another exmample, Wireshark [6] is more generic, but it is still mostly visualization-oriented 1 Some non-technical issues also exist, for instance legal and ethical issues [11;55] 1.3. Contributions of this thesis 4 and suffers from similar issues. Most packet processing programs have a good design and are very efficient regarding their focus, but each time one creates another tool to handle packet traces, it is impractical to rely on previous code. Furthermore, some tools suffer from performance issues (for instance, Scapy [5] is a powerful tool to analyze packet traces but is not tractable on large traces — 1 GB or more). All of this makes carrying custom analyses on sniffer traces a fastidious process. It often requires developing new tools from scratch. For the same reasons, merging IEEE 802.11 traces is also an issue. The literature has provided the community with a few merging tools, but most of them require a wired infrastructure [22;28] . The others are too specific to the experimentations conducted in the papers [43;44] . In order to make Wi-Fi sniffing generalizable to any environment, one needs both generic tools and tools that do not expect a wired infrastructure to be available. 1.3 Contributions of this thesis This thesis’ contributions are twofold. First, we develop a framework, called WiPal, to help processing IEEE 802.11 packet traces. This framework includes a generic library to help developing new tools, and several hands-on utilities to perform predefined operations on trace files. These utilities include a trace merger with innovative features. Second, we perform two analyses using these tools. In order to carry these analyses we collect several datasets in various environments, including day-long traces from a residential area and an uptown location. The first analysis focuses on the accuracy of Wi-Fi sniffing, while the second studies Wi-Fi usage patterns. 1.3.1 WiPal: manipulating IEEE 802.11 traces The first part of this thesis presents WiPal, our packet trace manipulation framework. WiPal was designed for performance, without any specific application in mind, but rather in the hope that others could rely on it to develop custom trace processing software. Though it focuses on handling the IEEE 802.11 protocol, it provides several protocol-agnostic features. What renders WiPal interesting is its original design, and some novel features it has. In this thesis: • We present generic patterns for handling various types of packet traces. For instance, using a pipe and filter mechanism to process packet traces, or using a static callback mechanism to generate both efficient and generic frame parsers. • We present how some novel features might benefit to packet processing programs, and how to implement them. For instance, random access to a packet trace, or the ability to consider the aggregation of several files as one unique packet stream. Chapter 1. Introduction 5 • We raise a number of issues a program designer might encounter when writing packet trace processing software. We discuss existing practices to solve them and the specific solutions adopted by WiPal (how and why). • We evaluate the performance of WiPal and compare it with other tools. The results show that WiPal’s generic design does not impact its performances regarding execution speed: it can compete with specialized code. Also, some new features do not impact performance, and others, which are optional, only imply a limited overhead. A distinctive component of WiPal is its merging tool. This tool works offline and is able to merge IEEE 802.11 packet traces. Its key features are performance, ease-of-use, and flexibility. As a consequence, its design do not assume features from traces that would require monitors to access a network infrastructure (e.g., some tools require network synchronization [22] ). It also supports most of the existing input formats (e.g., raw IEEE 802.11 frames, Prism, Radiotap, and AVS headers). Finally, it is usable in a straightforward fashion by just calling the adequate programs on trace files (other mergers require more complex setups, generally involving various servers [22;28;43] ). This thesis motivates and describes WiPal’s trace merger design: • It proposes new algorithms for various stages of the merging process. In particular, the synchronization algorithm is a generalization of previous algorithms from the literature. • It provides an analysis of the synchronization algorithm; we show that our algorithm is more accurate than previous algorithms. • It provides a performance study that shows WiPal’s merger is an order of magnitude faster than the other publicly available offline merger, namely Wit [43] . Our analyses rely on sixteen real traces from four distinct datasets (CRAWDAD’s uw/sigcomm2004 [50] , recorded during the SIGCOMM 2004 conference, and three private datasets we collected in various conditions). They allow us to calibrate various parameters of WiPal, validate its trace merger’s operation, and show its efficiency. We believe that WiPal will be of great utility for the research community working on wireless network measurements. 1.3.2 Applying WiPal: empirical analyses WiPal enables us to carry analyses on datasets we collect using Wi-Fi sniffing. The second part of this thesis presents two of these analyses. The first one focuses on the accuracy of Wi-Fi sniffing. The second one studies Wi-Fi usage patterns in environments with different sociological meanings. First, we collect short-lived (up to two hours) datasets using up to eight monitors sharing the same location. Analyzing these traces reveals that existing techniques to evaluate trace 1.4. Outline 6 completeness are inaccurate. Among other issues, we observe that a single buggy device can be responsible for blundering the whole system. Second, we investigate how the number of sniffers impacts trace completeness. We show that even though individual sniffers may provide good accuracy, sometimes using eight sniffers is still not enough to capture all frames. Furthermore, the sniffing process exhibits a high level of randomness with variable accuracy. Second, we record and analyze long-lived traces (three-day long and ten-day long) obtained in three environments: an office, a dense uptown residential area, and a sparse suburban residential area. We focus on the behavior of the devices rather than on the traffic characteristics. We are interested in observations like the total duration a device is active, the frequency of appearance of new devices, and activity that can be extracted from traces. Among a number of results, we show that: (i) independently of the trace, most devices are inactive most of the time, (ii) due to mobility, two traces have a constant discovery rate of new users, even after days of measurements, and (iii) as the environments are part of users’ life along a typical day, activity intensity alternates between residential and office areas. 1.4 Outline For the sake of clarity, we decided to split the description of related works: each chapter includes the part related to its concerns. There is four chapters divided in two parts: the first part focus on WiPal while the second part presents empirical studies. The first part features chapters 2 and 3. Chapter 2 gives a general overview of WiPal. It presents its original features and design, and also include a performance evaluation of this design. Chapter 3 focus on WiPal’s trace merging process. It presents some algorithms and optimizations that WiPal uses for this process. This includes an evaluation of WiPal performance with regard to trace merging. The second part presents the empirical analyses we performed using WiPal. It features Chapters 4 and 5. Chapter 4 is a study of Wi-Fi sniffing accuracy. Chapter 5 studies Wi-Fi usage patterns in various environments. Finally, appendices include WiPal’s manual and a list of references. Part I The WiPal trace manipulation framework 7 Chapter 2 WiPal: overview and design W I PAL emerged from our needs to manipulate Wi-Fi packet traces. At its creation, existing tools lacked features regarding some operations (e.g., merging or extract- ing statistics). We therefore developped WiPal to fulfill these needs. WiPal has a number of features, but its most significant one is certainly trace merging, and it includes some original algorithms to this regard.1 This chapter reports on our experience designing the WiPal framework: we draw attention on WiPal’s design, and some of its features, which are original and might benefit other software developers faced with similar issues. WiPal is free software, available at http://wipal.lip6.fr/. Appendix B includes WiPal’s manual. In the following, Section 2.1 first gives a short overview of existing software for packet traces manipulations. Then Section 2.2 gives an overview of WiPal’s design and features. The two subsequent sections focus on WiPal’s modules; Section 2.3, addresses packet parsing, while Section 2.4 describes WiPal’s pipe and filter mechanisms. Eventually, Section 2.5 evaluates WiPal’s performance. 2.1 Trace manipulation tools: related work Packet traces are lists of network packets, either synthetic or, more commonly, acquired by tapping a network medium. They are involved in a significant part of network operations: administrators use them for monitoring and troubleshooting, researchers use them in measurements, simulations, or validations. As a consequence, many tools exist for their creation and manipulation (this includes, for instance, visualization or filtering) [29] . A common format is also prevalent for packet trace operations: pcap (packet capture) [7] . However, most packet trace processing tools are designed for a very specific goal, and carry their own packet-processing code. For instance, tcpdump understands many network protocols, but its parsing code may not be used for other purpose than displaying packets on a terminal [8] . Many tools rely on libpcap, but this library focuses on capturing packets 1 Chapter 3 details WiPal’s merging algorithms. 9 2.2. Overview of WiPal 10 and does not export parsing or processing capabilities [7] . Wireshark is more modular, but still mostly visualization-oriented [6] . All these programs have a good design, and they are very efficient regarding their focus, but each time one creates another tool to handle packet traces, it is impractical to rely on previous code. Scapy is a notable exception [5] . It is an interactive packet manipulation program written in Python, able to parse many network protocols, providing features to read and write trace files, or interact with the network. Nevertheless, the scripting-nature of Scapy makes it dedicated to prototyping, experiments, short programs, or at least programs where performance is not an issue. Scapy’s features also stop at the packet level; they do not provide trace-processing algorithms. Another interresting software is binpac [47] . binpac is a parser generator similar to Yacc [40] but it focuses on network protocols. binpac is efficient but only handle unicast streams. It is therefore not suited for sniffer traces. We designed WiPal to solve these issues regarding reusability and performance. Although WiPal focuses on Wi-Fi traces, it provides several protocol-agnostic features. 2.2 Overview of WiPal At the very beginning, WiPal started as a limited C++ library to parse IEEE 802.11 frames. The library then grew with the applications using it. Due to our focus on reusability, we designed these applications as shells around WiPal’s features. Each new feature was therefore part of WiPal rather than of a specific program. Eventually, applications relied so much on WiPal and had so few features of their own that they were merged into WiPal. Our need for solutions regarding all aspects of packet trace processing made WiPal a consistent generic library rather than a patchwork of specific features. Another critical aspect of WiPal is performance. In the literature, some libraries exist to help users parse various network protocols but they are only available for scripting languages (e.g., Scapy [5] ). While this is especially well-suited for quick prototyping and experimentation, this is intractable for handling large traces (especially for heavy computations). WiPal’s most complex utilities can process gigabyte-long traces in minutes. WiPal uses C++ because of this combined importance of reusability, performance, and genericity. 2.2.1 Features WiPal comes both as a library and a set of binaries (programs). Binaries provide a quick and simple interface for high-level features, but these features are also available as library services. Low-level features though are only available through the library. As an example, one does not need to write a program for merging several trace files. The following command does the job: $ wipal-merge t1.pcap t2.pcap [t3.pcap...] Chapter 2. WiPal: overview and design 1 #include <wipal/pcap/stream.hh> 2 #include <wipal/wifi/frame.hh> 11 3 4 using namespace wpl; 5 6 int main() 7 { pcap::file<> f ("file.pcap"); 8 9 for (pcap::file<>::iterator i = f.begin(); i != f.end(); ++i) 10 std::cout << wifi::type::names[wifi::type_of(i->bytes())] << std::endl; 11 12 } Listing 2.1: A sample program using WiPal. This program prints the type of every IEEE 802.11 frame included in file.pcap. High level features include trace synchronization (using the wipal-synchronize binary), trace merging (using wipal-merge), statistics extraction (wipal-stats), trace anonymization (wipal-anonymize), and various minor utilities such as comparison, concatenation, or hexadecimal dumping (wipal-cmp or wipal-cat, to name a few) The most important low-level features are pcap file I/O, IEEE 802.11 parsing, and support for other IEEE 802.11 related protocols. Note that wipal-merge’s code is just a shell around the library features. As of this writing, WiPal binaries have an average length of 122 lines of C++ (the whole WiPal, including the library, is about 20k lines of code). The smallest binary is 44 lines of code, and the biggest 267. Most of this code is boilerplate due to specific C++ programming techniques. On the other hand, performing a specific task using WiPal’s parser, or combining several treatments in one executable file, requires users to write their own programs using the WiPal library. Listing 2.1 shows a sample program using this library. This program has few features, but other snippets will extend it in the following sections. 2.2.2 Overall architecture Figure 2.1 presents a simplification of WiPal’s structure. Binaries (on top) rely on the library, and the library itself relies on other external libraries. The WiPal library is also composed of several modules. We classify each module into one of base, protocols and file formats, or filters. Base. These modules provide simple and common features, unrelated with WiPal’s specific domain. For instance, they include various exceptions for error handling, generic abstract classes, and static programming helpers. We kept this layer as thin as possible thanks to external libraries such as Boost [1] or GNU MP [3] . 2.3. Packet parsing 12 Figure 2.1: WiPal’s structure and modules. Protocols and file formats. These modules are domain specific and provide the base to high level processing. They feature abstractions such as IEEE 802 addresses, pcap traces, and protocol headers. Filters. One may view a packet trace as a packet stream. Most algorithms just read this stream linearly, each packet after another, from its beginning to its end. This mode of operation particularly suits using a pipe and filter pattern [17] . Therefore, we base WiPal’s high level modules on this design. WiPal provides pipe input and output through iterators [32] . The instantiation of a filter object requires one or several iterators as input, and each object provides an output iterator. For instance, an anonymizer filter reads packets and outputs anonymized packets. A merge filter requires two packet streams as input but produces one output stream. Some processings need adaptation to this pattern. For instance, simultaneously synchronizing and merging two IEEE 802.11 packet traces is a complex operation [22;44;59] . Implementing it in WiPal means decomposing it into several filters and then using a specific wiring for these filters (see Figure 2.2). Every algorithm that needs to access a packet trace non-linearly needs such an adaptation. The base modules, as well as some of the protocols and file formats modules form the lower modules of the WiPal library. On the other hand, filters form the higher modules. 2.3 Packet parsing Although network packets often use a binary format, parsing them is not always straightforward. This is the case even when considering only Wi-Fi packet traces. Furthermore, Chapter 2. WiPal: overview and design 13 Figure 2.2: A complex filter example. This figure shows how WiPal uses filters to synchronize and merge two IEEE 802.11 traces. Each box represents a filter and arrows show pipes. Pipes convey different types of data. distinct traces may involve distinct formats in addition to IEEE 802.11 (see Section 2.3.1 below). IEEE 802.11 packets may have several types and subtypes, and each type/subtype pair yields a distinct format (although all formats share some similarities). A well-crafted program should handle as many formats as possible, and handle each field properly according to its type. Implementing a new format should not need modifying existing code. It should be possible to perform various processing on the same frame without modifying the frame parser. In the following, we describe WiPal’s mechanisms that enable users writing such programs. 2.3.1 PHY headers IEEE 802.11 packet traces often include extra information about the physical parameters of transmissions (e.g., frequency, signal-to-interference ratio, or precise timestamp). This information is available as an extra packet header inserted by the operating system for each frame. We call this header a PHY header. A pcap file is a succession of chunks, each chunk containing a pcap header and a byte sequence corresponding to a packet. PHY headers are located at the beginning of this byte sequence, between the pcap header and the IEEE 802.11 header. Inside packet traces that do not include PHY headers, an IEEE 802.11 header appears directly after each pcap header. There is no reference format for PHY headers: hardware vendors introduced them independently of any standardization process. Open source developers push the Radiotap format [4] as a de facto standard, but many traces are already available in other formats (e.g., AVS or Prism), and some network drivers do not support Radiotap. Furthermore, some 2.3. Packet parsing 14 6 template <class PHY> 7 void 8 print(pcap::file<>& f) 9 { for (pcap::file<>::iterator i = f.begin(); i != f.end(); ++i) 10 { 11 12 const PHY* 13 const void* ieee80211 = phy->decapsulate(i->meta().caplen, i->swapped()); phy = static_cast<const PHY*> (i->bytes()); 14 std::cout << wifi::type::names[wifi::type_of(ieee80211)] << std::endl; 15 } 16 17 } 18 19 int main() 20 { pcap::file<> f ("file.pcap"); 21 22 switch (f.type()) 23 24 { 25 case pkt::IEEE802_11: print<phy::empty_header<> >(f); break; 26 case pkt::IEEE802_11_RADIO: print<rtap::header>(f); break; 27 case pkt::IEEE802_11_RADIO_AVS: print<avs::header>(f); break; 28 case pkt::PRISM_HEADER: print<prism::header>(f); break; 29 } 30 } Listing 2.2: The program of listing 2.1 with support for multiple PHY headers. developers are reluctant to use it due to its variable-length headers. As a consequence, interoperability between IEEE 802.11 tools and packet traces is problematic. Most researchers develop their tools for the specific PHY headers they use, and different tools might not be able to process the same traces. Sometimes the features provided by two PHY formats are not even compatible! WiPal solves this issue using a proper abstraction for PHY headers (see listing 2.2). WiPal users can handle any PHY header using the same consistent API (Application Programming Interface), as shown in lines 12–13 of listing 2.2. Note that users need to test the format of the trace file to setup the proper PHY header type. This is the purpose of the switch statement line 23. The reason is that each PHY header’s C++ type is part of a static class hierarchy [16] , thus no dynamic method resolutions are possible. We wanted to avoid dynamic resolutions because a trace file may contain several hundred million packets, and we wanted to minimize the number of dynamic method calls for each packet (for the sake of performance). WiPal binaries factorize case statements and avoid this redundancy using the Boost preprocessor library [1] . Chapter 2. WiPal: overview and design 1 const uint8_t* offset = static_cast<const uint8_t*>(frame) + 30; 2 uint16_t eth_type = tool::extract_big_endian_short_u(offset); 3 uint8_t ip6nxthdr = offset[8]; 4 uint8_t icmp6type = offset[42]; 5 uint16_t udp6port = tool::extract_big_endian_short_u(offset + 42); 15 6 7 8 9 10 if(eth_type == 0x86dd and ip6nxthdr == 0x11 and udp6port == 698) // ... else if(eth_type == 0x86dd and ip6nxthdr == 0x3a and icmp6type == 0x86) // ... Listing 2.3: A typical example of packet processing code. The code is error-prone, depends on the whole protocol stack, and does not handle truncated frames. 2.3.2 IEEE 802.11 parsing Several practices exist regarding IEEE 802.11 frame parsing. A common malpractice among researchers is to feed a program such as Wireshark or tcpdump, which parses frames and output human-readable text, and then use a scripting language such as Perl to re-process this output. This should be avoided for three reasons: 1. It processes each frame twice. One of the processings is done by a scripting language and involves regular expressions. This is under-efficient for parsing a binary format. 2. Script code that involves regular expressions is error-prone and more difficult to maintain. In this case, the code also depends on the specific version of Wireshark or tcpdump used, as their outputs may change between versions. This imposes an extra burden on maintainance. 3. This often results in overspecialized code. A change in the sequence of protocols each packet uses might break the whole program. Another practice consists in focusing on the specific bytes we are interested in for each frame. This often produces error-prone and hard-to-maintain code. Listing 2.3 is a typical example of such code. Who could tell it does check for pseudo-Ethernet frames that include either OLSR messages in IPv6 UDP packets, or ICMPv6 router advertisements2 ? Even with proper comments and constants, one would need to be very careful about the offsets and the protocols under test (and we do not even mention handling truncated frames). Another problem with this technique is that it is specific to a given problem per se – change only one layer of the protocol stack, and the whole code must be rewritten. Wireshark adopts a valid approach. Its frame parsing component generates a syntax tree and one can access each of the frame’s field using a consistent API. We believe that this 2 OLSR (Optimized Link State Routing) is a routing protocol. UDP (User Datagram Protocol) is a transport- level protocol. ICMP (Internet Control Message Protocol) is a protocol from the Internet protocol suite. IPv6 and ICMPv6 refer to the last version of the internet protocol, which succeed IPv4 and ICMPv4. 2.3. Packet parsing 16 6 struct hooks: public wifi::dissector_default_hooks 7 { void 8 seq_ctl_hook(const void* ieee80211, size_t ieee80211_len, 10 unsigned fragno, 11 unsigned seqno) 9 { 12 std::cout << seqno << ’/’ << fragno << std::endl; 13 } 14 15 }; 16 17 template <class PHY> 18 void 19 print(pcap::file<>& f) 20 { for (pcap::file<>::iterator i = f.begin(); i != f.end(); ++i) 21 wifi::dissect<PHY, hooks>(*i); 22 23 } Listing 2.4: A program using WiPal’s IEEE 802.11 parser. It uses the same main() function as listing 2.2. approach is however overkill for WiPal. Many algorithms only focus on a few fields inside each frame; in this way, there is no need to spend resources on allocating and constructing a whole structure. Furthermore, handling each frame element would be a waste of time, e.g., when one needs only two of them. Instead of generating a syntax tree, WiPal’s parser calls user-given callback functions at various stages of its processing (for instance, for each address field or each time a sequence control field is encountered). When retrieving a specific field is unnecessary, the user just provides an empty callback (WiPal actually provides empty callbacks by default). Callbacks are static parameters to the parser, therefore compiler optimizations (function inlining and dead code elimination) ensure efficiency. Listing 2.4 shows an example. The print() function of listing 2.2 now parses the frame using callbacks (hooks) defined in line 6. The parser calls the seq_ctl_hook() function each time it parses a sequence number field. Note that, despite some frames may be truncated or may not include sequence numbers, this is transparent to the user. Also note that the user does not need to care about bit manipulations (inside IEEE 802.11 frames, sequence numbers and fragment numbers are respectively 12bit-long and 4-bit-long fields embedded into a 16-bit-long word, using the network byte order). Finally, it would also be possible to use WiPal’s parser to build a syntax tree. To this end, one just needs to implement the suited callbacks. Chapter 2. WiPal: overview and design 17 2.4 Filters WiPal processes packet traces using a pipe and filter pattern [17] . Iterators provide pipe input and output [32] . A filter is an object that takes iterators as input and makes an output iterator available. This section presents the benefits of using this pattern for packet processing. Trace files using the pcap file format provide the base iterators for filters. Therefore this section also presents the abstractions that provide basic iterators from pcap files, although they are considered lower modules of the library. 2.4.1 Filter sources: pcap abstractions pcap is the de facto standard for handling packet traces [7] . The format is both simple to read and simple to write, and may handle any type of packet traces, which explains its wide acceptance. Although some other formats exist (e.g., the formats used by Cheng et al. [22] or VeriWave [57] ), WiPal does not implement them for now as they are still barely used. But one could easily implement them with only minor intrusions into WiPal’s code. It is important to underline that tools exist to convert other formats to pcap. WiPal provides several abstractions for reading and writing pcap files. The following sections elaborate on three original features of WiPal’s pcap system: (i) random access to a pcap file, (ii) ability to aggregate several files as one pcap stream, and (iii) ability to attach meta-data to a pcap stream. Random access to a pcap file A basic usage for a pcap stream is to retrieve an iterator pointing to the stream’s first packet. Incrementing this iterator enables then the user to traverse the packet stream. But WiPal also features another access mode. One can retrieve iterators pointing to arbitrary packets in constant time. Random access to a packet is useful to focus on a trace’s specific portion in an efficient way. Here is an example. When opening a pcap trace, standard trace visualizers start by loading the whole trace into memory. Browsing the list of packets just requires memory accesses. This works for traces of reasonable size, but traces used in network research are frequently several gigabytes long [50] . Such traces cannot be loaded into memory. For instance, Wireshark on a GNU/Linux machine with 2 GB of RAM is unable to load traces of more than 500 MB. A solution would be to load into memory only the part of the trace the user is displaying at a given time. But as the user moves inside a trace, the program must be able to quickly load the correct part. From a programming point-of-view, it is not possible to re-traverse the whole trace each time, for performance reasons. Thus the need to access, in constant time, any specific packet inside the trace. WiPal achieves constant access to random packets using file indexes. When opening a pcap file, WiPal performs a single file traversal and records its position into the file every K 2.4. Filters 18 Figure 2.3: A screenshot of WScout [24] . WScout uses WiPal’s random access feature to open packet traces that do not fit in memory. packets (K is a customizable parameter). When required an iterator to the pth packet, it seeks to the recorded position of packet b p/K c and then traverses p mod K packets. Since seek() is constant time, and at most K read operations are required (K being fixed and independent of the trace file), random access is O(K ) = O(1). The smaller the K, the faster the operation, but also the larger the index’s memory footprint. Also note that building the index requires a single trace traversal. Therefore, this indexing mechanism is optional, so users can disable it in case they do not need random access. As a proof of concept, we developed a trace visualizer using this feature: WScout [24] . Figure 2.3 displays a screenshot. Thanks to WiPal’s random access feature, WScout is able to display in a graphical interface packet traces too large to fit into memory. To the extent of our knowledge, WScout is the only visualizer with such a feature. It is available as free software at http://wscout.lip6.fr/. Chapter 2. WiPal: overview and design 19 pcap file aggregation A common practice when capturing packets is to split the resulting packet stream into multiple files. Some tools require it in order to generate long traces (e.g., more than 2 GB). Crawdad’s uw/sigcomm2004 dataset includes such traces [50] . To later process these traces, one must consider the concatenation of the trace files as one unique packet stream. Despite looking like a minor issue, this is an annoying burden for developers – one would like to focus on the processing logic rather than working around measurement quirks. WiPal enables users to consider multiple pcap files as one single pcap stream. Adding this feature to a program is as simple as replacing every occurrence of pcap::file<> with pcap::list<> (e.g., in the preceding code snippets). One can then use a specific syntax to aggregate files. For instance, opening "file1.pcap:file2.pcap" will generate a stream that outputs packets from file1.pcap first, then file2.pcap. Note how this operation is transparent to end-users. Other services are also available, for instance to check consistency of the list (e.g., to check that every file in the list use the same PHY format). Packet stream meta-data Trace files are often associated to some information they do not include directly. A common example is the IP or MAC address of the machine that generated the trace file (i.e., that performed the packet capture). Such information can be useful, for instance, if this machine injected packets during the capture, and one needs to filter these packets out when processing the trace (e.g., because their timestamps are less accurate). A common practice is to embed these pieces of information into the traces’ file names. Some tools require the users to arrange trace files according to a specific filesystem tree. In order to ease the programming of such mechanisms, every packet stream in WiPal can embed meta-data. Streams’ meta-data in WiPal takes the form of a mapping from a string to an object of any type. Users can therefore attach any needed piece of information to a packet stream. pcap lists use this mechanism. For instance, when opening "file.pcap=192.168.1.1" or "foo.pcap:bar.pcap=10.0.0.1", WiPal associates the given IP address to the corresponding stream, under the string addr. WiPal’s trace merging services use such information. 2.4.2 Processing filters Filters are the core of WiPal’s advanced processing features. WiPal features a dozen filters related to trace merging, synchronization, or anonymization. This section illustrates with a simple example how they can improve code quality when dealing with packet traces. The example is a program that anonymizes a packet trace and then prints statistics concerning the resulting trace. This needs two filters and two “data sinks”. The filters are an anonymizer and a timetracker, and the sinks are a pcap output stream (for the anonymized trace) and 2.4. Filters 20 Figure 2.4: A simple processing pipeline using two filters (represented as white boxes). Listing 2.5 displays the code implementing this pipeline. a statistic extraction module. The anonymizer filter reads IEEE 802.11 frames and outputs a copy of these frames truncated at the end of the MAC layer, and where MAC addresses and ESSIDs (network identifiers3 ) have been replaced with random values. The timetracker filter is in charge of extracting precise timestamps from PHY headers (it fallbacks using pcap timestamps when there are no PHY headers). It also handles wraparounds (some timestamp formats roughly wrap every one hour and a half) so it produces monotonically increasing timestamps. Having as precise as possible timestamps is necessary in order to compute statistics about the packet stream. Figure 2.4 shows how to connect each element. The input file links to the anonymizer, and the anonymizer to the timetracker. We send then the timetracker output to both the output stream and statistics module. Listing 2.5 implements this program. One can distinguish three parts: type definitions (lines 10–12), object declarations (14–17), and processing (19–24). Inside WiPal, every filter is a class, therefore type definitions setup two type aliases for the filter classes: anonymizer and timetracker. One can notice that the C++ types for filters embed the type of their input iterators. This is a drawback of using static C++: when using many filters, one starts with a long list of typedef’s, and this requires the user to juggle with type names. It is however important to note that these static mechanisms enable compilers to perform optimizations and produce efficient code. This is the key to WiPal’s performance. Furthermore, type checking ensures correctness and safety, as the compiler does not let users mistake with this part of the code. Finally, C++0x, the next release of the C++ standard to be published soon [18] , will solve this problem, including features making writing these type definitions useless (thanks to the auto keyword). The next part of listing 2.5 (lines 14–17) declares the filter objects and end-modules. Connecting the filters is achieved by giving the proper iterators to the filters’ constructors (lines 14–15). Note that when the program runs, at this stage, no processing has started. Filters operate in a lazy fashion: the input file will not be read until we start reading the timetracker’s 3 ESSID stands for Extended Service Set Identifier. Chapter 2. WiPal: overview and design 6 template <class PHY> 7 void 8 process(pcap::file<>& f) 9 { 21 10 typedef filter::anonymizer<pcap::file<>::iterator, PHY> anonymizer; 11 typedef filter::timetracker<typename anonymizer::iterator, PHY> timetracker; 12 typedef typename timetracker::iterator iterator; 13 14 anonymizer a (f.begin(), f.end()); 15 timetracker t (a.begin(), a.end()); 16 pcap::ostream o ("output.pcap", f); 17 wifi::stats::stats s; 18 for (iterator i = t.begin(); i != t.end(); ++i) 19 { 20 o << *i; 21 s.account<PHY>(*i); 22 } 23 std::cout << s << std::endl; 24 25 } Listing 2.5: An example of advanced trace processing using filters. This program uses the same main() function as listing 2.2. output. Furthermore, filters only load into memory the data they need for producing their next element, nothing more. Finally, the last part of listing 2.5 (lines 19–24) reads each output frame from the timetracker, and sends it to the end-modules. Sending packets to a pcap::ostream object using the << operator transparently write the corresponding pcap file. The following method call to account() updates the statistics module. It is then possible to report these statistics on the standard output using standard C++ streams and formatting operators. These statistics include frame counts and traffic rates for each frame type/subtype, estimations of missed frames, list of networks and cells, information about transmitters, and other various figures. There are two important points about this program. First, this is very easy to alter its behavior by just adding or removing the desired filters. For instance, one could add a filter before the anonymizer to filter a certain type of packets out. Or the anonymizer could be removed thus making the program a simple statistics extractor. The second point is that filters are an easy mechanism to parameterize a processing. Some processings have parts that can be implemented with different algorithms (for instance, a merge process might use several synchronization algorithms). In such cases, testing various algorithms is just as simple as changing the corresponding filter, without altering the remaining of the pipeline. In other words, filters enable decomposing trace operations into several basic blocks, thus making trace processing modular. As a consequence we can expect programs to be easier to maintain and adapt, and code to be easier to re-use. 2.5. Performance evaluation 22 2.5 Performance evaluation We evaluate WiPal’s efficiency using nine test programs involving WiPal and some wellknown packet processing software. We are both interested in how WiPal performs with regard to other programs and in the overhead generated by WiPal’s original features (namely trace aggregation, random packet access, IEEE 802.11 parser, and filters). 2.5.1 Methodology We use nine test programs. Here is a short description for each of them. libpcap. This is a simple program that uses libpcap [7] to perform a single pcap file traversal. Packets are discarded immediately after being read from the file. We use this program as a reference. WiPal-file. This is the same program as above, using WiPal instead of libpcap. It uses pcap::file objects and its code is very similar to listing 2.1. The goal is to compare WiPal’s pcap reading mechanisms to libpcap’s. WiPal-list. This program is the same as WiPal-file, using pcap::list objects instead of pcap::file (see Section 2.4.1). We use this test to measure the overhead of WiPal’s file aggregation feature. WiPal-parser. This program performs a single file traversal, calling WiPal’s IEEE 802.11 parser on all frames composing the trace. We use the default behavior of WiPal’s parser, which is to call empty callbacks. This allows us measuring the overhead of an “empty” parser. In the ideal theoretical case, the C++ compiler would optimize the code out and the program would exhibit performances similar to WiPal-list. We also compare this program to Scapy (see below) that performs basically the same task. WiPal-random. This program tests WiPal’s random access feature (see Section 2.4.1). It starts by building an index of its input file, then perform successive access to random packets. The number of random accesses is twice the input trace’s packet count. Therefore, it does the equivalent of three file traversals: one using standard iteration mechanisms and two using random accesses. If one subtracts from its execution time the time for building the index (estimating it with WiPal-file) and divides the result by two, the result is the average time of a single random traversal. One can use this result to compute the overhead of random access over conventional iteration. In this program we use K = 4. This value ensures fast random access while keeping a reasonable memory footprint. Measurements show that WScout [24] , using WiPal’s indexes with K = 4, is able to load a 22 GB trace (including about 108,000,000 packets) using a total of 560 MB of virtual memory. Chapter 2. WiPal: overview and design 23 WiPal-filters. This is the program of listing 2.5. The goal is to have an idea of how a moderately complex processing performs using WiPal’s filters (anonymization and statistics extraction running simultaneously). Of course, this is not directly comparable to tshark or tcpdump (see below) because each program implements different features. But we expect these programs to have execution times in the same order of magnitude. Scapy. This program is very similar to WiPal-parser in its features. It uses Scapy’s sniff function to read its input file. Scapy parses each packet it reads, but we setup the function to immediately discard the packet without further processing afterwards. We setup Scapy to parse only the MAC layer. tshark. This is the plain tshark program, which is the console version of Wireshark. It reads the input file, parses each packet and display a text summary on standard output. tshark relies on libpcap for its I/O operations. tcpdump. This is the plain tcpdump program, which basically offers the same features as tshark. Contrary to Wireshark, it uses a custom parser dedicated to printing packet summaries on a terminal, and we expect it to be faster than Wireshark. tcpdump also relies on libpcap for its I/O operations. In order to evaluate a program, we feed it with a 460 MB pcap trace (including about 2,100,000 packets). We run each program a hundred times, measuring its execution speed (accounting only the user and system time as reported by the time UNIX command). We then compute the mean execution time and 95% confidence intervals. We always use the same trace file: (i) we do not expect another trace with a similar size to lead to significant differences, and (ii) each of these program is linear w.r.t. the trace size, so the average processing time per packet will not change with bigger or smaller traces. The trace comes from a real-world measurement and may be considered average-sized for measurements in wireless environments. In order to avoid disk slowdowns, we store this file in a RAM disk and R we redirect all outputs to /dev/null. The machine executing these tests is a dual-core Intel R Pentium D CPU at 3 GHz, with 2 GB of RAM and a 2 MB cache. 2.5.2 Results Figure 2.5 displays the results. It is important to keep in mind that many of these programs do not have the exact same features as the others. Therefore, most of the time, one should not expect precise comparisons from these results. They rather give an idea about order of magnitudes for a typical trace. We can nevertheless draw a number of interesting conclusions. Comparison with libpcap. A first thing to notice is that WiPal’s packet reading features perform almost as well as libpcap’s (WiPal-file is 120 ms slower, for a total execution time 2.5. Performance evaluation 24 Mean execution time 1h 70 min 1 min 70 s 28 s 10 s 13 s 12 s 1s 830 ms 950 ms 950 ms 970 ms lib W W W W W Sc iP iP iP iP iP pc ap a a a a a ap y l-f l-l l-p l-r l-f is ile i a ar lte nd t se r om s r ts tc ha rk pd um p Figure 2.5: Mean execution time for a hundred runs of the various test programs. Note that most 95% confidence intervals are too small to be distinguished clearly. of nearly a second). This extra delay is negligible: as shown in WiPal-filters or tcpdump, on more elaborated processings, the time actually spent performing I/O operations is small compared to the time spent performing other computations. The important point is that using iterators does not sensibly impact the performance of WiPal’s C++ API. WiPal’s I/O speed is comparable to libpcap’s. Overhead of WiPal-list and WiPal-parser. It is interesting to note that WiPal-list leads to the same execution time as WiPal-file and that WiPal-parser perform almost as well as the previous two (a couple dozen milliseconds slower). One can draw two conclusions: (i) the file aggregation feature has a negligible cost, and (ii) the generic parser implementation using static callbacks is efficient. This means only user-provided callbacks might cause a sensible overhead. Overhead of random access. The WiPal-random program runs in 28 seconds. Thus, we can estimate that traversing the trace once using a random order takes less than 14 seconds (remember the WiPal-random program performs one sequential and two random access per packet). This makes random access to a packet roughly 14 times slower than sequential access in practice. In theory, however, with K = 4, this should only be twice slower on average. We believe the difference between theory and practice is due to the fact that a random file access at the standard library level breaks the underlying buffering mechanisms (whereas a sequential access does not). As a conclusion, random access is significantly slower but is still reasonable with regard to the feature offered. This extra delay is of the same order of magnitude than the one of other processings such as WiPal-filters or tcpdump. Overhead of using filters for advanced trace processing. WiPal-filters runs in 13 seconds. This is about the same execution time as tcpdump, while tshark and Scapy are at least an Chapter 2. WiPal: overview and design 25 order of magnitude slower. Therefore, WiPal’s design does not hinder its efficiency: by using filters, WiPal achieves performance levels that are similar to specialized programs. WiPal-filters use two filter objects: an anonymizer and a timetracker. The anonymizer relies in part on WiPal’s generic IEEE 802.11 parser and the timetracker uses PHY abstractions to extract timestamps from PHY headers. On the one hand, this means that WiPal’s genericity does not preclude it to compare to specialized code. On the other hand, the extra-genericity in tshark’s design (compared to tcpdump) is at the cost of reduced performance (tshark is about seven times slower than tcpdump or WiPal-filters). Scapy is even slower, requiring more than one hour to process the trace. In this case, the first cause is its implementation language (Python). Of course, scripting languages are known to be slower than compiled ones, but they are also dynamic. Therefore, they lack several optimization opportunities. For instance, Scapy cannot optimize the parser out, even though each packet is discarded, while it is possible with WiPal-parser that provides the same features as Scapy. As a conclusion for this section, WiPal does not sacrifice performance to reusability. It may even outperform existing state-of-the-art programs. Its I/O operations are almost as fast as libpcap’s despite the extra features. It also has a generic design that can compete with specialized code. This is a strong argument towards adopting WiPal, instead of writing specific code when designing a packet trace manipulation software. 2.6 Conclusion This chapter presented WiPal, a packet manipulation framework, and reported on our experience designing it. To the extent of our knowledge, WiPal is the only framework that focuses both on performance and genericity. This makes it a valuable tool for researchers who need to develop packet trace processing software. Though WiPal addresses mostly the IEEE 802.11 protocol, it also provides several protocol-agnostic features (e.g., pcap I/O operations). Furthermore, WiPal uses patterns that could be useful to handle other types of packet traces and protocols. WiPal introduces a number of original features and a novel design. It features trace anonymization, statistics extraction, synchronization, merging, as well as other miscellaneous operations. Instead of relying on syntax trees, its IEEE 802.11 frame parser uses a static callback mechanism. By applying modern compiler optimizations, we obtain generic and fast operations. WiPal features the ability to index pcap files, thus allowing random access to packets. These accesses are constant-time and only imply limited overhead. It is also possible to aggregate trace files and to consider the concatenation of several files as one unique packet stream. WiPal also includes a mechanism to attach meta-data to trace files, as they are often associated to data they do not include directly. Finally, WiPal’s whole design is based on a pipe and filter pattern. This pattern enables decomposing trace operations into several basic blocks, thus making trace processing modular. The consequence is that 26 2.6. Conclusion programs become easier to maintain and to adapt, and code easier to re-use. Measurements show that WiPal compete with state-of-the-art packet processing software. Its I/O operations are almost as fast as libpcap’s, and its generic design is as fast as specialized code. Chapter 3 WiPal: IEEE 802.11 trace merging T HE most innovative part of WiPal is probably the one dedicated to merging IEEE 802.11 packet traces. This merger includes original algorithms and focuses on performance, ease-of-use, and flexibility. We achieve performance using a proper design and careful programming. Ease-of-use and flexibility, on the other hand, are the consequences of a number of characteristics that distinguish WiPal’s trace merger from other software: Offline operation. Because it is designed to run offline, WiPal is independent of the monitors. This means that one may use any software to acquire data. Most trace mergers expect monitors to embed specific software [22;28] . Independence of infrastructure. WiPal’s internal algorithms do not expect features from traces that would require monitors to access a network infrastructure (e.g., “loose” sniffer synchronization using NTP, the network time protocol). Monitors just need to record data in a compatible input format. Compliance with multiple formats. WiPal supports most of the existing input formats. On the other hand, other trace mergers require a specific format. Some tools even require a custom dedicated format [22] . Hands-on design. WiPal is usable in a straightforward fashion by just calling the adequate programs on trace files. Other mergers require more complex setups (e.g., a database server [43] or a network setup involving multiple servers [22] ). This chapter explains the design and internals of WiPal’s merger. We also intend complement existing papers in the literature and give additional insights about the complex process of trace merging. Section 3.1 first gives an overview of existing trace merging techniques. As every other tools WiPal uses these techniques. Then Section 3.2 explains the basics of Wi- Pal’s merging algorithms. Section 3.3 goes into more details, and presents each distinct part individually. Eventually Section 3.4 provides an evaluation of WiPal’s efficiency regarding trace merging. 27 3.1. Trace merging: state of the art 28 A. The traces are not synchronized and miss some frames. B. One identifies some reference frames common to both traces. This information enables trace synchronization. C. One adjusts the frames’ timestamps and synchronize T1 and T2 . D. One can merge the traces. Duplicate frames are only accounted once. Figure 3.1: Merging two traces T1 and T2 . 3.1 Trace merging: state of the art Wireless sniffing requires the use of multiple monitors for coverage and redundancy reasons. Coverage is concerned when the distance between the monitor and at least one of the transmitters to be sniffed is too large to ensure a minimum reception threshold. Redundancy is the consequence of the unreliability of the wireless medium. Even in good radio conditions monitors may miss successfully transmitted frames. After the collection phase, traces must be combined into one. A merged trace holds all the frames recorded by the different monitors and gives a global view of the network traffic. The traditional approach to merge traces involves a synchronization step, which aligns frames according to their timestamps. This step includes identifying the frames that are identical in all traces so that they appear once and only once in the output trace (Cheng et al. [22] refer to this operation as unification). Figure 3.1 illustrates this process (more details are given in Section 3.3). Synchronization is difficult to obtain because, in order to be useful, it must be very precise. Imprecise frame timestamps may result in duplicate frames and incorrect ordering in the output trace. An invalid synchronization may also lead to distinct frames accounted as the same frame in the output trace. In order to avoid such undesirable effects, one needs Chapter 3. WiPal: IEEE 802.11 trace merging 29 precision of less than 106 µs [59] . To the best of our knowledge, only the VeriWave WaveTest appliances [57] are able to synchronize network cards’ clocks with such a precision (note that we are interested in frame arrival times in the card, not in the operating system). But this requires a specific wiring among each sniffer, and this hardware is expensive. Therefore, all merging tools post-process traces to resynchronize them with the help of reference frames, which are frames that appear in multiple traces. One may readjust the traces’ timing information using the timestamps of the reference frames (see Figure 3.1). Finding reference frames is however a hard task, since we must be sure a given reference frame is an occurrence of the same frame in every traces. That is, some frames that occur frequently (e.g., MAC acknowledgements) cannot be used as reference frames because their content does not vary enough. Therefore, only a subset of frames are used as reference frames, as explained later in this paper (cf. Section 3.3). A few trace merging tools exist in the literature, but they do not focus on the same set of features as WiPal. For instance, Jigsaw is able to merge traces from hundreds of monitors, but requires monitors to access a network infrastructure [22] . This paper however considers smaller-scale systems (dozens of monitors) but where no monitor can access a network infrastructure. WisMon is an online tool that has requirements similar to Jigsaw [28] . CrunchXML [51] is a tool that uses the same merging algorithm as WisMon, but that can work either online or offline. However, due to this algorithm, its operation needs all sniffers to hear a common access point. In order to work in all kinds of environments, WiPal cannot make such an assumption (sometimes access points are not shared among all traces, or there are no access points). The system that is the closest to ours is Wit [43;44] . Although Wit provides valuable insights on how to develop a merging tool, it is difficult to use, modify, and extend in practice (cf. its authors’ note in CRAWDAD [44] ).1 This explains in part our motivation to propose a new trace merger. 3.2 WiPal’s basics WiPal has been designed according to the following constraints: No wired connectivity. The sniffers must be able to work in environments where no wired connectivity is provided. The idea is to be able to perform measurements when it is difficult to have all sniffers access a shared network infrastructure (e.g., in some conference venues, or when studying interferences between two wireless networks belonging to distinct entities). 1 Note that we refer to Wit’s merging process, and not on the other features available (e.g., a module to infer missing packets). 3.3. Detailed operation of WiPal’s trace merging 30 Simplicity to the end-user. We believe simplicity is the key to re-usability. Users are not expected to install and set up complex systems (e.g., a database backend) in order to use WiPal. Clean design. WiPal exhibits a modular design. Developers can easily adapt part of the trace merger or integrate them to other systems (e.g., reference frames identification process, synchronization, or merging algorithm). These constraints require an offline trace merger that does not require traces to be synchronized a priori. In practical terms, this means that sniffers only have to record their measurements on a local storage device, using the widely used pcap file format [7] . With regard to this format, WiPal supports all mainstream PHY headers: raw IEEE 802.11 frames, AVS, Prism, and Radiotap headers.2 Some wireless packet traces use another link type though: IP packets encapsulated into pseudo-Ethernet frames. It is important to note that such traces are not MAC traces (only IP packets are available) and thus do not contain enough information for accurate synchronization and merging. WiPal merges these traces when requested, but this is an experimental feature that has not been extensively tested. As seen in the previous chapter, adding new link types is straightforward: WiPal’s design principles only needs implementing the right abstractions and modifying only a couple of lines in the existing codebase. One can access WiPal’s merging services through its software library or using a set of binaries to manipulate wireless traces. All tools work directly on pcap files both as input and output. wipal-merge is the main command to merge an arbitrary number of traces: $ wipal-merge t1.pcap t2.pcap [t3.pcap...] It is worth mentioning that intermediate steps of the merge procedure can be performed separately, such as: $ wipal-intersect-unique-frames t1.pcap t2.pcap $ wipal-synchronize t1.pcap t2.pcap sync_t1.out.pcap 3.3 Detailed operation of WiPal’s trace merging Figure 3.2 depicts WiPal’s structure. Each box represents a distinct module and arrows show WiPal’s data flow. WiPal takes two wireless traces as input and produces a single merged trace.3 In the following, we explain in detail the functioning of each one of the modules. 2 See 3 In Chapter 2, Section 2.3.1 for an explanation about PHY headers. order to merge more than two traces, it suffices to execute the merging tool as many times as required (two by two). The wipal-merge command does this automatically. Chapter 3. WiPal: IEEE 802.11 trace merging 31 Figure 3.2: The structure of a merge process in WiPal. 3.3.1 Identifying reference frames This section explains the process of extracting reference frames. This operation involves two steps: extraction of unique frames and intersection of unique frames (see Figure 3.2). Let us first define what a unique frame is. A frame is said to be unique when it appears “in the air” once and only once for the whole duration of the measurement. A frame that is unique within each trace but that actually appeared twice on the wireless medium should not be considered as unique. The process of extracting unique frames finds candidates to become reference frames. The process of intersecting unique frames identifies then identical unique frames from both traces to become reference frames. 3.3. Detailed operation of WiPal’s trace merging 32 3.3.2 Extraction of unique frames WiPal considers every beacon frame and non-retransmitted probe response as unique frames. These are management frames that access points send on a regular basis (e.g., every 100 ms for beacon frames). The uniqueness of these frames is due to the 64-bit timestamps they embed (these timestamps are not related to the actual timestamps used for synchronization, as we will see later). In practice, the extraction process does not load full frames into memory. It uses 16-byte hashes instead, which are stored in memory and used for comparisons. Limiting the size of stored information is an important aspect since, as we will see later, WiPal’s intersection process performs a lot of comparisons and needs to store many unique frames in memory. Tests with CRAWDAD’s uw/sigcomm2004 dataset [50] have shown that this technique is practical. For instance, WiPal needs less than 600 MB to load 7,700,000 unique frames. There are some rare cases where the assumption that beacons and probe responses are unique does not hold. The uw/sigcomm2004 dataset has a total number of 50,375,921 unique frames (about 14% of the total 364,081,644 frames). Among those frames, we detected 5 collisions (distinct unique frames sharing identical hashes). WiPal’s intersection process includes a filtering mechanism to detect and filter such collisions out. 3.3.3 Intersection The intersection process intersects the sets of unique frames from both input traces. There are multiple algorithms to perform such a task. Based on Cheng et al. [22] , a solution is to “bootstrap” the system by finding the first unique frame common to both traces and then use this reference frame as a basis for the synchronization mechanism, as shown in Algorithm 1 (we call this algorithm streaming intersection). One may also use subsequent reference frames to update synchronization. This algorithm is practical because the inner loop only searches a very limited subset of I2 . It has several drawbacks though: (i) the performance of the algorithm strongly depends on the precision of the synchronization process; (ii) finding the first reference frame is an issue when no other synchronization mechanisms are available; (iii) this algorithm couples intersection with synchronization, which is undesirable with respect to modularity; and (iv) there is a possibility that some frames are read multiple times from I2 . More specifically, access to I2 is not sequential. We propose the retained intersection algorithm that is much simpler to implement and that avoids the drawbacks of the abovementioned solution (see Algorithm 2). Its main characteristics are: (i) it does not require a bootstrapping phase; (ii) it does not depend on any kind of synchronization; and (iii) it sequentially reads each frame only once from I1 and I2 . Our algorithm starts by loading all unique frames of the first trace into memory. This precludes using it as an online tool. Note that loading all unique frames from a trace into memory may also hog resources; this justifies the importance of having small identifiers for Chapter 3. WiPal: IEEE 802.11 trace merging 33 Algorithm 1 Streaming intersection (uses synchronization). Input: two lists of unique frames I1 and I2 . Output: a list of reference frames. δ ← synchronization precision for all u1 ∈ I1 do tu1 ← u1 ’s time of arrival for all u2 ∈ I2 between tu1 − δ and tu1 + δ do if u2 is an occurrence of u1 then Append (u1 , u2 ) to output. end if end for end for Algorithm 2 WiPal’s retained intersection. Input: two lists of unique frames I1 and I2 . Output: a list of reference frames. h←∅ . Implement h with a hash table. for all u1 ∈ I1 do Insert u1 into h. end for for all u2 ∈ I2 do if h contains an element u1 equal to u2 then Append (u1 , u2 ) to output. end if end for the unique frames. These constraints are however irrelevant in practice. To support our argument, let us show an example using the uw/sigcomm2004 dataset. The biggest traces are those from sniffers mojave and sonoran on channel 11 (roughly 19 GB each). Extracting these traces’ unique frames and intersecting them using WiPal needs 575 MB of memory. Therefore, memory aggressiveness is not a concern in our algorithm. Another advantage of the proposed algorithm is its ability to detect collisions of unique frames within the first trace. As indicated in Algorithm 2, this algorithm uses a set h (in practice, implemented using a hash table) that contains unique frames from the first trace. One detects collisions when trying to insert into h an element that is already part of it. When WiPal encounters such cases, it memorizes collisions, and filter them out of the hash table before starting the algorithm’s second loop. Of course, collisions in the second trace remain undetected. Even if WiPal detected them, there would still be the possibility that a collision spans across both traces (i.e., each trace contains one occurrence of a colliding unique frame). Such cases lead to producing invalid reference frames. To detect invalid reference frames, WiPal looks at possible anomalies w.r.t. the interarrival times between unique frames. In 3.3. Detailed operation of WiPal’s trace merging 34 Dataset Environment Hardware uw/sigcomm2004 Conference Laptop Private 1 Private 2 Private 3 Office building Uptown apartment Office building Soekris Netbook Netbook Id. Chan. 1 11 2 6/8 3 6 4 6 5 11 6 6 7 6 8 1 Table 3.1: Characteristics of the traces used for testing merge operations. Id. relates to the identification number of the merge operations. practice, invalid references are rare: only three occurrences when merging uw/sigcomm2004’s channel 11 (a 73 GB input which produces a 22 GB output). 3.3.4 Synchronization Synchronizing two traces means mapping trace one’s timestamps to values compatible with trace two’s. WiPal computes this mapping with an affine function t2 = a t1 + b. It estimates a and b with the help of reference frames as the process runs. Several techniques exist to perform these estimations: linear interpolations [44] , linear regressions [31;59] , or solving linear problems [52] . To combine generality with speed efficiency, WiPal uses a simple generalization of the techniques from Mahajan et al. [44] and Yeo et al. [59] . Note that other techniques could also be implemented without requiring modifications in other WiPal’s components. WiPal’s synchronization process operates on windows of w + 1 reference frames (finding an optimal value of w is discussed below). For each reference frame Ri , the process performs a linear regression using reference frames Ri−bw/2c , . . . , Ri+dw/2e . At the beginning and at the end of the trace, we use R1 , . . . , Rw and R N −w , . . . , R N (N is the number of reference frames). The result gives a and b for all frames between Ri and Ri+1 . Experiments led us to choose 3 as the optimal value for w (i.e., WiPal performs linear regressions on windows of 4 reference frames). Figure 3.3 shows the results of performing eight merge operations (on sixteen traces from four distinct datasets) with varying window sizes. The merges concern 12 hour-long excerpts of various traces. One of the four dataset is uw/sigcomm2004 while the three others are private datasets we collected. Table 3.1 presents some characteristics of the traces we used for each merge operation. It is important to note that these sixteen traces were collected with various hardware in several environments, on different channels. We define the synchronization difference between two traces as follows. First, consider only the subset S of frames that are shared by both T1 and T2 . For a given frame f , let t f ,1 be the arrival time of f inside T1 (after clock synchronization) and t f ,2 be the Chapter 3. WiPal: IEEE 802.11 trace merging 35 Synchronization difference (µs) Merges 1 and 3 to 8 1.4 1.2 1 0.8 0.6 0.4 0.2 2 3 4 8 16 32 64 128 8 16 32 Window size (w + 1) 64 128 Merge 2 3 2.8 2.6 2 3 4 Figure 3.3: Synchronization difference w.r.t. linear regression window size. The upper curve represent average, minimum, and maximum values for seven of the eight merges. The lower curve represent the result for the other one, and is plotted separately because it has a singular shape. We think this is related to the timestamping accuracy of the input traces for this merge. arrival time of f inside T2 . The synchronization difference is given by 1 |S| ∑ f ∈S |t f ,2 − t f ,1 |. One can summarize the synchronization difference as the average difference of synchronization between frames that are identified as shared among input traces. With the exception of merge 2 that exhibits a very singular behavior, Figure 3.3 shows that w = 2 leads to the minimum synchronization difference. Note that techniques that use w = 1 (i.e., that performs linear interpolations on couples of reference frames) lead to the worst synchronization difference in average. However, choosing a w that is too low or too high might lead to missing some shared frames. Figure 3.4 shows the number of frames that are identified as duplicates in the input traces with respect to window size. Whereas using 3 ≤ w or w ≤ 7 allows to detect the maximal number of shared frames, using other values leads to some missed duplicates. Note that w = 1 gives the worst results. That indicates synchronizing traces using linear interpolation (as Wit [44] does) may lead to incorrect results. Therefore WiPal uses w = 3: among the values that detect the maximum shared frames, this is the one that leads to the minimum synchronization difference. 3.4. Evaluation #shared frames (normalized) 36 1.000 0.995 1 0.990 0.99998 0.99996 0.985 3 4 8 16 32 0.980 2 3 4 8 16 32 Window size (w + 1) 64 128 Figure 3.4: Number of frames detected as shared by both input traces w.r.t. linear regression window size. The curve represents the average, minimum and maximum values for eight merge operations. For each merge operation, this number is normalized using 1 as the number of frames from the window size that gives the highest value. 3.3.5 Merging We now present how WiPal performs the final step, namely the merging process itself. Its role is to copy frames from synchronized traces to the output trace. Of course, it must organize its output correctly while avoiding duplicate frames. Algorithm 3 details WiPal’s merging algorithm. For the sake of illustration, we present here a simplified version that assumes that only one frame is emitted at a given time inside the monitoring area. It simultaneously iterates on both inputs, where each iteration adds the earliest input frame to the output (lines 15–16). Duplicate frames are the ones that have identical contents and that are spaced less than 106 µs (line 11). The rationale for this value is that 106 µs is half the minimum gap between two valid frames [59] . Therefore, the appearance of identical frames during such an interval is in fact a unique occurrence of the same frame. 3.4 Evaluation This section provides an evaluation of WiPal’s merging algorithms using the datasets previously described. We investigate both the correctness and the efficiency of WiPal. We run the merges and then use some heuristics to evaluate the quality of the result. We also analyze WiPal’s execution speed. 3.4.1 Correctness Checking the correctness of merge outputs is difficult. Being able to test whether traces are correctly merged or not would be equivalent to knowing exactly in advance what the merge should look like. Unfortunately, there is no reference output against which we could Chapter 3. WiPal: IEEE 802.11 trace merging 37 Algorithm 3 WiPal’s merging algorithm. Input: two synchronized traces T1 and T2 . Output: the merge of T1 and T2 . 1: procedure A DVANCE( f : frame, T: trace) 2: Append f to output; f ← T’s next frame (or nil) 3: end procedure 4: f 1 ← T1 ’s first frame; f 2 ← T2 ’s first frame 5: while f 1 6= nil or f 2 6= nil do 6: if f 1 = nil then A DVANCE( f 2 , T2 ) 7: else if f 2 = nil then A DVANCE( f 1 , T1 ) 8: else 9: t f 1 ← f 1 ’s time of arrival 10: t f2 ← f 2 ’s time of arrival 11: if f 1 = f 2 and |t f1 − t f2 | < 106µs then 12: Append either f 1 or f 2 to output. 13: f 1 ← T1 ’s next frame (or nil) f 2 ← T2 ’s next frame (or nil) 14: 15: else if t f1 < t f2 then A DVANCE( f 1 , T1 ) 16: else A DVANCE( f 2 , T2 ) 17: 18: end if end if 19: end while compare. Thus, we propose several heuristics to check if WiPal introduces inconsistencies in its outputs. We also check WiPal’s correctness with a test-suite of synthetic traces for which we know exactly what to expect as output. A broken merging process could lead to several inconsistencies in the output traces. Regarding our datasets, we investigate in particular two of those inconsistencies: duplicate unique frames and duplicate data frames. Duplicate unique frames. As seen previously, every unique frame should only occur once in the traces (including merged traces). Yet, it is difficult to avoid collisions in practice (see Section 3.3.2). Thus one should not consider all collisions as inconsistencies. After merging, our traces have 6 collisions. After a manual check, five of them are not inconsistencies introduced by WiPal’s merging process. The last one is due to a synchronization error of 1.5 millisecond. When looking closer at the output trace, it appears that error spans 4.7 milliseconds and duplicates at most 4 frames (a beacon frame and three identical retransmitted data frames). We believe this is an excellent score, considering our inputs have 79,340,347 frames with various timestamping accuracies. 3.4. Evaluation 38 Duplicate data frames. We search traces on a per-sender basis for successive duplicate data frames (only considering non-retransmitted frames). Such cases should not occur in theory – without retransmissions, sequence numbers should at least vary. Surprisingly, some input traces contain such anomalies. We have no explanations why some datasets exhibit those phenomena. We checked however that the merged trace does not have more duplicates than the original traces (inputs have 1,689 duplicates while the output only has 1,149). 3.4.2 Efficiency Trace merging is a run-once operation and WiPal is an offline process. Yet speed is an important metric to consider: • It is always desirable to make a program run faster, as long as it does not answer instantaneously. Especially, as the following section shows, WiPal is able to perform in minutes what takes hours with other merging software. • Less time spent merging means more time is available for other more important processing (e.g., analyzing the dataset, which might also be a heavy operation). As another example, the merge operation might run on a multi-user system, with other users having some time constraints. • Shorter delays between trace collection and trace analysis means more interactivity and gains in productivity (e.g., if the collected traces have issues, it might be desirable to detect it quickly in order to fix the problem, possibly by collecting other traces). Merging all the traces (17.5 GB) takes 35 minutes (real time as reported by the time UNIX command) on a 3 GHz processor with 2 GB RAM. The average CPU usage is 93%. User time, that does not account system delays and thus disk slowdowns, is 31 minutes and 32 seconds. Comparing WiPal with online trace mergers does not make much sense: their mode of operation is different, and these also have different requirements (e.g., wired connectivity and loose synchronization). The comparison would be unfair. We can however compare WiPal with Wit [44] , another offline merger. Wit works on top of a database backend, which means that trace files need to be imported into a database before any further operation can begin (e.g., merging or inferring missing packets). Using the same machine as before, importing all input traces into Wit’s database takes 8 hours and 20 minutes (user time). This means that, before Wit begins its merge operations, WiPal can perform at least 14 runs of a full merge with the same data. WiPal allows then tremendous speed improvements. One of the reasons for such a difference is WiPal uses high performance C++ code while Wit is just a set of Perl scripts using the SQL language to interact with a database. Chapter 3. WiPal: IEEE 802.11 trace merging 39 3.5 Conclusion This chapter introduces WiPal’s trace merger. As an offline merger, it does not require sniffers to be synchronized nor to have access to a wired infrastructure. WiPal provides several improvements over existing equivalent software: (i) it comes as a simple program able to manipulate trace files directly, instead of requiring a more complex software setup, (ii) its synchronization algorithm offer better precision than the existing algorithms; and (iii) it has a clean modular design. Furthermore, we also showed WiPal is an order of magnitude faster than Wit [44] , the other available offline merger. We have several plans for the future of WiPal’s merging procedure. First, we are currently extending it to include new features. For instance, we are working with other contributors in order to merge other types of packet traces using WiPal’s algorithms. We are also working with researchers from the University of California, Los Angeles on new synchronization algorithms. We would also like to make better use of WiPal’s modularity and test other algorithms for the various stages of the merging operation. 40 3.5. Conclusion Part II Applying WiPal: empirical analyses 41 Chapter 4 Accuracy of wireless packet sniffing O NCE one has tools for sniffing and merging, the question of trace completeness arises. With Wi-Fi sniffing each sniffer trace is incomplete (i.e., it lacks some frames). There- fore, it is possible that the merged traces are incomplete as well. This chapter focuses on two aspects of trace completeness in IEEE 802.11 networks. First, we observe that existing techniques to evaluate trace completeness are inaccurate (see Section 4.3). Among other issues, a single buggy device may be responsible for blundering the whole system. Second, we study how the number of sniffers impacts trace completeness (see Section 4.4). Using up to eight sniffers sharing (approximately) the same location, we show that even though individual sniffers may provide good accuracy, sometimes using eight sniffers is still not enough to capture all frames. Furthermore, the sniffing process exhibits a high level of randomness with variable accuracy. To obtain these results we conduct two similar controlled experiments. In each experiment one records a spot’s Wi-Fi activity for a given duration using multiple sniffers. All sniffers have (approximately) the same location. It is then possible to analyze each sniffer trace, compare it to each other, and compute merge operations with a varying number of traces. Eventually, studying each merge operation with respect to the number of traces that compose it provides comparative information. This chapter is structured as follows. Section 4.1 presents the existing techniques to estimate trace completeness. Section 4.2 introduces our datasets and provides a preliminary analysis. Then Section 4.3 evaluates our datasets’ completenesses and draws conclusions about existing evaluation techniques. Eventually, Section 4.4 studies the impact of using multiple monitors on completeness. 4.1 Completeness evaluation: state of the art When collecting IEEE 802.11 data using wireless sniffing, trace completeness is a key issue. Even under good radio conditions, sniffers may miss a successful transmission. Since 43 4.1. Completeness evaluation: state of the art 44 missed frames are unrecorded, it is impossible to know exactly how complete a trace is. Several methods exist however to estimate the efficiency of wireless sniffing as a technique. Other methods exists to estimate the completeness of single traces. Here is a panel of previous related works. Yeo et al. [59] use active indoor measurements (in a single university building). They estimate sniffer traces feature at least 73% of all of their experiment’s frames. When merging traces from three monitors, they obtain a completeness of at least 99%. Using similar experiments in the same kind of environment, Cheng et al. [22] experience a completeness of 95%. Serrano et al. [54] also perform active measurements using an anechoic chamber. Their results show that single sniffer accuracy varies significantly across sniffers, and that performance may also depend on the nature of the experiment under study and on slight changes of the sniffer position. With this best-case scenario using an anechoic chamber, they obtain a completeness of about 96% for single sniffers, on average. Based on message sequences allowed by the IEEE 802.11 standard one can infer some missing frames. For instance, since an acknowledgment frame must succeed a successful data frame transmission, a trace containing only one ack. with no preceding data lacks a frame. Of course, other rules exist for other frame types. Using this technique on real traces from an IETF meeting, Jardosh et al. [38] estimates completeness of at least 80% for individual sniffers (due to the dataset, no merging was possible, and therefore no data is available concerning the accuracy of merging). Rodrig et al. [49] use another technique based on frame sequence numbers to estimate the completeness of their traces. This technique is simple: since most IEEE 802.11 frames contain a sequence number, they look at sequence gaps to estimate missing frames. Using traces they record at the 2004 SIGCOMM conference, they evaluate an overall completeness of “roughly 90%”. Curiously, after merging the same dataset, Mahajan et al. [43] estimate an equal completeness of 90% for channel 1, but also a lower completeness of 79% for channel 11. Schulman et al. [53] also raise an interesting point: since the parameters that impact trace completeness may vary during measurements one should not use it as an accurate indicator of trace quality. For instance, a sniffer might provide a very accurate recording during “silent” periods where only a few access points send beacons, but perform very badly when the network load grows. To solve this issue they propose using dedicated “T-Fi” plots [53] . While we agree with them, we believe however that studying trace completeness is still interesting in some cases. It provides quick insight and is easy to understand. For instance, a trace with a low completeness raises issues, whatever the network load through time. As a summary, existing techniques rely on the fact that network protocols define “valid frame sequences”. When a trace contains an invalid (incomplete) frame sequence, one finds a number of frames to insert so that the sequence becomes valid. This counts for a number Chapter 4. Accuracy of wireless packet sniffing 45 Figure 4.1: ASUS EeePC 700 with three Netgear WG111v3, as used for trace collection. of missing frames. Regarding IEEE 802.11, two categories exist: (i) message-type techniques that rely on frame types (e.g., a management or data frame must precede an acknowledgement) [22;38;43] and (ii) seqnum-based techniques that rely on sequence numbers (e.g., if frame 42 occurs right after frame 39, then frames 40 and 41 are missing) [49;53] . Applications of these techniques show attractive results [22;38;43;49;53;59] . In “academic” environments (laboratories, campuses, conference venues), the literature shows that individual sniffers exhibit completeness values between 70% and 80%. By merging traces, it is possible to reach values above 90%. But, as we will see in the following, we could never achieve such values in our experiments. 4.2 Datasets We study trace completeness using two datasets. They feature traces from multiple sniffers, each one equipped with three IEEE 802.11 radio interfaces (ASUS EeePC 700 and Netgear WG111v3, see Figure 4.1). Interfaces listen on channels 1, 6, and 11. Each radio is set up in monitor mode and records every frame it hears regardless of the network the frame comes from. We merge then each sniffer’s traces (on a per channel basis) using the WiPal software suite [25;26] . 4.2.1 Overview We measure both datasets in the same environment but at different times. They record wireless activity in the computer science laboratory building of Université Pierre et Marie Curie. It spans four floors of a twelve-floor building mostly occupied by private companies. It is located in Paris way outside the university campus. We refer to the datasets as follows: 4.2. Datasets 46 2008-12-01 2008-12-19 Duration 1h13 2h10 1 3 GB / 578 MB 2.8 GB / 833 MB Data size 2 1 GB / 203 MB 210 MB / 82 MB 190 341 ESSIDs 13 24 Access points 66 122 Ad Hoc cells 3 3 Size Devices 3 1 Sizes before/after one merges the dataset. Only includes IEEE 802.11 frames and their payloads. 2 Data sizes before/after one merges the dataset. Only includes IEEE 802.11 data frames and their payloads. 3 Each distinct MAC address in a frame’s sender field accounts for a device. Table 4.1: Quantitative characteristics of the 2008-12-01 and 2008-12-19 datasets. 2008-12-01. Eight sniffers. Traces last roughly one hour and were recorded on December 1st 2008, starting around 3 p.m. All sniffers were located indoors on the same desktop. 2008-12-19. Six sniffers. Traces last roughly two hours and were recorded on December 19th 2008, starting around 11 a.m. In this dataset, due to other constraints, we split sniffers into three groups of two. All groups are located indoors in the same room, but each group is at a different spot in the room. 4.2.2 Preliminary analysis Table 5.1 presents some quantitative characteristics of the datasets. Despite not being very different in nature, traces display some unexpected differences. 2008-12-19 lasts twice as much as 2008-12-01, but its merged datasets is only one and a half times bigger. This difference of activity is probably due to the fact that more people are active during the afternoon than around lunch time. Also, 2008-12-19 is close to Christmas, thus we can expect some regular users to be on vacations at this time. 2008-12-19 has way less data traffic than 2008-12-01. This confirms the previous point. Average management traffic rates are the same order of magnitude in both datasets,1 but 2008-12-01 has an higher average data traffic rate (46 kB/s vs. 11 kB/s, all channels cumulated). This is why 2008-12-19 is not twice as big as 2008-12-01. Again, this confirms 2008-12-01 displays more user activity. 1 When cumulating traffic from all channels on merged datasets, 2008-12-01 has an average rate of 83 kB/s for management frames, while 2008-12-19 an average rate of 96 kB/s. Chapter 4. Accuracy of wireless packet sniffing 47 700 400 350 600 300 500 250 400 200 300 150 100 Channel 1 Channel 6 Channel 11 Channel avg. 50 0 200 Channel 1 Channel 6 Channel 11 Channel avg. 100 0 15:00 15:30 16:00 11:00 (a) 2008-12-01 11:30 12:00 12:30 13:00 (b) 2008-12-19 Figure 4.2: Number of MAC addresses each merged trace contains from its beginning to a given time. Contrary to table 5.1, which only accounts MAC addresses from frames’ sender fields, all fields containing valid MAC addresses are used. Also note that non-data traffic (management and control traffic) is unexpectedly high. Control traffic is negligible (less than 2% of all traffic) therefore this overhead is mostly management traffic. This is a sign that many networks share the medium, each network having its own traffic for management. 2008-12-19, which lasts twice as much as 2008-12-01, also has twice as much distinct devices. This also holds for ESSIDs and access points. This is surprising because one should expect to discover most of the devices at the beginning of traces and then to have a curve that increases slowly (especially for ESSIDs or access points). Figure 5.2 presents growth curves. They effectively appear to be non-linear but sniffers discover a majority of devices long after the first few minutes and the growth curves are not that flat. Probably datasets do not last long enough so we can draw more conclusions about that. However, the fact that one is able to discover new networks after more than one hour is another sign that many distinct networks share the radio medium. Despite both datasets exhibit the same small number of ad hoc cells, cell IDs are different in each dataset. Two cells from 2008-12-19 however share a prefix with a cell from 2008-12-01. We believe these cells relate to temporary or test networks (e.g., mesh test 1, mesh test 2, or meshtest). As a summary, both datasets reflects the same environment under different usage conditions. The environment features a high number of networks, almost all of them being 4.3. Completeness evaluation: shortcomings 48 infrastructure networks. Despite a crowded medium, 2008-12-19 displays sensibly less user activity than 2008-12-01. 4.3 Completeness evaluation: shortcomings Several issues render completeness evaluation techniques inaccurate. Partly because of their strategies and partly because of some anomalies that occur in traces. In fact, existing estimation techniques assume strict conformance to the IEEE 802.11 standard for all devices – this is often not the case, as we will see later in this section. Analyses of our datasets reveal multiple shortcomings. 2008-12-01’s and 2008-12-19’s individual traces exhibit unexpected completeness values between 10% and 15% (using a seqnum-based technique). Merging traces only raises these values by barely more than 1%. This is far from the expected 90%! Starting with this result, we made several observations. Estimation techniques assume the network is not congested. In a congested environment many frames fail to access the medium. This means that counting gaps in sequence number reveals transmission failures rather than sniffer losses. Note that the large number of access points in our traces supports the congestion hypothesis. It also suggests that the hidden terminal problem is likely to occur in a massive way. Seqnum-based techniques assume IEEE 802.11 implementations generate correct sequence numbers. This is wrong in practice because: 1. Some access points wrap their counters at 2,048 instead of 4,096 [53] . How this affects estimation techniques is implementation-dependent. Possible effects include ignoring some relevant gaps or detecting invalid gaps with large values. 2. Some access points set their sequence numbers to zero for all frames (we observed this behavior during other minor experiments). 3. Some access points manage multiple “virtual” access points simultaneously. 2008-12-01 and 2008-12-19 contain several such devices. In the ideal case, each virtual access point should maintain its own sequence counter (IEEE 802.11 [37 , p. 66, 7.1.3.4.1] ). But in practice this is not true, which introduces invalid gaps (i.e., leading to underestimating completeness). At the time of this writing, no automatic technique exists to detect such anomalies. In particular, we see no straightforward solution for the third anomaly (single counter for multiple virtual access points). Nevertheless, once one detects the faulty stations, it is in theory possible to work around these anomalies. As an example, we detected that a single device in 2008-12-01 was responsible for a 5% underestimation of completeness. Chapter 4. Accuracy of wireless packet sniffing score(mk ) = 49 min(o1 , o2 ) |m N | Figure 4.3: “Score” of a single merge operation. m N is the last merge, i.e., the one that includes N sniffers. Note that when k > 2, mk−1 features frames from at least two distinct sniffer traces and thus it is expected that o2 > o1 . Therefore in most cases score(mk ) = Message-type techniques fail to detect series of missing frames. o1 . |m N | For instance, message- type techniques cannot detect a missing data frame if its corresponding acknowledgement is also missing. We call a clear gap two consecutive frames from the same station that exhibit a gap in the sequence number and that are not interleaved with any frame that mentions this station (either as transmitter or receiver). Clear gaps are symptoms of missing frames that message-type techniques would not detect. In 2008-12-01 and 2008-12-19, 81% and 89% of the estimated missed frames are due to clear gaps. In the famous sigcomm2004 traces [50] , clear gaps represent 59% of the estimated missed frames. This means that message-type techniques fail to detect most of the missed frames. As a conclusion, one should use completeness estimation techniques with care. Messagetype techniques are likely inaccurate. Seqnum-based techniques might lead to good results, provided no congestion and strict IEEE 802.11 conformance of all participating devices. In any case, uncertainty exists regarding the accuracy of these techniques. 4.4 Completeness and number of sniffers We now make a step forward and investigate the impact of the number of sniffers on the completeness of the dataset. To this end, we analyze subsets of our datasets with a varying number of monitors. 4.4.1 Methodology The goal is to evaluate the “quality” of a merged dataset with respect to the number of sniffers that compose it. We combine individual traces in groups of increasing size k, where k ∈ {2, 3, . . . , N } is the number of traces inside the group. Recall that N = 8 for 2008-12-01 and N = 6 for 2008-12-19. 4.4. Completeness and number of sniffers 50 Figure 4.4: Successive computations of Mk for N = 4. An arrow from x to y symbolizes the x ? y merge operation. Let Mk be the set of groups of size k (i.e., merged datasets including traces from k sniffers). To compute Mk , we proceed recursively from Mk−1 . For the sake of simplicity, we define the binary merge operation a ? b meaning the result of merging datasets a and b. This operation is theoretically symmetric and associative, and we assume that our trace merging algorithms hold these properties. Let us show an example with N = 4 (see Figure 4.4). The original traces are { a, b, c, d}. We first compute M2 by merging each trace with each other (due to symmetry, we skip some operations, e.g., b ? a because we compute a ? b instead). We compute M3 by merging each element of M2 with the remaining traces. Again, we can skip some operations due to symmetry and associativity (e.g., we skip b ? c ? a because we compute c ? a ? b instead). We keep on performing this procedure until k = N. Note that computing Mk involves ( Nk ) merge operations. Also note that each merge operation produces a new merged dataset. Therefore, we assimilate in the following each merge operation with its output. In order to evaluate the quality of each Mk , we attribute a score to each element of Mk . We then compute Mk ’s average score. Let mk be the merged dataset one wants to score, mk−1 the previously merged dataset, and t the new individual trace we want to add to mk−1 . We have mk = t ? mk−1 . Figure 4.3 depicts how we compute score(mk ). Basically, score(mk ) represents how many frames mk contains that would not have been taken into account if only mk−1 or t were considered. For better readability, we normalize this quantity with the Chapter 4. Accuracy of wireless packet sniffing 51 Score (%) 2008-12-01 16 14 12 10 8 6 4 2 0 channel 1 channel 6 channel 11 2 3 4 5 6 7 8 Score (%) 2008-12-19 16 14 12 10 8 6 4 2 0 channel 1 channel 6 channel 11 2 3 4 5 6 7 8 Number of monitors Figure 4.5: Evolution of scores w.r.t. the number of monitors. frame count of our largest merged dataset (so a score is a ratio between 0 and 0.5). The larger the score, the more useful the merge. 4.4.2 Results Figure 4.5 and Figure 4.6 present the results. For both datasets, we merge each channel individually. Each cell presents average values for a given set of individual datasets. We draw the following conclusions. Scores decrease with size. This is expected: the bigger the dataset, the less interesting it is to add new sniffers. Scores never reach zero. This is however unexpected: even with eight sniffers, each trace contains a small percentage of frames that do not exist in the seven others. Small merges are not that bad. Merges of size 2 are able to provide a significant portion of the datasets’ total number of frames (78% and 73% in average for both datasets). This indicates a large part of individual traces’ frames are shared among sniffers. This is also visible when looking at the average proportion of shared frames inside M2 . One needs many sniffers however to obtain a near-complete trace: at least 5 sniffers for sizes above 90%. 52 4.4. Completeness and number of sniffers Figure 4.6: Scores w.r.t. number of monitors and dataset. Each column represents a given channel of a specific dataset. Each row Mk represents the set of sub-datasets of size k. Each cell contains a box whose size is proportional to the average number of packets inside the corresponding sub-datasets. Red (dark) parts of boxes represent average values of o1 (see Figure 4.3). Pink parts (medium grey) represent average values of o2 . Numbers below boxes are average scores (in percents) with 95% confidence intervals. Chapter 4. Accuracy of wireless packet sniffing 53 Individual sniffers display high variability. This translates into wide confidence intervals on the first row of Figure 4.6. For instance, a sniffer of 2008-12-01 accounts for 53% of all the dataset’s frames while another accounts for up to 87% (results vary between 45% and 80% for 2008-12-19). Since some of these variations occur with sniffers next to each other, we conclude that sniffing processes exhibit high randomness. As a summary, despite most frames are heard by multiple sniffers, a few of them are difficult to receive. This means that each sniffer’s traces contain most of the dataset’s frames but also some original frames. Therefore, researchers should use techniques that are robust to frame losses as they are unavoidable no matter the number of sniffers. 4.5 Conclusion Our analyses reveal that traditional completeness estimation techniques have several shortcomings, making them unreliable. Even when using eight sniffers on the same desktop, there exist frames only recorded by one sniffer. This suggests some other frames were left unrecorded despite we use more sniffers than used in typical settings. Several extensions are possible to this work. We plan analyzing underloaded environments with more monitors. One could also focus only on networks with good reception. Finally, it could be interesting to look for other completeness estimation techniques, to differentiate among transmission failures and frame losses. 54 4.5. Conclusion Chapter 5 Empirical analysis of Wi-Fi activity in three urban scenarios A BILITY to study arbitrary environments is one of the motivations that led to develop- ing WiPal. More specifically, we are interested in environments where no network traces are publicly available. This is why, in this chapter, we record and analyze traces from three environments with different sociological means: an office, a dense uptown residential area, and a sparse suburban residential area. Contrary to existing studies, we do not focus on a single network, but on the overall network activity. We study the behavior of devices rather than traffic characteristics. We are interested in observations like the total duration a device is active, the frequency of appearance of new devices, and activity that can be extracted from traces. It is usual that a sniffer faces radio range limitations and high packet losses; nevertheless, analyzing traces provides important insights into the activity of a given wireless environment as perceived by the wireless adapters. This work is a joint-work with Mathias Boc [15] . We carried it out during our respective PhD theses. Many papers actually use wireless sniffing as a monitoring technique. For instance, Cheng et al. propose Jigsaw [22] , a large scale monitoring system based on sniffing. However, despite being powerful and scalable, Jigsaw imposes some constraints on monitors that make it unpractical in a number of environments. Researchers often use sniffing as a means to diagnose network problems [23] , enhance security [12] , or analyze communication protocols [43;59] . Nevertheless, as far as we know, authors using sniffing do not study user behaviors. In fact, some papers analyze the behavior of users with other techniques, but most of them focus on specific environments. They typically rely on traces collected from a given network’s logs [13;35;42;56] . In this way, their methods are not applicable when several independent networks cover the target area, or when it is unfeasible to access the network infrastructure. It is interresting to note however that some of these papers study largescale networks. Especially, Afanasyev et al. [9] use such a technique on a city-wide network with several types of users (broadband access, 3G cellular, and commercial). Some papers 55 5.1. Setup 56 rely on giving to volunteers dedicated devices that measure contacts with other devices [36] . Typically, experiments concern a few dozen protagonists for a few days, which is the main limitation of this technique. To the best of our knowledge, only González et al. [33] studies human mobility in a large environment, but they focus on real mobility rather than user behavior as seen from IEEE 802.11 networks. 5.1 Setup We perform our analyses on traces collected in three different environments. We obtain each trace using a sniffer (laptop) equipped with three IEEE 802.11 radio interfaces (ASUS EeePC 700 and Netgear WG111v3, see Figure 4.1 in the previous chapter). The interfaces listen on channels 1, 6, and 11. Each radio is set up in monitor mode and record every frame it hears regardless of the network.1 We refer to the three traces as follows: Office. This is a three-day-long trace recorded in the computer science laboratory of Université Pierre et Marie Curie – Paris 6. The laboratory spans three floors of a twelve-floor building that is also occupied by some private companies. Residential, sparse. This is a three-day-long trace recorded in a suburban residential area. The area is crowded only with small habitation buildings and houses. Residential, dense. This is a ten-day-long trace recorded uptown. The area is mostly residential but includes shops and schools. Tall towers compose habitation buildings. There is a high car and pedestrian traffic. Table 5.1 presents quantitative characteristics of these traces. As expected, the office trace has the greatest number of devices, ESSIDs,2 and access points (AP). It has more access points than ESSIDs, which means that some wireless networks span multiple access points. The office trace also contains beacons from a relatively high number of ad hoc networks. This comes mostly from unconfigured devices (e.g., printers and “Free Public WiFi”) and devices that create a network they expected to find (e.g., “AT&T Wireless”). The same reasons make the dense residential trace include information on ad hoc networks. The sparse residential trace, as expected, has the smallest number of devices. It has more access points than ESSIDs; this is due in part to hidden ESSIDs (5 APs hide their ESSID, and we expect them not to belong to the same network) and to Internet boxes that advertise shared ESSIDs belonging to network operators (e.g., for Wi-Fi phone service). This trace includes however two surprising features: 1 Despite 2 ESSID’s not available yet, we plan to make these traces public as soon as possible. are strings used as network identifiers. A single network might include multiple access points, but has only one ESSID. Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios Duration Office Residential, sparse Residential, dense 3 days 10h 3 days 12h 10 days 15h Size1 11.92 GB 3.67 GB 1.61 GB Data size2 3.43 GB 4.75 MB 290.82 MB Devices3 856 49 294 ESSID’s 44 9 7 Access points 52 14 4 Ad Hoc cells 82 1 10 57 1 Sizes only include IEEE 802.11 frames and their payloads. sizes only include IEEE 802.11 data frames and their payloads. 3 Each distinct MAC address in a frame’s sender field accounts for a device. 2 Data Table 5.1: Quantitative characteristics of the Office, Residential sparse, and Residential dense traces. 1. Out of the 3.67 GB that compose the sparse residential trace, only 4.75 MB are data frames! 98.7% of frames in the sparse trace are access point beacons, which suggest these networks exist but are just unused in practice. We believe that they are default-provided with network operator boxes, but that most people access the Internet using wired links to their boxes. 2. The sparse residential trace is bigger, has more access points and ESSIDs than the dense residential trace. In fact, the sparse residential trace is bigger because its sniffer has more networks in its vicinity. This means that access points’ frames account for most of a trace’s size. This is however surprising that the sparse trace has more networks than the dense one. This might be due to differences in Wi-Fi signal propagation in each area (making it easier to hear far networks in a sparse environment) or to social differences in populations composing the neighborhoods. 5.2 Device diversity This section investigates two sources of device diversity: cumulated activity durations and growth of the number of devices. The term device refers to any IEEE 802.11 station. This typically concerns human-operated computers, but also access points and Wi-Fi printers. The reason we study these two characteristics is twofold. First, we want to investigate who exactly uses the wireless medium at given periods and locations. Second, device diversity is relatively easy to compute even in the presence of huge frame losses. This is important because we record each trace using wireless sniffing in areas that are unfriendly to this technique (due to interferences and the presence of multiple walls). In this regard, the sparse residential trace is the worst: by looking at frame sequence numbers, we observe that the 5.2. Device diversity 58 Channel 1 Channel 1 1 day 1 day 1h 15 min 3 min 1h 15 min 3 min 1h 15 min 3 min 1h 15 min 3 min Channel 6 1 week 1 day 1h 15 min 3 min 0 20 0 15 1 day 1 day 1h 15 min 3 min 1h 15 min 3 min 1h 15 min 3 min 0 25 0 20 0 15 (b) Residential, sparse 0 10 Devices (sorted by total activity duration) 50 0 40 35 30 25 20 15 10 5 0 0 60 0 50 0 40 0 30 0 20 0 10 0 (a) Office 0 Channel 11 1 week 1 day Devices (sorted by total activity duration) 10 Channel 11 50 0 Total activity duration 1 day 30 25 20 15 10 5 0 0 0 50 0 40 0 30 20 0 10 0 Total activity duration Channel 6 Channel 11 0 14 0 12 0 10 80 60 40 20 0 25 20 15 1h 15 min 3 min 10 1 day 5 Channel 6 0 0 400 350 300 250 200 150 10 50 0 Total activity duration Channel 1 1 week 1 day Devices (sorted by total activity duration) (c) Residential, dense Figure 5.1: Distributions of cumulated activity durations. trace lacks 85% of the frames. We estimate that the office trace has a missing frame ratio of 70%, and the dense residential area trace a missing frame ratio of only 4%. This small value is due the fact the trace features a very small number of active networks, which means that the sniffer ensured very good reception for the predominant network. We first analyze the distribution of cumulated activity durations and then the growth of the number of devices. 5.2.1 Cumulated activity durations Figure 5.1 plots the distribution of cumulated activity durations among all traces and channels. Each impulse maps a single device to the total duration of its activity inside the trace. We consider that a device is active when it emits a frame within a window of three minutes (any type of frame: management, data, or control). We use the thee-minute threshold because access point drivers use activity timers with similar values (e.g., MadWifi drivers use timers varying from 30s to 5min). Requiring one frame within a window of a few minutes makes the technique resilient to frame losses. A few features are common to all traces: Devices are unevenly distributed among channels. In all traces, more devices appear on channel 11 than on channel 6, and more devices appear on channel 6 than on channel 1. This is a direct consequence of networks being unevenly distributed among channels (both in ad hoc and infrastructure modes). Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios Device activity has a highly uneven distribution 59 for a given trace and channel (note the logarithmic scale on Figure 5.1). We can classify devices in three groups: (1) devices that are (almost) always active, (2) devices that appear only once, and (3) other devices. Among all traces, a sum of 31 devices (out of 2,395) belong to class (1).3 27 of these devices seem to be access points. Two of the four remaining devices appear in the office trace, and two of them in the dense residential trace. The remaining devices emit no beacons, so they are not in ad hoc mode. It is interesting to underline that a handful of users always leave their devices on. A significant portion of devices belong to class (2) (20% in the office and dense residential traces, 9% in the sparse residential trace). This means that many users are not regular and just pass by. Class (3) is diverse and includes the whole range of possible duration values. However, the smaller the duration, the higher the probability. Most devices are nearly inactive. Varying with the trace and channel, 48% to 96% (76% on average) of the devices are active for less than one hour during the whole duration of measurements. Therefore, a majority of devices are inactive most of the time. There are points however where traces are different and include specific features. The office and sparse residential traces have similar shapes, but devices in the latter tend to cumulate longer activity durations. The dense residential trace has a shape that include a visible cut between very active and nearly inactive devices. These variations are noticeable through the average durations: 2h36min for the office trace, 11h48min for the sparse residential trace, and 2h21min for the dense residential trace (keep in mind this trace is three time as long as the two others). Therefore in some environments, in average, devices tend to be active longer. 5.2.2 Growth of the number of devices Figure 5.2 plots the growth of the number of devices. Each curve corresponds to a given trace and channel (plus a curve for each trace that represents the average of the three channels). Each point shows the number of devices a given [trace, channel] pair features from its beginning up to the corresponding timestamp. We consider that each MAC address represents a device, and look for MAC addresses in all address fields of the frames. Some devices are mentioned as destinations but never as transmitters. That explains why we discover more devices than indicated in Table 5.1 and Figure 5.1. Furthermore, due to a subtlety in the IEEE 802.11 protocol, some address fields of the frames may contain values that are actually not real MAC addresses (e.g., independent BSSIDs). We ignore these values. We can derive a number of interesting observations from Figure 5.2: The repartition of devices among channels is uneven. Furthermore, it does not always correlate with the repartition of sending devices among channels. In all traces, less devices 3A device that appears on multiple channels is accounted multiple times. 5.2. Device diversity 60 2500 90 Channel 1 Channel 6 Channel 11 Channel avg. 2000 600 Channel 1 Channel 6 Channel 11 Channel avg. 80 70 500 60 1500 Channel 1 Channel 6 Channel 11 Channel avg. 400 50 300 40 1000 30 200 20 500 100 10 0 0 06 12 y da 11 y da 10 y da 09 y da 08 y da 07 y da 06 y da 05 y da 04 y da 03 y da 02 y da 01 y da 05 04 03 02 01 05 04 03 02 01 (b) Residential, sparse y da y da y da y da y da y da y da y da y da y da y da (a) Office 0 (c) Residential, dense Figure 5.2: Number of distinct MAC addresses each trace contains from its beginning to a given time. appear on channel 1 than on any other channel. This is perfectly consistent with previous results (cf. Section 5.2.1). Nevertheless, channel 6 attracts more users than channel 11 in two of the three traces. This contradicts the channel repartition of Figure 5.1. The difference is that Figure 5.1 only considers devices that emit frames while Figure 5.2 considers all types of devices. This indicates that it is difficult to evaluate the repartition of users among certain channels. The discovery rate follows a day-night pattern. Curves periodically alternate between flat and growing periods. Depending on the trace, this effect has varying amplitudes and periods, but is visible in all traces. Flat periods occur during nights, usually starting around midnight and stopping a few hours before noon. This shows that, as expected, devices’ activity correlates with human activity. In the office and dense residential area, the discovery rate is constant during long periods. Furthermore, in the dense residential trace, this still holds after a week of measurement. On the other hand, the sparse residential trace flattens drastically after two days. We believe that this is a consequence of the type of environment: high mobility is expected in uptown streets and offices, as well as a high turnover of people. We can expect many new users will not come back before the measurement ends. Therefore this also explains why the average activity duration per user is higher in the sparse residential trace (see the end of Section 5.2.1). Note that, however, even when the discovery rate falls after two days, it is still possible to discover new users near the end of the trace. Among the different observations derived in this section, we believe that two of them are of particular importance. First, as shown by the study of activity durations, users are mobile Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios 10 -1 P[Inter-activity > t] 14h34h 0 0 10 10 61 0 10 -1 -1 10 10 -2 10 -2 -2 10 10 -3 10 -3 -4 -5 10 -3 10 Distribution -α -βt/k (t + t0) e 10 10 Distribution -α -βt/k (t + t0) e -4 1min 1h Time t (seconds) 24h 72h 10 (a) Office Distribution -α -βt/k (t + t0) e -4 1min 12mins Time t (seconds) 24h 72h (b) Residential, sparse 10 2h 24h 1 week Time t (seconds) (c) Residential, dense Figure 5.3: CCDFs of aggregated inter-activity times of all devices for the three traces. The distributions are well fitted by truncated power laws with exponential decays. The parameters of the distributions are presented in the text. or do not generally keep their Wi-Fi equipments switched on. This translates into packet traces where most devices are inactive most of the time. Second, different environments have different impact on mobility. This translates either into new user apparitions being evenly spread inside traces or, on the contrary, grouped at the beginning of traces. 5.3 Activity/Mobility Behaviors This section analyzes the type of relationship devices develop with their environments. Behind the notion of relationship, we are interested in understanding how device activity evolves. We highlight predominant patterns when they exist with the objective of characterizing the importance of locations on the behaviors of the devices. Because of our centered vision of space (each time we only use one sniffer), it is difficult to extract physical mobility behaviors from traces. In some situations, however, temporal activity patterns give insight on devices’ mobility: either a device is no more in the considered space or it is back and active. Statistical tools exist to extract mobility patterns information. We take advantage of them and rely on the available activity information. 5.3.1 Inter-activity patterns In this first part, we analyze the devices’ rhythm of activity. For this purpose, we represent the aggregated complementary cumulative distribution function (CCDF) of the interactivity times (see Figure 5.3). The inter-activity time is the time gap between the beginnings of two consecutive periods of activity. Therefore, the duration of activity is included in the inter-activity time. We start by presenting the distribution parameters and, for each trace, 5.3. Activity/Mobility Behaviors 62 we investigate the meaning of variations when they exist. Note that only devices that are active at least twice are represented here. We can approximate the CCDFs of the three traces by truncated power laws with exponential decays: P(t) = (t + t0 )−α exp(− βt/k), (5.1) For the office trace t0 = 1 minute, α = 0.40, β = 1.2, and k = 24 hours (Figure 5.3(a)). The parameters for the residential sparse trace are: t0 = 1 minute, α = 0.45, β = 1.40, and k = 24 hours (Figure 5.3(b)). For the residential dense trace t0 = 15 minutes, α = 0.40, β = 0.8, and k = 24 hours (Figure 5.3(c)). The power law part of the distributions shows a slope that is very similar to recent experimental results found in the literature [19;41] . It counts for a large proportion of the inter-activity times: ≈ 98.3% for the office trace, ≈ 99.2% for the residential sparse, and ≈ 92% for the residential dense. For the three distributions, k is almost the same which can point out on a possible cycle or period of one day (the characteristic time in [41] ). The value β is around 1.3 for the first two traces which indicates a strong contraction of the probabilities of activity after 24 hours. For the residential dense trace, this value is lower (0.8) which here indicates a greater disparity of the probabilities. Partly due to the longer duration of the trace, it is important to note that there are no strong variations and thus no coordinated behaviors among the devices. Finally, we can note that the parameter values are very similar for the office and residential sparse distributions which indicate that these locations might have the same level of influence on devices behaviors (constraint, necessity, social habits). Concerning the variations in the distributions, we observe three main steps in the distribution of the office trace: the first around 1 hour, a second around 24 hours, and the third around 48 hours. According to the characteristic time k equal to 24 hours, we can suppose a periodicity of one day for a large part of the devices and, one of 48 hours for a smaller part. The first variation around 1 hour is difficult to interpret because of its small length but may have a link with different pauses in the activity along their presence in the environment. The residential sparse distribution presents four main steps: the first around 12 minutes, the second around 14 hours, the third around 24 hours and the fourth around 34 hours. The first variation around 12 minutes is particularly interesting. After verification in the trace, this duration corresponds to a handheld mobile device that is programmed to check mails every 15 minutes (the observed 12 minutes plus the 3 minutes of activity duration granularity). Of the three other variations, the one around 24 hours concentrates a greater proportion of probability. With a characteristic time of 24 hours, it points out a daily periodicity. Following the same logic, the two other variations also point out a strong periodicity of 24 hours but time-shifted by 14 hours and a periodicity of 34 hours that collect less devices than the 24-hour period. Compared to the other traces, the residential dense distribution does not present clear wide variations. Although the characteristic time is also around 24 hours, there is no vari- Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios 0 24 48 72 10 24 34 48 58 72 0 24 48 63 72 96 120 144 168 0.3 Proportion of Active users 0.14 0.14 0.12 0.25 0.1 0.2 0.1 0.15 0.08 0.08 0.12 0.06 0.06 0.1 0.04 0.04 0.05 0.02 0 0.02 0 0 Time t (hours) (a) Office Time t (hours) Time t (hours) (b) Residential, sparse (c) Residential, dense Figure 5.4: Proportion of users that are active each time interval relatively to the first time (interval) they appeared for the three traces. In these traces, we observe a clear periodicity of 24 hours with some variations that are characteristic of the social meaning of each environment. ation around this value. In this situation it is difficult to judge if really there are no coordinated behaviors among devices. To confirm the observations of periodicity, we analyze, in the following, from a different point of view the device activity behaviors. 5.3.2 Predominant activity pattern With a long-term scope, we now investigate and extract, if it exists, the predominant pattern that defines each context with the properties that characterize it. There are different means to address this issue. Our approach is to simply consider the activity of each device by slicing the observation period in time intervals of equal durations and by aggregating the activity patterns in each of these time intervals. More specifically, for each device we mark the set of time intervals where it has been active relatively to the first time interval when it appeared in the environment (which is set to 0). For each time interval, we compute the number of devices that were active to obtain the different proportions. With this method, the proportion of active devices at the time interval x indicates that a certain number of devices has been active x time interval(s) after the first time they have been seen for the first time. Therefore, a peak at each kx (with k > 0) could point out on a possible coordinated activity and periodicity of behaviors. Here, we set time intervals of 1 hour and plot the results in Figure 5.4. We start by analyzing the results obtained for the office trace (Figure 5.4(a)). As we can observe, the figure presents clear peaks each 24 hours, which indicates daily periodicity in the behaviors. The decrease in the proportions is due to the new devices that have not been active during the whole period of activity. Therefore, their activities are mostly visible in the first part of the figure. The second observation is that around the peaks, the proportions 5.4. Conclusion 64 remain high during a period of about 8 hours and decrease abruptly. Hence, there is a real coordinated movement of a large proportion of the observed population in this context. The office constraints and schedules can explain this phenomenon and then, this predominant behavior can be judged as representative for this type of environment as a large parts of the population are workers. Contrary to the office trace, the residential sparse one presents a different pattern with interesting properties (Figure 5.4(b)). If we start with a periodicity of 24 hours, the pattern presents peaks every 24 hours, which confirms this (expected) periodicity. However, we also observe another period of 24 hours but time-shifted by 10 hours from the start. To summarize, globally, devices are active 10 hours after the first time of activity and 14 hours after with a periodicity of 24 hours. This phenomenon might have different means. The most related to the residential environment is the diurnal activity where an office-like pattern is subtracted. Devices are active early morning and early night. In this situation, the gap of 10 hours corresponds to night periods when devices are not active and the gap of 14 hours, when devices are away from home. Therefore, there is a real complementary link between the two environments. Compared with what we know from the networking literature, where most of mobility/activity behaviors come from university campuses, the activity pattern we observe here is clearly new and different. However, if we are able to extract a predominant pattern from this residential (sparse) environment, we have a different pattern for the residential (dense) context. As mentioned in Section 5.1, the residential dense trace has been obtained uptown while the residential sparse one is suburban. In a suburban residential environment, the proportion of observed devices that may have a relationship with the environment is important because of the high proportion of homes. Uptown, the presence of shops, schools, and other concentration points may introduce a proportion of devices that do not have any relationship with the considered location. With these elements in mind, we observe that it is difficult to extract a predominant pattern from the results of the residential dense trace (Figure 5.4(c)). If, as the residential sparse trace, we consider, a priori, a periodicity of 24 hours, there are peaks that confirm this periodicity but with the same proportion than other that occur irregularly. In this situation, a classification of the devices could be interesting to better understand the different relationships that exist in this environment. Although we let this study as future work, we should be able to detect and analyze yet unseen population category of householders, and more traditional ones such as workers, commuters, and visitors. 5.4 Conclusion This chapter analyzes behaviors of Wi-Fi users in three different locations that have distinct social meanings. With our sniffing technique, we are able to provide a more complete view Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios 65 Figure 5.5: Sniffer locations regarding the collection of traces inside the Parc Monceau. The subsequent trace analysis is currently in progress. (Background from Google Maps.) of the population moving in a given location and highlight important aspects of what can be found in real situations. In particular, we notice that: (i) in popular places, the rate of discovered users can increase almost linearly within the window of observation, (ii) regular users count for a very small portion of the total population, (iii) user activity highly varies from scenario to scenario, and (iv) the location plays a role on the presence duration. Related to these aspects, our study also leverages open issues as how to distinguish the population for which the considered location has a social meaning and how the device can understand in what kind of environment it is currently in. In order to extend this study, we are currently analyzing traces from multiple monitors we collected in a Parisian park, the Parc Monceau. This park’s Wi-Fi activity interests us because it includes several access points spread at various locations. We used ten monitors and measured an area about half the park wide, during one hour (see Figure 5.5). Our analyses are in progress therefore we only have few results for the moment. Traces include 138 emitting devices, 71 of which are Apple devices. We believe these are mostly mobile devices (iPhone or iPod touch). With such a number of mobile devices, it is possible that traces reveal unseen usage patterns. 66 5.4. Conclusion Chapter 6 Conclusion and future work W IRELESS sniffing is a powerful technique to measure activity in Wi-Fi networks but suffers from a number of issues. These are both pragmatic and theoretical. First, existing software to handle IEEE 802.11 packet traces is not satisfying. In general, available software has not been designed for reusability. Thus, developing new tools requires starting from scratch. There is also a lack of efficient and flexible merging tools. Second, several issues exist regarding the relevance of wireless packet traces. Wi-Fi sniffers inherently miss some frames and therefore it is essential to evaluate the number of missed frames (i.e., the completeness of traces). Most studies involving wireless sniffing do not focus on Wi-Fi usage patterns. Other studies use it only in specific environments such as laboratories, campuses, or conferences. This thesis addressed the aforementioned issues. We first develop WiPal, a framework to help process IEEE 802.11 packet traces. WiPal includes a flexible trace merger. Through the analysis of two short-lived traces, we studied the accuracy of completeness evaluation techniques, and the impact of adding new sniffers on trace completeness. A final study collected and exploited three long-lived datasets in different environments to study Wi-Fi usage patterns. 6.1 WiPal WiPal’s design includes several software patterns that are relevant to packet trace processing. Since packet traces are basically streams of packets, using a pipe and filter pattern enables users to have a modular approach of trace processing. This allows for an easy parametrization and maintenance of existing algorithms. Many algorithms also need to access specific fields of IEEE 802.11 frames and thus need to embed a parser. WiPal provides a solution that uses static callback functions to combine performance and reusability. WiPal also includes original features that cannot be found in other tools. Among them are random access to a packet trace and trace aggregation. Evaluation shows that most of WiPal’s fea- 67 6.2. Wi-Fi sniffing accuracy 68 tures have marginal costs on its performance, and thus WiPal does not trade performance for reusability. Some of WiPal’s utilities run faster than other state-of-the-art programs. WiPal features a library and tools to carry various miscellaneous operations (such as comparison, concatenation, or hexadecimal dumping), statistics extraction, or anonymization. WiPal also includes an innovative offline trace merger. This merger includes original algorithms with regard to reference frame extraction and trace synchronization. A study shows its synchronization algorithm offers better performance and better accuracy than previous algorithms. WiPal’s merger supports more input formats than any other Wi-Fi packet trace merger. Contrary to other tools, using it is straightforward and does not require setting up database backends or time servers. A performance evaluation also shows it is an order of magnitude faster than Wit, the other offline trace merger. 6.2 Wi-Fi sniffing accuracy In order to gain further insight into the completeness one can expect from Wi-Fi sniffing, we collected two short-lived datasets involving six and eight sniffers. Possibly due to congestion, these datasets exhibit a lower completeness than expected. A careful analysis reveals, however, that the existing evaluation techniques suffer from a number of issues. First, techniques based on analyzing message types are not accurate. Second, some Wi-Fi devices do not conform with the IEEE 802.11 standard and might skew the results of techniques based on analyzing sequence numbers. Finally, all the existing techniques are not accurate when the network is congested. We then go further into our analyses and study the impact of the number of sniffers on trace accuracy. To this end, we vary the number of sniffers we use from a given dataset (starting with a single trace and then adding traces one after another until we use all the traces from the dataset). We find that, despite most frames are heard by multiple sniffers, a few of them are difficult to receive. In other words, each sniffer’s traces contain most of the dataset’s frames (between 45% and 87% in our traces) but also some original frames. This is true even when using eight sniffers sharing the same location. We argue that researchers should use analysis techniques that are robust to frame losses. 6.3 Wi-Fi activity In a last study, we deploy Wi-Fi sniffers in three distinct environments with different sociological meanings: an office space, a sparse suburban residential area, and an dense uptown residential area. Inside these environments, we focus on Wi-Fi usage patterns. We focus on the whole traffic rather than a single network. The traces we collect last three days, except for the dense residential trace that lasts ten days. Chapter 6. Conclusion and future work 69 All the three traces exhibit a number of differences. Among the residential traces, the biggest one carries almost no data traffic but includes more than ten distinct access points. The other residential trace, on the other hand, only mentions four access points but has a significant part of its traffic dedicated to data. This reveals that access point frames (and most notably management frames) account for most of a trace’s size. Also, some environments include networks that are configured but not used. Another interesting feature is device discovery. While the office and the dense residential traces display an ever-increasing number of discovered devices, the sparse residential trace flattens after two days. This reveals that some environments display users with higher mobility and a high turnover, but this behavior is not universal. Environments also display complementarity among each other. While the office space have one peak of activity per day, the residential environments display two peaks spaced by ten hours. This reflects how people use their Internet connection before and after going to work. Finally, the traces share a number of features. In all environments devices are unevenly distributed among channels. Channel 11 always includes more devices than channel 6, and channel 6 always includes more devices than channel 1. Datasets also feature day-night patterns (but this rather expected, as device activity reflects human activity). Also, inside every trace, only a very small portion of devices appear regularly. Finally, all traces include ad hoc cells. 6.4 Perspectives As WiPal is a framework to help developing new tools, several perspectives exist regarding its extension. A first natural step would be to implement new protocols and filters in order to obtain a tool with a more general purpose than IEEE 802.11 traces. WiPal already includes a few features regarding Ethernet, IPv4, and IPv6, but these are not at the same level as its handling of IEEE 802.11. Merging is also available as an experimental feature for some IP packet traces. Such generalizations are good proofs that WiPal’s design is not specific to Wi-Fi and suits any protocol. It would also be of interest to exploit the modular nature of WiPal, and develop multiple implementations of some features. For instance WiPal could include multiple synchronization algorithms or multiple anonymization functions. Because WiPal uses static C++ techniques, several of its features require writing cumbersome code. One should also carry research with this regard to make these features nicer to use. Another issue with WiPal is that its large code base makes it difficult to check for correctness of operation. Despite most algorithms WiPal implements are simple, the C++ techniques it uses and the number of interactions involved make the code difficult to check. As a consequence, many test-cases were developed to this end. It would be interesting however to study if WiPal could be formally proved. At the end, WiPal could become a generic frame- 6.4. Perspectives 70 work for handling packet traces, including algorithms at every level: from packet traces input/output to trace algorithms, and including parsers for a number of protocols. On the accuracy of wireless sniffing, our analysis raises a number of questions. First, we are not sure congestion is the source of the poor completeness in our traces. Furthermore, it would be unexpected that the CSMA mechanism of IEEE 802.11 generates such significant losses.1 Further controlled experiments with this regard is desirable. Maybe these losses are due to our setting (sniffers close to each other, with less-than-average capability, or even the specific traffic characteristics of the environment). It would be interesting to develop more experiments with different settings (e.g., different network adaptors, focusing on a single channel, or reducing interferences) This could also give more insights into how the number of sniffers impacts accuracy. With this regard, maybe a good thing to do would be to investigate why sniffers exhibit such a variable accuracy. We also show that the existing completeness evaluation techniques have some weaknesses. Although some of them probably cannot be worked around (it might be impossible to distinguish between sniffer losses and transmission failures), it would be interesting to develop techniques that fix the others (e.g., automatically detecting some non-conform behaviors regarding sequence numbers). Active experiments could also be of interest to evaluate how inaccurate each evaluation technique is. Our analysis of Wi-Fi activity in urban environments also raises questions. First, it is possible that wireless sniffing introduces a bias in our results. For instance devices located far from sniffers are likely to be seen less often than near devices, and it is unclear how this impacts their calculated active duration (even though we tried to mitigate such a phenomenon using an activity period of three minutes). At this stage, it is also unclear how the information we discovered could be of use to others (e.g., researchers, application designers, software engineers, or hardware vendors). The uneven repartition of devices among channels probably argues for commodity access points to include algorithms to dynamically switch among channels. Maybe the periodicity in device behaviors could be of some use to people designing opportunistic networking schemes. Another concern is the generality of our results. Since the tools now exist to perform wireless sniffing in any environment, it would be of interest to perform more experiments, both in similar urban environments but also in others. To this regard we collected some traces in a Wi-Fi enabled park. 1 CSMA (Carrier Sense Multiple Access), is a type of MAC scheme used by IEEE 802.11. Appendices 71 Annexe A Résumé de la thèse en français L E standard IEEE 802.11 [37] définit des couches de base pour des communications sans fils. Il est apparu il y a environ une dizaine d’années, sous la marque Wi-Fi, et il est largement utilisé aujourd’hui. Les ordinateurs personnels qui effectuent des communications Internet sur des liens radio utilisent quasiment exclusivement ce protocole. Wi-Fi joue également un rôle majeur dans beaucoup d’équipements mobiles : on le trouve dans des PDA, des téléphones, des baladeurs, même dans certains appareils photo. En conséquence, Wi-Fi fait parti du paysage de l’informatique ubiquitaire [58] . Avec l’aide d’autres protocoles comme Bluetooth ou GSM, on l’utilise pour créer un environnement numérique transparent, intégré à notre vie quotidienne. Par exemple, des points d’accès Wi-Fi (hotspots) équipent les foyers, les hôtels, les salles de conférences, ainsi que bien d’autres lieux. C’est pourquoi il est essentiel de comprendre comment les implémentations du standard IEEE 802.11 se comportent “sur le terrain”. Cette connaissance est nécessaire pour développer de nouvelles applications et de nouveaux protocoles, ou pour améliorer ceux qui existent. A.1 Contexte IEEE 802.11 spécifie une couche physique (PHY) et des règles d’accès au médium (MAC1 ) pour un réseau sans fils. La PHY est en charge de coder et de décoder l’information sous forme numérique (des séquences de bits) vers et depuis un signal radio. La MAC, d’un autre coté, coordonne les transmissions de sorte à ce que chaque station puisse partager le médium sans interférer avec les autres. Bien qu’il s’agisse principalement d’un standard poussé par les entreprises, les chercheurs ont produit une grande quantité de travaux au sujet de IEEE 802.11. Cela inclut des sujets très spécialisés, qui se concentrent par exemple sur la PHY [30;45] , la MAC [46] , ou d’autres fonctionnalités comme par exemple la sécurité [12;14] . Mais d’autres sujets de re1 MAC signifie Media Access Control. 73 A.1. Contexte 74 F IGURE A.1 – Sniffing sans fils : des moniteurs passifs écoutent l’activité radio au sein de la zone de mesure. cherche plus généraux impliquent ce protocole : les réseaux ad hoc et les réseaux mesh [10;27] , les réseaux de capteurs [60] , ou encore l’informatique ubiquitaire [58] . Bien comprendre le Wi-Fi bénéficie donc à tous ces domaines. Pour atteindre cette compréhension, des analyses théoriques aussi bien que des études expérimentales sont nécessaires. Cette thèse se concentre sur l’aspect expérimental, et en particulier sur les mesures de terrain des réseaux sans fils. A.1.1 Mesures passives Wi-Fi et sniffing Chaque technique de mesure d’un réseau est soit active soit passive. Les mesures actives modifient le trafic réseau de sorte à évaluer certains paramètres. Des techniques actives classiques consistent par exemple à saturer un lien pour évaluer sa capacité, ou à envoyer des sondes pour évaluer les délais aller-retour. À l’opposé, les mesures passives n’interfèrent, pas avec le trafic réseau. C’est le cas, par exemple, lorsque l’on écoute sur un lien pour analyser son trafic. Les techniques passives peuvent toutefois interférer avec l’infrastructure : elle peuvent nécessiter des utilisateurs d’installer un logiciel spécifique, ou des administrateurs de brancher des équipements d’écoute particuliers. Une technique passive classique pour mesurer des réseaux sans fils est le sniffing. Cela consiste à répartir des moniteurs au sein de la zone de mesure pour qu’ils capturent tout le trafic qu’ils pourront entendre (voir Figure A.1). Les moniteurs produisent des traces qui sont des successions de paquets MAC (des trames). Le sniffing est une étape fondamentale dans un certain nombre d’opérations réseaux, comme par exemple le diagnostique [23;34] , l’étude de la sécurité [12;48] , et l’analyse des comportements des protocoles [22;39;43;59] . Bien que cela ne soit pas obligatoire, il peut aussi servir de support à des systèmes de localisation [20;21;61] . Il existe beaucoup de configurations de sniffing différentes : il peut y avoir un seul ou plusieurs moniteurs, ceux-ci peuvent être constitués de matériel courant ou Annexe A. Résumé de la thèse en français 75 spécialisé, et ils peuvent fonctionner d’une manière isolée ou en étant relié à une infrastructure filaire (entre autres paramètres). En revanche, dans tous les cas, l’opération de mesure est passive, non-intrusive, et n’interfère pas avec l’opération normale du réseau. Le sniffing sans fils utilise souvent une procédure centralisée qui permet de fusionner les traces [22;43;59] . L’objectif est d’abord d’avoir une vision globale de l’activité radio à partir de plusieurs mesures locales. En utilisant des moniteurs avec des zones de couvertures qui se chevauchent, il est également possible de compenser les pertes de certains moniteurs en utilisant des données d’autres moniteurs. Mais cette fusion est une tâche difficile ; elle nécessite une synchronisation très précise des traces (de quelques microsecondes) et une prise en compte de la nature peu fiable du canal radio (les pertes de trames sont inévitables). A.1.2 Questions ouvertes Le sniffing soulève néanmoins un certain nombre de questions ouvertes. Dans cette thèse, nous nous concentrons sur les aspects de technique informatique2 . Nous les classons dans deux catégories : les questions au sujet de la technique en elle même, et les questions au sujet des outils. Cette thèse se préoccupe des deux, dans un effort pour collecter de nouveaux jeux de données et produire des analyses originales. Les questions au sujet de la technique sont relative à la pertinence des traces produites. Par exemple, au sujet de la précision des moniteurs. Même dans de bonnes conditions radio, ceux-ci peuvent rater des trames qui ont pourtant été transmises avec succès. Dans ce contexte, il est une question naturelle : puisque les traces de chaque moniteur sont incomplètes (c’est à dire que certaines trames ont été perdues) il est probable que la fusion de ces traces soient également incomplète. Quelle précision est-on en droit d’attendre d’un moniteur ? De plusieurs moniteurs ? Quels résultats peuvent être tirés de traces incomplètes ? Une autre question concerne la pertinence des jeux de données disponibles. Alors que le Wi-Fi est presque omnipotent, la plupart des jeux de données rendus publics par les chercheurs concernent des campus d’universités, des laboratoires, des lieux de conférences [2] . C’est en partie parce que la pratique courante est de se concentrer sur des environnements facile d’accès pour des chercheurs, mais aussi parce que les techniques de mesure qui existent ne marchent que dans certains scénarios. La plupart de ces techniques se concentrent sur un réseau unique, ou bien nécessitent de mettre en place une infrastructure complète, ou bien sont intrusives vis à vis des équipements réseaux. Lorsque l’on se retrouve dans la rue ou dans la maison d’un particulier, ces techniques sont donc difficiles à mettre ne pratique. Pourtant, le sniffing sans fils a un très fort potentiel pour mesurer n’importe quel type d’environnement : il est passif, il n’interfère pas avec l’infrastructure, et dans certains cas il ne nécessite pas de mettre en place une infrastructure. Mais ce potentiel est resté inexploité jusqu’à présent. En conséquence, les chercheurs se concentrent sur l’étude d’anoma2 Certains aspects ne sont pas directement informatiques. Par exemple, le sniffing soulève des questions d’ordre juridique et éthique. A.2. Contributions de cette thèse 76 lies et de certaines spécificités du protocole [22;39;43] . Nous pensons au contraire qu’il est plus intéressant d’utiliser le sniffing comme une technique pour étudier les usages du réseau dans des environnements difficiles d’accès (par exemple des maisons, des rues, ou encore des parcs). Les questions au sujet des outils sont relatives à la manipulation des traces de paquets. En réseau, beaucoup d’opérations mettent en jeu ce type de traces : les administrateurs les utilisent pour le suivi et le débogage, les chercheurs pour les mesures, la simulation, ou la validation. Les moniteurs sans fils produisent des traces de paquets, qui sont en fait des listes de trames MAC. Beaucoup d’outils existent pour créer ces traces et les manipuler, mais la plupart d’entre eux sont très spécifiques, et utilisent du code difficile à généraliser. Par exemple, tcpdump [8] est capable de décoder énormément de protocoles distincts, mais son code de traitement ne peut s’utiliser que pour afficher des paquets dans un terminal. Wireshark [6] est plus modulaire, mais reste dans l’ensemble orienté visualisation, et donc souffre de problèmes similaires. La plupart des programmes qui traitent des paquets réseaux sont bien conçus, et apparaissent efficaces vis à vis de leurs objectifs. Mais chaque fois qu’il faut créer un nouveau logiciel de traitement des traces de paquets, il n’est pas pratique de se reposer sur du code existant. De plus, certains outils souffrent de problèmes de performance (par exemple, Scapy [5] est un outil très puissant pour l’analyse de traces, mais il n’est pas utilisable sur de grosses traces – 1 GB ou plus). Tout cela fait que produire des analyses personnalisées sur des traces de paquets est fastidieux. Cela requiert généralement de programmer de nouveaux outils à partir de rien. Pour l’ensemble de ces raisons, fusionner des traces IEEE 802.11 pose également un problème. On trouve dans la littérature quelques outils à cette fin, mais la plupart reposent sur l’existence d’une infrastructure filaire [22;28] . Les autres sont trop spécifiques à l’expérience pour laquelle ils ont été conçus [43;44] . Afin de pouvoir généraliser le sniffing de réseaux Wi-Fi dans n’importe quel environnement, il faut à la fois des outils génériques, et des outils qui ne nécessitent pas l’utilisation d’une infrastructure filaire. A.2 Contributions de cette thèse Les contributions de cette thèse sont doubles. D’une part, nous développons une boı̂te à outil logicielle, nommée WiPal, pour aider à la manipulation des traces de paquets IEEE 802.11. Cet ensemble inclut une bibliothèque générique pour le développement de nouveaux outils, et plusieurs utilitaires directement utilisables pour effectuer des opérations prédéfinies sur les traces. WiPal possède notamment un outil de fusion de traces innovant. D’autre part, nous utilisons ces outils pour produire deux analyses. Celles-ci utilisent plusieurs jeux de données que nous avons collectés dans différents environnements, dont notamment des traces de plusieurs jours dans des zones résidentielles de banlieue et en centre ville. La Annexe A. Résumé de la thèse en français 77 première analyse se concentre sur l’étude de la précision du sniffing Wi-Fi. La seconde se concentre sur les usages du Wi-Fi dans ces différents environnements. A.2.1 WiPal : manipulation de traces IEEE 802.11 WiPal est notre ensemble logiciel pour manipuler de traces de paquets. On peut le télécharger librement à l’adresse http://wipal.lip6.fr/. Il est conçu pour la performance, de manière générique, dans l’espoir qu’il pourra être utilisé par d’autres pour le développements de nouveaux logiciels, plutôt que pour servir de support à un logiciel spécifique. Bien qu’il se concentre sur le protocole IEEE 802.11, il fournit plusieurs fonctionnalités indépendantes du protocole. Ce qui rend WiPal intéressant est sa conception originale, et la nouveauté de certaines de ses fonctionnalités. Dans cette thèse : • Nous présentons des patrons de conception génériques pour la gestion de plusieurs types de traces de paquets. Par exemple, l’utilisation d’un mécanisme pipe and filters pour le traitement des traces, ou l’utilisation de callbacks statiques pour générer des analyseurs syntaxiques qui soient simultanément génériques et efficaces. • Nous présentons comment certaines fonctionnalités nouvelles peuvent être bénéfiques aux programmes de traitement des traces de paquets, et comment les implémenter. Par exemple, l’accès aléatoire a une trace de paquet, ou l’agrégation transparente de plusieurs fichiers comme un seul flux de paquets. • Nous soulevons un certain nombre de problèmes qu’un concepteur de programmes peut rencontrer lorsqu’il écrit un logiciel de traitement de paquets. Nous présentons les techniques existantes pour y faire face, et nous expliquons quelles techniques nous avons retenues pour WiPal, et pourquoi. • Nous évaluons la performance de WiPal et la comparons avec d’autres programmes de traitement de traces de paquets. Les résultats montrent que la conception générique de WiPal n’a pas d’effet notable sur ses performances (vis à vis de la vitesse d’exécution). La vitesse de WiPal se compare à du code spécialisé. Également, certaines des nouvelles fonctionnalités n’ont pas d’impact sur les performances, tandis que d’autres, qui sont optionnelles, impliquent un ralentissement limité. Présentation générale de WiPal WiPal est constitué d’une bibliothèque et d’un ensemble de binaires (programmes). Les binaires constituent une interface simple et rapide à utiliser pour les fonctionnalités de haut niveau, mais ces fonctionnalités sont également disponibles à travers la bibliothèque. Par exemple, pour fusionner plusieurs traces, la commande suivante suffit : $ wipal-merge t1.pcap t2.pcap [t3.pcap...] A.2. Contributions de cette thèse 78 1 #include <wipal/pcap/stream.hh> 2 #include <wipal/wifi/frame.hh> 3 4 using namespace wpl; 5 6 int main() 7 { pcap::file<> f ("file.pcap"); 8 9 for (pcap::file<>::iterator i = f.begin(); i != f.end(); ++i) 10 std::cout << wifi::type::names[wifi::type_of(i->bytes())] << std::endl; 11 12 } Listing A.1 – Un exemple de programme qui utilise la bibliothèque de WiPal. Ce programme affiche le type de chaque trame IEEE 802.11 qui compose file.pcap. Parmi les fonctionnalités de haut niveau, on trouve la synchronisation de traces (en utilisant le programme wipal-synchronize), la fusion (avec wipal-merge), la computation de statistiques (wipal-stats), l’anonymisation (wipal-anonymize), et quelques opérations anodines comme la comparaison, la concaténation, ou l’affichage hexadécimal (wipal-cmp ou wipal-cat, par exemple) Les fonctionnalités de bas niveau les plus importantes sont les entrées/sorties au format pcap, le décodage de trames IEEE 802.11, et le support de différents protocoles afférents. Il est important de noter que le code source de wipal-merge n’est qu’une coquille autour des fonctionnalités de la bibliothèque. Actuellement, les codes sources des binaires ont une taille moyenne de 122 lignes de C++ (l’ensemble de WiPal, dont la bibliothèque, fait environ 20.000 lignes de code). Le binaire le plus petit nécessite 44 lignes de code, et le plus gros 267. Ce code est principalement de la “glu” nécessaire aux techniques de programmations génériques que WiPal utilise. D’un autre coté, effectuer des tâches spécifiques avec le décodeur de trames de WiPal, ou combiner plusieurs traitements dans un seul fichier exécutable, nécessite de l’utilisateur qu’il écrive ses propres programmes en utilisant la bibliothèque de WiPal. Le Listing A.1 montre un exemple très simple d’un programme qui utilise cette bibliothèque. Architecture de WiPal La Figure A.2 présente un schéma simplifié de l’architecture de WiPal. Les binaires (en haut) reposent sur la bibliothèque, qui elle-même utilise d’autres bibliothèques externes. La bibliothèque est composée de plusieurs modules. Nous classons ces modules dans trois catégories : la base, les protocoles et formats, et les filtres. Base. Ces modules fournissent des fonctionnalités communes et simples, qui ne dépendent pas vraiment du domaine d’application de WiPal. Par exemple, il s’agit d’exceptions pour la gestion des erreurs, de classes abstraites génériques, et d’aides à la programmation sta- Annexe A. Résumé de la thèse en français 79 F IGURE A.2 – L’architecture et les modules de WiPal. tique. Grâce à l’utilisation de bibliothèques externes (comme Boost [1] ou GNU MP [3] ), nous tentons de rendre cette couche aussi fine que possible. Protocoles et formats. Ces modules sont spécifiques au domaine applicatif de WiPal et fournissent les fondations des traitements de haut niveau. Parmi les abstractions fournies, citons les adresses IEEE 802, les traces au format pcap, et différents en-têtes de protocoles, dont IEEE 802.11. Filtres. À la base, une trace de paquets n’est qu’un simple flux de paquets réseaux. La plupart des algorithmes n’ont pas besoin d’autre chose que de lire ce flux de manière linéaire, un paquet après l’autre, du début à la fin. Pour un tel mode de fonctionnement, il est tout à fait approprié d’utiliser une architecture pipe and filters [17] . C’est donc ce que WiPal utilise. Les différents pipes sont implémentés avec des itérateurs [32] . Par exemple, un filtre d’anonymisation nécessite un itérateur en entrée, et fourni un itérateur en sortie. Parfois, certains traitements ont besoin d’être adaptés pour utiliser une telle architecture. C’est le cas de la fusion de traces IEEE 802.11. Il faut alors la décomposer en plusieurs opérations élémentaires (un filtre effectue chaque opération) et relier ces opérations d’une manière précise. La Figure A.4 montre comment WiPal décompose l’opération de fusion. Toutes les opérations qui accèdent à une trace de manière non-linéaire ont besoin d’une telle adaptation. Fusion de traces de paquets IEEE 802.11 L’un des composants distinctifs de WiPal est son outil de fusion. Cet outil fonctionne horsligne et fusionne des traces de paquets IEEE 802.11. Ses principales caractéristiques sont la performance, la facilité et la souplesse d’utilisation. En conséquence, sa conception ne fait A.2. Contributions de cette thèse 80 A. Les traces ne sont pas synchronisées et ne contiennent pas toutes les trames. B. On identifie des trames de référence qui sont communes aux deux traces. Cette information permet de synchroniser les traces. C. On ajuste les estampilles temporelles de chaque trame afin de synchroniser T1 et T2 . D. Il est possible de fusionner les traces en comparant les estampilles temporelles. Les trames qui apparaissent en double (une fois dans chaque trace) ne sont prises en compte qu’une seule fois. F IGURE A.3 – Fusion de deux traces T1 et T2 . pas d’hypothèse sur les traces qui nécessiterait que les moniteurs soient reliés à une infrastructure filaire (par exemple, certains outils nécessitent une synchronisation réseau [22] ). Cet outil est également compatible avec tous les formats courants (IEEE 802.11 brut, en-têtes Prism, Radiotap et AVS). Enfin, on peut l’utiliser simplement en l’invoquant directement sur les traces (tandis que les autres outils nécessitent des architectures plus compliquées, qui mettent généralement en jeu plusieurs serveurs [22;28;43] ). Cette thèse motive et décrit les choix de conception de l’outil de fusion de WiPal : • Elle propose de nouveaux algorithmes pour différentes étapes du processus de fusion. En particulier, l’algorithme de synchronisation est une généralisation des algorithmes existants dans la littérature. • Elle fournit une analyse de l’algorithme de synchronisation ; nous montrons que que celui-ci est plus précis que les algorithmes précédents. • Elle fournit une étude de performance qui montre que l’outil de fusion de WiPal est un ordre de grandeur plus rapide que Wit, le seul autre outil de fusion hors-ligne publiquement disponible. Annexe A. Résumé de la thèse en français 81 Nos analyses reposent sur seize traces réelles qui proviennent de quatre jeux de données (uw/sigcomm2004 [50] de CRAWDAD, enregistré durant la conférence SIGCOMM 2004, et trois jeux privés enregistrés dans des conditions différentes). Ils nous permettent de calibrer différents paramètres, de valider le fonctionnement de l’outil de fusion, et de montrer son efficacité. Fonctionnement d’une fusion de traces Afin de fusionner des traces Wi-Fi, il est en général nécessaire de les synchroniser en premier lieu. Cette étape corrige les estampilles temporelles de chaque trame afin que chaque trace utilise la même référence de temps. Ensuite il est possible d’identifier les trames qui sont identiques dans chaque trace afin qu’elle n’apparaissent qu’une seule fois dans l’output (Cheng et al. [22] appellent cette étape l’unification). Afin d’obtenir une synchronisation précise (une précision d’au pire 106 µs est requise), il faut extraire des trames de références. Ce sont des trames dont il a été possible d’identifier automatiquement, et sans recourir à une quelconque synchronisation, qu’elles sont présentes dans toutes les traces en entrée. En analysant les estampilles temporelles des trames de référence il est possible de calculer un modèle d’horloge pour chaque trace qui va permettre la synchronisation. La Figure A.3 illustre ce procédé. Afin d’identifier des trames de références, WiPal commence par isoler des trames uniques. Une trame est unique lorsqu’elle n’apparaı̂t sur le canal radio qu’une seule et unique fois durant toute la durée de la mesure. Une trame qui n’apparaı̂t qu’une seule fois dans une trace mais qui est en réalité apparue deux fois lors de la mesure ne doit pas être considérée comme une trame unique. Les trames uniques sont des candidates pour devenir des trames de référence. En réalité, les trames de références sont les trames uniques qui sont partagées par chaque trace. L’étape qui calcule les références à partir des trames unique est l’intersection. Un schéma de l’ensemble de l’opération de fusion telle que la pratique WiPal est montré dans la Figure A.4. A.2.2 Applications de WiPal : analyses empiriques En utilisant les différents outils de WiPal, nous pouvons ensuite conduire des analyses sur des jeux de données que nous avons collectés en utilisant le sniffing. Cette thèse présente deux de ces analyses. La première se concentre sur la précision du sniffing Wi-Fi. La seconde étudie les usages du Wi-Fi dans des environnements sociologiquement différents. Nous obtenons toutes nos traces en utilisant des moniteurs (des netbooks) équipés de trois interfaces radios (ASUS EeePC 700 avec des adaptateurs Wi-Fi USB Netgear WG111v3, voir la Figure A.5). Les radios écoutent les canaux 1, 6 et 11. Chaque radio est configurée en mode moniteur et enregistre toute les trames qu’elle entend, indépendamment du réseau. A.2. Contributions de cette thèse 82 F IGURE A.4 – L’architecture du processus de fusion de WiPal. Précision du sniffing D’abord, nous collectons des jeux de données de courtes durées (de une à deux heures) en utilisant jusqu’à huit moniteurs localisés au même endroit. Dans un premier temps, l’analyse de ces traces révèle plusieurs défaut avec les techniques existantes d’évaluation de la complétude d’une trace de paquets Wi-Fi. Ensuite, nous analysons comment la complétude d’un jeu de données varie en fonction du nombre de moniteurs qui compose ses traces. Défauts des techniques d’évaluation de la complétude Toutes les techniques existantes pour évaluer la complétude d’une trace reposent sur le fait qu’un protocole, par essence, définit quelles sont les séquences de trames qui sont valides. Quand une trace contient une séquence qui n’est pas valide, c’est très probablement que cette séquence est incomplète. Il s’agit alors de trouver un nombre minimal de trames à insérer afin que la séquence de- Annexe A. Résumé de la thèse en français 83 F IGURE A.5 – Un ASUS EeePC 700 avec trois adaptateurs Wi-Fi USB Netgear WG111v3 tel qu’utilisé pour la collection de nos traces. vienne valide. On suppose ensuite que ce nombre est exactement la quantité de trames qui ont été perdues par le moniteur. Pour IEEE 802.11 il existe deux catégories de techniques : (i) les techniques orientées messages qui se basent sur les types des trames (par exemple, une trame de management ou une trame de donnée précède obligatoirement un acquittement) [22;38;43] et (ii) les techniques orientées numéros de séquence (seqnum) qui se basent sur les numéros de séquence (par exemple, si la trame 42 suit la trame 39, c’est que les trames 40 et 41 ont été perdues). Pourtant, plusieurs défauts rendent ces techniques imprécises. En partie à cause de leurs mode opératoire et en partie parce que des anomalies existent dans les traces. En effet, ces techniques supposent que chaque périphérique Wi-Fi se conforme exactement au standard IEEE 802.11 – ce n’est malheureusement pas toujours le cas. Voici une liste des défauts que nous avons pu soulever. • Les techniques existantes supposent que le réseau n’est pas congestionné. Dans un environnement congestionné, beaucoup de trames échouent leurs procédures d’accès au médium. Cela signifie alors que les trous dans les numéros de séquences révèlent des échecs de transmission plutôt que des pertes des moniteurs. • Les techniques seqnums supposent des périphériques qui génèrent des numéros de séquences corrects. C’est faux en pratique. En effet : 1. Certains points d’accès réinitialisent leurs compteurs de numéros de séquence à 2048 au lieu de 4096 [53] . En fonction de la technique d’analyse, cela peut conduire à surestimer ou sous-estimer le nombre de trames manquantes. 2. Certains points d’accès utilisent zéro pour tous leurs numéros de séquence (nous l’avons observé dans certaines de nos expériences). A.2. Contributions de cette thèse 84 3. Certains points d’accès gèrent en réalité plusieurs points d’accès “virtuels”. En théorie, chaque point d’accès virtuel devrait entretenir son propre compteur de numéros de séquences. En pratique ce n’est pas toujours le cas, et cela conduit à une surestimation du nombre de trames manquantes. • Les techniques messages ne détectent pas certaines pertes en rafale. Par exemple, les techniques messages ne peuvent détecter la perte d’une trame de donnée si l’acquittement correspondant a aussi été perdu. Nos études montrent que les pertes en rafale constituent une proportion significatives des pertes dans chaque trace. Impact du nombre de moniteurs sur la complétude Après avoir étudié les techniques d’estimation de la complétude, nous analysons comment la complétude d’un jeu de donnée varie en fonction du nombre de moniteurs qui compose ses traces. Nous identifions plusieurs résultats intéressants. • Comme on pourrait s’y attendre, plus le nombre de moniteurs est élevé, moins il est intéressant de rajouter un nouveau moniteur. • En revanche, même en utilisant huit moniteurs au même endroit, chaque moniteur contient une petite proportion de trames qui n’ont été entendues par aucun autre moniteur. • En utilisant seulement deux moniteurs, on peut obtenir en moyenne entre 70% et 80% des trames qu’on aurait obtenues si on avait utilisé huit moniteurs. C’est à dire que la plupart des trames sont partagées entre les moniteurs. Ceci dit, il faut utiliser au moins cinq moniteurs pour dépasser 90%. • Individuellement, la précision des moniteurs est très variable. Avec un seul moniteur, on peut capturer entre 45% et 90% de ce qu’il aurait été possible de capturer avec huit moniteurs. Pour résumer, la plupart des trames sont reçues par plusieurs moniteurs, mais quelques unes sont très difficile à entendre. C’est à dire que, sur six ou huit moniteurs, chaque moniteur contient une petite proportion de trames originales. En conclusion, il nous semble que les pertes sont inévitables, et il est donc important que les chercheurs utilisent des techniques d’analyses qui restent fiables même en présence de trames manquantes. Usages du Wi-Fi en milieu urbain Dans une deuxième analyse, nous collectons et analysons des traces de longues durées (longues de trois et dix jours) obtenues dans trois environnements : un bureau, une zone résidentielle urbaine dense, et une zone résidentielle de banlieue de faible densité. Nous étudions le comportement de chaque périphérique plutôt que les caractéristiques du trafic. Annexe A. Résumé de la thèse en français 85 Canal 1 Canal 1 1 jour 1 jour 1h 15 min 3 min 1h 15 min 3 min 1h 15 min 3 min 1h 15 min 3 min Canal 6 1 sem. 1 jour 1h 15 min 3 min 0 20 0 1h 15 min 3 min 1h 15 min 3 min 1h 15 min 3 min 0 25 0 20 0 15 0 10 50 0 40 35 30 25 20 15 10 5 0 0 60 0 50 0 40 0 30 0 20 0 10 0 (b) Résidentiel, banlieue 15 1 jour (a) Bureau 0 Canal 11 1 jour Périphériques (par durées d’activité) 10 Canal 11 1 sem. 1 jour Périphériques (par durées d’activité) 50 0 Durée totale d’activité 1 jour 30 25 20 15 10 5 0 0 0 50 0 40 0 30 20 0 10 0 Durée totale d’activité Canal 6 Canal 11 0 14 0 12 0 10 80 60 40 20 0 25 20 15 1h 15 min 3 min 10 1 jour 5 Canal 6 0 0 400 350 300 250 200 150 10 50 0 Durée totale d’activité Canal 1 1 sem. 1 jour Périphériques (par durées d’activité) (c) Résidentiel, centre-ville F IGURE A.6 – Distribution des durées d’activité cumulées, pour chaque trace et pour chaque station. Nous nous intéressons à des observations comme la durée totale d’activité d’un périphérique, la fréquence d’apparition de nouveaux périphériques, et l’activité que nous pouvons extraire des traces. Dans ce résumé, nous présentons deux exemples de résultats que nous obtenons en analysant ces jeux de données. Durées d’activité cumulées La Figure A.6 présente la distribution des durées d’activité cumulées dans toutes les traces et sur tous les canaux. Chaque impulsion représente la durée totale d’activité d’un périphérique pour une trace donnée. Nous considérons qu’un périphérique est actif lorsqu’il a émis une trame dans les dernières trois minutes (n’importe quel type de trame : management, donnée, ou contrôle). Nous utilisons cette fenêtre de trois minutes car des pilotes des points d’accès utilisent des temporisateurs avec des durées similaires (par exemple, les temporisateurs de MadWifi varient entre 30 secondes et 5 minutes). De plus, ne nécessiter qu’une seule trame toutes les trois minutes rend la technique robuste vis à vis des pertes de trames. Un certain nombre de caractéristiques sont communes à toutes les traces. • Les périphériques ne sont pas répartis de manière uniforme sur les différents canaux. Dans toutes les traces, les périphériques apparaissent plus souvent sur le canal 11 que sur le canal 6, et plus souvent sur le canal 6 que sur le canal 1. C’est une conséquence directe A.2. Contributions de cette thèse 86 de ce que les réseaux ne sont pas répartis de manière homogène sur les différents canaux. • La distribution des durées d’activité n’est pas uniforme pour une trace et un canal donné (remarquez que la Figure A.6 utilise une échelle logarithmique). Il y a trois classes de périphériques : (1) ceux qui sont (presque) toujours actifs, (2) ceux qui n’apparaissent qu’une seule fois, et (3) les autres. Au sein de l’ensemble des traces, 31 périphériques (sur un total de 2.395) appartiennent à la classe (1)3 . 27 de ces périphériques semblent être des points d’accès. Deux des quatre périphériques restants font partie de la trace de bureau, et deux dans la trace de centre-ville. Comme ils n’émettent pas de balises, il ne s’agit pas de périphériques en mode ad hoc. Il est donc intéressant de noter qu’une poignée d’utilisateurs laisse leurs périphériques allumés en permanence. Un partie significative des périphériques appartient à la classe (2) (20% au bureau et dans le centreville, 9% en banlieue). Cela signifie que beaucoup d’utilisateurs ne sont pas réguliers et ne font que passer. La classe (3) est variée et inclus l’ensemble des valeurs possibles. Néanmoins, plus la durée est courte, plus la probabilité est forte. • La plupart des périphériques sont presque inactifs. Entre 48% et 96% des périphériques, en fonction de la trace et du canal (76% en moyenne), sont actifs moins d’une heure pendant toute la durée de la trace. Donc une majorité de périphériques est inactive la plupart du temps. Sur certains points néanmoins les traces présentent des caractéristiques différentes. Les profils des traces de bureau et de banlieue sont similaires, mais dans cette dernière les périphériques ont tendance a cumuler des durées d’activité plus longues. Le profil de la trace de centre-ville présente une rupture très nette entre les périphériques actifs et ceux qui sont presque inactifs. On perçoit bien ces variations si l’on regarde les durées d’activité moyennes : 2h36 pour la trace de bureau, 11h48 pour la trace de banlieue, et 2h21 pour la trace de centre-ville (alors même que cette trace est trois fois plus longue que les deux autres). En conclusion, dans certain environnements, en moyenne, les périphériques sont actifs plus souvent. Croissance du nombre de périphériques La Figure A.7 présente la croissance du nombre de périphérique. Chaque courbe est associée à un canal et une trace donnée (avec une courbe supplémentaire pour chaque trace, qui représente la moyenne des canaux). Chaque point montre combien de périphériques distincts une combinaison (trace, canal) contient entre le début de la mesure et le temps donné en abscisse. Nous considérons que chaque adresse MAC représente un périphérique, et nous cherchons les adresses MAC dans tous les champs de la trame. C’est à dire que certains périphériques sont mentionnés en tant que destinataire 3 Un périphérique qui apparaı̂t sur plusieurs canaux compte plusieurs fois. Annexe A. Résumé de la thèse en français 2500 90 Canal 1 Canal 6 Canal 11 Moyenne 2000 87 600 Canal 1 Canal 6 Canal 11 Moyenne 80 70 500 60 1500 Canal 1 Canal 6 Canal 11 Moyenne 400 50 300 40 1000 30 200 20 500 100 10 0 0 0 06 12 ur jo 11 ur jo 10 ur jo 09 ur jo 08 ur jo 07 ur jo 06 ur jo 05 ur jo 04 ur jo 03 ur jo 02 ur jo 01 ur jo ur 05 04 03 02 01 05 04 03 02 01 (b) Résidentiel, banlieue jo ur jo ur jo ur jo ur jo ur jo ur jo ur jo ur jo ur jo ur jo (a) Bureau (c) Résidentiel, centre-ville F IGURE A.7 – Nombre d’adresse MAC distinctes que contient chaque trace entre le début de la mesure et un temps donné. mais pas comme émetteur. C’est pourquoi nous découvrons plus de périphériques que sur la Figure A.6. De plus, à cause d’un détail de IEEE 802.11, certains champs d’adresse comportent des valeurs qui en réalité ne correspondent pas à des adresses MAC réelles (mais des BSSID indépendants). Nous ignorons ces champs. De la Figure A.7 nous pouvons tirer un certain nombre d’observations. • Les périphériques ne sont pas répartis de manière uniforme sur les différents canaux. De manière curieuse, la répartition n’est pas la même que celle des périphériques qui émettent des trames. Dans toutes les traces, le canal 1 est celui qui contient le moins de périphériques. C’est tout à fait cohérent avec les résultats précédents (voir ci-dessus). Néanmoins, le canal 6 attire plus d’utilisateurs que le canal 11 dans deux des trois traces. C’est en contradiction directe avec la répartition de la Figure A.6. Une différence est que cette figure ne prend en compte que les émetteurs tandis que la Figure A.7 considère tous les périphériques. En tout état de cause, cela signifie qu’il n’est pas si évident de déterminer la répartition des utilisateurs sur certains canaux. • La vitesse de découverte met en évidence un phénomène jour/nuit. Les courbes alternent périodiquement entre des périodes plates et des périodes de croissance. En fonction de la trace, ce phénomène est d’une amplitude et d’une période variable, mais on l’observe dans toutes les traces. Les périodes plates apparaissent la nuit, commencent généralement aux alentours de minuit, et s’arrêtent quelques heures avant midi. Cela montre, comme on pouvait s’y attendre, que l’activité Wi-Fi est corrélée avec l’activité humaine. • Dans les traces de bureau et de centre-ville, la vitesse de découverte est constante durant une longue période. De plus, dans la trace de centre-ville, cela est toujours vrai même après A.3. Conclusion 88 une semaine de mesure. En revanche, la trace de banlieue s’aplatit au bout de deux jours. Nous pensons que cela est une conséquence directe de l’environnement : en centre ville et dans des bureaux il y a une plus forte mobilité, et un turnover des individus plus important. On peut donc s’attendre à ce que beaucoup de nouveaux utilisateurs apparaissent sans revenir avant la fin de la mesure. Cela explique également pourquoi les temps d’activités moyen par utilisateurs sont plus élevés dans la trace de banlieue (voir ci-dessus). Notons néanmoins que même quand la vitesse de découverte chute après deux jours, il est encore possible de découvrir des nouveaux utilisateurs vers la fin de la trace. Parmi toutes ces observations, nous pensons que deux d’entre elles sont d’une importance particulière. D’abord, comme le montre les durées d’activité, les utilisateurs sont mobiles, ou bien ils éteignent généralement leurs équipements Wi-Fi. Cela donne des traces de paquet dans lesquelles la plupart des périphériques sont éteint la plupart du temps. Ensuite, les environnements ont des impacts différents sur la mobilité. Cela se traduit par des apparitions de nouveaux utilisateurs qui sont soit réparties de manière homogène, soit groupées au début de la trace. Parmi les autres résultats dont nous n’avons pas parlé dans ce résumé, nous avons également noté que l’intensité de l’activité Wi-Fi alterne entre les zone résidentielles et les bureaux. Cela est dû au fait que que tous les environnements font partie de la vie des utilisateurs, mais à un moment précis de la journée. A.3 Conclusion Le sniffing sans fils est une technique puissante pour mesurer l’activité des réseaux Wi-Fi, bien que cela pose un certain nombre de questions. Ces questions sont à la fois pragmatiques et théoriques. D’une part les logiciels disponibles pour gérer les traces IEEE 802.11 sont souvent insatisfaisant. D’autre part, la pertinence des traces de paquets IEEE 802.11 est sujette à caution. Dans cette thèse, nous abordons ces questions et apportons un certain nombre de réponses. D’abord, nous développons WiPal, une boı̂te à outils logicielle pour faciliter le traitement des traces de paquets IEEE 802.11. WiPal inclut un outil de fusion de traces flexible. Ensuite, à travers l’analyse de deux traces de courte durées, nous étudions la précision offerte par des moniteurs Wi-Fi. Une dernière étude collecte et exploite trois traces de longues durées dans des environnements différents. Cela nous permet d’étudier les usages que font les utilisateurs du Wi-Fi. Afin d’étendre ces analyses, nous sommes actuellement en train d’analyser des traces obtenues avec plusieurs moniteurs répartis dans le parc Monceau, à Paris. L’activité Wi-Fi au sein de ce parc nous intéresse car celui-ci inclut plusieurs poins d’accès situés à différents Annexe A. Résumé de la thèse en français 89 F IGURE A.8 – Position des moniteurs pour la collection de traces dans le parc Monceau. Le travail d’analyse des traces est en cours. (Arrière plan : Google Maps.) endroits du parc. Avec dix moniteurs, nous avons couvert une superficie équivalente à environ la moitié du parc (cf. Figure A.8). Nos analyses sont en cours et nous n’avons que peu de résultats pour le moment. Les traces incluent 138 émetteurs, dont 71 sont de marque Apple. Nous pensons qu’il s’agit principalement d’appareils mobiles (iPhone ou iPod touch). Avec un tel nombre de périphériques mobiles, il est possible que ces traces révèlent des usages nouveaux. Enfin, nous envisageons plusieurs travaux pour étendre WiPal et mieux comprendre les phénomènes précédemment observés. En effet, il est possible de rajouter le support pour de nouveaux protocoles et de nouveaux algorithmes dans WiPal, afin de montrer sa généricité et d’en faire un outil “universel”. Nous aimerions également le rendre encore plus simple d’utilisation, et améliorer ses procédures de test (et pourquoi pas, le prouver formellement ?) En ce qui concerne nos mesures de la précision des moniteurs, nous aimerions effectuer des expériences contrôlées pour mesurer l’impact réel de la congestion sur les moniteurs. Nous devrions également étudier pourquoi le processus de sniffing montre autant de variabilité. Il serait intéressant à cette fin d’utiliser différents types de matériels, et de varier les paramètres des expériences. En ce qui concerne la mesure des différents environnements, nous avons deux questions principales. D’abord, nous aimerions voir dans quelles proportions notre méthode d’analyse provoque des bais de mesure. Ensuite, nous aimerions tester plus d’environnements, et essayer de faire ressortir des catégories d’environnements avec des propriétés similaires. 90 A.3. Conclusion Appendix B WiPal manual WiPal is a piece of software dedicated to IEEE 802.11 traces manipulation. It comes as a set of programs and a C++ library. A distinctive feature of WiPal is its merging tool, which enables merging multiple wireless traces into a unique global trace. WiPal’s key features are flexibility, ease of use, and efficiency. B.1 The programs This part documents the programs WiPal features. B.1.1 Invocation WiPal’s programs all use the same invocation scheme: wipal-<command> [options] [inputs] [outputs] The command line may include no options and, depending on the program, there may be no inputs or no outputs. Most programs expect at least one input however. See the specific documentation for each program in order to know how many inputs and outputs each program expects. Inputs, outputs, and options may be mixed on the command line, e.g., wipal-simple-merge -n -P input1.pcap input2.pcap output.pcap wipal-simple-merge input1.pcap input2.pcap output.pcap -P -n wipal-simple-merge input1.pcap -n input2.pcap -P output.pcap ... are all equivalent. WiPal’s programs use getopt(3) to parse options, so they only have short options (no long options) composed of a dash followed by a letter (e.g.,-a, -t, etc.) Option letters always have the same meaning whatever the program. All options are not available for all programs 91 B.1. The programs 92 though (some options do not make sense with some programs). For instance, -P always means the invoked program should consider frames with non-zero Prism fields as invalid. In order to know which options a program accept, use the -h option. Finally, some options expect an extra argument right after they are provided: wipal-test-uniqueness -a hsh_80211 input.pcap ^^^^^^^^^ This is not an input Available options -8 When comparing two packets, only compare IEEE 802.11 frames. Do not compare Prism or pcap headers. This option is incompatible with traces of pcap link type EN10MB. -a See Section B.1.7. Specify which attributes the program must use to identify unique frames. An attribute specifier must follow this option on the command line. To see a list of valid attribute specifiers, use the -h option. -b When comparing two packets, only compare packet bytes. Do not compare pcap headers. -c Do not print column headers. This is the default when standard output is not a TTY. -C Do print column headers. This is the default when standard output is a TTY. -d When comparing two packets, compare everything: pcap headers and packet bytes. This is the default. -e In table outputs, do not use a column to report error values. This is the default. -E In table outputs, do use a column to report error values. -g Enable debugging output. As of now this only makes WiPal programs to display their options right after they parse the command line. -h Help. Print a short summary describing how one should invoke the program, which options it accepts, and possibly which attribute specifiers are accepted for option -a. -i In table outputs, do not print frame indices. -I In table outputs, do print frame indices. This is the default. -m Specify a MAC address mapping file. Some WiPal programs need to map MAC addresses to other identifiers. For instance, wipal-extract-unique-frames with the seq bss tmp attributes maps MAC addresses to 32-bit integers for performance reasons. wipal-anonymize maps real addresses Appendix B. WiPal manual 93 to anonymous ones. Each program stores these mappings into a file so they can be reloaded and reused latter. This option allows users to control the name of this file. When not specified, the “MAC.map” filename is used. The file is just a plaintext file where each line contains a value and the corresponding MAC identifier. A filename should follow this option. The file might not exist (in which case it will be created). If it exists, it might be extended, but will not be truncated. -n Consider Prism headers are little endian. This is the default when the corresponding pcap file is little endian. Note that some broken traces are big endian yet have little endian Prism headers. Thus this option. -N Consider Prism headers are big endian. This is the default when the corresponding pcap file is big endian. -o When comparing two traces with wipal-cmp, compare everything (pcap headers and packet bytes, as with option -d) and count how many bytes differ. The count is printed on standard output. The exit status remains unchanged. -p In Prism headers, do not consider noise fields have a special meaning. This is the default. -P In Prism headers, consider non-null noise fields indicate a PHY error, and thus an invalid frame. Such frames will be ignored, e.g., with wipal-cat they will not appear in the output. This option implicitly implies the input trace is composed of Prism headers (as pcap link type). -q Quiet. Produce minimal output. -r Blacklist a given reference frame. The reference frame will then be ignored and will not be used during synchronization. See Section B.1.6. A reference frame identifier must follow this option, e.g., 42-51 (indicating the reference frame composed of the unique frames 42 and 51). You may use this option multiple times, e.g., wipal-simple-merge -r 42-51 -r 666-505 \ input1.pcap input2.pcap output.pcap will blacklist both references 42-51 and 666-505. B.1. The programs 94 -s Specify an ESSID mapping file. This option works as -m but for files that map ESSIDs to other values. For instance, wipal-anonymize maps valid ESSIDs to anonymous ESSIDs. See -m for details. The default mapping file is “ESSID.map”. -t When comparing two packets, only compare IEEE 802.11 frames, along with some timestamps (e.g., pcap time, Prism MAC time, etc). Which timestamps are used depends on the traces’ link types (and whether options -y or -Y are provided as well). Compare time values with a precision of 106 microseconds by default (that is, assume two values are equal when they are spaced by less than 107 microseconds). You can change the precision using option -x. -u In table outputs, do not print microsecond timestamps. This is the default. -U In table outputs, do print microsecond timestamps. -v Display the program’s version (actually the version of the WiPal package the program come from). -x Change precision for timestamps comparisons (e.g., when using wipal-cmp with -t or when merging or synchronizing traces). By default, when the duration between two timestamps is 106 microseconds or less, WiPal programs consider the timestamps are equal. The rationale for this behavior is 106 microseconds is half the shortest frame interarrival time between two IEEE 802.11 frames (in infrastructure mode). Thus, this is the largest value one can afford when synchronizing IEEE 802.11 traces. Use -x to change this value. The new expected precision should be right after -x on the command line. -y Force the use of pcap timestamps. Some traces contain multiple timestamps for each frame. For instance, pcap traces with link type PRISM HEADER have the standard pcap timestamps plus extra PHY-level timestamps provided by the network adapter. WiPal programs’ policy is to use the most precise timestamps (that is, to ignore pcap timestamps when something else is available). This option alters this behavior and forces programs to use pcap timestamps. -Y Force the use of PHY-level timestamps when available. This is the default. See option -y for a more detailed explanation. Appendix B. WiPal manual 95 Input syntax Basic usage You may provide the name of a pcap trace file as input. wipal-cat input.pcap output.pcap Input concatenation You may provide the name of several pcap traces separated with columns (do not include any space). This tells the program to consider the concatenation of each trace as a single input. wipal-cat input1.pcap:input2.pcap:input3.pcap output.pcap will put into “output.pcap” the content of “input1.pcap”, followed by the content of “input2.pcap” and then “input3.pcap”. Every program understands this syntax. Note that specifying multiple traces with columns makes no sense for outputs: wipal-cat input1.pcap:input2.pcap output1.pcap:output2.pcap will concatenate “input1.pcap” and “input2.pcap” into a single file named “output1.pcap:output2.pcap”! Address specification Some programs (e.g., wipal-merge with attributes hsh en2) might need the IPv4 address of the machine that generated a trace to work properly. Attach such an address to a trace as follows: wipal-merge -a hsh_en2 foo.pcap=192.168.1.1 \ bar1.pcap:bar2.pcap=192.168.1.2 The rationale is that, in some cases, timestamps of emitted frames are not as precise as timestamps of received frames, and thus emitted frames should be ignored during synchronization. Special characters When your traces’ filenames contain the special characters ‘:’ or ‘=’ they need to be escaped with a backslash (‘\’): wipal-cat weird\=file\:name.pcap out.pcap wipal-merge -a hsh_en2 weird\=1:weird\=1:2=192.168.1.1 \ foo.pcap=192.168.1.2 B.1. The programs 96 B.1.2 Concatenation (and Prism noise filtering) One may concatenate traces using the wipal-cat command. It takes exactly one input and one output. It may be useful to recombine a trace that was split, or filter out frames with Prism noise (using the -P option). wipal-cat in.pcap out.pcap wipal-cat foo.pcap.0:foo.pcap.1 foo.pcap wipal-cat -P in.pcap out.pcap wipal-cat -P bar.pcap.0:bar.pcap.1:bar.pcap.2 bar.pcap The first example just copies “in.pcap” into “out.pcap”. Note that the two files might be different at the byte level, e.g., if “in.pcap” is big endian and the program is run on a little endian machine. The second example concatenate “foo.pcap.0” and “foo.pcap.1” and put the result into “foo.pcap”. The third example copies “in.pcap” into “out.pcap” but removes frames that have a non-zero noise field in their Prism headers. The fourth example both concatenates traces while filtering noisy frames out. B.1.3 Comparisons One may test two pcap traces for equivalence using the wipal-cmp command. The default is to compare every bit of information (pcap headers plus packet bytes) but you may change this behavior using the -8, -b, -o, or -t options. Note that this is different however to using diff or cmp since traces with different endianness may contain the same packets. By default wipal-cmp produces a report on the standard output indicating either that traces are equal, either which packet is the first to mismatch. Use -q if you are only interested in the program’s exit status and do not want to produce any output. Use -o if you are interested in counting the number of bytes that differ between two traces. e.g., wipal-cmp foo.pcap bar.pcap wipal-cmp -q foo.pcap bar.pcap wipal-cmp -q -8 in1.pcap.0:in1.pcap.1 in2.pcap ... B.1.4 Sub-traces One may extract sub-traces of pcap traces using wipal-extract-subtrace, wipal-extract-transmitter, or wipal-extract-bssid. Appendix B. WiPal manual 97 wipal-extract-subtrace takes two dates and a pcap trace as inputs, and produces one output. Unfortunately, it does not support any option currently. wipal-extract-transmitter takes a MAC address and a pcap trace as input, and produces one output. Its output contains the frames from its input that were transmitted by the given address. Note that the command looks at transmitters, not originators, e.g., the transmitter of a data frame that crossed the distribution system is the output access point, not the original sender. Also note that some frames do not contain information regarding their transmitters (e.g., MAC acknowledgements) and therefore cannot appear in the output, even if they were effectively sent by the given address. wipal-extract-bssid works as wipal-extract-transmitter, but the MAC address represents a BSSID and the command extracts frames that belong to the corresponding BSS. Again, note that some frames do not contain information regarding their BSS. These frames therefore cannot appear in the output, even if they were effectively belonging to the given BSS. e.g., wipal-extract-subtrace 2007-01-01 2008-01-01 \ in.pcap.0:in.pcap.1 out.pcap wipal-extract-subtrace \ "2004-Aug-30 16:59:39.789221" "2004-Aug-30 16:59:39.929872" \ kalahari-ath2 subtrace.pcap wipal-extract-transmitter 71:19:9f:6f:71:33 in.pcap out.pcap wipal-extract-bssid 9b:d2:d7:7f:aa:63 in.pcap out.pcap B.1.5 Merging One may merge two IEEE 802.11 traces into one using the wipal-simple-merge command. Use the -h option to have a description of the command’s syntax. It takes two inputs and produce one output. When ran, the merging process starts by synchronizing precisely both inputs (see Section B.1.6). Then both traces are merged and special care is given not to re-order packets or account duplicate packets twice in the output (that is, packets that are present in both traces appear only once in the output). This command expects pcap traces with either Prism headers, AVS headers, Radiotap headers, raw IEEE 802.11 frames, or pseudo-Ethernet II frames as link type. The -p and -P options only work with Prism headers. The following timestamps are used, unless -y is provided: B.1. The programs 98 IEEE 802.11 frames pcap timestamps, Ethernet II frames pcap timestamps, Radiotap headers Radiotap headers’ tsft fields. The command will fail with Radiotap headers that do not contain such fields, AVS headers AVS headers’ mactime fields, Prism headers Prism headers’ mactime fields. e.g., wipal-simple-merge a.pcap b.pcap output.pcap wipal-simple-merge -P -n foo-ath2.0:foo-ath2.1 bar-ath2 foo-bar-ath2 ... Notes regarding traces with Ethernet II frames as link type See Section B.1.7. Since version 4.0, WiPal is able to merge traces with Ethernet II frames as link type. This is useful because some wireless traces use this link type. These traces only contain IP packets encapsulated into pseudo-Ethernet frames. Since these traces contain no IEEE 802.11 MAC headers one cannot use the usual attributes – that rely on these headers – to merge them. Therefore, use the hsh en2 attributes when merging Ethernet II traces (see option -a). Using theses attributes tell WiPal to decapsulate Ethernet frames and use the following frames as unique frames: • OLSR packets (IPv4 and IPv6), • IPv6 router advertisements. Also note that machines recording pcap traces while emitting packets generally record imprecise timestamps for emitted packets. In order to solve this issue, you might specify an IPv4 address for each trace (see Section B.1.1). Frames originating from this address in this specific trace will be ignored for synchronization. Finally, remember that Ethernet traces only contain pcap timestamps, and these timestamps are not as precise as PHY-level timestamps. You might want to use option -x to raise the expected precision above 106 microseconds. Merging more than two traces wipal-simple-merge is only able to merge two traces. In order to merge more traces, one should run successive merges following a given sequence. For instance, merging traces A, B, and C might involve merging A and B into T first, and then merging T and C. The Appendix B. WiPal manual 99 wipal-merge command selects a merging sequence and runs the corresponding merge operations in turn. e.g., wipal-merge t1.pcap t2.pcap t3.pcap wipal-merge -n -P t11.pcap:t12.pcap:t13.pcap t21.pcap:t22.pcap t3.pcap There is no rule to determine which merging sequence will give the “best” results. We consider the two traces that are the most similar should be merged first. This to avoid generating anomalies due to a lack of reference frames (see Section B.1.6). In order to compute similarity between two traces A and B, WiPal count the number of reference frames it is able to extract from these traces, stopping when it reaches B’s 250,000th unique frame (see Section B.1.7). Despite its issues, this technique has the advantage of being both simple to implement and fast (determining a merging sequence should not take more time than actually merging the traces). wipal-merge computes its merging sequence as follows. Note that it is designed to be fast rather than to yield an optimal sequence. 1. For each trace, compute its similarity with each other trace. 2. Sort results by similarity. 3. Pick up the most similar result. • If it involves two non-merged traces, merge them. • If it involves a trace A that has already been merged into another trace T, consider merging T instead of A. • If it involves two traces that were already merged into the same trace, do nothing. 4. Pick up the next result in the list and repeat step 3 until all traces have been merged into one unique trace. One may compute the similarity between multiple traces using the wipal-similarity command. The output is sorted by ascending order of similarity. e.g., wipal-similarity t1.pcap t2.pcap wipal-similarity -P t1.pcap t2.pcap t3.pcap t4.pcap B.1.6 Synchronization In order to merge two IEEE 802.11 traces WiPal needs to synchronize them precisely. In order to do so, it first identifies some frames that appear in both inputs. These are reference frames. It uses these frames to model clock desynchronization among the traces. It then update the first trace’s timestamps so they are synchronized with the second trace. B.1. The programs 100 One may use the wipal-synchronize command to synchronize two traces. It takes two inputs and produce one output. The output contains the same packets as the first input, but with synchronized timestamps. To extract reference frames WiPal extract some specific frames called unique frames (see Section B.1.7) from both input traces and then intersect the two obtained sets. One may use the wipal-intersect-unique-frames command to get the result of this operation (i.e., the list of reference frames used for synchronization of two traces). WiPal’s synchronization process synchronizes reference frames before it synchronizes other frames. One may get the result of this operation using the wipal-synchronize-unique-frames command. e.g., wipal-intersect-unique-frames -n -P foo.0:foo.1:foo.2 bar.0:bar.1 wipal-synchronize-unique-frames -n -P foo.0:foo.1:foo.2 bar.0:bar.1 wipal-synchronize -n -P foo.0:foo.1:foo.2 bar.0:bar.1 foo-sync Synchronizing more than two traces Due to WiPal’s mode of operation, it is not possible to synchronize multiple traces on a common timeline in a single operation. wipal-merge-and-synchronize however provide a similar feature. It behaves as follows: 1. First, merge all the traces given on the command line. At this stage, the command behaves exactly as wipal-merge. 2. Then, synchronize each individual trace from the command line with the timeline of the previously merged trace. Record each synchronized trace into the files “sync-1”, “sync-2”, etc. For instance: wipal-merge-and-synchronize t1.pcap t2.pcap t3.pcap will merge “t1.pcap”, “t2.pcap”, and “t3.pcap” into the file “merge-2”. Then each trace will be synchronized using “merge-2”’s timeline. B.1.7 Unique frames A frame is said to be unique when it appears on the air once and only once for the whole duration of a trace. WiPal’s unique frame extraction process is an important stage of its trace synchronization process. WiPal programs’ default policy is to consider all beacon frames and all non-retransmitted probe responses as unique frames. Appendix B. WiPal manual 101 One may use the wipal-extract-unique-frames command to get a list of the unique frames that compose a trace. Run wipal-extract-unique-frames -h to get its invocation syntax. In practice, WiPal does not extract and load full unique frames into memory. This would slow the process down and require an excessive amount of memory. The default is to work on MD5 frame hashes when WiPal was compiled using OpenSSL. When compiled without OpenSSL, WiPal only extracts a subset of frame fields. We call the pieces of information WiPal extracts to identify unique frames “frame attributes”, or sometimes “frame identifiers”. You may specify frame attributes to use with the -a option. In practice, the difference in speed and memory consumption between attributes is negligible. There is an important difference between attributes, though. With some attributes, different unique frames may yield identical attributes (collisions). This is of course an undesirable behavior. One may check that a given trace’s unique frames are really unique w.r.t. unique frame attributes using the wipal-test-uniqueness command. This command finds collisions inside its input traces. You might specify different frame attributes using the -a option. e.g., wipal-test-uniqueness -P -a tmp foo.pcap.1:foo.pcap.2 wipal-extract-unique-frames -P foo.pcap.1:foo.pcap.2 > foo-unique.txt Special attributes WiPal’s “standard” behavior only considers non-retransmitted IEEE 802.11 probe responses and beacons to compute unique frames. Two “special” attributes however change this behavior: hsh en2 These attributes only work with traces using the EN10MB pcap link type (Ethernet II). On the opposite, “standard” attributes do not work with this link type. Using theses attributes tell WiPal to decapsulate Ethernet frames and use the following frames as unique frames: • OLSR packets (IPv4 and IPv6), • IPv6 router advertisements. Beware: if you use traces that last long, OLSR packets’ sequence numbers might wrap. This might result in the assumption of OLSR packets being unique not holding anymore. In such cases, you cannot use WiPal to merge your traces. hsh 80211 x These attributes work exactly as hsh 80211 but consider more frames as unique frames. Non-retransmitted IEEE 802.11 probe responses and beacons are still considered unique frames. In addition, WiPal programs decapsulate IEEE 802.11 data frames and consider the same frames as hsh en2 (previously described). B.1. The programs 102 Note that these attributes are disabled by default (because it slows the compilation down and for long traces they are less robust than hsh 80211). To enable them, use the --enable-attributes option of WiPal’s configure script before compiling WiPal. B.1.8 Duplicate data frames One may use the wipal-find-data-dups command to search some invalid data frames. It looks into traces on a per-sender basis for successive duplicate data frames (it only considers non-retransmitted frames). Such cases should not occur in theory – as it ignores retransmissions, successive data frames from the same sender should at least show variations in their sequence numbers. Surprisingly, some traces contain such anomalies: identical data frames that are not retransmissions and are only spaced by a few milliseconds. We have no explanations why some datasets exhibit those phenomena. e.g., wipal-find-data-dups foo.pcap.0:foo.pcap.1:foo.pcap.2 B.1.9 Statistics wipal-stats computes several figures concerning its given input pcap traces. It displays these figures as plain text on the standard output. You might either interpret them directly or post-process them with some tools, e.g., to generate plots. Most of the output figures are self-explanatory and therefore will not be mentioned in this manual. Some others need an explanation though: frames from expired senders The computation of some figures needs wipal-stats to keep a state for each sender (e.g., its current sequence number). To avoid some measurement artifacts, each state expires after one minute of inactivity from its sender. This counter indicates how many frames were received which sender had expired upon reception of the frame. sequence gap too large to make sense A sequence gap occurs every time a frame is received which sequence number is greater than its sender’s previous sequence number plus one. Theoretically, a gap of length N (e.g., receiving frame 42 and then frame 42 + N + 1) means the sniffer missed N frames. Sometimes however the gap is too large to make sense (e.g., a gap of 2000 within a window of 500 microseconds). WiPal counts the number of occurrences of these gaps, but otherwise ignores them (e.g., when estimating the number of missed frames). gap length frequencies This gives the frequencies of sequence gap lengths (see above). The data is directly suitable for Gnuplot. Use the wipal-plot-gaplenfreqs script to generate the plot using Gnuplot. e.g., Appendix B. WiPal manual 103 wipal-stats foo.pcap > foo.stats wipal-plot-gaplenfreqs foo.stats freqs.eps "A title" T-Fi plot This gives data suitable for Gnuplot to generate a T-Fi plot. Use the wipal-plot-tfi script to generate the plot using Gnuplot. e.g., wipal-stats foo.pcap > foo.stats wipal-plot-tfi foo.stats tfi.eps "A title" One may find an explanation about T-Fi plots in the following paper: On the fidelity of 802.11 Packet Traces, A. Schulman, D. Levin, and N. Spring, in the proceedings of PAM 2008. BSS figures This gives a list of all BSSs the trace contains as well as a few other figures (e.g., number of distinct BSSs, APs and STAs corresponding to each BSS, etc.) The list is ordered by number of beacons seen for each BSS. SSID figures This gives the number of distinct SSIDs the trace contains as well as two lists of these SSIDs. The first one orders them by frequency, the second one orders them lexicographically. activity This gives data that represents quantity of traffic w.r.t. elapsed time. Each line correspond to one minute. Columns respectively represent: 1. how many frames were sent (during the corresponding minute), 2. how many bytes were sent, 3. how many bytes from management frames were sent, 4. how many bytes from data frames were sent. 5. how many bytes from access points were sent. When a STA emits a beacon which is not belonging to an independent BSS (i.e., STA emits an infrastructure mode beacon), WiPal identifies this STA as an access point. All further frames from this STA are accounted as access point traffic. One might use the wipal-plot-activity script to plot traffic rate w.r.t. elapsed time for the whole trace, only for management frames, or only for access point frames. e.g., wipal-stats foo.pcap > foo.stats wipal-plot-activity foo.stats activity.eps "A title" B.1. The programs 104 Various growths (MAC addr., BSSID, IBSSID, SSID, AP) Actually each “growth” section gives the same kind of statistics, but for various elements. Elements are: MAC addr. MAC addresses, without BSSIDs or IBSSIDs. Inspect all frames. BSSID BSSIDs that are not IBSSIDs. That is, independent BSS frames (i.e., ad hoc mode frames) are ignored. Only inspect beacon frames, despite other frames also contain BSSIDs. IBSSID IBSSIDs. That is, only account independent BSS frames (i.e., ad hoc mode frames). Also, only inspect beacon frames, despite other frames also contain IBSSIDs. SSID All SSIDs. Only inspect beacon frames (e.g., ignore probe responses). AP Sender MAC addresses from beacons. Account both normal BSS frames (infrastructure mode) and independent BSS frames (ad hoc mode). For a given element type, “growth” data gives statistics about the evolution of the number of distinct elements. Each row represents a minute of measurement. Columns respectively represent: 1. The number of new distinct elements seen the last minute. 2. The total number of distinct elements seen since the beginning of the trace. 3. The number of distinct elements seen during the last minute. For instance, if a trace contains the following elements: first minute A B C second minute A D third minute A B D The corresponding rows are: 3 3 3 1 4 2 0 4 3 One might use the wipal-plot-growth script to plot an element growth w.r.t. elapsed time. e.g., wipal-stats foo.pcap > foo.stats wipal-plot-growth "MAC addr." foo.stats mac-growth.eps "A title" Appendix B. WiPal manual 105 ON/OFF events When a STA emits a frame, wipal-stats considers it as active. A STA’s state gets back to inactive after three minutes of silence. The ON/OFF events section lists these state’s changes. The section is composed of one subsection per STA and per trace. Within these subsections, each line indicate a state change. A state change line consists of two columns. The first one indicates the event’s timestamp, and the second one the STA’s new state after the event (0 for inactive and 1 for active). For instance: begin ON/OFF T2 STA 00:00:00:00:00:42 0 1 60000000 0 end ON/OFF T2 STA 00:00:00:00:00:42 indicates that, within the third trace (first trace is referred as T0), STA 00:00:00:00:00:42 is active between timestamp 0 and timestamp 60000000. One might use wipal-plot-onoff to generate a PDF file containing a visual representation of this section. Beware this script only gets installed if Python is present on your system, and will only work with a proper Pycairo installation (http://cairographics. org/pycairo/). e.g., wipal-stats foo.pcap > foo.stats wipal-stats bar.pcap > bar.stats wipal-plot-onoff foo.stats bar.stats > foo-bar-onoff.pdf per STA counters For each IEEE 802.11 station, wipal-stats maintains various counters. This section lists these counters. It is composed of several subsections which contain the same information sorted differently (e.g., by traffic per STA, by activity periods (“on time”), etc.) Inside a given subsection, each row contains information about a peculiar station. Each row has the following columns: 1. The MAC address of the station the row is about. 2. Total number of emitted bytes. This includes MAC frames and their payloads. 3. Average rate when on. That is, size/time on where size is the total number of emitted bytes and time on the total duration the station is active (“on”) inside traces. Values are in bytes per microseconds. 4. Total duration the station is active (“on”) inside traces. For instance, if a station is active for 3 minutes somewhere at the beginning of the trace and then active B.1. The programs 106 for 4 more minutes at another moment in the trace, this column holds 7 minutes. Values are in microseconds. 5. Proportion of stations that have been printed so far. For instance, if the trace contains 10 distinct stations, the first row’s value is 0.1, the second 0.2, etc. This is useful for scripts that compute cumulated distributions. 6. Total number of bytes emitted, cumulated with previous rows. This is useful for scripts that compute cumulated distributions. 7. Average rate when on, cumulated with previous rows. This is useful for scripts that compute cumulated distributions. 8. Total number of frames emitted. 9. Average number of frames per microsecond when active. That is, count/time on where count is the total number of frames emitted. time on has the same meaning as above. Three scripts use the “per STA counters” section: wipal-plot-t-dist, wipal-plot-t-c-dist, and wipal-plot-ot-dist. wipal-plot-t-dist Plots the distribution of traffic (and average rate when on) per STA. wipal-plot-t-c-dist Plots cumulated distributions of traffic (and average rate when on) per STA. wipal-plot-ot-dist Plots the distribution of total activity periods (“on time”). Plotting scripts wipal-plot-all is a wrapper that that call all of WiPal’s plotting scripts. e.g., $ wipal-stats foo.pcap > foo.stats $ wipal-plot-all foo.stats $ ls foo.pcap foo.stats.I-growth.eps foo.stats.gaplenfreqs.eps foo.stats foo.stats.M-growth.eps foo.stats.tfi.eps foo.stats.A-growth.eps foo.stats.S-growth.eps foo.stats.B-growth.eps foo.stats.activity.eps wipal-plot-activity and wipal-plot-growth use PCAP timestamps for the x axis. Usually, PCAP timestamps use GMT. However, traces are not necessarily recorded in a GMT zone. You might use the WP TZ environment variable to fix this. This variable specifies to WiPal’s plot scripts a time adjustment in minutes. e.g., if you recorded a trace in a GMT-4 zone, plot its statistics with: WP TZ=$((-4 * 60)) wipal-plot-activity foo.stats Appendix B. WiPal manual 107 B.1.10 Anonymization wipal-anonymize is a program to anonymize IEEE 802.11 traces. It takes one input and one output: the filename of the trace to anonymize, and the filename of the anonymous trace to produce. The output contains the same frames as the input with the following modifications: • NIC specific parts of MAC addresses are anonymized. • ESSIDs are anonymized with a prefix-preserving scheme. For instance, a valid anonymization could map operator-4251, operator-DODO, and foobar to abcdefgh*x0yz, abcdefgh*9876, and zxycba. The anonymization scheme also preserves character classes, i.e. alpha-numerical characters are anonymized to other alphanumerical characters, printable characters stay printable, and ASCII extended characters (128 to 256) stay extended. • Data frames are truncated so the output only contains MAC headers. wipal-anonymize stores valid-to-anonymous MAC and ESSID mappings into files so these mappings can be re-used latter. wipal-anonymize also reads these file at start-up when they exist. This enables the creation of distinct anonymous traces with consistent MAC addresses and ESSIDs. By default these mapping files’ names are “MAC.map” and “ESSID.map”. Use the -m and -s options to change this. See Section B.1.1. B.1.11 Miscellaneous programs wipal-list-frames just list a trace’s frames. This is a pretty dumb program, yet one may use it to display a trace’s timestamps. e.g., $ wipal-list-frames foo.pcap | head 1 2 3 4 5 6 7 8 9 10 $ wipal-list-frames -C -U foo.pcap | head foo.pcap Frame ID Microseconds ======== ============ B.2. The library 108 1 1258703194 2 1258704299 3 1258704368 4 1258705143 5 1258709302 6 1258709362 7 1258709784 B.1.12 Undocumented programs WiPal’s configure script has two options --enable-probe-stats and --enable-wit-import. These options enable the build of several programs, namely wipal-probe-stats, wit-create-datafiles, wit-create-tables-and-load-data, and wit-import. By default the build of those programs is disabled. Those are legacy programs that were useful to somebody once, yet are incomplete and flawed. They will not be updated later, and are not documented here. Build and use at your own risks! B.2 The library A C++ library also compose WiPal. WiPal programs all use this library. At a low level it provides various convenience tools (pcap file input/output, random access to PCAP traces, support for various static C++ techniques, etc.) At an upper level it provides a generic IEEE 802.11 frame parser that is easy to customize and re-use. Finally, it provides various mechanisms to synchronize and merge pcap traces directly from C++ code. The library is called libwipal and its headers are located in $(prefix)/include/wipal. You should be able to include them as follows: #include <wipal/pcap/stream.hh> // ... You will then need to provide the -lwipal option to your compiling/linking tools. The main documentation for this library is provided as a Doxygen documentation. It should be installed into WiPal’s package data directory, into the “doxygen” subdirectory. By default this gives “/usr/local/share/wipal/doxygen/”. This documentation is however a bit messy, and lacks some parts. The best entry point to learn how to use the library is to look at some of WiPal’s tools’ source code (e.g., into “src/misc/wipal-find-data-dups.cc”). You may also want to have a look at WScout (http://wscout.lip6.fr/) which is another program that uses WiPal (some versions of WScout embeds WiPal under the name tracetools). Appendix B. WiPal manual 109 B.3 FAQ What systems does WiPal support? WiPal was mostly designed using standard C++ and portable libraries. It however uses a few GCC extensions. Yet WiPal should run fine on most systems (e.g., GNU/Linux, WhateverBSD, Mac OS, Windows, ...). WiPal is however exclusively tested on Debian GNU/Linux (amd64 and, to a lower extent, powerpc). Which means you might experience problems on other systems, which the developers might not be aware of. In this case, please give feedback to them so they can fix it. Anyway, there should be no major obstacle to WiPal’s portability. What are WiPal’s requirements? WiPal needs: • GCC • The Boost C++ libraries. More specifically: – any, – array, – conversion/lexical cast, – date time, – filesystem, – foreach, – format, – multi array, – optional, – preprocessor, – smart ptr, – string algo, – tuple. • The GNU MP Bignum Library, • OpenSSL. How do I install WiPal? WiPal’s packaging follows the GNU conventions. An installation documentation is provided in the “INSTALL” file in the package’s root directory. However, with a standard system, the following commands should do the trick: B.3. FAQ 110 mkdir _build cd _build ../configure make make install-strip make check On some systems, you might have to customize the “configure” script’s invocation. e.g., mkdir _build cd _build ../configure CPPFLAGS=-I/foo/bar/libgmp make make install-strip make check Are there any options to optimize WiPal when building it? You might want to compile WiPal with the NDEBUG preprocessor symbol defined. If you use GCC you might also want to use its -O3 option. You can do that by running “configure” with the following options: ./configure CPPFLAGS=-DNDEBUG CXXFLAGS=-O3 Gee! WiPal’s compilation takes long and requires a lot of memory! WiPal heavily uses static C++ mechanisms and a full build requires instantiating many templates. This results in a long build process that requires much memory. You may disable some template instantiations to have a faster and lighter build process. This will however remove some features at the end. You may invoke configure with the following options: –enable-linktypes=LT1:LT2:... will only enable the listed pcap link types when compiling WiPal. The available link types are: IEEE802 11 raw IEEE 802.11 frames, IEEE802 11 RADIO Radiotap headers, IEEE802 11 RADIO AVS AVS headers, PRISM HEADER Prism headers, EN10MB Ethernet II. –enable-attributes=A1:A2:... will only enable the listed unique frame attributes (see Section B.1.7) when compiling WiPal. The list’s first attribute specifier is the default one (when -a is not provided on the command line). Available attributes are: Appendix B. WiPal manual 111 • tmp • seq tmp • dst tmp • src tmp • bss tmp • src bss tmp • seq bss tmp • seq dst bss tmp • seq src bss tmp • hsh 80211 (requires OpenSSL) • hsh 80211 x (requires OpenSSL) • hsh en2 (requires OpenSSL) If you know you are going to need only one pcap link type (e.g., Prism headers), and you do not want to test various attributes, a good choice might be: ./configure --enable-linktypes=PRISM_HEADER --enable-attributes=hsh_80211 which will only instantiate one template configuration for each WiPal utility. Do WiPal’s tools have a verbose mode to report extra information about their operation? There is no such options that can be activated dynamically. You might want however to compile WiPal with the WP ENABLE INFO preprocessor symbol defined. This will enable the printing of some extra information in some tools as they run (e.g., number of processed frames, synchronization error, etc.). Invoke the “configure” script with the following option: ./configure CPPFLAGS=-DWP_ENABLE_INFO Note however that this may slow some tools down and may require more memory. You say WiPal is flexible and customizable. Is there a way to customize WiPal’s tools beyond the options they propose? Yes! But this requires recompiling WiPal’s tools, and sometimes modifying a few lines of their source code. • You may change WiPal’s linear regression window (for trace synchronization) by defining the WP LRSYNC WINDOW SIZE macro symbol. Use the CPPFLAGS environment variable for this. The default value is 3. e.g., B.3. FAQ 112 ./configure CPPFLAGS=’-DWP_LRSYNC_WINDOW_SIZE=42’ • You may change the windowed merging algorithm’s window size by defining the WP WMERGE WINDOW SIZE macro symbol. Use the CPPFLAGS environment variable for this. The default value is 3. e.g., ./configure CPPFLAGS=’-DWP_WMERGE_WINDOW_SIZE=42’ • You may change the frame attributes (i.e., frame identifiers) to use in tools that do not support the -a option by modifying a few lines of their source code. This generally needs changing an include and a typedef, e.g., -#include <wipal/wifi/unique_id/seqctl_bssid_timestamp.hh> +#include <wipal/wifi/unique_id/seqctl_source_bssid_timestamp.hh> // ... - typedef wifi::seq_bss_tmp_id unique_id; + typedef wifi::seq_src_bss_tmp_id unique_id; “configure” complains it did not find library X? Either library X is not installed on your system, either your system is not properly configured, so the library cannot be found. You may use the CPPFLAGS and LDFLAGS variables to correct this behavior. e.g., run ./configure CPPFLAGS=-I/custom/path/include LDFLAGS=-L/custom/path/lib “configure” complains it found library X’s headers, but is unable to link? Most prob- ably library X is installed but its binaries are in a non-standard place. Use the LDFLAGS variable as described previously. “configure” complains library X’s headers are unusable, despite successful linking? Most probably library X is installed but its headers are in a non-standard place. Use the CPPFLAGS variable as described previously. Appendix B. WiPal manual Do you have a list of WiPal’s bugs? 113 No. We are not aware of any serious bug in WiPal. We take a special care at testing WiPal with an automated test suite. Do not hesitate to report unknown bugs to the package’s maintainers. We will hunt them. With some tools, you might however encounter some strange behaviors when providing invalid inputs (e.g., running wipal-find-data-dups a:b with “b” having a link type different from “a”). Consider that as a “feature”! ;-) I have found a bug, what should I do? Report it to [email protected], the pack- age’s maintainer. I would really love having feature X implemented! Give feedback to the package’s main- tainers about the features you want. We might not have the time to implement them, yet it is important for us to know when important features are missing. Regarding features you miss, you are greatly encouraged to contribute to WiPal. Again, contact the package’s maintainers so they can help you implement new features. I have a question this document did not answer! Mail [email protected]. 114 B.3. FAQ Appendix C List of publications C.1 Journals Thomas Claveirole and Marcelo Dias de Amorim, WiPal: Efficient Offline Merging of IEEE 802.11 Traces, to appear in the ACM Mobile Computing and Communications Review, 2010. Thomas Claveirole, Marcelo Dias de Amorim, Michel Abdalla, and Yannis Viniotis, Securing Wireless Sensor Networks Against Aggregator Compromises, IEEE Communications Magazine, vol. 46, no. 4, pp. 134-141, April 2008. C.2 Conferences Thomas Claveirole, Mathias Boc, and Marcelo Dias de Amorim, An Empirical Analysis of Wi-Fi Activity in Three Urban Scenarios, IEEE PerCom Workshop on Pervasive Wireless Networking, March 2009. Thomas Claveirole, Marcelo Dias de Amorim, Michel Abdalla, and Yannis Viniotis, Résistance contre les attaques par capture dans les réseaux de capteurs, Journées Doctorales en Informatique et Réseaux, January 2007. Awarded best paper. Alia Fourati, Khaldoun Al Agha, and Thomas Claveirole, Securing OLSR Routes, Asian Internet Engineering Conference, December 2005. Thomas Claveirole, Sylvain Lombardy, Sarah O’Connor, Louis-Noël Pouchet, and Jacques Sakarovitch, Inside Vaucanson, International Conference on Implementation and Application of Automata, June 2005. C.3 Demos and posters Thomas Claveirole, Marcelo Dias de Amorim, and Serge Fdida, Sniffing IEEE 802.11 Mobility, AdHoc-NOW PhD Workshop, September 2008. 115 116 C.4. Software Thomas Claveirole and Marcelo Dias de Amorim, WiPal and WScout, Two Hands-on Tools for Wireless Packet Traces Manipulation and Visualization (demo), ACM Mobicom Workshop on Wireless Network Testbeds, Experimental Evaluation and Characterization, September 2008. Thomas Claveirole, Marcelo Dias de Amorim, Michel Abdalla, and Yannis Viniotis, Resisting Against Aggregator Compromises in Sensor Networks (poster), ACM CoNext Student Workshop, December 2006. C.4 Software WiPal , IEEE 802.11 trace manipulation software, http://wipal.lip6.fr/, published on CRAWDAD [2] : http://www.crawdad.org/meta.php?name=tools/process/pcap/WiPal. WScout , Lightweight pcap trace trace visualizer, http://wscout.lip6.fr/, published on CRAWDAD [2] : http://www.crawdad.org/meta.php?name=tools/analyze/pcap/WScout. C.5 Under review Thomas Claveirole and Marcelo Dias de Amorim, Manipulating Wi-Fi Packet Traces with WiPal: Design and Experience, submitted to Wiley Software Practice and Experience, 2010. Thomas Claveirole and Marcelo Dias de Amorim, On the Completeness of Wireless Packet Sniffing, submitted to IEEE Communications Letter (second round). Ryad Ben-El-Kezadri, Giovanni Pau, and Thomas Claveirole, TurboSync: Clock Synchronization for Shared Media Networks via Principal Component Analysis with Missing Data, submitted to WiOpt: Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks, June 2010. Bibliography [1] Boost. http://www.boost.org/. Free peer-reviewed portable C++ source libraries. [2] CRAWDAD data sets. http://www.crawdad.org/data.php. Community Resource for Archiving Wireless Data At Dartmouth. [3] The GNU multiple precision arithmetic library. http://gmplib.org/. Free library for arbitrary precision arithmetic. [4] Radiotap. http://www.radiotap.org/. Standard for 802.11 frame injection and reception. [5] Scapy. http://www.secdev.org/projects/scapy/. Interactive packet manipulation program. [6] Wireshark. http://www.wireshark.org/. Network protocol analyzer. [7] libpcap. http://www.tcpdump.org/. Packet Capture library. [8] tcpdump. http://www.tcpdump.org/. Protocol packet capture and dumper program. [9] Mikhail Afanasyev, Tsuwei Chen, Geoffrey M. Voelker, and Alex C. Snoeren. Analysis of a mixed-use urban WiFi network: When metropolitan becomes neapolitan. In IMC: ACM SIGCOMM/USENIX Internet Measurement Conference, pages 85–98, Vouliagmeni, Greece, 2008. ISBN 978-1-60558-334-1. [10] Ian F. Akyildiz, Xudong Wang, and Weilin Wang. Wireless mesh networks: a survey. Computer Networks, 47(4):445–487, 2005. ISSN 1389-1286. [11] Mark Alllman and Vern Paxson. Issues and etiquette concerning use of shared measurement data. In IMC: ACM SIGCOMM/USENIX Internet Measurement Conference, pages 135–140, San Diego, California, USA, 2007. ISBN 978-1-59593-908-1. [12] Paramvir Bahl, Ranveer Chandra, Jitendra Padhye, Lenin Ravindranath, Manpreet Singh, Alec Wolman, and Brian Zill. Enhancing the security of corporate Wi-Fi networks using DAIR. In MobiSys: ACM/USENIX International Conference on Mobile Systems, Applications, and Services, Uppsala, Sweden, June 2006. [13] Magdalena Balazinska and Paul Castro. Characterizing mobility and network usage in a corporate wireless local-area network. In MobiSys: ACM/USENIX International Conference on Mobile Systems, Applications, and Services, San Francisco, California, USA, May 2003. [14] Kemal Bicakci and Bulent Tavli. Denial-of-service attacks and countermeasures in IEEE 802.11 wireless networks. Computer Standards & Interfaces, 31(5):931–941, 2009. ISSN 0920-5489. 117 118 Bibliography [15] Mathias Michael Boc. Profile of Mobility: User-centric Networking. PhD thesis, Université Pierre et Marie Curie, Paris, France, November 2009. [16] Nicolas Burrus, Alexandre Duret-Lutz, Thierry Géraud, David Lesage, and Raphaël Poss. A static C++ object-oriented programming (SCOOP) paradigm mixing benefits of traditional oop and generic programming. In Workshop on Multiple Paradigm with OO Languages (MPOOL’03), Anaheim, California, USA, October 2003. [17] Frank Buschmann, Regine Meunier, Hans Rohnert, Peter Sommerlad, and Michael Stal. Pattern-Oriented Software Architecture: A System of Patterns, volume 1. John Wiley and Sons Ltd., 1996. [18] C++0x. ISO Working Draft, Standard for Programming Language C++ , September 2009. Document number: N3000=09-0190. [19] A. Chaintreau, P. Hui, C. Diot, R. Gass, and J. Scott. Impact of human mobility on opportunistic forwarding algorithms. IEEE Trans. Mobile Comput., 6(6):606–620, June 2007. [20] Ranveer Chandra, Jitendra Padhye, Alec Wolman, and Brian Zill. A location-based management system for enterprise wireless LANs. In NSDI: ACM/USENIX Symposium on Networked Systems Design and Implementation, pages 115–130, Cambridge, Massachusetts, USA, 2007. [21] Yu-Chung Cheng, Yatin Chawathe, Anthony LaMarca, and John Krumm. Accuracy characterization for metropolitan-scale Wi-Fi localization. In MobiSys: ACM/USENIX International Conference on Mobile Systems, Applications, and Services, pages 233–245, Seattle, Washington, USA, 2005. ISBN 1-931971-31-5. [22] Yu-Chung Cheng, John Bellardo, Péter Benkö, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. Jigsaw: Solving the puzzle of enterprise 802.11 analysis. In ACM SIGCOMM Conference, pages 39–50, Pisa, Italy, September 2006. ISBN 1-59593-308-5. [23] Yu-Chung Cheng, Mikhail Afanasyev, Patrick Verkaik, Péter Benkö, Jennifer Chiang, Alex C. Snoeren, Stefan Savage, and Geoffrey M. Voelker. Automating cross-layer diagnosis of enterprise wireless networks. In ACM SIGCOMM Conference, pages 25–36, Kyoto, Japan, August 2007. ISBN 978-1-59593-713-1. [24] Thomas Claveirole. WScout. http://wscout.lip6.fr/, . Lightweight pcap trace visualizer. [25] Thomas Claveirole. WiPal. http://wipal.lip6.fr/, . IEEE 802.11 Trace Manipulation Software. [26] Thomas Claveirole and Marcelo Dias de Amorim. WiPal: Efficient offline merging of IEEE 802.11 traces. To appear in the ACM Mobile Computing and Communications Review, 2010. Draft available at http://www.citebase.org/abstract?id=oai:arXiv.org: 0806.4526. [27] Falko Dressler. A study of self-organization mechanisms in ad hoc and sensor networks. Computer Communications, 31(13):3018–3029, 2008. ISSN 0140-3664. [28] Diego Dujovne. WisMon: A wireless network statistical tool. Technical report, INRIA Sophia Antipolis, October 2006. Bibliography 119 [29] Diego Dujovne, Thierry Turletti, and Fethi Filali. A taxonomy of IEEE 802.11 wireless parameters and open source measurement tools. IEEE Commun. Surveys Tuts., 2009. [30] David Eckhardt and Peter Steenkiste. Measurement and analysis of the error characteristics of an in-building wireless network. In ACM SIGCOMM Conference, pages 243–254, Palo Alto, California, United States, 1996. ISBN 0-89791-790-1. [31] Jeremy Elson, Lewis Girod, and Deborah Estrin. Fine-grained network time synchronization using reference broadcasts. In NSDI: ACM/USENIX Symposium on Networked Systems Design and Implementation, pages 147–163, Boston, Massachusetts, USA, 2002. [32] Erich Gamma, Richard Helm, Ralph Johnson, and John M. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional, 1994. ISBN 0-201-63361-2. [33] Marta C. González, César A. Hidalgo, and Albert-László Barabási. Understanding individual human mobility patterns. Nature, (453):779–782, June 2008. [34] K. N. Gopinath, Pravin Bhagwat, and K. Gopinath. An empirical analysis of heterogeneity in IEEE 802.11 MAC protocol implementations and its implications. In WiNTECH: ACM SIGMOBILE International Workshop on Wireless Network Testbeds, Experimental Evaluation and CHaracterization, Los Angeles, California, USA, September 2006. [35] Tristan Henderson, David Kotz, and Ilya Abyzov. The changing usage of a mature campus-wide wireless network. Computer Networks, 52(14):2690–2712, October 2008. ISSN 1389-1286. [36] Pan Hui, Augustin Chaintreau, James Scott, Richard Gass, Jon Crowcroft, and Christophe Diot. Pocket switched networks and human mobility in conference environments. In WDTN: ACM SIGCOMM Workshop on Delay Tolerant Networking and Related Topics, pages 244–251, Philadelphia, Pennsylvania, USA, August 2005. ISBN 1-59593-026-4. [37] IEEE 802.11. IEEE Standard for Information Technology — Telecommunications and Information Exchange Between Systems — Local and Metropolitan Area Networks — Specific requirements — Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, June 2007. IEEE Std 802.11-2007 (Revision of IEEE Std 802.11-1999). [38] Amit P. Jardosh, Krishna N. Ramachandran, Kevin C. Almeroth, and Elizabeth M. Belding-Royer. Understanding congerstion in IEEE 802.11b wireless networks. In IMC: ACM SIGCOMM/USENIX Internet Measurement Conference, Berkeley, California, USA, October 2005. [39] Amit P. Jardosh, Krishna N. Ramachandran, Kevin C. Almeroth, and Elizabeth M. Belding-Royer. Understanding link-layer behavior in highly congested IEEE 802.11b wireless networks. In E-WIND: ACM SIGCOMM Workshop on Experimental Approaches to Wireless Network Design and Analysis, pages 11–16, Philadelphia, Pennsylvania, USA, August 2005. ISBN 1-59593-026-4. [40] Stephen C. Johnson. Yacc: Yet another compiler-compiler. Technical report, AT&T Bell Laboratories, 1979. [41] T. Karagiannis, J.Y. Le Boudec, and M. Vojnović. Power law and exponential decay of inter contact times between mobile devices. In MobiCom: ACM SIGMOBILE Annual International Conference on Mobile Computing and Networking, Montréal, Québec, Canada, September 2007. 120 Bibliography [42] Minkyong Kim, David Kotz, and Songkuk Kim. Extracting a mobility model from real user traces. In INFOCOM: IEEE Conference on Computer Communications, Barcelona, Spain, April 2006. [43] Ratul Mahajan, Maya Rodrig, David Wetherall, and John Zahorjan. Analyzing the MAC-level behavior of wireless networks in the wild. In ACM SIGCOMM Conference, pages 75–86, Pisa, Italy, September 2006. ISBN 1-59593-308-5. [44] Ratul Mahajan, Maya Rodrig, and John Zahorjan. CRAWDAD tool tools/analyze/802.11/wit (v. 2006-09-29). Downloaded from http://crawdad.cs.dartmouth.edu/tools/analyze/802.11/Wit, September 2006. [45] Mohammad Hossein Manshaei, Mathieu Lacage, Ceilidh Hoffmann, and Thierry Turletti. On selecting the best transmission mode for wifi devices. Wirel. Commun. Mob. Comput., 9(7):959–975, 2009. ISSN 1530-8669. [46] Salim Nahle and Naceur Malouch. Graph-based approach for enhancing capacity and fairness in wireless mesh networks. In GLOBECOM: IEEE Global Communications Conference, Honolulu, Hawaii, USA, November 2009. [47] Ruoming Pang, Vern Paxson, Robin Sommer, and Larry Peterson. binpac: a yacc for writing application protocol parsers. In IMC: ACM SIGCOMM/USENIX Internet Measurement Conference, pages 289–300, Rio de Janeriro, Brazil, 2006. ISBN 1-59593-561-4. [48] Maxim Raya, Imad Aad, Jean-Pierre Hubaux, and Alaeddine El Fawal. DOMINO: Detecting MAC layer greedy behavior in IEEE 802.11 hotspots. IEEE Trans. Mobile Comput., 5:1691–1705, 2006. ISSN 1536-1233. [49] Maya Rodrig, Charles Reis, Ratul Mahajan, David Wetherall, and John Zahorjan. Measurement-based characterization of 802.11 in a hotspot setting. In E-WIND: ACM SIGCOMM Workshop on Experimental Approaches to Wireless Network Design and Analysis, pages 5–10, Philadelphia, Pennsylvania, USA, August 2005. ISBN 1-59593-026-4. [50] Maya Rodrig, Charles Reis, Ratul Mahajan, David Wetherall, John Zahorjan, and Ed Lazowska. CRAWDAD data set uw/sigcomm2004 (v. 2006-10-17). Downloaded from http://crawdad.cs.dartmouth.edu/uw/sigcomm2004, October 2006. [51] Bilel Ben Romdhanne, Diego Dujovne, and Thierry Turletti. Efficient and scalable merging algorithms for wireless traces. ROADS: Workshop on Real Overlays and Distributed Systems, October 2009. [52] Björn Scheuermann, Wolfgang Kiess, Magnus Roos, Florian Jarre, and Martin Mauve. On the time synchronization of distributed log files in networks with local broadcast media. IEEE/ACM Trans. Netw., 17(2):431–444, April 2008. ISSN 1063-6692. [53] Aaron Schulman, Dave Levin, and Neil Spring. On the fidelity of 802.11 packet traces. In PAM: Passive and Active Measurement Conference, pages 132–141, Cleveland, Ohio, USA, April 2008. [54] Pablo Serrano, Michael Zink, and Jim Kurose. Assessing the fidelity of COTS 802.11 sniffers. In INFOCOM: IEEE Conference on Computer Communications, pages 1089–1097, April 2009. [55] Douglas C. Sicker, Paul Ohm, and Dirk Grunwald. Legal issues surrounding monitoring during network research. In IMC: ACM SIGCOMM/USENIX Internet Measurement Conference, pages 141–148, San Diego, California, USA, 2007. ISBN 978-1-59593-908-1. Bibliography 121 [56] Libo Song, David Kotz, Ravi Jain, and Xiaoning He. Evaluating location predictors with extensive Wi-Fi mobility data. In INFOCOM: IEEE Conference on Computer Communications, Hong Kong, China, March 2004. [57] VeriWave. WaveTest multi client traffic generator / performance analyzer. http:// veriwave.com/products/wavetest_90_20.asp, 2004. [58] Mark Weiser. The computer for the twenty-first century. Scientific American, pages 94–10, September 1991. [59] Jihwang Yeo, Moustafa Youssef, and Ashok Agrawala. A framework for wireless LAN monitoring and its applications. In WiSe: IEEE International Workshop on Wireless & Internet Services, Philadelphia, Pennsylvania, USA, October 2004. [60] Jennifer Yick, Biswanath Mukherjee, and Dipak Ghosal. Wireless sensor network survey. Computer Networks, 52(12):2292–2330, 2008. ISSN 1389-1286. [61] Moustafa Youssef, Matthew Mah, and Ashok Agrawala. Challenges: Device-free passive localization for wireless environments. In MobiCom: ACM SIGMOBILE Annual International Conference on Mobile Computing and Networking, pages 222–229, Montréal, Québec, Canada, 2007. ISBN 978-1-59593-681-3. 122 Bibliography List of Figures 1.1 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4 4.5 Wireless sniffing: passive monitors listen to the wireless activity inside the measurement area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 WiPal’s structure and modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . A complex filter example. This figure shows how WiPal uses filters to synchronize and merge two IEEE 802.11 traces. Each box represents a filter and arrows show pipes. Pipes convey different types of data. . . . . . . . . . . . . A screenshot of WScout [24] . WScout uses WiPal’s random access feature to open packet traces that do not fit in memory. . . . . . . . . . . . . . . . . . . . A simple processing pipeline using two filters (represented as white boxes). Listing 2.5 displays the code implementing this pipeline. . . . . . . . . . . . . Mean execution time for a hundred runs of the various test programs. Note that most 95% confidence intervals are too small to be distinguished clearly. . 12 Merging two traces T1 and T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The structure of a merge process in WiPal. . . . . . . . . . . . . . . . . . . . . . Synchronization difference w.r.t. linear regression window size. The upper curve represent average, minimum, and maximum values for seven of the eight merges. The lower curve represent the result for the other one, and is plotted separately because it has a singular shape. We think this is related to the timestamping accuracy of the input traces for this merge. . . . . . . . . . . Number of frames detected as shared by both input traces w.r.t. linear regression window size. The curve represents the average, minimum and maximum values for eight merge operations. For each merge operation, this number is normalized using 1 as the number of frames from the window size that gives the highest value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 31 ASUS EeePC 700 with three Netgear WG111v3, as used for trace collection. . Number of MAC addresses each merged trace contains from its beginning to a given time. Contrary to table 5.1, which only accounts MAC addresses from frames’ sender fields, all fields containing valid MAC addresses are used. . . “Score” of a single merge operation. m N is the last merge, i.e., the one that includes N sniffers. Note that when k > 2, mk−1 features frames from at least two distinct sniffer traces and thus it is expected that o2 > o1 . Therefore in most cases score(mk ) = |mo1 | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N Successive computations of Mk for N = 4. An arrow from x to y symbolizes the x ? y merge operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evolution of scores w.r.t. the number of monitors. . . . . . . . . . . . . . . . . 45 123 13 18 20 24 35 36 47 49 50 51 124 4.6 5.1 5.2 5.3 5.4 5.5 List of Figures Scores w.r.t. number of monitors and dataset. Each column represents a given channel of a specific dataset. Each row Mk represents the set of sub-datasets of size k. Each cell contains a box whose size is proportional to the average number of packets inside the corresponding sub-datasets. Red (dark) parts of boxes represent average values of o1 (see Figure 4.3). Pink parts (medium grey) represent average values of o2 . Numbers below boxes are average scores (in percents) with 95% confidence intervals. . . . . . . . . . . . . . . . . . . . . Distributions of cumulated activity durations. . . . . . . . . . . . . . . . . . . Number of distinct MAC addresses each trace contains from its beginning to a given time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CCDFs of aggregated inter-activity times of all devices for the three traces. The distributions are well fitted by truncated power laws with exponential decays. The parameters of the distributions are presented in the text. . . . . . Proportion of users that are active each time interval relatively to the first time (interval) they appeared for the three traces. In these traces, we observe a clear periodicity of 24 hours with some variations that are characteristic of the social meaning of each environment. . . . . . . . . . . . . . . . . . . . . . . Sniffer locations regarding the collection of traces inside the Parc Monceau. The subsequent trace analysis is currently in progress. (Background from Google Maps.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Sniffing sans fils : des moniteurs passifs écoutent l’activité radio au sein de la zone de mesure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 L’architecture et les modules de WiPal. . . . . . . . . . . . . . . . . . . . . . . . A.3 Fusion de deux traces T1 et T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 L’architecture du processus de fusion de WiPal. . . . . . . . . . . . . . . . . . . A.5 Un ASUS EeePC 700 avec trois adaptateurs Wi-Fi USB Netgear WG111v3 tel qu’utilisé pour la collection de nos traces. . . . . . . . . . . . . . . . . . . . . . A.6 Distribution des durées d’activité cumulées, pour chaque trace et pour chaque station. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7 Nombre d’adresse MAC distinctes que contient chaque trace entre le début de la mesure et un temps donné. . . . . . . . . . . . . . . . . . . . . . . . . . . A.8 Position des moniteurs pour la collection de traces dans le parc Monceau. Le travail d’analyse des traces est en cours. (Arrière plan : Google Maps.) . . . . 52 58 60 61 63 65 74 79 80 82 83 85 87 89 List of Tables 3.1 Characteristics of the traces used for testing merge operations. Id. relates to the identification number of the merge operations. . . . . . . . . . . . . . . . . 34 4.1 Quantitative characteristics of the 2008-12-01 and 2008-12-19 datasets. . . . 46 5.1 Quantitative characteristics of the Office, Residential sparse, and Residential dense traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 125 126 List of Tables Listings 2.1 A sample program using WiPal. This program prints the type of every IEEE 802.11 frame included in file.pcap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 The program of listing 2.1 with support for multiple PHY headers. . . . . . . 14 2.3 A typical example of packet processing code. The code is error-prone, depends on the whole protocol stack, and does not handle truncated frames. . . 15 2.4 A program using WiPal’s IEEE 802.11 parser. It uses the same main() function as listing 2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 An example of advanced trace processing using filters. This program uses the same main() function as listing 2.2. . . . . . . . . . . . . . . . . . . . . . . . . . 21 A.1 Un exemple de programme qui utilise la bibliothèque de WiPal. Ce programme affiche le type de chaque trame IEEE 802.11 qui compose file.pcap. 78 127 128 Listings