à l`échelle du Web Motivations
Transcription
à l`échelle du Web Motivations
Contrôle des Changements dans XML Objectifs Cours: Données semi structurées Comprendre la gestion de données dynamiques DEA I3 : Information, Interaction, Intelligence • À large échelle, cas d’un entrepôt de données du Web (Xyleme) Grégory Cobena http://www-rocq.inria.fr/verso/ [email protected] • À l’échelle du document XML, cas de la gestion de versions 20/12/2002 Motivations: à l’échelle du Web Dans quel cas trouve-t-on la notion de changements? • • Savoir découvrir des sources de données et • des documents XML, sur le Web ou sur un Intranet Mettre en place un suivi dans le temps de ces documents Extraire des connaissances sur ce qui change: les documents, leurs propriétés, leur contenu 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 3 Enjeux Lorsque l’on gère différents documents, on étudie les changements inter-documents Exemple: Fichier XML décrivant deux modèles de voitures, une Peugeot-307 et une 206 • Lorsqu’on s’intéresse à l’évolution dans le temps d’un document donné Exemple: Fichier XML décrivant un carnet d’adresses 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 4 Plan du cours Les données semi structurées doivent apporter une description plus précise que du simple texte, avec une sémantique bien définie La gestion des changements dans les données semi structurées est encore plus complexe que dans les BD relationnelles. 20/12/2002 2 Motivations: à l’échelle du document Le contrôle des changements, c’est d’abord: • DEA I3 - Données semi-structurées - Grégory Cobéna DEA I3 - Données semi-structurées - Grégory Cobéna Xyleme • • • Un entrepôt de données XML à large échelle Intégration de données du Web Surveillance active des données du Web XML Diff • • 5 20/12/2002 Représentation des changements Détection des changements DEA I3 - Données semi-structurées - Grégory Cobéna 6 Organization Première Partie: Xyleme 1. The Web and XML 2. Xyleme 3. Data Acquisition and Maintenance 4. XML Repository, Semantic Data Integration and Query Processing 5. Query Subscription Conclusion A Dynamic Warehouse for the XML data of the Web 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 8 The Web today (Part I: Xyleme) 1. The Web and XML Terabytes of data A lot of public pages • 1 billion in [06/2000] • several millions of servers Private web: not publicly available pages Deep web: data hidden behind forms 20/12/2002 HTML = Hypertext Language HTML DEA I3 - Données semi-structurées - Grégory Cobéna <product-table> Ref Name Price < product reference=”X23"> X23 Camera 359.99 <designation> camera </designation> R2D2 Robot 19350.00 <price unit=Dollars> 359.99 </price> Z25 PC 1299.99 easy <description> … </description> ... </product> < product reference=”R2D2"> Information System Data + Structure Semistructured: more flexible Information System 20/12/2002 10 XML = Semistructured Data The <b> X23 </b> new camera Ref Name Price replaces the <b> X22 </b>. It X23 Camera 359.99 comes equipped with a flash R2D2 Robot 19350.00 (worth by itself <i>53.99 $</i>) Z25 PC 1299.99 hard and provides great quality for only <i>359.99 $</i>. Text + presentation Where is the data ? DEA I3 - Données semi-structurées - Grégory Cobéna 11 20/12/2002 <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> DEA I3 - Données semi-structurées - Grégory Cobéna 12 XML : Tree Types (Part I: Xyleme) 2. A Dynamic Warehouse for the XML Data of the Web product-table product designation price reference description Semantics and structure are in paths • product-table/product/reference • product-table/product/price 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 13 Xyleme Research Xyleme Company Started September 2000 Project Xyleme at INRIA (1999-2000) : Explore XML + Web + SGBD to make the Web a Knowledge Database INRIA • • • Market Challenges: Sophie Cluet: Databases (OQL…) Serge Abiteboul: semi-structured data + web Guy Ferran: ex O2 Technology • Mannheim University • • Few XML documents available on the Web (because of weak software support) Company is focusing on private XML: • Technology: Guido Moerkotte Université d’Orsay • (25 employees end of 2001) Marie Christine Rousset CNAM • 20/12/2002 Dan Vodislav DEA I3 - Données semi-structurées - Grégory Cobéna 15 Architecture 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 16 User Interface -------------------- I N T E R N E T ----------------------Web Interface • local: Corba • external: HTTP Acquisition Loader & Crawler Distribution between autonomous machines Now Web Services DEA I3 - Données semi-structurées - Grégory Cobéna • Scalability for large amount of data • Internet (+focus) / Intranet support • Monitoring and Version Management • Heterogeneous Data Integration Functional Architecture Cluster of PCs Developed with Linux and C++ Communications 20/12/2002 • Press, Editors, Financial Data, Biology… Xyleme Interface Change Control Semantic Module Query Processor Repository and Index Manager 17 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 18 (Part I: Xyleme) 3. Data Acquisition and Maintenance, Page Importance Architecture -------------------- I N T E R N E T ----------------------Change Control and Semantic Integration Change Control and Semantic Integration Index Index Loader |Query Repository 20/12/2002 Acquisition and Maintenance E T H E R N E T Repository Acquisition and Maintenance Index Loader |Query Repositorry Repository DEA I3 - Données semi-structurées - Grégory Cobéna 19 Life Cycle of a page in Xyleme Goals Discover XML pages on the web that are of interest for customers • For this crawl the web (HTML+XML) Maintain them up to date Do this under bounded resources: • The meta data of D is read • type, last_date_update... • The document D is loaded • Memory for known URLs • Bandwidth 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna The document D is re(read) regularly 21 Main Issues a standard PC main cost is Internet connection Metadata management (access to disk) Page scheduling 22 (M. Preda, S. Abiteboul, G. Cobena) • does not require to maintain graph information • faster convergence with focused crawling • decide which page to read or refresh next DEA I3 - Données semi-structurées - Grégory Cobéna DEA I3 - Données semi-structurées - Grégory Cobéna Definition: Important pages are linked to by important pages Offline algorithm (used by Google) Our Online algorithm • we can load up to 5 millions of pages/day on 20/12/2002 20/12/2002 Page Importance Loading of pages • The URL of D is discovered as a link in another page (or published by a customer) The page scheduler decides to read D 23 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 24 (Part I: Xyleme) 4. XML Repository: Semantic Data Integration and Query Processing Querying Language Today: A mix of OQL and XQL We are currently moving to X-Query (which is also a mix of OQL and XQL…) Select boss/Name, boss/Phone From comp in BusinessDomain, boss in comp//Manager Where comp/Product contains “Xyleme” 20/12/2002 Web Heterogeneity DEA I3 - Données semi-structurées - Grégory Cobéna 26 Indexing Semantic domains, e.g., cinema Many possible types for data in this domain, many DTDs Semantic Integration Standard inverted index • word → documents that contain this word Xyleme index • word → elements that contain this word • one abstract DTD for the domain • gives the illusion that the system maintains an document + element identifier Goal: more work can be performed without accessing data homogeneous database for this domain 1 domain = 1 abstract DTD 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 27 I.4.1 Xyleme: Semantic Data Integration 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 28 Data Integration One application domain -- Several schemas • heterogeneous vocabulary and structure Xyleme Semantic Integration • gives the illusion that the system maintains an • 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 29 homogeneous database for each domain abstracts a set of DTDs into an abstract DTD = a hierarchy of pertinent terms for a particular domain 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 30 I.4.2 Xyleme: Query Processing Technology in short Cluster DTDs into application domains • Business, culture, tourism, biology, … For an application domain – semi-automatically • • • 20/12/2002 Organize tags into a hierarchy of concepts using thesauri such as Wordnet and other linguistic tool This provides the abstract DTD for the particular domain Generate mappings between concrete DTDs and the abstract one DEA I3 - Données semi-structurées - Grégory Cobéna 31 Xyleme Query Language 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 32 Principle of Querying query on abstract DTD A mix of OQL and XQL, will use the W3C standard when there will be one Select product/name, product/price From doc in catalogue, product in doc/product Where product//components contains “flash” and product/description contains “camera” catalogue/product/price Union of concrete queries (possibly with joins) ⇒ d1//camera/price ⇒ d2/product/cost catalogue/product/description ⇒ d1//camera/description ⇒ d2/product/info, ref ⇒ d2/description MAPPINGS between concrete and abstract DTD’s 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 33 Query Processing 1. 2. 3. 4. 5. DEA I3 - Données semi-structurées - Grégory Cobéna DEA I3 - Données semi-structurées - Grégory Cobéna 34 Query processing Partial translation, from abstract to concrete, to identify “machines” with relevant data Algebraic rewriting, linear search strategy based on simple heuristics: in priority, use in memory indexes and minimize communication Decomposition into local physical subplans and installation Execution of plans If needed, Relaxation 20/12/2002 20/12/2002 35 Essential use of a smart index combining full-text and structure 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 36 I.4.2 Xyleme: Repository Storage System: Xyleme Store Efficient storage of trees in variable length records within fixed length pages Balancing of tree branches in case of overflow • minimize the number of I/O for direct access • 20/12/2002 37 DEA I3 - Données semi-structurées - Grégory Cobéna Tree Balancing in Xyleme Store 20/12/2002 and scanning good compromise : compaction / access time DEA I3 - Données semi-structurées - Grégory Cobéna 38 Questions ? Record 1 Overflow: more children in other page Overflow: Sub-tree in other page Record 2 20/12/2002 Record 3 Record 4 DEA I3 - Données semi-structurées - Grégory Cobéna (Part I: Xyleme) 5. Change Control 39 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 40 The Web changes all the time Data acquisition + maintenance • keep the warehouse up-to-date Version management • representation and storage of change (see part II) Change monitoring • 20/12/2002 query subscription DEA I3 - Données semi-structurées - Grégory Cobéna 42 Subscription Language Example SQL-like language based on ‘atomic events’. Combines the use of monitoring queries and continuous queries. The language can be extended by adding new types of atomic events. Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 43 Step 1: Atomic Event Detection metadata manager document & alerts d/46 XML loader 20/12/2002 atomic event 46: URL matches pattern www.musee-orsay.fr/* atomic event 67: XML document contains the tag <painter> with the value “Monet” d/46,67 complex event detection DEA I3 - Données semi-structurées - Grégory Cobéna 45 URL Patterns Detection (1) Test in O(1), total test time is O(n), where n is the length of URLs DEA I3 - Données semi-structurées - Grégory Cobéna Each Alerter can be viewed as a plug-in that acts on a document flow. All sorts of Atomic events can be detected: URL pattern detection, Keywords, XPath expressions, Page rank… Can be distributed. Some advanced alerts are: • • • 20/12/2002 Long string look-ups Finding XML Patterns (e.g. XPath) Comparing digital signature of text documents (copy tracker) DEA I3 - Données semi-structurées - Grégory Cobéna 46 <root str=“www.inria.fr/”> <node str=“/verso” alert=“1”> <node str=“/index.html” alert=“2”/> <node str=“/main.html” alert=“3”/> </node> </root> Example: http://www.inria.fr/verso/index.html Test: http://www.inria.fr/verso/* http://www.inria.fr/* 20/12/2002 44 Using a tree: navigate on the tree until a leave is encountered Example: Tree is, URL | prefix* | *suffix Using Hash Table: try all possible patterns • DEA I3 - Données semi-structurées - Grégory Cobéna URL Patterns Detection (2) Supported patterns • 20/12/2002 Alerters 5 millions of pages/day d subscription myPaintings % what are the new painting entries in Musee d’Orsay site monitoring newPainting select URL Atomic where URL extends www.musee-orsay.fr/* events and <painter> contains “Monet” % manage the changes in the expositions continuous delta Exposition select ... from ... where when monthly notify daily % send me a daily report Patricia Trees ? 47 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 48 Keywords Sequence Algorithm Simple XPath filtering Algorithm Detect: « Air France » Solution: Problem: • a Tree of backward keyword sequences • a context memory with O(1) update cost Solution: • detect <a> CONTAINS « word » • Reverse path expression • Use postfix order • Use a stack for ‘//’ and another stack for ‘/’ Tree is implemented over a hash table 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 49 Simple XPath filter example: Understanding the tree structure in postfix order DEA I3 - Données semi-structurées - Grégory Cobéna DEA I3 - Données semi-structurées - Grégory Cobéna 50 Simple XPath: Example Consider tree: <A><B><C>toto</C><C/></B></A> Nodes come as: toto (id=1, level=4) C (id=2, level=3) C (id=3, level=3) B (id=4, level=2) A (id=5, level=1) 20/12/2002 20/12/2002 <a> CONTAINS toto is detected by: • « toto »::ancestor <a> When « toto » is detected, it is stored For each ancestor of « toto », the name is compared to <a>. All tests are executed using an hash table 51 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 52 Step 2: Complex Event Detection Stemming On the Alerter • • • Exemple: Éléphant –> ELEPHANT Do it for 500 documents / second Noise may be introduced (Example: tâche = tache) HTML parser Millions of alerts of pages/day Millions of subscriptions complex event detection On the Subscription Manager • • To avoid duplicate registration of similar events To show the user how his query is stemmed XML loader Real stemmers: chevaux -> cheval 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 53 20/12/2002 complex event 12: 67 & 46 (XML document contains the tag <painter> with value “Monet” and URL matches pattern www.musee-orsay.fr/*) DEA I3 - Données semi-structurées - Grégory Cobéna 54 Complex Events Algorithm Step 3: Notification Processor The formal problem is NP-hard We proposed several possible algorithms Experimental (simulation) values proved the effectiveness of our solutions The Hash-Tree based algorithm is well suited for our application: • 10 million Complex Events • 1 million Atomic Events • 100 Atomic events detected per document 0.8 ms to process a document. ~2 million documents per day (on each PC). 20/12/2002 alerts complex event detection notification/monitoring Reporter Millions of Notifications/day triggers clock 55 DEA I3 - Données semi-structurées - Grégory Cobéna Architecture continuous queries 20/12/2002 notification/results DEA I3 - Données semi-structurées - Grégory Cobéna 56 Monitoring Applications Xyleme Query Processor documents Trigger Engine Complex Event Detection Xyleme Alerter Xyleme Reporter Reporter Subscription Manager SQL 20/12/2002 SQL Xyleme Subscription Manager Web Browser 57 DEA I3 - Données semi-structurées - Grégory Cobéna Copy tracking Query to search engine Or specific crawl + pre-filter 2 • • • • • 3 detection 20/12/2002 58 Standard portal management Filter Flow of candidate documents DEA I3 - Données semi-structurées - Grégory Cobéna Web portal management Example: a press agency wants to check that people are not publishing illegally copies of their wires Need to react fast on changes: illegal copy of the wire may last only a couple of days 1 20/12/2002 Unreachable pages Dangling pointers Incorrect pages (e.g., do not parse) Detection of interesting pages on the web Etc. Portal archiving Subscription and notification Slice the document DEA I3 - Données semi-structurées - Grégory Cobéna 59 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 60 Web surveillance Conclusion & Prospectives Applications • • Focus crawling on important pages Anti-criminal and anti-terrorist intelligence, e.g., detecting suspicious acquisition of chemical products Business intelligence, e.g., discovering potential customers, partners, competitors • Refine notion of importance • Improve important pages discovery Find the data (crawl the web) Monitor the changes • Improve Change control accuracy new pages, deleted pages, changes in a page Classify information and extract data of interest • 20/12/2002 Data mining, text understanding, knowledge representation and extraction, linguistic… Very AI DEA I3 - Données semi-structurées - Grégory Cobéna 61 Questions ? 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna Temporal Queries (persistent identification of nodes) Version some documents or some sites (store a ‘delta’) Change Monitoring (query changes) (Part II: XML Diff) 1. Detecting Changes in XML Documents Grégory Cobéna, Serge Abiteboul, Amélie Marian We proposed a representation of changes “Change-Centric Management of Versions” (VLDB 2001) We developed a Diff algorithm for XML INRIA Rocquencourt, Columbia University “Detecting Changes in XML Documents”, G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose) DEA I3 - Données semi-structurées - Grégory Cobéna DEA I3 - Données semi-structurées - Grégory Cobéna 63 Objectives: 20/12/2002 20/12/2002 62 Deuxième Partie: XML Diff Versions • • • • Semantic web • Real-time advanced processing 65 Introduction Overview Algorithms for detecting changes in XML documents Plan • An XML Diff algorithm • A comparative study for XML change Motivations State of the art Change model Algorithm • Tradeoff ‘quality’ versus speed • Quasi linear time and space complexity detection Experiments • Synthetic and real world experiments 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 67 Monitoring XML data on the Web Change-centric management of versions in an XML warehouse A. Marian, S. Abiteboul, G. Cobéna, L. Mignet, VLDB2001 In fact, all these problems are very similar Learning about changes Architecture and requirements ( ‘speed’ ) Multiple optimality criteria ( ‘quality’ ) 69 Consider string: abcdefg How to transform it into: bczdeyz ? Possible solutions • 70 S1x into S2y Conversely, to find out the shortest path for transforming S1x into S2y, it is sufficient to compare following transformations: delete all 7 chars and insert 7 other chars Update <a> into <b>, <b> into <c>, <c> into <z>, <f> into <y>, <g> into <z> Mix both solutions DEA I3 - Données semi-structurées - Grégory Cobéna DEA I3 - Données semi-structurées - Grégory Cobéna If we know how to transform S1 into S2, then we know how to transform: • • • Question: What is the shortest edit sequence? 20/12/2002 20/12/2002 Solving the String-Edit-Problem The String Edit Problem • • 68 Unix Diff: shows the different lines between two text files String Diff: shows which symbol have changed XML Diff: Which parts of the tree have been modified, inserted or deleted B. Nguyen, S. Abiteboul, G. Cobéna, M. Preda, SIGMOD2001 DEA I3 - Données semi-structurées - Grégory Cobéna DEA I3 - Données semi-structurées - Grégory Cobéna II.1.1 XML Diff What is a diff ? Motivations 20/12/2002 20/12/2002 71 20/12/2002 S1 into S2, then x into y S1x into S2, then insert y delete x, and then S1 into S2y DEA I3 - Données semi-structurées - Grégory Cobéna 72 String Edit Problem The algorithm A Quadratic Solution Two strings S1 and S2 Cost(x,y) represents the shortest edit cost to transform S1[1..x] into S2[1..y] The cost is the sum of individual costs for each edit operation (insert, delete, update) Then, cost(x,y) is the min of: • Cost(x-1,y-1)+update_cost(S1[x],S2[y]) • Cost(x-1,y)+delete_cost(S1[x]) • Cost(x,y-1)+insert_cost(S2[y]) 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 73 State of the art (1): the string edit problem The solution is to represent all possible path on a matrix: M[1..|S1|][1..|S2|] • • • • M[x,y] represents the cost of transforming S1[1..x] into S2[1..y] M[x,y] can be computed using M[x-1,y-1], M[x-1,y] and M[x,y-1] M[0,i] and M[i,0] are obvious Thus, M[|S1|,|S2|] can be computed Note that the number of path is exponential, but the cost remains quadratic. Time and Space cost is O(|S1|*|S2|) 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 74 Questions? Best result is O(|s|^2 / log s) solution over finite alphabet O(|x|*|y|) solution with Directed A-cyclic Graph … … A C source string delete C (cost=1) B C do nothing (cost=0) destination string 20/12/2002 insert C (cost=1) DEA I3 - Données semi-structurées - Grégory Cobéna 75 Finds the solution in O(n*D) where n is the size of the largest string, and D the distance between the two strings Adapt M[x,y] to work on trees (S. Chawathe) 20/12/2002 • • Remove some edges to ensure that deleting a node will delete the subtree rooted at that node (and conversely for insert) DEA I3 - Données semi-structurées - Grégory Cobéna 76 Kuo-Chung Tai, Lu, Selkow: based on string edit problem in XML, many labels are identical Unix Diff, Sun DiffML LaDiff (MH-Diff) , Chawathe, Rajaraman, Garcia-Molina, J. Widom Compute M[x,y] only close to the diagonal (E. Myers) • DEA I3 - Données semi-structurées - Grégory Cobéna State of the art (2): the tree pattern matching for XML Extending the String problem • 20/12/2002 matching criteria to compare nodes and subtrees quadratic in the ‘distance’ between both trees. IBM diff available at alphaworks 77 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 78 Data Model Change Model Attach persistent identifiers: Issue: Persistent identification of nodes Catalog Pr Pr Pr N P N P Camera 300 TV 100 Pr Pr Pr Represent changes with a Delta N P N P N P N P VCR 200 TV 100 DVD 500 VCR 150 • Delta = Set of changes • Nice mathematical properties Change-centric Management of versions, VLDB2001 Version 2 Version 1 20/12/2002 • to every node = XID • to the document = XID-map Catalog 79 DEA I3 - Données semi-structurées - Grégory Cobéna Catalog Catalog 16 Delete Pr 10 N P N P 2 4 Camera 300 1 3 7 Pr 8 12 14 VCR 200 11 13 20/12/2002 Pr 21 15 N P N P N P 7 9 18 20 TV 100 DVD 500 6 8 17 19 12 14 VCR 150 Update 11 13 Version 2 Version 1 XID-map: (1-16|17) Pr 10 15 N P 9 TV 100 6 16 Insert Pr 5 DEA I3 - Données semi-structurées - Grégory Cobéna 80 Objectives Algorithm: Intuition Pr 20/12/2002 Diff (V1,V2) delete(5) update(13,150) insert(16,2,(17-21)) • Constraint-Awareness: New XID-map: (6-10,17-21,11-16|22) DEA I3 - Données semi-structurées - Grégory Cobéna Assign persistent identifiers by matching nodes Compute a representation of changes between the two documents Also 81 • 20/12/2002 Follow DTD specifications Correctness: No change is missed DEA I3 - Données semi-structurées - Grégory Cobéna 82 II.1.2 XML Diff The XyDiff Algorithm Representation of Changes: Example <delete XID-map=(1-5) parent=16 pos=1> <Product><Name>Camera</Name><Price>$300</Price> </Product></delete> <insert XID-map=(17-21) parent=16 pos=2> <Product><Name>DVD</Name><Price>$400</Price> </Product></insert> <update XID=13> <old>$200</old><new>$150</new> </update> 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 83 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 84 Phase 2: Bottom Up+Lazy Down Propagation Phase 1: Identify Subtrees One traversal of the tree Let L be the list of all subtrees in second document For each subtree S in L (in decreasing weight order) Use ID-attributes from DTD to match nodes (or forbid matching) Compute for every subtree • Signature • Weight 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna • • • 1/ find all identical subtrees in first document 2/ Select acceptable matches 3/ If at least one match, choose the best candidate • 4/ Propagate [Very Carefully] matching to parents and ancestors Remove S and all its subtrees from L 85 Phase 3: Optimization 20/12/2002 Find inserted/deleted nodes Find “easy” move operations: parent node changed Find “complex” move = reordering children • • • Largest common subsequence (weight) Ex: A, B, C, D, E, F E, D, A, B, C, F Largest common subsequence is A, B, C, F nodes D and E are ‘moved’ Complexity is quadratic We approximate the solution in linear time 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna propagate matching to ancestors Then quick top-down pass propagate matching to descendant nodes based on element names if no ambiguity DEA I3 - Données semi-structurées - Grégory Cobéna 87 Key aspect: the weight of trees Definition of weight affects both speed and quality Look-up and Propagation distances • Use locality (e.g. find matching ancestors) to avoid wrong matches Two small trees are matched if some ancestors are matching. For large trees, further look-up is accepted. • Propagation • • Propagation should try not to induce wrong matches. Intuition is that large matching subtrees are more relevant The larger the tree, the more we propagate the matching to ancestors. DEA I3 - Données semi-structurées - Grégory Cobéna 88 Algorithm: Tuning Select Acceptable Match 20/12/2002 86 Phase 4: Construct the delta Some nodes are unmatched after previous phases Use previous results to propagate matching [now a bit less carefully] First, bottom-up 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 89 Choice affects speed and quality Trade-off Quality vs. Speed We exhibit in the paper some bounds that guarantee linear time complexity 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 90 Complexity: n*log(n) Experiments Phase 1 (identification) is one traversal of the tree Phase 2 (propagation) is n times ‘get best candidate’ in the worst case • • Simulator of changes on XML documents Speed and Quality evaluation on synthetic data Look-up level is designed to have ‘get best candidate’ cost in O(log(n)) uses some pre-computed indexes Comparison with Unix Diff on web data Phase 3 (optimization) is designed to be linear Phase 4 (delta construction) is linear • 20/12/2002 longest common subsequences of children is approximated DEA I3 - Données semi-structurées - Grégory Cobéna 91 Typical Pattern • delta of changes • XML document D’=delta(D) DEA I3 - Données semi-structurées - Grégory Cobéna 93 20/12/2002 Synthetic Data: Quality of the algorithm 3 2 1 0 1Mb Size ratio of the diff over original delta 20/12/2002 Start from document D Size of the computed delta is comparable to ‘original’ delta size For large deltas, XyDiff finds more efficient operations DEA I3 - Données semi-structurées - Grégory Cobéna DEA I3 - Données semi-structurées - Grégory Cobéna 94 Comparison of the size of results: XyDiff vs. UnixDiff – Generate changes over D: D’ = delta(D) – Give (D, D’) to XyDiff and compute a delta 4 92 Experimental verification that the algorithm is quasi-linear Parameters control the number of delete/insert/move Input: XML document D Outputs: Typical Pattern DEA I3 - Données semi-structurées - Grégory Cobéna Synthetic Data: Speed of the algorithm Simulator of changes 20/12/2002 20/12/2002 95 Experiments on 10.000 XML web documents that changed • at the time of that experiment, we had to crawl 10 million web pages to find them ☺ 80% of the documents below the size of (UnixDiff*1.2) Almost all below UnixDiff*2 Of course, the delta of XyDiff contains much more information 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 96 Perspective Conclusion Larger scale experiments on web data Learn about changes: A novel algorithm for XML diff in quasi linear time XML specificities are used to improve quality Available as Open Source freeware at: • Frequency, patterns, … • Obtain statistics for DTD and XMLSchema • Use the statistics to learn about changes and improve XyDiff for typed XML data http://www-rocq.inria.fr/~cobena/XyDiffWeb/ Use XML diff to observe changes between websites 20/12/2002 97 DEA I3 - Données semi-structurées - Grégory Cobéna XyDiff in Xyleme Architecture 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 98 Questions ? Web Crawler XML Loader XyDiff Alerter V(n) of the XML document Delta(V(n-1),V(n)) Storage 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna (Part II: XML Diff) Etude comparative sur la détection de changements en XML Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) 99 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 100 Context Consider change-control in XML data warehouses. We want to understand changes We have only the old and new version of documents A diff need to be computed 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 102 Organization Motivations Motivations Data Model Representing Changes • • • Version Management and Querying Comparison of Change representation models Experiments Detecting Changes • • • State of the art in change detection Performance analysis and experiments Quality analysis and experiments Summary 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 103 Motivations: Detecting Changes Motivations: Representing Changes Version management, which means that the representation should allow for effective storage strategies Temporal Databases, the support for persistent identification of nodes is mandatory Monitoring: information about changes is used to support triggers or detect events Note: HTML or XHTML documents may be used 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 105 II.2.1 XML Diff Comparing Data Models Correctness: the diff programs miss no changes Minimality of the result is important to save storage space and network bandwidth Semantics: some algorithms consider more semantics in XML documents Performance: with dynamic services and/or large amounts of data, high speed and low memory usage are mandatory ‘Move operations’: some algorithms support move operations whereas others don’t. This impacts both the performance of the tool and the quality of results. 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 106 Data Model (quick overview) Operations are: • • • (i) insert, delete applied to leaves or subtrees (ii) update of text nodes (iii) move applied to a subtree root, moving the entire subtree An edit cost is assigned to each operation. Usually, the cost is 1 per node touched The semantic of move is to identify subtrees even when their context has changed. We use the notion of mapping between the two trees. Each node in document A (or B) that is not deleted (or inserted) is matched to the corresponding node in B (or A). 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 107 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 108 II.2.2 XML Diff Representing Changes Data Model: Intuition Tai’s model: delete ‘b’ Selkow’s model: delete ‘b’ root root a b x 20/12/2002 c a b x y c y DEA I3 - Données semi-structurées - Grégory Cobéna 109 Representing Changes • • There are several version management strategies. For instance, when only deltas are stored, their size must be reduced We also consider the performance of reconstructing a document given the delta and the previous document. It is linear in all cases. A simple text-based version management is possible but can not be used for querying. Querying Changes • • • 20/12/2002 Labeling nodes by prefix+postfix identifiers improves querying algorithms Labeling nodes with persistent identifiers improves temporal databases There is no short labeling scheme that is good for both DEA I3 - Données semi-structurées - Grégory Cobéna DEA I3 - Données semi-structurées - Grégory Cobéna 110 Our Example Version Management • 20/12/2002 111 Different representations <catalog> <product> <name>Notebook</name> <description> 2200MHz Pentium4 </description> <price>$1999</price> </product> <product> <name>Digital Camera</name> <description> Fuji FinePix 2600Z </description> <status> Not Available </status> </product> </catalog> 20/12/2002 <catalog> <product> <name>Notebook</name> <description> 2200MHz Pentium4 </description> <price>$1999</price> </product> <product> <name>Digital Camera</name> <description> Fuji FinePix 2600Z </description> <price> $299 </price> </product> </catalog> DEA I3 - Données semi-structurées - Grégory Cobéna 112 Change Models: XUpdate <xupdate:modifications version="1.0" xmlns:xupdate="http://www.xmldb.org/xupdate"> <xupdate:insert-after select="/catalog[1]/product[2]/description[1]" > <xupdate:element name="price"> $299 </xupdate:element> XPath </xupdate:insert-after> expression <xupdate:remove select="/catalog[1]/product[2]/status[1]" /> </xupdate:modifications> 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 113 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 114 Same look’n’feel Change Models: as the document DeltaXML (Example) Change Models: XyDelta (Example) <catalog delta='modified'> <product deltaxml:delta='unchanged' /> <product deltaxml:delta='modified'> <status deltaxml:delta='deleted'> Not Available </status> <name deltaxml:delta=‘unchanged’/> <description deltaxml:delta=‘unchanged’/> <price deltaxml:delta='inserted'> $399 </price> </product> 115 Verify Change Models: consistency Microsoft XDL (Example) <xd:node match="1"> <xd:node match="2"> <xd:change match="3" name="price"> <xd:change match="1"> $299 </xd:change> </xd:change> Identify nodes </xd:node> </xd:node> </xd:xmldiff> • • • • File Size 100000 XyDelta DeltaXML 100 20/12/2002 A framework for querying Validation by a DTD (may be a problem for DeltaXML, XyDelta) Verify the source document (only XDL) Support of ‘move’ operations (only XyDelta and XDL) Backward deltas (only XyDelta) Monitoring the delta (only XUpdate and DeltaXML) DEA I3 - Données semi-structurées - Grégory Cobéna 118 Change monitoring is easier with DeltaXML and XUpdate Temporal queries are easier to evaluate with XyDelta (persistent identifiers) Future work: 100000 protocols Edit Cost DEA I3 - Données semi-structurées - Grégory Cobéna A formal model and nice mathematical properties Persistent identification of nodes (at least as an option) • It is not yet clear how to query changes • Define transaction or synchronization 1000 20/12/2002 116 Change Models: Conclusion Comparing Delta Size 10000 • • • 1000000 10000 DEA I3 - Données semi-structurées - Grégory Cobéna Unique advantages of XyDelta • Identifiers save space when few updates 1000 20/12/2002 Nice features that some are missing Storage Experiments 100 </xydelta> Still missing for all of them 117 DEA I3 - Données semi-structurées - Grégory Cobéna 10 What is the parent node? Summary <xd:xmldiff srcDocHash=“fd452bab54320191“ Updates an xmlns:xd="http://schemas.microsoft.com/xmltools/2002/xmldiff"> element node 1 Persistent identifiers <insert xid=(31-33) parent=6 position=4> <price>$399</price> </insert> DEA I3 - Données semi-structurées - Grégory Cobéna 20/12/2002 <xydelta v1_XidMap="(1-30)" v2_XidMap="(1-14;18-23;31-33;24-30)"> <delete xid=(15-17) parent=6 position=1> <status>Not Available</status> </delete> The order is important (no ids, no move) </catalog> 20/12/2002 mentions some unchanged nodes 119 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 120 II.2.3 Detecting Changes State of the art Based on the String Edit Problem (1966) Tree-to-tree correction Algorithms: • • find the Minimum Edit Script in O(m*n) time and space, where m and n are the size of the two documents Other algorithms • • 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 121 Experiments: Speed of several algorithm 20/12/2002 Run in linear time or close Match nodes or subtrees depending on their content DEA I3 - Données semi-structurées - Grégory Cobéna Algorithms: Overview From: <root> <a> <x/><y/><z/> </a> <a> <x/><y/><z/> <u/><v/> </a> </root> To: <root> <a> <u/><v/> <x/><y/><z/> </a> <a> <x/><y/><z/> </a> </root> 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 123 Experiments: Quality (measured by the Edit Cost) 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 122 20/12/2002 The cheapest choice would be to move <u> and <v>. (cost=2) But finding the best script with ‘move’ operations is NP-hard The minimum edit script consists in deleting <u> and <v> and then inserting them. (cost=4) (MMDiff) Preprocessing often consists in mapping identical subtrees. In these case, an additional ‘move’ operations will be needed (cost=5) DEA I3 - Données semi-structurées - Grégory Cobéna 124 Experiments: Speed (focus on DeltaXML) 125 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 126 Comparison summary Other issues Many other algorithms that have no advantages MMDiff is the reference for quality DeltaXML and XyDiff are good compromises quality/performance; but performances of XyDiff more regular Performance measure for Microsoft available soon – seems comparable in performance to DeltaXML 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna Constrained Diff is often interesting: • Using ‘keys’ to match specific nodes (e.g. DeltaXML) • Using XMLSchema or DTD information • Time-constrained diff (e.g. XyDiff) Postprocessing of results? 127 What’s next? 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 128 Questions ? Representing Changes: • • • Unify and improve existing features Support Queries! Chain versions? Change Detection: • • • 20/12/2002 We are currently working on Microsoft’s XML Diff Use XMLSchema (or DTD) information Mining changes? Use learning ? DEA I3 - Données semi-structurées - Grégory Cobéna merci 129 20/12/2002 DEA I3 - Données semi-structurées - Grégory Cobéna 130