à l`échelle du Web Motivations

Transcription

Contrôle des Changements dans XML
Objectifs
Cours: Données semi structurées
Comprendre la gestion de données
dynamiques
DEA I3 : Information, Interaction,
Intelligence
• À large échelle, cas d’un entrepôt de données
du Web (Xyleme)
Grégory Cobena
http://www-rocq.inria.fr/verso/
[email protected]
• À l’échelle du document XML, cas de la
gestion de versions
20/12/2002
Motivations:
à l’échelle du Web
Dans quel cas trouve-t-on la notion de
changements?
•
• Savoir découvrir des sources de données et
•
des documents XML, sur le Web ou sur un
Intranet
Mettre en place un suivi dans le temps de ces
documents
Extraire des connaissances sur ce qui
change: les documents, leurs propriétés, leur
contenu
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
3
Enjeux
Lorsque l’on gère différents documents, on étudie les
changements inter-documents
Exemple: Fichier XML décrivant deux modèles de
voitures, une Peugeot-307 et une 206
•
Lorsqu’on s’intéresse à l’évolution dans le temps d’un
document donné
Exemple: Fichier XML décrivant un carnet d’adresses
20/12/2002
4
Plan du cours
Les données semi structurées doivent
apporter une description plus précise
que du simple texte, avec une
sémantique bien définie
La gestion des changements dans les
données semi structurées est encore
plus complexe que dans les BD
relationnelles.
20/12/2002
2
Motivations:
à l’échelle du document
Le contrôle des changements, c’est
d’abord:
•
Xyleme
•
•
•
Un entrepôt de données XML à large
échelle
Intégration de données du Web
Surveillance active des données du Web
XML Diff
•
•
5
20/12/2002
Représentation des changements
Détection des changements
6
Organization
Première Partie: Xyleme
1. The Web and XML
2. Xyleme
3. Data Acquisition and Maintenance
4. XML Repository, Semantic Data
Integration and Query Processing
5. Query Subscription
Conclusion
A Dynamic Warehouse for
the XML data of the Web
20/12/2002
8
The Web today
(Part I: Xyleme)
1. The Web and XML
Terabytes of data
A lot of public pages
• 1 billion in [06/2000]
• several millions of servers
Private web: not publicly available pages
Deep web: data hidden behind forms
20/12/2002
HTML = Hypertext Language
HTML
<product-table>
Ref
Name Price
< product reference=”X23">
X23 Camera 359.99
<designation> camera </designation>
R2D2 Robot 19350.00
<price unit=Dollars> 359.99 </price>
Z25
PC
1299.99
easy <description> … </description>
...
</product>
< product reference=”R2D2">
Information System
Data + Structure
Semistructured:
more flexible
Information System
20/12/2002
10
XML = Semistructured Data
The X23 new camera
Ref
Name Price
replaces the X22 . It
X23 Camera 359.99
comes equipped with a flash
R2D2 Robot 19350.00
(worth by itself 53.99 $)
Z25
PC
1299.99 hard and provides great quality for
only 359.99 $.
Text + presentation
Where is the data ?
11
20/12/2002
<designation> Robot </designation>
<price unit=Dollars> 19350 </price>
<description> … </description>
...
</product-table>
12
XML : Tree Types
(Part I: Xyleme)
2. A Dynamic Warehouse for
the XML Data of the Web
product-table
product
designation
price
reference
description
Semantics and structure are in paths
• product-table/product/reference
• product-table/product/price
20/12/2002
13
Xyleme Research
Xyleme Company
Started September 2000
Project Xyleme at INRIA (1999-2000) :
Explore XML + Web + SGBD to make the Web a Knowledge Database
INRIA
•
•
•
Market Challenges:
Sophie Cluet: Databases (OQL…)
Serge Abiteboul: semi-structured data + web
Guy Ferran: ex O2 Technology
•
Mannheim University
•
•
Few XML documents available on the Web (because
of weak software support)
Company is focusing on private XML:
•
Technology:
Guido Moerkotte
Université d’Orsay
•
(25 employees end of 2001)
Marie Christine Rousset
CNAM
•
20/12/2002
Dan Vodislav
15
Architecture
20/12/2002
16
User Interface
-------------------- I N T E R N E T ----------------------Web Interface
• local: Corba
• external: HTTP
Acquisition Loader
& Crawler
Distribution between autonomous
machines
Now Web Services
• Scalability for large amount of data
• Internet (+focus) / Intranet support
• Monitoring and Version Management
• Heterogeneous Data Integration
Functional Architecture
Cluster of PCs
Developed with Linux and C++
Communications
20/12/2002
• Press, Editors, Financial Data, Biology…
Xyleme Interface
Change Control
Semantic Module
Query Processor
Repository and Index Manager
17
20/12/2002
18
(Part I: Xyleme)
3. Data Acquisition and
Maintenance,
Page Importance
Architecture
-------------------- I N T E R N E T ----------------------Change Control and
Semantic
Integration
Change Control and
Semantic
Integration
Index
Index
Loader |Query
Repository
20/12/2002
Acquisition and
Maintenance
E
T
H
E
R
N
E
T
Repository
Acquisition and
Maintenance
Index
Loader |Query
Repositorry
Repository
19
Life Cycle of a page in Xyleme
Goals
Discover XML pages on the web that are
of interest for customers
• For this crawl the web (HTML+XML)
Maintain them up to date
Do this under bounded resources:
• The meta data of D is read
• type, last_date_update...
• The document D is loaded
• Memory for known URLs
• Bandwidth
20/12/2002
The document D is re(read) regularly
21
Main Issues
a standard PC
main cost is Internet connection
Metadata management (access to disk)
Page scheduling
22
(M. Preda, S. Abiteboul, G. Cobena)
• does not require to maintain graph information
• faster convergence with focused crawling
• decide which page to read or refresh next
Definition: Important pages are linked to
by important pages
Offline algorithm (used by Google)
Our Online algorithm
• we can load up to 5 millions of pages/day on
20/12/2002
20/12/2002
Page Importance
Loading of pages
•
The URL of D is discovered as a link in
another page (or published by a customer)
The page scheduler decides to read D
23
20/12/2002
24
(Part I: Xyleme)
4. XML Repository:
Semantic Data Integration
and Query Processing
Querying Language
Today: A mix of OQL and XQL
We are currently moving to X-Query
(which is also a mix of OQL and XQL…)
Select boss/Name, boss/Phone
From comp in BusinessDomain,
boss in comp//Manager
Where comp/Product contains “Xyleme”
20/12/2002
Web Heterogeneity
26
Indexing
Semantic domains, e.g., cinema
Many possible types for data in this domain,
many DTDs
Semantic Integration
Standard inverted index
• word → documents that contain this word
Xyleme index
• word → elements that contain this word
• one abstract DTD for the domain
• gives the illusion that the system maintains an
document + element identifier
Goal: more work can be performed without
accessing data
homogeneous database for this domain
1 domain = 1 abstract DTD
20/12/2002
27
I.4.1 Xyleme:
Semantic Data Integration
20/12/2002
28
Data Integration
One application domain -- Several schemas
• heterogeneous vocabulary and structure
Xyleme Semantic Integration
• gives the illusion that the system maintains an
•
20/12/2002
29
homogeneous database for each domain
abstracts a set of DTDs into an abstract DTD = a
hierarchy of pertinent terms for a particular domain
20/12/2002
30
I.4.2 Xyleme:
Query Processing
Technology in short
Cluster DTDs into application domains
•
Business, culture, tourism, biology, …
For an application domain – semi-automatically
•
•
•
20/12/2002
Organize tags into a hierarchy of concepts using
thesauri such as Wordnet and other linguistic tool
This provides the abstract DTD for the particular
domain
Generate mappings between concrete DTDs and the
abstract one
31
Xyleme Query Language
20/12/2002
32
Principle of Querying
query on abstract DTD
A mix of OQL and XQL, will use the W3C
standard when there will be one
Select product/name, product/price
From
doc in catalogue,
product in doc/product
Where
product//components contains “flash”
and
product/description contains “camera”
catalogue/product/price
Union of concrete queries
(possibly with joins)
⇒ d1//camera/price
⇒ d2/product/cost
catalogue/product/description
⇒ d1//camera/description
⇒ d2/product/info, ref
⇒ d2/description
MAPPINGS between concrete
and abstract DTD’s
20/12/2002
33
Query Processing
1.
2.
3.
4.
5.
34
Query processing
Partial translation, from abstract to concrete, to
identify “machines” with relevant data
Algebraic rewriting, linear search strategy based
on simple heuristics: in priority, use in memory
indexes and minimize communication
Decomposition into local physical subplans and
installation
Execution of plans
If needed, Relaxation
20/12/2002
20/12/2002
35
Essential use of a smart index combining
full-text and structure
20/12/2002
36
I.4.2 Xyleme:
Repository
Storage System: Xyleme Store
Efficient storage of trees in variable
length records within fixed length pages
Balancing of tree branches in case of
overflow
• minimize the number of I/O for direct access
•
20/12/2002
37
Tree Balancing in Xyleme Store
20/12/2002
and scanning
good compromise : compaction / access time
38
Questions ?
Record 1
Overflow:
more children in other page
Overflow:
Sub-tree in other page
Record 2
20/12/2002
Record 3
Record 4
(Part I: Xyleme)
5. Change Control
39
20/12/2002
40
The Web changes all the time
Data acquisition + maintenance
•
keep the warehouse up-to-date
Version management
•
representation and storage of change
(see part II)
Change monitoring
•
20/12/2002
query subscription
42
Subscription Language
Example
SQL-like language based on ‘atomic events’.
Combines the use of monitoring queries and
continuous queries.
The language can be extended by adding new
types of atomic events.
Uses the XML Query Language for continuous
queries. “Querying the XML Documents of the
Web”, V. Aguilera, S. Cluet, F. Boiscuvier,
Tech. Report
20/12/2002
43
Step 1: Atomic Event Detection
metadata
manager
document & alerts
d/46
XML
loader
20/12/2002
atomic event 46: URL matches
pattern www.musee-orsay.fr/*
atomic event 67: XML document
contains the tag <painter> with
the value “Monet”
d/46,67
complex
event detection
45
URL Patterns Detection (1)
Test in O(1), total test time is O(n), where n is the
length of URLs
Each Alerter can be viewed as a plug-in that acts on
a document flow.
All sorts of Atomic events can be detected: URL
pattern detection, Keywords, XPath expressions,
Page rank…
Can be distributed.
Some advanced alerts are:
•
•
•
20/12/2002
Long string look-ups
Finding XML Patterns (e.g. XPath)
Comparing digital signature of text documents (copy
tracker)
46
<root str=“www.inria.fr/”>
<node str=“/verso” alert=“1”>
<node str=“/index.html” alert=“2”/>
<node str=“/main.html” alert=“3”/>
</node>
</root>
Example: http://www.inria.fr/verso/index.html
Test:
http://www.inria.fr/verso/*
http://www.inria.fr/*
20/12/2002
44
Using a tree: navigate on the tree until a leave is
encountered
Example: Tree is,
URL | prefix* | *suffix
Using Hash Table: try all possible patterns
•
URL Patterns Detection (2)
Supported patterns
•
20/12/2002
Alerters
5 millions of pages/day
d
subscription myPaintings
% what are the new painting entries in Musee d’Orsay site
monitoring newPainting
select URL
Atomic
where URL extends www.musee-orsay.fr/*
events
and <painter> contains “Monet”
% manage the changes in the expositions
continuous delta Exposition
select ... from ... where
when monthly
notify daily
% send me a daily report
Patricia Trees ?
47
20/12/2002
48
Keywords Sequence Algorithm
Simple XPath filtering Algorithm
Detect: « Air France »
Solution:
Problem:
• a Tree of backward keyword sequences
• a context memory with O(1) update cost
Solution:
• detect <a> CONTAINS « word »
• Reverse path expression
• Use postfix order
• Use a stack for ‘//’ and another stack for ‘/’
Tree is implemented over a hash table
20/12/2002
49
Simple XPath filter example:
Understanding the tree structure in
postfix order
50
Simple XPath: Example
Consider tree:
<A><C>toto</C><C/></A>
Nodes come as:
toto (id=1, level=4)
C (id=2, level=3)
C (id=3, level=3)
B (id=4, level=2)
A (id=5, level=1)
20/12/2002
20/12/2002
<a> CONTAINS toto is detected by:
• « toto »::ancestor <a>
When « toto » is detected, it is stored
For each ancestor of « toto », the name
is compared to <a>.
All tests are executed using an hash
table
51
20/12/2002
52
Step 2: Complex Event Detection
Stemming
On the Alerter
•
•
•
Exemple: Éléphant –> ELEPHANT
Do it for 500 documents / second
Noise may be introduced
(Example: tâche = tache)
HTML
parser
Millions of alerts of pages/day
Millions of subscriptions
complex
event detection
On the Subscription Manager
•
•
To avoid duplicate registration of similar events
To show the user how his query is stemmed
XML
loader
Real stemmers: chevaux -> cheval
20/12/2002
53
20/12/2002
complex event 12: 67 & 46
(XML document contains the tag
<painter> with value “Monet”
and URL matches pattern
www.musee-orsay.fr/*)
54
Complex Events Algorithm
Step 3: Notification Processor
The formal problem is NP-hard
We proposed several possible
algorithms
Experimental (simulation)
values proved the effectiveness
of our solutions
The Hash-Tree based algorithm
is well suited for our application:
• 10 million Complex Events
• 1 million Atomic Events
• 100 Atomic events
detected per document
0.8 ms to process a document.
~2 million documents per
day (on each PC).
20/12/2002
alerts
complex
event detection
notification/monitoring
Reporter
Millions of
Notifications/day
triggers
clock
55
Architecture
continuous
queries
20/12/2002
notification/results
56
Monitoring Applications
Xyleme
Query
Processor
documents
Trigger
Engine
Complex
Event
Detection
Xyleme
Alerter
Xyleme
Reporter
Reporter
Subscription
Manager
SQL
20/12/2002
SQL
Xyleme
Subscription
Manager
Web Browser
57
Copy tracking
Query to search engine
Or specific crawl + pre-filter
2
•
•
•
•
•
3
detection
20/12/2002
58
Standard portal management
Filter
Flow of candidate
documents
Web portal management
Example: a press agency wants to check that people are not
publishing illegally copies of their wires
Need to react fast on changes: illegal copy of the wire may last
only a couple of days
1
20/12/2002
Unreachable pages
Dangling pointers
Incorrect pages (e.g., do not parse)
Detection of interesting pages on the web
Etc.
Portal archiving
Subscription and notification
Slice the
document
59
20/12/2002
60
Web surveillance
Conclusion & Prospectives
Applications
•
•
Focus crawling on important pages
Anti-criminal and anti-terrorist intelligence, e.g.,
detecting suspicious acquisition of chemical products
Business intelligence, e.g., discovering potential
customers, partners, competitors
• Refine notion of importance
• Improve important pages discovery
Find the data (crawl the web)
Monitor the changes
•
Improve Change control accuracy
new pages, deleted pages, changes in a page
Classify information and extract data of interest
•
20/12/2002
Data mining, text understanding, knowledge
representation and extraction, linguistic… Very AI
61
Questions ?
20/12/2002
Temporal Queries (persistent identification of nodes)
Version some documents or some sites (store a ‘delta’)
Change Monitoring (query changes)
(Part II: XML Diff)
1. Detecting Changes
in XML Documents
Grégory Cobéna, Serge Abiteboul,
Amélie Marian
We proposed a representation of changes
“Change-Centric Management of Versions” (VLDB 2001)
We developed a Diff algorithm for XML
INRIA Rocquencourt,
Columbia University
“Detecting Changes in XML Documents”,
G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose)
63
Objectives:
20/12/2002
20/12/2002
62
Deuxième Partie: XML Diff
Versions
•
•
•
• Semantic web
• Real-time advanced processing
65
Introduction
Overview
Algorithms for detecting changes in XML
documents
Plan
• An XML Diff algorithm
• A comparative study for XML change
Motivations
State of the art
Change model
Algorithm
• Tradeoff ‘quality’ versus speed
• Quasi linear time and space complexity
detection
Experiments
• Synthetic and real world experiments
20/12/2002
67
Monitoring XML data on the Web
Change-centric management of versions in an
XML warehouse
A. Marian, S. Abiteboul, G. Cobéna, L. Mignet,
VLDB2001
In fact, all these problems are very similar
Learning about changes
Architecture and requirements ( ‘speed’ )
Multiple optimality criteria ( ‘quality’ )
69
Consider string: abcdefg
How to transform it into: bczdeyz ?
Possible solutions
•
70
S1x into S2y
Conversely, to find out the shortest path for
transforming S1x into S2y, it is sufficient to
compare following transformations:
delete all 7 chars and insert 7 other chars
Update <a> into , into <c>, <c> into <z>, <f>
into <y>, <g> into <z>
Mix both solutions
If we know how to transform S1 into S2, then
we know how to transform:
•
•
•
Question: What is the shortest edit sequence?
20/12/2002
20/12/2002
Solving the
String-Edit-Problem
The String Edit Problem
•
•
68
Unix Diff: shows the different lines between
two text files
String Diff: shows which symbol have changed
XML Diff: Which parts of the tree have been
modified, inserted or deleted
B. Nguyen, S. Abiteboul, G. Cobéna, M. Preda,
SIGMOD2001
II.1.1 XML Diff
What is a diff ?
Motivations
20/12/2002
20/12/2002
71
20/12/2002
S1 into S2, then x into y
S1x into S2, then insert y
delete x, and then S1 into S2y
72
String Edit Problem
The algorithm
A Quadratic Solution
Two strings S1 and S2
Cost(x,y) represents the shortest edit cost to
transform S1[1..x] into S2[1..y]
The cost is the sum of individual costs for each
edit operation (insert, delete, update)
Then, cost(x,y) is the min of:
• Cost(x-1,y-1)+update_cost(S1[x],S2[y])
• Cost(x-1,y)+delete_cost(S1[x])
• Cost(x,y-1)+insert_cost(S2[y])
20/12/2002
73
State of the art (1):
the string edit problem
The solution is to represent all possible path on
a matrix: M[1..|S1|][1..|S2|]
•
•
•
•
M[x,y] represents the cost of transforming S1[1..x] into
S2[1..y]
M[x,y] can be computed using M[x-1,y-1], M[x-1,y] and
M[x,y-1]
M[0,i] and M[i,0] are obvious
Thus, M[|S1|,|S2|] can be computed
Note that the number of path is exponential,
but the cost remains quadratic.
Time and Space cost is O(|S1|*|S2|)
20/12/2002
74
Questions?
Best result is O(|s|^2 / log s) solution over finite alphabet
O(|x|*|y|) solution with Directed A-cyclic Graph
…
…
A
C
source string
delete C (cost=1)
B
C
do nothing (cost=0)
destination string
20/12/2002
insert C (cost=1)
75
Finds the solution in O(n*D) where n is the size of the
largest string, and D the distance between the two
strings
Adapt M[x,y] to work on trees (S. Chawathe)
20/12/2002
•
•
Remove some edges to ensure that deleting a node
will delete the subtree rooted at that node (and
conversely for insert)
76
Kuo-Chung Tai, Lu, Selkow: based on string
edit problem
in XML, many labels are identical
Unix Diff, Sun DiffML
LaDiff (MH-Diff) , Chawathe, Rajaraman,
Garcia-Molina, J. Widom
Compute M[x,y] only close to the diagonal (E.
Myers)
•
State of the art (2):
the tree pattern matching for XML
Extending the String problem
•
20/12/2002
matching criteria to compare nodes and subtrees
quadratic in the ‘distance’ between both trees.
IBM diff available at alphaworks
77
20/12/2002
78
Data Model
Change Model
Attach persistent identifiers:
Issue: Persistent identification of nodes
Catalog
Pr
Pr
Pr
N P N P
Camera 300
TV 100
Pr
Pr
Pr
Represent changes with a Delta
N P
N P N P
N P
VCR 200
TV 100 DVD 500
VCR 150
• Delta = Set of changes
• Nice mathematical properties
Change-centric Management of versions,
VLDB2001
Version 2
Version 1
20/12/2002
• to every node = XID
• to the document = XID-map
Catalog
79
Catalog
Catalog
16
Delete
Pr
10
N P N P
2
4
Camera 300
1
3
7
Pr
8
12 14
VCR 200
11 13
20/12/2002
Pr
21
15
N P N P
N P
7
9
18
20
TV 100 DVD 500
6
8
17
19
12 14
VCR 150
Update
11 13
Version 2
Version 1
XID-map:
(1-16|17)
Pr
10
15
N P
9
TV 100
6
16
Insert
Pr
5
80
Objectives
Algorithm: Intuition
Pr
20/12/2002
Diff (V1,V2)
delete(5)
update(13,150)
insert(16,2,(17-21))
• Constraint-Awareness:
New XID-map:
(6-10,17-21,11-16|22)
Assign persistent identifiers by matching
nodes
Compute a representation of changes
between the two documents
Also
81
•
20/12/2002
Follow DTD
specifications
Correctness: No change is missed
82
II.1.2 XML Diff
The XyDiff Algorithm
Representation of Changes: Example
<delete XID-map=(1-5) parent=16 pos=1>
<Product><Name>Camera</Name><Price>$300</Price>
</Product></delete>
<insert XID-map=(17-21) parent=16 pos=2>
<Product><Name>DVD</Name><Price>$400</Price>
</Product></insert>
<update XID=13>
<old>$200</old><new>$150</new>
</update>
20/12/2002
83
20/12/2002
84
Phase 2: Bottom Up+Lazy Down
Propagation
Phase 1: Identify Subtrees
One traversal of the tree
Let L be the list of all subtrees in second
document
For each subtree S in L (in decreasing weight
order)
Use ID-attributes from DTD to match
nodes (or forbid matching)
Compute for every subtree
• Signature
• Weight
20/12/2002
•
•
•
1/ find all identical subtrees in first document
2/ Select acceptable matches
3/ If at least one match, choose the best candidate
•
4/ Propagate [Very Carefully] matching to parents and
ancestors
Remove S and all its subtrees from L
85
Phase 3: Optimization
20/12/2002
Find inserted/deleted nodes
Find “easy” move operations: parent node
changed
Find “complex” move = reordering children
•
•
•
Largest common subsequence (weight)
Ex: A, B, C, D, E, F
E, D, A, B, C, F
Largest common subsequence is A, B, C, F
nodes D and E are ‘moved’
Complexity is quadratic
We approximate the solution in linear time
20/12/2002
propagate matching to ancestors
Then quick top-down pass
propagate matching to descendant nodes based on
element names if no ambiguity
87
Key aspect: the weight of trees
Definition of weight affects both speed and
quality
Look-up and Propagation distances
•
Use locality (e.g. find matching ancestors) to avoid
wrong matches
Two small trees are matched if some ancestors are
matching. For large trees, further look-up is accepted.
•
Propagation
•
•
Propagation should try not to induce wrong matches.
Intuition is that large matching subtrees are more
relevant
The larger the tree, the more we propagate the matching
to ancestors.
88
Algorithm: Tuning
Select Acceptable Match
20/12/2002
86
Phase 4: Construct the delta
Some nodes are unmatched after previous
phases
Use previous results to propagate matching
[now a bit less carefully]
First, bottom-up
20/12/2002
89
Choice affects speed and quality
Trade-off Quality vs. Speed
We exhibit in the paper some bounds that
guarantee linear time complexity
20/12/2002
90
Complexity: n*log(n)
Experiments
Phase 1 (identification) is one traversal of the
tree
Phase 2 (propagation) is n times ‘get best
candidate’ in the worst case
•
•
Simulator of changes on XML documents
Speed and Quality evaluation on synthetic data
Look-up level is designed to have ‘get best candidate’
cost in O(log(n))
uses some pre-computed indexes
Comparison with Unix Diff on web data
Phase 3 (optimization) is designed to be linear
Phase 4 (delta construction) is linear
•
20/12/2002
longest common subsequences of children is
approximated
91
Typical
Pattern
• delta of changes
• XML document D’=delta(D)
93
20/12/2002
Synthetic Data:
Quality of the algorithm
3
2
1
0
1Mb
Size ratio of the diff over original delta
20/12/2002
Start from document D
Size of the computed delta
is comparable to ‘original’
delta size
For large deltas, XyDiff
finds more efficient
operations
94
Comparison of the size of results:
XyDiff vs. UnixDiff
– Generate changes over D:
D’ = delta(D)
– Give (D, D’) to XyDiff and
compute a delta
4
92
Experimental verification that the algorithm is
quasi-linear
Parameters control the number of
delete/insert/move
Input: XML document D
Outputs:
Typical
Pattern
Synthetic Data:
Speed of the algorithm
Simulator of changes
20/12/2002
20/12/2002
95
Experiments on 10.000 XML web documents
that changed
•
at the time of that experiment, we had to crawl 10
million web pages to find them ☺
80% of the documents below the size of
(UnixDiff*1.2)
Almost all below UnixDiff*2
Of course, the delta of XyDiff contains much
more information
20/12/2002
96
Perspective
Conclusion
Larger scale experiments on web data
Learn about changes:
A novel algorithm for XML diff in quasi
linear time
XML specificities are used to improve
quality
Available as Open Source freeware at:
• Frequency, patterns, …
• Obtain statistics for DTD and XMLSchema
• Use the statistics to learn about changes and
improve XyDiff for typed XML data
http://www-rocq.inria.fr/~cobena/XyDiffWeb/
Use XML diff to observe changes
between websites
20/12/2002
97
XyDiff in Xyleme Architecture
20/12/2002
98
Questions ?
Web Crawler
XML Loader
XyDiff
Alerter
V(n) of the XML document
Delta(V(n-1),V(n))
Storage
20/12/2002
(Part II: XML Diff)
Etude comparative sur
la détection de changements
en XML
Grégory Cobéna (INRIA),
Talel Abdessalem (ENST),
Yassine Hinnach (ENST)
99
20/12/2002
100
Context
Consider change-control in XML data
warehouses. We want to understand
changes
We have only the old and new version of
documents
A diff need to be computed
20/12/2002
102
Organization
Motivations
Motivations
Data Model
Representing Changes
•
•
•
Version Management and Querying
Comparison of Change representation models
Experiments
Detecting Changes
•
•
•
State of the art in change detection
Performance analysis and experiments
Quality analysis and experiments
Summary
20/12/2002
103
Motivations: Detecting Changes
Motivations: Representing Changes
Version management, which means that the
representation should allow for effective
storage strategies
Temporal Databases, the support for persistent
identification of nodes is mandatory
Monitoring: information about changes is used
to support triggers or detect events
Note: HTML or XHTML documents may be
used
20/12/2002
105
II.2.1 XML Diff
Comparing Data Models
Correctness: the diff programs miss no changes
Minimality of the result is important to save storage space
and network bandwidth
Semantics: some algorithms consider more semantics in
XML documents
Performance: with dynamic services and/or large
amounts of data, high speed and low memory usage are
mandatory
‘Move operations’: some algorithms support move
operations whereas others don’t. This impacts both the
performance of the tool and the quality of results.
20/12/2002
106
Data Model (quick overview)
Operations are:
•
•
•
(i) insert, delete applied to leaves or subtrees
(ii) update of text nodes
(iii) move applied to a subtree root, moving the entire
subtree
An edit cost is assigned to each operation.
Usually, the cost is 1 per node touched
The semantic of move is to identify subtrees even when
their context has changed.
We use the notion of mapping between the two trees.
Each node in document A (or B) that is not deleted (or
inserted) is matched to the corresponding node in B (or
A).
20/12/2002
107
20/12/2002
108
II.2.2 XML Diff
Data Model: Intuition
Tai’s model:
delete ‘b’
Selkow’s model:
delete ‘b’
root
root
a
b
x
20/12/2002
c
a
b
x
y
c
y
109
•
•
There are several version management strategies. For
instance, when only deltas are stored, their size must be
reduced
We also consider the performance of reconstructing a
document given the delta and the previous document. It is
linear in all cases.
A simple text-based version management is possible but
can not be used for querying.
Querying Changes
•
•
•
20/12/2002
Labeling nodes by prefix+postfix identifiers improves
querying algorithms
Labeling nodes with persistent identifiers improves temporal
databases
There is no short labeling scheme that is good for both
110
Our Example
Version Management
•
20/12/2002
111
Different representations
<catalog>
<product>
<name>Notebook</name>
<description>
2200MHz Pentium4
</description>
<price>$1999</price>
</product>
<product>
<name>Digital Camera</name>
<description>
Fuji FinePix 2600Z
</description>
<status>
Not Available
</status>
</product>
</catalog>
20/12/2002
<catalog>
<product>
<name>Notebook</name>
<description>
2200MHz Pentium4
</description>
<price>$1999</price>
</product>
<product>
<name>Digital Camera</name>
<description>
Fuji FinePix 2600Z
</description>
<price>
$299
</price>
</product>
</catalog>
112
Change Models: XUpdate
<xupdate:modifications version="1.0"
xmlns:xupdate="http://www.xmldb.org/xupdate">
<xupdate:insert-after
select="/catalog[1]/product[2]/description[1]"
>
<xupdate:element name="price">
$299
</xupdate:element>
XPath
</xupdate:insert-after>
expression
<xupdate:remove
select="/catalog[1]/product[2]/status[1]" />
</xupdate:modifications>
20/12/2002
113
20/12/2002
114
Same look’n’feel
Change
Models:
as the document
DeltaXML (Example)
Change Models:
XyDelta (Example)
<catalog delta='modified'>
<product deltaxml:delta='unchanged' />
<product deltaxml:delta='modified'>
<status deltaxml:delta='deleted'>
Not Available
</status>
<name deltaxml:delta=‘unchanged’/>
<description deltaxml:delta=‘unchanged’/>
<price deltaxml:delta='inserted'>
$399
</price>
</product>
115
Verify
Change
Models:
consistency
Microsoft XDL (Example)
<xd:node match="1">
<xd:node match="2">
<xd:change match="3" name="price">
<xd:change match="1">
$299
</xd:change>
</xd:change>
Identify nodes
</xd:node>
</xd:node>
</xd:xmldiff>
•
•
•
•
File Size
100000
XyDelta
DeltaXML
100
20/12/2002
A framework for querying
Validation by a DTD (may be a problem for DeltaXML,
XyDelta)
Verify the source document (only XDL)
Support of ‘move’ operations (only XyDelta and XDL)
Backward deltas (only XyDelta)
Monitoring the delta (only XUpdate and DeltaXML)
118
Change monitoring is easier with
DeltaXML and XUpdate
Temporal queries are easier to evaluate
with XyDelta (persistent identifiers)
Future work:
100000
protocols
Edit Cost
A formal model and nice mathematical properties
Persistent identification of nodes (at least as an option)
• It is not yet clear how to query changes
• Define transaction or synchronization
1000
20/12/2002
116
Change Models: Conclusion
Comparing Delta Size
10000
•
•
•
1000000
10000
Unique advantages of XyDelta
•
Identifiers save space when few updates
1000
20/12/2002
Nice features that some are missing
Storage Experiments
100
</xydelta>
Still missing for all of them
117
10
What is the
parent node?
Summary
<xd:xmldiff srcDocHash=“fd452bab54320191“
Updates an
xmlns:xd="http://schemas.microsoft.com/xmltools/2002/xmldiff">
element node
1
Persistent
identifiers
<insert xid=(31-33) parent=6 position=4>
<price>$399</price>
</insert>
20/12/2002
<xydelta
v1_XidMap="(1-30)"
v2_XidMap="(1-14;18-23;31-33;24-30)">
<delete xid=(15-17) parent=6 position=1>
<status>Not Available</status>
</delete>
The order is
important
(no ids, no move)
</catalog>
20/12/2002
mentions some
unchanged
nodes
119
20/12/2002
120
II.2.3
Detecting Changes
State of the art
Based on the String Edit Problem (1966)
Tree-to-tree correction Algorithms:
•
•
find the Minimum Edit Script
in O(m*n) time and space, where m and n are the size
of the two documents
Other algorithms
•
•
20/12/2002
121
Experiments:
Speed of several algorithm
20/12/2002
Run in linear time or close
Match nodes or subtrees depending on their content
Algorithms: Overview
From:
<root>
<a>
<x/><y/><z/>
</a>
<a>
<x/><y/><z/> <v/>
</a>
</root>
To:
<root>
<a>
<v/> <x/><y/><z/>
</a>
<a>
<x/><y/><z/>
</a>
</root>
20/12/2002
123
Experiments:
Quality (measured by the Edit Cost)
20/12/2002
122
20/12/2002
The cheapest choice would be to
move and <v>. (cost=2)
But finding the best script with
‘move’ operations is NP-hard
The minimum edit script consists
in deleting and <v> and then
inserting them. (cost=4)
(MMDiff)
Preprocessing often consists in
mapping identical subtrees.
In these case, an additional
‘move’ operations will be needed
(cost=5)
124
Experiments:
Speed (focus on DeltaXML)
125
20/12/2002
126
Comparison summary
Other issues
Many other algorithms that have no
advantages
MMDiff is the reference for quality
DeltaXML and XyDiff are good compromises
quality/performance; but performances of
XyDiff more regular
Performance measure for Microsoft available
soon – seems comparable in performance to
DeltaXML
20/12/2002
Constrained Diff is often interesting:
• Using ‘keys’ to match specific nodes
(e.g. DeltaXML)
• Using XMLSchema or DTD information
• Time-constrained diff
(e.g. XyDiff)
Postprocessing of results?
127
What’s next?
20/12/2002
128
Questions ?
Representing Changes:
•
•
•
Unify and improve existing features
Support Queries!
Chain versions?
Change Detection:
•
•
•
20/12/2002
We are currently working on Microsoft’s XML Diff
Use XMLSchema (or DTD) information
Mining changes? Use learning ?
merci
129
20/12/2002
130

à l`échelle du Web Motivations

Transcription

Documents pareils

application for a statcan/npcds/mitacs research position

Attendance Exmatec - crhea

Football, the challenge: video C

les panels

ELIPPSE 40 - Longitudinal study of the psychosocial impact of

OMA - Evaluation of the management of acute otitis media in

- 2008 Health and nutrition Barometer

PMSI-MCO - Medicalisation programme for acute healthcare

- Study on Mental Health Indicators for Health Care Planning

PréCARE - Cohort of Socially Vulnerable Females