Provenance for Web 2.0 Data
Meghyn Bienvenu, CNRS and University of Paris-Sud
Daniel Deutch, Ben Gurion University of the Negev
Fabian M. Suchanek, Max Planck Institute for Informatics
Vision Paper
1
The Web 1.0
2
Content Producer
Web Page
User
Scientific
articles
...
The Web 1.0
3
Content Producer
Web Page
User
Scientific
articles
...
Elvis is a rock star
The Web 1.0
4
Content Producer
Web Page
User
Elvis is a rock star
Scientific
articles
...
Elvis is a rock star
The Web 1.0
5
Content Producer
Web Page
User
Elvis is a rock star
Elvis is a rock star
Scientific
articles
...
Elvis is a rock star
The Web 1.0
6
Content Producer
Web Page
User
This view is obsolete!
Today's Web is no longer
"flat"!
The Web 2.0
7
W
WordPress
Elvis is alive!
Elvis is dead!
Juan Carlos is the King!
Elvis is the King!
Who is this Elvis?
Challenge 1: Context
8
W
WordPress
Juan Carlos is the King!
Elvis is the King!
Different sources have different presuppositions
=> it is important to know the context of a statement
The Web 2.0
9
W
WordPress
Elvis for president!
me who is this Elvis?
Good that Elvis is dead!
Elvis is great!
Elvis is boring!
Can you please tell
Challenge 2: Perspectives
10
W
WordPress
Elvis is great!
Elvis is boring!
In today's Web, statements are produced by different sources
=> it is important to know where a statement comes from
The Web 2.0
11
W
WordPress
Elvis is great!
The Web 2.0
12
W
WordPress
Elvis is great!
The Web 2.0
13
W
WordPress
Elvis is great!
that
Elvis is great!
says
The Web 2.0
14
W
WordPress
Elvis is great!
I don't like that
says
that
Elvis is great!
The Web 2.0
15
W
WordPress
Elvis is great!
says
Elvis is great!
I don't like that
that
Elvis is great!
The Web 2.0
16
W
WordPress
Elvis is great!
says
Elvis is great!
I don't like that
that
Elvis is great!
The Web 2.0
17
W
WordPress
Elvis is great!
says
Elvis is great!
I don't like that
that
Statements can be propagated
=> we should model citation, aggregation, and copying
18
We want a model for Web 2.0 data
• that accounts for different types of Web data
Desiderata for a Model
19
We want a model for Web 2.0 data
• that accounts for different types of Web data
Desiderata for a Model
Social Networks
20
We want a model for Web 2.0 data
• that accounts for different types of Web data
Social Networks
Desiderata for a Model
Collaborative Resources
21
We want a model for Web 2.0 data
• that accounts for different types of Web data
Social Networks
Collaborative Resources
Desiderata for a Model
WordPress
Blogs / Homepages
W
Desiderata for a Model
22
We want a model for Web 2.0 data
• that accounts for different types of Web data
• that can deal with different transformations
Desiderata for a Model
23
We want a model for Web 2.0 data
• that accounts for different types of Web data
• that can deal with different transformations
says
Citation
that
Elvis is great!
Desiderata for a Model
24
We want a model for Web 2.0 data
• that accounts for different types of Web data
• that can deal with different transformations
Citation
says
Elvis is great!
10
Elvis is great!
Aggregation
that
Desiderata for a Model
25
We want a model for Web 2.0 data
• that accounts for different types of Web data
• that can deal with different transformations
Citation
Aggregation
says
Elvis is great!
Elvis is great!
10
Negation
that
Elvis is great!
Desiderata for a Model
26
We want a model for Web 2.0 data
• that accounts for different types of Web data
• that can deal with different transformations
• that supports different types of meta-data
Desiderata for a Model
27
We want a model for Web 2.0 data
• that accounts for different types of Web data
• that can deal with different transformations
• that supports different types of meta-data
Valuations
Context
Access Rights
Place
Authorship
Time
Desiderata for a Model
28
We want a model for Web 2.0 data
• that accounts for different types of Web data
• that can deal with different transformations
• that supports different types of meta-data
Basic principle:
Statement
+
Annotation
Desiderata for Reasoning
29
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
Desiderata for Reasoning
30
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
What happened in 1984?
Desiderata for Reasoning
31
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
What happened in 1984?
What happened in 1984 in Istanbul?
Desiderata for Reasoning
32
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
What happened in 1984?
What happened in 1984 in Istanbul?
What does Alice think?
Desiderata for Reasoning
33
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
What happened in 1984?
What does Alice think?
What happened in 1984 in Istanbul?
What are the statements on which Alice and Bob agree?
Desiderata for Reasoning
34
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
• given a statement, what are its annotations?
Desiderata for Reasoning
35
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
• given a statement, what are its annotations?
When and where did this statement happen?
Desiderata for Reasoning
36
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
• given a statement, what are its annotations?
When and where did this statement happen?
Where does this statement come from?
Desiderata for Reasoning
37
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
• given a statement, what are its annotations?
Where does this statement come from?
When and where did this statement happen?
Alice said that she liked that
so many people liked her statement about Elvis.
Desiderata for Reasoning
38
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
• given a statement, what are its annotations?
Where does this statement come from?
When and where did this statement happen?
Alice said that she liked that
so many people liked her statement about Elvis.
Who has sufficient credentials to see this statement?
Desiderata for Reasoning
39
We want to perform at least the following
archetypical reasoning tasks:
• given an annotation, what statements have it?
• given a statement, what are its annotations?
Who has sufficient credentials to see this statement?
Where does this statement come from?
When and where did this statement happen?
Alice said that she liked that
so many people liked her statement about Elvis.
Everybody who is in Unix group X
and is friends or relatives with Alice
Applications
40
Such a model with its reasoning
would have numerous applications:
Modeling privacy and access rights
Modeling authorship and provenance
Modeling time and space contexts
Modeling opinions and valuations
Estimating trust
Related Work
41
The idea of annotations has been explored
• in Database research
• on the Semantic Web
• on the Web itself
• in Artificial Intelligence
On the Semantic Web: Named Graphs
42
plays
rdf:type
guitar
livingPerson
Alice's world
A named graph is a group of ontological statements:
On the Semantic Web: Named Graphs
43
plays
rdf:type
guitar
livingPerson
Alice's world
A named graph is a group of ontological statements:
Named graphs can express basic provenance and nesting,
but not transformations or aggregations.
On the Semantic Web: Watermarking
44
Ontological Watermarking removes certain statements
to prove provenance and plagiarism.
Original Ontology
Plagiarizing Ontology
plagiarize
On the Semantic Web: Watermarking
45
Ontological Watermarking removes certain statements
to prove provenance and plagiarism.
Original Ontology
Plagiarizing Ontology
plagiarize
On the Semantic Web: Watermarking
46
Ontological Watermarking removes certain statements
to prove provenance and plagiarism.
Original Ontology
Plagiarizing Ontology
plagiarize
On the Semantic Web: Watermarking
47
Ontological Watermarking removes certain statements
to prove provenance and plagiarism.
Original Ontology
Plagiarizing Ontology
plagiarize
Pattern of removed facts
proves provenance.
On the Semantic Web: Watermarking
48
Ontological Watermarking removes certain statements
to prove provenance and plagiarism.
Original Ontology
Plagiarizing Ontology
Pattern of removed facts
proves provenance.
plagiarize
Watermarking can prove provenance,
but it cannot model meta-information or transformations.
On the Web: HTML5
49
HTML5 allows users to specify provenance using tags:
<blockquote cite=http://elvis.com/quotes.html>
Don't criticize what you don't understand, son.
You never walked in that man's shoes.
</blockquote>
Don’t criticize what you don’t understand, son.
You never walked in that man’s shoes.
On the Web: HTML5
50
HTML5 allows users to specify provenance using tags:
<blockquote cite=http://elvis.com/quotes.html>
Don't criticize what you don't understand, son.
You never walked in that man's shoes.
</blockquote>
Don’t criticize what you don’t understand, son.
You never walked in that man’s shoes.
HTML5 is limited to citations and references,
and does not support other types of annotations.
On the Web: Information Extraction
51
• attaching sources:
(example from NELL)
Information Extraction often attaches meta-information to its facts:
Human feedback from bsettles @181 on 02-jan-2011
elvis_presley is a male
On the Web: Information Extraction
52
• attaching sources:
(example from NELL)
Information Extraction often attaches meta-information to its facts:
Human feedback from bsettles @181 on 02-jan-2011
elvis_presley is a male
• attaching context:
(example from YAGO2)
<f42>: <Elvis_Presley> rdf:type <person>
<f42> <validFrom> "1935"
<f42> <validTo> "1977"
<f42> <extractedFrom> <http://wikipedia.org>
On the Web: Information Extraction
53
This type of attached information can serve as a use-case,
but does not support transformations.
• attaching sources:
(example from NELL)
Information Extraction often attaches meta-information to its facts:
Human feedback from bsettles @181 on 02-jan-2011
elvis_presley is a male
(example from YAGO2)
• attaching context:
<f42>: <Elvis_Presley> rdf:type <person>
<f42> <validFrom> "1935"
<f42> <validTo> "1977"
<f42> <extractedFrom> <http://wikipedia.org>
In AI: Context and Epistemic Logics
54
Contexts relativize logical statements to contexts or agents:
Mary believes that
Bob does not know that
"Mary is dating Peter”"
In AI: Context and Epistemic Logics
55
Contexts relativize logical statements to contexts or agents:
Mary believes that
Bob does not know that
"Mary is dating Peter”"
"Sarkozy is president"
France, [2007-2012]:
[2008] ⊂ [2007-2012]
In AI: Context and Epistemic Logics
56
Contexts relativize logical statements to contexts or agents:
Mary believes that
Bob does not know that
"Mary is dating Peter”"
France, [2007-2012]:
"Sarkozy is president"
[2008] ⊂ [2007-2012]
France, [2008] :
"Sarkozy is president"
In AI: Context and Epistemic Logics
57
Contexts relativize logical statements to contexts or agents:
Mary believes that
Bob does not know that
"Mary is dating Peter”"
France, [2007-2012]:
"Sarkozy is president"
[2008] ⊂ [2007-2012]
France, [2008] :
"Sarkozy is president"
These models support complex reasoning,
but are not designed to handle large amounts of data.
In Databases: Provenance
58
Semi-ring provenance captures the transformations
that data undergoes in queries.
Emps
GoodEmps
Prov
x
y
k
Prov
z
w
m
Dep
Eng.
Eng.
Sales
Name
Alice
Bob
Elvis
Name
Alice
Bob
Elvis
In Databases: Provenance
59
Semi-ring provenance captures the transformations
that data undergoes in queries.
Emps
GoodEmps
Prov
x
y
k
Prov
z
w
m
Dep
Eng.
Eng.
Sales
Name
Alice
Bob
Elvis
Name
Alice
Bob
Elvis
Indicates
where this
tuple came
from
In Databases: Provenance
60
Semi-ring provenance captures the transformations
that data undergoes in queries.
Emps
GoodEmps
Prov
x
y
k
Prov
z
w
m
Dep
Eng.
Eng.
Sales
Name
Alice
Bob
Elvis
Name
Alice
Bob
Elvis
Indicates
where this
tuple came
from
Prov
x*z+y*w
k*m
Dep
Eng.
Sales
In Databases: Provenance
61
Semi-ring provenance captures the transformations
that data undergoes in queries.
Emps
GoodEmps
Prov
x
y
k
Prov
z
w
m
Dep
Eng.
Eng.
Sales
Name
Alice
Bob
Elvis
Name
Alice
Bob
Elvis
Dep
Eng.
Sales
Prov
x*z+y*w
k*m
Indicates
where this
tuple came
from
Whom do I have to
believe to believe
this result tuple?
In Databases: Provenance
62
Semi-ring provenance captures the transformations
that data undergoes in queries.
Emps
GoodEmps
Prov
x
y
k
Prov
z
w
m
Dep
Eng.
Eng.
Sales
Name
Alice
Bob
Elvis
Name
Alice
Bob
Elvis
Dep
Eng.
Sales
Prov
x*z+y*w
k*m
Indicates
where this
tuple came
from
Whom do I have to
believe to believe
this result tuple?
Provenance can capture "positive SQL" ,
but there are many other transformations on the Web.
Toy Model
63
Inspired by provenance, we have developed a toy model
to illustrate the complexity of basic reasoning tasks:
Location
Paris
Paris
Person
Alice
Bob
Context
Context-Annotated database D:
gives conditions of tuples
Toy Model
64
Inspired by provenance, we have developed a toy model
to illustrate the complexity of basic reasoning tasks:
Location
Paris
Paris
Person
Alice
Bob
Context
Context-Annotated database D:
gives conditions of tuples
Background theory T:
expresses relationships between contexts
Toy Model
65
Inspired by provenance, we have developed a toy model
to illustrate the complexity of basic reasoning tasks:
Location
Paris
Paris
Person
Alice
Bob
Context
Context-Annotated database D:
gives conditions of tuples
Background theory T:
expresses relationships between contexts
1. Is there a consistent set of contexts that make the query hold?
(or: find all such contexts)
2. Does a given set of contexts ensure that the query holds?
(or: find all tuples that hold in this context)
Reasoning tasks:
Toy Model
66
Inspired by provenance, we have developed a toy model
to illustrate the complexity of basic reasoning tasks:
Location
Paris
Paris
Person
Alice
Bob
Context
Context-Annotated database D:
gives conditions of tuples
Background theory T:
expresses relationships between contexts
Reasoning tasks:
1. Is there a consistent set of contexts that make the query hold?
(or: find all such contexts)
2. Does a given set of contexts ensure that the query holds?
(or: find all tuples that hold in this context)
NP-complete
(PTIME if T positive)
Toy Model
67
Inspired by provenance, we have developed a toy model
to illustrate the complexity of basic reasoning tasks:
Location
Paris
Paris
Person
Alice
Bob
Context
Context-Annotated database D:
gives conditions of tuples
Background theory T:
expresses relationships between contexts
Reasoning tasks:
1. Is there a consistent set of contexts that make the query hold?
(or: find all such contexts)
2. Does a given set of contexts ensure that the query holds?
(or: find all tuples that hold in this context)
NP-complete
(PTIME if T positive)
coNP-complete
(PTIME if T Horn)
What we need
68
Database
Research
Artificial
Intelligence
Semantic
Web
Efficient storage,
Powerful model
Semantic,
Large-scale,
Distributed
Complex
inference
Conclusion
69
• Data on the Web is not "flat", but comes with meta-information
• To model this data, we can use research from
• Databases
• Semantic Web
• Artificial Intelligence
• There are still many open issues, in particular for reasoning
Conclusion
• Data on the Web is not "flat", but comes with meta-information
• To model this data, we can use research from
• Databases
• Semantic Web
• Artificial Intelligence
• There are still many open issues, in particular for reasoning
(it all depends on the context)
Elvis is alive!
• most importantly:
Slides done with PowerLine, the free SVG slide editor with Latex support