Very VERY large scale knowledge representation In collaboration with:

Very VERY large scale knowledge representation In collaboration with:
Frank van Harmelen In collaboration with: Jacopo Urbani (VUA) Henri Bal (VUA)

Can we do this at “Web scale”?
Reminder RDF(S) & OWLS are (weak) logics We can infer implicit statements from explicit ones subClass hierarchy Predicate types Cardinalities Transitive & symmetric relations Inconsistency ..... Can we do this at “Web scale”?

25 billion facts and counting

1 triple:

107 Triples Suez Canal Suez Canal 163 km 10^5 m
Denny Vrandečić – AIFB, Universität Karlsruhe (TH)

RDF Store subsecond querying 108 Triples Moon

~109 Triples Earth Earth Diameter: 10^5 km Volume: 10^21 km³

~1010 Triples ≈ 1 triple per web-page Jupiter
[LarKC proposal] ~1010 Triples ≈ 1 triple per web-page Jupiter ≈ 1 triple per web-page Denny Vrandečić – AIFB, Universität Karlsruhe (TH)

~1011 Triples

Rephrase inference as Map-Reduce
Map = group (key,val) pairs Reduce = process grouped pairs Inference = join : Map = group equal 1st or 3rd elements Reduce = perform joins A p C A q B D r D E r D F r C C 2 p 1 r 3 q 1 D 3 F 1 Map Reduce Jacopo Urbani <C,_,_> <A,_,_> . . <C,_,_> Map Reduce <F,_,_> Map-Reduce Map-Reduce Socrates a Human Human subClassOf Animal Socrates an Animal

WebPIE: Scalable reasoning in Hadoop
deploy Map-Reduce on Hadoop platform Run on a cluster: 64 quad-core nodes cheap network (Gbit ethernet), cheap HD (1/node), limited memory (4GB/node) Jacopo Urbani

WebPIE performance: headlines
compute the closure of 1.6B triples in < 2hr (Uniprot, OWL, 64 nodes) compute the closure of 100B triples in < 2 days (LUBM, OWL, 64 nodes) Linear scalability

WebPIE: performance Scalability (on the input size, using LUBM on 100 Billion triples) In this experiment we wanted to evaluate how WebPIE would perform if we increase the input. To test it, we always used the same number of machines (64) and generated LUBM data up to 100 billion triples. From the graph we see that the execution time increases linearly. This is good.

WebPIE: performance Scalability (on the number of nodes, up to 32 nodes) Here we tested how the performance would be if we increase the number of nodes. Therefore, we kept the size of the input constant (1B triples) and doubled the number of nodes. We notice that at the beginning the performance is superlinear but this can be misleading because in reality this is due to the fact that the Hadoop settings were not optimized for the execution on one machine and therefore that execution time is too penalized. In reality, the real performance is linear or even sublinear as it shown in the last part of the graph (look at the difference between 16 and 32 nodes).

Scalable reasoning in Hadoop

What to do for infinite scalability? (2/2)
brain the size of a planet Eyal Oren anytime convergence (more complete over time)

What to do for infinite scalability?
MarVIN: Divide – Conquer – Swap Split the input across peers Calculate the closure If you want more completeness, Goto 1. RDFS closure of 200M triples in 7 minutes. Approximate reasoning: full closure guarantee at  brain the size of a planet

Does this guarantee completeness?
Questions: Does this guarantee completeness? Yes, theoretical model, experimentally verified Will this take forever? Yes, if triples are exchanged randomly No, if we can do something better 28-April-09

Random is inefficient Random scales badly
Why is random routing inefficient? Random is inefficient Triples meet other triples randomly Most meetings are useless: inferences are sparse Random scales badly Useful meetings decrease as system size increases 28-April-09

Human subClassOf Animal
Why is efficient routing difficult? Efficient: term-based partitions All triples with term x go to node y For inferencing, you need terms in common But will not work: Very skewed term distribution (Zipf) Load-balance will be too uneven Socrates a Human Human subClassOf Animal Socrates an Animal 28-April-09

Data clustering with SpeedDate
DHT Random Speed Date 28-April-09

SpeedDate vs. other approaches
We’re almost as good as a DHT 28-April-09

SpeedDate with various data distributions
We can handle skewed data 28-April-09

SpeedDate under network churn
We can handle node failures 28-April-09

SpeedDate scaling with system size
We scale ~ sqrt(x) 28-April-09

Experimental speedup 28-April-09

. . A p C A q B D r D E r D F r C C 2 p 1 r 3 q 1 D 3 F 1 Map Reduce
Jacopo Urbani Spyros Kotoulas compute Eyal Oren input data compute compute output data compute compute compute Divide-Conquer-Swap

Inference in Weak Logics at very, VERY large scale is possible
Conclusion Inference in Weak Logics at very, VERY large scale is possible Future challenges: Incremental reasoning Stream reasoning Approximate reasoning (targetted incompleteness) Stronger Logics Cost predictions

Semantic Web Intro in 6 slides & a movie

P1. Give all things a name

P2. Relations form a graph between things

P3. The names are addresses on the Web
[<x> IsOfType <T>] x different owners & locations T <village>

P1+P2+P3 = Giant Global Graph

P4. explicit & formal semantics
assign types to things assign types to relations organise types in a hierarchy empose constraints on possible interpretations

Examples of “semantics”
married-to Frank Lynda married-to Hazel Frank is male married-to relates males to females married-to relates 1 male to 1 female Lynda = Hazel lowerbound upperbound Semantics = predictable inference

Very VERY large scale knowledge representation In collaboration with:

Similar presentations

Presentation on theme: "Very VERY large scale knowledge representation In collaboration with:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Very VERY large scale knowledge representation In collaboration with:

Similar presentations

Presentation on theme: "Very VERY large scale knowledge representation In collaboration with:"— Presentation transcript:

Similar presentations

About project

Feedback