Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Unique identifiers for the Web Zoltan Miklos Joint work with Gleb Skobeltsyn, Saket Sathe, Nicolas Bonvin, Philippe Cudré-Mauroux, Ekaterini Ioannou,

Similar presentations


Presentation on theme: "1 Unique identifiers for the Web Zoltan Miklos Joint work with Gleb Skobeltsyn, Saket Sathe, Nicolas Bonvin, Philippe Cudré-Mauroux, Ekaterini Ioannou,"— Presentation transcript:

1 1 Unique identifiers for the Web Zoltan Miklos Joint work with Gleb Skobeltsyn, Saket Sathe, Nicolas Bonvin, Philippe Cudré-Mauroux, Ekaterini Ioannou, Karl Aberer and others EPFL 1

2 2 Providing unique identifiers Webpages Documents Okkamization Entities store Query: “Barack Obama” Response: http://www.okkam.org/ens/idb3016709-b9e1-42c0-ac5f-6383d2e5b235 (Information extraction)

3 3 Web search vs. entity search Web searchEntity search DocumentsWeb documentsEntity profiles/unique ID RankingPage rankOKKAM ranking QueryKeywords, e.g. Barack Obama Keywords + attribute names, e.g. Barack Obama, Name:”Barack Obama”, geoLocation:Paris, firstName:Paris Query “semantic” Find all relevant documents and have the most relevant first Find the only relevant document, if this is not possible, an ordered list of candidates(with confidence values)

4 4 Entity profiles are collection of attribute-value pairs with an okkam-id Examples of entity requests ◦ Q 1 -- name= “Einstein” (AND) physicist ◦ Q 2 -- Einstein (AND) physicist ◦ Q 3 -- name= “Einstein” (AND) profession= “physicist” Entities and Entity Requests 4 name : Albert Einstein affiliation : Institute of Advanced Study profession : physicist okkam-id : http://www.okkam.org/ens/id06b1791fhttp://www.okkam.org/ens/id06b1791f

5 5 OKKAM Match API OKKAM Match & Store Process Matching Modules Receive the entity request Name=“Einstein" AND physicist Group Linkage Group Linkage Generic Matching Generic Matching Product Matching Product Matching Convert request and select matching module Module Selection: Entity Type Inferred from attributes Identified from receiver Required response time …

6 6 OKKAM Match API OKKAM Match & Store Process Generation of the storage query Name=“Einstein" AND physicist Matching Module Matching Module Create query for OKKAM Store Possibility to overwrite default implementation Schema rewriting (internal object, or store query) Add attributes to values Complex query plan

7 7 OKKAM Store API OKKAM Match & Store Process 7 OKKAM Store Index Top-k matches (IDs + scores) Send the query to index Query the distributed index Each server processes the query from the index and returns top-k results Aggregate top-k results from each server name:einstein physicist

8 8 OKKAM Store API OKKAM Match & Store Process 8 OKKAM Store storage Top-k matches (IDs) Top-k entities (candidates) Requesting entity profiles by their IDs for top-k candidate matches Top-k matching candidates are obtained name:einstein physicist

9 9 OKKAM Match API OKKAM Match & Store Process Receive matching candidates Name=“Einstein" AND physicist Matching Module Matching Module Advanced matching and final entities Background knowledge Domain specific information Analyze inner-relationships Make another query … …

10 10 Name=“Einstein" AND physicist OKKAM Match API OKKAM Match & Store Process Matching Module Matching Module Ranked list with matching entities Background knowledge Domain specific information Analyze inner-relationships Make another query … … X XXX X XX X X XX X X 0.95 0.89

11 11 einstein physicist D3D3 D5D5 D9D9 D 15... namefirmname D1D1 D3D3 D9D9 D 17... school affiliation … D3D3 OKKAM Store Index OKKAM Match & Store Process 11

12 12 Scoring at the index level OKKAMStore returns top-k candidate entities Scoring for keyword queries: ◦Example: query Paolo – entity with “name=Paolo” will be scored higher than the entity with “comment=Paolo leads OKKAM…” Scoring for structured queries: ◦Example: query name=Paolo – high score to the entity with “name=paolo” and low score to the entity with “location=paolo alto” 12 bu Boosting for unstructured queries bu ~ popularity of the attribute used with the term t from the query q in the entity e bs(a) Boosting for unstructured queries bs(a) = 1 if the entity e contains the term t exactly with the attribute a from the query q

13 13 Identified Challenges  Achievements Challenges : Huge number of entities that OKKAM needs to store and process A single algorithm for matching an entity description to the OKKAM entities does not exist 13 [ENA+]

14 14 OKKAMstore distributed architecture Conceptual principles: ◦Document- (entity-) partitioned distributed index + ◦distributed storage: 14 EEEE AB Storage Collection of entities Server maintains: Storage: Entity read(OkkamID) write(Entity) Index: Collection match(query) Inverted index Inverted index Inverted index Inverted index EEEE CD Storage Inverted index Inverted index Inverted index Inverted index Servers (replicas), each maintains:

15 15 Future work: manage mappings The user in general does not know the set of available attributes D1 fName Paris … User query: firstName=Paris Need a mapping firstName -> fName Challenge: on-the-fly mappings are needed but only mappings with very low computation costs (constant time) are realistic Strategy: create mapping candidates, from the dataset, adapt the mappings based on statistics (we don’t have good test data …) Posting list

16 16 idMesh: Cudré-Mauroux et al., WWW’09 Source1 e1= c1 e2 e1= c2 e3 e1≠ c3 e4 e2 ≠ c4 e4 e2= c5 e4 e3= c6 e4 Source2 e1 e2e3 e4 l12l13 l24 l34 l14 Entity graph l12 l13 l14 l24 l34 S1 S2 c1 c2 c3c4 c5 c6 Source graph

17 17 idMesh: Inferring the most-probable relations We formulate a set of integrity constraints: P(l=equal)+P(l=non-equal)=1, for link variables No cycle can contain exactly one non-equivalent link We also define a trust framework and attach a trust variable to each source (which has the value 1 if all the relations declared by this source are correct). With a graphical model-based (factor graph) probabilistic inference machinery we compute the most probable values for the entity relations.

18 18 idMesh: further challenges Entity graph can be very large ◦ε - graph : represent only edges with confidence larger than (1-ε) or smaller than ε (even this is difficult to compute) How to construct the graph if the entity profiles have a different set of attributes Large connected components in the graph or large circles ◦Apply standard graph algorithms for finding max connected components Problems with the dataset, eg. the number of sources is low ◦More advanced models

19 19 Thank you for your attention! http://www.okkam.org New version (V2) will be online in July/August 2009


Download ppt "1 Unique identifiers for the Web Zoltan Miklos Joint work with Gleb Skobeltsyn, Saket Sathe, Nicolas Bonvin, Philippe Cudré-Mauroux, Ekaterini Ioannou,"

Similar presentations


Ads by Google