Data Integration for the Relational Web Katsarakis Michalis.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
A Machine Learning Approach for Improved BM25 Retrieval
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Evaluating Search Engine
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Article Review Study Fulltext vs Metadata Searching Brad Hemminger School of Information and Library Science University of North Carolina.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Querying Structured Text in an XML Database By Xuemei Luo.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
CS4432: Database Systems II Query Processing- Part 2.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Search Engines and Search techniques
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Data Integration for the Relational Web
Evaluation of Relational Operations
Lecture 12: Data Wrangling
Data Integration for Relational Web
Presentation transcript:

Data Integration for the Relational Web Katsarakis Michalis

Data Integration for the Relational Web Katsarakis Michalis Presentation of the paper: Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova Data integration for the relational web. Proc. VLDB Endow. 2, 1 (August 2009), for the needs of the course hy562hy562

Octopus system in one slide

NameInstitute Country

Octopus system in one slide 1.Search 1.Find relations relevant to user’s query string 2.Cluster similar tables together 2.Context – Enrich relations with data from the surrounding text 3.Extend – Adorn an existing relation with additional data columns derived from other relations

Index 1.Integration Operators 2.Algorithms 3.Implementation at Scale 4.Experiments 5.Related Work 6.Conclusions

INTEGRATION OPERATORS 1.Integration Operators 2.Algorithms 3.Implementation at Scale 4.Experiments 5.Related Work 6.Conclusions

Extracted Set of Relations Search Operator Relevance Ranking Clustering Keyword query string Ordered List of relevant Relations Ordered List of Clusters of Relations

Search Operator (2) Search operator finds relevant data over the Web and then clusters the result. – Each member table of the cluster is a concrete table that contributes to the Clusters Schema Relation

Context Operator Context Extracted Relation TT’s source web pageT enriched with new columns

Context Operator (2) Course id Semester

Context Operator (3) Data values that hold for every tuple are generally “projected out” and added to the Web page’s surrounding text. Context takes as input a single extracted Table T and modifies it to contain additional columns, using data retrieved from T’s source Web Page

Extend Operator Extend Topic Keyword k Column c of relation T Extended T’

Extend Operator (2) Enables the user to add more columns to the table by performing a join. Takes a column “c” of table T as input and a topic keyword “k”. It returns 1or more columns whose values are described by k. The new column added to T does not necessarily come from a single data source. It gathers data from large number of sources. It can also gather data from table with different label from k or no label at all.

ALGORITHMS 1.Integration Operators 2.Algorithms 3.Implementation at Scale 4.Experiments 5.Related Work 6.Conclusions

Algorithms Search – Ranking – Clustering Context Extend Search: – Rank the Table by relevance to Users Query – Cluster other related tables around top ranking Search result.

Ranking Algorithms Simple Rank – Transmits the users search query to Web Search engine, obtains the URL ordering and presents the data according to that order. – Drawbacks: Ranks Individual whole page and not the data on that page. – Eg: persons home page contains a HTML list that serve as navigation list to other pages. When multiple data sets are present on the web page, SR algorithm relies on in-page ordering. (ie. In the order of its appearance) Any metadata about the HTML lists exists only in the surrounding text and not the table itself. – Cannot count hits between the query and a specific tables metadata.

Ranking Algorithms (2) SCPRank

Ranking Algorithms (3) SCPRank correlation between cell in extracted database and query term – Uses symmetric conditional probability to measure correlation between cell in extracted database and query term. It is defined as: How likely the term q and c appear together in a document. – SCPRank scores the table and not the cell. – It sends the query to the Search Engine, extracting a candidate set of tables. – Then it computes per-column scores, each of which is sum of per-cell SCP score in the column. – The tables overall score is the max of all of its per-column scores. – Finally it sorts the tables in the order of their scores and returns a ranked list. – Time consuming. – Compute score for first ‘r’ rows of every candidate table. – Approximating SCP score on a small subset of Web corpus.

Embedded Appendix: symmetric conditional probability

Ranking Algorithms (4)

Clustering Algorithms TextCluster – computes tf-idf cosine dist between texts of table a and text of table b. SizeCluster – computes column to column similarity score that measures the difference in mean string length between them. – The overall table-to-able similarity score for a pair of table is sum of per column score for best column-to-column matching. ColumnCluster – Its similar to Size Cluster however it computes a tf-idf cosine distance using only the text found in the 2 columns.

Embedded Appendix: tf-idf term frequency–inverse document frequency reflects how important a word is to a document in a collection or corpus – highest when the term occurs many times within a small number of documents – lower when the term occurs fewer times in a document, or occurs in many documents – lowest when the term occurs in virtually all documents

Context Algorithms SignificantTerms – Examines the source page of the extracted table and returns the k terms with the highest tf-idf values and do not appear in the extracted data. RVP (Related View Partners) – Looks beyond the source page. – Operating on the table T, it obtains a large number of candidate related view tables, by using each value in T as parameter for a new Web Search – Then filters out tables that are unrelated to t’s source page, by removing all tables that do not contain at least one value from ST(T) – It obtains all the data value in the remaining table and ranks them according to the frequency of occurrence, returns the k highest ranked values.

Context Algorithms (2) Hybrid – It uses the fact that the above 2 algorithm are complimentary in nature. – ST finds the context terms that RVP misses and RVP discovers the context terms that ST misses. – Hybrid returns the context term that appear in result of either algorithm.

Extend Algorithms JoinTest Jaccardian Distance TableDistance Candidate 1α Candidate 2β Threshold: Distance ≤ Ordered List of Joinable Tables

Extend Algorithms (2) JoinTest – Combines web search and key-matching to perform schema matching – Uses Jaccardian distance to measure the compatibility between the values of T’s column c and each column of in each candidate table. – If the distance is greater than a constant threshold t, we consider the tables to be joinable – All tables that pass this threshold, are sorted by relevance to keyword k

Embedded Appendix: Jaccardian Distance

Extend Algorithms (3) MultiJoin Topic Keyword k Clustering Web Search for every pair (c.cell, k) Ordered List of relevant Relations Clusters of Relations, Ordered by Relevance and JoinScore

Extend Algorithms (4) MultiJoin – Attempts to join each tuple of in the source table T with a potentially different table Can handle the case when there is no single joinable table. – Issues a distinct web search query for every (c.cell,k) pair – Clusters the results – Ranks the clusters, using a combination of relevance score for the ranked table and a join score for the cluster. JoinScore counts how many unique values from from T’s c column elicited tables in the cluster via the web search step

Extend Algorithms (5)

IMPLEMENTATION AT SCALE 1.Integration Operators 2.Algorithms 3.Implementation at Scale 4.Experiments 5.Related Work 6.Conclusions

Implementation at Scale Question: Can Octopus ever provide low latencies for a mass audience? Challenges – Traditional relevance-based Web search chalenges – Non-adjacent SCP computations for Search ScpRank algorithm – Multi-Query web searches for Context RVP algorithm Extend MultiJoin algorithm Search engines can afford to spend a huge amount of resources in order to quickly process a single query, but the same is not true for one Contopus user who yields tens of thousands of queries Case 1: 2 small prototype back-end systems Case 2: Approximation techniques to make it computationally feasible

Non-adjacent SCP computations Not feasible to precompute word-pair statistics: just for pairs of tokens, each sampled document would yield O(w 2 ) unique token combinations Miniature search engine that fits entirely in memory – 100GiB RAM over 100 machines – Few billion web pages – No absolute precision for hitcount numbers (in order to save memory by representing document setsusing Bloom Filters)

Embedded Appendix: Bloom Filter A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set Query can return – "inside set (may be wrong)“ – "definitely not in set"

Multi-Query web searches The naïve Context RVP algorithm implementation requires r*d Web searches – r: number of tables processed by Context – d: average number of sampled non-numeric data cells in each table d in fairly low values (e.g.30) RVP offers a real gain in quality MultiJoin has a smaller problem, as it needs 1 query per row

EXPERIMENTS 1.Integration Operators 2.Algorithms 3.Implementation at Scale 4.Experiments 5.Related Work 6.Conclusions

Experiements The goal is to evaluate the quality of results generated by each Octopus Oerator Collecting Queries – Collected a diverse query load from Web Users, using Amazon Mechanical Turk. Each user suggested Topic of Data Table 2 distinct URLs that provide example tables

Experiments (2)

Ranking Experiments Run the ranking phase of search on each of the above 52 queries, first using SimpleRank, then ScpRank Two judges, drawn from Amazon Mechanical Turk, labeled the table’s relevance to the query, on a scale 1-5. Table was marked as relevant only when both judges gave score 4 or higher

Ranking Experiments (2) Results – ScpRank performs substantially better than SimpleRank, especially in Top-2 case. – The extra computational overhead clearly offers real gains in result quality

Clustering Experiments Issued queries and obtained a sorted list of tables, using ScpRank – Best Table for each result manually chosen and used as center input to the clustering system Cluster quality assessed by computing the percentage of queries in which a k-sized cluster contains a table that is “highly similar” to the center. Determine whether a table is “highly similar”, by asking two users from Amazon Mechanical Turk to rate the similarity of the pair in a scale 1-5. Table was marked as “highly similar” only when both judges gave score 4 or higher

Clustering Experiments (2) Results – k: cluster size: the system has only k “guesses” to find a table that is similar to the center – Little variance in quality across all algorithms

Context Experiments Top-1 relevant table per query Two of the authors manually reviewed each Table’s source page, noting terms that appeared to be useful context values The values that both reviewers noted, were added in the test set of true context values Within the test set, there is a median of 3 test context values per table Measured the percentage of tables, where a true context value is included in the top-k of the context terms, generated by each algorithm

Context Experiments (2) Results – Context can adorn a table with useful data from the surrounding text over 80% of the time – Although the RVP and SignificantTerms are not disjoint, RVP is able to discover new context terms that were missed by SignificantTerms – SignificantTerms does not yield the best output quality, but it is still efficient and very easy to implement

Extend Experiments A small number of queries that appear to be Extend-able were chosen Top-1 ranked “relevant” table returned from search was used Join column c and topic keyword query k were chosen by hand opting for values that appear to be ammendable to Extend processing

Extend Experiments (2) Results – JoinTest (tries to find a single satisfactory table) only found extended tuples in 3 cases Countries US Cities UK Political Parties – In this 3 cases, 60% of tuples were extended – MultiJoin found extended data for all cases – On average, 33% of the source tuples were extended – MultiJoin has a lower rate of tuple-extension than JoinTest – MultiJoin finds an average of 45.5 correct extension values for every successfully –extended source tuple. – MultiJoin shows flexibility on per-tuple approach – With MultiJoin, fewer rows may be extended, but at least some data can be found.

Experiments Summary It is possible to obtain high-quality results for all three Octopus operators Even with imperfect outputs, Octopus improves the productivity of the user Promising areas of future research – Output quality – Algorithmic runtime performance

RELATED WORK 1.Integration Operators 2.Algorithms 3.Implementation at Scale 4.Experiments 5.Related Work 6.Conclusions

Related Work Data Integration on Web called as “MashUp” is increasingly popular area of work. The Yahoo Pipes allows the user to graphically describe the flow of data (structured data only) CIMPLE is data integration system for web use designed to construct community websites.

CONCLUSIONS 1.Integration Operators 2.Algorithms 3.Implementation at Scale 4.Experiments 5.Related Work 6.Conclusions

Conclusions OCTOPUS allows the user to integrate data from many unstructured data source. It offers access to orders of magnitude of data sources, frees the user from having to design or even know about the mediated schema.

Questions