ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION

Outline  Introduction, opportunities and challenges with Web data  The Deep Web  Vertical search  Surfacing the Deep Web  Creating topical portals  Lightweight data management on the Web  Discovery of data sets  Extracting data from Web pages  Combining multiple data sets  Re-using others’ work

Broad Range of Data on the Web

Key Characteristics  Scale and heterogeneity  Data is about everything! Overlapping sources, varying levels of quality.  Multiple formats (tables, lists, cards, etc.)  Data is laid out for visual appeal  Extracting the data is very tricky!  Semantics of the data are rarely specified and need to be inferred from text and other clues.

Different Forms of Structured Data on the Web

Tables: hundreds of millions good ones

Databases Behind Forms  The Deep/Invisible Web store locations used cars radio stations patents recipes Tens of millions of high-quality forms

HTML Lists Every list item is a row in a table, but figuring out cell boundaries is very tricky.

Structured data embedded more loosely in pages. Extraction is very tricky!

What Can we do with Structured Web Data?  Integrate:  Imagine integrating your data with any data on the Web!  Insights come when independently developed data sets come together  (of course, you can also get garbage that way, so you need to be careful).  Improve web search  Find tables & lists when they’re relevant to queries  Answer fact-seeking queries with facts rather than links to Web pages.  Aggregate: answer “total GDP of 10 largest countries” by putting together facts from multiple pages

Discover via search Manage, Analyze, Visualize, Integrate, create compelling stories Extract from Web Sources Publish back to the Web Bigger Vision: create an ecosystem of structured data on the Web

Outline Introduction, opportunities and challenges with Web data  The Deep Web  Vertical search  Surfacing the Deep Web  Creating topical portals  Lightweight data management on the Web  Discovery of data sets  Extracting data from Web pages  Combining multiple data sets  Re-using others’ work

What is the Deep Web?  Content hidden behind HTML forms, not accessible to search engines.

The Deep Web  The collection of databases that are accessed by users entering values into HTML forms.  The crawler of search engines cannot fill the forms, therefore the content is invisible to the search engine.  The work on the Deep Web illustrates many of the challenges of managing Web data.

Two Approaches to the Deep Web  Build a vertical search engine:  Apply all the data integration techniques we’ve learned so far to a set of data sources such as job sites, airplane reservations, etc.  The approach is applicable to domains that have thousands of form sites.  Surface the content:  Try to guess good queries to pose to the forms. Insert the resulting HTML pages into the Web index.  The approach covers the long tail of content on the Web.

Approach #1: Vertical Search: Data Integration

Vertical Search as Data Integration  Mediated schema: the properties of the domain that need to be exposed to the user  If you include too many attributes in the mediated schema, you may not be able to query them on many sources.  Source descriptions: relatively simple. Sources are often distinguished by their geographical coverage.  Wrappers:  Parsing the answers from the resulting HTML is the tricky part.  Alternate approach: don’t parse the answers. Just show the user the returned web pages.

Tree Search Amish quilts Parking tickets in India Horses Deep Web: the Long Tail

The Surfacing Approach  Crawl & Indexing time  Pre-compute interesting form submissions  Insert resulting pages into the Web Index  Query time: nothing!  Deep web URLs in the Index are like any other URL  Advantages  Reuse existing search engine infrastructure  Reduced load on target web sites – users click only on what they deem relevant.  Approach taken at Google for the long tail.

Surfacing Challenges 1.Predicting the correct input combinations  Generating all possible URLs is wasteful and unnecessary  Cars.com has ~500K listings, but 250M possible queries 2.Predicting the appropriate values for text inputs  Valid input values are required for retrieving data  Ingredients in recipes.com and zipcodes in borderstores.com 3.Don’t do anything bad! 4.Coverage of the crawl: don’t try to cover sites in their entirety, it’s not necessary. 1.Once you get part of the content, there will be links to the rest 2.It’s enough to have part of the content in the index to send it relevant traffic.

Form Processing 101  GET and POST: types of HTML forms  Only GETs can be surfaced … URL: http://www.borders.com/locator?store=All&city=&state= &zip=94043&within=25&search=Go&site=homepage on submit

Google's Deep-Web Crawl (VLDB 2008) Predicting Input Combinations  Forms can have multiple inputs  Generating all possible URLs is wasteful! … and un-necessary!  Goal: minimize URLs while maximizing retrieval!  Other considerations  Generated URLs must be good candidates for index  Only need URLs sufficient to drive traffic  Only need URLs sufficient to seed the web crawler  Solution: discover only informative input combinations.

Informative Form Fields http://jobs.shrm.org/search?state=All&kw=&type=All http://jobs.shrm.org/search?state=AL&kw=&type=All http://jobs.shrm.org/search?state=AK&kw=&type=All … http://jobs.shrm.org/search?state=WV&kw=&type=All http://jobs.shrm.org/search?state=All&kw=&type=ALL http://jobs.shrm.org/search?state=All&kw=&type=ANY http://jobs.shrm.org/search?state=All&kw=&type=EXACT Result pages different  informative Result pages similar  un-informative Varying the state results in qualitatively different content, and hence it is an informative field.

Computing Informative Field Combinations  Informative field combinations can be computed bottom up:  Begin with single fields and find which ones are informative.  For every informative combination, try to add another field and check if the resulting combination is still informative.  In practice, we rarely need combinations of more than 3 fields.

Google's Deep-Web Crawl (VLDB 2008) Challenge 2: Generic and Typed Text boxes  Generic Search Boxes  Accept any keywords  Challenge: selecting the most appropriate values  Typed Text Boxes  Only values belonging to specific types, e.g., zipcodes  Challenge: selecting the type of the input

Google's Deep-Web Crawl (VLDB 2008) Example: www.wipo.int

Input values for Generic Search  Iterative Probing for search boxes  Select an initial list of candidate keywords  Download pages based on current set of keywords  Extract more candidate keywords from result pages  Refine the current set of keywords  Repeat until no more new candidate keywords  Prune list of candidate keywords

Example: www.wipo.int Metalworking Protein Antibody Pyrazole Immobilizer Vasoconstriction Phosphinates Nosepiece Sandbridge Viscosity Carboxydiphenylsulphide Ozonizer …

Outline Introduction, opportunities and challenges with Web data The Deep Web  Vertical search  Surfacing the Deep Web  Creating topical portals  Lightweight data management on the Web  Discovery of data sets  Extracting data from Web pages  Combining multiple data sets  Re-using others’ work

Topical Portals  An integrated view of a topic:  E.g., a info about database researchers, all info about coffee and their growing regions.  Topical portals find different aspects of the same objects on different sources  E.g., publications of a person may come from one source, while their job affiliations may come from another  In contrast, vertical search integrated similar objects from multiple sources  E.g., job listings, apartments for rent, …

Topical Portal: example Integrated Page for an Entity

Building a Topical Portal  Approach #1:  Perform a focused crawl of the Web to find pages on the topic  Use word signatures as a method for determining the topic of a page.  Use information extraction techniques to get the data out of the pages.  Perform reference resolution and schema matching to create a cleaner set of data.

Creating a Topical Portal  Approach #2:  Start with a set of well known sites in the domain  Create an initial schema for the domain (the properties you’re interested in modeling)  Create extractors for pages on the known sites  Note: extractors will be more accurate because they were created for the sites themselves  Result: a good basis of entities and relationships to build on.  Extend the initial data set:  Follow references from the initial set of chosen pages  Use collaboration (of people in the community) to find additional data and to correct extractions.

Outline Introduction, opportunities and challenges with Web data The Deep Web  Vertical search  Surfacing the Deep Web Creating topical portals  Lightweight data management on the Web  Discovery of data sets  Extracting data from Web pages  Combining multiple data sets  Re-using others’ work

Lightweight Combination of Web Data  With such a vast collection of data, we would like to enable easy data integration.  Imagine a school student combining her data about bird species with a country population table found on the Web  A journalist creating a news story with data about riots in the UK and needing to combine it with demographic data  …  Many data integration tasks are transient: the result will be used for a short period of time only  Hence, creating the integrated data must be easy. Creating a mediated schema and mappings is too tedious.

Challenges to Data Integration on the Web  Discovering data on (search engines are optimized for documents, not tables or lists)  Extracting the data from the Web pages into a form that can be processed  Combining multiple data sets  Unique opportunities on the Web: re-use work of others!

Not a great result!

But the data does exist out there!

Discovering Data on the Web  Search engines are optimized for documents  E.g., proximity of terms matters in ranking. In tables, the schema applies to all rows. “zambia” is far from “population” in a document containing population data, but should be considered close.  No special attention is given to schema rows (if they can be detected) or columns closer to the left of the table (that are often the “subject” of the table).  Tables with high quality data look like ones that are used for formatting.  Over 99% of the HTML tables on the Web are not high quality data tables!

Challenges to Discovering the Semantics of Structured Data on the Web

Semantics Embedded in Surrounding Text Topic of table is in the text, and the token “2006” is crucial to understanding the data.

No schema, but beautifully understandable table by people.

Structured Data can be Plain Complicated!

HTML Tables used for Formatting

“Vertical” Tables: one tuple of a bigger table

Tree Search Amish quilts Parking tickets in India Horses Can’t Use Domain Knowledge: Data is about Everything

Search by Tweaking Document Traditional Search  Consider new cues in ranking:  Hits on left column  Hits on schema (where there is one)  Number of rows, columns  Hits on table body  Size of table relative to page  But we can do better: try to recover the underlying semantics of the data.

If we see these patterns enough times, we can infer that Green Ash is a North American species Recovering Table Semantics: cells on the Web are mentioned in Web text

If we infer that a large fraction of the left column are North American tree species, we can infer that the table is about these tree species. Which is not mentioned on the page! Recovering Table Semantics: cells on the Web are mentioned in Web text

Extracting Data from the Page  In the case of tables, it’s fairly easy  Main challenge: decide if there is a row with attribute names  Lists are tricky: punctuation and formatting do not always provide the right cues for partitioning a list element into cells boundaries.  Structured data in cards: in general, it’s an information extraction problem.

Structured Data in Cards

Copy & Paste Approach: Extraction by Demonstration  Using previous slide as example.  Start by copying “Four Barrel” into a column of a spreadsheet.  System tries to generalize and suggest other café names: Sightglass, Blue Bottle, Ritual.  Next, the user copies the address of Four Barrel into the next column of the spreadsheet  System generalizes…  Etc.

Combining Multiple Data Sets  First, find related data sets. Depending on the context, you may be looking for:  Data sets to join with (add new columns)  Data sets to union with (add new rows)  Specifying the join:  Again, by demonstration. Drag and drop a cell from one table into another.  Reference reconciliation is a big challenge:  Use reference data such as Freebase?

Re-Using Work of Others  Most good data sets will get extracted more than once:  Re-use the work done by other extractors  Data cleaning can be a collaborative effort  Data sets that get integrated often are probably high quality – leverage that signal  With 200M tables on the Web, you can mine their schemas to find attribute synonyms and common schematic patterns.

Summary of Chapter 15  Structured data on the Web is an incredible collection of data  More is coming on because organizations and governments are being encouraged to publish data  Data comes with little or no semantics  Huge challenge when you try to make sense of it  Key emphasis: create data management tool that anyone can use  Data is no longer just for database experts!

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.

Similar presentations

Presentation on theme: "ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.

Similar presentations

Presentation on theme: "ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION."— Presentation transcript:

Similar presentations

About project

Feedback