Download presentation
Presentation is loading. Please wait.
1
The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: http://scom.hud.ac.uk/scomtlm/chs2533 Practical this week: http://www.isi.edu/info-agents/
2
Recall Slide from Week 18 on Content Acquisition: - the semantic web needs populating with content – how can this be done given people in general don’t understand description logic / FOL ??????? - There are two types of content – - A. NEW knowledge - B. OLD information in existing, structured formats - We concentrated on A – most content initially at least will be through B - we will return to this later… - Actually we can also acquire semantic web content from OLD information in existing semi/un- structured formats eg HTML form..
3
Information Extraction n Information extraction is the process of extracting “meaningful” data from raw or semi-structured text n Two extremes: u “Natural Language Understanding” - take raw (English) text and turn into some logic representing its meaning. F In SW terms, Raw text => OWL u “Feature Extraction” - extract particular a piece of data from a semi- or unstructed document eg extract an address from a standard web page. F In SW terms, HTML => XML
4
Information Extraction Example: You’re on ebay and you want a toilet cistern & wash basin that have a combined width of under 90cm Solution: waste all Sunday afternoon going through 673 entries for “toilet” looking for widths and cross checking with 923 entries for wash basin! The Web’s HTML content makes it difficult to retrieve and integrate data from multiple sources. Information Agents are capable of retrieving info from some web sites via database-like queries (such as required in the example above. ) The Agent uses a wrapper to extract the information from the collection of similarly-looking Web pages. The wrapper ~ grammar of the data in the web site + code to utilize the grammar This is equivalent to turning the HTML => XML+DTDs !!
5
Example of Automated Extraction Hebden Bridge West Yorks UK 01422 843222 #350,000 Bijou residence on the edge of this popular little town...... Residential Housing House For Sale location: Hebden Bridge agent-phone: 01422 843222 listed-price: #350,000 comments: Bijou residence on the edge of this popular little town... House For Sale...... Source: HTML ======> Destination: XML
6
Information Integration Example: Consider the problem of travel planning on the Web. There are huge number of travel sites, with different types of information. Site1 hotel and flight information, Site 2 airports that are closest to your destination, Site 3 a third site to get directions to your hotel Site 4 weather in the destination city ETC Information Agents are capable of retrieving and integrating info from web sites to solve complex queries ISI built an ‘information agent’ which performs this function. See University of Southeren California’s Info Sciences Institute (ISI): Heracles project (http://www.isi.edu/info-agents/) The technology is based on Information Extraction + Integration
7
Information Extraction How can we create tools to ‘extract meaningful data’ from the current Web for (a) Populating the SW? (b) Inputting to information agents? (1) Write a tool to extract data …. BUT would have to write a tool of every type of data / every type of webpage eg a C program to process every eBay page on toilets and output width. This is far too specific! (2) ISI’s idea: Write a tool to ‘learn’ the format of web pages and/or particular fields. User is given or acquires ‘good examples’ of web pages. User points out fields to be learned. Tool builds up a characterisation of the formats from the examples and uses this to recognize and extract data from similar web pages
8
Similarity-based Learning Algorithms that ‘learn’ = Machine Learning Similarity-Based Learning Explanation-Based Learning Neural Networks Learning from Examples Learning by Observation Rule Induction Symbolic Learning Sub-symbolic learning Genetic Algorithms
9
Inductive Learning – Rule Induction from Examples Roughly, the algorithm is as follows: Input: a (large) number of +ve instances (examples) of concept C + (possibly) a number of –ve instances of C Output: a characterization H of the examples forming the rule H => C
10
Inductive Learning – JARGON Learning rule: H => C, H is a ‘hypothesis’ -- H COVERS an instance x if x satisfies H. -- H1 is a GENERALISATION of H2 if H2 satisfies H1 -- H is CONSISTENT if it covers no –ve instance -- H is COMPLETE if it covers all +ve instances -- H is CHARACTERISTIC if it is complete and consistent -- H is a MAXIMAL GENERALISATION if it is the most specific complete hypothesis. Example: features – a, b, c, d, e +ve instances of concept C: d&b&c, a&b&c&d, e&c&b&a -ve instances: a&b&e, d&e Give examples of consistent, complete and maximal hypotheses
11
Inductive Learning – Learning features in html pages In ML the input can be strings (leaning a grammar) or assertions. In Document Feature Extraction order is important Document Examples can be represented by sequences of tokens.
12
Token hierarchy (from ISI’s travel assistant) e.g. Example 1: ALPHA CAPS Example 2: NUM LOWER Inductive hypothesis: Title-tag ALPHANUM EndTitleTag ALPHA
13
Exercises 1.Find examples of consistent, complete and maximal hypotheses (if they exist) of the following example features: (a) +ve instances of concept C: e&b&c, f&b&c&d, e&c&b&a -ve instances: a&b&e, d&e&b (b) +ve instances of concept C: b&c, a&b&d, e&c&b&a, c&b, e -ve instances: a&b&e, d&e (c) +ve instances of concept C: d&b&c, a&b&c&d, e&c&b&a&d -ve instances: a&b&e&d, d&e&d 2. Run the demonstration of information extraction via the following website: http://www.isi.edu/info-agentshttp://www.isi.edu/info-agents 3. Look at the source of the some web sites containing ‘regular’ data and fields and see if there are any learnable patterns that could help extract data. E.g.Amazon books: author; eBay toilet cisterns: dimensions
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.