University of Economics Prague 1 Information extraction from web pages using extraction ontologies Martin Labský KEG Seminar, 28 th November 2006.

University of Economics Prague 1 Information extraction from web pages using extraction ontologies Martin Labský KEG Seminar, 28 th November 2006

2 Agenda  Purpose  Knowledge sources  Extraction ontology  Finding attribute candidates  Instance parsing  Wrapper induction  Ex demo  Discussion

3 Purpose  Extract objects from documents –object = instance of a class from an ontology –document = text, possibly with formatting  Objects –belong to known, well-defined class(es) –classes consist of attributes, axioms, constraints  Documents –may come in collections of arbitrary sizes –Structured, semi-structured, free-text –Extraction should improve if: documents contain some formatting (e.g. HTML) this formatting is similar within or across document(s)  Examples –Product catalogues (e.g. detailed product descriptions) –Weather forecast sites (e.g. forecasts for the next day) –Restaurant descriptions (cuisine, opening hours etc.) –Contact information –Financial news

4 Knowledge sources  Why –often for some attributes of one class it is easier to obtain manual extraction knowledge than training data and vice versa (experience from Bicycle product IE) –allow people to experiment just with manually encoded patterns, let them investigate easily whether some IE task is feasible by trying it quickly. If so, training data can be added for attributes which require it.  1. Knowledge entered manually by expert –the only mandatory source –class definitions + extraction evidence  2. Training data –sample attribute values or sample instances –possibly coupled with referring documents –used to induce typical content and context of extractable items, cardinalities and orderings of class attributes...  3. Common formatting structure –of observed instances –in a single document, or –across documents from the same source

5 Extraction ontology  Attribute data types –assigned manually  Cardinality ranges –assigned manually, cardinality probability estimates could be trained  Patterns for content and typical context of attributes: –regular grammars at the level of words, lemmas, POS tags or word types (uppercase, capital, number, alphanumeric etc.) –phrase lists –attribute value lengths –the above equipped with probability estimates –assigned manually or to be induced from training data  For numeric attributes: –units –estimated probability distributions (e.g. tables, gaussian) –assigned manually or trained  Sample ex ontologies contacts_en.xml or monitors.eol.xml –see class and attribute definitions, data types –ECMAScript axioms, –regular pattern language, –pattern precision and recall parameters  Sample instances monitors.tsv and *.html

6 Finding attribute candidates (1)  Preprocessing –document tokenized –parsed into a light-weight DOM (if HTML)  Matching of attributes’ regular patterns –content and context patterns matched –each pattern has:  Pattern precision –estimates how often the pattern actually identifies a value of the attribute in question –P(attribute|pattern)  Pattern recall –estimates how many values of the attribute in question satisfy the pattern –P(pattern|attribute)

7 Finding attribute candidates (2)  Create new attribute candidate (AC) where at least one pattern matches  AC for attribute A scored by the estimate of –P(A|patterns) where patterns = matched state of all patterns known for A –Independence assumption for all patterns E, F from the set of known patterns for attribute A (phi) –AC score computation: (for derivation see ex.pdf)

8 Finding attribute candidates (3)  Most attributes can occur independently or as part of their containing class –Each attribute is equipped with estimate of P(engaged|A) = e.g. 0.75  Three ways of explaining an AC: –part of instance; then the AC score computed as: P(A|patterns) * P(engaged|A) –standalone: P(A|patterns) * (1-P(engaged|A)) –mistake: 1 - P(A|patterns)

9 Finding attribute candidates (4)  ACs naturally overlap; they form a lattice within document: initial null state log(AC standalone score) the best path scores  = -0.5754 if we wanted just standalone attributes, we could be complete AC’s ID and indices of start and end tokens

10 Parsing instances (1)  Initially, each AC is converted into a singleton instance candidate IC = {AC}  Nested ACs supported  Then, iteratively, the most promising ICs are expanded: neighboring ACs are added to them  Expansion possible only if constraints not violated (e.g. max cardinality reached or ecmascript axioms may fail; selective axiom evaluation)  IC scoring –so far, IC score =  log (AC engaged score) + + penalties for skipped ACs (orphans) within the IC’s span + + fixed penalties for crossing formatting blocks by IC –we need to incorporate ASAP: likelihood of IC’s attribute cardinalities and ordering learnable formatting block crossing penalties

11 Parsing instances (2) Simplified IC parsing algorithm 1.Create a set of singleton ICs singletons ={ {AC}, {AC},... } of singleton ICs each containing just 1 AC 2.Enrich ICs singletons by adding ICs with 2 or more contained attribute values (still referred to as singletons since they have a single containing root attribute) 3.Create a set of instance candidates ICs valid ={}; 4.Create a queue of instance candidates ICs work ={}. Keep ICs work sorted by IC score, with max size of K (heap). 5.Add content of ICs singletons to ICs work. 6.Pick the best scoring IC best from ICs work 7.Set beam area of document BA=span of the document fragment (e.g. HTML element) containing IC best 8.While expanding IC best : 1.If BA does not contain more ACs, expand BA to the parent BA 2.Within BA, try adding to IC those IC near_singleton which are singletons and are closest to IC: IC new = IC + IC near_singleton 3.If IC new does not violate integrity constraints (e.g. max cardinality already reached in IC or axiom failure) Add IC new to ICs work If IC new is valid, add it to ICs valid 4.Break If large portion of ICs near_singleton was refused due to integrity constraints, or BA is too large or too high in the formatting block tree 9.Remove IC best from document, and if ICs work is not empty goto 6 10.Return ICs valid

12 abcdefghijkl AXAX AYAY AZAZ Garbage {AX}{AX} X card=1, may contain Y Y card=1..n Z card=1..n Class C mn... block structure  TD TR TABLE A Parsing instances (3)

13 abcdefghijkl AXAX AYAY AZAZ Garbage mn... TD TR TABLE A {AX}{AX} {A Y } {AXAY}{AXAY} X card=1, may contain Y Y card=1..n Z card=1..n Class C block structure  Parsing instances (3)

14 abcdefghijkl AXAX AYAY AZAZ Garbage mn... TD TR TABLE A {AX}{AX} {A Y } {AXAY}{AXAY} {AXAY}{AXAY} X card=1, may contain Y Y card=1..n Z card=1..n Class C block structure  Parsing instances (3)

15 abcdefghijkl AXAX AYAY AZAZ Garbage mn... TD TR TABLE A {AX}{AX} {A Y } {AXAY}{AXAY} {AXAY}{AXAY} {A Y } {A Z } {A Z } {A X A Y A Z } {A X A Y A Y } {A X A Y A Z } {A X A Z } {A X A Y } {A X A Z } X card=1, may contain Y Y card=1..n Z card=1..n Class C block structure  Parsing instances (3)

16 abcdefghijkl AXAX AYAY AZAZ Garbage mn... TD TR TABLE A {AX}{AX} {A Y } {A X [A Y ]} {A X [A Y ]} {A X A Y } {A Y } {A Z } {A Z } {A X A Y A Z } {A X A Y A Y } {A X A Y A Z } {A X A Z } {A X A Y } {A X A Z } {AXAY}{AXAY} X card=1, may contain Y Y card=1..n Z card=1..n Class C block structure  Parsing instances (3)

17 abcdefghijkl AXAX AYAY AZAZ Garbage mn... TD TR TABLE A {AX}{AX} {A Y } {A X [A Y ]} {A X [A Y ]} {A X A Y } {A Y } {A Z } {A Z } {A X A Y A Z } {A X A Y A Y } {A X A Y A Z } {A X A Z } {A X A Y } {A X A Z } {AXAY}{AXAY} {A X [A Y ]A Z } {A X [A Y ]A Y } {A X [A Y ]A Z } X card=1, may contain Y Y card=1..n Z card=1..n Class C block structure  Parsing instances (3)

18 abcdefghijkl AXAX AYAY AZAZ Garbage mn... TD TR TABLE A {AX}{AX} {A Y } {A X [A Y ]} {A X [A Y ]} {A X A Y } {A Y } {A Z } {A Z } {A X A Y A Z } {A X A Y A Y } {A X A Y A Z } {A X A Z } {A X A Y } {A X A Z } {AXAY}{AXAY} {A X [A Y ]A Z } {A X [A Y ]A Y } {A X [A Y ]A Z } X card=1, may contain Y Y card=1..n Z card=1..n Class C block structure  Parsing instances (3)

19  From the instance parser, we get a set of valid ICs  similar to ACs, these may overlap  valid ICs form a lattice within the analyzed document  Parsing instances (4)

20 Parsing instances (5)  Since we want to extract both valid instances and standalone attributes, we merge the AC lattice and the valid IC lattice:  ICs which interfere with other ICs and leave their parts unexplained are penalized relatively to the unexplained parts of interfering ICs

21 Parsing instances (6)  The best path is found through the merged lattice  This should be the sequence of standalone ACs and valid ICs which best explain the document content

22 Wrapper induction (1)  During IC parsing, we search for common formatting patterns which would encapsulate part of the ICs being generated  E.g. person’s first name and last name (if we extracted these as separate attributes) could be regularly contained in formatting pattern: –TR[1..n] {TD[0] {person.firstname} TD[1] {person.lastname} }  Formatting pattern is defined as the first block area (HTML tag) containing the whole IC, plus the paths from that area to each of the IC’s attributes.  If “reliable” formatting patterns are found, we add them to the context patterns of the respective attributes. For such attribute A, we then: –boost/lower scores of all ACs of A, –create new ACs for A where the formatting patterns match and ACs did not exist before –rescore all ICs which contain rescored ACs, –add new singleton ICs for newly added ACs.

23 Wrapper induction (2) Formatting pattern induction process  Segment all ICs from parser’s queue (not only the valid ones) into clusters of ICs with the same attributes populated –e.g. {firstname: Varel, lastname: Fristensky} {firstname: Karel, lastname: Nemec} would fit into one cluster.  For each cluster, build an IC lattice going through the document, and find the best path of non-overlapping ICs.  For ICs on the best path, compute the counts of each distinct formatting pattern. For each formatting pattern FP, estimate –precision(FP)=C(FP,instance from cluster) / C(instance from cluster) –recall(FP)=C(FP,instance from cluster) / C(FP) where C() means observed counts.  We induce a new pattern if precision(FP), recall(FP) and C(FP,instance from cluster) reach configurable thresholds.

24 Wrapper induction (3)  Plugging wrapper generation into the instance parsing algorithm –in the current implementation, formatting patterns are only induced once for singleton ICs  Parallel parsing of multiple documents –documents from the same source (e.g. website) often share formatting patterns; we expect measurable improvement over the single document extraction approach –to be implemented  More experiments needed

25 Ex demo  Command line version  GUI available –GUI of Information Extraction Toolkit exists as a separate project, ready to accommodate other IE Engines  Simple API to enable usage in 3 rd party systems  Everything written in Java –however may connect to lemmatizers / POS taggers / other tools written in arbitrary languages –Ex: ~ 26,000 lines of code –Information Extraction Toolkit: ~ 2,500 lines of code

26 Discussion  Thank you.

University of Economics Prague 1 Information extraction from web pages using extraction ontologies Martin Labský KEG Seminar, 28 th November 2006.

Similar presentations

Presentation on theme: "University of Economics Prague 1 Information extraction from web pages using extraction ontologies Martin Labský KEG Seminar, 28 th November 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Economics Prague 1 Information extraction from web pages using extraction ontologies Martin Labský KEG Seminar, 28 th November 2006.

Similar presentations

Presentation on theme: "University of Economics Prague 1 Information extraction from web pages using extraction ontologies Martin Labský KEG Seminar, 28 th November 2006."— Presentation transcript:

Similar presentations

About project

Feedback