Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: Mark Craven,

Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: http://www.cs.cmu.edu/~knigam/http://www.cs.cmu.edu/~knigam/ Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum Carnegie Mellon University, J.Stefan Institute AAAI-98

3/6/2001 Changho Choi, University at Buffalo 1 Abstract Information on the Web Unstandable to Human ???? KB Extract information Knowledgable

3/6/2001 Changho Choi, University at Buffalo 2 Introduction (#1/4) Two types of inputs of the information extraction system Ontology Specifying the classes and relations of interest For example, a hierarchy of classes including Person, Student, Research.Project, Course, etc. Training examples Represent instances of the ontology classes and relations For example, a course web page for Course classes, faculty web pages for Faculty classes, this pair of pages for Courses.Taught.By, etc.

3/6/2001 Changho Choi, University at Buffalo 3 Classes Relations : value

3/6/2001 Changho Choi, University at Buffalo 4 Introduction (#3/4) Assumptions about the mapping between the ontology and the Web 1. Each instance of an ontology class is a single Web page, a contiguous string of text, or a collection of several Web pages. 2. Each instance of a relation is a segment of hypertext, a contiguous segment of text, or t he hypertext segment.

3/6/2001 Changho Choi, University at Buffalo 5 Introduction (#4/4) Three primary learning tasks Involved in extracting knowledge-base instances for the Web 1. Recognizing class instances by classifying bodies. 2. Recognizing relation instances by classifying chains of hyperlinks. 3. Recognizing class and relation instances by extracting small fields of text form Web pages.

3/6/2001 Changho Choi, University at Buffalo 6 Experimental Testbed Experiments Based on the ontology Classes:Department, faculty, staff, student, research_project, course, other Relations: Instructors.Of.Course(251), Members.Of.Project(392), Department.Of.Person(748) Data sets A set of pages(4127) and hyperlinks(10945) from 4 CS dept. A set of pages(4120) from numerous other CS dept. Evaluation Four-fold cross validation 3 for training, 1 for testing

3/6/2001 Changho Choi, University at Buffalo 7 Statistical Text Classification Process building a probabilistic model of each class using labeled training data Classifying newly seen pages by selecting the class that that is most probable given the evidence of words describing the new page. Train three classifiers Full-text Title/Heading Hyperlink

3/6/2001 Changho Choi, University at Buffalo 8 Statistical Text Classification Approach the naïve Bayes, with minor modifications Based on Kullback-Leibler Divergence Given a document d to classify, we calculate a score for each class c as follows:

3/6/2001 Changho Choi, University at Buffalo 9 Statistical Text Classification Experimental evaluation Actual Predicted coursestudentfacultystaffResear ch_pro ject depart ment other Accu racy Course202170010552 26.2 Student0421141720519 43.3 Faculty5561181630264 17.9 Staff015040045 6.2 Research_project89105620384 13.0 Department1083154209 1.7 Other1932731201064 93.6 Coverage82.872.477.18.772.9100.035.0

3/6/2001 Changho Choi, University at Buffalo 10 Accuracy/coverage Coverage The percentage of pages for a given class that are correctly classified as belonging to the class accuracy The percentage of pages classified into a given class that are actually members of that class

3/6/2001 Changho Choi, University at Buffalo 11 Accuracy/coverage tradeoff 1. Full-text classifiers2. Hyperlink classifiers3. Title/heading classifiers “Hyperlink information can provide strong knowledge.”

3/6/2001 Changho Choi, University at Buffalo 12 First-Order Text Classification Second approach for text classification : learn first-order rules for classifying pages 1 st -order: rules with variables FOIL is the well-known algorithm for first-order learning. 0 th -order: no variables. Prolog-like. Function-free Horn clauses C4.5 is the well-known algorithm for zeroth-order learning.

3/6/2001 Changho Choi, University at Buffalo 13 FOIL’s input for text classification For each distinct word, has_word(Page) word is stemmed. For every hyperlink, link_to(Page, Page) Training data, Student(“http://www.cs.buffalo.edu/grads.html”), …http://www.cs.buffalo.edu/grads.html Course(“http://www.cse.buffalo.edu/courses.html”), …http://www.cse.buffalo.edu/courses.html …

3/6/2001 Changho Choi, University at Buffalo 14 FOIL’s result Sample learned rules, Student(A) := not(has_data(A)), not(has_comment(A)), link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)). Test Set: 126(+), 5(-) Faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B). Test Set: 18(+), 3(-)

3/6/2001 Changho Choi, University at Buffalo 15 FOIL’s result Comparing to statistical classification More accurate Less coverage

3/6/2001 Changho Choi, University at Buffalo 16 Classifying Hyperlinks Use a first-order representation because this task involves discovering hyperlink paths of unknown and variable size. and, since we want to find out following patterns. “The ProjectMember(A,B) relation holds if A is a Person, and B is a ResearchProject, and B includes a link to A near the word ‘People’”.

3/6/2001 Changho Choi, University at Buffalo 17 FOIL’s Input for classifying hyperlinks Predicates: class(Page) link_to(Hyperlink, Page, Page) has_word(Hyperlink) all_words_capitalized(Hyperlink) has alphanumeric_word(Hyperlink) has_neighborhood_word(Hyperlink) Training examples: Department.Of.Person(“CSE”, “Changho Choi”), … Instructors.Of.Course(“Sargur N. Srihari”, “CSE711”), …

3/6/2001 Changho Choi, University at Buffalo 18 FOIL’s result Sample learned rules, “Members_of_project(A, B) := research_project(A), person(B), link_to(C,A,D), link_to(E,D,B), neighborhood_word_people(C).” Test Set: 18(+), 0(-) “department_of_person(A,B) := person(A), department(B), link_to(C,D,A), link_to(E,F,D), link_to(G,B,F), neighborhood_word_graduate(E).” Test Set: 371(+), 4(-)

3/6/2001 Changho Choi, University at Buffalo 19 FOIL’s result Fairly High Accuracy Limited coverage Because limited coverage of page classifiers

3/6/2001 Changho Choi, University at Buffalo 20 Extracting Text Fields Uses a richer set of predicates length(Fragment, Relop, N) Some(Fragment, Var, Path, Attr, Value) Position(Fragment, Var, From, Relop, N) Relpos(Fragment, Var1, Var2, Relop, N) Sample learned rule, “ownername(Fragment) := some(Fragment, B, [], in_title, true), length(Fragment, <, 3), some(Fragment, B, [prev_token], word, “gmt”), some(Fragment, A, [], longp, true), some(Fragment, B, [], word, unknown), some(Fragment, B, [], quadrupletonp, false)”

3/6/2001 Changho Choi, University at Buffalo 21 FOIL’s result

3/6/2001 Changho Choi, University at Buffalo 22 Conclusions The approach we propose in this paper is to construct a system that can be trained to automatically populate such a KB. We have presented a variety pf approaches that take advantage of the special structure of hypertext By considering relationships among Web pages, Their hyperlinks, And specific words on individual pages and hyperlinks.

Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: Mark Craven,

Similar presentations

Presentation on theme: "Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: Mark Craven,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: Mark Craven,

Similar presentations

Presentation on theme: "Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: Mark Craven,"— Presentation transcript:

Similar presentations

About project

Feedback