Presentation is loading. Please wait.

Presentation is loading. Please wait.

Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

Similar presentations


Presentation on theme: "Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks."— Presentation transcript:

1 Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks

2 F. Ciravegna- AKT Town Meeting April 2003 Language Technologies Goal –Building systems able to process Natural Language in its written or spoken form Methodology –Use of Language Analysis Technologies (examples) : Information Extraction from Text Question Answering Text Generation

3 F. Ciravegna- AKT Town Meeting April 2003 HLT for Kn. Management Use of HLT for Knowledge –Acquisition –Retrieval –Publication Main benefits –Cost Reduction –Time needed for KM –Improving knowledge accessibility Accessing/Diffusing/Understanding

4 F. Ciravegna- AKT Town Meeting April 2003 HLT in AKT for KM acquisition retrieval publishing Text mining Information Extraction from Text Text Generation

5 F. Ciravegna- AKT Town Meeting April 2003 HLT for Semantic Web Use of HLT for: –Document annotation –Information integration from different sources Benefit –Reduce annotation needs –Retrieve and integrate dispersed information

6 F. Ciravegna- AKT Town Meeting April 2003 Information Extraction Textual documents are pervasive (e.g. Web) –Contained knowledge cannot be queried, therefore cannot be Used by automatic systems Easily managed by humans IE can identify information in documents –e.g. to populate a database –e.g. to annotate documents Method: natural language analysis Words Information Knowledge

7 IE tasks Named EntitiesTemplate Elements Template Relations Scenario Template WASHINGTON, D.C. (October 5, 1999) - nQuest Inc. today announced that Paul Jacobs, former Vice-President of E-Commerce at SRA International, has joined the company's executive management team as president. nQuest Inc.Paul Jacobs. SRA International Company: nQuest Inc. Date: today InPerson: Paul Jacobs InRole: president Company: SRA International OutPerson: Paul Jacobs OutRole: Vice-President of E-Commerce,

8 F. Ciravegna- AKT Town Meeting April 2003 IE Sheffield GATE: –General Architecture for Language Engineering –Used to integrate HLT modules Annie: –Rule-based Named Entity Recogniser –Download at Amilcare: –Adaptive IE system –Portable using examples –www.nlp.shef.ac.uk/amilcare

9 F. Ciravegna- AKT Town Meeting April 2003 IE Sheffield (2) Melita: –Annotation tool –supported by adaptive IE (Amilcare) –Learns how to annotate –www.aktors.org/technologies/melita/www.aktors.org/technologies/melita/ Lasie –IE system for complex event extraction –Manual rule development –www.dcs.shef.ac.uk/research/groups/nlp/funded/lasie.h tmlwww.dcs.shef.ac.uk/research/groups/nlp/funded/lasie.h tml

10 F. Ciravegna- AKT Town Meeting April 2003 An architecture A macro-level organisational picture for LE software systems. A framework for programmers, GATE is an object-oriented class library that implements the architecture. A development environment for language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. Free software (LGPL). Mature robust software (in development since 1995). Comes with… Some free components......and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. GATE is…

11 F. Ciravegna- AKT Town Meeting April 2003 Some users… At time of writing a representative fraction of GATE users includes: Longman Pearson publishing, UK; BT Exact Technologies, UK; Merck KgAa, Germany; Canon Europe, UK; Knight Ridder (the second biggest US news publisher); BBN Technologies, US; Sirma AI Ltd., Bulgaria; Resco AB, Sweden/Finland/Germany; Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts Master Foods NV: extraction of commodities events from news the American National Corpus project, US; Imperial College, London, the University of Manchester, Queen Mary College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities; the Perseus Digital Library project, Tufts University, US.

12 F. Ciravegna- AKT Town Meeting April 2003 GATE and Content Extraction ANNIE - Open-source IE system in GATE, providing modules needed for content extraction –Pre-processing –Named entity recognition –Coreference resolution ANNIE handles proper names, pronouns, and nominals Easy-to-use pattern-action rule language to enable customisation and postprocessing of the IE results Contact Hamish Cunningham

13 F. Ciravegna- AKT Town Meeting April 2003 Amilcare Active annotation for the Semantic Web Tool for adaptive IE from Web-related texts –Specifically designed for document annotation –Trains with a limited amount of examples –Effective on different text types From free texts to rigid docs (XML,HTML, etc.) –Tools for: Normal user –Able to annotate a corpus Amilcare Expert –Able to optimise experiments IE Expert –Able to edit rules –Uses Annie for preprocessing up to Named Entity Recognition [Ciravegna – IJCAI 2001]

14 F. Ciravegna- AKT Town Meeting April 2003 Implementation details 100% Java External Interfaces: –API for use from other programs –GUI for manual training Requirements: –10M on HD –Up to 300M RAM Contact Fabio Ciravegna

15 F. Ciravegna- AKT Town Meeting April 2003 Users Integrated with SW annotation tools : –MnM (Open Univ.) –Ontomat (Karlsruhe Univ.) –Melita (Sheffield Univ.) Users: –Merck (D), –ISOCO (SP), –Quinary (I), –Ontoprise (D) –University College Dublin (IE), –2 departments of CNRS (F) –University of Trier (D), –University of Texas (Austin, USA)

16 F. Ciravegna- AKT Town Meeting April 2003 Document Annotation Many application areas require document annotation (enrichment) –Knowledge Management Protocol analysis in industry (Kingston 94) Italian police: 100 annotators/6 pages a day each –Semantic Web (Staab00, Motta02, Ciravegna02) Annotation is generally manual –Expensive –Inefficient –Difficult –Tedious & Tiring Error prone (15-30% inter-annotator disagreement) –Never ending

17 F. Ciravegna- AKT Town Meeting April 2003 Melita Document annotation tool –Use adaptive IE engine to support annotation IE System: –Trains while users annotate –Provides preliminary annotation for new documents Advantages –Annotates trivial or previously seen cases –Focuses slow/expensive user activity on unseen cases –Validating extracted information Simpler & less error prone Speeds up corpus annotation –Learns how to improve capabilities

18 F. Ciravegna- AKT Town Meeting April 2003 Annotation with IE User Annotates Trains on annotated corpus Bare Text Bare Text Annotation Comparison Retrains using errors, missing tags and mistakes Annotates

19 F. Ciravegna- AKT Town Meeting April 2003 Bare Text User Corrects Annotates Uses corrections to retrain Annotation with Suggestions

20 F. Ciravegna- AKT Town Meeting April 2003 Cooperation: is IE a Useful Support? CMU Seminars TASK Test:250 texts (Amilcare report the best IE results ever)

21 F. Ciravegna- AKT Town Meeting April 2003 Integrating Information Information is available over the Web –Dispersed –In textual format IE as basis for retrieval and integration of information –Unsupervised learning using The redundancy of the web Available Repositories –Collections of documents/data –Known services (e.g. databases, digital libraries, search engines) to bootstrap learning and produce simple high precision IE applications

22 F. Ciravegna- AKT Town Meeting April 2003 Mining Web Sites Extracting knowledge from CS Web sites Name Position /Telephone Involvement in projects Publications Co-workers Person: Information distributed Challenges Retrieving information Integrating Information Largely unsupervised by user

23 F. Ciravegna- AKT Town Meeting April 2003 Mining Web sites People and Project names HomePageSearch Project/People name lists and hyperlinks Basket: Annotates known names Trains on annotations to discover the HTML structure of the page Recovers all names and hyperlinks Mines the site looking for Project and People names Uses Generic patterns Annie Citeseer for likely bigrams

24 F. Ciravegna- AKT Town Meeting April 2003 Mining Web sites Projects/People Web pages HomePageSearch Extracts personal data Addresses Tel number address … Project/People name lists and hyperlinks Basket: Name lists and hyperlinksPersonal dataPeople and Projects Basket:

25 F. Ciravegna- AKT Town Meeting April 2003 Name lists and hyperlinksPersonal dataPeople and Projects Basket: HomePageSearch People Publications Mining Web sites Annotates known papers Trains on annotations to discover the HTML structure Recovers co-authoring information Name lists and hyperlinksPersonal dataCo-authoring informationPeople and Projects Basket:

26 F. Ciravegna- AKT Town Meeting April 2003 Paper discovery

27 F. Ciravegna- AKT Town Meeting April 2003 Focus on people

28 F. Ciravegna- AKT Town Meeting April 2003 User Role Providing: –A URL –List of services (e.g. Google) Train wrappers using examples –some examples of fillers (e.g. projects) In case, correcting intermediate results

29 F. Ciravegna- AKT Town Meeting April 2003 Rationale Large collections (e.g. Web) contain redundant information –Redundancy can be used to bootstrap learning Mining the Web for information –Learned patters Integration of information –Multiple evidence Different strategies with different reliability Scruffy works! –User corrections of data in case

30 F. Ciravegna- AKT Town Meeting April 2003 Conclusion In AKT we are using HLT (IE) for: –Helping in document annotation –Integrating information from different sources Benefit: –Reduce annotation needs –Retrieve and integrate dispersed information Minimum user intervention


Download ppt "Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks."

Similar presentations


Ads by Google