Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.

Similar presentations


Presentation on theme: "The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October."— Presentation transcript:

1 The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October 11 th 2002 Next generation web GATE, language technology infrastructure 1(19)

2 A Ubiquitous Permeable Web The next generation of the web must be: ubiquitous: semantics for every device, every organisation, every individual; permeable: allow contextual data to penetrate and persist; companionable: able to engage with us via multiple natural modalities. Roles for Language Technology: discovery of semantics (ubiquity); mediating between context and personal semantic memories (permeability); conversing with people and the semantic web (companionableness). 2(19)

3 Critical Mass for the Semantic Web The SW: machine processable, repurposable data to compliment hypertext But: semantics = % of the Web How to achieve critical mass? Huge scale automatic annotation. Requirements: Huge scale: – freely available to all EU citizens – distributed (over a Grid) – re-purposeable (delivered as Web Services) Portability and robustness via: – simple and therefore shallow HLT methods – +ve and –ve learning – analogs of IPSEs for computer-literate users 3 (19)

4 Motivation for Software Infrastructure for Language Engineering Need for scalable, reusable, and portable HLT solutions Support for large data, in multiple media, languages, formats, and locations Lowering the cost of creation of new language processing components Promoting quantitative evaluation metrics via tools and a level playing field 4 (19)

5 5 (19) Motivation (II):

6 GATE, a General Architecture for Text Engineering An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. Some free components......and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at 6 (19)

7 Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) (Almost) everything is a component, and component sets are user-extendable Component-based development An OO way of chunking software: Java Beans GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) The minimal component = 10 lines of Java, 10 lines of XML, 1 URL. 7 (19)

8 GATE Language Resources GATE LRs are documents, ontologies, corpora, lexicons, …… Documents / corpora: GATE documents loaded from local files or the web... Diverse document formats: text, html, XML, , RTF, SGML. Processing Resourcres Algorithmic components knows as PRs – beans with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple repurposing) freebies with GATE e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene 8 (19)

9 Relational Database … GATE Format Handlers HTML docs RTF docs XML docs Named entity Core- ference … ANNIE POS tagger Named entity Event extraction … Custom application 1 … Document content Document metadata Document format data Linguistic data File storage … Oracle/ PostgresQL A Language Analysis Example

10 10(11)

11 Building IE Components in GATE (1) The ANNIE system – a reusable and easily extendable set of components 11 (19)

12 Building IE Components in GATE (2) JAPE: a Java Annotation Patterns Engine Light, robust regular-expression-based processing Cascaded finite state transduction Low-overhead development of new components Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = { kind = company, rule = “Company1” } 12 (19)

13 GATE is being used for development of (semi-)automatic methods for: linking web pages to Ontologies using Information Extraction; learning and evolving Ontologies via IE and lexical semantic network traversal. The Semantic Web and GATE 13 (19)

14 Populating Ontologies with IE

15 Protégé and Ontology Management

16 Information Retrieval Support Based on the Lucene IR engine 16 (19)

17 Displaying Multilingual Data All the visualisation and editing tools for ML LRs use enhanced Java facilities: 17 (19)

18 Applications GATE has been used for a variety of applications, including: MUMIS: automatic creation of semantic indexes for multimedia programme material MUSE: a multi-genre IE system Metadata for Medline (at Merck) ACE: participation in the Automatic Content Extraction programme HSE: summarisation of health and safety information from company reports OldBaileyIE: NE recognition on 17th century Old Bailey Court reports. Various Medical Informatics and database technology projects IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and French (Arabic, Chinese and Russian this autumn) 18 (19)

19 Conclusion GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components Further information: Online demos, tutorials and documentation Software downloads Talks and papers 19 (19)


Download ppt "The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October."

Similar presentations


Ads by Google