Presentation on theme: "5-6 Oct 2005 JRC. 2/68 5-6 Oct 2005 Semantic Web The Semantic Web is the abstract representation of data on the WWW, based on the RDF and other."— Presentation transcript:
5-6 Oct 2005 JRC
2/ Oct 2005 Semantic Web The Semantic Web is the abstract representation of data on the WWW, based on the RDF and other standards SW is being developed by the W3C, in collaboration with a large number of researchers and industrial partners
JRC 3/ Oct 2005 Semantic Web (II) "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. [Berners-Lee et al. 2001] The spirit: Automatically processable metadata regarding: –the structure (syntax) and –the meaning (semantics) –of the content. Presented in a standard form; Dynamic interpretation for unforeseen purposes
JRC 4/ Oct 2005 Semantic Web: Languages RDF(S) – the next slides SHOE, XOL, etc – the pioneers Topic Maps – a metadata language with limited impact OIL – Ontology Interchange Language, the basis of the next two –Description Logics-based multilayered language DAML+OIL – the predecessor of OWL, not to be developed OWL – the W3C standard for Semantic Web ontology language, –Extends RDF(S), but also constraints it –Has multiple layers (Lite, DL, Full) –Transitive/symmetric/etc properties, disjointness, cardinality restrictions
JRC 5/ Oct 2005 Semantic Web: Problems Critical mass of metadata is necessary Still lack of consensus on many issues (like query languages) Lack of practices at the proper scale and complexity Lack of robust Semantic (in our days RDFS) repositories: –Should be as flexible, multi-purpose and easy to use as HTTP servers and –As efficient in structured knowledge management as RDBMS
JRC 6/ Oct 2005 What are Sirma & Ontotext? Established in 1992 as a Bulgarian AI Lab. Current structure: – Sirma Group International Corp, Montreal, Canada; – 8 subsidiary companies; the most important ones follow below. Sirma AI, Sofia – The R&D backbone of the group with two divisions: – Sirma Solutions: e-Business, banking, C3, e-Publishing, consultancy; – Ontotext Lab: Knowledge and Language Engineering. EngView Systems, Montreal – CAD/CAM systems and applications. WorkLogic.Com, Ottawa – Web-based collaboration, workflow, e-Gov.
JRC 7/ Oct 2005 Software Development and Research since 1992 Track record of success – large companies and government organizations in US, Canada, Western Europe and Bulgaria; Top-3 Software Company in Bulgaria; About 70 developers; ISO 2001 Certificate; 1999 EIST prize winner;
JRC 8/ Oct 2005 Sirma Businesses and Domains Diverse business, ranging from COTS products to custom projects, consultancy, and outsourcing services. Major areas: AI – expert systems (beside Ontotext); b2b market places CAD/CAM (for packaging, quality control) e-Government, CSCW, Groupware, Workflow; Banking C3/C4 Systems (military, airport traffic); VOIP billing systems; e-Publishing, Proofing tools.
JRC 9/ Oct 2005 Ontotext Lab An R&D lab of Sirma for Knowledge and Language Engineering Research and core technology development for knowledge discovery, management, and engineering. Specialized for applications in Semantic Web, Knowledge Management, and Web Services. Aside from the scientific matters, most of us are just professional software developers.
JRC 10/ Oct 2005 Leading Semantic Web Technology Provider Ontotext is a leading Semantic Web technology provider, being: the developer of the KIM Semantic Annotation Platform and a co-developer of the GATE language engineering platform; a co-developer of the Sesame semantic repository and OWLIM high- performance OWL reasoner; the developer of the WSMO4J semantic web services API; a partner in the SWAN Semantic Web Annotator project. Ontotext is part of most of the major European research projects in the field; the most successful Bulgarian participant in FP6.
JRC 11/ Oct 2005 Mission A critical mass of research in a number of AI areas made efficient KM almost possible. the technology on the market is mostly of two sorts: –Expensive black boxes –Academic prototypes Our mission is: To develop and popularize open, skillfully engineered tools... For Information Extraction and Knowledge Management, Which considerably reduce the cost for implementation and use of KM applications.
JRC 12/ Oct 2005 Major Research Areas We focus on building cutting-edge expertise and technology in the following areas: ontology design, management, and alignment; knowledge representation, reasoning; information extraction (IE), applications in IR; semantic web services; upper-level ontologies and lexical semantics; NLP: POS, gazetteers, co-reference resolution, named entity recognition (NER) machine learning (HMM, NN, etc.)
JRC 13/ Oct 2005 Academic & Technology Partners NLP Group, Sheffield University, UK; Digital Enterprise Research Institute (DERI), Institut für Informatik, Innsbruck, Austria, and National University of Ireland, Galway; Aduna (Aidministrator) b.v., The Nederland's; Linguistic Modelling Lab. CLPOI, Bulgarian Academy of Sciences; British Telecommunications Plc, (BT), UK. Froschungszentrum Informatik (FZI) and Institut AIFB Karlsruhe, Germany.
JRC 15/ Oct 2005 Research Projects We were/are part of a number of FP5 research projects: On-To-Knowledge - the project which invented OIL. Ontology Middleware Module and a DAML+OIL reasoner. VISION - Towards Next Generation Knowledge Management. OntoWeb - Ontology-based information exchange for knowledge management …. SWWS - Semantic Web enabled Web Services.
JRC 16/ Oct 2005 Research Projects (II) FP6 integrated projects that started Jan 2004, durations ~3 years: SEKT: Semantic Knowledge Technologies. Targeting a synergy of Ontology and Metadata Technology, Knowledge Discovery and Human Language Technology. DIP: Data, Information, and Process Integration with Semantic Web Services. PrestoSpace: Preservation towards storage and access. Standardized Practices for Audiovisual Contents in Europe. Infrawebs: Intelligent Framework for Generating Open (Adaptable) Development Platforms for Web-Service Enabled Applications Using Semantic Web Technologies, Distributed Decision Support Units and Multi-Agent-Systems
JRC 17/ Oct 2005 Introduction to Ontologies Despite the formal definitions, ontologies are: Conceptual models or schemata –Represented in a formalism which allows –Unambiguous semantic interpretation –Inference Can be considered a combination of: –DB schema –XML Schema –OO-diagram (e.g. UML) –Subject hierarchy/taxonomy (think of Yahoo) –Business logic rules
JRC 18/ Oct 2005 Introduction to Ontologies (II) Imagine a DB storing John is a son of Mary. It will be able to "answer" just: –Which are the sons of Mary? Which son is John? An ontology with a definition of the family relationships. It could infer: –John is a child of Mary (more general) –Mary is a woman; –Mary is the mother of John (inverse) ; –Mary is a relative of John (generalized inverse). The above facts, would remain "invisible" to a typical DB, which model of the world is limited to data-structures of strings and numbers.
JRC 19/ Oct 2005 Products The Ontology Middleware Module (OMM) is an enterprise back-end for formal KR and KM applications based on Semantic Web standards An extension of the Sesame RDF(S) repository that adds a Knowledge Control System. OMM integration options: Built-In, RMI, SOAP, HTTP.
JRC 20/ Oct 2005 Products BOR – a DAML+OIL reasoner. Proprietary GATE components: –Hash Gazetteer. A high-performance lookup tool. –Hidden Markov Model Learner. A stohastic module for filtering annotations, disambiguation, (etc.,) based on confidence measures. The News Collector is a web service, collecting and indexing articles from the top-10 global news wires: –About 1000 articles/day, annotated and indexed using KIM; –Used to validate the heuristics and resources of KIM;
JRC 21/ Oct 2005 Products (II) The KIM Platform (the next slides), SWWS Studio (http://swws.ontotext.com) – Semantic Web Service description development environment – Developed in the course of the SWWS project – Based on WSMO (http://www.wsmo.org) WSMO4J (http://wsmo4j.sourceforge.net) –A WSMO API and a reference implementation –for building Semantic Web Services applications –Used in WSMO Studio, (http://www.wsmostudio.org/) –The basis for ORDI, used in OMWG (http://www.omwg.org) –Used in projects DIP, SEKT, Infrawebs
JRC 22/ Oct 2005 OWLIM OWLIM is a high-performance OWL repository Storage and Inference Layer (SAIL) for Sesame RDF database OWLIM performs OWL DLP reasoning It is uses the IRRE (Inductive Rule Reasoning Engine) for forward-chaining and total materialization In-memory reasoning and query evaluation OWLIM provides a reliable persistence, based on RDF N-Triples OWLIM can manage millions of statements on desktop hardware Extremely fast upload and query evaluation even for huge ontologies and knowledge bases
JRC 23/ Oct 2005 Scalability: Upload and Reasoning
JRC 24/ Oct 2005 Scalability: Query Answering Q2: Pattern of 12 statement-joins and LIKE literal constraint
JRC 25/ Oct 2005 OWLIM under LUMB Benchmark The Lehigh Univ. evaluation is one of the most comprehensive benchmark experiments published recently (ISWC 2004, WSJ 2005) Synthetically generated OWL knowledge bases The biggest set generated is LUMB(50,0) – 6M explicit statements 14 queries, checking different inferences OWLIM on LUMB: –On a desktop machine OWLIM loads LUMB(50,0) in 10 min –The only other systems known to load it, does this for 12 hours –All the queries are answered correctly Based on this we can claim that: –OWLIM is the fastest OWL repository in the world!
JRC 26/ Oct 2005 JOCI Jobs & Contacts Intelligence, Innovantage, Fairway Consultants Gathering recruitment-related information from web-sites of UK organizations Offering services on top of this data to recruitment agencies, job portals, and other. JOCI uses KIM for information extraction (IE, text-mining) JOCI makes use of a domain ontology to: –support the IE process, –to structure the knowledge base with the obtained results, and –facilitate semantic queries. Sirma is shareholder in Fairway Consultants
JRC 27/ Oct 2005 UK Web Space JOCI Dataflow Focused Crawler CrawlerClassifier Information Extraction Single-Document IE Object Consolidation KIM Server Semantic Repository Document Store Web UI
JRC 28/ Oct 2005 JOCI: Vacancy Consolidation/Matching Location CityCountry U.K.Scotland Glasgow Vacancy 1Vacancy 2 Consolidated Vacancy IT Applications Support Analyst Support Analyst subRegionOf hasJobTitle locatedIn sub-string type subClassOf
JRC 29/ Oct 2005 JOCI Statistics The figures below are indicative and reflect an old state of the JOCI system: –The actual figures are to be announced after the launch of JOCI Web-sites inspected: 0.5M Web-sites with vacancy announcements: 30K Extracted vacancies: 100K
JRC 30/ Oct 2005 The KIM Platform A platform offering services and infrastructure for: –(semi-) automatic semantic annotation and –ontology population –semantic indexing and retrieval of content –query and navigation over the formal knowledge Based on Information Extraction technology
JRC 31/ Oct 2005 KIM Whats Inside? The KIM Platform includes: Ontologies (PROTON + KIMSO + KIMLO) and KIM World KB KIM Server – with a set of APIs for remote access and integration Front-ends: Web-UI and plug-in for Internet Explorer.
JRC 32/ Oct 2005 The AIM of KIM Aim: to arm Semantic Web applications -by providing a metadata generation technology -in a standard, consistent, and scalable framework
JRC 33/ Oct 2005 What KIM does? Semantic Annotation
JRC 34/ Oct 2005 Simple Usage: Highlight, Hyperlink, and…
JRC 35/ Oct 2005 Simple Usage: … Explore and Navigate
JRC 36/ Oct 2005 Simple Usage: … Enjoy a Hyperbolic Tree View
JRC 37/ Oct 2005 KIM is Based On… KIM is based on the following open-source platforms: GATE – the most popular NLP and IE platform in the world, developed at the University of Sheffield. Ontotext is its biggest co-developer. and OWLIM – OWL repository, compliant with Sesame RDF database from Aduna B.V. Lucene – an open-source IR engine by Apache. jakarta.apache.org/lucene/
JRC 38/ Oct 2005 How KIM Searches Better KIM can match a Query like: Documents about a telecom company in Europe, John Smith, and a date in the first half of With a document containing: At its meeting on the 10th of May, the board of Vodafone appointed John G. Smith as CTO" The classical IR could not match: –Vodafone with a "telecom in Europe, because: Vodafone is a mobile operator, which is a sort of a telecom; Vodafone is in the UK, which is a part of Europe. –5th of May with a "date in first half of 2002; –John G. Smith with John Smith.
JRC 39/ Oct 2005 Entity Pattern Search
JRC 40/ Oct 2005 Pattern Search: Entity Results
JRC 41/ Oct 2005 Entity Pattern Search: KIM Explorer
JRC 42/ Oct 2005 Semantic Metadata in KIM… Provides a specific metadata schema, –focusing on named entities (particulars), –as well as number and time-expressions, addresses, etc., –everything specific, apart from the general concepts. Defines specific tasks for generation and usage of the metadata which are well-understood and measurable. Why not metadata about general things (universals)? –It is too complex… –but we leave the door open. The particulars seem to provide a good 80/20 compromise.
JRC 43/ Oct 2005 World Knowledge in KIM Rationale: The ontology is encoded in OWL Lite and RDF. provide common knowledge about world entities; KIM bets on scale and avoids heavy semantics; minimum modeling of common-sense, almost no axioms; The ontology is encoded in OWL Lite and RDF. In addition, a number of rules (generative axioms) are defined, e.g.: and => Axioms of this sort are supported by OWLIM and they provide a consistent mechanism for custom extensions to the OWL or RDF(S) semantics with respect to a particular ontology
JRC 44/ Oct 2005 PROTON Name. PROTON is an acronym for Proto Ontology –ex-names: BULO (basic upper-level ontology), GO (generic ontology); –not a Russian space rocket –proto – used in the sense of primary, beginning, giving rise to, vs. first in time or oldest; –connotations: positive, fundamental, elemental, in favour of, even romantic (like a science-fiction novel from the 60-ies) Intended usage. A Basic Upper-Level Ontology like PROTON - used for: –ontology population –knowledge modelling and integration strategy of a KM environment; –generation of domain, application, and other ontologies.
JRC 45/ Oct 2005 PROTON Design Design principles: 1.domain-independence; 2.light-weight logical definitions; 3.Compliance with popular metadata standards; 4.good coverage of concrete and/or named entities (i.e. people, organizations, numbers); 5.no specific support for general concepts (such as apple, love, walk), however the design allows for such extensions
JRC 46/ Oct 2005 Some Figures… PROTON defines about 250 classes and 100 properties Providing coverage of most of the upper-level concepts necessary for semantic annotation, indexing, and retrieval A modular architecture, allowing for great flexibility of usage and extension: –SYSTEM module - contains a few meta-level primitives (6 classes and 7 properties); introduces the notion of 'entity', which can have aliases; –TOP module - the highest, most general, conceptual level, consisting of about 20 classes; –UPPER module - over 200 general classes of entities, which often appear in multiple domains.
JRC 47/ Oct 2005 PROTON Ontology Language The current version of the ontology is encoded in OWL Lite. A few custom entilement rules (axioms) are also defined for usage in tools that support them, for instance: Premise: Consequent: Axioms of this sort are interpreted by OWLIM PROTON is portable to any OWL(Lite)-compliant tool. PROTON can be used without such axioms either.
JRC 48/ Oct 2005 Other Standards: Relations ADL Feature Type Thesaurus and GNS –the backbone of the Location branch; –on its turn aligned with the geographic feature designators, of the GNS database of NIMA; –PROTON is more coarse-grained, taking about 80 out of 300 types. Dublin Core –the basic element set available as properties of protont:InformationResource and protont:Document classes; –the resource type vocabulary is mapped to sub-classes of InformationResource. OpenCyc and WordNet– consulted and referred to in glosses. ACE (Automatic Content Extraction) annotation types – covered. FOAF – assure easy mapping (e.g. the Account class was added). DOLCE, EuroWordnet Top, and others – consulted to various extent.
JRC 49/ Oct 2005 Other Standards: Compliance Other models are not directly imported (for consistency reasons) The mapping of the appropriate primitives is easy, on the basis of –a compliant design, and –formal notes in the PROTON glosses, which indicate the appropriate mappings. For instance, in PROTON, a protont:inLanguage property is defined –as an equivalent of the dc:language element in Dublin Core –with a domain protont:InformationResource –and a range protont:Language
JRC 50/ Oct 2005 KIM World KB A quasi-exhaustive coverage of the most popular entities in the world … What a person is expected to have heard about that is beyond the horizons of his country, profession, and hobbies. Entities of general importance … like the ones that appear in the news … KIM knows: Locations: mountains, cities, roads, etc. Organizations, all important sorts of: business, international, political, government, sport, academic… Specific people, etc.
JRC 51/ Oct 2005 KIM World KB: Entity Description The NE-s are represented with their Semantic Descriptions via: Aliases (Florida & FL); Relations with other entities (Person hasPosition Position); Attributes (latitude & longitude of geographic entities); their proper Class
JRC 52/ Oct 2005 The Scale of KIM World KB RDF StatementsSmall KBFull KB - explicit444,0862,248,576 - after inference1,014,4095,200,017 Instances - Entity:40,804205,287 - Location:12,52835,590 - Country:261 - Province:4,262 - City:4,4004,417 - Organization:8,339146,969 - Company:7,848146,262 - Person:6,0226,354 - Alias:64,589429,035
JRC 53/ Oct 2005 KIM IE Pipeline
JRC 54/ Oct 2005 JAPE Grammars Jape grammars are based on the last MUSE version Class/instance information included Better class granularity in grammars Relation recognition grammars - LocatedIn and HasPositionWithinOrganization
JRC 55/ Oct 2005 Disambiguation & Filtering Simple disambiguation (longest match), e.g. San Francisco Journal Based on the main alias, e.g. Beijing By priority of the class, instance or relative class priority –E.g. Brand Microsoft vs. Company Microsoft Corp. –We assign a priority (1-1000) to each class and instance –For pairs of classes we define relative priority –If the difference between the priorities is greater than a certain threshold the possible reference to the entity with the lower priority is ignored Still to be improved
JRC 56/ Oct 2005 KIM Scaling on Data The Semantic Repository is based on OWLIM In our practical tests we observe perfect performance on top of: –1.2M of entity descriptions: –about 15M explicit statements; –above 30M statements after forward chaining. Document and Annotation storage and indexing with Lucene: –One million docs, processed on a $1000-worth machine; –retrieval in milliseconds.
JRC 57/ Oct 2005 Entity Ranking: a sketch for Jan-May 2004 NoInstanceLabelRank 1Country_T.5United States Country_T.IZRepublic of Iraq Person_T.51George W. Bush Country_T.ISState of Israel DayOfWeek_T.4Tuesday NewsAgency_T.6The Associated Press InternationalOrganization_T.13United Nations Country_T.CHPeople's Republic of China City_T.3068New York InternationalOrganization_T.18European Union Person_T.115Ariel Sharon Country_T.JAJapan Country_T.UKUnited Kingdom CountryCapital_T.93Baghdad0.003
JRC 58/ Oct 2005 SWAN/KIM Cluster Architecture At present, KIM is used for massive semantic annotation in the context of the SWAN and SEKT projects Here are some of its features: support for a virtually unlimited number of annotators centralized ontology storage and querying; centralized meta-data (annotations) and document storage, indexing, and querying; support for multiple crawlers (or other data sources); dynamic reconfiguration of the cluster (e.g. staring new crawlers or annotators on demand).
JRC 59/ Oct 2005 SWAN/KIM Cluster Console
JRC 60/ Oct 2005 SWAN Project: Semantic Web Annotator Large Scale Annotation of human language for the Semantic Web using Human Language Technology (HLT). Hosted by DERI (NUIG, Galway) and involves also: GATE team (from the Sheffield University's NLP Group) and Ontotext Lab. For more details take a look at The current status: KIM Cluster of 7 servers in DERI Above 0.5TB shared storage 6 AMD64 Opterons, 6 Xeons, 36GB RAM
JRC 61/ Oct 2005 CoreDB: Name and Goals CoreDB is a component of KIM Stands for: Co-Occurrence and Ranking of Entities DB In a nutshell, it is designed to allow fast queries of the sort: –Q1: the number of appearances of UK in documents during Jan 2005 –Q2: all people co-occurring with John Smith and some bank institution in documents from the second half of 2003 –Q3: Q2 + where the documents contain fraud and the name of the institution contains capital
JRC 62/ Oct 2005 CoreDB: Functionality It allows asking in a structured manner for: –The number of references to entities in a (sub-)set of documents –The entities, which co-occur together with other entities Entities can be constrained by: –Class (and its sub-classes) –Keyword/token in one of its names/aliases/labels Documents can be constrained according to DC-like features: –Date (range; could be any date in the doc) –Type (exact match; could be any string) –Authors –Title and Sub-title –Keyword/token in the content, authors or the title fields
JRC 63/ Oct 2005 The Scale of Ambition The major point is to allow such queries in *efficient* manner over data with the following cardinality: –10^6 entities/terms –10^7 documents –10^2 entities occurring in an average document This means managing and querying efficiently 10^9 entity occurrences We had tested the current implementations with 10^7 occurrences and it answers the basic queries in milliseconds.
JRC 64/ Oct 2005 CoreDB Applications Detection of associative links between entities, based on co- occurrence in documents –It is an alternative of the detection of strong links based on local context parsing Ranking, measuring popularity, of an entity over a set of documents –The ranking is as good/relevant/representative as the set of documents is Computing timelines (changes over time) for entity ranking or co- occurrence –How did our popularity in the IT press changed during June (i.e. What is the effect of this 1.5MEuro media campaign ?!?) –How does the strength of association between organization X and RDF changes over Q1 ?
JRC 65/ Oct 2005 Implementation It is a new component in the architecture of KIM –Having an API (part of the KIM API), allows different implementations There are now a couple of RDBMS-based implementations: –Derby (free, open-source, 100% Java, was Cloudscape from IBM) –ORACLE (v. 10g) The Derby implementation – does not allow for efficient searches involving keywords The ORACLE implementation is used also for FTS-style indexing of the document contents –Makes possible efficient combination of semantic and keyword search (which is already available through the SemanticQuery API) In both RDBMS implementations: –Part of the ontology and the KB are replicated –Same with part of the document and index related information
JRC 66/ Oct 2005 Ontotext Facts Founded year employees (permanent, without the shared personnel and associates) Daily statistics for over: 150 visits; 2000 hits Number of scientific publications: above 30 Number of projects running: 9 More than 20 partners we directly cooperate with on projects Average age: about 28 Number of servers per developer: 0.7
JRC 67/ Oct 2005 Ontotext Lab Robust Technology and Professional Services for Knowledge and Language Engineering