Presentation on theme: "Inside semantic Web search engines: between semantic annotation and Natural Language Processing Dentro i motori di ricerca semantici: tra annotazione semantica."— Presentation transcript:
Inside semantic Web search engines: between semantic annotation and Natural Language Processing Dentro i motori di ricerca semantici: tra annotazione semantica ed elaborazione della lingua naturale Incontro ISKO Italia - Torino 3 aprile 2009 Intervento di Mela Bosch
Terminology on Web Search Engines Text Search Engine: based on Lexical analysis. The main aim of the lexical analysis is to divide the text into paragraphs, sentences and words and also entities such as addresses or URLs. All these elements are knows as tokens, and the Search Engine makes a parsing with statistical parameters to develop a range of links as a response to a query. Latent semantic indexing (LSI) : based on Latent semantic analysis (LSA); LSI is a technique of Natural Language Processing (NLP) which uses an indexed database of documents to find similar terms. It can find a synonym and then return the best matched websites for the query. LSI does not require exact matching words for ranking result. Semantic Web search engines: take the sense of a word as a factor in its ranking lists or offers the user a choice as to the sense of a word or phrase.
Semantic Web search engines or Search engines of 3rd generation Three types: User oriented Semantic Web search engine: It returns web page links. It can use internally both Semantic Web technologies and LSI. Ex.: True Knowledge, Hakia and PowerSet. Semantic Web Services oriented engine: It returns links to ontologies, OWL files, RDF instances. It is inadequate for end users. Ex.: SOWL, WSE, Watson, Falcons, Sindice and Swoogle. The idea is to provide ways for businesses to inter-operate across domains or services. Social-semantic Web oriented engine: The socio-semantic web (s2w) uses classification and ontologies in very practical situations. S2w search engines’ aim is to complement the formal Semantic Web vision adding a pragmatic collaborative tagging (folksonomy) approach. The main interest is to to enable users to share knowledge. Ex.:
Semantic Web search engines. What are all these differences for? “Semantic Web means many things to different people: It is about artificial intelligence, computer programs solving complex optimization problems It is about web services, in terms of end user value It is the web of data, where information is represented in RDF or microformats and OWL.” See: emantic_web_patterns_a_guide_redux.php emantic_web_patterns_a_guide_redux.php Natural Language Processing (NLP) Annotation The components of Semantic Web search engines
Annotation Free-text annotation: The annotations can be comments, notes, explanations, references, examples, advice, corrections or any other type of external remark that can be attached to or embedded in a Web document or a selected part of the document. See: Semantic annotation in general Semantic annotation is the association of a data entity with an element from a classification scheme, ontology or other knowledge repository Examples of semantic annotation: the assignment of MeSH descriptors to citations in MEDLINE the assignment of MeSH descriptors to citations in MEDLINE the assignment of Gene Ontology terms to gene products in UniProt the assignment of Gene Ontology terms to gene products in UniProt
Semantic Web Annotation It is crucial to the fulfillment of the Semantic Web to give useful meaning to data or to unstructured text A semantic annotation is a formal annotation, where the predicate is an ontological term, and the object conforms to an ontological definition. Is the technique for uploading machine understandable data on the Web by creating metadata through semantic tagging The term “annotation” can denote both the process of annotating and the result of that process.
Semantic Web Annotation See: an ontology which describes the domain of interest a data instance recognition process that discovers all instances of interest in target web documents based on the defined ontology an annotation generation process creates a semantic meaning disclosure file for each annotated document. Through the semantic meaning disclosure file, any ontology-aware machine agent can understand the target document. The Semantic Web Annotation process includes three components:
Annotation: can be manually, automatically or semi-automatically generated The process of annotating requires semantic annotation tools: Types of semantic annotation tools Inline annotation means that the original document is augmented with metadata information. Embedded metadata … … Also called: Semantic Authoring or Bottom-up approach It focuses on annotating information on pages using RDF so that it is machine readable
It is generally preferable from the point of view of inter-operability Types of semantic annotation tools: Standoff annotation means that the metadata is stored separately from the original document. Also called: top-down approach. Its focus is leveraging information from existing web pages, to derive meaning automatically … annotation Attached metadata The annotations are then stored in a database that is made available to users via websites and sometimes via web services
There are several choices for annotation
Initially NLP is conceived as a support for Linguistics studies aims at using computers to interpret and manipulate words as a part of a language The components of Semantic Web search engines Natural Language Processing (NLP) Then Artificial Intelligence defines NLP as the act of using computers to process written and spoken languages for some practical purpose such as translating languages, or carrying conversations with machines. A powerful method for the investigation and evaluation of human language itself. i.e. enhanced study over large corpora of texts
After the Web explosion NLP has been used for the development of natural language understanding systems that convert samples of human language into more formal representations that are easier to manipulate for computer programs. The components of Semantic Web search engines Natural Language Processing (NLP) Now Thanks to the NLP techniques different algorithms such as chunking, clustering, parsing, spellchecking, tagging, and word sense disambiguation are used to handle text intelligently and to get information from the Web on text data banks in order to answer questions
Conclusion However, both methodologies are now being combined: semantic web search engines need many pages to be annotated (which requires an enormous effort), so that NLP becomes an important help in automatic or semi-automatic annotation. At the same time the precision of text analysis may be optimized by means of techniques of assignment provided by users and professionals. In conclusion, the trend is the development of collective knowledge systems that improve as more people participate, as they are based on human contributions. All of this will possibly be integrated by NLP algorithms.
References Iskold, Alex. (2006) Semantic Web Patterns: A Guide to Semantic Technologies. Atanas, K. et al. (2005) Semantic Annotation, Indexing, and Retrieval. Ontotext Lab. Vehvilainen, A. et al. (2006) SemiAutomatic Semantic Annotation and Authoring, Tool for a Library Help Desk Service. Helsinki University. semi-automatic-semantic-annotation-and-authoring-tool.pdfhttp://www.seco.tkk.fi/publications/2006/vehvilainen-hyvonen-alm- semi-automatic-semantic-annotation-and-authoring-tool.pdf Diana Maynard (2005) Benchmarking ontology-based annotation tools for the Semantic Web. Department of Computer Science, University of Sheffield, UK.http://gate.ac.uk/sale/ahm05/ahm.pdfhttp://gate.ac.uk/sale/ahm05/ahm.pdf Good, Benjamin M ; Kawas, Edward ; Wilkinson, Mark. (2007) Bridging the gap between social tagging and semantic annotation: E.D. the Entity Describer. Useful links: