12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Web Mining Research: A Survey
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Chapter 19: Information Retrieval
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Link Structure and Web Mining Shuying Wang
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Information Retrieval
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
IL Step 1: Sources of Information Information Literacy 1.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Web Mining By:- Vineeta 8pgc18 M.Tech (II Semester)
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Decision Support Systems
Search Engine Architecture
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Algorithmic Detection of Semantic Similarity WWW 2005.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Chapter 8: Web Analytics, Web Mining, and Social Analytics
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Data mining in web applications
Text & Web Mining 9/22/2018.
Information Retrieval
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Data Mining Chapter 6 Search Engines
Introduction of KNS55 Platform
Web Mining Research: A Survey
Presentation transcript:

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach Saarland University Project COLLATE (funding: BMBF 01 IN A01 B)

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Outline Part I: Web Corpora Part II: Applications of Web Corpora Part III: LT-World Web Corpus Part IV: Research in COLLATE

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Part I: Web Corpora 1.Formal Properties of the Web 2.Web Corpus 3.Document and Hyperlink Database 4.TREC web track

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Formal Properties of the Web Hypertext/Hypermedia Directed graph with cycles Edges = hyperlinks Nodes = documents ??? Nodes often have internal tree structure (HTML, XML)

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Web Corpus A web corpus consists of a database of documents a database of hyperlinks

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Document Database Information for each document: –URL/URN –Full Text (possibly with linguistic annotation such as POS, named entities, phrases) –Full Text Index –Metadata Author, Language, Date, MIME type … (Dublin Core) Category, Abstract, Keywords, Type of Page …

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Fields of Hyperlink Database source anchor URL source anchor position on web page (percentage) source anchor position in document structure (HTML element path) source anchor type (text or image) source anchor text and context target anchor URL target anchor position on web page target anchor MIME type

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Derived Properties of Hyperlinks Same document? Same server? Same 2 nd /3 rd level domain? Ascending of descending in directory structure Source is within a list of links Navigation link (up, previous, next …)

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University TREC web track Construction of a web corpus (WT10g) according to the following criteria: –Broadly representative of web data in general –Many inter-server links –Contains all available pages from a set of servers –Contains an interesting set of meta-data –Contains few binary, non-English or duplicate documents –Size: 10 GB P. Bailey, N. Craswell and D. Hawking. Engineering a multi-purpose test collection for Web retrieval experiments. IP&M, to appear.

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Part II: Applications of Web Corpora 1.Web Mining 2.Information Retrieval 3.Clustering and Categorisation 4.Summarisation 5.Discovery of Relations 6.Terminology Extraction 7.Information Extraction 8.Ontology Learning

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Useful Methods Machine Learning and Data Mining Natural Language Processing Information Retrieval Ontologies and Semantic Web Bibliometrics (citation analysis ~ link analysis)

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Web Mining Web Content Mining –Discovery of terminology, acronyms, concepts Web Structure Mining –Discovery of relations, communities … Web Usage Mining –Discovery of navigation patterns

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Information Retrieval Usage of hyperlinks for determining popularity of web pages Hub and authority pages Widely used: Google PageRank Mixed results in TREC web track Jon M. Kleinberg (1997) Authoritative Sources in a Hyperlinked Environment. Journal of the ACM Sergey Brin, Lawrence Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Clustering Standard clustering algorithms form clusters by iteratively grouping documents/clusters, according to a distance measure Content-based methods measure distance by counting terms/concepts (often TF/IDF) Connectivity-based distance measures make use of hyperlinks

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Categorisation Categorisation algorithms determine the membership of a document in a pre-defined thematic category Content-based categorisation methods measure distance from a representative of the category Connectivity-based distance measures are based on the assumption that certain types of hyperlinks lead to documents of the same category

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Summarisation / Keyword Extraction Source anchor text has been used to generate short summaries of target web pages.

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Discovery of Relations Hyperlink structure reflects relations between web resources (e.g. between personal homepage, project page, organisation page) Relations can be discovered by content-based methods and by connectivity-based methods

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Terminology Extraction Content-based: extraction of domain terminology by statistical analysis (TF/IDF …) and/or phrasal chunking Applicability of connectivity-based methods?

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Information Extraction Automatic extraction of meta-data Extraction of named entities for concept-based indexing Extraction of templates/relations for relation-based indexing, and question answering

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Ontology Learning Extraction of candidates by frequency of occurrence in similar contexts Usage of textual clues (“such as”, “sogar” …) Applicability of connectivity-based methods? Definition and acronym mining

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Part III: LT-World Web Corpus 1.Content of LT World 2.Ontology 3.Hyperlinking within LT World 4.Construction of the corpus

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University LT World: Idea and Context The virtual information center is a comprehensive WWW- based information and knowledge service for the entire area of language technology. LT World is a “virtual” center in the sense that most information will physically remain with their creators or with other service providers. The virtual information center has been online since October 2001 under the name „LT World“ for „Language Technology World“ (

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Virtual Information Center - LT World Information and Knowledge –Technical and Scientific Information Players and Teams –Persons, Projects, Organisations Resources and Results –Research Systems, Commercial Products Communication and Events –News, Conferences

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University LT World Ontology Publi cations ProductsProjects People Layer 2: Specific Ontologies Corporaetc. Layer 1: Dublin Core Layer 3: Ontology for CL & LT

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University LT World Ontology Dimensions –Linguality (monolingual, multilingual, cross-language) –Application –Computational/mathematical methods –Linguistic Models / Theories –Level of linguistic description/processing –Technologies –Language(s) Ontology is modelled in RDF with Protégé 2000

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University LT World: Coverage 99 topic nodes 300 NLP tools and products 1800 people 850 organisations 500 projects

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Data Acquisition Process Manual collection, categorization and annotation of URLs by students and staff Sources: conference proceedings and journals, lists of links on the web, Self-registration and correction of data by users of the service Technical/scientific information in topic nodes has been provided by domain experts

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University LT World: Topic Nodes Topic nodes are the main information unit of the Area “Knowledge and Information”. They are organized in a shallow slightly multidimensional hierarchy following the chapter plan of the second edition of the Language Technology Survey. Example of the shallow hierarchy: Information Extraction Named Entity Recognition Terminology Extraction Relation Extraction Answer Extraction

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Information for each Topic Name Acronyms aka‘s, Term Translations Short Definition Overview Article (from HLT Survey) Topic Websites R&D Prototypes/Products Projects People Literature

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Hyperlinking between Sections

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Corpus Construction Start from URLs in LT-World collection Expand document set by recursively following outgoing hyperlinks using a webspider (e.g., GNU wget) Expand document set by following incoming hyperlinks (“link” query to search engine) Expand document set by search engine queries with domain terminology Construct document database and link database (Filter out irrelevant documents) Publish Corpus

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Part IV: Research Directions Categorisation / Information Extraction Discovery of Relations for Hyperlinking Other

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Categorisation and Information Extraction Research objectives –find method for categorising documents according to LT-World ontology –find method for extraction of meta-information Compare and combine content-based and connectivity-based methods If successful, it will contribute to semi-automatic extension of the coverage of LT-World

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Discovery of Relations Objective: develop method for finding pairs of related documents, e.g. personal page – organisation page. Content-based and connectivity-based methods are applicable If successful, it will enable a significant improvement of LT-World (resource discovery, resource annotation)

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Other Objective: compare and combine content-based and connectivity-based clustering methods Applications: 1.Information Retrieval 2.Clustering 3.Summarisation 4.Terminology Extraction 5.Ontology Learning

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Conclusion Main research interest: comparison and combination of content-based and connectivity- based methods Main application impact: going from a set of “seed” web pages to a domain-specific information system