1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Web Mining.
Wincite Knowledge Warehousing and Networking Sophisticated Simplicity.
1 Copyright Jiawei Han; modified by Charles Ling for CS411a/538a Data Mining and Data Warehousing  Introduction  Data warehousing and OLAP for data mining.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Data warehouse example
From DBMiner to WebMiner: What is the Future of Data Mining?
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Data Mining: Current Status and Directions. What is Data Mining? Data mining (also called knowledge discovery in databases) Extraction of interesting.
Web Mining Research: A Survey
Search engines. The number of Internet hosts exceeded in in in in in
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Misc Topics 2 Amol Deshpande CMSC424. Topics OLAP Data Warehouses Information Retrieval.
McGraw-Hill/Irwin Copyright © 2008, The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin Copyright © 2008 The McGraw-Hill Companies, Inc.
WHAT HAVE WE DONE SO FAR?  Weeks 1 – 8 : various components of an information retrieval system  Now – look at various examples of information retrieval.
Overview of Search Engines
Lecture-8/ T. Nouf Almujally
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Intro to MIS – MGS351 Databases and Data Warehouses Chapter 3.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Chapter 1 Introduction to Data Mining
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
1 Bridging the gap between the paper past and digital future.
Distributed Information Retrieval Using a Multi-Agent System and The Role of Logic Programming.
Data Mining By Dave Maung.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Data Mining for Web Intelligence Presentation by Julia Erdman.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Soon Joo Hyun Database Systems Research and Development Lab. US-KOREA Joint Workshop on Digital Library t Introduction ICU Information and Communication.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
SEMANTIC WEB Presented by- Farhana Yasmin – MD.Raihanul Islam – Nohore Jannat –
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data mining in web applications
Web Mining Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services Discovering.
Web Mining Ref:
Special Thanks to Dr. S. C. Shirwaikar for such making wonderful PPTS
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Data Mining: Concepts and Techniques Course Outline
Data Warehouse and OLAP
Information Retrieval
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining Chapter 6 Search Engines
Data Mining: Concepts and Techniques
Web Mining Department of Computer Science and Engg.
Data Mining: Concepts and Techniques
Data Warehouse and OLAP
Presentation transcript:

1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation

2 Challenges to Web Mining j Web: A huge, widely-distributed, highly heterogeneous, semi-structured, interconnected, evolving, hypertext/hypermedia information repository. j Problems: 4 the “abundance” problem 4 limited coverage of the Web (hidden Web sources) 4 limited query interface: keyword-oriented search 4 limited customization to individual users j DBMS, DBers, and data miners will play an increasingly important role in the new generation of Internet

3 Web Mining: Lots Can Be Done! j A taxonomy of Web mining: 4 Web content mining 4 Web structure mining 4 Web usage mining j Some interesting examples of Web mining 4 Mining integrated with Web search engines 4 Weblog mining (usage, access, and evolution) 4 Warehousing a Meta-Web: An MLDB approach

4 Mine What Web Search Engine Finds j Current Web search engines: convenient source for mining 4 keyword-based, return too many answers, low quality answers, still missing a lot, not customized, etc. j Data mining will help: 4 coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies 4 better search primitives: user preferences/hints 4 linkage analysis: authoritative pages and clusters 4 Web-based languages: XML + WebSQL + WebML 4 customization: home page + Weblog + user profiles

5 Web Log Mining j Weblog provides rich information about Web dynamics j Multidimensional Weblog analysis: 4 disclose potential customers, users, markets, etc. j Plan mining (mining general Web accessing regularities): 4 Web linkage adjustment, performance improvements j Web accessing association/sequential pattern analysis: 4 Web cashing, prefetching, swapping j Trend analysis: 4 Dynamics of the Web: what has been changing? j Customized to individual users

6 Warehousing a Meta-Web: An MLDB Approach j Meta-Web: A structure which summarizes the contents, structure, linkage, and access of the Web and which evolves with the Web j Layer 0 : the Web itself j Layer 1 : the lowest layer of the Meta-Web 4 an entry: a Web page summary, including class, time, URL, contents, keywords, popularity, weight, links, etc. j Layer 2 and up: summary/classification/clustering in various ways and distributed for various applications j Meta-Web can be warehoused and incrementally updated j Querying and mining can be performed on or assisted by meta-Web (a multi-layer digital library catalogue, yellow page).

7 A Multiple Layered Meta-Web Architecture Generalized Descriptions More Generalized Descriptions Layer 0 Layer 1 Layer n...

8 Construction of Multi-Layer Meta-Web j XML: facilitates structured and meta-information extraction j Hidden Web: DB schema “extraction” + other meta info j Automatic classification of Web documents: 4 based on Yahoo!, etc. as training set + keyword-based correlation/classification analysis (IR/AI assistance ) j Automatic ranking of important Web pages 4 authoritative site recognition and clustering Web pages j Generalization-based multi-layer meta-Web construction 4 With the assistance of clustering and classification analysis

9 Use of Multi-Layer Meta Web j Benefits of Multi-Layer Meta-Web: 4 Multi-dimensional Web info summary analysis 4 Approximate and intelligent query answering 4 Web high-level query answering (WebSQL, WebML) 4 Web content and structure mining 4 Observing the dynamics/evolution of the Web j Is it realistic to construct such a meta-Web? 4 Benefits even if it is partially constructed 4 Benefits may justify the cost of tool development, standardization and partial restructuring

10 Thank you !!!

11 Question # 2: Can Web Structure Be Mined? j Use topic hierarchies for document classification? 4 Topic hierarchies, such as CS classifications, are essential components for document classification 4 Yahoo!, AOL, and other information service providers are teachers (training sets) for Web page automatic classification 4 Classification leads to “lattices”, “trees”, or “clusters” j Mine patterns involving Web pages and hyperlinks? 4 Find authoritative Web pages 4 Find Web page structures and clusters. 4 Query and mine Web structures

12 Question # 3: Can Customization Be Improved? j Learn about user’s interests based on access patterns 4 Weblog mining + multidimensional log analysis 4 Home page and user profiles disclose interests j Provide users with pages, sites, and advertisements of interest 4 Provide facilities for users to specify interests, constraints, and customization 4 Intelligent query answering using multidimensional Web warehouse.

13 Question # 4: What Role will XML Play? j XML provides a promising direction for a more structured Web and DBMS-based Web servers j Promote standardization, help construction of multi- layered Web-base. j Will XML transform the Web into one unified database enabling structured queries like: 4 “find the cheapest airline ticket from NY to Chicago” 4 “list all jobs with salary > 50 K in the Boston area” j It is a dream now but more will be minable in the future!

14 Question # 5: What is the Vision for the Future? j How will users interact with the Web in the future? 4 Key-word based search of Web pages 4 RDBMS-server based query of hidden Webs 4 Meta-Web based query and multidimensional analysis j Will structured, declarative querying become widespread? 4 Yes, but co-exists with keyword-oriented search 4 Web will be more structured with XML and leaders 4 IR and DBMS will be a joint force in Web technology 4 Keyword search + query + OLAP + mining tools

15 Question # 5: Future? (continued) j Will traditional mining techniques (e.g., clustering, classification) be able to cope with scale, heterogeneity and dynamic nature of the Web? 4 New technologies: j What key innovation will be required going forward? 4 Web warehouse