Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.

Similar presentations


Presentation on theme: "1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation."— Presentation transcript:

1 1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation

2 2 Challenges to Web Mining j Web: A huge, widely-distributed, highly heterogeneous, semi-structured, interconnected, evolving, hypertext/hypermedia information repository. j Problems: 4 the “abundance” problem 4 limited coverage of the Web (hidden Web sources) 4 limited query interface: keyword-oriented search 4 limited customization to individual users j DBMS, DBers, and data miners will play an increasingly important role in the new generation of Internet

3 3 Web Mining: Lots Can Be Done! j A taxonomy of Web mining: 4 Web content mining 4 Web structure mining 4 Web usage mining j Some interesting examples of Web mining 4 Mining integrated with Web search engines 4 Weblog mining (usage, access, and evolution) 4 Warehousing a Meta-Web: An MLDB approach

4 4 Mine What Web Search Engine Finds j Current Web search engines: convenient source for mining 4 keyword-based, return too many answers, low quality answers, still missing a lot, not customized, etc. j Data mining will help: 4 coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies 4 better search primitives: user preferences/hints 4 linkage analysis: authoritative pages and clusters 4 Web-based languages: XML + WebSQL + WebML 4 customization: home page + Weblog + user profiles

5 5 Web Log Mining j Weblog provides rich information about Web dynamics j Multidimensional Weblog analysis: 4 disclose potential customers, users, markets, etc. j Plan mining (mining general Web accessing regularities): 4 Web linkage adjustment, performance improvements j Web accessing association/sequential pattern analysis: 4 Web cashing, prefetching, swapping j Trend analysis: 4 Dynamics of the Web: what has been changing? j Customized to individual users

6 6 Warehousing a Meta-Web: An MLDB Approach j Meta-Web: A structure which summarizes the contents, structure, linkage, and access of the Web and which evolves with the Web j Layer 0 : the Web itself j Layer 1 : the lowest layer of the Meta-Web 4 an entry: a Web page summary, including class, time, URL, contents, keywords, popularity, weight, links, etc. j Layer 2 and up: summary/classification/clustering in various ways and distributed for various applications j Meta-Web can be warehoused and incrementally updated j Querying and mining can be performed on or assisted by meta-Web (a multi-layer digital library catalogue, yellow page).

7 7 A Multiple Layered Meta-Web Architecture Generalized Descriptions More Generalized Descriptions Layer 0 Layer 1 Layer n...

8 8 Construction of Multi-Layer Meta-Web j XML: facilitates structured and meta-information extraction j Hidden Web: DB schema “extraction” + other meta info j Automatic classification of Web documents: 4 based on Yahoo!, etc. as training set + keyword-based correlation/classification analysis (IR/AI assistance ) j Automatic ranking of important Web pages 4 authoritative site recognition and clustering Web pages j Generalization-based multi-layer meta-Web construction 4 With the assistance of clustering and classification analysis

9 9 Use of Multi-Layer Meta Web j Benefits of Multi-Layer Meta-Web: 4 Multi-dimensional Web info summary analysis 4 Approximate and intelligent query answering 4 Web high-level query answering (WebSQL, WebML) 4 Web content and structure mining 4 Observing the dynamics/evolution of the Web j Is it realistic to construct such a meta-Web? 4 Benefits even if it is partially constructed 4 Benefits may justify the cost of tool development, standardization and partial restructuring

10 10 http://db.cs.sfu.ca/ Thank you !!!

11 11 Question # 2: Can Web Structure Be Mined? j Use topic hierarchies for document classification? 4 Topic hierarchies, such as CS classifications, are essential components for document classification 4 Yahoo!, AOL, and other information service providers are teachers (training sets) for Web page automatic classification 4 Classification leads to “lattices”, “trees”, or “clusters” j Mine patterns involving Web pages and hyperlinks? 4 Find authoritative Web pages 4 Find Web page structures and clusters. 4 Query and mine Web structures

12 12 Question # 3: Can Customization Be Improved? j Learn about user’s interests based on access patterns 4 Weblog mining + multidimensional log analysis 4 Home page and user profiles disclose interests j Provide users with pages, sites, and advertisements of interest 4 Provide facilities for users to specify interests, constraints, and customization 4 Intelligent query answering using multidimensional Web warehouse.

13 13 Question # 4: What Role will XML Play? j XML provides a promising direction for a more structured Web and DBMS-based Web servers j Promote standardization, help construction of multi- layered Web-base. j Will XML transform the Web into one unified database enabling structured queries like: 4 “find the cheapest airline ticket from NY to Chicago” 4 “list all jobs with salary > 50 K in the Boston area” j It is a dream now but more will be minable in the future!

14 14 Question # 5: What is the Vision for the Future? j How will users interact with the Web in the future? 4 Key-word based search of Web pages 4 RDBMS-server based query of hidden Webs 4 Meta-Web based query and multidimensional analysis j Will structured, declarative querying become widespread? 4 Yes, but co-exists with keyword-oriented search 4 Web will be more structured with XML and leaders 4 IR and DBMS will be a joint force in Web technology 4 Keyword search + query + OLAP + mining tools

15 15 Question # 5: Future? (continued) j Will traditional mining techniques (e.g., clustering, classification) be able to cope with scale, heterogeneity and dynamic nature of the Web? 4 New technologies: j What key innovation will be required going forward? 4 Web warehouse


Download ppt "1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation."

Similar presentations


Ads by Google