Download presentation

Presentation is loading. Please wait.

Published byStuart Pettitt Modified over 2 years ago

1
Hypertext Databases and Data Mining (SIGMOD 1999 Tutorial) Soumen Chakrabarti Indian Institute of Technology Bombay http://www.cse.iitb.ernet.in/~soumen http://www.cs.berkeley.edu/~soumen soumen@cse.iitb.ernet.in

2
Soumen Chakrabarti IIT Bombay 2 The Web 350 million static HTML pages, 2 terabytes 0.8–1 million new pages created per day 600 GB of pages change per month Average page changes in a few weeks Average page has about ten links Increasing volume of active pages and views Boundaries between repositories blurred Bigger than the sum of its parts

3
Soumen Chakrabarti IIT Bombay 3 Hypertext databases Academia –Digital library, web publication Consumer –Newsgroups, communities, product reviews Industry and organizations –Health care, customer service –Corporate email

4
Soumen Chakrabarti IIT Bombay 4 What to expect Write in decimal the exact circumference of a circle of radius one inch Is the distance between Tokyo and Rome more than 6000 miles? What is the distance between Tokyo and Rome? java java +coffee -applet “uninterrupt* power suppl*” ups -parcel

5
Soumen Chakrabarti IIT Bombay 5 Search products and services Verity Fulcrum PLS Oracle text extender DB2 text extender Infoseek Intranet SMART (academic) Glimpse (academic) Inktomi (HotBot) Alta Vista Google! Yahoo! Infoseek Internet Lycos Excite

6
Soumen Chakrabarti IIT Bombay 6 FTPGopherHTML Crawling Indexing Search Local data Monitor Mine Modify Web Servers Web Browsers Social Network of Hyperlinks Relevance Ranking Latent Semantic Indexing Topic Directories More structure Clustering Scatter- Gather Semi-supervised Learning Automatic Classification Web Communities Topic Distillation Focused Crawling User Profiling Collaborative Filtering WebSQL WebL XML

7
Basic indexing and search

8
Soumen Chakrabarti IIT Bombay 8 Keyword indexing Boolean search –care AND NOT old Stemming –gain* Phrases and proximity –“new care” –loss care – My care is loss of care with old care done Your care is gain of care with new care won D1 D2 careD1: 1, 5, 8 D2: 1, 5, 8 newD2: 7 old D1: 7 lossD1: 3

9
Soumen Chakrabarti IIT Bombay 9 Tables and queries 1 POSTING select distinct did from POSTING where tid = ‘care’ except select distinct did from POSTING where tid like ‘gain%’ with TPOS1(did, pos) as (select did, pos from POSTING where tid = ‘new’), TPOS2(did, pos) as (select did, pos from POSTING where tid = ‘care’) select distinct did from TPOS1, TPOS2 where TPOS1.did = TPOS2.did and proximity(TPOS1.pos, TPOS2.pos) proximity(a, b) ::= a + 1 = b abs(a - b) < 5

10
Soumen Chakrabarti IIT Bombay 10 Relevance ranking Recall = coverage –What fraction of relevant documents were reported Precision = accuracy –What fraction of reported documents were relevant Trade-off Query“True response” Search Output sequence Consider prefix k Compar e

11
Soumen Chakrabarti IIT Bombay 11 Vector space model and TFIDF Some words are more important than others W.r.t. a document collection D –d + have a term, d - do not –“Inverse document frequency” “Term frequency” (TF) –Many variants: Probabilistic models

12
Soumen Chakrabarti IIT Bombay 12 Tables and queries 2 VECTOR(did, tid, elem) ::= With TEXT(did, tid, freq) as (select did, tid, count(distinct pos) from POSTING group by did, tid), LENGTH(did, len) as (select did, sum(freq) from TEXT group by did), DOCFREQ(tid, df) as (select tid, count(distinct did) from TEXT group by tid) select did, tid, (freq / len) * (1 + log((select count(distinct did from POSTING))/df)) from TEXT, LENGTH, DOCFREQ where TEXT.did = LENGTH.did and TEXT.tid = DOCFREQ.tid

13
Soumen Chakrabarti IIT Bombay 13 Relevance ranking select did, cosine(did, query) from corpus where candidate(did, query) order by cosine(did, query) desc fetch first k rows only ‘auto’‘car’ ‘now’ query A T D Find largest k columns of: Exact computation: O(n 2 ) All entries above mean can be estimated with error e within O(ne -2 ) time

14
Similarity and clustering

15
Soumen Chakrabarti IIT Bombay 15 Clustering Given an unlabeled collection of documents, induce a taxonomy based on similarity Need document similarity measure –Distance between normalized document vectors –Cosine of angle between document vectors Top-down clustering is difficult because of huge number of noisy dimensions –k-means, expectation maximization Quadratic-time bottom-up clustering

16
Soumen Chakrabarti IIT Bombay 16 Vocabulary V, term w i, document represented by is the number of times w i occurs in document Most f’s are zeroes for a single document Monotone component-wise damping function g such as log or square-root Document model

17
Soumen Chakrabarti IIT Bombay 17 Similarity Normalized document profile: Profile for document group :

18
Soumen Chakrabarti IIT Bombay 18 Group average clustering 1 Initially G is a collection of singleton groups, each with one document Repeat –Find , in G with max s( ) –Merge group with group For each keep track of best O(n 2 ) algorithm

19
Soumen Chakrabarti IIT Bombay 19 Group average clustering 2 Un-normalized group profile: Can show:

20
Soumen Chakrabarti IIT Bombay 20 “Rectangular time” algorithm Buckshot Randomly sample documents Run group average clustering algorithm to reduce to k groups or clusters Iterate assign-to-nearest O(1) times –Move each document to cluster with max s( , ) Total time taken is O(kn)

21
Soumen Chakrabarti IIT Bombay 21 Extended similarity auto and car co-occur often Therefore they must be related Documents having related words are related Useful for search and clustering Two basic approaches –Hand-made thesaurus (WordNet) –Co-occurrence and associations … car … … auto … … auto …car … car … auto … auto …car … car … auto … auto …car … car … auto car auto

22
Soumen Chakrabarti IIT Bombay 22 k k-dim vector Latent semantic indexing A Documents Terms U d t r DV d SVD TermDocument

23
Soumen Chakrabarti IIT Bombay 23 Collaborative recommendation People=record, movies=features, cluster people Both people and features can be clustered For hypertext access, time of access is a feature Need advanced models

24
Soumen Chakrabarti IIT Bombay 24 A model for collaboration People and movies belong to unknown classes P k = probability a random person is in class k P l = probability a random movie is in class l P kl = probability of a class-k person liking a class-l movie Gibbs sampling: iterate –Pick a person or movie at random and assign to a class with probability proportional to P k or P l –Estimate new parameters

25
Supervised learning

26
Soumen Chakrabarti IIT Bombay 26 Supervised learning (classification) Many forms –Content: automatically organize the web per Yahoo! –Type: faculty, student, staff –Intent: education, discussion, comparison, advertisement Applications –Relevance feedback for re-scoring query responses –Filtering news, email, etc. –Narrowing searches and selective data acquisition

27
Soumen Chakrabarti IIT Bombay 27 Difficulties Dimensionality –Decision tree classifiers: dozens of columns –Vector space model: 50,000 ‘columns’ Context-dependent noise –‘Can’ (v.) considered a ‘stopword’ –‘Can’ (n.) may not be a stopword in /Yahoo/SocietyCulture/Environment/Recycling

28
Soumen Chakrabarti IIT Bombay 28 More difficulties Need for scalability –High dimension needs more data to learn Class labels are from a hierarchy –All documents belong to the root node –Highest probability leaf may have low confidence

29
Soumen Chakrabarti IIT Bombay 29 Techniques Nearest neighbor +Standard keyword index also supports classification –How to define similarity? (TFIDF may not work) –Wastes space by storing individual document info Rule-based, decision-tree based –Very slow to train (but quick to test) +Good accuracy (but brittle rules) Model-based +Fast training and testing with small footprint

30
Soumen Chakrabarti IIT Bombay 30 More document models Boolean vector (word counts ignored) –Toss one coin for each term in the universe Bag of words (multinomial) –Repeatedly toss coin with a term on each face Limited dependence models –Bayesian network where each feature has at most k features as parents –Maximum entropy estimation

31
Soumen Chakrabarti IIT Bombay 31 “Bag-of-words” Decide topic; topic c is picked with prior probability (c); c (c) = 1 Each topic c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1 Fix document length and keep tossing coin Given c, probability of document is

32
Soumen Chakrabarti IIT Bombay 32 Limitations With the model –100th occurrence of term as surprising as first –No inter-term dependence With using the model –Most observed (c,t) are zero and/or noisy –Have to pick a low-noise subset of the term universe Improves space, time, and accuracy –Have to “fix” low-support statistics

33
Soumen Chakrabarti IIT Bombay 33 Feature selection p1p1 p2p2...q1q1 q2q2 TT N Model with unknown parameters Observed data 01... N p1p1 q1q1 Confidence intervals Pick F T such that models built over F have high separation confidence

34
Soumen Chakrabarti IIT Bombay 34 Tables and queries 3 TAXONOMY 1 2 3 4 5 TEXT EGMAP EGMAPR(did, kcid) ::= ((select did, kcid from EGMAP) union all (select e.did, t.pcid from EGMAPR as e, TAXONOMY as t where e.kcid = t.kcid)) STAT(pcid, tid, kcid, ksmc, ksnc) ::= (select pcid, tid, TAXONOMY.kcid, count(distinct TEXT.did), sum(freq) from EGMAPR, TAXONOMY, TEXT where TAXONOMY.kcid = EGMAPR.kcid and EGMAPR.did = TEXT.did group by pcid, tid, TAXONOMY.kcid)

35
Analyzing hyperlink structure

36
Soumen Chakrabarti IIT Bombay 36 Hyperlink graph analysis Hypermedia is a social network –Telephoned, advised, co-authored, paid, cited Social network theory (cf. Wasserman & Faust) –Extensive research applying graph notions –Centrality –Prestige and reflected prestige –Co-citation Can be applied directly to Web search –HIT, Google, CLEVER, topic distillation

37
Soumen Chakrabarti IIT Bombay 37 Hypertext models for classification c=class, t=text, N=neighbors Text-only model: Pr[t|c] Using neighbors’ text to judge my topic: Pr[t, t(N) | c] Better model: Pr[t, c(N) | c] Non-linear relaxation ?

38
Soumen Chakrabarti IIT Bombay 38 Exploiting link features 9600 patents from 12 classes marked by USPTO Patents have text and cite other patents Expand test patent to include neighborhood ‘Forget’ fraction of neighbors’ classes

39
Soumen Chakrabarti IIT Bombay 39 Google and HITS In-degree prestige Not all votes are worth the same Prestige of a page is the sum of prestige of citing pages: p = Ep Pre-compute query independent prestige score High prestige good authority High reflected prestige good hub Bipartite iteration –a = Eh –h = E T a –h = E T Eh

40
Soumen Chakrabarti IIT Bombay 40 Tables and queries 4 HUBS AUTH LINK update LINK as X set (wtfwd) = 1. / (select count(ipsrc) from LINK where ipsrc = X.ipsrc and urldst = X.urldst) where type = non-local; delete from HUBS; insert into HUBS(url, score) (select urlsrc, sum(score * wtrev) from AUTH, LINK where authwt is not null and type = non-local and ipdst <> ipsrc and url = urldst group by urlsrc); update HUBS set (score) = score / (select sum(score) from HUBS); urlsrc @ipsrc urldst @ipdst wgtfwd wgtrev score

41
Querying/mining semi-structured data

42
Soumen Chakrabarti IIT Bombay 42 Semi-structured database systems Lore (Stanford) –Object exchange model, dataguides WebSQL (Toronto), WebL (Compaq SRC) –Structured query languages for the Web WHIRL (AT&T Research) –Approximate matches on multiple textual columns Strudel (AT&T Research, U. Washington) –Web site generation and management

43
Soumen Chakrabarti IIT Bombay 43 Select x.url, x.title from Document x such that “http://www.cs.wisc.edu”==| | x where x mentions “semi-structured data” Apart from cycling, find the most common topic found within link radius 2 of pages on cycling In the last year, how many links were made from environment protection pages to Exxon? Queries combining structure and content Answer: “first-aid”

44
Soumen Chakrabarti IIT Bombay 44 Resource discovery Taxonomy Database Taxonomy Editor Example Browser Crawl Database Hypertext Classifier (Learn) Topic Models Hypertext Classifier (Apply) Topic Distiller Feedback Scheduler Workers Crawler

45
Soumen Chakrabarti IIT Bombay 45 Resource discovery results High rate of “harvesting” relevant pages Robust to perturbations of starting URLs Great resources found 12 links from start set

46
Soumen Chakrabarti IIT Bombay 46 Resource discovery results 1 High rate of ‘harvesting’ relevant pages –Standard crawling neither necessary nor adequate for answering specific queries

47
Soumen Chakrabarti IIT Bombay 47 Resource discovery results 2 Robust to perturbations of starting URLs Great resources found 12 links from start set

48
Soumen Chakrabarti IIT Bombay 48 Database issues Useful features +Concurrency and recovery (crawlers) +I/O-efficient representation of mining algorithms +Ad-hoc queries combining structure and content Need better support for –Flexible choices for concurrency and recovery –Indexed scans over temporary table expressions –Efficient string storage and operations –Answering complex queries approximately

49
Resources

50
Soumen Chakrabarti IIT Bombay 50 Research areas Modeling, representation, and manipulation More applications of machine learning Approximate structure and content matching Answering questions in specific domains Interactive refinement of ill-defined queries Tracking emergent topics in a discussion group Content-based collaborative recommendation Semantic prefetching and caching

51
Soumen Chakrabarti IIT Bombay 51 Events and activities Text REtrieval Conference (TREC) –Mature ad-hoc query and filtering tracks (newswire) –New track for web search (2GB and 100GB corpus) –New track for question answering DIMACS special years on Networks (-2000) –Includes applications such as information retrieval, databases and the Web, multimedia transmission and coding, distributed and collaborative computing Conferences: WWW, SIGIR, SIGMOD/VLDB?

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google