Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database and Knowledge Discovery : from Market Basket Data to Web PageRank Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.

Similar presentations


Presentation on theme: "Database and Knowledge Discovery : from Market Basket Data to Web PageRank Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY."— Presentation transcript:

1 Database and Knowledge Discovery : from Market Basket Data to Web PageRank Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY

2 What is a Database ? A Database:  A very large, integrated collection of data. Models a real-world “enterprise”  Entities e.g., students, courses  Relationships e.g., Zhang San is taking the Database System course.

3 What is a Database Management System? A Database Management System (DBMS) is:  A software system designed to store, manage, and facilitate query to databases. Popular DBMS  Oracle  IBM DB2  Microsoft SQL Server Database System = Databases + DBMS

4 DBMS makes a big industry The world's largest software companies by Forbes Global 2000 (http://www.forbes.com/lists/2009/18/global-09_The-Global- 2000_IndName_17.html)Forbes Global 2000  IBM IBM  Microsoft Microsoft  Oracle Oracle  Google Google  Softbank Softbank  SAP SAP  Accenture Accenture  Computer Sciences Corporation Computer Sciences Corporation  Yahoo! Yahoo!  Cap Gemini Cap Gemini

5 Typical Applications Supported by Database Systems Online Transaction Processing (OLTP)  Recording sales data in supermarkets  Booking flight tickets  Electronic banking Online analytical processing (OLAP) and Data Warehousing  Business reporting for sales data  Customer Relationship Management (CRM)

6 Is the WWW a DBMS? The Web = Surface Web + Deep Web Surface Web: simply the HTML pages  Accessed by “search”: Pose keywords in search box. Deep Web: content hidden behind HTML forms Accessed by “query” Fill in query forms.

7 “ Search ” as “ Simple Search ”

8 “ Query ” as “ Advanced Search ”

9 “ Search ” vs. “ Query ” Search is structure-free.  The keywords “database systems” can appear in anyplace in a HTML pages. Query is structure-aware.  Say, we restruct that the keywords “database systems” can only appear in the “TITLE” field.  i.e., we assume there is an underlying STRUCTURE (of a book).

10 What is a “ STUCTURE ” ? Referring to the C programming language struct BOOK { char TITLE [256]; char AUTHOR [256]; float PRICE; int YEAR; } In this course, we study database management systems that focus on processing structured data.

11 An Anecdote about “ Search ” : Google ’ s Founders are from Database Group of Stanford.

12 A Historical Perspective (1) Relational Data Model, by Edgar Codd, 1970.  Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377–387.A Relational Model of Data for Large Shared Data BanksCommunications of the ACM  Turing Award, 1981. System R, by IBM, started in 1974  Structured Query Language (SQL) Structured Query Language INGRES, by Berkeley, started in 1974  POSTGRES, Mariposa, C-Store

13 A Historical Perspective (2) Database Transaction Processing, mainly by Jim Gray.  J Gray, A Reuter. Transaction processing: concepts and techniques. 1993.Transaction processing: concepts and techniques  Turing Award, 1998. Object-Relational DBMS, 1990s.  Stonebraker, Michael with Moore, Dorothy. Object- Relational DBMSs: The Next Great Wave. 1996.  Postgres (UC Berkeley), PostgreSQL.PostgreSQL  IBM's DB2, Oracle database, and Microsoft SQL ServerDB2Oracle databaseMicrosoft SQL Server

14 Describing Data: Data Models A data model is a collection of concepts for describing data. A schema is a description of a particular collection of data, using a given data model. The relational data model is the most widely used model today.  Main concept: relation, basically a table with rows and columns.  Every relation has a schema, which describes the columns, or fields (their names, types, constraints, etc.).

15 Schema in Relation Data Model sidnameloginagegpa 53666Jonesjones@cs183.4 53688Smithsmith@ee183.2 53650Smithsmith@math193.8 Students (sid: string, name: string, login: string, age: integer, gpa: real) Table 1. An Instance of the Students Relation Definition of the Students Schema A relation schema is a TEMPLATE of the corresponding relation.

16 Queries in a Relational DBMS Specified in a Non-Procedural way  Users only specify what data they need;  A DBMS takes care to evaluate queries as efficiently as possible. a Non-Procedural Query Language:  SQL: Structured Query Language  Basic form of a SQL query: SELECT target-list FROM relation-list WHERE qualification

17 A Simple SQL Example At an airport, a gate agent clicks on a form to request the passenger list for a flight. SELECTname FROMPassenger Whereflight = 510275 Passenger(pid: string, name: string, flight: integer)

18 Concurrent execution of user programs Why?  Utilize CPU while waiting for disk I/O (database programs make heavy use of disk)  Avoid short programs waiting behind long ones e.g. ATM withdrawal while bank manager sums balance across all accounts courtesy of Joe Hellerstein

19 Concurrent execution Interleaving actions of different user programs can lead to inconsistency: Example:  Bill transfers $100 from savings to checking Savings –= 100; Checking += 100  Meanwhile, Bill’s wife requests account info. Bad interleaving: Savings –= 100 Print balances Checking += 100  Printout is missing $100 ! courtesy of Joe Hellerstein

20 Concurrency Control DBMS ensures such problems don’t arise. Users can pretend they are using a single- user system.

21 STRUCTURE OF A DBMS

22 22 The Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day.

23 23 Support Simplest question: find sets of items that appear “frequently” in the baskets. Support for itemset I = the number of baskets containing all items in I.  Sometimes given as a percentage. Given a support threshold s, sets of items that appear in at least s baskets are called frequent itemsets.

24 24 Example: Frequent Itemsets Items={milk, coke, pepsi, beer, juice}. Support = 3 baskets. B 1 = {m, c, b}B 2 = {m, p, j} B 3 = {m, b}B 4 = {c, j} B 5 = {m, p, b}B 6 = {m, c, b, j} B 7 = {c, b, j}B 8 = {b, c} Frequent itemsets: {m}, {c}, {b}, {j},, {b,c}, {c,j}. {m,b}

25 25 Applications – (1) Items = products; baskets = sets of products someone bought in one trip to the store. Example application: given that many people buy beer and diapers together:  Run a sale on diapers; raise price of beer. Only useful if many buy diapers & beer.

26 26 Applications – (2) Baskets = Web pages; items = words. Unusual words appearing together in a large number of documents, e.g., “Brad” and “Angelina,” may indicate an interesting relationship.

27 27 Association Rules If-then rules about the contents of baskets. {i 1, i 2,…,i k } → j means: “if a basket contains all of i 1,…,i k then it is likely to contain j.” Confidence of this association rule is the probability of j given i 1,…,i k.

28 28 Example: Confidence B 1 = {m, c, b}B 2 = {m, p, j} B 3 = {m, b}B 4 = {c, j} B 5 = {m, p, b}B 6 = {m, c, b, j} B 7 = {c, b, j}B 8 = {b, c} An association rule: {m, b} → c.  Confidence = 2/4 = 50%. + _ +

29 29 Finding Association Rules Question: “find all association rules with support ≥ s and confidence ≥ c.”  Note: “support” of an association rule is the support of the set of items on the left. Hard part: finding the frequent itemsets.  Note: if {i 1, i 2,…,i k } → j has high support and confidence, then both {i 1, i 2,…,i k } and {i 1, i 2,…,i k,j } will be “frequent.”

30 30 A-Priori Algorithm – (1) A two-pass approach called a-priori limits the need for main memory. Key idea: monotonicity : if a set of items appears at least s times, so does every subset.  Contrapositive for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets.

31 31 A-Priori Algorithm – (2) Pass 1: Read baskets and count in main memory the occurrences of each item.  Requires only memory proportional to #items. Items that appear at least s times are the frequent items.

32 32 A-Priori Algorithm – (3) Pass 2: Read baskets again and count in main memory only those pairs both of which were found in Pass 1 to be frequent.  Requires memory proportional to square of frequent items only (for counts), plus a list of the frequent items (so you know what must be counted).

33 33 Picture of A-Priori Item counts Pass 1Pass 2 Frequent items Counts of pairs of frequent items

34 Page Rank: Ranking web pages Web pages are not equally “important”  www.joe-schmoe.com v www.stanford.edu Inlinks as votes  www.stanford.edu has 23,400 inlinks www.stanford.edu  www.joe-schmoe.com has 1 inlink www.joe-schmoe.com Are all inlinks equal?  Recursive question!

35 Simple recursive formulation Each link’s vote is proportional to the importance of its source page If page P with importance x has n outlinks, each link gets x/n votes Page P’s own importance is the sum of the votes on its inlinks

36 Simple “flow” model The web in 1839 Yahoo M’softAmazon y a m y/2 a/2 m y = y /2 + a /2 a = y /2 + m m = a /2

37 Solving the flow equations 3 equations, 3 unknowns, no constants  No unique solution  All solutions equivalent modulo scale factor Additional constraint forces uniqueness  y+a+m = 1  y = 2/5, a = 2/5, m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large graphs

38 Matrix formulation Matrix M has one row and one column for each web page Suppose page j has n outlinks  If j  i, then M ij =1/n  Else M ij =0 M is a column stochastic matrix  Columns sum to 1 Suppose r is a vector with one entry per web page  r i is the importance score of page i  Call it the rank vector  |r| = 1

39 Example Suppose page j links to 3 pages, including i i j M r r = i 1/3

40 Eigenvector formulation The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix  In fact, its first or principal eigenvector, with corresponding eigenvalue 1

41 Example Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y = y /2 + a /2 a = y /2 + m m = a /2 r = Mr y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

42 Power Iteration method Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r 0 = [1/N,….,1/N] T Iterate: r k+1 = Mr k Stop when |r k+1 - r k | 1 <   |x| 1 =  1 ≤ i ≤ N |x i | is the L 1 norm  Can use any other vector norm e.g., Euclidean

43 Power Iteration Example Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y a = m 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 1/5...

44 Random Walk Interpretation Imagine a random web surfer  At any time t, surfer is on some page P  At time t+1, the surfer follows an outlink from P uniformly at random  Ends up on some page Q linked from P  Process repeats indefinitely Let p(t) be a vector whose i th component is the probability that the surfer is at page i at time t  p(t) is a probability distribution on pages

45 The stationary distribution Where is the surfer at time t+1?  Follows a link uniformly at random  p(t+1) = Mp(t) Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)  Then p(t) is called a stationary distribution for the random walk Our rank vector r satisfies r = Mr  So it is a stationary distribution for the random surfer

46 Existence and Uniqueness A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.

47 Spider traps A group of pages is a spider trap if there are no links from within the group to outside the group  Random surfer gets trapped Spider traps violate the conditions needed for the random walk theorem

48 Microsoft becomes a spider trap Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m y a = m 111111 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2 003003...

49 Random teleports The Google solution for spider traps At each time step, the random surfer has two options:  With probability , follow a link at random  With probability 1- , jump to some page uniformly at random  Common values for  are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps

50 Random teleports (  ) Yahoo M’softAmazon 1/2 0.8*1/2 0.2*1/3 y 1/2 a 1/2 m 0 y 1/2 0 y 0.8* 1/3 y + 0.2* 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2

51 Random teleports (  ) Yahoo M’softAmazon 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 y a = m 111111 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11...

52 Matrix formulation Suppose there are N pages  Consider a page j, with set of outlinks O(j)  We have M ij = 1/|O(j)| when j  i and M ij = 0 otherwise  The random teleport is equivalent to adding a teleport link from j to every page with probability (1-  )/N reducing the probability of following each outlink from 1/|O(j)| to  /|O(j)| Equivalent: tax each page a fraction (1-  ) of its score and redistribute evenly

53 Page Rank Construct the N*N matrix A as follows  A ij =  M ij + (1-  )/N Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix  satisfying r = Ar Equivalently, r is the stationary distribution of the random walk with teleports


Download ppt "Database and Knowledge Discovery : from Market Basket Data to Web PageRank Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY."

Similar presentations


Ads by Google