1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.

1 Data Warehousing

2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products in all branches of a supermarket Example: Information about all telephone calls made through a certain company

3 OLTP vs. OLAP The queries we have seen so far were Online Line Transaction Processing, i.e., many small queries with results needed immediately Typically, Online Line Analytical Processing is done against a data warehouse. –A lot of aggregation –Results aren’t needed immediately (queries can run for hours) –Used for decision support (more about this later)

4 Creating a Data Warehouse Data from different sources is combined –need to transform data to a single schema –need to transform data to have same type of measurements (e.g., same currency) Occasionally, data is refreshed (get new data) Occasionally, data is purged (throw out old data) Need a meta-data repository to store information for the above

5 Design Usually, a star schema: –large fact table that continuously grows –small dimension tables that remain basically static Example: –Sales(pid, timeid, locid, amount) –Products(pid, pname, category, price) –Locations(locid, city, start, country) –Times(timeid, date, week, month, holiday_flag) Sales is the fact table, the rest are dimensions

6 Normal Forms Fact Table: In BCNF Dimension Tables: Generally not in BCNF –few updates, insert, deletes – few anomalies –data is relatively small so duplication doesn’t matter –Allow queries to be faster since less joins are needed

7 Data Mining

8 Data Mining is the process of finding interesting trends or patterns in large datasets in order to guide future decisions. Related to exploratory data analysis (area of statistics) and knowledge discovery (area in artificial intelligence, machine learning). Data Mining is characterized by having VERY LARGE datasets.

9 Some Real Problems Associations between products bought in a store –milk –beers and diapers Deciding on product discounts Figuring out how to keep customer at lowest cost to company Personal recommendation page at Amazon Stock market trends

10 Data Mining vs. Machine Learning Size: Databases are usually very large so algorithms must scale well Design Purpose: Databases are not usually designed for data mining (but for other purposes), and thus, may not have convenient attributes Errors and Noise: Databases almost always contain errors

11 Data Mining Techniques TechniqueCourse Where it is Taught Decision TreesIntroduction to Artificial Intelligence Neural NetworksNeural Networks 1, Neural Networks 2 K-Nearest NeighborComputational Geometry Association Rules (Sequence Patterns, Classification Rules) Introduction to Database Systems!

12 Association Rules A market basket is a collection of items bought by a single customer in a single customer transaction Problem: Identify sets of items that are purchased together Attempt to identify rules of the form {pen}  {ink} Meaning: If a pen is bought in a transaction, it is likely that ink will also be bought

13 Example Purchases Table transiditem 111pen 111ink 111diary 111soap 112pen 112ink 112diary transiditem 113pen 113diary 114pen 114ink 114soap 114tissues

14 Measures for Association Rules General form: L  R, where L and R are sets of items Support: The support for L  R is the percentage of transactions containing all items from L and from R Confidence: The confidence for L  R are the percentage of transactions that contain R, from among the transactions that contain L

15 Examples The rule {pen}  {ink} has: –support: –confidence: The rule {tissues}  {ink} has: –support: –confidence: The rule {pen}  {soap} has: –support: –confidence:

16 Understanding the Measures: The Intuition If a rule has low support then it might have arisen by chance. There is not enough evidence to draw a conclusion. If a rule L  R has low confidence then it is likely that there is no relationship between buying L and buying R Note: L  R and R  L will always have the same support, but may have different confidence

17 Goal Find association rules with high support and high confidence. Support and confidence values are specified by the user. Remember: Finding such a rule does not mean that there must be a relationship between the left and right sides. A person must evaluate such rules by hand. –Example: {milk}  {bread}

18 Algorithm Given values s and c, find all association rules with support >= s and confidence >= c. Can be done in 2 steps. Step 1: Find frequent itemsets, i.e., sets of items that appear with support >= s Step 2: For each frequent itemset F, try all ways to partition F to L and R and check if the rule generated will have confidence >= c.

19 Finding Frequent Itemsets How would you do this? Think!

20 Finding Frequent Itemsets The A Priori Property: Every subset of a frequent itemset must also be a frequent itemset. This means that we can create frequent itemsets iteratively, by taking frequent itemsets of size n, and extending them to frequent itemsets of size n+1.

21 Finding Frequent Itemsets (2) Freq = {} scan all transactions once and add to Freq the items that have support > s k = 1 repeat foreach I k in Freq with k items generate all itemsets I k+1 with k+1 items, such that I k is contained in I k+1 scan all transactions once and add to Freq the k+1-itemsets that have support > s k++ until no new frequent itemsets are found

22 Example Run the algorithm with support = 0.7 Level 1: Find frequent itemsets: {pen}, {ink}, {diary} Level 2: Check each of: {pen, ink}, {pen, diary}, {pen, soap}, {pen, tissues}, {ink, diary}, {ink, soap}, {ink, tissues}, {diary, soap}, {diary, tissues} The frequent itemsets are: {pen, ink}, {pen, diary}

23 Example Level 3: Check each of: {pen, ink, diary}, {pen, ink, soap}, {pen, ink, tissues}, {pen, diary, soap}, {pen, diary, tissues} No frequent itemsets! To summarize, the frequent itemsets were: {pen}, {ink}, {diary} {pen, ink}, {pen, diary}

24 Refinements to the Algorithm Note that we checked if {pen, tissues} is a frequent itemset even though we already knew that {tissues} is not a frequent itemset. Refinement: Try to extend frequent itemsets in a fashion that will ensure that all their subsets are frequent itemsets Scanning the transactions can be expensive if the table does not fit in main memory. Refinement: Find rules over a sample of the database that fits in memory

25 Deriving Association Rules foreach frequent itemset I foreach partition of I to two sets L, R generate a candidate rule L  R foreach transaction T in the database foreach candidate rule L  R if L in T then lnum(L  R)++ if R in T then rnum(L  R)++ return all rules L  R with rnum(L  R)/lnum(L  R) > c

26 Other Types of Rules Sequential Patterns –Each transaction for each customer is a set of items –A sequence of transactions is a sequence of sets. For example: {pen, ink, soap}, {pen, ink diary, soap} –A subsequence is derived from a sequence by deleting some itemsets and some items in other itemsets

27 Sequence Patterns (2) {pen}, {ink, diary}, {pen, soap} is a subsequence of {pen, ink}, {shirt}, {milk, ink, diary}, {soap, pen, diary} {pen}, {ink, diary}, {pen, soap} is not a subsequence of {pen, ink}, {shirt}, {soap, pen, diary}, {milk, ink, diary}

28 Sequence Patterns (3) The support for a sequence pattern S is the percentage of customer sequences that have S as a subsequence. A sequence s 1, s 2,..., s n with high support means that someone who buys s 1 is likely to buy s 2 later on, etc. Finding sequence patterns: In a similar fashion to finding frequent itemsets. Start with small sequences and try to extend them

29 Classification Rules Consider a table: BankInfo(custid, balance, age, returnLoan) Would like to find rules of the type: “If age is in a certain range and balance is in a certain range, then a customer is likely to not return a loan.” We can apply this rule to new customers who ask for a loan.

30 Classification Rules (2) General Form: (l1<X1<h1) and... and (lk < Xk < hk)  Y=c Use values for X1,...,Xk to “predict” value for Y We can define support and confidence as before.

1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.

Similar presentations

Presentation on theme: "1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.

Similar presentations

Presentation on theme: "1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products."— Presentation transcript:

Similar presentations

About project

Feedback