Chapter 22: Advanced Querying and Information Retrieval

Chapter 22: Advanced Querying and Information Retrieval
Decision-Support Systems Data Analysis and OLAP Data Mining Data Warehousing Information-Retrieval Systems

Decision Support Systems
Decision-Support systems are used to make business decisions often based on data collected by on-line transaction-processing systems. Examples of business decisions: what items to stock? What insurance premium to change? Who to send advertisements to? Examples of data used for making decisions Retail sales transaction details Customer profiles (income, age, sex, etc.)

Decision-Support Systems: Overview
Data analysis tasks are simplified by specialized tools and SQL extensions Example tasks For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year As above, for each product category and each customer category Statistical analysis packages (e.g., : S++) can be interfaced with databases Statistical analysis is a large field will not study it here Data mining seeks to discover knowledge automatically in the form of statistical rules and patterns from Large databases. A data warehouse archives information gathered from multiple sources, and stores it under a unified schema, at a single site. Important for large businesses which generate data from multiple divisions, possibly at multiple sites Data may also be purchased externally

Data Analysis and OLAP Aggregate functions summarize large volumes of data Online Analytical Processing (OLAP) Tools to support interactive analysis of data, allowing data to be summarized and viewed in different ways A histogram partitions the values taken by an attribute into ranges, and computes an aggregate over the values in each range; cumbersome to use standard SQL to construct a histogram. Extension proposed by Red Brick: select percentile, avg (balance) from account group by N_tile (balance, 10) as percentile

Cross Tabulation of sales by item-name and color
The table above is an example of a cross-tabulation(or cross-tab) also referred to as a pivot-table. In general, a cross-table is a table where values for one attribute form the row headers, values for another attribute form the column headers, and the values in an individual cell are derived as follows. A cross tab with summary rows/columns can be represented by introducing a special value all to represent subtotals.

Relational Representation of the Data in Figure 22.1

Three-Dimensional Data Cube

Online Analytical Processing
The operation of changing the dimensions used in a cross-tab is called pivoting. An OLAP system provides other functionality as well. For instance, the analyst may wish to see a cross-tab on item-name and color for a fixed value of size, for example, large, instead of the sum across all sizes. Such an operation is referred to as slicing. The operation is sometimes called dicing, particularly when values for multiple dimensions are fixed. The operation of moving from finer-granularity data to a coarser granularity is called a rollup. The opposite operation - that of moving from coarser-granularity data to finer-granularity data – is called a drill down.

OLAP Implementation The earliest OLAP systems used multidimensional arrays in memory to store data cubes, and are referred to as mutidimensional OLAP (MOLAP) systems. Hybrid systems, which store some summaries in memory and store the base data and other summaries in a relational database, are called hybrid OLAP (HOLAP) systems.

Hierarchies on Dimensions

Cross Tabulation of sales With Hierarchy on item-name

Data Analysis (Cont.) Small Medium Large Total 8 20 Light Dark 35 10
310 5 53 35 28 Total 45 15 88 Cross-tabulation of number by size and color of sample relation sales with the schema Sales(color, size, number).

Data Analysis (Cont.) Color Size Number Light Dark all Small Medium
Large all 8 35 10 53 20 5 28 45 15 88 Can represent subtotals in relational form by using the value all E.g. : obtain (Light, all, 53) and (Dark, all, 35) by aggregating individual tuples with different values for size for each color.

Data Analysis (Cont.) Rollup: Moving from finer-granularity data to a coarser granularity by means of aggregation. Drill down: Moving from coarser-granularity data finer-granularity data. Proposed extensions to SQL, such as the cube operation help to support generation of summary data The following query generates the previous table. select color, size, sum (number) from sales groupby color, size with cube

Data Analysis (Cont.) Figure shows the combinations of dimensions size, color, price In general computing cube operation with n groupby columns gives 2nd different groupby combinations.

Data Mining Like knowledge discovery in artificial intelligence data mining discovers statistical rules and patterns it differs from machine learning in that it deals with large volumes of data stored primarily on disk. Knowledge discovered from a database can be represented by a set of rules. e.g.,: “Young women with annual incomes greater than $50,000 are most likely to buy sports cars” Discover rules using one of two models: 1. The user is involved directly in the process of knowledge discovery. 2. The system is responsible for automatically discovering knowledge from the database by detecting patterns and correlation's in the data.

Knowledge Representation Using Rules
General form of rules  X antecedent  consequent X is a list of one or more variables with associated ranges. The rule  transactions T, buys (T, bread)  buys(T, milk) states: if there is a tuple (t1, bread) in the relation buys, there must also be a tuple (t1, milk) in the relation buys. Population: Cross-product of the ranges of the variables in the rule. In the above example, the set of all transactions. Support: Measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. e.g., 10% of transactions buy bread and milk. Confidence : Measure of how often the consequent is true when the antecedent is true. e.g., 80% of transactions that buy bread also buy milk

Some Classes of Data-Mining Problems
Classification : Finding rules that partition the given data into disjoint groups (classes) that are relevant for making a decision (e.g.,: which of several factors help classify a person’s credit worthiness). Associations: Useful to determine associations between different items (e.g.,: someone who buys bread is quite likely also to buy milk). Sequence correlations: determine correlations between related sequence data. (e.g., : when bond rates go up stock prices go down within two days.)

User-Guided Data Mining
In user-guided data mining, primary responsibility for discovering rules is with the user. User may runs tests on the database to verify or refute a hypothesis. Confidence and support for rules expressing a hypothesis are derived from the database. An iterative process of forming and refining rules is used. Example: Test the hypothesis “People who hold master’s degrees are the most likely to have an excellent credit rating.” If confidence of rule is low, may refine it into the rule:  people P, P.degree = Masters and C.income  75, 000  C.credit = excellent Data-visualization though graphical representations like maps, charts, and color-coding, helps detect patterns in data

Classification Rules Classification rules help assign new objects to a set of classes. E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk? Classification rules for above example could use a variety of knowledge, such as educational level of applicant, salary of applicant, age of applicant, etc. Classification rules can be compactly shown as a Classification tree.

Part of Credit Risk Classification Tree

Discovery of Classification Rules
Training set: a data sample in which the grouping for each tuple is already known. Top down generation of classification tree. Each internal node of the tree partitions the data into groups based on the attribute. The data at a node is not partitioned further if either all (or most) of the items at the node belong to the same class, or all attributes have been considered. Such a node is a leaf node. Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node.

Discovery of Classification Rules (Cont.)
Consider credit risk example: Suppose degree is chosen to partition the data at the root. Since degree has a small number of possible values, one child is created for each value. At each child node of the root, further classification is done tuple if required. Here, partitions are defined by income. Since income is a continuous attribute, some number of intervals are chosen, and one child created for each interval. Different classification algorithms use different ways of choosing which attribute to partition on at each node, and what the intervals, if any, are. In general, different branches of the tree could grow to different levels. Different nodes at the same level may use difficult partitioning attributes.

Discovery of Association Rules
Example:  transactions T, buys (T, bread)  buys (T, milk) In general: notion of transaction , and its intemset, the set of items contained in the transaction General form of rule:  transactions T, c(T, i1) and and c(T, io)  c(T, i0) where c(T, ik) denotes that transaction T contains item ik. Above can be represented as A  b where A = {i1, i2, , in} and b = io. Support of rule = number of transactions whose itemsets contain A  {b} Usually desire rules with strong support, which will involve only items purchased in a significant percentage of the transactions.

Discovery of Association Rules (Cont.)
Consider all possible sets of relevant items. For each set find its support (i.e. , how many transactions purchase all items in the set). Use sets with sufficiently high support to generate association rules. From set A generate the rules A - {b} b for each b  A. Support of each of the rules is support of A. Confidence of a rule is support of A divided by support of A - {b}.

Finding Support Few sets: Determine level of support via a single pass. A count is maintained for each set, initially set to 0. When a transaction is fetched, the count is incremented for each set of items which contained in the itemset of the transaction. Sets with a high count at the end of the pass correspond to items with a high degree of association. Many sets: If memory not enough to hold all counts for all sets Use multiple passes, considering only some sets in each pass. Optimization: Once a set is eliminated because it occurs in too small a fraction of the transactions, none of its supersets needs to be considered.

Data Warehousing A data warehouse is a repository of information gathered from multiple sources.

Data Warehousing (Cont.)
Provides a single consolidated interface to data Data stored for an extended period, providing access to historical data Data/updates are periodically downloaded form online transaction processing (OLTP) systems. Typically, download happens each night. Data may not be completely up-to-date, but is recent enough for analysis. Running large queries at the warehouse ensures that OLTP systems are not affected by the decision-support workload.

Issues in Building a Warehouse
When and how to gather data. Source driven: data source initiates data transfer Destination driven: warehouse initiates data transfer What schema to use. Schema integration Cleaning and conversion of incoming data What data to summarize. Raw data may be too large to store on-line Aggregate values (totals/subtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values How to propagate updates. Date at warehouse is a view on source data Efficient view maintenance techniques required

Information Retrieval Systems
Information retrieval (IR) systems use a simpler data model than database systems, but provide more powerful querying capabilities within the restricted model. Queries attempt to locate documents that are of interest by specifying, for example, sets of keywords. e.g., find documents containing the words “database systems” Information retrieval systems order answers based on their estimated relevance. e.g., user may really only want documents about database systems, but the system may retrieve all documents that mention the phrase database systems”. Documents may be ordered by, for example, how many times the phrase appears in the document.

Queries Combinations of keywords
motorcycle and maintenance computer or micro-processor computer but not database Closeness of keyword s in the and case affects ranking. Some systems allow user to specify that the keywords must occur close to each other. Synonyms To retrieve document title motorcycle repair for the query motorcycle and maintenance, need to realize that maintenance and repair are synonyms Similarity based retrieval - retrieve documents similar to a given document. Similarity may be defined based on metrics such as number of common keywords.

Differences From Database Systems
Information retrieval systems, unlike traditional database systems, handle: Unstructured documents Searching using keywords and relevance ranking Most information retrieval systems do not handle: High update rates Concurrency control Data structured using more complex data models (e.g., relational or object oriented data models) Complex queries written in, e.g., SQL

Indexing of Documents Documents that contain a specified keyword can be located using an inverted index which maps each keyword Ki to the set Si of identifiers of documents that contain Ki . IR systems save space by using index structures that support only approximate retrieval. May result in: false drop - some relevant documents may not be retrieved. false positive - some irrelevant documents may be retrieved. For many applications a good index should not permit any false drops, but may permit a few false positives. Relevant performance metrics: Precision - what percentage of the retrieved documents are relevant to the query. Recall - what percentage of the documents relevant to the query were retrieved.

Indexing of Documents (Cont.)
and operation: Finds documents that contain all of a set of keywords K1, K2, ..., Kn. Retrieve the corresponding sets of identifiers of documents S1, S2, ... Sn. The intersection, S1 S2 .....  Sn, gives the identifiers of the desired set of documents. or operation: Gives the set of all documents that contain at least one of the keywords K1, K2, …, Kn Found by computing the union, S1 S2 .....  Sn, of the sets.

Indexing of Documents (Cont.)
not operation: Finds documents that do not contain a specified keyword Ki Let Si be the set of identifiers of documents that contain the keyword Ki. Given a set of document identifies S, eliminate documents that contain the specified keyword Ki by taking the difference S-Si A full-text index uses every work in the document as a keyword. Stop words are very commonly occurring words that are useless as key words, e.g, a, an, the, it etc. These are eliminated from the index.

Browsing Storing related documents together facilitates browsing, where a user can see not only requested document but also related ones. Browsing in a library facilitated by classification system that organizes logically related books together.

A Classification Tree

Classification DAG Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important. Classification hierarchy is thus Directed Acyclic Graph (DAG).

A Classification DAG for a library information retrieval system

Locating of Information on the Web
The Archie system automatically follows Gopher links to locate information, and creates a centralized index of information found various sites. Web indexing systems (Web crawlers) follow the hypertext links in documents to find other documents, and build an index on the documents. The indices are often full-text indices, and are stored locally at the indexing system. These systems run a background process to find new sites. obtain updated information from known sites. discard defunct sites.

Locating of Information on the Web (Cont.)
Web indexing systems permit documents to be located even though they are not registered with any central authority. Drawback: poor precision of recall Full text indexing retrieves unrelated documents that just happen to mention the requested keyword HTML extensions now allow documents to be tagged with keywords to be used by search engines; unfortunately, many documents do not provide such keywords. The extremely large number of documents on the Web often leads to far more results than a human can handle. Alternative approach: a cataloging system for the Web, such as that provide by Yahoo Combinations of catalogs and indexing are quite useful Provided by, e.g., Yahoo (

Classification Tree

Data-Warehouse Architecture

Star Schema For A Data Warehouse

A Classification Hierarchy For A Library System

A Classification DAG For A Library Information Retrieval System

Data Analysis and OLAP Online Analytical Processing
Data that can be modeled as dimension attributes and measure attributes are called multidimensional data. Given a relation used for data analysis, we can identify some of its attributes as measure attributes, since they measure some value, and can be aggregated upon. For instance, the attribute number of the sales relation is a measure attribute, since it measures the number of units sold. Some of the other attributes of the relation are identified as dimension attributes, since they define the dimensions on which measure attributes, and summaries of measure attributes, are viewed.

Extended Aggregation SQL:1999 also supports generalizations of the group by constructs, using the cube and rollup constructs. A representative use of the cube construct is; select item-name, color, size, sum(number) from sales group by cube(item-name, color, size) This query computes the union of eight different groupings of the sales relation: { (item-name, color, size), (item-name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) } Where ( ) denotes an empty group by list. For each grouping, the result contains the null value for attributes not present in the grouping. For instance, with occurrences of all replaced by null, can be computed by the query select item-name, color, sum(number) from sales group by cube(item-name, color)

Extended Aggregation (Cont.)
A representative rollup construct is select item-name, color, size, sum(number) from sales group by rollup(item-name, color, size) Here only four grouping are generated: { (item-name, color, size), (item-name, color), (item-name), ( ) } Rollup can be used to generate aggregates at multiple levels of a hierarchy on a column. For instance, we have a table itemcategory(item-name, category) giving the category of each item. Then the query select category, item-name, sum(number) from sales, category where sales.item-name = itemcategory.item-name group by rollup(category, item-name) would give a hierarchical summary by item-name and by category.

Extended Aggregation (Cont.)
Multiple rollups and cubes can be used in a single group by clause.For instances, the following query select item-name, color, size, sum(number) from sales group by rollup(item-name), rollup(color, size) generates the groupings { (item-name, color, size), (item-name, color), (item-name), (color, size), (color), ( ) } The function grouping can be applied on an attribute; it returns 1 if the value is a null value representing all, and returns 0 in all other cases. Consider the following query: select item-name, color, size, sum(number), grouping(item-name) as item-name-flag, grouping(color) as color-flag, grouping(size) as size-flag, from sales group by cube(item-name, color, size)

Ranking Ranking is done in conjunction with an order by specification. Suppose we are given a relation student-marks(student-id, marks) which stores the marks obtained by each student. The following query gives the rank of each student. select student-id, rank( ) over (order by (marks) desc as s-rank from student-marks An extra order by clause is needed to get them in sorted order, as shown below. select student-id, rank ( ) (order by (marks) desc) as s-rank from student-marks order by s-rank Ranking can be done within partition of the data. The following query then gives the rank of students within each section. select student-id, section, rank ( ) over (partition by section order by marks desc) as sec-rank from student-marks, student-section where student-marks.student-id = student-section.student-id order by section, sec-rank

Ranking (Cont.) For a given constant n, the ranking the function ntile(n) takes the tuples in each partition in the specified order, and divides them into n buckets with qual numbers of tuples. For instance, we an sort employees by salary, and use ntile(3) to find which range (bottom third, middle third, or top third) each employee is in, and compute the total salary earned by employees in each range: select threetile, sum(salary) from ( select salary, ntile(3) over (order by (salary) as threetile from employee) as s group by threetile SQL:1999 permits the user to specify where they should occur by using nulls first or nulls last, for instance select student-id, rank ( ) over (order by marks desc nulls last) as s-rank from student-marks

Windowing An example of window query is that, given sales values for each date, calculates for each date the average of the sales on that day, the previous day, and the next day; such moving average queries are used to smooth out random variations. In contrast to group by, the same tuple can exist in multiple windows. Suppose we are given a relation transaction(account-number, date-time, value), where value is positive for a deposit and negative for a withdrawal, consider a query select account-number, date-time, sum(value) over (partition by account-number order by date-time rows unbounded preceding) as balance from transaction order by account-number, date-time

Decision - Tree Construction
The main idea of decision tree construction is to evaluate different attributes and different partitioning conditions and pick the attribute and partitioning condition that results in the maximum information gain ratio. Procedure Grow.Tree(S) Partition(S); Procedure Partition (S) if (purity(S) > p or |S| < s) then return; for each attribute A evaluate splits on attribute A; Use best split found (across all attributes) to partition S into S1, S2, …., Sr, for i = 1, 2, ….., r Partition(Si); We can generate classification rules from a decision tree, if we so desire. For each leaf we generate a rule as follows: The left hand side is the conjunction of all the split conditions on the path to the leaf, and the class is the class of the majority of the training instances at the leaf. An example of such a classification rule is degree = masters and income > 75,000  excellent

Other Types of Classifiers
There are two types of classifiers Neural net classifiers Bayesian classifiers Neural net classifiers use the training data to train artificial neural nets. To find the probability p(cj|d) of instance d being in class cj, Bayesian classifiers use Baye’s theorem, which says p(cj|d) = p(d|cj)p(cj) p(d) where p(d|cj) is the probability of generating instance d given class cj, p(cj) is the probability of occurrence of class cj, and p(d) is the probability of instanced occuring. To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d|cj) = p(d1|cj) * p(d2|cj) * ….* (p(dn|cj) That is, the probability of the instance d occuring is the product of the probability of occurrence of each of the attribute values di of d, given the class cj.

Regression Regression deals with the prediction of a value, rather than a class. Given values for a set of variables, X1, X2, …, Xn, we wish to predict the value of a variable Y. One way is to infer coefficients a0, a1, a1, …, an such that Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn Finding such a linear polynomial is called linear regression. The process to find a curve that fits the data is called curve fitting. The fit may only be approximate, because of noise in the data or because the relationship is not exactly a polynomial, so regression aims to find coefficients that give the best possible fit.

Association Rules Retail shops are often interested in associations between different items that people buy. Examples of such associations are: Someone who buys bread is quite likely also to buy milk A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts. Associations information can be used in several ways. When a customer buys a particular book, an online shop may suggest associated books. Association Rules An example of Association Rules is bread  milk An association rule must have an associated population; the population consists of a set of instances.

Association Rules (Cont.)
Association Rules: An example of Association Rules is bread  milk An association rule must have an associated population; the population consists of a set of instances. Rules have an associated support, as well as an associated confidence. These are defined in the context of population: Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule.For instance, suppose only percent of all purchases include milk and screwdrivers. The support for the rule is milk  screwdrivers is low. Confidence is a measure of how often the consequent is true when the antecedent is true. For instance, the rule bread  milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk.A rule with a low confidence is not meaningful. Note that the confidence of bread  milk may be very different from the confidence of milk  bread, although both have the same supports.

Clustering

Other Types of Mining

Data Warehousing

Components of Data Warehouse
When and how to gather data. What schema to use. Data cleansing How to propagate updates. What data to summarize

Warehouse Schemas

Information-Retrieval Systems

Keyword Search Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not. A query containing keywords without any of the above connectives is assumed to have ands implicitly connecting the keywords In full text retrieval, all the words in each document are considered to be keywords. We shall use the word term to refer to the words in a document, since all words are keywords.

Relevance Ranking Using Terms
One way of measuring r(d, t), the relevance of a document d to a term t, is Where n(d) denotes the number of terms in the document and n(d, t) denotes the number of occurrences of term t in the document d. n(d) n(d, t) 1 + r(d, t) = log r(d, Q) =  r(d, t) n(t) tQ

Relevance Using Hyperlinks

Similarity-Based Retrieval

Synonyms and Homonyms

Indexing of Documents

Measuring Retrieval Effectiveness

Web Search Engines

Directories

Applications of Data Mining
 person P, P.degree = masters and P.income > 75,000  P.credit – excellent person P, P.degree = bachelors or (p.income  25,000 and P.income  75,000)  P.credit = good

Best Splits The purity of a set S of training instances can be measured quantitatively in several ways. Suppose there are k classes, and of the instances in S the fraction of instances in class I is pi. One measure of purity, the Gini measure is defined as [ Gini (S) = 1 -  When all instances are in a single class, the Gini value is 0, while it reaches its maximum (of 1 –1 /k) if each class the same number of instances. Another measure of purity is the entropy measure, which is defined as Entropy (S) = -  When a set S is split into multiple sets Si, I=1, 2, …, r, we can measure the purity of the resultant set of sets as: Purity(S1, S2, ….., Sr) =  The above formula can be used with both the Gini measure and the entropy measure of purity. k i- 1 p2i k i- 1 pilog2pi r i= 1 |Si| |S| purity (Si)

Best Splits (Cont.) The information gain due to particular split of S into Si, I = 1, 2, …., r is then Information-gain (S,{S1, S2, …., Sr) = purity(S) – purity (S1, S2, … Sr) The information content of a particular split can be defined in terms of entropy as Information-content(S, {S1, S2, ….., Sr})) = -  All of this leads to a definition: The best split for an attribute is the one that gives the maximum information gain ratio, defined as Information-gain (S{S1, S2, ……, Sr}) Information-content (S, {S1, S2, ….., Sr}) r i- 1 |Si| |S| log2

Chapter 22: Advanced Querying and Information Retrieval

Similar presentations

Presentation on theme: "Chapter 22: Advanced Querying and Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 22: Advanced Querying and Information Retrieval

Similar presentations

Presentation on theme: "Chapter 22: Advanced Querying and Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback