Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monday, February 22, 2016.  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.

Similar presentations


Presentation on theme: "Monday, February 22, 2016.  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful."— Presentation transcript:

1 Monday, February 22, 2016

2  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful business patterns or mathematical decision models from a preprocessed data set

3

4  Analytics techniques come from a variety of disciplines:  Statistics (e.g., regression)  Machine learning (e.g., decision trees)  Biology (e.g., neural networks, genetic algorithms)

5  Applications exist in numerous areas  Retail  Travel  Health care  Actuarial science  Credit scoring  Movies  Sports  Marketing  Financial services  Pharmaceuticals  Telecommunications  Etc.

6 1. In predictive analytics, a target variable is typically available  Can be categorical (e.g., churn or not, fraud or not) or continuous (e.g., customer lifetime value, loss given default) 2. In descriptive analytics, no such target variable is available  Clustering is one example

7

8  Missing data values can occur for various reasons  Customer decides not to disclose income  Error occurs in merging because of typos in name  Popular schemes to deal with it:  Replace data With average or median Using a regression based on other data (e.g., age, income)  Delete data Simplest and most straightforward option Assumes no meaningful interpretation is lost  Keep data Missing data may be meaningful (e.g., customer did not disclose income because he is currently unemployed)

9  Two types of outliers can be considered:  Valid observation (e.g., salary of $2 million)  Invalid observation (e.g., age of 200 years)  Detection can be done statistically  Couple techniques:  Trimming/truncating – remove outliers  Winsorising – bring data back to lower and upper limits (e.g., median +/- 3SD)

10  Regression – target variable is continuous  Stock prices  Loss given default (LGD)  Customer lifetime value (CLV)  Classification – target is categorical  Binary (fraud, churn, credit risk)  Multiclass (predict credit ratings)

11  Active churn – customer stops relationship with firm  Contractual setting (e.g., cell phone service) – easy to detect – customer cancels contract  Noncontractual setting (e.g., grocery store) – need to operationalize – customer has not purchased any products in last 3 months  Passive churn – decreasing product or service usage  Forced churn – company stops the relationship  Expected churn – customer no longer needs a product or service (e.g., baby products)

12  Recursive partitioning algorithm (RPA) that represents patterns in underlying data set  Leaf/terminal nodes represent outcomes  Building a decision tree:  Splitting: Which variables and at what values?  Stopping: When to stop growing the tree?  Decisions: What class to assign each leaf node?

13  Decision trees essentially model decision boundaries orthogonal to the axes

14  Decision trees can be used for continuous targets

15  Contrary to predictive analytics, there is no real target variable available  Sometimes called unsupervised learning since there is no target variable to steer the learning process

16  Typically begins with a database of transactions:

17  Stochastic in nature, with a statistical measure of the strength of the association  Rules measure correlation association and should not be interpreted in a causal way  Examples:  If a customer buys spaghetti, then customer buys red wine in 70 percent of the cases  If a customer visits web page A, then the customer will visit web page B in 90% of the cases  If a customer has a car loan and car insurance, then the customer has a checking account in 80% of the cases

18  Suppose customer web page visits were logged:  Session 1: A, B, C  Session 2: B, C  Session 3: A, C, D  Session 4: A, B, D  Session 5: D, C, A  Consider the sequence rule A -> C  The support and confidence can be measure in various ways  Support: C follows A in any subsequent stage (2/5) C immediately follows A (1/5)  Confidence (given that A occurs): C follows A in any subsequent stage (2/4) C immediately follows A (1/4)

19  Divisive clustering starts with the entire data set in one cluster and breaks it up into smaller clusters until one observation per cluster remains (right to left below)  Agglomerative clustering does the reverse – it merges clusters until one big cluster is left (left to right)

20  The vertical lines on the dendogram gives the distance between two clusters amalgamated  The elbow point of a scree plot indicates the optimal clustering

21  Non-hierarchical procedure 1. Select k observations as initial cluster centroids (seeds) 2. Assign each observation to the cluster that has the closest centroid 3. When all observations have been assigned, recalculate the positions of the k centroids 4. Repeat steps #2 and #3 until the cluster centroids no longer change  Notes: The number of clusters, k, must be specified before the procedure begins. Different seeds should be tried to verify the stability of the clustering solution.

22  Read Chapter 6 of your textbook  Work on my term project


Download ppt "Monday, February 22, 2016.  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful."

Similar presentations


Ads by Google