Presentation on theme: "Analysis of Reliance Home Comfort (RHC) Survey Data (fragment)"— Presentation transcript:
Analysis of Reliance Home Comfort (RHC) Survey Data (fragment)
Objectives 2 To show potentials of Business Intelligent Solution in the development and analysis of complex survey study Illustrate fruitfulness of synergy of statistical and data mining approaches in survey data analysis Formulate new important business questions that can be answered only within data mining modeling paradigm
Brief description of 2008 Reliance Home Comfort (RHC) Brand & Ad Tracking Study 3 The study is dedicated to evaluation of client awareness and ability to recognize 7 the most popular Canada home comfort products and services: Reliance Home Comfort Direct Energy Lennox Carrier Air One Sears Home Depot The phone household survey is conducted by agents who asked customers to identify at most 3 out of those 7 companies. Therefore, the number of recognized companies could be between 0 and 3. The number of questions in the questionnaire was about 300, but the questionnaire had hierarchical structure, and average time to complete the survey was approximately 15 minutes. Example of questions: –When you think of COMPANIES that provide ESSENTIAL HOME COMFORT products and services, which company comes to mind FIRST? –Have you seen or heard any advertising from any companies that provide ESSENTIAL HOME COMFORT products and services in the past 3 months?
4 Executive Summary: BI Solutions Business Intelligence Solutions (BIS) is a well established statistical/data mining/GIS company that conducts business in the USA and Canada. Our specialization is complex unstructured business problems for data rich firms. Our multidisciplinary team includes professionals in applied statistics, data mining, GIS, and software application development. Among our employees there are professionals with PhD degree in diverse quantitative fields: Applied Statistics, Data Mining/Machine Learning, Operations Research and Differential Equations The team members are authors of more than 100 published papers on diverse applications of data mining and other quantitative fields to market research, customer relationship management, pilot study design, etc. BIS has access to the best statistical, visualization, data mining and GIS software on the world market. The essence of our approach is to understand and analyze our clients business problem and corresponding data through the prism of dissimilar statistical/data mining models. As a result we are always able to produce the best possible model /results and help our clients in the most effective and scientifically sound way.
5 Exploratory Data Analysis (EDA) and Data Complexity
6 Example of Data Transformation Original Data, 22 categories Original First Response (Q9) Frequency Modified First Response (Q9) Frequency Transformed Data, 9 categories Categories MorEnergy, Prestige Home Comfort, and Roy Inch & Sons have no variance and do not produce useful information in the analysis. Therefore these categories should be aggregated. 6 Exploratory Data Analysis (EDA) and data preprocessing are a vital step of any data analysis project This example demonstrates the necessity of these preliminary steps: it turns out that the predictability of constructed variable Modified First Response is much higher than original First Response (Q9)
7 Modified First Response (Q9) by Region (Q1) Company comes to mind FIRST (Q9) is significantly different (p-value for Chi-Square is 0.0003) for different regions (Q3) For Sudbury/Thunder Bay residents Reliance Home Comport company comes to mind FIRST 6 times more often than for Hamilton residents. Contrasting RHC with aggregated other companies (Other), we can note that Other has practically uniform distribution. Therefore, the advertisement/marketing of RHC In Burlington, Hamilton, and Oakville have to be improved. Binary Q9 Region
8 The 5-point scale statements (questions Q74a - Q74f) should be analyzed separately for those interviewees who heard about the company by word of mouth, and who did not Q70a: How to hear about the company: Word of mouth / Recommendation = Yes Q70a: How to hear about the company: Word of mouth / Recommendation = No Just 2 pairs of questions out of 15 have non-significant correlation 10 pairs of questions out of 15 have non-significant correlation Different correlation structure Spearman correlation is non-parametric (distribution free) measure of the relationship between two variables 8
9 Exploratory Data Analysis summary RHC survey data analysis requires sophisticated approaches due to high complexity of the data. The Complexity can be characterized by : –High dimensionality (about 300 attributes/questions) –Uncharacterizable non-linearities –Hierarchy among attributes –Presence of differently scaled attributes (numeric, binary, and nominal) –Vast majority of attributes are nominal –Large percentage of categorical attributes with huge numbers of categories and non-uniform frequency distributions –Large percentage of missing values for some attributes/predictors
10 Data Mining Application (Decision Tree and TreeNet) to Survey Data Analysis
11 Fragment of Decision Tree for Binary First Response (Q9) Binary First Response: RHC, or OTHER Providing high quality products and services is a great predictor of Binary First Response: Probability of First Response = RHC jumps by 100% from 0.11 for the whole sample to 0.21 for interviewees experienced good quality RHC has a weak association with low quality of products and services Example of If-Then scenario that can be answered by Decision Tree: If all interviewees would give the highest score to the quality of RHC products and services, how the probability of First Response = RHC will be changed?
12 TreeNet: Intro TreeNet (Stochastic Gradient Boosting) was invented in 1999 by Stanford University Professor Jerome Friedman. It is the most flexible and powerful data mining tool. Salford Systems - a California based data mining software development company (http://www.salford-systems.com) has implemented and commercialized this invention as a TreeNet product in 2003. It was the first stochastic gradient boosting tool in the world data mining industry. The intensive research has shown that TreeNet models are among the most accurate of any known modeling techniques. TreeNet model is a non-parametric non-linear regression and can be described as a linear combination of a large amount of small trees.
13 Drivers of Modified First Response (Q9) The most important predictor of values of the First Response (Q9) is Q75a (Age of interviewee) Q1(Region) and Q78 (income) are examples of predictors with modest impact on First Response (Q9) Q8 (Gender) is an example of a predictor that have no impact on First Response (Q9) 13 Predictor importance of the probability (Modified First Response = RHC)
14 Misclassification Rate: TreeNet model for Modified First Response (Q9) prediction 14 Cost Matrix Prediction Accuracy (learning data- 60%) The Percent Error is the smallest for Reliance Home Comfort (best accuracy): Pct Error = 0.00. The Percent Error is the largest for Union Energy (worst accuracy): Pct Error = 27.59. On average, the prediction accuracy for Modified First Response across all 9 Categories is 15.79%. Cost of correct classification equals 0, and cost of incorrect classification equals 1.
15 TreeNet model: Impact of You mentioned that you are familiar with RHC (Q15 ) on Probability of Binary First Response (Q9) = RHC, controlling for all other predictors 15 The highest positive impact on the Probability of First Response (Q9) = RHC The highest negative impact on the Probability of First Response (Q9)= RHC Using the TreeNet model, it is possible to answer diverse If – Then business questions. For example, if the response Telemarketing would be increased by 10 %, how the probability of First Response = RNC will be changed?
16 TreeNet summary TreeNet algorithm has about 20 different options that can be controlled by a researcher. Usage of default options did not produce a good model. Determination of the best set of options/optimal model is time consuming and requires experience and expertise. TreeNet is an appropriate tool for the analysis of complex survey data. TreeNet is a perfect tool for –Prediction and Scoring –Estimation of a probability of an event of interest –Identification of predictor importance and drivers –If - Then scenario analysis 16
17 Conclusion Typical survey data analysis questions are: –Segmenting respondents –Drivers identification of question of interest –Relationship between different survey questions –Predictability of the answer to a question under consideration –Diverse If – then scenarios –Combining primary and secondary data to answer unique business question The essence of our approach is to understand and analyze our clients business problem and corresponding data through the prism of dissimilar statistical/data mining models. Synergy of data mining and traditional statistics allows to extract maximum useful information from complex survey data. As a result we are always able to produce the best possible model /results and help our clients in the most effective and scientifically sound way.