Download presentation
Presentation is loading. Please wait.
Published byAbril Hache Modified over 9 years ago
2
Consumer Research Organization. Commissions surveys and publishes reports & ratings for automobiles. Maintains online discussion forums where consumers can post questions/experiences related to cars and driving. 2
3
Focus on cars in the 10 – 25 year range. Studying the mileage of cars of the above age. Aiming to quantify the impact of a set of variables on the mileage, and thereby, provide decision-making help to a potential customer looking to buy a car within the above age range. 3
4
Multiple Linear Regression Least Squares Approach. Enables quantification of the strength of the relationship between the response and the predictor variables. Allows prediction of response values based upon knowledge of relationships between response and predictor variables. 4
5
R Version 3.0.1 http://www.r-project.org/ http://www.r-project.org/ free download. Provides inbuilt ‘dummy’ variable creation for factor variables. Model generated contains fitted values and residuals. Can be easily accessed and isolated for analysis. 5
6
392 observations * The response variable is MPG (Miles per Gallon) The predictor variables are: cylinders: 3,4,5,6 or 8 cylinders. displacement (cu. inches): the total air displaced by the pistons in all of an engine’s cylinders. It is a measure of engine size and power. horsepower (hp) weight (lbs): Vehicle weight. acceleration(seconds): time to accelerate from 0 to 60 mph. origin: 1. American, 2. European,3. Japanese age(years): years lapsed since year of manufacture. 6 * Revised from the StatLib library at Carnegie Mellon University
7
Dataset available in the form of a.csv file. “carID” is a unique identifier for each row and does not contain any logic or intelligence. All the variables are in numeric format. The variables “cylinders” and “origin” will need to be converted to factor variables. 7
8
A “base” regression model was built, to predict “mpg” using all the variables. Residual plots were created and visually inspected for zero mean, constant variance and independence assumptions. Normality assumption was verified by generating histogram and QQ-plots. An R 2 value of 0.8462 was obtained. An improved model ( i.e. containing one less predictor variable) was identified by statistical selection. 8
9
Client feedback: “Can this model be applied to dataset with different values for the same variables? Is it re-usable?” ( This model can be re-used for prediction, provided the values of the variables are within the ranges seen within the dataset used to generate this model. The assumptions of linear regression may not hold good, once the data is out of this range, and hence linear regression may not be applicable.) 9
10
Client feedback: “Is there a need to download/install special R packages to carry out the necessary charting and analysis?” (No packages need to be downloaded or installed. The basic R functionality is more than enough to carry out regression analysis, work with the residuals and fitted values, and generate the necessary visualizations.) 10
11
Ideas for further analysis: Include car name (e.g. “Chevrolet Malibu”) as a variable in the analysis. Generate side-by-side comparison of different statistical selection methods for improving the model. 11
12
Data gathered from nationwide surveys over a period of 7 months. Analysis and review carried out over an 8-week period. 12
13
An analyst and a ‘domain’ expert were assigned to this project full time. This project involved a combined effort of about 95- 110 man-hours. The cost can be estimated keeping the above details in mind. 13
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.