WALMART RECRUITING – STORE SALES FORECASTING Chaoyi Liu, Yuqing Lu, Haoran Wu Rajendran, Goutham
The dataset we choose ·Store - the store number ·Date - the week ·Temperature - average temperature in the region ·Fuel Price - cost of fuel in the region ·MarkDown1-5 - data is only available after Nov 2011, ·CPI - the consumer price index ·Unemployment - the unemployment rate ·IsHoliday - whether the week is a special holiday week provided parameters may affect weekly sales, but did not provide weekly sales. Store - the store number ·Dept - the department number ·Date - the week ·Weekly Sales - sales for the given department in the given store ·IsHoliday - whether the week is a special holiday week provided sales data of 45 stores with up to 99 departments in more than 421,000 records, and didn’t sum each store’s weekly sales up.
Then we integrated two datasets So initially, we integrated these two massive tables into one that has everything we need with 6,435 records like this: Store Date Temperature Fuel_Price MarkDown1-5 CPI Unemployment IsHoliday Weekly_Sales We decide to divide the whole 6,435 records equally into 5 groups each contain 1,287 records by quinquesection from small to big like this: Mark asDescription Level 1DMore than $ 0.00 Level 2C More than $ 497, Level 3B More than $ 748, Level 4A More than $ 1,056, Level 5S More than $ 1,414,343.53
Neural Network Model It is for complicated prediction problems Visualization or understanding of the rules are not needed Accuracy is very important
Result Learning Rate / Training Cycles = 0.03/2000 Accuracy = 70.61% true Dtrue Strue Ctrue Btrue A class precision pred. D % pred. S % pred. C % pred. B % pred. A % class recall 78.32%68.09%76.92%65.00%65.67% It is easy to find out that Accuracy achieve 70.61% when Learning Rate is 0.03 and will increase as well as Training Cycles increasing
Neural Network Weights Node 1Node 2Node 3Node 4Node 5 Node 6 Node 7Node 8Node 9Node 10 Store Date Temperature Fuel_Price MarkDown MarkDown MarkDown MarkDown MarkDown CPI Unemployme nt IsHoliday Bias Hidden Layer :
Class 'S'Class 'A'Class 'B'Class 'C'Class 'D' Node Node Node Node Node Node Node Node Node Node Threshold Output:
Naïve Bayes Accuracy = 18.63% true Dtrue Strue Ctrue Btrue A class precisio n pred. D % pred. S % pred. C % pred. B % pred. A % class recall % 0.00% Why Naïve Bayes performances “idiot” on this sample? Because variable Store, Data to IsHoliday are independent on each other, so: P(Store,Date,Temperature, … & IsHoliday)=P(Store)*P(Date)*…..*P(IsHoliday) Due to so many numbers in columns Store, Date, … IsHoliday that do not repeat, the probability of each Variables is too small. So P(Store)*P(Date)*…..*P(IsHoliday) will be far lower than 1/6435. This means the probability of sales basing such a model is infeasible.
When K = 1, Accuracy = 26% true Dtrue Strue Ctrue Btrue A class precision pred. D % pred. S % pred. C % pred. B % pred. A % class recall 32.11%37.84%15.58%15.00%25.17% When K = 10, Accuracy = 29.03% true Dtrue Strue Ctrue Btrue A class precision pred. D % pred. S % pred. C % pred. B % pred. A % class recall 35.78%50.98%15.50%28.92%6.00% K-NN
Conclusion MarkDown 1 to 5 has the highest weight as 16 which mean it really makes an enormous impact on the sales. Promotion will increase weekly sales remarkably. Fuel price and temperature also makes a positive impact, higher price makes higher sales. CPI and Unemployment rate having a heavy negative impact on the prospects of sales. The higher CPI and unemployment rate, the less weekly sales. Holidays affect weekly sales slightly. I think customers don’t care whether today is holiday or not, the only reason they buy items is promotion.