Statistical Analysis of Transaction Dataset Data Visualization Homework 2 Hongli Li
Dataset Introduction Generated by IBM Quest Synthetic Data Generation Code It’s Transaction Dataset It’s for Mining Association Rules Generation Parameter Number of transaction = 1000 Average transaction length = 10 (default) Number of items = 30
Transaction Dataset
Metadata No Missing Values Actual Transaction Number = 980 Actual Average Transaction Length = 9.24 Actual Number of Items = 30 The Most Frequent Item Is Item 12 (64%) The second Most Freq. Item is Item 9 (62%) Other Information
Pearson Correlation – Item × Item A measured of the degree of linear relation between two variables Person correlation matrix of Item x ItemItem x Item The most correlated two items are item 24 and item 1(0.138)
Pearson Correlation – TID × TID Pivot the dataset to get Item x TID matrixItem x TID Person correlation matrix of TID x TIDTID x TID The most correlated transaction are TID 9 and TID 857, the correlation coefficient between these two is 1
Conclusion Only using statistical tools is hard! Needs mining algorithms Visualization could help