Presentation is loading. Please wait.

Presentation is loading. Please wait.

Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many.

Similar presentations


Presentation on theme: "Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many."— Presentation transcript:

1 Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many data (memory, time) –Examples –Attributes –Values Data violate asumption of method –Correlated attributes

2 Bias in learning method –E.g. linearity

3 Preprocessing part of DM/ML Ideally methods should include transformations Practically: preprocessing takes most of the time of DM process Transformations * learning methods  large search space Some preprocessing is useful for all learning methods, some is specific Main types: –Remove bad examples / features –discretisation

4 Attribute selection Aka “feature subset selection” Features that cannot contribute at all to prediction/classification cause problems for (some) learners Redundant attributes can also be harmful “wrapper approach”: evaluate feature subsets by learning with them “Filter approach”: try to identify bad attributes without learning, eg. associated with target class and association between attributes

5 Many combinations … Optimal attribute subset depends on learner Redundant: combine, not remove –E.g. “thermometer value”, “subjective temperature”  average value is more reliable than one of these!

6 Discretisation supervised / unsupervised Fixed size “bins” / fixed number of “bins” / flexible Supervised ~ 1 attribute learning with intervals So: information gain, MDL (!?); maybe chi- square for stopping Recursive splitting = 1-pass splitting

7 Discrete  numerical Each attribute-value combination as separate binary attribute 1/0 Or|: “scaling”: red10 yellow7 red9 Green5 3 Yellow6

8 More transformations Principal component analysis –Find principal components (~ correlated attributes) –Remove components with little variance –Use components as attributes for learning

9 Data cleansing Impossible values Outliers (from distribution, median/mean) Outliers (from predictions) Risk: throw away unexpected but correct / data: anomalies


Download ppt "Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many."

Similar presentations


Ads by Google