Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many.

Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many data (memory, time) –Examples –Attributes –Values Data violate asumption of method –Correlated attributes

Bias in learning method –E.g. linearity

Preprocessing part of DM/ML Ideally methods should include transformations Practically: preprocessing takes most of the time of DM process Transformations * learning methods  large search space Some preprocessing is useful for all learning methods, some is specific Main types: –Remove bad examples / features –discretisation

Attribute selection Aka “feature subset selection” Features that cannot contribute at all to prediction/classification cause problems for (some) learners Redundant attributes can also be harmful “wrapper approach”: evaluate feature subsets by learning with them “Filter approach”: try to identify bad attributes without learning, eg. associated with target class and association between attributes

Many combinations … Optimal attribute subset depends on learner Redundant: combine, not remove –E.g. “thermometer value”, “subjective temperature”  average value is more reliable than one of these!

Discretisation supervised / unsupervised Fixed size “bins” / fixed number of “bins” / flexible Supervised ~ 1 attribute learning with intervals So: information gain, MDL (!?); maybe chi- square for stopping Recursive splitting = 1-pass splitting

Discrete  numerical Each attribute-value combination as separate binary attribute 1/0 Or|: “scaling”: red10 yellow7 red9 Green5 3 Yellow6

More transformations Principal component analysis –Find principal components (~ correlated attributes) –Remove components with little variance –Use components as attributes for learning

Data cleansing Impossible values Outliers (from distribution, median/mean) Outliers (from predictions) Risk: throw away unexpected but correct / data: anomalies

Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many.

Similar presentations

Presentation on theme: "Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many.

Similar presentations

Presentation on theme: "Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many."— Presentation transcript:

Similar presentations

About project

Feedback