WOW World of Walkover-weight “My God, it’s full of cows!” (David Bowman, 2001)

Slides:



Advertisements
Similar presentations
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Advertisements

Introduction Simple Random Sampling Stratified Random Sampling
Authorship Verification Authorship Identification Authorship Attribution Stylometry.
© Copyright 2001, Alan Marshall1 Regression Analysis Time Series Analysis.
Copyright © Allyn & Bacon (2007) Statistical Analysis of Data Graziano and Raulin Research Methods: Chapter 5 This multimedia product and its contents.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Evaluating Search Engine
x – independent variable (input)
Model Evaluation Metrics for Performance Evaluation
Regression Analysis. Unscheduled Maintenance Issue: l 36 flight squadrons l Each experiences unscheduled maintenance actions (UMAs) l UMAs costs $1000.
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Ensemble Learning: An Introduction
Evaluating Hypotheses
Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
February 15, 2006 Geog 458: Map Sources and Errors
Applied Business Forecasting and Planning
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Who would be a good loanee? Zheyun Feng 7/17/2015.
Measures of Central Tendency
Calibration & Curve Fitting
Data Mining – Credibility: Evaluating What’s Been Learned
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Evaluation – next steps
The Power of Moving Averages in Financial Markets By: Michael Viscuso.
P Values Robin Beaumont 10/10/2011 With much help from Professor Chris Wilds material University of Auckland.
Chapter 18 Four Multivariate Techniques Angela Gillis & Winston Jackson Nursing Research: Methods & Interpretation.
Welcome to Econ 420 Applied Regression Analysis Study Guide Week Two Ending Sunday, September 9 (Note: You must go over these slides and complete every.
Chapter 9 – Classification and Regression Trees
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Experimental Evaluation of Learning Algorithms Part 1.
Learning from Observations Chapter 18 Through
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
CpSc 810: Machine Learning Evaluation of Classifier.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Organization of statistical research. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and.
P Values Robin Beaumont 8/2/2012 With much help from Professor Chris Wilds material University of Auckland.
An Exercise in Machine Learning
Organization of statistical investigation. Medical Statistics Commonly the word statistics means the arranging of data into charts, tables, and graphs.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
P Values - part 2 Samples & Populations Robin Beaumont 2011 With much help from Professor Chris Wilds material University of Auckland.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Chapter 8: Introduction to Hypothesis Testing. Hypothesis Testing A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Machine Learning: Ensemble Methods
Sampling Distributions
Ensemble learning.
Regression Forecasting and Model Building
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Presentation transcript:

WOW World of Walkover-weight “My God, it’s full of cows!” (David Bowman, 2001)

Can walkover-weight suggest a cow needs attention?

Join with breeding information …

Position at the outset …  Obstacle: No health information!!!  Suggested: Milking order (i.e. where a cow is in the herd/line-up) is hierarchical and affected by health issues  Proposed goal: to predict a drop in milking order using WOW and other facts

Assumptions … deck of cards  Same cows come in for milking each time  Cows are well-behaved (e.g. arrive in a nice queue)  Data is in good shape (e.g. one reading per cow per milking)

Data problems  Multiple entries for cows (e.g. four entries for in QBH2005)  Delete duplicate weights (SQL problem?)  Cow skipped and recycled back into order  Use average if more than one value

About a quarter of the data are zeroes … instances 0 weights λ weights BBYG ,93557, BBYG ,54539,7261,193 JJVX20077, QBH20057, QBH ,36580,36272 QBH ,30067,1092,118 QBH200848,53410,535224

“zero” problems  Differentiate between a missing cow, a missing weight and a “zero” weight  Ignore missing cows  Cow skipped and recycled back into order  Time-based interpolation  Can be problematic if cow has been missing for a while  Add flag to indicate weight was “guessed”

other issues in data preparation  Change milking date to milk index  Change birthdate to age in months  Change parturition date to days since last calved  Additional derivatives  milking index - cow’s position in milk order  ∆-index – change in index for a cow over various time periods (1, 3 and 7 days)  mu-weight – average weight over varying-length periods (3, 7, 14, 21 and 28 milkings)  ∆-mu-weight – change in index for a cow (1, 3, and 7 days)

Does [change in] milk order correlate to WOW?

Correlation coefficients QBH2006 (dense)  WOW to index == 0.12  WOW to 14-day mu-weight == 0.93  Index to 10-day mu-weight == 0.14  3-day ∆-order to ∆-weight == 0.045

∆∆ 3-day ∆-order and 3-day ∆-weight

Predict change in milking order  Use M5P to predict how the milking order will change for a cow at the next milking  Approx. 205,000 QBH2006 samples (with fewer than 5/25 missing attributes)  2/3 training 1/3 testing

Re-running took too long … but … you’ve all seen it before, where accuracy was 51.89% (discrimination 0.527) and the model tree was hugely ugly (65 nodes, 33 leaves). Also tried predicting cow’s index as decile and as ratio to herdsize.

Cow’s position (index) as ratio to herdsize

Cow index vs. herd size

Where to? ….  Data must still be scrubbed so that milking order makes sense (if milking order is going to be relevant)  Perhaps cow order needs to be described in completely different terms (e.g. cow buddies)  Easy visualization of herds/cows/breeds/dates/trends is needed this segued into another area of the project.. this segued into another area of the project..

Visualization tools (alpha and beta)

In the meantime … health data is obtained …

Can WOW predict onset of illness?  Combine original attributes and derivatives with health judgments  Cows with unknown health are considered healthy  Need equal number of positive and negative instances

Health data becomes available farmyear Qty > 50 BBYG BBYG BBYG QBH QBH QBH QBH

Not so much health data  1613 recorded instances of health  913 different cows with health info  2540 cows with milking info  788 milked cows with health data  7 broad categories of illness:  Calving disorder  Metabolic disorder  Udder disorder (only one with >50 in herd)  Reproductive disorder  Lameness  Infectious diseases  Other ailments

Data sparseness QBH2006  75 instances out of 324,291 have health  63 udder disorder  10 metabolic disorder  2 lameness  Only.002% positives → will never be isolated → must subsample negatives  Random selection of 75 negatives → data sparseness → over-fitting likely

Data sparseness QBH2006  36 cows have illness at some time, so just learn those?  11,966 records for those cows, 76 of which have illness (still <1% positive)  Random selection of 1% as negatives (about 120)

Refinements to approach QBH2006  Restrict target objective to UDDER DISORDER  Randomly select equal number of negatives from cows who have health problem at some point goal: differentiate between healthy and unhealthy state

Detecting mastitis amidst random normal cows QBH2006  Restrict learning objective to UDDER DISORDER  Randomly select equal number of negatives from all cows that have been milked (63+,63-)

When is a cow sick?  So far, attempted to predict health label at point of milking, but..  … when was the health label attached? before, during or after the current milking?  Goal: predict whether cow needs attention at the next milking (i.e. time series)

=== Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 128 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class UDDER DISORDER NONE === Confusion Matrix === a b <-- classified as | a = UDDER DISORDER 7 58 | b = NONE

Agenda  Replace quantified attributes with simpler (e.g. boolean, nominal) ones  Characterise exceptions  Below average weight for cow/herd/breed/age  Dropped decile/>50 in order  Broad statistical measures  How many std.devs. from mean  z-score (probability of variation)  Choose negative instances more carefully (select fewer interpolates)  Spend more time with people who know cows