Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning for Language Technology 2015 Introduction to Weka: Arff format and Preprocessing.

Similar presentations


Presentation on theme: "Machine Learning for Language Technology 2015 Introduction to Weka: Arff format and Preprocessing."— Presentation transcript:

1 Machine Learning for Language Technology 2015 http://stp.lingfil.uu.se/~santinim/ml/2015/ml4lt_2015.htm Introduction to Weka: Arff format and Preprocessing Practical Machine Learning for Language Technology Marina Santini santinim@stp.lingfil.uu.se Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Autumn 2015 ML4LT 2015 - Lecture 2: LAB SESSION1

2 Acknowledgements ML4LT 2015 - Lecture 2: LAB SESSION2 Many thanks to Weka slides…..Martin D. Sykora,

3 Outline Aim of lab sessions Requirement of the lab sessions Structure of the lab assignments The Weka Package Arff format Preprocessing – Feature Selection ML4LT 2015 - Lecture 2: LAB SESSION3

4 Aim of the lab sessions The aim of the lab sessions is manyfold: – to practise with a number of machine learning methods – to apply machine-learning methods to real-world problems in LT – to learn how to use a state-of-the-art machine- learning workbench. ML4LT 2015 - Lecture 2: LAB SESSION4

5 Requirements of the lab sessions Each lab session includes a number of lab assignments to be completed. The completion of the lab assignments is required to pass the course. The physical attendance to the lab sessions is required to pass the course Out of 12 lectures and corresponding lab sessions, 9? lab assignments must be correctely completed to pass the course. ML4LT 2015 - Lecture 2: LAB SESSION5

6 Structure of the lab assignments Lab assignments should be completed in class. A lab assignment includes a number of tasks. Tasks are divided into G tasks and VG tasks. In order to pass a lab assignment, the G tasks must be completed correctly and a short report must be sent to the teacher by the the due date. ML4LT 2015 - Lecture 2: LAB SESSION6

7 Weka 1 Weka stands for Waikato Environment for Knowledge Analysis. It is a state of the art machine learning workbench normally used to derive useful knowledge from datasets that are far too large to be anlalysed by hand. ML4LT 2015 - Lecture 2: LAB SESSION7

8 Weka 2 Weka is a general purpose workbench that is used in many different, domains (bioinformatics, medicine, text analytics, etc. ) for data and text mining. It contains many machine learning methods (both supervised and unsupervised), preprocessig tools and statistical tests to evaluate the performance of the different models. ML4LT 2015 - Lecture 2: LAB SESSION8

9 ??? When you want to apply ML to our classification problem: – Either you write your own implementation of a model using a programming language – Or you use an off-the-shelf software package that free you from the programming task. ML4LT 2015 - Lecture 2: LAB SESSION9

10 ??? Some learning models are easy to program: students in the previous year have provided their own implementation of the Perceptron using Java. You could this by using Python this year… You can also use Weka open source code and modify it (if you are not happy with it) to achieve your specific purposes. ML4LT 2015 - Lecture 2: LAB SESSION10

11 Weka includes Regression Classification Clustering Association Rules Attribute Selection Visualization ML4LT 2015 - Lecture 2: LAB SESSION11

12 The ARFF format The standard format of the datasets to be processed by Weka is the ARFF format. See section 2.4 Example: <> ML4LT 2015 - Lecture 2: LAB SESSION12

13 The Weather Table ML4LT 2015 - Lecture 2: LAB SESSION13

14 Feature representation You must decide about the best way of representing the problem you wan to address! Different features give different results There is no a priori correct/incorrect answer to ”which are the best features?”. Feature selection is based on your theoretical knowledge about the problems, your theoretical assumption and empirical tries with different models/algorithms. ML4LT 2015 - Lecture 2: LAB SESSION14

15 How to get the ARFF format? P. 407 Either you use an already prepared arff, that somebody else has made available Or you create yourself (feature manipulation and extraction) – Decide the best way to represent your problem thru the feature – Extract features from a corpus – Organize the feature in a spreadsheed (eg. csv, exec) – Convert it into arff – Or… ML4LT 2015 - Lecture 2: LAB SESSION15

16 Get the Lab Assignment ML4LT 2015 - Lecture 2: LAB SESSION16

17 Summary and Conclusions ML4LT 2015 - Lecture 2: LAB SESSION17


Download ppt "Machine Learning for Language Technology 2015 Introduction to Weka: Arff format and Preprocessing."

Similar presentations


Ads by Google