Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Classification Seminar Social Media Mining University UC3M

Similar presentations


Presentation on theme: "Text Classification Seminar Social Media Mining University UC3M"— Presentation transcript:

1 Text Classification Seminar Social Media Mining University UC3M
Date May 2017 Lecturer Carlos Castillo Sources: CS124 slides by Dan Jurafsky Slides by Muhammad Atif Qureshi & Arjuman Younus – 2017

2 Facebook study (comments and timeline posts)
Burke, Moira, Lada A. Adamic, and Karyn Marciniak. "Families on Facebook." In ICWSM Featured in a blogpos by M. Burke.

3 Example applications “Federalist Papers” in USA Gmail smart folders
Mosteller, Frederick, and David L. Wallace. "Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers." Journal of the American Statistical Association 58, no. 302 (1963): Gmail smart folders

4 Per-document frequency of use of the word “you” in fiction documents
Male author Female author “even in formal writing, female writing exhibits greater usage of features identified by previous researchers as ‘involved’ while male writing exhibits greater usage of features which have been identified as ‘informational’.” Argamon, S., Koppel, M., Fine, J. and Shimoni, A.R., Gender, genre, and writing style in formal written texts. TEXT 23(3), pp

5 Positive or negative review?
Given a text, determine if the author is praising or complaining about a monument / landmark

6 Academic articles Antagonists and Inhibitors Blood Supply Chemistry
Drug Therapy Embryology Epidemiology

7 Text classification problems
Generic documents → Topics, Keywords, … → Author age, Author gender, … → Language Messages → Folder(s), Priority, Spam?, … Usual approach: supervised learning methods

8 Learning on text The most obvious mapping is:
Each document is an input element Each word is a possible feature Huge dimensionality (order of hundred of thousands words) need sparse representations

9 Determining features Apply pre-processing pipeline of search
Join tokens when needed (e.g., “AK-48”, part numbers, chemical formulas, etc.) May need to emphasize words in title, abstracts, or section headings One option: multiply input dimensionality by number of existing blocks (“embryo” in title is completely unrelated to “embryo” in body) Another option: increase weight of title words, section headers, heuristically Term frequency not relevant for short messages

10 Training data is essential
SVMs and Random forests are popular choices Very little training data: Naïve Bayes (However, I would say just get more training data) The amount of training data will vary during the learning cycle In practice: With a few hundred examples per class you already see that obvious examples are classified correctly With a few thousand examples per class less common cases start to be classified correctly

11 The devil is in the details
Real systems may use automatic classification and a few carefully hand-crafted rules Real systems incorporate continuously new examples to maintain and improve performance Commonly you have unbalanced classes Need to get many examples of the minority class, can obtain them by keyword filtering, but that biases the training data (harms generative models)

12 Evaluating Evaluation can be done on a hold-out set
If more data is becoming available … how do we know our classifier is performing better? Cross-validation Fixed assignment to test or hold-out (validation)

13 Cross-validation Divide sample into n “folds” 5 in this example
For k = 1 … n Train on all folds except fold k Test on fold k Average n runs → result

14 With unbalanced classes, accuracy becomes meaningless
Need to analyze confusion matrix Example: classes are { uk, poultry, …, trade }

15 Micro- and Macro-Average
Micro-Average Evaluate every item separately Macro-Average Evaluate each class separately, then average

16 Micro- and Macro-Average (cont.)


Download ppt "Text Classification Seminar Social Media Mining University UC3M"

Similar presentations


Ads by Google