Presentation is loading. Please wait.

Presentation is loading. Please wait.

Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be.

Similar presentations


Presentation on theme: "Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be."— Presentation transcript:

1 Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be reused or redistributed in whole or in part without expressed written permission of the authors. Please contact jwortman-at-technion.ac.il for more information.  Film classification using subtitles and automatically generated language factors

2 2 Background and motivations Components of Analysis Classification models Conclusions Contents

3 3 The data: Subtitle files for 1062 films genres from IMDB The challenge: Label each film with its genres. Background & Motivations

4 4 What is a genre? Drama Thriller Comedy Action Crime Romance Animation Family War Sci-Fi Fantasy Horror Adventure Mystery Genre is a theme or style, not a topic.

5 5 Background & Motivations There are no prototypes

6 6 Background & Motivations Examples %Count Genre The Shawshank Redemption, 1994; Passion of the Christ, %464Drama Next Friday, 2000; Legally Blonde, %401Comedy The Glass House, 2001; Red Eye, %391Thriller Sniper 2; 2002; A Better Way to Die, %340Action Meet the Parents, 2000; The Notebook, %244Romance Indiana Jones and the Temple of Doom, %240Adventure Get Shorty, 1995; Gangs of New York, %230Crime Battlestar Galactica: The Second Coming, %142Sci-Fi Big, 1988; Bruce Almighty, %140Fantasy Secret Window, 2004; Alien: Resurrection, %114Horror The Sixth Sense, 1999; Flightplan, %110Mystery Charlie and the Chocolate Factory, %107Family Monsters, Inc., 2001; Snow White and the Seven Dwarfs, %66Animation Kippur, 2000; Saving Private Ryan, %49War

7 7 Background & Motivations What tools and methods do we have available? Topical Bag of words Entity Extraction (Feldman et al.; various) Word Net (Katsiouli, Tsetsos, & Hadjiefthymiades, 2007) Stylistic Lexicographic analysis, Corpus Linguistics (Biber, Conrad & Reppen, 2004). POS tagging Usage of linguistic word categories in text. Linguistic Inquiry and Word Count (LIWC)

8 8 Background & Motivations Factor Approach with LIWC Pennebaker & King. Linguistic Style: Language Use as an Individual Difference. J. of Personality and Social Psychology (1999). Mairesse et al. Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text. Journal of Artificial Intelligence (2007). CategoryExamplesTotal Stems Linguistic Processes Personal pronounsI, them, her70 1st pers singularI, me, mine12 2nd personYou, your, thou20 3rd pers singularShe, her, him17 Past tense aWent, ran, had145 Present tense aIs, does, hear169 Future tense aWill, gonna48 AdverbsVery, really, quickly69 PrepositionsTo, with, above60 ConjunctionsAnd, but, whereas28 NegationsNo, not, never57 QuantifiersFew, many, much89 Swear wordsDamn, piss, ….53 AssentAgree, OK, yes30 Psychological Processes Social processes bMate, talk, they, child455 FamilyDaughter, husband, aunt64 FriendsBuddy, friend, neighbor37 HumansAdult, baby, boy61 Affective processesHappy, cried, abandon915 Positive emotionLove, nice, sweet406 Negative emotionHurt, ugly, nasty499 AnxietyWorried, fearful, nervous91 AngerHate, kill, annoyed184 SadnessCrying, grief, sad101 Cognitive processescause, know, ought730 Personal Concerns WorkJob, majors, xerox327 AchievementEarn, hero, win186 LeisureCook, chat, movie229 HomeApartment, kitchen, family93 MoneyAudit, cash, owe173 ReligionAltar, church, mosque159 DeathBury, coffin, kill62

9 9 Background & Motivations CategoryExamplesTotal Stems Linguistic Processes Personal pronounsI, them, her70 1st pers singularI, me, mine12 2nd personYou, your, thou20 3rd pers singularShe, her, him17 Past tense aWent, ran, had145 Present tense aIs, does, hear169 Future tense aWill, gonna48 AdverbsVery, really, quickly69 PrepositionsTo, with, above60 ConjunctionsAnd, but, whereas28 NegationsNo, not, never57 QuantifiersFew, many, much89 Swear wordsDamn, piss, ….53 AssentAgree, OK, yes30 Psychological Processes Social processes bMate, talk, they, child455 FamilyDaughter, husband, aunt64 FriendsBuddy, friend, neighbor37 HumansAdult, baby, boy61 Affective processesHappy, cried, abandon915 Positive emotionLove, nice, sweet406 Negative emotionHurt, ugly, nasty499 AnxietyWorried, fearful, nervous91 AngerHate, kill, annoyed184 SadnessCrying, grief, sad101 Cognitive processescause, know, ought730 Personal Concerns WorkJob, majors, xerox327 AchievementEarn, hero, win186 LeisureCook, chat, movie229 HomeApartment, kitchen, family93 MoneyAudit, cash, owe173 ReligionAltar, church, mosque159 DeathBury, coffin, kill62 p(future) p(assent) p(anxiety)

10 10 Components of Analysis Factor Approach For a set of films: i = 1, 2,… and g  {Drama, Comedy,…} d i is subtitle file i (1 document) D: set of subtitle files from a training set D = {d i :  i  training set} D g :set of subtitle files from the same training set, where each is labeled with g (template) D g = {d i : i  g  training set} s ac s a b x w z u x y didi r n af m This film is 17 words long factor α = {a,b,c,d} #(factor α | d i )= 5

11 11 Components of Analysis Factor Approach Probability for factor α in d i is: The film i is represented as a vector of probabilities Probability for factor α in D is: Probability for factor α in D g is: s ac s a b x w z u x y didi factor α = {a,b,c,d} #(factor α | d i )= 5 p i,α = 5/17 = 29% r n af m This film is 17 words long p i = (p i1, p i2, …, p iα, …, p i,m )  [0,1] m

12 12 Components of Analysis factor 2 MYSTERY Scooby-Doo Scream Cinderella Top Gun factor 1 factor m   MYSTERY = Scooby-Doo =

13 13 Components of Analysis Drill down into LIWC category “SWEAR_WORDS” Maybe log likelihood can help? p CRIME, SWEAR  0.63%

14 14 Components of Analysis Drill down into LIWC category “SWEAR_WORDS” Michelson Contrast Function Amplifies low signals Bounds max signal

15 15 Components of Analysis Drill down into LIWC category “SWEAR_WORDS”  0.63% Probability Contrast

16 16 Components of Analysis MC(,x 2 ) MYSTERY Scooby-Doo Scream Cinderella Top Gun MC(,x 1 ) MC(,x m ) MYSTERY = Scooby-Doo =  

17 17 Classification Models 10 fold cross validation:  70% train set: creates D and each D g  20% threshold training: identify optimal classification threshold  10% test set: used to generate performance values 65 LIWC categories, targeting optimal F-score

18 18 Classification Models Hits Hits + FA Hits Hits + Misses Precision = Recall = = 0.6 = 0.75 = = F-score = 2 *Precision * Recall / (Precision + Recall) = LIWC categories, targeting optimal F-score

19 19 Classification Models 65 LIWC categories, targeting optimal F-score

20 20 Classification Models Strength metric for factor selection Using D and D g, split each factor k in two sets: H k,g = {w  factor k | p(w|g) > p(w) } L k,g = {w  factor k | p(w|g) ≤ p(w) } Possibilities: p(H k,g ) = p(L k,g ) p(H k,g ) > p(L k,g ) p(H k,g ) < p(L k,g ) Sort k by: | log(p(H k,g )/p(L k,g )) | 40.5 factors used 37% savings!

21 21 Generating factors from a graph 1.Select a set of useful words W 2.Represent relationship between members of W as a graph 3.Cluster words from graph Implement vector model with these clusters General GraphGenre-specific Graph Classification Models

22 22 Classification Models (general method) 1. Selecting useful words (no “love” methodology): For w  D   (w) = Percentage of films with #(w|d)  c :  if (2% <  (w) < 45%) Add w to W  Stop words include: worddf(w)  (w) please96%88% sorry95%85% life94%77% love90%74% kill85%55% girl83%54% wanted81%53% shit67%51%

23 23 Classification Models (general method) 2. Building GWG with words in W: For each word pair: w i, w j  W  ( w i, w j ) = Percentage of films with both #(w i |d)  c and #(w j |d)  c The co-occurrence ratio:  ( w i, w j ) =  ( w i, w j ) / [  ( w i ) *  ( w j ) ] GWG contains the set of edges: ( w i, w j )  GWG  ( w i, w j )   empirical constants

24 24 Classification Models (general method) 3. Clustering words of GWG into sets: Create the set of maximum size cliques for each w  GWG  Merge highly similar cliques  This will be a factor in our custom model  Modeling performance with these factors equals LIWC performance  Strength metric reduces dimensionality…. 38% of 220 factors used Maximum sized cliques for “commander” 9: commander contact heading nationalposition states strike target weapons 9: commander contact heading major national states strike target weapons 9: commander contact heading launch position states strike target weapons 9: commander contact heading launch major states strike target weapons Relaxed clique containing “commander” using θ = 0.7 attack base begin bomb bridge build built captain center commander complete contact crew destroy destroyed earth emergency energy escape force forward holding immediately impossible launch lieutenant main necessary planet prepare project ship signal space speed system weapons

25 25 Classification Models (genre specific method) 1. Selecting useful words: For w  D g ( D g = {d i : i  g} )   g (w) = Percentage of films in g with #(w|d)  c   ¬g (w) = Percentage of films in  g with #(w|d)  c (  (w) -  g (w) )  if(  g (w)  25% &  g (w) /  ¬g (w) > 1.4 ) Add w to W g

26 26 Classification Models (genre-specific method) 2. Building the GS word graph from W g  For each pair w,v  W g : Calculate new word pair relationship value: g (w,v) =  Normalize them: δ g (w,v) =  GS contains the set of edges: ( w i, w j )  GS where  g ( w i, w j )   g 3. Create cliques and relaxed cliques…

27 27 Classification Models (genre-specific method) Results using GS factors:  Performance also equals LIWC performance!  Strength metric reduces dimensionality…. 55% of factors used Comparing GWG and GS factors Conclusion: let’s try mixing them… GWG Factors GS Factors Unique words Total usage3.57%3.98% Clusters Average cluster size Largest3733

28 Mixed model performance is significantly better than all previous methods (p<0.003). Compared to LIWC model error is reduced by:  8.1% for precision  6.6% for recall  7.4% for F-score 28 Classification Models (mixed model) GWG Factors GS Factors Mixed Factors Unique words Total usage3.6%4.0%5.9% Factors Factors used38%55%49% mixed LIWC

29 29 Classification Models (mixed model continued) CAUGHT

30 30 Classification Models (mixed model continued) GenreScoreThreshTagCorrect Animation X Fantasy X  Family X  Adventure X  Sci-Fi  Romance X Horror  Comedy X  Mystery Action  War  Thriller  Crime  Drama  Unmasking Scooby-doo… Over all films, Classification Accuracy is 79.4% Far below mystery threshold CA= 78.6%

31 31 Conclusions: Should automatically generated factors replace LIWC? You can't begin to imagine the thousands of hours that have gone into the making of these dictionaries. And when I see that we apparently missed "yourself" I wonder how it is possible that it happened. – Prof. James Pennebaker, LIWC, personal communication. Our methods may be used for… thematic document classification personality research building a better search engine creating a movie recommendation system

32 ~ Thank you ~


Download ppt "Joshua Wortman Industrial Engineering and Management, Technion Prof. Alon Itai Computer Science, Technion The contents of this presentation may not be."

Similar presentations


Ads by Google