Presentation is loading. Please wait.

Presentation is loading. Please wait.

Boosting Textual Source Attribution Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006.

Similar presentations


Presentation on theme: "Boosting Textual Source Attribution Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006."— Presentation transcript:

1 Boosting Textual Source Attribution Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006

2 HEY??? What’s so funny? ► What makes something funny? ► Can we tell by just reading? Can a computer? ► Shakespeare’s Comedies and Tragedies.  Actually, Comedies, Tragedies, Historical Plays and Sonnets.

3 High Level Source Attribution Process

4 Experimenting with Boosting  Most work done on binary classification.  Needs lots of “weak” learners.  Some variants work well with limited Data Set.  Will provide knowledge about importance of features.

5 Data Set (Training) ► Comedies  Measure for Measure  Much Ado about Nothing  Merchant of Venice  Midsummer Night’s Dream  Taming of the Shrew  Twelfth Night ► Tragedies  Anthony and Cleopatra  Titus Andronicus  Hamlet  Julius Caesar  Romeo and Juliet

6 Data Set (Test) ► All’s Well that End’s Well [c] ► Comedy of Errors [c] ► As You Like It [c] ► The Tempest [c] ► Mary Wives of Windsor [c] ► King Lear [t] ► Macbeth [t] ► Coriolaunus [t] ► Othello [t]

7 Feature Selection ► Features: words ► Selection method: picked 2500 most common words in the Training Set ► Preprocessing: 300 common English words and grammar operators removed  HTML and stage directions removed ► 429 out of 2500 words were not common to all plays. Chose the 429 for weak learner functions. (this particular run)

8 TRAGEDY WORDS COMEDY WORDS 429 Words: 225(Com.), 204(Trag.) Data: Vector of 2500 words: X= [X1, X2…X2500] Weak Learners F1(X)…F429(X), each returning 1 for a positive hit.

9 PlayInput S [n]Ynd[t]h1[t]u[n]e1[t]d[t+1] Measure [ 2500 words ]1 +1/811 3/8 * 8/10e Much Ado [ 2500 words ]1 +1/8 3/8 * 8/6e Merchant [ 2500 words ]1 +1/8 3/8 * 8/6e Midsumme r [ 2500 words ]1 +1/8 3/8 * 8/6e Romeo [ 2500 words ] +1/81 3/8 * 8/10e Antony [ 2500 words ] +1/81 3/8 * 8/10e Titus [ 2500 words ] +1/81 3/8 * 8/10e Hamlet [ 2500 words ] +1/81 3/8 3/8 * 8/10e

10 Boosting ► A mix of LP Boost and TotalBoost ► No Termination (finite weak learners) ► Didn’t have a Gamma function, used Eta (error) instead. ► Didn’t use Zero Sum constraint on normalization of weight updates.

11 Classification Used Accumulated weights at the very end Every presense in Test Corpus means (1*W) added to totalW, some W’s negative At the end it was a simple matter of observing if the results were positive or negative and by how much.

12     

13     

14     

15 Program Output ► [root@localhost output]#./classify.sh ► 00_allswell.html-ratio.txt: 14.6807 ► 01_comedyErrors.html-ratio.txt: 13.2634 ► 02_measure.html-ratio.txt: 34.2748 ► 03_muchAdo.html-ratio.txt: -6.43018 ► 04_asyoulikeit.html-ratio.txt: 18.8413 ► 05_cleopatra.html-ratio.txt: 14.1148 ► 06_lear.html-ratio.txt: 32.2858 ► 07_macbeth.html-ratio.txt: -21.095 ► 08_coriolanus.html-ratio.txt: 43.5599 ► 09_titus.html-ratio.txt: -3.31167 ► 10_cleopatraFull.html-ratio.txt: -300.179 ► 11_learFull.html-ratio.txt: 356.504 ► 13_tempestFull.html-ratio.txt: 454.171 ► 14_marryWivesFull.html-ratio.txt: 147.738 ► 15_measure2.html-ratio.txt: 39.0357 ► 16_measureFull.html-ratio.txt: 112.527 ► 17_muchAdoFull.html-ratio.txt: 256.078 ► 18_veronaFull.html-ratio.txt: -222.444 ► 19_othelloFull.html-ratio.txt: -433.769 ► 20_titusFull.html-ratio.txt: -564.977

16 Results ► All’s Well that End’s Well [c][1] ► Comedy of Errors [c][1] ► As You Like It [c][1] ► The Tempest [c][1] ► Mary Wives of Windsor [c][1] ► King Lear [t][0] ► Macbeth [t][1] ► Coriolaunus [t][0] ► Othello [t][1] 2/9 mistakes, 7/9 or 77%, (also 66% and 69%) Previous run on Neural Net (different setup: 5/13 61%) - With no proportionals!

17 Challenges ► Natural language has a lot nuances that could make a difference (preprocessing methods, “common word” sets, adaptations) ► Boosting has great potential in this area ► Words provide easy method for coming up with (many) weak learners


Download ppt "Boosting Textual Source Attribution Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006."

Similar presentations


Ads by Google