Presentation is loading. Please wait.

Presentation is loading. Please wait.

Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Similar presentations


Presentation on theme: "Toward Automatic Speech Act Discovery. email newsgroups forums blogs."— Presentation transcript:

1 Toward Automatic Speech Act Discovery

2 email newsgroups forums blogs

3 Data Set 20 usenet newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

4 Preprocessing >> I just wonder if this will also cause a divergence between commercial >> and non-commercial software (ie. you will only get free software using >> Athena or OpenLook widget sets, and only get commercial software using >> the Motif widget sets). > > I can't see why. If just about every workstation will come with Motif > by default and you can buy it for under $100 for the "free" UNIX > platforms, I can't see this causing major problems. Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap", but I cannot get the source for "cheap", hence I am limited to using whatever X libraries the Motif port was compiled against (at least with older versions of Motif. I have been told that Motif 1.2 can be used with any X, but I have not seen it myself).

5 Preprocessing >> I just wonder if this will also cause a divergence between commercial >> and non-commercial software (ie. you will only get free software using >> Athena or OpenLook widget sets, and only get commercial software using >> the Motif widget sets). > > I can't see why. If just about every workstation will come with Motif > by default and you can buy it for under $100 for the "free" UNIX > platforms, I can't see this causing major problems. Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap", but I cannot get the source for "cheap", hence I am limited to using whatever X libraries the Motif port was compiled against (at least with older versions of Motif. I have been told that Motif 1.2 can be used with any X, but I have not seen it myself). Section into “levels” Level < previous level = reply to previous message Level > previous level = new message

6 Also: Remove headers Xref: cantaloupe.srv.cs.cmu.edu comp.windows.x:66928 comp.windows.x.apps:2487 Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu! cis.ohio-state.edu!zaphod.mps.ohio- state.edu!howland.reston.ans.net!gatech!asuvax!chnews!tmcconne From: tmcconne@sedona.intel.com (Tom McConnell~)‏ Newsgroups: comp.windows.x,comp.windows.x.apps Subject: Re: Motif vs. [Athena, etc.] Date: 16 Apr 1993 20:14:04 GMT Organization: Intel Corporation Lines: 44 Sender: tmcconne@sedona (Tom McConnell~)‏ Distribution: world Message-ID: References: NNTP-Posting-Host: thunder.intel.com Originator: tmcconne@sedona

7 Also: Remove signatures Cheers, Tom McConnell -- Tom McConnell | Internet: tmcconne@sedona.intel.com Intel, Corp. C3-91 | Phone: (602)-554-8229 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in Chandler, AZ 85226 | their right mind would claim them.

8 Also: Remove signatures Cheers, Tom McConnell -- Tom McConnell | Internet: tmcconne@sedona.intel.com Intel, Corp. C3-91 | Phone: (602)-554-8229 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in Chandler, AZ 85226 | their right mind would claim them. Look for ---* Doesn't always find it

9 Also: Remove signatures Cheers, Tom McConnell -- Tom McConnell | Internet: tmcconne@sedona.intel.com Intel, Corp. C3-91 | Phone: (602)-554-8229 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in Chandler, AZ 85226 | their right mind would claim them. Look for ---* Doesn't always match First paragraph only Might miss important content Sometimes grabs greetings (e.g. “Hi, \n”

10 Preprocessing Bi- and tri-grams Tag start of sentence with ^ Force “not” to join with adjacent n-grams e.g. ^there_is_not not_a_way a_way way_to to_do do_that

11 Text Modeling and Topic Discovery Assume words and/or documents belong to some class/topic Assume words are conditionally independent given the class/topic P(w|z)‏

12 Naïve Bayes Each document belongs to one class P(d) = \product P(w|z)

13 Naïve Bayes - Inference Expectation-Maximization

14 Latent Semantic Indexing / Latent Dirichlet Allocation Each document contains multiple topics P(d) = \product P(w|z) P(z|d)‏

15 Model for Conversational Text Message m Response r P(m,r|z) = P(m|z) P(r|z)‏ P(r|m) prop to P(z) P(m|z) P(r|z)‏

16 Example

17

18

19

20

21 Classification Performance Labeled ~100 messages with speech acts – M/R model – 40-60% – Single-message NB – 20-30% Need more labels


Download ppt "Toward Automatic Speech Act Discovery. email newsgroups forums blogs."

Similar presentations


Ads by Google