Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Similar presentations


Presentation on theme: "Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD."— Presentation transcript:

1 Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD

2 Outline SIGKDD: Text Mining Workshop: Session: Mining Time-Tagged Text –Mining of Concurrent Text and Time Series –TimeMines: Constructing Timelines with Statistical Models of Word Usage Session: Text Mining Applications: –Mining E-mail Authorship

3 Mining of Concurrent Text and Time Series Ænalyst Predicting trends in stock prices based on the content of news stories that precede the trends Two types of data –Financial time series –Time-stamped news stories How to connect? –Learn a language model for every trend type

4 Mining of Concurrent Text and Time Series System Design Time-Series Data (Stock Price) Trends Texual Data (News Articles) Relevent Documents Align Trends With Documents Language Model For Trend-Type New Document Likelihood That the Document Is from Each Model

5 Mining of Concurrent Text and Time Redescribe Time Series Identifying Trends Discretizing Trends –This step in a subjective one in which we assign labels to segments based on their characteristics Length Slope Intercept r 2

6 Mining of Concurrent Text and Time Clustering Agglomerative clustering

7 Mining of Concurrent Text and Time Language Models (I) A Language Model represents a discrete distribution over the words in the vecabulary

8 Mining of Concurrent Text and Time Language Models (II) Language Model can separate stories that are followed by a surge that from stories that are not

9 Mining of Concurrent Text and Time Current Alignment A document would be associated with more than one trend It is possible for d 2 to influence both trends t 1 and t 2.

10 TimeMines: Constructing Timelines with Statistical Models of Word Usage Automatically generates timelines from data-tagged free text corpora Construct overviews of text corpora suitable for browsing using timelines Identify time-dependent features that identify important topics in text documents

11 TimeMines Systems Overview Process steps to discover features in text

12 TimeMines The Model for Extracting Features Stationary random model –The occurrence of a feature depends only on its base rate, and dose NOT vary with time. –The arrival of features is a random process with an Unknown binomial distribution Extracting Features –Noun phrases and name entities –Label as noun phrases any grouops of words of length less than 6 which matched the regular expression (NOUN| ADJECTIVE)*NOUN

13 TimeMines Finding Significant Features Many statistics can be used to characterize a 2x2 Contigency Table –EMIM: Expected Mutual Information Measure –KL: Kullback-Leibler divergence –x 2 : Chi-Square f0f0 ~f 0 t t 0 ab cd

14 TimeMines Grouping Significant Features The assumption that two features f j and f k have independent distributions implies that P( f k ) = P( f k | f j ) f j ~f j f k ab ~f k cd

15 TimeMines Systems Image The pop-up window shows significant named entities of Oklahoma, FBI, Justice Department, etc.

16 Mining E-mail Authorship Authorship identification or categorisation by E-mail documents E-mail document features –Structural characteristics –Linguistic evidenece Support Vector Machine

17 Mining E-mail Authorship E-mail document body attributes Structural features pattern of vocabulary usage Stylistic Sub-stylistic features

18 Mining E-mail Authorship Experienmantal Results SVMlight F-measure with β=1.0


Download ppt "Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD."

Similar presentations


Ads by Google