Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD

Outline SIGKDD: Text Mining Workshop: Session: Mining Time-Tagged Text –Mining of Concurrent Text and Time Series –TimeMines: Constructing Timelines with Statistical Models of Word Usage Session: Text Mining Applications: –Mining E-mail Authorship

Mining of Concurrent Text and Time Series Ænalyst Predicting trends in stock prices based on the content of news stories that precede the trends Two types of data –Financial time series –Time-stamped news stories How to connect? –Learn a language model for every trend type

Mining of Concurrent Text and Time Series System Design Time-Series Data (Stock Price) Trends Texual Data (News Articles) Relevent Documents Align Trends With Documents Language Model For Trend-Type New Document Likelihood That the Document Is from Each Model

Mining of Concurrent Text and Time Redescribe Time Series Identifying Trends Discretizing Trends –This step in a subjective one in which we assign labels to segments based on their characteristics Length Slope Intercept r 2

Mining of Concurrent Text and Time Clustering Agglomerative clustering

Mining of Concurrent Text and Time Language Models (I) A Language Model represents a discrete distribution over the words in the vecabulary

Mining of Concurrent Text and Time Language Models (II) Language Model can separate stories that are followed by a surge that from stories that are not

Mining of Concurrent Text and Time Current Alignment A document would be associated with more than one trend It is possible for d 2 to influence both trends t 1 and t 2.

TimeMines: Constructing Timelines with Statistical Models of Word Usage Automatically generates timelines from data-tagged free text corpora Construct overviews of text corpora suitable for browsing using timelines Identify time-dependent features that identify important topics in text documents

TimeMines Systems Overview Process steps to discover features in text

TimeMines The Model for Extracting Features Stationary random model –The occurrence of a feature depends only on its base rate, and dose NOT vary with time. –The arrival of features is a random process with an Unknown binomial distribution Extracting Features –Noun phrases and name entities –Label as noun phrases any grouops of words of length less than 6 which matched the regular expression (NOUN| ADJECTIVE)*NOUN

TimeMines Finding Significant Features Many statistics can be used to characterize a 2x2 Contigency Table –EMIM: Expected Mutual Information Measure –KL: Kullback-Leibler divergence –x 2 : Chi-Square f0f0 ~f 0 t t 0 ab cd

TimeMines Grouping Significant Features The assumption that two features f j and f k have independent distributions implies that P( f k ) = P( f k | f j ) f j ~f j f k ab ~f k cd

TimeMines Systems Image The pop-up window shows significant named entities of Oklahoma, FBI, Justice Department, etc.

Mining E-mail Authorship Authorship identification or categorisation by E-mail documents E-mail document features –Structural characteristics –Linguistic evidenece Support Vector Machine

Mining E-mail Authorship E-mail document body attributes Structural features pattern of vocabulary usage Stylistic Sub-stylistic features

Mining E-mail Authorship Experienmantal Results SVMlight F-measure with β=1.0

Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Similar presentations

Presentation on theme: "Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Similar presentations

Presentation on theme: "Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD."— Presentation transcript:

Similar presentations

About project

Feedback