Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.

Similar presentations


Presentation on theme: "Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning."— Presentation transcript:

1

2 Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning Techniques

3 Ping-Tsun Chang Language Identification Classification Clustering Summerization Feature Selection Text Analysis

4 Ping-Tsun Chang Text Mining Text mining is about looking for patterns in natural language text –Natural Language Processing May be defined as the process of analyzing text to extract information from it for particular purposes. –Information Extraction –Information Retrieval

5 Ping-Tsun Chang Text Mining and Knowledge Management a recent study indicated that 80% of a company's information is contained in text documents –emails, memos, customer correspondence, and reports The ability to distil this untapped source of information provides substantial competitive advantages for a company to succeed in the era of a knowledge-based economy.

6 Ping-Tsun Chang Text Mining Applications Customer profile analysis –mining incoming emails for customers' complaint and feedback. Patent analysis –analyzing patent databases for major technology players, trends, and opportunities. Information dissemination –organizing and summarizing trade news and reports for personalized information services. Company resource planning – mining a company's reports and correspondences for activities, status, and problems reported.

7 Ping-Tsun Chang Text Categorization Problem Definition Text categorization is the problem of automatically assigned predefined categories to free text documents –Document classification –Web page classification –News classification

8 Ping-Tsun Chang Information Retrieval Full text is hard to process, but is a complete representation to document Logical view of documents Models –Boolean Model –Vector Model –Probabilistic Model Think text as patterns?

9 Ping-Tsun ChangEvaluation Retrieved Relevant a b c d

10 Ping-Tsun Chang Pattern Recognization Sensing Segmentation Classification Post-Processing Feature ExtractionDecision

11 Ping-Tsun Chang Pattern Classification f 1 f 2 C 1 C 2

12 Ping-Tsun Chang Machine Learning Using Computer help us to induction from complex and large amount of pattern data Bayesian Learning Instance-Based Learning –K-Nearest Neighbors Neural Networks Support Vector Machine

13 Ping-Tsun Chang Feature Selection (I) Information Gain

14 Ping-Tsun Chang Feature Selection (II) Mutual Information CHI-Square

15 Ping-Tsun Chang Weighting Scheme TF ‧ IDF

16 Ping-Tsun Chang Simility Evaluation Cosine-Like schema didi djdj

17 Ping-Tsun Chang Machine Learning Approaches: Baysian Classifier

18 Ping-Tsun Chang Machine Learning Approaches: kNN Classifier d ?

19 Ping-Tsun Chang Machine Learning Approaches: Support Vector Machine Basic hypotheses : Consistent hypotheses of the Version Space Project the original training data in space X to a higher dimension feature space F via a Mercel operator K

20 Ping-Tsun Chang Compare: SVM and traditional Leaners Traditional Leaner SVM access the hypothesis space! P(h) hypothesis P(h|D 1 ) hypothesis P(h|D 1^ D 2 ) hypothesis

21 Ping-Tsun Chang SVM Learning in Feature Spaces Example: XF

22 Ping-Tsun Chang Support Vector Machine (cont’d) Nonlinear –Example: XOR Problem Natural Language is Nonlinear! f 1 f 2 f 1 f 1 f 2

23 Ping-Tsun Chang Support Vector Machine (cont’d) Consistent hypothses Maximum margin Support Vector

24 Ping-Tsun Chang Statistical Learning Theory P(X)P(y|x) F(x) y y* x x Generator Supervisor Leaner

25 Ping-Tsun Chang Support Vector Machine Linear Discriminant Functions Linear discriminant space Hyperplane g(y)>1 y2y2 y1y1 g(y)<1

26 Ping-Tsun Chang Learning of Support Vector Machine Maxmize Margin Minimize ||a|| Optimal hyperplane

27 Ping-Tsun Chang Version Space Hypothesis Space H Version Space V H V

28 Ping-Tsun Chang Support Vector Machine Active Learning Why Support Vector Machine? –Text Categorization have large amount of data –Traditional Learning cause Over-Fitting –Language is complex and nonlinear Why Active Learning? –Labeling instance is time-consuming and costly –Reduce the need for labeled training instances

29 Ping-Tsun Chang Active Learning: History Text Classification [Rochio, 71] [Dumais, 98] Support Vector Machine [Vapnik,82] Text Classification Support Vector Machine [Joachims,98] [Dumais,98] Pool-Based Active Learning [Lewis, Gale ‘94] [McCallum, Nigrm ‘98] The Nature of Statistical Learning Theory [Vapnik, 95] Automated Text Categorization Using Support Vector Machine [Kwok, 98]

30 Ping-Tsun Chang Active Learning UPool-Based active learning have a pool U of unlabeled instances Active Lerner l have three components (f,q,X) –f: classifier x->{-1, 1} –q: querying function q(X), given a training instance labeled set X, decide which instance in U to query next. –X: training data, labeled.

31 Ping-Tsun Chang Active Learning (cont’d) Main difference: querying component q. How to choose the next unlabeled instance to query? Resulting Version Space

32 Ping-Tsun Chang Active Learner Active learner l* always queries instances whose corresponding hyperplanes in parameter space W halves the area of the current version space

33 Ping-Tsun Chang Experienments Bayesian Classifier

34 Ping-Tsun Chang Comparsion of Learning Methods 0102030405060 0.6 0.8 1 0.4 0.2 Precision Training Data Size SVM kNN NB NNet

35 Ping-Tsun Chang Conclusions Text-Mining extraction knowledge from text. Support Vector Machine is almost the best statistic-based machine learning method Natural Language Understanding is still a open problem Knowledge


Download ppt "Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning."

Similar presentations


Ads by Google