Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005.

Similar presentations


Presentation on theme: "A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005."— Presentation transcript:

1 A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005

2 Article Information Published in  Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval 1995 Authors  Wiener E.,  Pedersen, J.O.  Weigend, A.S. 54 citations

3 Summary Introduction  Related Work  The Corpus Representation  Term Selection  Latent Semantic Indexing Generic LSI Local LSI  Cluster-Directed LSI  Topic-Directed LSI Relevancy Weighting LSI

4 Summary Neural Network Classifier Neural Networks for Topic Spotting  Linear vs. Non Linear Networks  Flat Architecture vs. Modular Architecture Experiment Results  Evaluating Performance  Results & discussions

5 Introduction Topic Spotting = Text Categorization = Text Classification Problem of identifying which of a set of predefined topics are present in a natural language document. Document Topic 1 Topic 2 Topic n

6 Introduction Classification Approaches  Expert system approach: manually construct a system of inference rules on top of large body of linguistic and domain knowledge could be extremely accurate very time consuming brittle to changes in the data environment  Data driven approach: induce a set of rules from a corpus of labeled training documents practically better

7 Introduction – Related Work The major remarks regarding the related work:  Separate classifier was constructed for each topic.  Different set of terms was used to train each classifier.

8 Introduction – The Corpus Reuters 22173 corpus of Reuters newswire stories from 1987  21,450 stories  9,610 for training  3,662 for testing  mean length: 90.6 words, SD 91.6  92 topics appeared at least once in the training set. The mean is 1.24 topics/doc. (up to 14 topics for some doc.)  11,161 unique terms after preprocessing inflectional stemming, stop word removal, conversion to lower case elimination of words appeared in fewer three documents

9 Representations starting point:  Document Profile: term by document matrix containing word frequency entries

10 Representation  3/  33  1/  33  2/  33 Thorsten Joachims. 1997. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. http://citeseer.ist.psu.edu/joachims97text.html

11 Representation - Term Selection the subset of the original terms that are most useful for the classification task. Difficult to select terms that discriminate between 92 classes while being small enough to serve as the feature set for a neural network  Divide problem into 92 independent classification tasks  Search for best discriminator terms between documents with the topic and those without

12 Representation - Term Selection Relevancy Score  measures how unbalanced the term is across documents w/ or w/o the topic  Highly +ve and highly -ve scores indicate useful terms for discrimination  using about 20 terms yielded the best classification performance No. of doc. w/ topic t & contain term k Total No. of doc. w/ topic t

13 Representation - Term Selection

14 Advantage:  little computation is required  resulting features have direct interpretability Drawback:  many of best individual predictors contain redundant information  a term which may appear to be a very poor predictor on its own may turn out to have great discriminative power in combination with other terms, and vise verse. Apple vs. Apple Computers Selected Term Representation (TERMS) with 20 features Representation - Term Selection TERMS

15 Representation – LSI Transform original doc to lower-dimensional space by analyzing correlational structure of terms in the document collection  (Training Set): applying a singular-value decomposition (SVD) to the original term by document matrix  Get U, , V  (test set): Transform document vectors by projecting them into LSI space Property of LSI: higher dimensions capture less of variance of original data  drop w/ minimal loss.  Found: performance continues to improve up to at least 250 dimensions  Improvement rapidly slows dawn after about 100 dimensions Generic LSI Representation (LSI) with 200 features LSI

16 Representation – LSI SVD Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold Generic LSI Representation w/ 200 features

17 Representation – Local LSI Global LSI performs worse as frequency decreases  infrequent topics are usually indicated by infrequent terms and infrequent terms may be projected out of LSI and considered as mere noise. Proposed two task-directed methods that make use of prior knowledge of the classification task

18 Representation – Local LSI What is Local LSI?  modeling only the local portion of the corpus related to those topics  includes documents that use terminology related to the topics (not necessary have any of the topics assigned)  Performing SVD over only the local set of documents representation more sensitive to small, localized effects of infrequent terms. representation more effective for classification of topics related to that local structure.

19 Representation – Local LSI Type of Local LSI:  Cluster Directed representation 5 Meta-topics (clusters):  Agriculture, Energy, Foreign exchange, Government, and metals How to construct local region?  Break corpus into 5 clusters  each containing all documents on corresponding meta-topic  Perform SVD for each Meta-topic region  Clustor-Directed LSI Representation (CD/LSI) with 200 features CD/LSI

20 Representation – Local LSI SVD Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold

21 Energy Metal Foreign Exchange Agriculture Government Representation – Local LSI SVD Reuters Corpus Wool Barley Wheat Money- supply ZincGold GOVERNMENTGOVERNMENT AGRICULTUREAGRICULTURE ForeignExchangeForeignExchange METALMETAL ENERGYENERGY Clustor-Directed LSI Representation (CD/LSI) w/ 200 features SVD

22 Representation – Local LSI Types of Local LSI:  Term Directed representation More fine-grained approach to local LSI Separate representation for each topic. How to construct the local region?  Use 100 most predictive terms for the topic.  Pick N most similar documents. N = 5 * No. of documents containing topic, 350  N  110  Final Documents in topic region = N documents + 150 random documents  Topic-Directed LSI Representation (TD/LSI) with 200 features

23 Representation – Local LSI SVD Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold

24 Representation – Local LSI Reuters Corpus Wool Barley Wheat Money- supply Zinc Gold SVD Term-Directed LSI Representation (TD/LSI) w/ 200 features

25 Drawback of Local LSI:  Narrower the region, the Lower flexibility in representations for modeling the classification of multiple topics  High computational overhead Representation – Local LSI

26 Representation - Relevancy Weighting LSI Use term weight to emphasize the importance of particular terms before applying SVD  IDF weighting  importance of low frequency terms  the importance of high frequency terms Assumes low frequency terms to be better discriminators than high frequency terms

27  Relevancy Weighting tune the IDF assumption emphasize terms in proportion to their estimated topic discrimination power Global Relevancy Weighting of term k (GRW k )  Final Weighting of term k = IDF 2 * GRW k  all low frequency terms pulled up by IDF  Poor predictors pushed down  leaving only relevant low frequency terms with high weights Relevancy Weighted LSI Representation (REL/LSI) with 200 features Representation - Relevancy Weighting LSI

28 Neural Network Classifier (NN) NN consists of:  processing units (Neurons)  weighted links connecting neurons

29 major components of NN model:  architecture: defines the functional form relating input to output network topology unit connectivity activation functions: e.g. Logistic regression fn. Neural Network Classifier (NN)

30 Logistic regression function z = is a linear combination of the input features p  (0,1) - can be converted to binary classification method by thresholding the output probability

31 major components of NN model (cont):  search algorithm: the search in weight space for a set of weights which minimizes the error between the output and the expected output (TRAINING PROCESS) Backpropagation method  Mean squared errors  Cross-entropy error performance function C = - sum [all cases and outputs] (d*log(y) + (1-d)*log(1-y) ) d: desired output, y: actual output Neural Network Classifier (NN)

32 NN for Topic Spotting Network outputs are estimates of the probability of topic presence given the feature vector of a document Generic LSI representation each network uses same representation Local LSI representation different representation for each network

33 Linear NN  Output units with logistic activation and no hidden layer NN for Topic Spotting n 2 1

34 Non Linear NN  Simple networks with a single hidden layer of logistic sigmoid units (6 – 15)

35 NN for Topic Spotting  Flat Architecture Separate network for each topic use entire training set to train for each topic Avoiding overfitting problem by  adding penalty term to the cross-entropy cost function to encourage elimination of small weights.  Early stopping based on cross-validation

36 NN for Topic Spotting  Modular Architecture decompose learning problem into smaller problems Meta-Topic Network trained on full training set  estimate the presence probability of the five topics in doc.  use 15 hidden units

37 NN for Topic Spotting  Modular Architecture five groups of local topic networks  consists of local topic networks for each topic in meta-topic  each network trained only on the meta-topic region

38 NN for Topic Spotting  Modular Architecture five groups of local topic networks (cont.)  Example: wheat network trained Agriculture meta-topic.  Focus on finer distinctions, e.g. wheat and grain  Don’t waste time on easier distinctions, e.g. wheat and gold.  Each local topic networks uses 6 hidden units.

39 NN for Topic Spotting  Modular Architecture To compute topic predictions for a given document  Present document to meta-topic network  Present document to each of the topic networks  Outputs of meta-topic network  estimate of topic networks = final topic estimates

40 Experimental Results Evaluating Performance  Mean squared error between actual and predicted values is inefficient  Compute precision and recall based on contingency table constructed over range of decision thresholds  How to get the decision Thresholds?

41 Experimental Results Evaluating Performance  How to get the decision Thresholds? Proportional assignment Topic = ‘wool’ Topic  ‘wool’ Predicted Topic = ‘wool’ iff Output probability    = output probability of kp’th highest rank doc. K integer, P prior probability of “wool” topic Predicted Topic  ‘wool’, iff output probability < 

42 Experimental Results Evaluating Performance  How to get the decision Thresholds? fixed recall level approach  determine set of recall levels  analyze ranked documents to determine what decision thresholds lead to the desired set of recall levels. Topic = ‘wool’ Topic  ‘wool’ Predicted Topic = ‘wool’ iff Output probability    = output probability of doc. where # of doc. with higher output probability Leads to desired recall level Predicted Topic  ‘wool’, iff output probability <  Target Recall

43 Experimental Results Performance by Micoraveraging  add all contingency tables together across topics at a certain threshold  compute precision and recall  used proportional assignment for picking decision thresholds  does not weight the topics evenly  used for comparisons to previously reported results  Breakeven point is used as a summary value

44 Experimental Results Performance by Macoraveraging  compute precision and recall for each topic  take the average across topics  used fixed set of recall levels  summary values are obtained for particular topics by averaging precision over the 19 evenly spaced recall levels between 0.05 and 0.95

45 Experimental Results Microaveraged performance  Breakpoints compared to best algorithm: rule induction method best on heuristic search with breakpoint (0.789) 0.82 0.801 0.795 0.775

46 Experimental Results Macroaveraged performance  TERMS appears much closer to other three.  Relative effectiveness of the representations at low recall levels is reversed at high recall levels

47 Slight improvement of nonlinear networks LSI performance degrades compared to TERMS when f t decreases Six techniques performance on 54 most frequent topics  considerable variation of performance across topics  relative ups and downs are mirrored in both plots

48 Experimental Results Performance of Combination of Techniques and Its Improvement NN architecture Document Representation FlatModular (Meta-Topic NW trained using LSI representation) LinearNon LinearLinearNon Linear TERMS          LSI         CD-LSI      TD-LSI     REL-LSI   Hybrid (CD-LSI + TERMS)  Match color & shape to get an experiment

49 Experimental Results Flat Networks

50 Experimental Results Modular Networks  4 clusters only used  Recomputed average precision for the flat networks

51 Non linear networks seem to perform better than the linear models, but the difference is very slight.

52 LSI representation is able to equal or exceed TERMS performance for high frequency topic, but performs poorly for low frequency

53 Task-Directed LSI representations improve performance in the low frequency domain TD/LSI Trade-off  Cost REL/LSI Trade-off  lower performance on m/h topics

54 Modular CD/LSI improves performance further for low frequency, because individual networks are trained only in the domain that LSI was performed

55 TERMS proves to be competitive to more sophisticated LSI technique  most topics are predictable by small set of terms

56 Discussion Rich solution – many representations and many models Total Supervised approach Results are lower than what expected  Is the dataset responsible? High computational overhead Does NN deserve a place in DM tool boxes?Questions?


Download ppt "A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005."

Similar presentations


Ads by Google