Presentation is loading. Please wait.

Presentation is loading. Please wait.

HCC class lecture 13 comments

Similar presentations


Presentation on theme: "HCC class lecture 13 comments"— Presentation transcript:

1 HCC class lecture 13 comments
John Canny 3/7/05

2 Administrivia

3 LSA LSA uses a standard matrix method (SVD) and a language weighting scheme (tfidf) to analyze texts as “BoW” (Bags of Words). Tfidf weights terms according to their “information content”. Common works like “at”, “to”, “the” have very low weight. Tfidf is used in many information retrieval algorithms. SVD computes the “principal components” of the document corpus.

4 Why does LSA work? One useful property (second paper today), is that it recognizes words that are used in similar contexts. In other words, it performs paradigmatic analysis in Saussure’s terminology. Interestingly, it ignores syntagymatic structure (order and syntax) completely.

5 LSA review The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents) Columns are distinct terms Matrix elements are term counts (x tfidf weight) The idea is to “Factor” this matrix (D is diagonal): Themes Terms Terms D B = Text blocks M A Text blocks

6 LSA review A encodes the representation of each text block in a space of themes. B encodes each theme with term weights. It can be used to explicitly describe the theme. Themes Terms Terms D B = Text blocks M A Text blocks

7 Why does LSA work? LSA looks for meta-structure (topical or thematic or voice) that runs through an entire text block. It infers structure by comparison between texts rather than analysis within texts. The structures it discovers at this level are typically topics or themes within documents, or author voices.

8 Why does LSA work? LSA ignores a lot of important structure about a text (the syntactic and logical structure), but works reasonably well on many tasks anyway. An intuition for this is that there are many complex processes involved in text generation, but they still display distinct first- and second-order statistics. LSA efficiently encodes those statistics, and leverages applications that compare those statistics.

9 Why does LSA work? So far we have seen arguments for patterns of word use in large blocks of text, or entire documents – Bakhtin’s notion of voice. You can discover authorship using LSA. Other stable patterns of structure in documents include “activities” which we talk about next week. These structures fix particular actors, actions, and objects in the narrative. Genre is another attribute of corpora that might be amenable to LSA analysis.

10 LSA limitations LSA has a few assumptions that don’t make much sense:
If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices. LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian. SVD forces themes to be orthogonal in the A and B matrices. Why should they be?

11 LSA limitations The consequences are:
LSA themes are not meaningful beyond the first few (the ones with strongest singular value). LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do). Other methods for text modeling address these limitations.

12 LSA apps – Summary Assistant
The student summary evaluator uses projection into an LSA space to judge the quality of the summary. Each sentence is checked for “relevancy” and “redundancy” and feedback is given to the student. Overall grade assigned based on comparison with original text section-by-section.

13 LSA macrostructures LSA models human understanding of texts (?).
Macrostructures are important for human understanding of texts. LSA representation facilitates comparison of text blocks and summaries or titles. LSA doesn’t generate the summaries itself – you need more of the logical/syntactic structure to do that.

14 Discussion Topics T1: Critique the assertion that LSA models human thought. Be careful not to oversimplify. Think also about situations where people aggregate information in an LSA-like way. T2: What would you change about LSA to fix its limitations?


Download ppt "HCC class lecture 13 comments"

Similar presentations


Ads by Google