Presentation is loading. Please wait.

Presentation is loading. Please wait.

24. March 2003 Leif Grönqvist, in Borås 1 Latent Semantic Indexing and Beyond Leif Grönqvist School of Mathematics and Systems Engineering.

Similar presentations


Presentation on theme: "24. March 2003 Leif Grönqvist, in Borås 1 Latent Semantic Indexing and Beyond Leif Grönqvist School of Mathematics and Systems Engineering."— Presentation transcript:

1 24. March 2003 Leif Grönqvist, in Borås 1 Latent Semantic Indexing and Beyond Leif Grönqvist School of Mathematics and Systems Engineering the Swedish Graduate School of Language Technology

2 24. March 2003Leif Grönqvist, in Borås2Overview My background My background Introduction to vector space models and Latent Semantic Indexing Introduction to vector space models and Latent Semantic Indexing A toy example A toy example Interpretation Interpretation Some applications Some applications A concrete example and a small experiment A concrete example and a small experiment Improvements of the model Improvements of the model Various unsolved problems Various unsolved problems Conclusion: things I have to do Conclusion: things I have to do

3 24. March 2003Leif Grönqvist, in Borås3 My Background : ”4-årig teknisk” (electrical engineering) : ”4-årig teknisk” (electrical engineering) : MSc (official translation of “Filosofie Magister”) in Computing Science, Göteborg University : MSc (official translation of “Filosofie Magister”) in Computing Science, Göteborg University : 62 points in mechanics, electronics, etc : 62 points in mechanics, electronics, etc : Work at the Linguistic department in Göteborg : Work at the Linguistic department in Göteborg Various projects related to corpus linguistics Various projects related to corpus linguistics Some teaching on statistical methods (Göteborg and Uppsala), Some teaching on statistical methods (Göteborg and Uppsala), and corpus linguistics in Göteborg, Sofia, and Beijing and corpus linguistics in Göteborg, Sofia, and Beijing 1995: Consultant at Redwood Research, in Sollentuna, working on information retrieval in medical databases 1995: Consultant at Redwood Research, in Sollentuna, working on information retrieval in medical databases : Work at the department of Informatics in Göteborg (the Internet Project) : Work at the department of Informatics in Göteborg (the Internet Project) : PhD Student in Computer Science / Language Technology : PhD Student in Computer Science / Language Technology

4 24. March 2003Leif Grönqvist, in Borås4 Vector Space Models If we had a way to map any term to a vector in a high-dimensional space, in a way such that the similarity between the meaning of the terms is reflected in the distance between the vectors… Then we could: If we had a way to map any term to a vector in a high-dimensional space, in a way such that the similarity between the meaning of the terms is reflected in the distance between the vectors… Then we could: For a given term t, find an ordered list of the terms most similar to t For a given term t, find an ordered list of the terms most similar to t For any two terms, find the similarity between them For any two terms, find the similarity between them

5 24. March 2003Leif Grönqvist, in Borås5 Vector Space Models, cont. And if it is possible to add meaning for terms and this is also reflected by adding the corresponding vectors, we could do some more things: And if it is possible to add meaning for terms and this is also reflected by adding the corresponding vectors, we could do some more things: If we assume that it is possible to extract terms from a document, we can map documents to vectors too! If we assume that it is possible to extract terms from a document, we can map documents to vectors too! A set of terms (one or more terms) may be seen as a document as well A set of terms (one or more terms) may be seen as a document as well

6 24. March 2003Leif Grönqvist, in Borås6 Vector Space Models, cont. Now it is possible for any [term or document] d, to find an ordered list of the terms or documents most similar to d Now it is possible for any [term or document] d, to find an ordered list of the terms or documents most similar to d Further, we can for any two [term or document]s, find the similarity between them Further, we can for any two [term or document]s, find the similarity between them Therefore it is meaningful to look at terms as a special case of document – a short one Therefore it is meaningful to look at terms as a special case of document – a short one

7 24. March 2003Leif Grönqvist, in Borås7

8 24. March 2003Leif Grönqvist, in Borås8 Zoom into the blue cluster

9 24. March 2003Leif Grönqvist, in Borås9 And the red one

10 24. March 2003Leif Grönqvist, in Borås10 Alternative data sources A useful data source to get similar information would be a thesaurus, a WorldNet, or any kind of knowledge database. But: A useful data source to get similar information would be a thesaurus, a WorldNet, or any kind of knowledge database. But: We don’t have them for all languages We don’t have them for all languages Most of them are not domain specific and domain specific terms are not covered Most of them are not domain specific and domain specific terms are not covered In such data sources most of the words are missing In such data sources most of the words are missing Especially names, compounds, technical terms and numbers Especially names, compounds, technical terms and numbers My big newspaper corpus contains ~ unique words My big newspaper corpus contains ~ unique words A vector space model can be trained from raw un- annotated corpus data! A vector space model can be trained from raw un- annotated corpus data!

11 24. March 2003Leif Grönqvist, in Borås11 Calculating a vector space The training process needs a large set of documents - the bigger the better. My data set used for experiments contains roughly 1.2 million newspaper articles and 0.4 billion running words but I will collect more… The training process needs a large set of documents - the bigger the better. My data set used for experiments contains roughly 1.2 million newspaper articles and 0.4 billion running words but I will collect more… Step 1: Create a word-by-document matrix - each element in the matrix is a frequency (possibly weighted) for a word type in a specific document Step 1: Create a word-by-document matrix - each element in the matrix is a frequency (possibly weighted) for a word type in a specific document From here there are several ways to find a good vector space From here there are several ways to find a good vector space

12 24. March 2003Leif Grönqvist, in Borås12 Vector Space Algorithms Singular Value Decomposition (SVD) Singular Value Decomposition (SVD) This is a mathematically complicated (based on eigen- values) way to find an optimal vector space in a specific number of dimensions This is a mathematically complicated (based on eigen- values) way to find an optimal vector space in a specific number of dimensions Computationally heavy - maybe 20 hours for my test set Computationally heavy - maybe 20 hours for my test set Uses often the entire document as context Uses often the entire document as context Random Indexing (RI) Random Indexing (RI) Select some dimensions randomly Select some dimensions randomly Not as heavy to calculate, but more unclear (for me) why it works Not as heavy to calculate, but more unclear (for me) why it works Uses a small context, typically 1+1 – 5+5 words Uses a small context, typically 1+1 – 5+5 words Neural nets, Hyperspace Analogue to Language, etc. Neural nets, Hyperspace Analogue to Language, etc.

13 24. March 2003Leif Grönqvist, in Borås13 The terminology I use Some people use these terms in a sloppy way. For me: LSI=LSA: Latent Semantic Indexing / Analysis are used in roughly the same way by most people LSI=LSA: Latent Semantic Indexing / Analysis are used in roughly the same way by most people Two ways to obtain the model used in LSA are SVD and RI – they both find the latent information Two ways to obtain the model used in LSA are SVD and RI – they both find the latent information

14 24. March 2003Leif Grönqvist, in Borås14 A toy example

15 24. March 2003Leif Grönqvist, in Borås15 What SVD gives us X=T 0 S 0 D 0 : X, T 0,S 0,D 0 are matrices

16 24. March 2003Leif Grönqvist, in Borås16 And our example: T

17 24. March 2003Leif Grönqvist, in Borås17 And our example: S

18 24. March 2003Leif Grönqvist, in Borås18 And our example: D

19 24. March 2003Leif Grönqvist, in Borås19 We can recalculate X with m=2 C1C2C3C4C5M1M2M3M4 Human Interface Computer User System Response Time EPS Survey Trees Graph Minors

20 24. March 2003Leif Grönqvist, in Borås20 What does the SVD give? Susan Dumais 1995: “The SVD program takes the ltc transformed term-document matrix as input, and calculates the best "reduced-dimension" approximation to this matrix.” Susan Dumais 1995: “The SVD program takes the ltc transformed term-document matrix as input, and calculates the best "reduced-dimension" approximation to this matrix.” Michael W Berry 1992: “This important result indicates that A k is the best Michael W Berry 1992: “This important result indicates that A k is the best k-rank approximation (in at least squares sense) to the matrix A. Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way, so no problems with local maxima. Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way, so no problems with local maxima.

21 24. March 2003Leif Grönqvist, in Borås21 The distance measure Three easy-to-calculate distance measures: Three easy-to-calculate distance measures: Cosine: the cosine of the angle between the vectors Cosine: the cosine of the angle between the vectors Euclidean distance: just the distance as we all know it Euclidean distance: just the distance as we all know it Manhattan distance: the distance if you walk only along the orthogonal axes Manhattan distance: the distance if you walk only along the orthogonal axes Just as easy to calculate in n dimensions where n>>3 Just as easy to calculate in n dimensions where n>>3 The most used is the cosine The most used is the cosine

22 24. March 2003Leif Grönqvist, in Borås22 What does it really mean then? The fact that a word w is represented by a specific vector v means exactly nothing! The fact that a word w is represented by a specific vector v means exactly nothing! If two words a, b are represented by vectors close to each other (the angle between them is small) then: If two words a, b are represented by vectors close to each other (the angle between them is small) then: a and b are often found in the same document and/or a and b are often found in the same document and/or a is often found together with c and c is often found together with b a is often found together with c and c is often found together with b And so on…

23 24. March 2003Leif Grönqvist, in Borås23 A naïve algorithm Not trivial that SVD and RI works. I will explain a naive but more intuitive algorithm to obtain a result similar to SVD, but too slow for practical use: Not trivial that SVD and RI works. I will explain a naive but more intuitive algorithm to obtain a result similar to SVD, but too slow for practical use: 1. Select a random point in a space with the selected dimensionality, for each unique word 2. For each document D in the set: move the points corresponding to each word towards the mass center for the words/points in D. 3. If any point made a “big” move since last iteration, then go back to step 2 Step 1-3 could be done several times to have a better chance to find the global maximum

24 24. March 2003Leif Grönqvist, in Borås24 Some applications Automatic generation of a domain specific thesaurus Automatic generation of a domain specific thesaurus Keyword extraction from documents Keyword extraction from documents Find sets of similar documents in a collection Find sets of similar documents in a collection Find documents related to a given document or a set of terms Find documents related to a given document or a set of terms

25 24. March 2003Leif Grönqvist, in Borås25 Problems and questions How can we interpret the similarities as different kinds of relations? How can we interpret the similarities as different kinds of relations? How can we include document structure and phrases in the model? How can we include document structure and phrases in the model? Terms are not really terms, but just words Terms are not really terms, but just words Ambiguous terms pollute the vector space Ambiguous terms pollute the vector space How could we find the optimal number of dimensions for the vector space? How could we find the optimal number of dimensions for the vector space?

26 24. March 2003Leif Grönqvist, in Borås26 An example based on 5000 newspaper articles pelle svensson pelle svensson svenssons ödsligt skandal frikännande polismannens tjänstetid slutkörd munsex avstyra bengt johansson johansson bengt davidson folkpartiledaren kdsledaren ö resundsbroprojektet centerledaren irhammar partiledarna avgaser lyckosamt

27 24. March 2003Leif Grönqvist, in Borås27 Bengt Johansson is just Bengt + Johansson – something is missing! bengt folkpartiledaren westerberg kdsledaren riksdagsledamot ändrats ingbritt irhammar tolkningen tolkar partiledarna johansson olof miljödepartementets görel thurdin miljöminister brofrågan rosenbad miljödepartementet regeringssammanträdet avgaser

28 24. March 2003Leif Grönqvist, in Borås28 A small experiment I want the model to know the difference between Bengt and Bengt I want the model to know the difference between Bengt and Bengt 1. Make a frequency list for all n-tuples up to n=5 with a frequency>1 2. Keep all words in the bags, but add the tuples, with space replaced by _, as words 3. Run the LSI again Now Bengt_Johansson is a word, and Bengt_Johansson is NOT Bengt + Johansson Now Bengt_Johansson is a word, and Bengt_Johansson is NOT Bengt + Johansson Number of terms grows from to

29 24. March 2003Leif Grönqvist, in Borås29 New results Some distances Some distances bengt_johanssonjohansson bengt_johansson bengt bengt_johanssonolof bengt_johanssonfolkpartiledaren johanssonolof johanssonfolkpartiledaren johanssonbengt bengtfolkpartiledaren bengtolof folkpartiledarenolof

30 24. March 2003Leif Grönqvist, in Borås30 And the top list for Bengt_Johansson bengt_johansson handbollslandslag gunnar_blombäck fyrnationsturneringen_i_östergötland fyrnationsturneringen förbundskapten_bengt_johansson förbundskapten_bengt blombäck carlen åtta_mål bänken magnus_wislander wislander målet_stod svenske_förbundskaptenen orutinerade vinna_den_här magnus_andersson matchen_spelades förbundskaptenen landskamp glädjeämnen vmlaget halvlek världsstjärnor bottenlaget brolin uppvisningen offensivt jörgensen landslag

31 24. March 2003Leif Grönqvist, in Borås31 The new vector space model It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach But is the model better for single words or for document comparison? But is the model better for single words or for document comparison? What do you think? More “words” than before – hopefully it improves the result just as more data does More “words” than before – hopefully it improves the result just as more data does At least no reason for a worse result... Or? At least no reason for a worse result... Or?

32 24. March 2003Leif Grönqvist, in Borås32 An example document REGERINGSKRIS ELLER INTE PARTILEDARNA I SISTAMINUTEN ÖVERLÄGGNINGAR OM BRON Under onsdagskvällen satt partiledarna i regeringen i sista minutenöverläggningar om Öresundsbron Centerledaren Olof Johansson var den förste som lämnade överläggningarna På torsdagen ska regeringen ge ett besked Det måste dock enligt statsminister Carl Bildt inte innebära ett ja eller ett nej till bron …

33 24. March 2003Leif Grönqvist, in Borås33 Closest terms in each model 0.986underkänner 0.982irhammar 0.977partiledarna 0.970godkände 0.962delade_meningar 0.960regeringssammanträde 0.957riksdagsledamot 0.957bengt_westerberg 0.954materialet 0.952diskuterade 0.950folkpartiledaren 0.949medierna 0.947motsättningarna 0.946vilar socialminister_bengt_westerberg 0.967partiledarna 0.921miljökrav 0.921underkänner 0.918tolkar 0.897meningar 0.888centerledaren 0.886regeringssammanträde 0.880slottet 0.880rosenbad 0.877planminister 0.866folkpartiledaren 0.855thurdin 0.845brokonsortiet 0.839görel 0.826irhammar

34 24. March 2003Leif Grönqvist, in Borås34 Closest document in both models BILDT LOVAR BESKED OCH REGERINGSKRIS HOTAR Det blir ett besked under torsdagen men det måste inte innebära ett ja eller nej från regeringen till Öresundsbroprojektet Detta löfte framförde statsminister Carl Bildt under onsdagen i ett antal varianter Samtidigt skärptes tonen mellan honom och miljöminister Olof Johansson och stämningen tydde på annalkande regeringskris De båda har under den långa broprocessen undvikit att uttala sig kritiskt om varandra och därmed trappa upp motsättningarna Men nu menar Bildt att centern lämnar sned information utåt Johansson och planminister Görel Thurdin anser å andra sidan att regeringen bara kan säga nej till bron om man tar riktig hänsyn till underlaget för miljöprövningen …

35 24. March 2003Leif Grönqvist, in Borås35 Doc Basic model Phrases added ScoreRankScoreRank

36 24. March 2003Leif Grönqvist, in Borås36 Documents with better ranking in the basic model BRON KAN BLI VALFRÅGA SÄGER JOHANSSON Om det lutar åt ett ja i regeringen av politiska skäl då är naturligtvis den här frågan en viktig valfråga … INTE EN KRITISK RÖST BLAND CENTERPARTISTERNA TILL BROBESKEDET En etappseger för miljön och centern En eloge till Olof Johansson Görel Thurdin och Carl Bildt …

37 24. March 2003Leif Grönqvist, in Borås37 Documents with better ranking in the phrase model ALF SVENSSON TOPPNAMN I STOCKHOLM Kds- ledaren Alf Svensson toppar kds riksdagslista för Stockholms stad och Michael Stjernström sakkunnig i statsrådsberedningen har en valbar andra plats … BENGT WESTERBERG BARNPORREN MÅSTE STOPPAS Folkpartiledaren Bengt Westerberg lovade på onsdagen att regeringen ska göra allt för att stoppa barnporren …

38 24. March 2003Leif Grönqvist, in Borås38 Hmm, adding n-grams was maybe too simple... If the bad result is due to overtraining, it could help to remove the words I build phrases of, but maybe not all If the bad result is due to overtraining, it could help to remove the words I build phrases of, but maybe not all Another way to try is to use a dependency parser to find more meaningful phrases, not just n-grams Another way to try is to use a dependency parser to find more meaningful phrases, not just n-grams

39 24. March 2003Leif Grönqvist, in Borås39 The interpretation of similarities I havn’t tried to solve this problem at all but one idea I have is to: I havn’t tried to solve this problem at all but one idea I have is to: Calculate vector spaces for various dimensionalities and context widths Calculate vector spaces for various dimensionalities and context widths Check if the different settings find different kind of relations Check if the different settings find different kind of relations With a data source like WordNet it could be done in a systematic way With a data source like WordNet it could be done in a systematic way

40 24. March 2003Leif Grönqvist, in Borås40 How to select the number of dimensions Susan T Dumais 1995: “In previous experiments we found that performance, improves as the number of dimensions is increased up to 200 or 300 dimensions, and decreases slowly after that to the level observed for the standard vector EC­3 method (Dumais, 1991).” Susan T Dumais 1995: “In previous experiments we found that performance, improves as the number of dimensions is increased up to 200 or 300 dimensions, and decreases slowly after that to the level observed for the standard vector EC­3 method (Dumais, 1991).” Jason I Hong 2000: “There does not seem to be a general consensus for an optimal number of dimensions; instead, the size of the concept space must be determined based on the specific collection of documents used.” Jason I Hong 2000: “There does not seem to be a general consensus for an optimal number of dimensions; instead, the size of the concept space must be determined based on the specific collection of documents used.” Thomas K Landauer 1997: “Near maximum performance of 45-53%, corrected for guessing, was obtained over a fairly broad region around 300 dimensions” Thomas K Landauer 1997: “Near maximum performance of 45-53%, corrected for guessing, was obtained over a fairly broad region around 300 dimensions” Leif 2003: “We should try to do similar experiments as Dumais/Landauer, but relate the optimal dimensionality to measures like number of documents, terms, nonzero elements, etc, because these could give us a formula not relying on hand tagged data sets” Leif 2003: “We should try to do similar experiments as Dumais/Landauer, but relate the optimal dimensionality to measures like number of documents, terms, nonzero elements, etc, because these could give us a formula not relying on hand tagged data sets”

41 24. March 2003Leif Grönqvist, in Borås41 Performance for the SVD Dumais 1995: “The SVD takes only about 2 minutes on a Sparc10 for a 2k x 5k matrix, but this time increases to about hours for a 60k x 80k matrix.” Dumais 1995: “The SVD takes only about 2 minutes on a Sparc10 for a 2k x 5k matrix, but this time increases to about hours for a 60k x 80k matrix.” Hong 2000: “The SVD algorithm is O(N 2 k 3 ), where N is the number of terms plus documents, and k is the number of dimensions in the concept space”, “However, if the collection is stable, SVD will only need to be performed once, which may be an acceptable cost.” Hong 2000: “The SVD algorithm is O(N 2 k 3 ), where N is the number of terms plus documents, and k is the number of dimensions in the concept space”, “However, if the collection is stable, SVD will only need to be performed once, which may be an acceptable cost.” Leif: So if a good computer today is 100 times faster than Dumais’ 1995 and we have 20 times bigger data sets and we have an optimized SVD function instead of a research prototype, it should still take around 20 hours. Leif: So if a good computer today is 100 times faster than Dumais’ 1995 and we have 20 times bigger data sets and we have an optimized SVD function instead of a research prototype, it should still take around 20 hours.

42 24. March 2003Leif Grönqvist, in Borås42 What I still have to do something about Find a better LSI/SVD package than the one I have (old C-code from 1990), or maybe writing it myself... Find a better LSI/SVD package than the one I have (old C-code from 1990), or maybe writing it myself... Get the phrases into the model in some way Get the phrases into the model in some way When these things are done I could: Try to interpret various relations from similarities in a vector space mode Try to interpret various relations from similarities in a vector space mode Try to solve the “number of optimal dimensions”- problem Try to solve the “number of optimal dimensions”- problem Explore what the length of the vectors mean Explore what the length of the vectors mean


Download ppt "24. March 2003 Leif Grönqvist, in Borås 1 Latent Semantic Indexing and Beyond Leif Grönqvist School of Mathematics and Systems Engineering."

Similar presentations


Ads by Google