Presentation is loading. Please wait.

Presentation is loading. Please wait.

20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos,

Similar presentations


Presentation on theme: "20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos,"— Presentation transcript:

1 20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos, C. Kotropoulos, A. Karydas

2 20/07/2000, Page 2 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing –Corpus Checking (manual) Corpus files of continuous non-English text or of browser related stuff (frames etc.) were removed. For the results reported below 135 html files (~1.15Mb) were used (word count 111,657). –Merging (unix) Concatenation of the corpus html files into one file. –Text processing (C code) Html cleaning –Html tags (e.g. ) & entities (e.g. ) are removed. –Special treatment for tags, which influence word separation (e.g. ) & inclusion in output text (e.g., ).

3 20/07/2000, Page 3 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing (Cont.) –Text processing (Cont.) Plain text cleaning –Email addresses & URLs  X (removed). –Word separators: isspace(), ( _. –Capital letters  lowercase. –Digits inside words  X. –Words beginning with a digit  X. –Multiple periods  one. –Periods are separated from words by a space character. –Multiple isspace() characters  one. –All other chars  X (e.g. hyphen or apostrophe). Result: 66478 words.

4 20/07/2000, Page 4 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing (Cont.) –Stemming (IRIS nice stemmer C code as base) Clustering by elimination of word suffixes. Porter (derivational) algorithm was chosen. –Contains rules, which iteratively remove or substitute known suffixes, e.g. -s, -ation, -ing. –Other tested stemmers: Krovetz (inflectional), Lovins. –Porter usually gives »less stems than Krovetz. »“better” stem-word representation than Lovins. Stop list file: common words to be disregarded –articles, conjunctions, prepositions, pronouns, one letter words, … Result: 6,458 stems out of 35,099 “valid” words.

5 20/07/2000, Page 5 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Feature Vector Extraction –Stem Frequencies Computation (CMU- Cambridge toolkit v2 C code as base) Hash table used. 12 more frequent: –StemFrequency Stem Frequency –museum305– pm 154 –citi206– import 152 –town196– great 147 –centuri195– visit 141 –hotel178– church 136 –spain166  – build 128

6 20/07/2000, Page 6 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Feature Vector Extraction (Cont.) –Vocabulary Construction (CMU-Cambridge toolkit v2 C code as base) Input: stem frequencies file. Result: Alphabetically sorted stem list (used to simplify next task). –Bigram Generation (CMU-Cambridge toolkit v2 C code as base) Bigram: probability of occurrence of two neighboring words. Its statistical measurement (estimator) assists word clustering.

7 20/07/2000, Page 7 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Feature Vector Extraction (Cont.) –Bigram Generation (Cont.) Result: 259,09 pairs out of 6,458 stems. 12 more frequent: –Stem Pair Frequency Stem Pair Frequency –am pm 104 – worth visit 41 –tel fax 51 – singl room 32 –photo tour 51 – index sightsee 32 –room dm 46 – pm pm 31 –fiesta folklor 44 – excurs fiesta 29 –dom pedro 42  – folklor photo 28

8 20/07/2000, Page 8 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Feature Vector Extraction (Cont.) –Manipulation of stem frequencies & bigram files (unix) Prepares input of next task. –Collection of Contextual Statistics (C code) Contextual statistics used to encode stem I are: –x il = n il /N i, i = 1,…,N w. –n il : # times stem l occurred before stem i. –N i : # times stem i occurred, N w : # stems in the corpus. –Vector that holds overall contextual statistics for the stem i will be x i (average context vector - ACV): –e l vector with 1 as its i-th coordinate & 0s for the others.

9 20/07/2000, Page 9 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Future –Annotated corpus handling. –Code optimization. –Bigram smoothing.


Download ppt "20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos,"

Similar presentations


Ads by Google