Presentation is loading. Please wait.

Presentation is loading. Please wait.

The ILK Suite of Text Tools Antal van den Bosch ILK Research Group Faculty of Humanties, Tilburg University Political Mashup Meeting.

Similar presentations


Presentation on theme: "The ILK Suite of Text Tools Antal van den Bosch ILK Research Group Faculty of Humanties, Tilburg University Political Mashup Meeting."— Presentation transcript:

1 The ILK Suite of Text Tools Antal van den Bosch ILK Research Group Faculty of Humanties, Tilburg University Political Mashup Meeting Amsterdam, March 19, 2008

2 The ILK Text Tools Text Quality Management –Text normalization –Spelling and grammar checking –Structured data cleaning Text Mining –Entity recognition –Relation finding Text Recommendation –Document recommendation –Expert recommendation

3 ILK Text Tools Applications Cultural Heritage –Historical texts: Royal Library, DBNL –Entity recognition: Naturalis field books –Structured data cleaning: Naturalis, Beeld & Geluid, Army Museum, Meertens Service and media industries –Text mining: Textkernel B.V. –Recommendation: Trouw

4 TICCL Text-induced corpus cleanup –Martin Reynaert Robust, scalable method for finding wordform variants Sensitive to morphology and context Knowledge-free Very large corpus Linked word list Dirty word list indexe s

5 TICCL hartstochtelijk hartstochtelyk hartstochtelyke hartstochtlijk hartstochtlijke hartstochtlyk hartstogtelijk hartstogtelijke hartstogtelijks hartstogtelyk wenkbrauwen wenkbraauwen wenkbraeuwen wenkbrauwen winkbraauwen wynbraauwen wynbrauwen Nederland NEDERLANDEN Nederlan Nederland Nederlanden Nederlander Nederlandse Nederlandt Nederlandts Nederlandze Nederlansch Nederlanse Nederlant Nederlants Neederland Neerland Neerlands Neerlandts Neerlants Netherlands

6 Other Text QM Tools Knowledge-free, corpus- driven Tokenization and sentence splitting Grammar checking –All d/t/dt errors gebeurd/gebeurt, word/wordt –Inflectional and derivational errors Run-on/split detection Word completion Dirty corpus Cleaner corpus Disambiguator

7 MITCH: Mining Natural History Piroska Lendvai, Marieke van Erp, Steve Hunt Field books and registers describe objects in many valuable facets, –In ambiguous, elliptic language –In multiple languages –Describing animals, people, biotopes, geographical names, time expressions

8 Cleaning and overhauling data Auteur Determi- nator FamilieGenusLand Bewaar- methode (Daudin, 1802) Batagurida e AnolisCambodja (Schild droog) (Schlegel) G. vd. Boog ColubridaeIndonesia Schneider M.S. Hoogmoed BufoSuriname (Horst, 1883) Tyler, M.J.HylidaeLitoriaalcohol GeophisGeophis? ? Rhabdo- phis? Actual value: Geophis Expected: Rhapdophis

9 Entity type correction

10

11 11 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed. Entity detection in fieldbooks

12 12 → Number 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed. Entity detection in fieldbooks

13 13 → Number, Genus 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed. Entity detection in fieldbooks

14 14 → Number, Genus, Species 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed. Entity detection in fieldbooks

15 15 → Number, Genus, Species, Biotope 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed. Entity detection in fieldbooks

16 16 → Number, Genus, Species, Biotope, Collection Time 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed. Entity detection in fieldbooks

17 Training on labeled examples –Easy: short, regular entities –Hard: longer textual descriptions Metadata detection in description entities –Types of forest, soil, … in biotopes –Physical appearance, … in special comments By automatically learning the “grammar” of these entities (ABL)

18

19 Expert search Toine Bogers, A Propos project Two types –Expert finding –Expert profiling Evidence of expertise –Content-based evidence –Evidence from social networks –Activity-based evidence Current results on academic workgroup –Content-based not better than citation-base –Number of citations just as good as PageRank –“authorship = expertise”? not 100%

20 news article recommender for Trouw –recommend related stories for article posted online –editors provide feedback on recommendations –approved recommendations are automatically placed online Trouw Recommender

21

22 Other ILK Text Tools Translation –Memory-based, any pair of languages Morpho-syntactic analysis: Tadpole –Part-of-speech tagging, lemmatization –Dependency parsing, 20 languages Text-to-speech conversion –Dutch speech synthesizer: NeXTeNS Word sense disambiguation, co-reference resolution, paraphrasing, named entity recognition.

23 Thank you  Toine Bogers, Martin Reynaert, Piroska Lendvai, Marieke van Erp, Steve Hunt, Peter Berck, Ko van der Sloot, Herman Stehouwer, Menno van Zaanen, Tanja Gaustad, Sebastiaan Tesink, Erwin Marsi, Iris Hendrickx, Antal van den Bosch, Walter Daelemans   


Download ppt "The ILK Suite of Text Tools Antal van den Bosch ILK Research Group Faculty of Humanties, Tilburg University Political Mashup Meeting."

Similar presentations


Ads by Google