Presentation is loading. Please wait.

Presentation is loading. Please wait.

World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development.

Similar presentations


Presentation on theme: "World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development."— Presentation transcript:

1 World class IT in a world-wide market

2 Text Mining Highlights Marten Trautwein Syllogic Research & Development

3 RoadMap TextHub –A parallel information retrieval tool Text Mine –A document clustering extension Emile –Grammar induction & clustering

4 What is TextHub? Intelligent Parallel Information Retrieval Tool Intuitive Web based graphical user interface Compression  Decompression Indexing  Retrieval Document clustering & categorization

5 The star topology Master receives requests Master delegates tasks Slave performs tasks Master collects results Master returns answer

6 Use of parallelism Documents outnumber processors Divide and conquer Distribute documents Communication overhead minimum Linear speed-up (1GB per hour)

7 Functionality details Compression / Decompression –Canonical Huffman encoding Indexing –Inverted file index with canonical terms Retrieval –Boolean (AND, OR, MINUS) –Search modifiers (stemming, case folding, stop list, synonyms, semantic network) –Proximity (AT, FAR, NEAR) Relevance ranking –Score documents

8 Retrieval (Boolean)

9 Retrieval (Search modifiers)

10 Retrieval (Proximity)

11 Relevance ranking Rate relevance of document Score based on number of occurrences Score compensated for large documents TextHub marks where document is relevant

12 Text Mine - Document clustering Improve relevance feed- back Clustering of related documents Categorization of documents Minimum spanning tree algorithm

13 Using minimum spanning tree Combine different measures Ordinary query retrieves relevant nodes Nodes serve as entry-points No global minimum spanning tree V U T S C D A B F E ?

14 Emile In coorparation with University of Amsterdam Engine enabling –Grammar induction –Knowledge base construction –Compound term separation Language independent

15 Grammar induction Rules [0] --> [1] kan ik geen mail lezen [0] --> [1] kan ik geen mail schrijven [0] --> met MS-Mail kan ik geen mail openen [0] --> met MS-Mail kan ik geen mail versturen [0] --> met MS-Outlook kan ik geen mail openen [0] --> met MS-Outlook kan ik geen mail versturen [1] --> met MS-Mail [1] --> met MS-Outlook [1] --> met Mail [1] --> met Outlook [1] --> met Outlook95 Dictionary Type [6] MS-Mail MS-Outlook Mail Outlook Outlook95 Dictionary Type [16] lezen openen versturen schrijven

16 Grammar induction Fragment of Phaistos disk * … Fragment of grammar [0] --> [3]. [3] --> [16] [47] [14] --> 15 [40] [14] --> 2 12 [16] --> 2 [57] [16] --> [14] 13 1 [16] --> [40] --> 7 [40] --> 29 [47] --> 18 [47] --> [57] --> 27 [57] --> 29

17 Knowledge base construction Dictionary Type [35] K033 k033 K105 k33 Dictionary Type [87] Vrachtgeb vrachtgeb Vrachtgebouw Vracht Dictionary Type [89] CGOADTP6 Printqueue Dictionary Type [114] is Userid Password Dictionary Type [138] status Error Dictionary Type [196] scarlos vrachtbrieven Dictionary Type [215] G239 g239 Dictionary Type [237] enorm ontzettend super Dictionary Type [290] pingen benaderen

18 Emile on Biomed (1)

19 Emile on Biomed (2)

20 Emile on Biomed (3)

21 Emile outcome [16] --> School of Medicine, University of Washington, Seattle 98195, USA [16] --> University of Kitasato Hospital, Sagamihara, Kanagawa, Japan [16] --> Heinrich-Heine-University, Dusseldorf, Germany [16] --> School of Medicine, Chiba University [5] --> Department of Urology, [16] [94] --> Chinese [94] --> Japanese [94] --> Polish [101] --> 32 : Cancer Res 1996 Oct [101] --> 35 : Genomics 1996 Aug [101] --> 44 : Cancer Res 1995 Dec [101] --> 50 : Cancer Res 1995 Feb [101] --> 54 : Eur J Biochem 1994 Sep [101] --> 58 : Cancer Res 1994 Mar [105] --> identified in 13 cases ( 72 [105] --> detected in 9 of 87 informative cases ( 10 [105] --> observed in 5 ( 55 [11] --> LOH was [105] %


Download ppt "World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development."

Similar presentations


Ads by Google