Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 DTSI / Service Cognitique Robotique et Interaction Automatic Concept Identification Gregory Grefenstette IRF Symposium 2007 Vienna Contact :

Similar presentations


Presentation on theme: "1 DTSI / Service Cognitique Robotique et Interaction Automatic Concept Identification Gregory Grefenstette IRF Symposium 2007 Vienna Contact :"— Presentation transcript:

1 1 DTSI / Service Cognitique Robotique et Interaction Automatic Concept Identification Gregory Grefenstette IRF Symposium 2007 Vienna Contact : gregory.grefenstette@cea.fr

2 2 DTSI / Service Cognitique Robotique et Interaction Automatic Concept Recognition Linguistic Transformations and Language Models Statistics for Salience, weirdness Concept finding using simple statistics Concept network extraction using patterns

3 3 DTSI / Service Cognitique Robotique et Interaction Linguistic Models (lists, graphs, patterns, weights) OCR Speech x,kc k x.9.1 ca,kaco,ko lexicon coca even cocaine.1.3.85 Tokenization Text abbrev  abbrev [space] Stopwords A retrospective study of 90 men with painful hip in of the this to with A retrospective study of 90 men with painful hip Lemmatization pained pains lexicon pain +V Derivational Morphology pained painful pain lexicon Part-of-Speech Tagging their hip pains were +Det+N +V +BeV their hip pains were +Det+N +BeV N V D BeV.85.05.35.13 Noun Phrase Chunking their hip pains were +Det+N +BeV Adj* N* N (prep Adj* N* N)* their [ hip pains ] were

4 4 DTSI / Service Cognitique Robotique et Interaction What Natural Language Processing Does It knows something and transform one thing into another (looks in a list)  Some things NLP knows: Thinks, thought, thinking  think (list) John is a common first name in English (list with frequency) Lists of country names (typed lists) Lists of medicines (ontologies) It guesses (uses a list with frequencies attached)  Some things NLP has to guess Tokenization Part of speech tagging Parsing Language identification Domain Classification New Concepts

5 5 DTSI / Service Cognitique Robotique et Interaction Identifying concepts using NLP Tokenization  Extract brines from.. separate hydrate crystals from residual brines, then.. Part of Speech Tagging  Extract Nouns from.. Separating/verb the/det hydrate/adj crystals/noun from/prep the/det residual/adj brines/noun,/comma Lemmatisation  Extract separate, brine, crystal from above Noun phrase extraction  Extract hydrate crystal, residual brine Dependency Extraction  Extract separate,crystal,VDOBJ Compare lists and frequencies  Retain patterns in a list, or those more frequent than “normal”

6 6 DTSI / Service Cognitique Robotique et Interaction Statistics available Frequency of word in domain corpus Frequency of multiword term in domain corpus Frequency of word in background corpus Frequency of multiword term in background corpus Number of documents word is found in Length of the word

7 7 DTSI / Service Cognitique Robotique et Interaction Comparing against background concepts Weirdness factor Khurshid Ahmad, Lee Gillam and Lena Tostevin (2000) Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER) In (Eds.) E.M. Voorhees and D.K. Harman. The 8th Text Retrieval Conference (TREC-8). Washington: National Institute of Standards and Technology. pp 717-724

8 8 DTSI / Service Cognitique Robotique et Interaction Example : noun phrases according to weirdness RAW 22685 present invention 21515 heat sources 13672 certain embodiments 13627 heat source 12148 treatment area 11389 freeze wells 11300 carbon dioxide 9971 production wells 9860 formation fluid 9622 hydrocarbon layer 9403 selected section 8852 synthesis gas 8281 situ conversion process 7250 oil shale formation 7174 situ process 5844 method of claim 5785 oxidizing fluid 5104 aqueous solution 4751 condensable hydrocarbons 4660 permeable formation WEIRDNESS RANKING 189,817 freeze wells 45,2821 frozen barrier 24,6667 dewatering wells 22,9855 magnetic string 22,771 insulated conductor heater 22,3962 ICP wells 20,5966 oil shale formation 19,0847 surface treatment units 18,6949 tar sands formation 18,3359 synthesis gas generating fluid 17,9942 heater wells 17,7454 oxidizing fluid 17,4545 synthetic condensate 16,3158 perimeter barriers 15,5696 hydrocarbon layer 15,5183 perimeter barrier 11,8152 lean zones 11,5316 condensable hydrocarbons 11,3564 surface treatment unit 10,4157 heat recovery fluid 9,78417 overburden casing 9,48993 non-condensable hydrocarbons Domain vocabulary Mixture background vocabulary Domain vocabulary

9 9 DTSI / Service Cognitique Robotique et Interaction Simpler Method Take two orthogonal fields Subtract vocabulary found in both Example : Filtering desalination patents against carburetor patents

10 10 DTSI / Service Cognitique Robotique et Interaction Desalination Carburetor common 21515 heat sources 12148 treatment area 11389 freeze wells 9971 production wells 9860 formation fluid 9622 hydrocarbon layer 9403 selected section 7250 oil shale formation 5785 oxidizing fluid 4751 condensable hydrocarbons 4660 permeable formation 4619 simulation method 4129 heated portion 4116 formation fluids 3791 heat transfer fluid 3626 dewatering wells 3532 frozen barrier 3383 perimeter barrier 3383 API gravity 3177 heating rate 3113 heater wells 2983 insulated conductor heater 2870 heated formation 2685 resistance section 2566 average temperature 5335 cylinder head 5028 fuel tank 3687 engine speed 3265 cylinder block 2473 fuel mixture 2320 air-fuel mixture 2142 choke valve 2091 crank chamber 2028 air cleaner 2016 intake manifold 1885 exhaust system 1668 exhaust valve 1583 output shaft 1571 spark plug 1567 intake valve 1456 engine body 1378 lubricating oil 1371 fuel pump 1204 fuel injector 1170 carburetor body 1155 idle position 1151 personal watercraft 1151 combustion chambers 1128 fuel system 1125 fuel-air mixture 1090 intake passage present invention certain embodiments heat source carbon dioxide synthesis gas situ process method of claim aqueous solution temperature zone process of claim reaction zone heat transfer operating conditions room temperature molecular weight flow rate another embodiment computer system preferred embodiment thermal conductivity sea water reaction mixture incorporated by reference heat exchanger upper portion carbon monoxide

11 11 DTSI / Service Cognitique Robotique et Interaction Concept Extraction via pattern matching and position ADJ( femoral, head) … This syndrome is characterized by hip pain, limping, and osteoporosis of the femoral head with preservation of the joint space…. NN( hip, pain) femoral head hip pain

12 12 DTSI / Service Cognitique Robotique et Interaction … A sixteen year old girl presented with a four year history of hip pain followed subsequently by back pain radiating down her left leg…. NN( hip, pain) NN( *, pain) NN( back, pain) Concept Extraction via pattern matching and position

13 13 DTSI / Service Cognitique Robotique et Interaction Concept Extraction Network e.g. hip pain (457 documents MEDLINE) femoral head (79) stress fracture (43) Perthes disease (41) hip replacement (41) hip joint (41) septic arthritis (38) avascular necrosis (38) differential diagnosis (37) rheumatoid arthritis (35) femoral neck (35) patient with pain (34) transient osteoporosis (34) cause of pain (33) femoral fracture (32) back pain (32) total replacement (31) bone marrow (28) neck fracture (27) knee pain (26) hip arthroplasty (26) magnetic imaging (25) necrosis of head (23)... back pain (32) knee pain (26) low pain (18) left pain (18) severe pain (17) groin pain (16) acute pain (16) persistent pain (14) right pain (13) thigh pain (12) chronic pain (11) anterior pain (9) lateral pain (8) bilateral pain (8) patient pain (5) joint pain (5) foot pain (5) extremity pain (5) buttock pain (5) abdominal pain (5) fever pain (4) musculoskeletal pain (4)... hip replacement (41) hip joint (41) hip arthroplasty (26) osteoporosis of hip (22) hip disease (15) hip fracture (14) arthritis of hip (13) hip joint (12) hip replacement (11) hip effusion (10) radiograph of hip (9) hip OA (9) hip dislocation (9) synovitis of hip (8) dislocation of hip (8) hip disease (8) examination of hip (7) hip arthroplasties (7) hip radiograph (7) hip prosthesis (7) patient with hip (6) hip score (6)... NNPREP( pain in hip ) ADJ ( painful hip ) ADJ ( painless hip ) NNPREP( pain of hip ) NNPREP( pain around hip ) NNPREP( pain about hip ) Derivational and syntactic variants Other nearby ‘hip things’ Other local ‘pains’ Other local dependencies

14 14 DTSI / Service Cognitique Robotique et Interaction Topic Detection Thesaurus Extraction Trend Analysis Summarization Genre Author- ship Info Retrieval Part-of- Speech Tag Anaphora Resolution Entity Recognition Question Answering Concept Identification Dependency Extraction Hubs and Authorities Morphological Analysis Clustering Classification counting pattern recognition Grading

15 15 DTSI / Service Cognitique Robotique et Interaction Conclusion In NLP, concepts are  Elements of a list (and their predictable variations) [ontology]  Syntactic patterns (usually noun phrases)  Salient sequences (frequency higher than background suggests)  Current tendency Vaster background models Producing concept lists per topic (per concept)

16 16 DTSI / Service Cognitique Robotique et Interaction Google Web 1T 5-gram The n-gram counts were generated from approximately 1 trillion word tokens of text from publicly accessible Web pages. Tokenization  similar to the tokenization of the Penn Treebank.  Notable exceptions include the following: Hyphenated word are usually separated, and hyphenated numbers usually form one token. Sequences of numbers separated by slashes (e.g. in dates) form one token. Sequences that look like urls or email addresses form one token. Data Sizes:  approx. 24 GB compressed (gzip'ed) text files  Number of tokens: 1,024,908,267,229  Number of sentences: 95,119,665,584  Number of unigrams: 13,588,391  Number of bigrams: 314,843,401  Number of trigrams: 977,069,902  Number of fourgrams: 1,313,818,354  Number of fivegrams: 1,176,470,663

17 17 DTSI / Service Cognitique Robotique et Interaction Moving to large scale models Web 1T 5-gram, sample of 3-grams ceramics collection, 144 ceramics collection. 247 ceramics collection 120 ceramics collection and 43 ceramics collection at 52 ceramics collection is 68 ceramics collection of 76 ceramics collection | 59 ceramics collections, 66 ceramics collections. 60 ceramics combined with 46 ceramics come from 69 ceramics comes from 660 ceramics community, 109 ceramics community. 212 ceramics community for 61 ceramics companies. 53 ceramics companies consultants 173 ceramics company ! 4432 ceramics company, 133 ceramics company. 92 ceramics company 41 ceramics company facing 145 ceramics company in 181 ceramics company started 137 ceramics company that 87 ceramics component ( 76 ceramics composed of 85 ceramics composites ferrites 56 ceramics composition as 41 ceramics computer graphics 51 ceramics computer imaging 52 ceramics consist of 92

18 18 DTSI / Service Cognitique Robotique et Interaction Agarwal, R. Semantic Feature Extraction From Technical Texts With Limited Human Intervention. Phd, Mississippi State University, May 1995. M. Berland and E. Charniak. 1999. Finding Parts in Very Large Corpora. In Proceedings of the the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99). Bourigault, D. (1993). An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation. In Proceedings, 6 th European Chapter of the Association for Computational Linguistics (EACL'93), Utrecht, 81—86 M.R. Brent. Automatic acquisition of subcategorization frames from untagged text. In Proceedings of the 29th Annual Meeting of the ACL, pages 209--214, Berkeley, CA, 1991. K.W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22--29, 1990. Dagan, Ido & Kenneth W. Church. 1994. Termight: Identifying and translating technical terminology. In Proceedings of the Fourth Conference on APLN 34--40, Stuttgart S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Hashman. Indexing by Latent Semantic Indexing. Journal of the American Society for Information Science, 41(6), 1990. David Faure and Claire Nedellec. 1999. Knowledge acquisition of predicate-argument structures from technical texts using machine learning. In Proc. Of Current Developments in Knowledge Acquisition, EKAW- 99. G. Grefenstette. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers. Guarino, N. 1999. The Role of Identity Conditions in Ontology Design. In Proceedings of IJCAI-99 workshop on Ontologies and Problem-Solving Methods: Lessons Learned and Future Trends. Stockholm, Sweden, IJCAI, Inc.: 2-1 2-7. Hearst, M. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics,. 1992. Hindle, D. 1990. Noun classification from predicate-argument structure. In Proceedings of the 28th annual meeting of the Association for Computational Linguistics, Pittsburgh, Pa International Organization for Standardization: Documentation: guidelines for the establishment and development of monolingual thesauri: ISO 2788, 2nd ed., 1986. C. Jacquemin and J. Royaute. Retrieving terms and their variants in a lexicalized unification-based framework. In Proceedings of the ACM-SIGIR, pages 132--141, July 1994. C. Jacquemin. 1996. A symbolic and surgical acquisition of terms through variation. In E. Riloff et G. Scheler S. Wermter, editor, Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing, pages 425-438, Springer. K. Lagus, T. Honkela, S. Kaski, & T. Kohonen. "Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration," In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1996. McDonald, D. 1996. Internal and external evidence in the identification and semantic categorization of proper names. In B. Boguraev and J. Pustejovsky, editors, Corpus processing for lexical acquisition, pages 21--39. Miller, G. (1990). Wordnet: An on-line lexical database. International journal of Lexicography, 3(4):235-- 312. F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In 30th Annual Meeting of the ACL, pages 183--190, 1993. Riloff, E., and Shepherd, J. 1997. A Corpus-Based Approach for Building Semantic Lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 117-124. Roark, B. and Charniak, E. Noun-phrase cooccurrence statistics for semi-automatic semantic lexicon construction. In 36th Annual Meeting of the Association for Computational Linguistics and 17 th International Conference on Computational M. Roche, J. Aze, Y. Kodrato, and M. Sebag. Learning interestingness measures in terminology extraction. a roc-based approach. In Proc. of "ROC Analysis in AI" Workshop (ECAI), pages 81-88, 2004. Gerda Ruge. 1992. Experiments on linguistically based term associations. Information Processing & Management, 28(3):317--332. Ryu, P., Choi K. (2004) Measuring the Specificity of Terms for Automatic Hierarchy Construction. In Proceedings of ECAI-2004 Workshop on Ontology Learning and Population Scott, Sam and Stan Matwin, "Text classification using WordNet hypernyms," Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal, 1998 H. Schütze and J. Pedersen. A co-occurrence-based thesaurus and two applications to information retrieval. In Proceedings of the RIAO'94, pages 266--274, Rockefeller University, New York, 1994. Su, Keh-Yih, Tung-Hui Chiang and Jing-Shin Chang, "An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing," Intl. Journal of Computational Linguistics and Chinese Language Processing (CLCLP), vol. 1, no. 1, pp. 101-157, Taipei, August 1996. BIBLIOGRAPHY


Download ppt "1 DTSI / Service Cognitique Robotique et Interaction Automatic Concept Identification Gregory Grefenstette IRF Symposium 2007 Vienna Contact :"

Similar presentations


Ads by Google