Presentation is loading. Please wait.

Presentation is loading. Please wait.

NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia.

Similar presentations


Presentation on theme: "NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia."— Presentation transcript:

1 NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

2 Trends of Internet Services Eco system to work with third party’s apps – Apple Apps, Facebook, Twitter, Baidu, Sina, QQ Real time content collection and search – Twitter, Facebook, Del.ici.ous, NYT, YouTube Mobile search – Contextual intent understanding – Towards decision making and action taking Social power – Social tags (like) for general search engines – Search engines in SNS – Social QA

3 Impact and Challenge to NLP Research Impact – Biggest database ever – connects data – Biggest social network – connects people – Harnessing collective intelligence – Contextual information processing: User, user’s social network, location, time – Real-time information processing: Collection, index, operation without delay Challenge – How to leverage data, people, contextual information to reach real-time information processing?

4 Problems of Traditional NLP Approaches (NLP 1.0) Deep in individual component technologies but reach upper bounds Less consider scenarios, user’s need, market need Serious data sparseness with human annotation Evaluation bottleneck Slow deployment Lack effective framework to involve users’ feedback 4

5 New Strategy of NLP (NLP2.0) Data collection from the web Domain specific and open-IE Contextual NLP Maximize on the system level not on the individual component Earlier deployment on Internet Make best use of social factors 5

6 Our Vision and Task Advanced NLP technologies – Word breaker, POS tagging, chunking, syntactic parser, semantic role labeling, speller, query suggestion, summarization – Chinese, Japanese, English Multi-language information access – Statistical machine translation – Multi-language search Semantic computing – Sentiment analysis, event extraction, ontology learning – Understanding query intent and document – Contextual NLP Understand user and document in any language, for any device and any applications

7 Text analysis Skeleton parser Named entity identification Pos tagging SLM Component techs Machine Translation Translation evaluation Tran. know. acquisition WEB mining for MT SMT Information Extraction Annotation tool Machine learning Term extraction Information Retrieval paraphrasing Vertical search Cross language IR NLP enriched Indexing and search Query-doc relevance Text mining Data NLP (C, J, E)MT (C, J, E) MRD Translation lexicon Bilingual corpus Bilingual tagged corpus IR and IE (C,J,E) MRD Parsing lexiconTagged corpus Balanced corpus Applications Chinese IME Query speller English writing wizardNews Search Twitter Search Pocket translatorJapanese IME MSRA NLP Research Overview Meta data extraction Couplet generation Resume Routing General web search Chatbot Comparison Shopping

8 Research Accomplishment Awards – MSRA Best Research Team(2010) – Finalist of WSJ Asian Innovation Awards (2010) – MS ARD Best Project (Engkoo) – MSRA Best Innovation (1998-2008): IME and Chinese couplets Academic impact – Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009 – Best result in SIGHAN 2006 bake off on Chinese word segmentation – Best result in cross language information retrieval in TREC-9, NTCIR-III – 40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010) – PC Chair, area chair of ACL Collaboration with universities – HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Media and Network – 400 interns in 12 years – Summer schools since 2001 – PhD supervisors at universities 8

9 Summer School on Information Extraction (Harbin, June, 2005) Cheng Niu: Information extraction Frank Seide: Speech information extraction and search Hwee Tou Ng: Advanced topics of information extraction Chin-Yew Lin: Information extraction for automatic summarization

10 Projects based on NLP 2.0 Engkoo: Web-based English learning service – Data mining from the web Chinese couplets – Include user’s power into system evolvement Semantic analysis and search of micro- blogging – Move to SNS, mobile

11 Engkoo Parallel data mining from the web Video: http://video.sina.com.cn/v/b/37417609-1286528122.html

12 Rapidly Changing Language Approximately 1.5 billion people speak English as a primary, secondary or business language China: The largest “English speaking” country with 250 million English learners and USD 60 billion annual expenses Problem: Live language: new words, new meanings Key Insight: With billions of translated web pages and sharable repositories of language data growing every day, the Internet holds the sum of human language knowledge

13 www.engkoo.com Major Features: Microsoft Products: Endless Lexicon with Native Definitions State-of-the-Art Machine Translation (NIST OpenMT Winner) Real-time Interactive Alignment Bing Office MSN Human-Like TTS & Phonetic Search

14 Massive Dictionary Mined from the Web

15 Fresh and Diverse Examples

16 Advanced Search with Sentence Analysis

17

18 Sentences Classification

19

20

21 Learn Contextual Usage with Word Alignment

22

23

24 Hints of Easy-Confused Words

25

26 Knowlege Mining Pipeline Mined Data Parsed Data Linguistic Knowledge Web Mining Indexed Data Linguistic Parsing Knowledge Mining Multi- level Indexing Machine Translation Model Paraphrasing Model tokenizing: he could hardly afford to waste that golden time. 他 无法 浪费 那样 的 好 时光。 skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford) (Tobj~waste~afford) (Tobj~time~waste) (AdjAttrib~golden~time) (Tsub~ 他 ~ 浪费 ) (ModAdv~ 无法 ~ 浪费 )(Tobj~ 浪费 ~ 时光 ) (AdjAttrib~ 好 ~ 时光 ) alignment: he( 他 ) could hardly afford to( 无法 ) waste( 浪费 ) that( 那样的 ) golden( 好 ) time( 时光 ) 1. word’s idiomatic usage Verb~Noun (decline~offer) Verb~Adv (greatly~improve) Adj~Noun (arduous~task) Adv~Adj (extremely~bad) 2. paraphrasing turn_on~light, switch_on~light laborious~task, hard~task deeply~moved, deeply~touched 3. collocation translations 订 ~ 计划,make~plan 订 ~ 旅馆, book~room 订 ~ 杂志, subscribe to ~magazine Parallel Sentence: He could hardly afford to waste that golden time. 他无法浪费那样的好时光。 1.single word “he”, “could”, “hardly”, “afford” etc. “ 他 ”, “ 无法 ”, ” 浪费 “ etc. 2. single word with its POS “he_Pron”, “could_Verb”,“hardly_Adv” etc. “ 他 _Pron”, “ 无法 _Adv”, ” 浪费 _Verb“ etc. 3. collocation “Tsub~he~afford ”, “Tobj~time~waste” etc. “Tsub~ 他 ~ 浪费 ”, “ModAdv~ 无法 ~ 浪费 ” etc.

27 Chinese Couplets Include user‘s power into system evolvement

28 Chinese Couplets (http://duilian.msra.cn) http://video.sina.com.cn/v/b/10937201-1452530713.html

29 FS and SS Share the Same Style 风 (wind)---------------- 水 (water) 吹 (blow) --------------- 使 (make) 荞 ( buckwheat ) -- ------ 舟 (ship) 动 (wave)---------------- 流 (go) 桥 ( bridge) ------------- 洲 (island) 未 (not) ----------------- 不 (not) 动 (wave) --------------- 流 (go) Repetition of pronunciations( 音韵联 )

30 FS and SS Share the Same Style 有 (have)----------------- 缺 (lack) 子 (son) ------------------- 鱼 (fish) 有 (have) ------------------ 缺 (lack) 女 (daughter)------------- 羊 (mutton) 方 (so) --------------------- 敢 (dare) 称 (call) -------------------- 叫 (call) 好 (good) ------------------- 鲜 (fresh) Decomposition of characters ( 拆字联 ) 鲜 鱼 羊 好 女 子

31 FS and SS Share the Same Style 板桥 (Banqiao)---------------- 东坡 (Dongpo) 造 (produce) ------------------- 居 (live) 桥 (bridge) --------------------- 坡 (mountain) 板 (board)---------------------- 东 (east) Person name ( 人名联 ) Palindrome ( 回文联 ) Banqiao( 板桥 ) and Dongpo( 东坡 ) are famous litterateurs Reading from top to down is identical to down to top

32 天 高 sky high 天 高 sky high SS Generation Process 山 hill 山 hill 天 sky 天 sky 高 high 高 high 深 deep 深 deep 任 permit 任 permit 倚 depend 倚 depend 虫 insect 虫 insect 鸟 bird 鸟 bird 虎 tiger 虎 tiger 飞 fly 飞 fly 舞 dance 舞 dance 鸣 tweedle 鸣 tweedle 鸟 飞 bird fly 鸟 飞 bird fly 山 高 hill high 山 高 hill high 海 阔 凭 鱼 跃 Sea wide allow fish jump 海 阔 凭 鱼 跃 Sea wide allow fish jump 虎 啸 tiger roar 虎 啸 tiger roar 山高任鸟飞 天高任鸟鸣 天高任鸟飞 山高靠虎啸 山高任虎啸 山深任鸟飞 天高任花香 …… 山高任鸟飞 天高任鸟鸣 天高任鸟飞 山高靠虎啸 山高任虎啸 山深任鸟飞 天高任花香 …… SMT decoding Reranking 天高任鸟飞 山高任鸟飞 天高任鸟鸣 天高任鸟舞 山深任鸟飞 山高任花香 天高任花香 …… 天高任鸟飞 山高任鸟飞 天高任鸟鸣 天高任鸟舞 山深任鸟飞 山高任花香 天高任花香 …… 山高任鸟飞 天高任鸟鸣 天高任鸟飞 山深任鸟飞 天高任花香 天高任鸟舞 山高任花香 …… 山高任鸟飞 天高任鸟鸣 天高任鸟飞 山深任鸟飞 天高任花香 天高任鸟舞 山高任花香 …… Linguistic filtering Linguistic filtering

33 SS Generation Approach A multi-phase SMT approach – Phase1: a phrase-based log-linear model – Phase2: some linguistic filters – Phase3: a Ranking SVM Phrase-based log- linear model SS output Linguistic filters FS input N-best candidates Ranking SVM model

34 Great Examples FS: 月落乌啼霜满天 SS: 风吹雁过雨连宵 FS: 千江有水千江月 SS: 万里无云万里星 FS: 秦淮河桨声灯影 SS: 松花江水色月光 FS: 此木为柴山山出 ( 此 + 木 = 柴 ; 山 + 山 = 出 ) SS: 白水作泉日日昌 ( 白 + 水 = 泉 ; 日 + 日 = 昌 )

35

36

37 Motivation – Training data is not adequate – While user log is big(60k/m), increasing, diverse What logs we record – User inputs – User finalized couplets Second sentences selected out of the candidates provided by our system User modified second sentences User log for Model Enhancement

38 User’s Log Analysis Number of input sentences12,322 Number of unique input sentences6,698 Users directly select from system output 3,459 User manual modify system output606 Save as favorite couplets109 Invalid user input618 No second sentence generated2,211 Banner generation2,687 Select the generated banner as favorite 428 No banner output265 Data Source Log from http://couplet.msra. cn http://couplet.msra. cn Time period Aug. 31-Oct. 9, 2006

39 New Framework with Log Data Training data Source-Channel model Second sentence output Translation model Log data Re-ranking First sentence input Language model Mutual information N-best candidates Translation model Language model Mutual information User operation

40 Twitter Search Move to social internet and mobile

41 Tweets Noise Filtering Raw Data Semantic Role Labeling Sentiment Analysis NE Recognition Dependency Parsing Co-reference Text Normalization Classification Sentence Boundary Detection Tweets Cluster Statistical Relationship Learning News & Images Link Extraction Community ExtractionUser Influence Measure Hot tag, topic ExtractionPopular Tweet Extraction Top video, music, artists Extraction A collection of tweets Individual tweet Multi-level Indexing Semantic Search

42 Conclusion Internet trends and impacts to NLP NLP2.0 strategy Web data mining: Engkoo User’s power: Couplets SNS and mobile: Twitter search


Download ppt "NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia."

Similar presentations


Ads by Google