Presentation is loading. Please wait.

Presentation is loading. Please wait.

What's Happened Since the First SIGDAT Meeting? Kenneth Ward Church AT&T Labs-Research

Similar presentations


Presentation on theme: "What's Happened Since the First SIGDAT Meeting? Kenneth Ward Church AT&T Labs-Research"— Presentation transcript:

1 What's Happened Since the First SIGDAT Meeting? Kenneth Ward Church AT&T Labs-Research

2 The First SIGDAT Meeting WVLC-1 was held just before ACL-93 Great turnout! –More like a conference than a workshop We knew that corpora were “hot,” –but didn't appreciate just how hot they would turn out to be.

3 Sister meetings have also done very well since 1993 Information Retrieval –http://www.acm.org/sigir/ Digital Libraries –http://fox.cs.vt.edu/DL99/ Machine Learning –http://www.cs.cmu.edu/Web/Groups/NIPS Data-mining, Databases, Data Warehousing –http://www.acm.org/sigkdd/ –http://www.vldb.org/

4 Empiricism has a long history In the 1950’s, empiricism dominated a broad set of fields: –from psychology (behaviorism) –to electrical engineering (information theory). At the time, it was common practice in linguistics to classify words not only on the basis of their meanings –but also on the basis of their co-occurrence with other words. –``You shall know a word by the company it keeps” (Firth, 1957) Regrettably, interest in empiricism faded in the 1960’s: –Chomsky's criticism of ngrams in Syntactic Structures (1957) and –Minsky and Papert's criticism of neural nets in Perceptrons (1969).

5 1990’s Revival Empiricism regained a dominant position: –Ngrams and Hidden Markov Models (HMMs) became the method of choice in Speech. –Neural Networks (Perceptrons + Hidden Layers) helped create Machine Learning. Empiricism  Rationalism  Empiricism –Oscillates about once a career Mark Twain: Grandparents and Grandchildren have a natural alliance.

6 Why the Revival? “It was a bad idea then, and it is still a bad idea now” More powerful computers?? Availability of massive quantities of data!! –Text is available like never before. –Not long ago, the Brown Corpus was considered large. –But now, text is available like never before! First came collection efforts (www.ldc.upenn.org), And now everyone has access to the Web! Experiments are routinely carried out on gigabytes of text. Some researchers are even working with terabytes.

7 Big Changes Since 1993 The Web, stupid! –Demos –Data Research: –Shared resources + evaluation –Scale: How large is very large? –Increased breadth: Geography, Topics Commercial: Wall Street & Main Street

8 The Web, Stupid! If you publish a paper about neat stuff, it is expected that you will post it on the web. I’ll mention just a few examples of neat stuff on the web. –Demos –Data –Tools

9 Lots of Neat Demos on the Web Web Searching with Machine Translation –www.altavista.com(uses Systran) Cross-Language Information Retrieval (CLIR): –www.xrce.xerox.com Parallel Corpora: www-rali.iro.umontreal.ca Latent Semantic Indexing (LSI) –superbook.bellcore.com/~remde/lsi –lsa.colorado.edu Speech Synthesis: Dotplot:

10 Lots of Neat Data on the Web Wordnet: Linguistic Data Consortium (LDC): – SIGLEX: Discourse Resource Initiative (DRI) –www.georgetown.edu/luperfoy/Discourse- Treebank/dri-home.html The Federalist Papers: –www.mcs.net/~knautzr/fed

11 More Neat Data on the Web (in Lots of Languages) Chinese: –rocling.iis.sinica.edu.tw –www.sinica.edu.tw Japanese: cl.aist-nara.ac.jp/lab/resource/resource.html –Electronic Dictionary Research (EDR): –Advanced Telecommunications Research (ATR): –www.rdt.monash.edu.au/~jwb/japanese.html Korean: korterm.kaist.ac.kr European Language Resources Association (ELRA) –www.icp.grenet.fr/ELRA Parallel Text (Resnik, ACL-99) –Canadian Hansards: –Turkish: –Swedish: svenska.gu.se

12 Lots of Neat Tools on the Web Penntools (links to all over the world) –www.cis.upenn.edu/~adwait/penntools.html Part of Speech Taggers (see above) Juman/Chasen –pine.kuee.kyoto-u.ac.jp/nl-resource/juman.html –cl.aist-nara.ac.jp/lab/nlt/chasen.html Suffix Arrays –http://cm.bell-labs.com/cm/cs/who/doug/ssort.c

13 Big Changes Since 1993 The Web, stupid! –Demos –Data  Research: –Shared resources + evaluation –Scale: How large is very large? –Increased breadth: Geography, Topics Commercial: Wall Street & Main Street

14 Shared Resources + Evaluation Common tasks: –Trec ( trec.nist.gov), Tipster, MUC Common benchmark corpora: Brown, Penn Treebank, Wall Street Journal, Switchboard Shared lexical resources: Wordnet ( Common labeling conventions/standards in all areas of NLP from Speech to Discourse Evaluation, evaluation, evaluation –Required to get a paper accepted anywhere.

15 In 1993, it wasn’t like this... Invited talks at ACL-93 –“Planning Multimodal Discourse” –“Transfers of Meaning” –“Quantificational Domains and Recursive Contexts” Less sharing of resources Evaluation not required

16 Empiricism vs. Rationalism Pluses: Clear measurable progress –Speech Recognition –Part of Speech Tagging –Parsing Minuses: Herd mentality, incrementalism, mindless metrics, duplicated effort –Recall: empiricism fell out of favor in 1960s when methodology became too burdensome.

17 Big Changes Since 1993 The Web, stupid! –Demos –Data Research: –Shared resources + evaluation –Scale: How large is very large? –Increased breadth: Geography, Topics  Commercial: Wall Street & Main Street

18 Main Street: Big change since 1993 Large corpora are now having an impact on ordinary users: –Web search engines/portals –Managing gigabytes, not just a popular book, but something that ordinary users are beginning to take for granted.

19 Huge Commercial Successes (Since 1993) Information Retrieval & Digital Libraries –Web search engines/portals: highly successful on both Wall Street as well as Main Street Invited talks from Lycos (1997) & Infoseek (1998) Machine Translation & Speech –Available wherever software is sold –Can’t use a phone without talking to a computer

20 Big Changes Since 1993 The Web, stupid! –Demos –Data Research: –Shared resources + evaluation  Scale: How large is very large? –Increased breadth: Geography, Topics Commercial: Wall Street & Main Street

21 How Large is Very Large?

22 Mirror, mirror on the wall Who is the largest of them all? –The Web? –Lexis-Nexis? –West? We have had invited talks from all three –Web: Lycos (1997) & Infoseek (1998) – Lexis-Nexis (1993) –West (1997)

23 Big Changes Since 1993 The Web, stupid! –Demos –Data Research: –Shared resources + evaluation –Scale: How large is very large?  Increased breadth: Geography, Topics Commercial: Wall Street & Main Street

24 Internationalization SIGDAT-93: Nearly equal participation –America : 4 papers –Asia: 4 papers –Europe: 3 papers Great growth in activity around the world, especially Asia SIGDAT has met in a dozen cities (50% in America) –America: Columbus, Cambridge, Philadelphia, Providence, Montreal, College Park –Asia: Kyoto, Beijing, Hong Kong –Europe: Dublin, Copenhagen, Grenada

25 Some Topics that are Behind the International Expansion Classic Issues –Machine Translation (MT) / Tools –Input Method Editor (IME): MS-IME98 –Morphology: Juman, Chasen New Issues –Cross-language Information Retrieval (CLIR) –Browsing the Internet: integrate IME + CLIR + MT –Parallel and comparable corpora –Terminology Extraction & Alignment –Suffix Arrays

26 Big Changes Since 1993 The Web, stupid! –Demos –Data Research: –Shared resources + evaluation –Scale: How large is very large?  Increased breadth: Geography, Topics Commercial: Wall Street & Main Street

27 Broader (and More Applied) View of Computational Linguistics Data-mining, Databases, Data Warehousing Digital Libraries Information Retrieval, Categorization, Extraction Lexicography Machine Learning Machine Translation Speech Text Analysis

28 Data-Mining Issues (How Large is Very Large?) Similar technology to corpus-based methods But much larger datasets –Newswire (AP): 1 million words per week –Telephone calls: 1-10 billion per month –IP packets: expected to be even larger Tasks: Fraud, Marketing, Operations, Care –Identify knobs that business partners can turn Increase demand (buy TV ads, reduce price) Increase supply (buy network capacity, enhance operations) –Target opportunities for improvement (marketing prospects) –Track market response in real time (supply/demand by knob)

29 Best of SIGDAT Best Invited Talk Work of Note Work of Note (in Related Fields)

30 Best Invited Talk at a SIGDAT Meeting Henry Kučera and Nelson Francis –Third Workshop on Very Large Corpora (1995) –Massachusetts Institute of Technology (MIT) –Cambridge, MA, USA Described their work on the Brown Corpus –At a time when empiricism was out of fashion –especially at MIT –Personal & Touching (received standing ovation)

31 Work of Note Statistical Machine Translation / Alignment –Brown et al. Statistical Parsing (In 1993, poor use of lexical info) –Jelinek, Magerman, Charniak, Collins Statistical PP Attachment –Hindle and Rooth Word-sense Disambiguation –Yarowsky Text-tiling (Discourse Parsing) –Hearst

32 Work of Note (in Related Fields) Learning –Classification and Regression Trees (CART) –Riper Web Tools –Managing Gigabytes, Harvest, SGML  XML Representation –Suffix Arrays –Latent Semantic Indexing

33 Summary: Reaching a Wider Audience Commercial Successes –Main Street & Wall Street Internationalization –Goal: equal rep from America, Asia & Europe More topic areas –Information Retrieval, Speech, Machine Translation, Machine Learning, Data-mining

34 Self-organizing vs. EDA Self-organizing: Learning, HMM –Statistics do it all Manual –Wilks’ Stone Soup: Statistics don’t do nothing Exploratory Data Analysis (EDA) –Hybrid of above

35 Time for a little controversy: Two types of Empiricism New Linguistic Insights vs. Methodology Reviewers do what reviewers do –Safe, conservative, seek precedents, case law –Reviewers go easy on methodology papers Grim historical reminder: –Recall: empiricism fell out of favor in 1960s when methodology became too burdensome. Shouldn’t let the methodology get in the way of what we are here to do.


Download ppt "What's Happened Since the First SIGDAT Meeting? Kenneth Ward Church AT&T Labs-Research"

Similar presentations


Ads by Google