Presentation on theme: "N CIS 530 Orientation - November 2001 1 CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania Philadelphia, PA 19104."— Presentation transcript:
n CIS 530 Orientation - November 2001 1 CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania Philadelphia, PA 19104
n CIS 530 Orientation - November 2001 2 Motivation There are several thousand languages. Over 320 are spoken by over 1,000,000 speakers. The ability to process foreign languages supports global economy, internationalization of business, software localization, military roles, intelligence gathering, humanitarian efforts, foreign policy To develop technology for language requires large amounts of data appropriately selected sampled, organized and annotated in corpora Corpus creation requires special equipment, unique legal arrangements and business models and specialized skills not usually taught in the programs of users of language data LDC exists to make language data broadly available for linguistic education, research and technology development
n CIS 530 Orientation - November 2001 3 LDC Role LDC began in 1993 as a specialized publisher of language data. The data was typically produced elsewhere. Distributed over 14,000 copies of 196 corpora to >1000 organizations worldwide LDC gradually developed the ability to create language resources locally newswires/text collection, collection of conversational data via telephone, broadcast news collection transcription, time-alignment, topic relevance annotation, named entity annotation, phonological /morphological resources LDC more recently extended its research program TalkBank & Linguistic Exploration, Open Languages Archives, African Language Lexicons, DASL Linguistic technologies Information Detection, Extraction and Summarization Speech Recognition and Speech Synthesis Machine Translation Language and Speaker Identification Language Teaching, Linguistics
n CIS 530 Orientation - November 2001 4 Annotating LDC Corpora: TDT Topic Detection & Tracking (TDT) Corpora TDT4 Corpus (most recent) contains 9 months of data in 6 languages Subset of 4 months of English, Chinese, Arabic for annotation Topics selected and defined from all sources Topic is a specific event or activity along with all directly related events (e.g., Hurricane Mitch) Multiple levels of annotation segmentation of audio signal into individual stories topic-story relevance judgements first story identification story-link identification Millions of annotation decisions
n CIS 530 Orientation - November 2001 5 Using commercial transcripts or closed-caption annotators assess existing story boundaries add, delete, move boundaries as needed classify units as “news” or “not news” (commercials, etc.) set and confirm timestamps for all story boundaries Audio Segmentation
n CIS 530 Orientation - November 2001 6 Topic-Story Annotation Annotators read and evaluate news stories against topic list Classify story as directly, briefly or not at all related to a target topic
n CIS 530 Orientation - November 2001 7 Annotating LDC Corpora: ACE Automatic Content Extraction Project (ACE) Develop technology to support automatic processing of human language in text form Classification, filtering, representing language content Four annotation tasks Identify all nominal entities in news story Categorize according to type Persons, organizations, GPE, location, facility Name, nominal, pronominal Co-index all mentions of single entity within story Classify relations among entities
n CIS 530 Orientation - November 2001 8 Nominal Entity Tagging
n CIS 530 Orientation - November 2001 9 Best practices in use of large-scale corpora in study of linguistic variation Focus on -t/d deletion in American English (well-known variable) Four LDC Corpora, all created for linguistic technology development All data already transcribed, segmented to provide fine-grained access Basic demographic information available (gender, age, education, region, race/ethnicity)
n CIS 530 Orientation - November 2001 10 DASL Technology Create concordance -regular expression search of corpus Create tag set -specify which factors to code Create annotation file -combines data with tag set Annotate using web browser -play each example, tool supports common audio formats -code factors in each factor group, adding comments when needed -demographic information displayed Save results and output to text file -can be exported to Excel Spreadsheet, statistical analysis package
n CIS 530 Orientation - November 2001 12 TDT Overview
n CIS 530 Orientation - November 2001 13 Transcripts ABC19981001.1830.0750 NEWS STORY 10/01/1998 18:42:30.46 In the U.S. and Canada tonight, there is intense concern. It is fair to say, about the insulation used on 1,000 airplanes. It is the same insulation used on Swissair flight 111 and it has been linked to fires on three other planes. Swissair went down off Nova Scotia, which is why the Canadians are concerned. The company that made that planes warned of the fire hazard years ago. ABC's Lisa Stark is in Washington. Reporter: This is the type of insulation in question.... Lisa Stark, ABC News, Washington. 10/01/1998 18:44:37.14 In the U.S. and Canada tonight, there is intense concern. It is fair to say, about the insulation used on 1,000 airplanes.
n CIS 530 Orientation - November 2001 14 ASR Output IN THE U. S. AND CANADA TONIGHT THERE IS INTENSE CONCERN IT IS FAIR TO SAY ABOUT THE INSULATION USED ON A THOUSAND AIRPLANES In the U.S. and Canada tonight, there is intense concern. It is fair to say, about the insulation used on 1,000 airplanes.
n CIS 530 Orientation - November 2001 15 Boundary Table Tokenized Text The most luxurious minivan you can buy... Chrysler town and country. We call it limited. You'll call it unlimited. In the U.S. and Canada tonight, there is intense concern.
n CIS 530 Orientation - November 2001 16 Relevance Table Topic Definition 30016. SwissAir111 Crash Seminal Event WHAT: SwissAir Flight 111 crashes WHERE: Off the coast of Halifax, Nova Scotia. WHEN: The crash occurs on 9/2/98; the investigation continues through the fall of 1998. Topic Explication The MD-11 aircraft was en route from New York to Geneva, Switzerland when it crashed into the Atlantic Ocean, killing all 229 people on board. On topic: Stories covering the crash and ensuing investigation; plans to compensate the victims' families; any safety measures proposed or adopted as a direct result of this crash. Rule of Interpretation Rule 5: Accidents Rule of Interpretation 5. Accidents: Examples - plane- car- train crash, bridge collapse, accidental shootings, boats sinking. The event would be causal activities and unavoidable consequences like death tolls, injuries, loss of property. The topic includes mourners pursuit of legal action, investigations, issues with responsible parties (like drug and alcohol tests for drivers etc.)
n CIS 530 Orientation - November 2001 17 Story Links Story Link Table Linked Story APW19981122.0381 NEWS STORY 11/22/1998 09:21:00 (...) Swissair CEO defends installation of in-flight entertainment ZURICH, Switzerland (AP) _ Swissair ``did everything correctly'' in installing a state-of-the-art entertainment system switched off last month in the wake of the crash of Flight 111, the airline's chief executive said in an interview published Sunday. Swissair acted voluntarily to disconnect the video-on-demand system, connected to a power supply routed through the cockpit, after Canadian investigators detected signs of heat damage on wiring and other debris from the ceiling around the cockpit of the MD-11. (...) (...)