Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell.

Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell Nicholas Schwartzmyer Rohini Srihari Janya, Inc. www.janyainc.com 8 January 2007

Significance of the Problem Synchronous Computer-mediated communication (SCMC) or chat is an increasingly important means of communication in many settings including intelligence and military domains Information extraction (IE) as applied to chat could aid activities reliant upon real-time decision-making (e.g. Entity tracking/targeting, monitoring teenager chat rooms etc.) Most IE applications, including our Semantex TM system, have been developed for optimal performance on well-written text This necessitates research into the unique characteristics of chat and how they affect IE performance –Perform a corpus study, i.e. gaps analysis to prioritize tasks for chat IE system –Add to the nascent study of chat as a discourse type Focus on task-oriented chat discourse –involves participants exchanging information on a highly-complex collaborative task in unconstrained setting –Task-oriented dialogs have played a significant role in understanding discourse- level coordination of meaning in conversation in the fields of psycholinguistics and computational linguistics

Task-oriented chat: The GeoTools corpus Need to develop a task-oriented corpus of chat data for research The GeoTools corpus –56 IRC logs for the GeoTools project (http://geotools.codehaus.org/IRC+Logs) –Approximately 180K and 18,000 participant turns (cf. The COCONUT corpus < 14K) –Interactions have form of business meeting Agenda of weekly problems/issues in the project, each discussed in some degree of detail Appropriate model of task-oriented discourse

Characteristics of chat data that distinguish it from narrative text It is dynamic. This results in:

Characteristics of chat data that distinguish it from narrative text It is dynamic. This results in: –Epistemological uncertainty. Propositions and entities may be challenged and/or revised –Structural Errors. Grammatical, spelling, and orthographic mistakes will be unedited in the text

Characteristics of chat data that distinguish it from narrative text It is interactive. Content has contribution from multiple participants, allowing:

Characteristics of chat data that distinguish it from narrative text It is interactive. Content has contribution from multiple participants, allowing: –Misunderstanding and disagreement but also –Extensive implicit, shared knowledge

Characteristics of chat data that distinguish it from narrative text It has a relatively unconstrained turn-taking structure

Characteristics of chat data that distinguish it from narrative text It has a relatively unconstrained turn-taking structure –Distinguishes chat from conversation as well –Allows for multi-threaded, interleaved conversations –Complicates resolution tasks

Implications for Information Extraction: Surface Level These characteristics complicate IE on two levels: Surface-level (Dynamic noise) affecting turn processing –Spelling errors ex: rschulz...It gave wierd results. [sic] ex: jgarnettI would love a repalcement [sic] –Non-standard punctuation/orthographic decisions ex: jmacgillsorry [,] gotta run [.] –Ungrammatical/non-standard grammatical constructions ex: jmacgillsorry [I] gotta run

Implications for Information Extraction: Discourse Level Discourse-level (Interactive noise) affecting 'document' processing –Topic segmentation –Time and event normalization –Anaphora resolution jgarnettWhile we wait dblasby I almost have the empty hsql datastore ready to commit cholmesCool, I'd be happy to see it polioah, that was me jeicharand I added crs support to postgis cholmesSuccessfully? dblasbythanks jody cholmeswhere did you get the changes? poliohe made a crs factory thingy and I hooked it into postgis

Semantex 3.0 Semantex –Tags key entities (people, places, organizations,...), Relationships, Events –Summarizes information in entity profiles Hybrid Model –Combines statistical and grammar- based approaches in a cascade of over 60 modules –FST grammars Modular –Semantex engine features plug and play ease of use. –Easily integrate additional modules, such as a DoD acronym tagger –Supports variety of data sources FBIS, USMTF, Lexis-Nexis, HUMINT, Factiva, Dialog, etc.

Semantex Generates Entity Profiles & Events from Documents Organization Profile (EP402) Profile Nameal-Barakaat Descriptorsmoney transfer terrorist StaffMohamed Barre (EP102) FounderOsama bin Laden (EP103) Events founding Who : EP103 Org : EP402 ______________________ According to the Boston Globe, the al- Barakaat network was founded by Osama bin Laden…Mohamed Barre has been the money transfer agencys broker… Copyright 2001 Blethen Maine Newspapers, Inc. Portland Press Herald Person Profile (EP101) Profile NameHerman Cohen Mentionsofficial, Mr. Cohen AliasesCohen Positionassistant secretary of state Where FromUS ________________________ Assistant secretary of state, Herman Cohen,…as the US official currently visiting Khartoum, he has been … Copyright 1992 Guardian Newspapers Limited The Guardian (London)

Concept of Operations 2 x 2 approach –2 levels of processing Turn processing Discourse processing –2 channels for system I/O Mission channel: IE chatbot monitors chat session Alert channel: system reports flagged information to interested parties

The GeoTools corpus annotation schema Surface-level phenomena annotation –Turn-final punctuation mark-up (incl. omission) –Misspelling, non-standard punctuation/orthography annotator provides correct form –Ungrammatical constructions violations of syntactic rules constructions not found in narrative text –blends: gotcha, dunno, etc. –constituent drop (incl. apostrophes) –Annotated 50% of the corpus due to density (23/56 logs)

The GeoTools corpus annotation schema Discourse-level phenomena annotation –Location deixis Potentially important inferential construction, esp. for location normalization ex: It's midnight here, what time is it there? mark deictic, describe what they resolve to –Verb phrase ellipsis (VPE) Another potentially important inferential construction ex: Although Max thinks I'll leave soon, I do want to mark the VPE and its antecedent

The GeoTools corpus annotation schema –Noun Phrases (NP) 20% of the corpus Inspired by DRAMA and MATE/GNOME schemas mark referentiality type –non-referential, anaphoric, non-NP antecedent –mark antecedent, if applicable –Sentential and non-sentential utterances 20% of the corpus Defined as a functionally independent clause Annotation activities include: –Sentential status (+/-) –Mark dialog act type (after DAMSL-SWDB schema) –Link dependent utterances

The GeoTools corpus annotation

Surface-level annotation analysis Surface-form noise is in fact very common in chat

Surface-level annotation analysis Orthographic errors, in particular, show a tendency toward reduced formality: This can also be seen in most common ungrammatical construction annotations –constituent drop (particularly the subject) –apostrophe omission (in contractions & possessives) –blends

Utterance annotation analysis

194 discrete utterance dependency chains –Interleaved topics are still relatively local Median number of turns between linked utterances: 1 Mean number of turns between linked utterances: 1.6 –These number hold for both sentential and non-sentential utterances Wider distribution of dialog act types than narrative discourse, as to be expected, but statements remain the dominant type

NP annotation analysis 13,225 NP annotations (excl. turn-initial usernames) Furthermore, ~7% of definite NPs were coreference chain-initial, meaning the rest will have an inferable antecedent Some implications: –Entities in chat discourse may not rely as heavily on implicit knowledge as hypothesized –Majority of entities are introduced anew in every chat discourse This would therefore allow their antecedents to be recoverable via IE

Verb phrase ellipsis annotation analysis Rare in our corpus (61 instances) Paucity is likely not a domain effect 61/61 had its antecedent in a distinct utterance 53/61 had its antecedent in a distinct turn Therefore, VPE resolution will be rather local, but necessitates discourse-level processing

Location deixis annotation analysis Also rare in our corpus This is likely not a domain effect 138 instances of deictic here –59% refer to the present chat –14% refer to the participant's location in the real world 84 instances of deictic there –1% refer to the present chat –10% refer to a location in the real world This distribution may also be domain-dependent

Consequences for a chat IE system Low-level Chat IE (Turn Handling) requires little modifications to Semantex: –Pre-existing case restoration modules, can be retrained for punctuation –Robust shallow parsing can handle many ungrammaticalities Discourse-level Chat IE System: –moderately local context of many inference phenomena makes for a more tractable problem –Topic segmentation and discourse modelling aided by dialog act tagging (utterance type classification) –Multi-threaded discourse model, Strictly linear model will not suffice Solution Solution: A tree structure will turn-level processing outputs at the nodes Model updated with each utterance added –Dynamic world model, Parallel to discourse model Consists of Concepts and Mentions –Concepts: Events, Entities, Relationships –Mentions: Token-based references to these Concepts –Separation of concepts from mentions allows for truth value updating of a concept with each addition mention

Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell.

Similar presentations

Presentation on theme: "Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell.

Similar presentations

Presentation on theme: "Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell."— Presentation transcript:

Similar presentations

About project

Feedback