Presentation is loading. Please wait.

Presentation is loading. Please wait.

UK DATA ARCHIVE-NLP COLLABORATION Louise Corti and Claire Grover UK Data Archive University of Essex Colchester, Essex CO4 3SQ

Similar presentations


Presentation on theme: "UK DATA ARCHIVE-NLP COLLABORATION Louise Corti and Claire Grover UK Data Archive University of Essex Colchester, Essex CO4 3SQ"— Presentation transcript:

1 UK DATA ARCHIVE-NLP COLLABORATION Louise Corti and Claire Grover UK Data Archive University of Essex Colchester, Essex CO4 3SQ Email: quads@esds.ac.uk Tel: +44 (0)1206 872145 URL: quads.esds.ac.uk/squad CONTACT USING NLP TOOLS WHAT FEATURES DO WE NEED TO MARK-UP AND WHY? WHAT IS SQUAD? METADATA STANDARDS quads.esds.ac.uk/squad Spoken interview texts provide the clearest and most common example of the types of encoding features needed. There are three basic groups of structural features: CAPTURING AND DEFINING DATA CONTEXT enables preservation and re-use of metadata, data and annotation ensures consistency of presentation and description of data supports the development of common web-based publishing and search tools facilitates data interchange (e.g. CADAS packages) and comparison among datasets Progress: limited formal definition of a common XML vocabulary and Document Type Definition (DTD) based on the Text Encoding Initiative (TEI) testing of a new Qualitative Data Interchange Format (QDIF) DATA EXCHANGE STANDARDS Main aim: to explore methodological and technical solutions for ‘exposing’ digital qualitative data to make them fully shareable and exploitable. Main objectives. ANNOTATION TOOL - ANONYMISE Rich context enables informed re-use of data. But defining how to provide context for raw data to make it more ‘usable’ is complex. ESDS Qualidata has spent ten years working in the area of sharing qualitative data, and has done much to establish informal ways of documenting raw data. Both micro and macro level features should be considered including: how the research question was framed, the research application process, project progress, fieldwork situations, analyses processes. Fieldwork observations are useful as are timelines and political chronologies. Equally when undertaking a replication or restudy, detailed information on sampling procedures, field work approaches and question guides will be essential. Archiving and exposure of qualitative data in a way that faithfully represents its origins and context is important. Linking qualitative data to other distributed data sources such as audio-visual or geo-coded data sources, such as maps can afford creative and exciting ways of visualising data. Identify atomic elements of information in text: Example: Italy's business world was rocked by the announcement last Thursday that Mr. Verdi would leave his job as vice- president of Music Masters of Milan, Inc to become operations director of Arthur Anderson. Information Extraction (IE) is a sub-field of NLP which aims to identify key pieces of information in texts using 'shallow' analysis techniques. A typical IE system will perform Named Entity Recognition where particular kinds of proper names and terms are identified, classified and marked up. ESDS Qualidata is using semi-automated mark-up of some components of its data collections using natural language processing (NLP) and information extraction: AUDIOVISUAL ARCHIVING This tool imports marked up data from the CME NLP system. Named entities are highlighted and co-reference chains – e.g numerous references to a single person - are identified. core tag set for transcription names, numbers, dates links and cross references notes and annotations text structure unique to spoken texts linking, segmentation and alignment advanced pointing - XPointer framework text and AV synchronisation contextual information (participants, setting, text) Names can be anonymised with chosen pseudonyms. The references of names to pseudonyms is saved. Annotations are explored in an XML format in the NITE NXT model. NXT uses ‘stand off’ annotation – where annotation is linked to or referenced by words. This is a means of annotating documents with semantic metadata – enabling highly resource discovery and data exploration. The Java interface tool developed in SQUAD is called CME. SQUAD has identified a minimal generic set of elements that represent a baseline for contextualising data. QUADS has produced an edited collection on this issue as a special edition of the Journal in Methodological Innovations Online. sirius.soc.plymouth.ac.uk/~andyp/. UK Data Archive, University of Essex (lead partner) Language Technology Group, Human Communication Research Centre, School of Informatics, University of Edinburgh SMART QUALITATIVE DATA: METHODS AND COMMUNITY TOOLS FOR DATA MARK-UP new partnerships created – new methods, tools and jargon to learn new area of application for NLP to social science data growing interest in UK in applying NLP and text mining to social science texts – data and research outputs such as publications’ abstracts Collaboration between: 18 months duration 1 March 2005 – 31 October 2006 The XML schema will specify a ‘reduced’ set of Text Encoding Initiative (TEI) elements: specify, test and propose an eXtended Markup Language (XML) schema for storing and marking up qualitative data investigate requirements for contextualising qualitative data and developing standards for data documentation develop semi-automated using natural language processing tools for preparing marked up qualitative data for sharing research tools for publishing and interrogating data via the web – Qualitative Data Mark-Up Tools (QDMT) utterance, specific turn taker, defining idiosyncrasies in transcription links to analytic annotation and other data types (e.g. thematic codes, concepts,audio or video links, researcher annotations) identifying information such as real names, company names, place names, occupations, temporal information personal names company/organisation names locations dates times percentages occupations monetary amounts The formalised and systematic archiving and sharing of digital audio-visual data from qualitative research is fairly new. SQUAD is helping to explore XML representation and display of audio-visual data. A uniform format for richly encoding qualitative research is necessary as it: defined header metadata for a standardised transcript defined and tested generic XML models for qualitative data tested and refined NLP tools for qualitative data built front end to NLP named entity tools chosen software to enable annotation of data explored data export formats for longer-term archiving investigated powerful XML based indexing tools for searching and retrieving data investigated web display of multimedia data and pointers to other resources using XML - extending the functionality of ESDS Qualidata From Autumn 2006: formalising data exchange standard key word extraction systems to help conceptually index qualitative data – text mining collaboration exploring grid-enabling data: e-social science collaboration TOOLS PROGRESS There's just one or two factual things first of all do you mind my asking how old you are? 49. And what schools did you go to? - King Street, Woodside and Hilton. Uh-huh.. and how old were you when you left the school? 14. And you work at the moment? What sort of work do you do? - Well I've gone back to get shorter hours, I've went back to domestic, which I dinna really care for. But then I used to be in the pharmacy department at ARI... just pharmacy assistant Information about interviewee Date of birth: 1930 Gender: female Marital status: married Occupation: pharmacy assistant Geographic region: Scotland LP:There's just one or two factual things first of all do you mind my asking how old you are? G24:49. LP:And what schools did you go to? G24:King Street, Woodside and Hilton. LP:Uh-huh.. and how old were you when you left the school? G24:14. LP:And you work at the moment? What sort of work do you do? G24:Well I've gone back to get shorter hours, I've went back to domestic, which I dinna really care for. But then I used to be in the pharmacy department at ARI... just pharmacy assistant. At least it was better than cleanin'! But then they've nae part-time workers there so.. LP:And did you work in the pharmacy long? XML: enabling web- enabled display, search and browse XML: enabling a standardised format for interview transcripts interview text with XML tags embedded


Download ppt "UK DATA ARCHIVE-NLP COLLABORATION Louise Corti and Claire Grover UK Data Archive University of Essex Colchester, Essex CO4 3SQ"

Similar presentations


Ads by Google