Smart Qualitative Data: Methods and Community Tools for Data Mark-Up (SQUAD) Louise Corti and Libby Bishop UK Data Archive, University of Essex IASSIST.

Slides:



Advertisements
Similar presentations
New Services for Users Enhanced User Support and Enhanced Access to Data Angela Dale, Head ESDS Government Melanie Wright, Head ESDS Access & Preservation.
Advertisements

Smart Qualitative Data: Methods and Community Tools for Data Mark-Up (SQUAD) Louise Corti UK Data Archive, University of Essex QUADS Demonstrator Workshop.
Using Atlas-ti to explore qualitative data Libby Bishop and Louise Corti, UK Data Archive, ESDS, University of Essex IASSIST 2004 workshop.
Depositing Data for Archiving Libby Bishop ESDS Qualidata, University of Essex Changing Families, Changing Food Meeting University of Sheffield 15 March.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
Using secondary qualitative data in interdisciplinary contexts Libby Bishop ESDS Qualidata, University of Essex Working Across Boundaries: 2 nd NCRM Summer.
ESDS Qualidata and QUADS Coordination Louise Corti Online Resources Day 15 November 2005, London.
QUALITATIVE ARCHIVING AND DATA SHARING SCHEME WHO WE ARE QUADS is the ESRC Qualitative Archiving and Data Sharing Scheme, running from April 2005 until.
New Services for Data Creators and Providers Louise Corti, Head ESDS Qualidata/ Outreach & Training Alasdair Crockett, ESDS Data Services Manager.
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
HAND OUTS DExT Project UK Data Archive September 2007.
A DTD for Qualitative Data: Extending the DDI to Mark-up the Content of Non-numeric Data Libby Bishop and Louise Corti, UK Data Archive, ESDS, University.
ESDS Qualidata Libby Bishop, ESDS Qualidata Economic and Social Data Service UK Data Archive ESDS Awareness Day Friday 5 December 2003Royal Statistical.
New features for ESDS Qualidata Online Libby Bishop UK Data Archive, University of Essex QUADS Demonstrator Workshop 28 September 2006.
Nesstar, ESDS International and ESDS Qualidata online demonstrations ASLIB visit to the UK Data Archive Wednesday 24 November 2004 Louise Corti, Associate.
Data Exchange and Conversion Utilities and Tools (DExT) Louise Corti, Angad Bhat, Herve LHours UK Data Archive CAQDAS Conference, April 2007.
QUADS Co-ordination Louise Corti QUADS Director, UKDA 28 September 2006.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
Why, what were the idea ? 1.Create a data infrastructure, 2.Data + the knowledge products that are produced on the basis of data a) Efficiant access to.
The Caught and Coloured website: its EMu origins Alex Chubaty – Collection Information Systems Craig Churchill – IT Software Development Museum Victoria.
A Common Standard for Data and Metadata: The ESDS Qualidata Document Type Definition (DTD) Libby Bishop Online Qualitative Data Resources: Best Practice.
National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.
Qualitative Data Preparation and Use Jack Kneeshaw ESDS Psychology Department-U of Essex 4 December 2003.
Discove r Humanities and Social Science Electronic Thesaurus - HASSET Faceted search HASSET is the subject thesaurus that the UK Data Service uses to index.
EAD in A2A Bill Stockting, Senior Editor A2A and EAD Working Group: Central Archives of Historical Records, Warsaw, 26 April 2003.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Louise Corti IASSIST, Edinburgh May 2005.
StatCat Building a Statistical Data Finder ssrs.yale.edu/statcat Steven Citron-Pousty Ann Green Julie Linden Yale University.
1 Adaptive Management Portal April
August 14, 2015 Research data management – an introduction Slides provided by the DaMaRO Project, University of Oxford Research Services.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
INTRODUCTION TO RESEARCH DATA MANAGEMENT Robin Desmeules Janice Kung J W Scott Health Sciences Library University of Alberta Libraries.
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
DExT PROJECT Louise Corti UK Data Archive University of Essex Colchester, Essex CO4 3SQ Tel: +44 (0) URL:
Distributed Access to Data Resources: Metadata Experiences from the NESSTAR Project Simon Musgrave Data Archive, University of Essex.
STIM Sloan-Stanford Network for the History of Technology.
DTIC Discovery Tools 28 March 2012 Moderator: Kapin L. Ferguson.
Metadata, the CARARE Aggregation service and 3D ICONS Kate Fernie, MDR Partners, UK.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
DLI Training April 2004 Kingston Ontario. DDI What, Why, How?
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Metadata: Essential Standards for Management of Digital Libraries ALI Digital Library Workshop Linda Cantara, Metadata Librarian Indiana University, Bloomington.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up (SQUAD) Louise Corti UK Data Archive, University of Essex ASC Conference 29 September.
UK DATA ARCHIVE-NLP COLLABORATION Louise Corti and Claire Grover UK Data Archive University of Essex Colchester, Essex CO4 3SQ
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
Contexts and recontextualisation Libby Bishop ESDS Qualidata, University of Essex Context Workshop - QUADS Southbank University, London 3 May 2006.
Introduction to Metadata, the DDI and the Metadata Editor Presentation to the SERPent project team by Margaret Ward 3 March 2010.
Introduction to Omeka. What is Omeka? - An Open Source web publishing platform - Used by libraries, archives, museums, and scholars through a set of commonly.
Building the Mother of all Collections: the future of the National Library’s discovery services Warwick Cathro Assistant Director-General, Innovation National.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
REPORT BACK FROM THE DDI QUALITATIVE WORKING GROUP ……………………………………………………….………………………………
Quads.esds.ac.uk/squad THE PROJECT SMART QUALITATIVE DATA: METHODS AND COMMUNITY TOOLS FOR DATA MARK-UP SQUAD aims to explore methodological and technical.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
SDMX IT Tools Introduction
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Find Research Data b2find.eudat.eu B2FIND User Training How to find data objects and collections using EUDAT’s B2FIND This work is licensed.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
Delivering textual and visual resources. Overview Case studies Methods for providing access Structures for delivery Full text Marked-up Image and text.
METADATA ORGANISATION ESDS APPROACHES AND RESOURCES …………………………………………
Attributes and Values Describing Entities. Metadata At the most basic level, metadata is just another term for description, or information about an entity.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
The Semantic Web By: Maulik Parikh.
An Overview of Data-PASS Shared Catalog
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Powerful access to qualitative data: What’s behind the UK QualiBank
BIO1130 Lab 2 Scientific literature
Presentation transcript:

Smart Qualitative Data: Methods and Community Tools for Data Mark-Up (SQUAD) Louise Corti and Libby Bishop UK Data Archive, University of Essex IASSIST 2006 May 06

2 Access to qualitative data access to qualitative research-based datasets access to qualitative research-based datasets resource discovery points – catalogues resource discovery points – catalogues online data searching and browsing of multi-media data online data searching and browsing of multi-media data new publishing forms: re-presentation of research outputs combined with data – a guided tour new publishing forms: re-presentation of research outputs combined with data – a guided tour text mining, natural language processing and e-science applications offer richer access to digital data banks text mining, natural language processing and e-science applications offer richer access to digital data banks underpinning these applications is the need for agreed methods, standards and tools underpinning these applications is the need for agreed methods, standards and tools

3 Applications of formats and standards standard for data producers to store and publish data in multiple formats standard for data producers to store and publish data in multiple formats e.g UK Data Archive and ESDS Qualidata Online e.g UK Data Archive and ESDS Qualidata Online data exchange and data sharing across dispersed repositories (c.f. Nesstar) data exchange and data sharing across dispersed repositories (c.f. Nesstar) import/export functionality for qualitative analysis software (CAQDAS) based on a common interoperable standard import/export functionality for qualitative analysis software (CAQDAS) based on a common interoperable standard more precise searching/browsing of archived qualitative data beyond the catalogue record more precise searching/browsing of archived qualitative data beyond the catalogue record researchers and archivists are requesting a standard they can follow – much demand researchers and archivists are requesting a standard they can follow – much demand

4 Our own needs ESDS Qualidata online system ESDS Qualidata online system limited functionality - currently keyword search, KWIC retrieval, and browse of texts limited functionality - currently keyword search, KWIC retrieval, and browse of texts wish to extend functionality wish to extend functionality display of marked-up features (e.g.. named entities) display of marked-up features (e.g.. named entities) linking between sources (e.g.. text, annotations, analysis, audio etc) linking between sources (e.g.. text, annotations, analysis, audio etc) for 5 years we have been developing a generic descriptive standard and format for data that is customised to social science research and which meets generic needs of varied data types for 5 years we have been developing a generic descriptive standard and format for data that is customised to social science research and which meets generic needs of varied data types some important progress through TEI and Australian collaboration some important progress through TEI and Australian collaboration

5 How useful is textual data? dob: 1921 Place: Oldham finalocc: Oldham [Welham] U id='1' who='interviewer' Right, it starts with your grandparents. So give me the names and dates of birth of both. Do you remember those sets of grandparents? U id='2' who='subject' Yes. U id='3' who='interviewer' Well, we'll start with your mum's parents? Where did they live? U id='4' who='subject' They lived in Widness, Lancashire. U id='5' who='interviewer' How do you remember them? U id='6' who='subject' When we Mum used to take me to see them and me Grandma came to live with us in the end, didn't she? U id='7' who='Welham' Welham: Yes, when Granddad died - '48. U id='8' who='interviewer' So he died when he was 48? U id='9' who='Welham' Welham: No, he was 52. He died in U id='10' who='interviewer' But I remember it. How old would I be then? U id='11' who='Welham' Welham: Oh, you would have been little then. U id='12' who='subject' I remember him, he used to have whiskers. He used to put me on his knee and give me a kiss....

6 What are we interested in finding in data? short term: short term: how can we exploit the contents of our data? how can we exploit the contents of our data? how can data be shared? how can data be shared? what is currently useful to mark-up? what is currently useful to mark-up? long term long term what might be useful in the future? what might be useful in the future? who might want to use your data? who might want to use your data? how might the data be linked to other data sets? how might the data be linked to other data sets?

7 What features do we need to mark-up and why? spoken interview texts provide the clearest ― and most common ― example of the kinds of encoding features needed 3 basic groups of structural features 3 basic groups of structural features utterance, specific turn taker, defining idiosyncrasies in transcription utterance, specific turn taker, defining idiosyncrasies in transcription links to analytic annotation and other data types (e.g.. thematic codes, concepts, audio or video links, researcher annotations) links to analytic annotation and other data types (e.g.. thematic codes, concepts, audio or video links, researcher annotations) identifying information such as real names, company names, place names, occupations, temporal information identifying information such as real names, company names, place names, occupations, temporal information

8 Identifying elements Identify atomic elements of information in text Identify atomic elements of information in text Person names Person names Company/Organisation names Company/Organisation names Locations Locations Dates Dates Times Times Percentages Percentages Occupations Occupations Monetary amounts Monetary amounts Example: Example: Italy's business world was rocked by the announcement last Thursday that Mr. Verdi would leave his job as vice-president of Music Masters of Milan, Inc to become operations director of Arthur Anderson. Italy's business world was rocked by the announcement last Thursday that Mr. Verdi would leave his job as vice-president of Music Masters of Milan, Inc to become operations director of Arthur Anderson.

9 How do we annotate our data? human effort? human effort? how long does one document take to mark up? how long does one document take to mark up? how much data do you want/need? how much data do you want/need? how many annotators do you have? how many annotators do you have? how well does a person do this job? how well does a person do this job? accuracy accuracy novice/expert in subject area novice/expert in subject area boredom boredom subjective opinions subjective opinions what if we decide to add more categories for mark-up at a later date? what if we decide to add more categories for mark-up at a later date? can we automate this? can we automate this? the short answer: “it depends” the short answer: “it depends” the long answer... the long answer...

10 Automating content extraction using rules why don't we just write rules? why don't we just write rules? persons: persons: lists of common names, useful to a point lists of common names, useful to a point lists of pronouns (I, he, she, me, my, they, them, etc) lists of pronouns (I, he, she, me, my, they, them, etc) “me mum”; “them cats”, but which entities do pronouns refer to? “me mum”; “them cats”, but which entities do pronouns refer to? rules regarding typical surface cues: rules regarding typical surface cues: CapitalisedWord CapitalisedWord probably a name of some sort e.g. “John found it interesting…” probably a name of some sort e.g. “John found it interesting…” first word of sentences is useless though e.g. “Italy’s business world… first word of sentences is useless though e.g. “Italy’s business world… title CapitalisedWord title CapitalisedWord probably a person name, e.g. “Mr. Smith” or “Mr. Average” probably a person name, e.g. “Mr. Smith” or “Mr. Average” how well does this work? how well does this work? not too bad, but… not too bad, but… requires several months for a person to write these rules requires several months for a person to write these rules each new domain/entity type requires more time each new domain/entity type requires more time requires experienced experts (linguists, biologists, etc.) requires experienced experts (linguists, biologists, etc.)

11 What about more intelligent content extraction mechanisms? machine learning machine learning manually annotate texts with entities manually annotate texts with entities 100,000 words can be done in 1-3 days depending on experience 100,000 words can be done in 1-3 days depending on experience the more data you have, the higher the accuracy the more data you have, the higher the accuracy the less annotated data you have, the poorer the results the less annotated data you have, the poorer the results if the system hasn’t seen it or hasn’t seen anything that looks like it, then it can’t tell what it is if the system hasn’t seen it or hasn’t seen anything that looks like it, then it can’t tell what it is garbage in, garbage out garbage in, garbage out

12 State of the Art use a mixture of rules and machine learning use a mixture of rules and machine learning use other sources (e.g.. the web) to find out if something is an entity use other sources (e.g.. the web) to find out if something is an entity number of hits indicates likelihood something is true number of hits indicates likelihood something is true e.g.. finding if Capitalised Word X is a country e.g.. finding if Capitalised Word X is a country search google for: search google for: “Country X”; “The prime minister of X” “Country X”; “The prime minister of X” uew focus on relation and event extraction uew focus on relation and event extraction Mike Johnson is now head of the department of computing. Today he announced new funding opportunities. Mike Johnson is now head of the department of computing. Today he announced new funding opportunities. person(Mike-Johnson) person(Mike-Johnson) head-of(the-department-of-computing, Mike-Johnson) head-of(the-department-of-computing, Mike-Johnson) announced(Mike-Johnson, new funding opportunities, today) announced(Mike-Johnson, new funding opportunities, today)

13

14 UK Data Archive - NLP collaboration ESDS Qualidata making use of options for semi-automated mark-up of some components of its data collections using natural language processing and information extraction ESDS Qualidata making use of options for semi-automated mark-up of some components of its data collections using natural language processing and information extraction new partnerships created – new methods, tools and jargon to learn! new partnerships created – new methods, tools and jargon to learn! new area of application for NLP to social science data new area of application for NLP to social science data growing interest in UK in applying NLP and text mining to social science texts – data and research outputs such as publications’ abstracts growing interest in UK in applying NLP and text mining to social science texts – data and research outputs such as publications’ abstracts

15 SQUAD Project: Smart Qualitative Data Primary aim: to explore methodological and technical solutions for exposing digital qualitative data to make them fully shareable and exploitable to explore methodological and technical solutions for exposing digital qualitative data to make them fully shareable and exploitable collaboration between collaboration between UK Data Archive, University of Essex (lead partner) UK Data Archive, University of Essex (lead partner) Language Technology Group, Human Communication Research Centre, School of Informatics, University of Edinburgh Language Technology Group, Human Communication Research Centre, School of Informatics, University of Edinburgh 18 months duration, 1 March 2005 – 31 August months duration, 1 March 2005 – 31 August 2006

16 SQUAD: main objectives developing and testing universal standards and technologies developing and testing universal standards and technologies long-term digital archiving long-term digital archiving publishing publishing data exchange data exchange user-friendly tools for semi-automating processes already used to prepare qualitative data and materials (Qualitative Data Mark-up Tools (QDMT) user-friendly tools for semi-automating processes already used to prepare qualitative data and materials (Qualitative Data Mark-up Tools (QDMT) formatted text documents ready for output formatted text documents ready for output mark-up of structural features of textual data mark-up of structural features of textual data annotation and anonymisation tool annotation and anonymisation tool automated coding/indexing linked to a domain ontology automated coding/indexing linked to a domain ontology defining context for research data (e.g.. interview settings and dynamics and micro/macro factors defining context for research data (e.g.. interview settings and dynamics and micro/macro factors providing demonstrators and guidance providing demonstrators and guidance

17 Progress draft schema with mandatory elements draft schema with mandatory elements chosen an existing NLP annotation tool - NITE XML Toolkit chosen an existing NLP annotation tool - NITE XML Toolkit building a GUI – with step-by-step components for ‘data processing’ building a GUI – with step-by-step components for ‘data processing’ data clean up tool data clean up tool named entity and annotation mark-up tool named entity and annotation mark-up tool anonymise tool anonymise tool archiving tool – annotated data archiving tool – annotated data publishing tool – transformation scripts for ESDS Qualidata Online publishing tool – transformation scripts for ESDS Qualidata Online extending functionality of ESDS Qualidata Online system to include audio- visual material and linking to research outputs and mapping system extending functionality of ESDS Qualidata Online system to include audio- visual material and linking to research outputs and mapping system from summer: from summer: key word extraction systems to help conceptually index qualitative data – text mining collaboration key word extraction systems to help conceptually index qualitative data – text mining collaboration exploring grid-enabling data – e-science collaboration exploring grid-enabling data – e-science collaboration

Annotation tool - anonymise

Annotation tool

Anonymised data

Formats - how stored? saves original file saves original file creates new anonymised version creates new anonymised version saved matrix of references - names to pseudonyms saved matrix of references - names to pseudonyms outputs annotations – who worked on the file etc? outputs annotations – who worked on the file etc? NITE NXT XML model NITE NXT XML model uses ‘stand off’ annotation – annotation linked to or references words uses ‘stand off’ annotation – annotation linked to or references words also about to test Qualitative Data Exchange Format – ANU in Australia also about to test Qualitative Data Exchange Format – ANU in Australia non-proprietary exchangeable bundle - metadata, data and annotation non-proprietary exchangeable bundle - metadata, data and annotation testing import and export from Atlas-ti and Nvivo testing import and export from Atlas-ti and Nvivo will probably be RDF will probably be RDF

Metadata standards in use DDI for Study description, Data file description, Other study related materials, links to variable description for quantified parts (variables) DDI for Study description, Data file description, Other study related materials, links to variable description for quantified parts (variables) for data content and data annotation: the Text Encoding Initiative for data content and data annotation: the Text Encoding Initiative standard for text mark-up in humanities and social sciences standard for text mark-up in humanities and social sciences using consultant to help text the DTD using consultant to help text the DTD will then attempt to meld with the QDIF will then attempt to meld with the QDIF

23 “Reduced” set of TEI elements core tag set for transcription; editorial changes core tag set for transcription; editorial changes names, numbers, dates names, numbers, dates links and cross references links and cross references notes and annotations notes and annotations text structure text structure unique to spoken texts unique to spoken texts linking, segmentation and alignment linking, segmentation and alignment advanced pointing - XPointer framework advanced pointing - XPointer framework Synchronisation Synchronisation contextual information (participants, setting, text) contextual information (participants, setting, text) ESDS Qualidata XML Schema

24

25 Metadata for model transcript output Study Name Mothers and daughters Study Name Mothers and daughters Depositor Mildred Blaxter Depositor Mildred Blaxter Interview number 4943int01 Interview number 4943int01 Date of interview 3 May 1979 Date of interview 3 May 1979 Interview ID g24 Interview ID g24 Date of birth 1930 Date of birth 1930 Gender Female Gender Female Occupation pharmacy assistant Occupation pharmacy assistant Geo region Scotland Geo region Scotland Marital status Married Marital status Married

26 Transcript with recommended XML mark-up

27 XML is source for.rtf download

28 Metadata used to display search results

29 XML+XSL enables online publishing

30 Information ESDS Qualidata Online site: ESDS Qualidata Online site: SQUAD website: SQUAD website: quads.esds.ac.uk/projects/squad.asp NITE NXT toolkit: NITE NXT toolkit: ESDS Qualidata site: ESDS Qualidata site: We would like collaboration and testers!