Presentation on theme: "Setting the scene: the ESRC and JISC vision for access to qualitative data Louise Corti, ESDS Qualidata Economic and Social Data Service, UK Data Archive."— Presentation transcript:
Setting the scene: the ESRC and JISC vision for access to qualitative data Louise Corti, ESDS Qualidata Economic and Social Data Service, UK Data Archive Online Access to Qualitative Data: Opportunities and Challenges Friday 5 December, 2003 Royal Statistical Society
Aim of workshop Present issues/ problems Describe Qualidata Online system a –publishing data processes –Querying data tools Hear what web content others are creating Hear more on data/XML mark-up standards Hear about e-science and the potential to use the Grid to share qualitative data Discuss challenges and solutions and propose future work directions Help inform ESDS Qualidatas plan of work
Talk overveiw ESDS Qualidata – remit and expectations Why need data online? How do we get data online? XML standards within the international data archiving community Other XML metadata and data standards and tools
ESDS Overview new national data archiving and dissemination service, running from 1 Jan. 2003 – 2008 provides access and support for key economic and social data distributed service, bringing together centres of expertise in data creation, dissemination, preservation and use provides seamless and easier access to a range of disparate resources for UK Higher and Further Education sectors core archiving services plus four specialist data services
UK Data Archive, Essex MIMAS, Manchester Cathie Marsh Centre for Census and Survey Research (CCSR), Manchester Institute of Social and Economic Research (ISER), Essex Partners
ESDS Qualidata specialist function of the new UK Economic and Social Data Service (ESDS) hosted by the UK Data Archive provides access to, and support for, a range of qualitative datasets the work builds on Qualidata's expertise and international reputation in this area, developed over the past eight years
Qualidata: old remit originally an independent unit within Dept Sociology UK national service for acquisition, dissemination and re-use of social science qualitative research data used network of UK archives for deposit worked closely with the UK Research Council (ESRC) to operate its Datasets Policy outreach activities and support for creating and depositing data, resource discovery
Types of qualitative data diverse data types: in-depth interviews ; semi- structured interviews; focus groups; oral histories; mixed methods data; open-ended survey questions; case notes/records of meetings; diaries/ research diaries multi-media: audio, video, photos and text (most common is interview transcriptions) formats: digital, paper, analogue audio-visual data structures - differ across different document types
Qualitative data collections data from National Research Council (ESRC) individual and programme research grant awards data from classic social science studies other funders/sources focus on DIGITAL Collections, but also facilitate paper-based archiving
Classic datasets Peter Townsend – Poverty, old age and Katherine Buildings Paul Thompson – oral history and Edwardians Ray Pahl –Hertfordshire Villages studies National Social Policy and Social Change Archive
Facilitate greater usage encouraging: researchers to consult existing data sources Teachers to use real data for teaching and learning Students to explore original data sources make obtaining data more straightforward promotion… exploit existing and new networks
ESDS Qualidata: key activities enhanced user guides and digital samplers exemplars and case studies of re-use online access to qualitative data user support and training activities to support secondary analysis of qualitative data
Enhanced user guides and digital samplers providing a better understanding of the study and research methods enhanced users guides – detailed notes on study methodology and re-use; behind the scenes interviews with depositors; FAQs thematic pages – combining interviews digital samplers of classic sociology collections
Exemplars and case studies of re-use providing guidance on data resources and how to re-use them overview of ways of re-using data case studies of re-use including reflections and commentary full bibliography of re-use articles online packaged training resources user support and training programme
Online access to qualitative data new emphasis on providing direct access to collection content –supports more powerful resource discovery –greater scope for searching and browsing content of data (supplementary to higher level study-related metadata) –since users can search and explore content directly… can retrieve data immediately providing access to qualitative data via common interface (Edwardians Online) supporting tools for searching, retrieval, and analysis across different datasets
Exploring qualitative data on-line More than file download Access to content and structure –Structural content Speakers Coded textual/audio data Place names etc. –Links to other project material Audio files; fieldnotes; photos; analytical annotations etc –Links to other sources Micro data; aggregate statistics; maps; census data etc.
Describing qualitative data Full catalogue record Data listing (ID, biog details, date of intervews, media, format, transcript details) Online pdf User Guide Use/ processing notes Archival listing for large collections
User Read File UK DATA ARCHIVE DOCUMENTATION 4594 - Policing, Cultural Change and 'Structures of Feeling in Post-War England, 1945-1999 Access conditions Until 1 May 2008, the depositor's permission must be sought for access - please contact Qualidata at UKDA for further details. Users should note that no access at all is permitted to the Metropolitan Police Commissioner's interview transcript (int54) until 31st January 2005. Conversion of data and documentation formats All 65 interview/focus group transcript files were converted to both MS WORD 97 and rich text formats. Both the MS WORD 97 and the rich text files are available to users. The hard copy documentation was scanned and is available as a one volume Acrobat PDF user guide. Anonymisation Some limited edits have been made to interview transcripts during processing to protect the identity of respondents. Care has been taken to ensure that this does not compromise the quality of information available. Data and documentation problems There are some spelling mistakes in the interview transcriptions, (left in situ due to limited processing resources), and the format transfer to Word has produced odd characters within the files in a very few cases. These issues should not present problems for secondary users. Notes from data delivery and post-order corrections
The Standard Study Description UK Data Archive: Recommended key bibliographic elements Devised in 1970s to describe academically created sociological/political science datasets Informally adopted by CESSDA in 1980s; some controlled vocabulary Often adapted to suit local needs
The Standard Study Description recommended elements: subject category title depositor principal investigator abstract and main topics kind of data dimensions of dataset universe sampled sampling procedures method of data collection dates of coverage, fieldwork and deposit availability and access conditions references to reports and related datasets
The first step towards interoperability Driven by the need to search across European SSDA holdings Development of a core element set for The Integrated Data Catalogue (IDC) – included Dublin core elements Catalogue records marked with standard tags for inclusion into WAIS indexes (Wide Area Information Servers) Enabled multisite searching via Z39.50-WAIS protocol But excluded - links to additional metadata, documentation, thesaurus help, and browsing.
International standards for metadata schema To ensure that every element of information pertaining to the lifecycle of an object (collection) can be captured: –Creation, appraisal, accessioning, conservation, preservation, availability and access Produce quality machine readable metadata designed to be human AND computer understandable and processable Must be dynamic and must be open to amendment Aim to be consistent, appropriate and self-explanatory description (encourages full documentation of data) Facilitate the retrieval and exchange of information Enable the integration of descriptions from different locations into a unified information system Increases the depth of access to collections Encourages cooperative metadata collection development and shared discovery tools Enable the sharing of authority data
Popular metadata schemas DTD = Document Type Definition Dublin Core minimum number of elements required to facilitate the discovery of document-like objects in a networked environment (eg Internet). Currently 15: Content: Title, Subject, Description, Source, Language, Relation, Coverage Intellectual Property: Author/Creator, Publisher, Contributor, Rights Electronic/Physical Manifestation: Date,Type, Format, Identifier AACR2 (Anglo-American Cataloguing Rules) ISAD(G) General International Standard of Archival Description
A schema for social science data DTD = Document Type Definition XML = DDI = Data Documentation Initiative
XML basics XML = eXtensible Markup Language XML is to a documents intellectual content what HTML is to the physical structure of that document Elements Attribute Attribute types (imposing controls) Hierarchies and nesting
Example of XML Survey Good survey Lousie Corti 13. June 2001
Data Documentation Initiative (DDI) A Document type Definition (DTD) for an international codebook standard Established in 1995 by ICPSR, USA Used SGML, made compliant with XML in 1997 Creating automated tools for publishing data and associated information into XML http://www.icpsr.umich.edu/DDI/
Tag Library - 5 levels Document Description: Items describing the marked-up document itself as well as its source documents Study Description: Items describing the overall data collection (title, citation, methodology, study scope, data access, etc) Data Files Description: Items relating to the format, size, and structure of the data files (physical descriptions) Variables Description: Items relating to variables in the data collection (logical descriptions) Other Study-Related Materials : Other study- related material not included in the other sections (bibliography, separate questionnaire file, etc.)
Code book – level 1 XML codeBook for SN:2000 2002-02-14 Nesstar Publisher
Qualidata and the DDI All catalogue records use the DDI for document and study level descriptions (see handout) Survey data uses lower levels for describing variables - NESSTAR Behind the scene tables held in relational SQL form Exported to XML for web search and retrieve tools and online data manipulation
Preparing data for NESSTAR I 1.Prepare SPSS files to agreed standard. The basic level of preparation is that all variables need to be as fully labelled as possible. 2.Prepare the Excel files used for inputting question text and other related information, e.g. interviewers instructions. 3.Question text: Add question text and other related information to the Excel files and perform quality checks as necessary. 4.Variable groups: Decide on the variable groups for the study and add this information to the question text Excel files. Create the variable group name file. 5.Study description: Create an xml file using the study description information contained in the Archives catalogue. Add to this any additional information that is required at the DDI document description (section 1) level and the DDI study description (section 2) level, e.g. weighting information.
Preparing data for NESSTAR I 6.XML Generator: Use the XML Generator software to create the basic xml file and NSDstat files. Various items of information relating to specific variables can also be entered at this stage. 7.Run the QLoader program to add the question text, and related information, to produce the final xml file. 8.Validate the final xml file and check the layout etc. using the NESSTAR publisher. 9.Publish the required dataset to a NESSTAR server. 10.Create a readme file to accompany a study if downloaded through NESSTAR. This should include information a user needs to be aware of, e.g. citation information. 11.Run a web based program to inform the ACU that a new dataset has been published.
Two other main schemas for spoken text Text Encoding Initiative – TEI Consortium extensive tag set for many types of text: eg prose, drama, speech. –A subset of elements are suitable for our needs eg speaker ID place and organisation names elements of transcription e.g uncertainty pointers to external objects, eg audio files Lou to tell us more…! CHILDES, Talk Bank and CHAT format –5-year NSF grant to Carnegie Mellon University and the University of Pennsylvania
CHILDES system Developed due to inconsistency in transcription methodology, coding schemas and cross- investigator reliability for child speech data CHILDES – data exchange system for child language via shared database & tools Sharing transcription format shared codes shared analysis 3 tools: –CHAT transcription and coding format –Clan analysis programme –Database
Codes for the Human Analysis of Transcripts (CHAT) format CHAT transcription system Capacity for highly detailed encoding – e.g. for conversation analysis Standardised format for face-to-face transcriptions XML web format via TALKBANK project Some elements map to TEI
Talkbank foster fundamental research in the study of human and animal communication by: –constructing sample databases within subfields studying communication. –using databases to advance the development of standards and tools for creating, sharing, searching, and commenting upon primary materials via web TalkBank DTD covers: –speaker ID –pauses, tone, paralinguistic events, etc. & other non- verbal actions, events required for conversation analysis –unintelligable, omitted speech –dialectal variations –audio and video time markers http://www.talkbank.org/
- The first half of this file is linked only to audio (1.wav and 2.wav). - got to be here and Adam A, J, S laugh
Future work on data formats and DTD Recommended DTD that takes elements from DDI, TEI and Talkbank for a data content element –speakers/turn takers identity –themes and provenance of themes etc. –question text from schedule –links to documents e.g. notes, audio clips, photos etc. –real names/organisations/places etc Recommended data format –Enable preservation/portable format for coded qualitative data (import/export from CAQDAS packages) –Libby….