Presentation on theme: "Researching e-Science Analysis of Census Holdings www.ucl.ac.uk/reach/ Dr Melissa Terras School of Library, Archive and Information Studies University."— Presentation transcript:
Researching e-Science Analysis of Census Holdings Dr Melissa Terras School of Library, Archive and Information Studies University College London
e-Science and the Humanities Little use has been made of the computational grid in humanities research The aims of the ReACH project were To establish the potential of applying grid technologies to analyse a complex and rich humanities dataset Pre-digitised Historical census data Of interest to academic researchers and general public To investigate how e-Science technologies may be appropriated in the arts and humanities Academic, Technical, Legal, Managerial, aspects of analysing large scale pre-digitized datasets using e-Science technologies Understand the characteristics and features of large scale humanities datasets which differentiate them from scientific datasets How does this affect the application of e-Science for research in the arts and humanities?
Partners UCL SLAIS Digital humanities, informatics, archives and digital preservation UCL Research Computing World leading expertise in High Performance, Grid and e-Science computing “Research Computing” High Levels of SRIF funding The National Archives who select, preserve and provide access to, and advice on, historical records, e.g. the censuses of England and Wales (and also the Isle of Man, Channel Islands and Royal Navy censuses) Ancestry.co.uk who own a massive dataset of census holdings worldwide, and who have digitized the censuses of England and Wales under license from The National Archives
Historical Census Data England and Wales Census Data – – 7 different censuses taken at 10 year intervals –20 GB, 200 million records Complex data set –Fields vary between each census year –Errors from those supplying the data from those writing down those answers from those transcribing those answers into the enumerator returns from those entering the data into the digital version of the records
Overview of aims Ascertain whether it would be technically possible Ascertain whether access to the data would be feasible Ascertain whether is would be useful to historians Ascertain whether the results from the project would by worthy of the intellectual and financial investment And what financial investment would be required to undertake the project
Data How do humanities datasets differ from scientific datasets? Does this preclude them from utilising e-Science technologies in research? Understand issues pertaining to the historical census Quality of data Importance of data to historians and researchers What can be done to process the data to improve and facilitate research How feasible, or useful, will that processing be Understanding legal and managerial aspects of licensing pre- digitized datasets for analysis using grid technologies Security Who owns the research outcomes?
Methodology - ReACH Workshop Series Series of 3 AHRC funded Workshops at UCL from June – August 2006 All Hands Workshop -June 2006 Featuring input from Historians, Archivists, Digital Librarians, Computing Scientists, Physicists, and Humanities Computing Experts What is the research question? It may be technically feasible – but will outcomes be useful? Technical Workshop -June 2006 Computing scientists, physicists, archivists Determining input, output, processing techniques, workflow, and costings of potential project Managerial Workshop – July 2006 Legal, security, and managerial aspects to using pre-digitized commercially sensitive data for research purposes
Historical issues – will it be useful? If data quality/ computational complexity is not an issue: Longitudinal dataset Dictionaries of variants Probability modelling of variants Log analysis of how people are using census material Checking and cleansing of census data Generation of simple statistics Calculating and identifying individuals who have been missed out in various censuses. Reconstitution of missing data in the records through contextual information Develop OCR techniques which can be used on copperplate Techniques for social computing and family histories Geographically normalised dataset Mapping of geography to names Assign grid references to historical data Adding current geographical data to the census Visualisation techniques
Is it technically possible? Implement a project would be relatively straightforward Mount it on UCL Research Computing facilities SGI Altix Facility: 135GFlops Access to data relatively straightforward Outputted to XML database 20 GB of data, warrants use of grid computing for searching and analysis Computational Grid techniques (and CS algorithms) No real understanding of tools to benchmark cross dataset record matching Of great interest to physicists, astronomers, astrophysicists, computing scientists…. Further research could investigate how automated record linking could be initiated, using probability modelling of variants
Is it feasible? Managerial Issues Send in the lawyers… Major legal issues in gaining access to commercially sensitive digitized data sets Need for consortium agreements Need to safeguard intellectual property rights Need to ascertain who owns research outcomes –Datasets created in the process of analysing other datasets Arts and Humanities need institutional backing in this area Access to small subset of data in first instance to prove proof of concept Need to set up secure systems and data management to ensure limited access to commercial datasets –Following lead of medical sciences
But is this possible with the information available? Historical census material Complex, and flawed dataset For historical reasons The very fact it is complex provides interesting opportunities to investigate record matching techniques Also, access to other datasets needed “triangulation” Births, marriages and Deaths Burials Parish registers In England and Wales, this data is not in the public domain (yet), and not available in digital form In order to undertake this project successfully, a massive digitisation project would have to be undertaken first Or wait a few years until others undertake the digitisation project.
Findings: e-Science and the Census There has been much financial, industrial and academic investment in the creation of digital records from the English and Welsh historical census data BUT there is not the quantity nor quality of information currently available to allow useful and usable results to be generated, checked, and assessed –will change as more data is digitised and becomes public The potential for high performance processing of large scale census data is large –may result in useful techniques and datasets (for historian, genealogist and beyond) –Only when adequate historical data becomes available. –This should be revisited in the future
Findings – e-Science and the A + H High performance computing and e-Science community were very welcoming to researchers in the Arts and Humanities Often the problems facing e-Science research in the arts and humanities are not technical Nature of humanities data means that novel computational techniques need to be developed to analyse and process them fuzzy, small scale, heterogeneous, of varying quality, and transcribed by human researchers as opposed to scientific datasets large scale, homogenous, numeric, and generated (or collected/sampled) automatically Arts and Humanities projects need to engage with the legal issues in using and creating commercially sensitive datasets Sensitive data sets and security: Arts and Humanities researcher should look towards Medical Sciences for their methodologies in data security and management in particular utilising ISO to maintain data integrity and security
Conclusion Aimed to deliver a full project proposal for future funding rounds Had to decide not to take this forward Undertaking this pilot project prevented long term funding being wasted on a project which would have failed Highlighted issues, problems, solutions, and barriers to any humanities project who may wish to use the computational grid to do complex record analysis Report available from