Presentation on theme: "The development of Cascot: Computer Aided Structured Coding Tool Rob Jones Institute for Employment Research University of Warwick."— Presentation transcript:
The development of Cascot: Computer Aided Structured Coding Tool Rob Jones Institute for Employment Research University of Warwick
Project History Two programs Casoc (SOC 2000 coding) Casic (SIC 92 coding) Undocumented, monolithic, legacy dos code. Development Undertaken – Testing framework – Modularisation (Object Orientation) – Integration of Casoc and Casic into one. – New GUI
Current Status Single program, Cascot – Capable of SOC 2000 & SIC 92Coding – Any Classification. Single scoring engine Loadable classifications – Structure – Index – Rules Optional interfaces: web page & desktop application.
Classification: Structure Nature of the classification – Example. SOC 2000: 4 levels, Code & Title 1Managers and Senior Officials 11Corporate Managers 111Corporate Managers And Senior Officials 1111Senior officials in national government 1112Directors and chief executives of major organisations 1113Senior officials in local government 1114Senior officials of special interest organisations 112Production Managers 1121Production, works and maintenance managers
Classification: Index Series of texts associated with given codes. 2312Teacher (educational establishments: college of education) 2312Teacher (further education) 2312Teacher (higher and further education) 2312Teacher (tertiary college) 2312Teacher, dance (further education) 2312Teacher, music (further education) 2312Teacher; head (educational establishments: further education Tutor (further education) 2312Tutor (higher and further education) 2313Adviser (education)
Classification: Rules Abbreviations – Eg. deli = delicatessen Misspellings – Eg. taylor = tailor Thesaurus Alternatives – Eg. cook = (95%) chef Default values – Eg. BUSINESS MANAGER = company manager Non concluding text – Eg. Owner
Classification: Rules Downgraded Words – Eg. Trainee, Assistant, Senior Noise Words – Eg. and, of, with, in, at, the Noise Phrases – Eg My Mother is
Principals of operation Identify words. Select all codes where those words are used. Score all index entries in all those codes. Score comprised of – Global component – Record component 2 way comparison (Text-2-Index & Index-2-Text) Final Score (0-100) known as 'Certainty Score'
Complexities of Scoring Rules – Create alternatives. – Non concluding texts (in rules) => 39 Words are 'Pseudo Matched' before being searched for – Eg. miner matches mine, miner, mineral, minerals,mines Final score adjusted by next closest score
Performance Can be measured in many ways. – Speed, Throughput, Accuracy, Speed: Approx 1,000 texts / minute. Main Test Data – LFS 96/97 – Total Records : – Compared to manual coding.
Certainty Score Automatic text processing (SOC 2000): Throughput and error rates by certainty score
Certainty Score Automatic text processing (SIC 92) throughput and error rates by certainty score
% matching at each value of certainty score The relationship between matching at SOC2000 unit group level and the certainty score Certainty Score
Future Work Classification Editor – Editing of rules – Editing of structure, index entries – Creation of new classifications Performance enhancements – Integrated spell checker Integration of SOC & SIC coding (Output of SOC coding influenced by SIC code)
Cascot Website Cascot freely available over the web Desktop version (for high volume use) coming soon. Please register on the website.