1 Error Detection and Correction in Metadata Nilu Prahallad, Zhenkun Zhou, Ting Zhang and Vamshi Ambati Carnegie Mellon University, USA and Zheijiang University,

Slides:



Advertisements
Similar presentations
The DFG Study of Author Experience of Open Access Dr. Johannes Fournier (Bonn) The Deutsche Forschungsgemeinschft (German Research Society)
Advertisements

Career after 10 th / 12th Brief overview of career options.
Why, what were the idea ? 1.Create a data infrastructure, 2.Data + the knowledge products that are produced on the basis of data a) Efficiant access to.
OnlineBooks and Blackwell Reference Online Nigel Thompson Account Development Manager.
Review of Related Literature By Dr. Ajay Kumar Professor School of Physical Education DAVV Indore.
Lecture №2 State System of Scientific and Technical Information.
The Future Ain’t What It Used To Be UKSG Conference 2004 and Exhibition Manchester, UK 29 March 2004.
Basic Test Taking Tips 40 questions – 35 minutes.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Research Methods for Business Students
Basic Scientific Writing in English Lecture 3 Professor Ralph Kirby Faculty of Life Sciences Extension 7323 Room B322.
'DIGITAL LIBRARY MANAGER' Profile Sector to which the profile is associated: Culture Fabrizio Melorio
Chapter 5: Information Retrieval and Web Search
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
Brandi Kirkland EDUT  The Dewey Decimal system is a general knowledge organizational tool that is continuously revised to keep pace with knowledge.
CLASSIFICATION.
The Dewey Decimal System
Managing and developing the collections at the Bodleian Libraries of the University of Oxford COSEELIS conference June 2012 Catríona Cannon, Associate.
Dewey Decimal Classification (DDC) A library classification developed by Melvil Dewey in 1876 DDC are numbers representing subjects. Ten main classes –
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Accountancy (also Banking/Finance/Insurance) ESSENTIAL ADVANCED LEVEL QUALIFICATIONS: Usually none although one or two universities require Mathematics.
Lecture # 31 Category Trees. Binary Trees 16 How many steps to reach a leaf? 4.
Measuring R&D in the social sciences: data availability and gaps Laudeline Auriol, OECD Strategic Workshop: Addressing the Shortage of Data on the Social.
Orientation to Web of Science Dr.Tariq Ashraf University of Delhi South Campus
The Scientific Library of NUPh for students: services and electronic resources The presentation of services, provided by the Scientific Library of NUPh.
Electronic Thesis & Dissertation Program Searching Techniques for Access to the ETD Collection.
Understanding TAGs & Course Equivalency Faculty Panel Point Person Meeting September 18, 2008.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
By Erika Roberts. The system is broken into 10 subjects Each subject is assigned a “call number” Every book is assigned a smaller number or “decimal”
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Welcome to the RSC –YH Information Skills Project.
Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.
What is the Dewey Decimal System It is a general knowledge organization tool to organize information into ten subject areas that is broken down into smaller.
Prospects for standardising SSH data collection Strategic Workshop: Addressing the Shortage of Data on the Social Sciences and Humanities.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Research library of the National Aerospace University Kharkiv Aviation Institute.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Ashley Success Professor Mitzi Crow EDUT 6116
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Understanding Dewey! Everything you wanted to know about finding materials in the library media center!
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Practice for ACT Reading. Content: One passage each from Prose fiction: passages from short stories or novels Humanities: architecture, dance, ethics,
A brief tour of Academic Search Premier. Agenda: Agenda: What is a database? What is a database? Searching keywords and using truncation. Searching keywords.
Leacock, Warrican & Rose (2009) Reviewing Literature Presentation 4.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
A Level reforms. A Levels  Equipping students to progress to success at university and in their careers.  More involvement from universities in the.
Million Book Project: Vision Becoming Reality Gabrielle Michalek, Carnegie Mellon Presentation to Carnegie Mellon Qatar Library November 9 & 10, 2005.
Maya Sharsheeva, reference-librarian AUCA Library Effective information search in the Library e-Resources.
The Dewey Decimal System
 Bernadette López-Fitzsimmons  Information Services Librarian  O’Malley Library  October 16, 2013  Miguel 307 LIBRARY TECH TALK EBOOK COLLECTIONS.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Library familiarization
Compare Dewey and Library of Congress classification systems
Internet Searching: Finding Quality Information
Text Based Information Retrieval
Drug Information Resources
Searching for and Accessing Information
Everything you wanted to know about finding materials in the library!
WISER Social Sciences: Key Search Skills
Multilingual Information Access in a Digital Library
The Dewey Decimal System By A. Karen Wilson
IL Step 3: Using Bibliographic Databases
The Dewey Decimal System
Search for Resource 2. Introduction to Resources
STATISTICS derived from the Latin word STATUS, Italian word STATISTA, German word STATISTIK, and French word STATISTIQUE which express one meaning “ Political.
Collection Analysis with Circulation, ILL and Collection Statistics: A Follow-up Presentation Lynn Silipigni Connaway OCLC, Inc. Heather Wicht University.
Presentation transcript:

1 Error Detection and Correction in Metadata Nilu Prahallad, Zhenkun Zhou, Ting Zhang and Vamshi Ambati Carnegie Mellon University, USA and Zheijiang University, China

2 Agenda Typical errors in Metadata Typical errors in Metadata Title Title Language Language Subject Subject Other fields Other fields Correction Strategies Correction Strategies Future Research directions Future Research directions Learning from Example Learning from Example

3 Universal Digital Library Large scale digital collections and archive - first of its kind Large scale digital collections and archive - first of its kind 1.46 Million Books 1.46 Million Books 21 different languages 21 different languages Large scale distributed collaboration - first of its kind Large scale distributed collaboration - first of its kind Four countries - USA, China, Egypt, India Four countries - USA, China, Egypt, India 35 scanning locations 35 scanning locations 3000 people (or more…) 3000 people (or more…)

4 What has kept us busy for last 1 year? We reached 1 M books at our last meeting in EGYPT We reached 1 M books at our last meeting in EGYPT Aggregating and Cleaning the metadata took us 1 complete year Aggregating and Cleaning the metadata took us 1 complete year Metadata is the most important component in a Library, more so in a Digital Library Metadata is the most important component in a Library, more so in a Digital Library Humans works in strange ways that computers don’t YET Humans works in strange ways that computers don’t YET

5 What is metadata? Information to identify a book Information to identify a book Title, Author, Year, Language, Subject, Publisher, Copyright Title, Author, Year, Language, Subject, Publisher, Copyright Dublincore standard Dublincore standard Strcutural metadata - METS standard Strcutural metadata - METS standard

6 Why do we have problems in Metadata? Cataloguing in libraries by professionals is accurate but expensive Cataloguing in libraries by professionals is accurate but expensive $100 per book? $100 per book? At ULIB we want to get things done on a large scale but economically At ULIB we want to get things done on a large scale but economically We are not limited by our visions, but our funds We are not limited by our visions, but our funds To Err is Human To Err is Human

7 Nature of the Problems Data Entry problems Data Entry problems Genuine confusion Genuine confusion Careless entry Careless entry Data Normalization Data Normalization Multiple languages and Standards Multiple languages and Standards Although not a problem, absolutely necessary for multilingual access Although not a problem, absolutely necessary for multilingual access

8 What are the solutions on table? Manual effort Manual effort Reliable but expensive and time consuming Reliable but expensive and time consuming Original born digital metadata records Original born digital metadata records Not all books have them, coordinating to get these is time-consuming Not all books have them, coordinating to get these is time-consuming Complete Automatic, Unsupervised Complete Automatic, Unsupervised Not reliable, more good than harm? Not reliable, more good than harm? Semi-supervised techniques Semi-supervised techniques Manual 20%, Automatic 80% Manual 20%, Automatic 80% We think we know how to work in such a scenario We think we know how to work in such a scenario

9 Going Semi-Automatic Computers are really good at Anomaly Detection Computers are really good at Anomaly Detection We identify and perform automatic correction for most confident records and put all doubt cases for manual observation We identify and perform automatic correction for most confident records and put all doubt cases for manual observation

10 Language Identification Problems and Solutions Work done by Nilu Prahallad

11 Scale of the Problem 1.46 million books in digital library 1.46 million books in digital library 0.4 million books were tagged with wrong language/no language at all 0.4 million books were tagged with wrong language/no language at all

12 Problems in Language Blank Language field Blank Language field Wrong Language assigned Wrong Language assigned Non-standard conventions Non-standard conventions Multilanguage confusion Multilanguage confusion

13 Blank Language Field This book is a French book, data entry operator may not know the language, so he must have tagged as unknown

14 Wrong language assignment Data entry errors (Copy/paste errors) Data entry errors (Copy/paste errors) A bulk of books is given a random language A bulk of books is given a random language Lack of language knowledge Lack of language knowledge Not all data operators know/identify/speak all languages that we itend to digitize Not all data operators know/identify/speak all languages that we itend to digitize

15 Wrong language assignment The above is a chinese book which talk about Japanese ethics There is Japanese in the title which made the operator to tag it as a Japanese book, instead of chinese

16 Non Standard Conventions Different data entry conventions Different data entry conventions Ex: English, ENGLISH, en, eng, Ex: English, ENGLISH, en, eng, Typographic errors by the data entry operators Typographic errors by the data entry operators ENGLIS, ENGL etc ENGLIS, ENGL etc

17 Multilanguage confusion This book is a Chinese book which talks about the techniques of reading and its approaches Language field is wrongly tagged as English, instead it should be Chinese.

18 Impact on ULIB Due to the errors mentioned in the above slide, the goal of the digital library is hindered Due to the errors mentioned in the above slide, the goal of the digital library is hindered Accurate and complete access to online books is not available though the book is available in the servers Accurate and complete access to online books is not available though the book is available in the servers

19 Solutions Automatic detection of the Language Automatic detection of the Language Method: Method: Automatic detection of the language is found using the language models Automatic detection of the language is found using the language models The steps involved in building the above models are: The steps involved in building the above models are: 1. Obtain unique tri letter in each document 1. Obtain unique tri letter in each document 2. Compute TF-IDF weights for each of the term. 2. Compute TF-IDF weights for each of the term. To perform identification of the language for a given title, the steps are: To perform identification of the language for a given title, the steps are: 1. Obtain terms from the query title. 2. Compute Cosine correlation between the query title and all the documents 3. Find the document which produces maximum correlation with the query- title. 4. The language of the query-title is the same as the language of the document producing the maximum correlation.

20 Solutions Advantages: Advantages: Our program can detect the language exactly the book belongs even though multiple languages are mentioned in the title. Our program can detect the language exactly the book belongs even though multiple languages are mentioned in the title. Though the language is tagged as unknown, we can find the language of the books programmatically. Though the language is tagged as unknown, we can find the language of the books programmatically. We can correct the errors in the language using the language model and MMR (maximal marginal relevance) by taking the correlation factor for the title and the corresponding language and the finding out the least possible occurrences in the language. We can correct the errors in the language using the language model and MMR (maximal marginal relevance) by taking the correlation factor for the title and the corresponding language and the finding out the least possible occurrences in the language. Disadvantages: Disadvantages: This procedure is not 100% accurate, but gives the desired results in most of the cases. This procedure is not 100% accurate, but gives the desired results in most of the cases.

21 Subject Categorization Problems and Solutions Ting Zhang

22 General Information Total Chinese and English books: 1,027,840 Total Chinese and English books: 1,027,840 Total number of combinational subject: 210,439 Total number of combinational subject: 210,439

23 Need for Subject Categories Subject navigation Subject navigation Narrow the range of search down Narrow the range of search down

24 Problems with Subject Wrong Categorization Wrong Categorization Blank Subject field Blank Subject field Non-English subject field Non-English subject field Mixed Language subject field Mixed Language subject field Very-detailed subject field Very-detailed subject field

25 Wrong categorization A History book got classified into Geography A History book got classified into Geography

26 Blank Subject Almost 300K books have “NULL” subject information Almost 300K books have “NULL” subject information

27 Non-English subject An English language book tagged with Chinese subject An English language book tagged with Chinese subject A Chinese language book tagged with Chinese subject might be ok, but would create issues for multi-lingual search and access A Chinese language book tagged with Chinese subject might be ok, but would create issues for multi-lingual search and access Mixed language subject Mixed language subject

28 Non-English subject Chinese book with Chinese subject

29 Mixed Language Subjects Subject of this book is described in a mixture of English and Chinese

30 Very detailed subjects Almost every book is tagged with a distinct variation of the Subject Almost every book is tagged with a distinct variation of the Subject

31 What needs to be done? Standardize the set of subjects like art, biology, medicine, physics etc. We have made 29 such standard subjects, and we made sure that we have mapped all the sub subjects to one main subject. This made most of the books compress and fit into the 29 range of the subjects. All the 29 catalogues are based on the CLC (Chinese Library Classification) Appendix 1 Appendix 1

32 Solution: Semi-Automatic A librarian manually categorizes one book into a particular category A librarian manually categorizes one book into a particular category A Programmer writes a program to identify all titles in the ULIB collection that have overlap of title words and attaches the subject tag A Programmer writes a program to identify all titles in the ULIB collection that have overlap of title words and attaches the subject tag Continue process for at least 20% of the books and the 80% get corrected automatically Continue process for at least 20% of the books and the 80% get corrected automatically

33 More than 600K Chinese books got a main subject category. More than 600K Chinese books got a main subject category. Our Progress with the solution:

34 TITLE Correction Problems and Solutions Zhenkun Zhou

35 ‘Title’ Statistics  There are more than 1,466,000 books  There are more than 1 million titles not in English, but in 20 other languages

36 Issues with TITLE field Illegal characters Illegal characters Incomplete and incorrect titles Incomplete and incorrect titles Varying Character-sets Varying Character-sets Spelling Variations (old / new variations) Spelling Variations (old / new variations) Segmentation and Tokenization Segmentation and Tokenization Non-native language titles Non-native language titles

37 Illegal characters Punctuation marks mostly Punctuation marks mostly Examples Examples " Watch Out for the Foreign Guests! "

38 Incomplete Titles Incomplete titles or Partial titles Incomplete titles or Partial titles Examples Examples There are about 37 books with the same title “ Annual report ” “ Annual report ” In fact, their titles should be such as “ Hong Kong Immigration Department Annual Report of the Year ”

39 Varying character sets Titles in different character sets Titles in different character sets GBK, UTF8, ASCII GBK, UTF8, ASCII

40 Varying spelling style Example Example 明實錄 : 明太宗實錄 traditional Chinese 明實錄 : 明太宗實錄 traditional Chinese 明实录 : 明太宗实录 simplified Chinese 明实录 : 明太宗实录 simplified Chinese Same is true with Arabic old and new Same is true with Arabic old and new

41 Segmentation and Tokenization Not a problem, but an issue Not a problem, but an issue Most languages have word level segmentation, “ “, which helps text processing Most languages have word level segmentation, “ “, which helps text processing For Chinese, it ’ s not easy to deal segmentation problem which prevents word level search on titles For Chinese, it ’ s not easy to deal segmentation problem which prevents word level search on titles

42 Non-native language titles Standard transliteration notation for enabling cross-language search ability Standard transliteration notation for enabling cross-language search abilityEx. “ 齐白石 ”  ”Qi Bai Shi” Displaying the Transliteration and equivalent Translation of a book would enable us to know what the book is about Displaying the Transliteration and equivalent Translation of a book would enable us to know what the book is about

43 Solutions For Titles with punctuation mistakes or Incomplete titles For Titles with punctuation mistakes or Incomplete titles  Using some parsing tools to correct Ex. Perl Advantage: use regular expression to control different situations Disadvantage: can ’ t predict all situations, sometimes not preciously

44 Solutions For Titles in different character sets For Titles in different character sets  change the book titles into UTF character sets, Ex. UTF8 characters.

45 Solutions For Titles in different spelling style For Titles in different spelling style  change the different titles of the same book in one style  Ex. “ 中国 ”, ” 中國 ”  ” 中国 ”  Advantage: offline, easy  Disadvantage: bad expansibility, not correct in concept  Transform titles between styles  Ex. “ 中国 ”  “ 中國 ” “ 中國 ”  “ 中国 ” “ 中國 ”  “ 中国 ”  Advantage: online, good expansibility  Disadvantage: need process time

46 Solutions Title translation and transliteration Title translation and transliteration Translate titles from different language. Ex. “ 中国历史 ” ??- “ Chinese History ” !! Automatic Translation (Zheijiang Univ) and Transliteration module (open source tool) Automatic Translation (Zheijiang Univ) and Transliteration module (open source tool)

47 Future Research Directions Vamshi Ambati

48 Subject Categorization Text Categorization Text Categorization Requires large amount of text Requires large amount of text At ULIB, not all languages have an OCR At ULIB, not all languages have an OCR Can we do well with spare data Can we do well with spare data Semantics of words using Wordnet Semantics of words using Wordnet Can we use contextual information Can we use contextual information Ex: Jane Austin, Charles Dickens - Literature Ex: Jane Austin, Charles Dickens - Literature Ex: Swami Prabudha - Religion Ex: Swami Prabudha - Religion

49 Language Identification Our ‘byte frequency’ based language identification approach has a lot of problems when the languages are close Our ‘byte frequency’ based language identification approach has a lot of problems when the languages are close Hindi, Sanskrit Hindi, Sanskrit Can we use larger context Can we use larger context Longer character sequences Longer character sequences Functional words -’of’,’the’ (English) Functional words -’of’,’the’ (English) Dictionaries Dictionaries Language Identification from Images Language Identification from Images

50 Agents that learn by Example OCLC has the arguably most accurate data we have so far OCLC has the arguably most accurate data we have so far Can we programmatically access it, compare with our existing data and correct it Can we programmatically access it, compare with our existing data and correct it Some of the information regarding books is available on multiple catalogues all over the web (including Wikipedia) Some of the information regarding books is available on multiple catalogues all over the web (including Wikipedia) Can we benefit from this Can we benefit from this

51 Language Translation Good Enough Translation for Titles and Subjects Good Enough Translation for Titles and Subjects Universal Dictionary of All Languages (Dr.Shamos) could be a starting point Universal Dictionary of All Languages (Dr.Shamos) could be a starting point Google Translation Systems could help Google Translation Systems could help System at Xia Men University in China has already helped us do the translation System at Xia Men University in China has already helped us do the translation We at CMU, IISc will address most of the other languages We at CMU, IISc will address most of the other languages

52 Thank you Suggestions/Questions?

53 Appendix Subjects list

54 Appendix 1 catalogueContent Agriculture Agricultural engineering 、 agronomy 、 gardening 、 forestry 、 herding 、 veterinary 、 hunting 、 silkworm 、 bee 、 aquatic product 、 fishery etc Architecture art of building 、 architectural science including: architectural exploration 、 architectural design 、 Architectural Structure 、 soil mechanics 、 building’s foundations 、 Building materials 、 Construction Technology 、 building equipment 、 regional planning 、 town planning 、 public works) Art Painting 、 calligraphy 、 seal cutting 、 photographic art 、 industrial art 、 music 、 dance 、 drama 、 cinematic 、 television art

55 Appendix 1 AstronomyAstronomy BiographyBiography Biology General biology 、 cytology 、 genetics 、 biochemistry 、 biophysics 、 molecular biology 、 bioengineering 、 environmental biology 、 paleontology 、 microbiology 、 botany 、 zoology 、 insect logy 、 anthropology Chemistry inorganic chemistry 、 organic chemistry 、 Macromolecule Chemistry| Polymer Chemistry 、 physical chemistry 、 theoretical chemistry 、 analytical chemistry 、 applied chemistry Computer Science Automatic 、 computing technique Economics Political economics 、 economic profile 、 economic history 、 economic geography 、 economic planning 、 economic management 、 agricultural economy 、 industry economy 、 traffic and transport economy 、 trade 、 marketing

56 Appendix 1 Education Education 、 education at all levels 、 all forms of education 、 Information & knowledge dissemination 、 cultural activities Engineering General industrial Technology 、 mineral engineering 、 petroleum and natural gas industry 、 metallurgical industry 、 metallographic & smith craft 、 machinery & meter craft 、 weapon industry 、 energy industry 、 atomic energy technology 、 electro engineering 、 radio electronics & telegraphy 、 Chemical industry 、 light industry & Handicraft 、 Hydraulic Engineering 、 Transportation 、 aviation & space flight Environmental Science Geography Human geography 、 nature geography 、 geophysics 、 topography 、 meteorology 、 geology 、 oceanography

57 Appendix 1 History archaeology 、 folkways Language Linguistics 、 minority language 、 foreign language 、 all kind of language systems Literature Literary theory 、 Chinese literature etc. MathematicsMathematics Medicine Basic Medicine 、 clinical medicine 、 preventive medicine 、 hygiene 、 pharmacy etc. Military Strategy 、 tactics 、 military campaign 、 military technology 、 military geography etc. Natural Science System theory 、 methodology etc.

58 Appendix 1 Philosophy Logic 、 Ethics 、 aesthetics etc. Physics Dynamics 、 physics etc. Poetrypoetry Politics & Law Diplomacy 、 political relations 、 law etc. PsychologyPsychology Religion religion 、 divination 、 superstition etc. Social Science Management theory 、 statistics 、 sociology 、 demology 、 science of personnel ect. General Encyclopedia 、 dictionary 、 book catalogue & Abstract & indexing etc Miscellaneous Not included in above catalogue

59 For one million books.. Chinese books and English books are mainly tagged wrong out of 1 million books