Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Error Detection and Correction in Metadata Nilu Prahallad, Zhenkun Zhou, Ting Zhang and Vamshi Ambati Carnegie Mellon University, USA and Zheijiang University,

Similar presentations


Presentation on theme: "1 Error Detection and Correction in Metadata Nilu Prahallad, Zhenkun Zhou, Ting Zhang and Vamshi Ambati Carnegie Mellon University, USA and Zheijiang University,"— Presentation transcript:

1 1 Error Detection and Correction in Metadata Nilu Prahallad, Zhenkun Zhou, Ting Zhang and Vamshi Ambati Carnegie Mellon University, USA and Zheijiang University, China

2 2 Agenda Typical errors in Metadata Typical errors in Metadata Title Title Language Language Subject Subject Other fields Other fields Correction Strategies Correction Strategies Future Research directions Future Research directions Learning from Example Learning from Example

3 3 Universal Digital Library Large scale digital collections and archive - first of its kind Large scale digital collections and archive - first of its kind 1.46 Million Books 1.46 Million Books 21 different languages 21 different languages Large scale distributed collaboration - first of its kind Large scale distributed collaboration - first of its kind Four countries - USA, China, Egypt, India Four countries - USA, China, Egypt, India 35 scanning locations 35 scanning locations 3000 people (or more…) 3000 people (or more…)

4 4 What has kept us busy for last 1 year? We reached 1 M books at our last meeting in EGYPT We reached 1 M books at our last meeting in EGYPT Aggregating and Cleaning the metadata took us 1 complete year Aggregating and Cleaning the metadata took us 1 complete year Metadata is the most important component in a Library, more so in a Digital Library Metadata is the most important component in a Library, more so in a Digital Library Humans works in strange ways that computers don’t YET Humans works in strange ways that computers don’t YET

5 5 What is metadata? Information to identify a book Information to identify a book Title, Author, Year, Language, Subject, Publisher, Copyright Title, Author, Year, Language, Subject, Publisher, Copyright Dublincore standard Dublincore standard Strcutural metadata - METS standard Strcutural metadata - METS standard

6 6 Why do we have problems in Metadata? Cataloguing in libraries by professionals is accurate but expensive Cataloguing in libraries by professionals is accurate but expensive $100 per book? $100 per book? At ULIB we want to get things done on a large scale but economically At ULIB we want to get things done on a large scale but economically We are not limited by our visions, but our funds We are not limited by our visions, but our funds To Err is Human To Err is Human

7 7 Nature of the Problems Data Entry problems Data Entry problems Genuine confusion Genuine confusion Careless entry Careless entry Data Normalization Data Normalization Multiple languages and Standards Multiple languages and Standards Although not a problem, absolutely necessary for multilingual access Although not a problem, absolutely necessary for multilingual access

8 8 What are the solutions on table? Manual effort Manual effort Reliable but expensive and time consuming Reliable but expensive and time consuming Original born digital metadata records Original born digital metadata records Not all books have them, coordinating to get these is time-consuming Not all books have them, coordinating to get these is time-consuming Complete Automatic, Unsupervised Complete Automatic, Unsupervised Not reliable, more good than harm? Not reliable, more good than harm? Semi-supervised techniques Semi-supervised techniques Manual 20%, Automatic 80% Manual 20%, Automatic 80% We think we know how to work in such a scenario We think we know how to work in such a scenario

9 9 Going Semi-Automatic Computers are really good at Anomaly Detection Computers are really good at Anomaly Detection We identify and perform automatic correction for most confident records and put all doubt cases for manual observation We identify and perform automatic correction for most confident records and put all doubt cases for manual observation

10 10 Language Identification Problems and Solutions Work done by Nilu Prahallad

11 11 Scale of the Problem 1.46 million books in digital library 1.46 million books in digital library 0.4 million books were tagged with wrong language/no language at all 0.4 million books were tagged with wrong language/no language at all

12 12 Problems in Language Blank Language field Blank Language field Wrong Language assigned Wrong Language assigned Non-standard conventions Non-standard conventions Multilanguage confusion Multilanguage confusion

13 13 Blank Language Field This book is a French book, data entry operator may not know the language, so he must have tagged as unknown

14 14 Wrong language assignment Data entry errors (Copy/paste errors) Data entry errors (Copy/paste errors) A bulk of books is given a random language A bulk of books is given a random language Lack of language knowledge Lack of language knowledge Not all data operators know/identify/speak all languages that we itend to digitize Not all data operators know/identify/speak all languages that we itend to digitize

15 15 Wrong language assignment The above is a chinese book which talk about Japanese ethics There is Japanese in the title which made the operator to tag it as a Japanese book, instead of chinese

16 16 Non Standard Conventions Different data entry conventions Different data entry conventions Ex: English, ENGLISH, en, eng, Ex: English, ENGLISH, en, eng, Typographic errors by the data entry operators Typographic errors by the data entry operators ENGLIS, ENGL etc ENGLIS, ENGL etc

17 17 Multilanguage confusion This book is a Chinese book which talks about the techniques of reading and its approaches Language field is wrongly tagged as English, instead it should be Chinese.

18 18 Impact on ULIB Due to the errors mentioned in the above slide, the goal of the digital library is hindered Due to the errors mentioned in the above slide, the goal of the digital library is hindered Accurate and complete access to online books is not available though the book is available in the servers Accurate and complete access to online books is not available though the book is available in the servers

19 19 Solutions Automatic detection of the Language Automatic detection of the Language Method: Method: Automatic detection of the language is found using the language models Automatic detection of the language is found using the language models The steps involved in building the above models are: The steps involved in building the above models are: 1. Obtain unique tri letter in each document 1. Obtain unique tri letter in each document 2. Compute TF-IDF weights for each of the term. 2. Compute TF-IDF weights for each of the term. To perform identification of the language for a given title, the steps are: To perform identification of the language for a given title, the steps are: 1. Obtain terms from the query title. 2. Compute Cosine correlation between the query title and all the documents 3. Find the document which produces maximum correlation with the query- title. 4. The language of the query-title is the same as the language of the document producing the maximum correlation.

20 20 Solutions Advantages: Advantages: Our program can detect the language exactly the book belongs even though multiple languages are mentioned in the title. Our program can detect the language exactly the book belongs even though multiple languages are mentioned in the title. Though the language is tagged as unknown, we can find the language of the books programmatically. Though the language is tagged as unknown, we can find the language of the books programmatically. We can correct the errors in the language using the language model and MMR (maximal marginal relevance) by taking the correlation factor for the title and the corresponding language and the finding out the least possible occurrences in the language. We can correct the errors in the language using the language model and MMR (maximal marginal relevance) by taking the correlation factor for the title and the corresponding language and the finding out the least possible occurrences in the language. Disadvantages: Disadvantages: This procedure is not 100% accurate, but gives the desired results in most of the cases. This procedure is not 100% accurate, but gives the desired results in most of the cases.

21 21 Subject Categorization Problems and Solutions Ting Zhang

22 22 General Information Total Chinese and English books: 1,027,840 Total Chinese and English books: 1,027,840 Total number of combinational subject: 210,439 Total number of combinational subject: 210,439

23 23 Need for Subject Categories Subject navigation Subject navigation Narrow the range of search down Narrow the range of search down

24 24 Problems with Subject Wrong Categorization Wrong Categorization Blank Subject field Blank Subject field Non-English subject field Non-English subject field Mixed Language subject field Mixed Language subject field Very-detailed subject field Very-detailed subject field

25 25 Wrong categorization A History book got classified into Geography A History book got classified into Geography

26 26 Blank Subject Almost 300K books have “NULL” subject information Almost 300K books have “NULL” subject information

27 27 Non-English subject An English language book tagged with Chinese subject An English language book tagged with Chinese subject A Chinese language book tagged with Chinese subject might be ok, but would create issues for multi-lingual search and access A Chinese language book tagged with Chinese subject might be ok, but would create issues for multi-lingual search and access Mixed language subject Mixed language subject

28 28 Non-English subject Chinese book with Chinese subject

29 29 Mixed Language Subjects Subject of this book is described in a mixture of English and Chinese

30 30 Very detailed subjects Almost every book is tagged with a distinct variation of the Subject Almost every book is tagged with a distinct variation of the Subject

31 31 What needs to be done? Standardize the set of subjects like art, biology, medicine, physics etc. We have made 29 such standard subjects, and we made sure that we have mapped all the sub subjects to one main subject. This made most of the books compress and fit into the 29 range of the subjects. All the 29 catalogues are based on the CLC (Chinese Library Classification) Appendix 1 Appendix 1

32 32 Solution: Semi-Automatic A librarian manually categorizes one book into a particular category A librarian manually categorizes one book into a particular category A Programmer writes a program to identify all titles in the ULIB collection that have overlap of title words and attaches the subject tag A Programmer writes a program to identify all titles in the ULIB collection that have overlap of title words and attaches the subject tag Continue process for at least 20% of the books and the 80% get corrected automatically Continue process for at least 20% of the books and the 80% get corrected automatically

33 33 More than 600K Chinese books got a main subject category. More than 600K Chinese books got a main subject category. Our Progress with the solution:

34 34 TITLE Correction Problems and Solutions Zhenkun Zhou

35 35 ‘Title’ Statistics  There are more than 1,466,000 books  There are more than 1 million titles not in English, but in 20 other languages

36 36 Issues with TITLE field Illegal characters Illegal characters Incomplete and incorrect titles Incomplete and incorrect titles Varying Character-sets Varying Character-sets Spelling Variations (old / new variations) Spelling Variations (old / new variations) Segmentation and Tokenization Segmentation and Tokenization Non-native language titles Non-native language titles

37 37 Illegal characters Punctuation marks mostly Punctuation marks mostly Examples Examples " Watch Out for the Foreign Guests! "

38 38 Incomplete Titles Incomplete titles or Partial titles Incomplete titles or Partial titles Examples Examples There are about 37 books with the same title “ Annual report ” “ Annual report ” In fact, their titles should be such as “ Hong Kong Immigration Department Annual Report of the Year 2000-2001 ”

39 39 Varying character sets Titles in different character sets Titles in different character sets GBK, UTF8, ASCII GBK, UTF8, ASCII

40 40 Varying spelling style Example Example 明實錄 : 明太宗實錄 traditional Chinese 明實錄 : 明太宗實錄 traditional Chinese 明实录 : 明太宗实录 simplified Chinese 明实录 : 明太宗实录 simplified Chinese Same is true with Arabic old and new Same is true with Arabic old and new

41 41 Segmentation and Tokenization Not a problem, but an issue Not a problem, but an issue Most languages have word level segmentation, “ “, which helps text processing Most languages have word level segmentation, “ “, which helps text processing For Chinese, it ’ s not easy to deal segmentation problem which prevents word level search on titles For Chinese, it ’ s not easy to deal segmentation problem which prevents word level search on titles

42 42 Non-native language titles Standard transliteration notation for enabling cross-language search ability Standard transliteration notation for enabling cross-language search abilityEx. “ 齐白石 ”  ”Qi Bai Shi” Displaying the Transliteration and equivalent Translation of a book would enable us to know what the book is about Displaying the Transliteration and equivalent Translation of a book would enable us to know what the book is about

43 43 Solutions For Titles with punctuation mistakes or Incomplete titles For Titles with punctuation mistakes or Incomplete titles  Using some parsing tools to correct Ex. Perl Advantage: use regular expression to control different situations Disadvantage: can ’ t predict all situations, sometimes not preciously

44 44 Solutions For Titles in different character sets For Titles in different character sets  change the book titles into UTF character sets, Ex. UTF8 characters.

45 45 Solutions For Titles in different spelling style For Titles in different spelling style  change the different titles of the same book in one style  Ex. “ 中国 ”, ” 中國 ”  ” 中国 ”  Advantage: offline, easy  Disadvantage: bad expansibility, not correct in concept  Transform titles between styles  Ex. “ 中国 ”  “ 中國 ” “ 中國 ”  “ 中国 ” “ 中國 ”  “ 中国 ”  Advantage: online, good expansibility  Disadvantage: need process time

46 46 Solutions Title translation and transliteration Title translation and transliteration Translate titles from different language. Ex. “ 中国历史 ” ??- “ Chinese History ” !! Automatic Translation (Zheijiang Univ) and Transliteration module (open source tool) Automatic Translation (Zheijiang Univ) and Transliteration module (open source tool)

47 47 Future Research Directions Vamshi Ambati

48 48 Subject Categorization Text Categorization Text Categorization Requires large amount of text Requires large amount of text At ULIB, not all languages have an OCR At ULIB, not all languages have an OCR Can we do well with spare data Can we do well with spare data Semantics of words using Wordnet Semantics of words using Wordnet Can we use contextual information Can we use contextual information Ex: Jane Austin, Charles Dickens - Literature Ex: Jane Austin, Charles Dickens - Literature Ex: Swami Prabudha - Religion Ex: Swami Prabudha - Religion

49 49 Language Identification Our ‘byte frequency’ based language identification approach has a lot of problems when the languages are close Our ‘byte frequency’ based language identification approach has a lot of problems when the languages are close Hindi, Sanskrit Hindi, Sanskrit Can we use larger context Can we use larger context Longer character sequences Longer character sequences Functional words -’of’,’the’ (English) Functional words -’of’,’the’ (English) Dictionaries Dictionaries Language Identification from Images Language Identification from Images

50 50 Agents that learn by Example OCLC has the arguably most accurate data we have so far OCLC has the arguably most accurate data we have so far Can we programmatically access it, compare with our existing data and correct it Can we programmatically access it, compare with our existing data and correct it Some of the information regarding books is available on multiple catalogues all over the web (including Wikipedia) Some of the information regarding books is available on multiple catalogues all over the web (including Wikipedia) Can we benefit from this Can we benefit from this

51 51 Language Translation Good Enough Translation for Titles and Subjects Good Enough Translation for Titles and Subjects Universal Dictionary of All Languages (Dr.Shamos) could be a starting point Universal Dictionary of All Languages (Dr.Shamos) could be a starting point Google Translation Systems could help Google Translation Systems could help System at Xia Men University in China has already helped us do the translation System at Xia Men University in China has already helped us do the translation We at CMU, IISc will address most of the other languages We at CMU, IISc will address most of the other languages

52 52 Thank you Suggestions/Questions?

53 53 Appendix Subjects list

54 54 Appendix 1 catalogueContent Agriculture Agricultural engineering 、 agronomy 、 gardening 、 forestry 、 herding 、 veterinary 、 hunting 、 silkworm 、 bee 、 aquatic product 、 fishery etc Architecture art of building 、 architectural science including: architectural exploration 、 architectural design 、 Architectural Structure 、 soil mechanics 、 building’s foundations 、 Building materials 、 Construction Technology 、 building equipment 、 regional planning 、 town planning 、 public works) Art Painting 、 calligraphy 、 seal cutting 、 photographic art 、 industrial art 、 music 、 dance 、 drama 、 cinematic 、 television art

55 55 Appendix 1 AstronomyAstronomy BiographyBiography Biology General biology 、 cytology 、 genetics 、 biochemistry 、 biophysics 、 molecular biology 、 bioengineering 、 environmental biology 、 paleontology 、 microbiology 、 botany 、 zoology 、 insect logy 、 anthropology Chemistry inorganic chemistry 、 organic chemistry 、 Macromolecule Chemistry| Polymer Chemistry 、 physical chemistry 、 theoretical chemistry 、 analytical chemistry 、 applied chemistry Computer Science Automatic 、 computing technique Economics Political economics 、 economic profile 、 economic history 、 economic geography 、 economic planning 、 economic management 、 agricultural economy 、 industry economy 、 traffic and transport economy 、 trade 、 marketing

56 56 Appendix 1 Education Education 、 education at all levels 、 all forms of education 、 Information & knowledge dissemination 、 cultural activities Engineering General industrial Technology 、 mineral engineering 、 petroleum and natural gas industry 、 metallurgical industry 、 metallographic & smith craft 、 machinery & meter craft 、 weapon industry 、 energy industry 、 atomic energy technology 、 electro engineering 、 radio electronics & telegraphy 、 Chemical industry 、 light industry & Handicraft 、 Hydraulic Engineering 、 Transportation 、 aviation & space flight Environmental Science Geography Human geography 、 nature geography 、 geophysics 、 topography 、 meteorology 、 geology 、 oceanography

57 57 Appendix 1 History archaeology 、 folkways Language Linguistics 、 minority language 、 foreign language 、 all kind of language systems Literature Literary theory 、 Chinese literature etc. MathematicsMathematics Medicine Basic Medicine 、 clinical medicine 、 preventive medicine 、 hygiene 、 pharmacy etc. Military Strategy 、 tactics 、 military campaign 、 military technology 、 military geography etc. Natural Science System theory 、 methodology etc.

58 58 Appendix 1 Philosophy Logic 、 Ethics 、 aesthetics etc. Physics Dynamics 、 physics etc. Poetrypoetry Politics & Law Diplomacy 、 political relations 、 law etc. PsychologyPsychology Religion religion 、 divination 、 superstition etc. Social Science Management theory 、 statistics 、 sociology 、 demology 、 science of personnel ect. General Encyclopedia 、 dictionary 、 book catalogue & Abstract & indexing etc Miscellaneous Not included in above catalogue

59 59 For one million books.. Chinese books and English books are mainly tagged wrong out of 1 million books


Download ppt "1 Error Detection and Correction in Metadata Nilu Prahallad, Zhenkun Zhou, Ting Zhang and Vamshi Ambati Carnegie Mellon University, USA and Zheijiang University,"

Similar presentations


Ads by Google