Nederlandse Organisatie voor Wetenschappelijk Onderzoek Script-Analysis Tools for the Cultural Heritage – SCRATCH Project review.

Slides:



Advertisements
Similar presentations
Academic Search Engines
Advertisements

Patient information extraction in digitized X-ray imagery Hsien-Huang P. Wu Department of Electrical Engineering, National Yunlin University of Science.
FUNCTION FITTING Student’s name: Ruba Eyal Salman Supervisor:
FOR PROFESSIONAL OR ACADEMIC PURPOSES September 2007 L. Codina. UPF Interdisciplinary CSIM Master Online Searching 1.
1 of 16 Information Access The External Information Providers © FAO 2005 IMARK Investing in Information for Development Information Access The External.
1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.
World Digital Library OSI | WEB SERVICES World Digital Library Arab Peninsula Regional Group Meeting Doha, Qatar, December 12-14, 2010 An Introduction.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
DARE: Digital Academic Repositories A new age in academic information provision in the Netherlands Henk Ellermann, DARE, 4/5 September 2003.
Library 1 Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian September 2, 2009.
Electronic Resources in the EUI Library
1 EnviroInfo 2006, 05/09/06 Graz Automatic Concept Space Generation in Support of Resource Discovery in Spatial Data Infrastructures Paul Smits, Anders.
WIPO Patent Information Services
Introduction to Metview
LIBRARY WEBSITE, CATALOG, DATABASES AND FREE WEB RESOURCES.
NIH Public Access Compliance Cleveland Health Sciences Library Case Western Reserve University Kathleen C. Blazar.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Configuration management
Campaign Overview Mailers Mailing Lists
SCORE The Supplemental Complex Repository for Examiners Biotechnology/Chemical/Pharmaceutical Partnership June 2006.
Open access policies in Norway Frode Bakken Birzeit 26th of May 2009.
Review of AI from Chapter 3. Journal May 13  What advantages and disadvantages do you see with using Expert Systems in real world applications like business,
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
12 January 2009SDS batch generation, distribution and web interface 1 ExESS IT tool for SDS batch generation, distribution and web interface ExESS IT tool.
RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
People Counting and Human Detection in a Challenging Situation Ya-Li Hou and Grantham K. H. Pang IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART.
Student Interface for Online Testing Training Module Copyright © 2014 American Institutes for Research. All rights reserved.
Writer identification through information retrieval Ralph Niels, Franc Grootjen & Louis Vuurpijl.
Search Engines and Information Retrieval
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
Information Retrieval in Practice
Cyclic input of characters through a single digital button without visual feedback Yang Xiaoqing New Interaction Techniques Dept.
Document Image Analysis CSE 717 An Introduction. Document Image Analysis  DIA is the theory and practice of recovering the symbol structures of digital.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
The 2014 International Conference on Internet Computing and Big Data (ICOMP'14), USA, Las-Vegas, July 21-24, science.org/worldcomp14/ws/conferences/icomp14/submission.
Developing Health Geographic Information Systems (HGIS) for Khorasan Province in Iran (Technical Report) S.H. Sanaei-Nejad, (MSc, PhD) Ferdowsi University.
Search Engines and Information Retrieval Chapter 1.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
The DSpace Course Module – An introduction to DSpace.
Chapter 1 Introduction to Data Mining
Thomson Scientific October 2006 ISI Web of Knowledge Autumn updates.
Hala Bezine IGS 2011 Cancun-Mexico 1 Presented by :M me Hala Bezine Republic of Tunisia Ministery of Higher Education and Scientific Research University.
A centre of expertise in data curation and preservation Subtitle here, if required Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Comparison of Handwritings Miroslava Božeková Thesis supervisor: Doc. RNDr. Milan Ftáčnik, CSc.
Examples for Open Access Scholar Electronic Repository by New Bulgarian University IP LibCMASS Sofia 2011 Contract № 2011-ERA-IP-7 Sofia, September,
Signature Verification
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Lluís Codina (UPF) MUCS Dept. Of Communication Online Searching: role and characteristics of Academic Databases.
Advanced Science and Technology Letters Vol.28 (AIA 2013), pp Local Contour Features for Writer Identification.
Automatic Script Identification. Why do we need Script Identification OCRs are generally language dependent. Document layout analysis is sometimes language.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Introduction to Machine Learning August, 2014 Vũ Việt Vũ Computer Engineering Division, Electronics Faculty Thai Nguyen University of Technology.
WLD: A Robust Local Image Descriptor Jie Chen, Shiguang Shan, Chu He, Guoying Zhao, Matti Pietikäinen, Xilin Chen, Wen Gao 报告人:蒲薇榄.
Visual Information Processing. Human Perception V.S. Machine Perception  Human perception: pictorial information improvement for human interpretation.
Information Retrieval in Practice
IEEE Computer Society Digital Library (CSDL)
Digital Video Library - Jacky Ma.
Visual Information Retrieval
Color-Texture Analysis for Content-Based Image Retrieval
Multimedia Information Retrieval
Presentation transcript:

Nederlandse Organisatie voor Wetenschappelijk Onderzoek Script-Analysis Tools for the Cultural Heritage – SCRATCH Project review

Nederlandse Organisatie voor Wetenschappelijk Onderzoek From Scratch Lambert Schomaker Henny van Schie Tijn van der Zant Fons LaanSveta Zinger

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  PhD student: Tijn van der Zant  Postdoc: Sveta Zinger  Scientific programmer: Fons Laan  Coordinator: Lambert Schomaker  Nationaal Archief: Henny van Schie  several ‘invisible’ transcription volunteers on WWW People

Nederlandse Organisatie voor Wetenschappelijk Onderzoek   develop Script-retrieval methods strive for Script-Google don’t promise automatic transcription (but develop traditional recognizers ‘under the hood’) Goal

 Experimentation platform: KdK archive

Nederlandse Organisatie voor Wetenschappelijk Onderzoek SCRipt Analysis Tools for the Cultural Heritage  Traditional OCR: know all character shapes in advance, know the language, know the page layout, then start on page 1, towards the end  Scratch philosophy: coarse to fine-grained access  in an interactive cycle between users and machine  pages annotate  retrieval  paragraphs annotate  retrieval  lines annotate  retrieval  words annotate  retrieval  recognition  characters annotate  retrieval  recognition

Nederlandse Organisatie voor Wetenschappelijk Onderzoek Retrieval vs Recognition  Retrieval:  Return(Set_of Image_objects | a_keyword)  Recognition:  Return(Sequence_of Letters | an_image)

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  User interface and data-base issues  Fons Laan  Pattern-recognition methods  Lambert Schomaker  Behavior-based Machine Learning, scalability issues  Tijn van der Zant  Language modeling  Sveta Zinger  Transcription/annotation  Henny van Schie and several volunteers Research and activity Tracks

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  developments:  line-to-line matching (finished, published)  word-to-line matching (finished, published)  word spotting (continuing, published)  word recognition (continuing, published)  character recognition (continuing, unpublished yet)  layout analysis (published)  parallel computing (continuing, published)  continuous transcription/annotation (continuing process)  GUI, XML and database (continuing)  other types of collections:  Cliwoc / Admiraliteit 1777  Ubbo Elzevirium  migration to public server and Target project Status of project

Crude automatic word-zone segmentation Manual word-zone segmentation Unsheared line strips (B/W) t Unsheared Word Zones (B/W) Word labeling t B: Word retrieval and recognition route Word Classifiers tttt Character Classifiers tttt C: Word and character recognition

Nederlandse Organisatie voor Wetenschappelijk Onderzoek Line-transcription web site

Nederlandse Organisatie voor Wetenschappelijk Onderzoek Example of hit list for a trained word spotter: find the machine errors for abbreviation “RvSt”!

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  ~1200 scanned pages / 29k lines  of which 30k lines have now been transcribed via our web interface  Then, of a set of 115k Word Zones (white-space delimited text chunks), 74k words have been labeled by man+machine in concerted action during the last six year: Incremental Machine learning Data

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  In order to demonstrate genericity of the approach:  new KdK books will be processed  ‘cliwoc’ admirality books, e.g., captain’s log of 1777, is being processed at the moment  Difficult machine-printed books are a new target, e.g., Ubbo Emmius (1616). Rerum Frisicarum Historia, published by Elzevirium. Data

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  Several non-KdK collections (captain’s logs) have been ‘ingested’ by the system Page and line segmentation methods could be used without much problems  A ‘JAVA’ tool was developed for manual segmentation and labeling of words (in addition to the automatic method). This tool allows for fine-tuned word and letter harvesting.  An independent tool was developed for high-quality refinement of line transcriptions. This tool will now be migrated to WWW  New books (~37 KdK, ~1000 pages each) will be scanned Data, developments

Nederlandse Organisatie voor Wetenschappelijk Onderzoek Patch9x9 Kohonen map 33x33 cells trained on 1.6M patches

Nederlandse Organisatie voor Wetenschappelijk Onderzoek Examples of Patch9x9Quadrants feature Amsterdam Groningen 4x33x33

Nederlandse Organisatie voor Wetenschappelijk Onderzoek “typical results” (difficult samples, correct) aanhangig Officier overweging zijn zulks

 for nearest-neighbour matching Lambert Schomaker – ICCS Computational Humanities 1/6/2010

Advances in writer identification and verification – Lambert Schomaker

Lambert Schomaker – ICCS Computational Humanities 1/6/2010  #Labels as a function of ‘days on-line’, KdK 1903

Lambert Schomaker – Digihist 1/7/2010  #Labels, total as a function of ‘days on-line’, KdK 1903,1893,1897

start t=0 synchronized Lambert Schomaker – Digihist 1/7/2010

Can we go from pixel to meaning? ... without going through the language (=ASCII) bottleneck?  Idea: take a semantic category or concept  Collect positive examples of text-line images  Sample random, negative examples  Train a binary classifier (SVM) to detect the presence of lines containing the concept Lambert Schomaker – Digihist 1/7/2010

Image-feature based ‘municipality’ detector ConceptN labeled Precision (%) Recall (%) City Names Lambert Schomaker – Digihist 1/7/2010

Image-feature based ‘proper name’ detector ConceptN labeled Precision (%) Recall (%) City Names Proper Names Lambert Schomaker – Digihist 1/7/2010

Image-feature based ‘personal role’ detector  “ambtenaren”  “vervanger”  “president”  “verpleegde” ConceptN labeled Precision (%) Recall (%) City Names Proper Names Personal Roles Lambert Schomaker – Digihist 1/7/2010

 X-position and semantics, Bayesian modeling Lambert Schomaker – Digihist 1/7/2010

From pixel to meaning  these are examples of ‘data mining’  regular handwriting recognition (transcoding into ASCII) is not always needed to get at meaning  with sufficient examples, abstract concepts can be learned  interaction and iteration  e-Science Lambert Schomaker – Digihist 1/7/2010

Nederlandse Organisatie voor Wetenschappelijk Onderzoek

Example entry wind & weather ‘gps’ coordinates geographical name event

Nederlandse Organisatie voor Wetenschappelijk Onderzoek

 Important paper in IEEE PAMI (vd Zant et al): biologically inspired features for word recognition (89%)  Dissertation: end 2009 (TvdZ)  Preliminary word-recognition results for other collections:  Admiraliteit, 1777: 93% KohSOM of patches, 4-quadrants  Ubbo Emmius, Latin 95% (idem)  KdK % (idem, +MLP)  KdK-51+ (vd Zant) 93+% Engineered version of IEEE PAMI  New books will be scanned  Ubbo Emmius: scanning was improvised, digital camera Obtained 15kEuro from Gratama Foundation for serious scanning (Arnold Meijster) Results

Nederlandse Organisatie voor Wetenschappelijk Onderzoek Future  Target project: 32 MEuro, 15 MEuro funded, 500kEuro for us.  massive, long-term data storage and access methods, from bioinformatics, astronomy to cultural heritage  about 6 person years  New collections are being ‘ingested’: Qumran Scrolls, new institutions are contacting us  we are aiming at a second ‘big’ journal article, to be submitted end 2009

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  1 [J] van der Zant, T., Schomaker, L.R.B. & Haak, K. (2008). Handwritten-word spotting using biologically inspired features, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol x(x), pp. xxx-xxx. [accepted]  2 [J] Schomaker, L.R.B., Franke, K., Bulacu, M. (2007). Using codebooks of fragmented connected- component contours in forensic and historic writer identification. In: Pattern Recognition Letters, 28(6), p  3 [C] Schomaker, L.R.B. (2008). Word mining in a sparsely-labeled handwritten collection, Document Recognition and Retrieval XV, IS&T/SPIE International Symposium on Electronic Imaging, pp. xxx-xxx.  4 [C] van der Zant, T., Schomaker, L.R.B., Valentijn, E. (2008). Large-scale parallel document-image processing, Document Recognition and Retrieval XV, IS&T/SPIE International Symposium on Electronic Imaging, pp. xxx-xxx.  5 [C] Schomaker, L.R.B. (2007). Retrieval of handwritten lines in historical documents. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR 2007), IEEE Computer Society, pp. xxx-xxx, vol. x, September, Curitiba, Brazil.  6 [C] Bulacu, M., van Koert, R., Schomaker, L.R.B., van der Zant, T. (2007). Layout analysis of handwritten historical documents for searching the archive of the Cabinet of the Dutch Queen. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR 2007), IEEE Computer  Society, pp. xxx-xxx, vol. x, September, Curitiba, Brazil. Publications [j]=journal, [c]=peer-reviewed proceedings

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  7 [C] Bulacu, M., Schomaker, L.R.B. (2007) Automatic handwriting identification on medieval documents. In: Proceedings of 14th International Conference on Image Analysis and Processing (ICIAP 2007), IEEE Computer Society, September, Modena, Italy, pp  8 [CB] Schomaker, L.R.B. (2007). Reading Systems: An introduction to Digital Document Processing. In: B.B. Chaudhuri (Ed.), Digital Document Processing, Springer-Verlag: Guildford, Surrey, UK. ISBN: , p  9 [C] Bulacu, M. & Schomaker, L.R.B. (2007). Automatic handwriting identification on medieval documents, Proc. of 14th Int. Conf. on Image Analysis and Processing (ICIAP 2007), IEEE Computer Society, pp , September, Modena, Italy.  10 [C] van der Zant, T., Schomaker, L.R.B., Wiering, M. & Brink, A. (2006). Cognitive Developmental Pattern Recognition: Learning to Learn. IEEE SMC 2006, p  11 [C] S. Zinger, J. Nerbonne, L.R.B. Schomaker, H. van Schie, "Script Analysis Tools for the Cultural Heritage: text line matching for historical handwritten document retrieval", SIREN 2007, Scientific Information and communication technology Research Event Netherlands, p. 49, Delft (the Netherlands), October Publications

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  12 [C] S. Zinger, J. Nerbonne, L. Schomaker, H. van Schie, "Content-based text line comparison for historical document retrieval", Computational Phonology workshop, Recent Advances in Natural Language Processing conference, RANLP-2007, pp , Borovets (Bulgaria), September  13 [C] S. Zinger, J. Nerbonne, L.R.B. Schomaker, H. van Schie, "Script Analysis Tools for the Cultural Heritage: statistics on queries and line matching", SIREN 2006, Scientific Information and communication technology Research Event Netherlands, p. 91, Utrecht (the Netherlands), October  14 [P] Schomaker, L.R.B., Zant, T. van der & Bogaarts, J. (2005). NWO/Catch project "Scratch" - Script-Analysis Tools for the Cultural Heritage. A Pilot Experiment on sparse- knowledge search methods for handwritten collections. Informatica Platform Nederland (IPN), Siren, October 6th, Eindhoven, The Netherlands.  15 [P] M. Bulacu, A. Brink, T. van der Zant, L. Schomaker (2009) Recognition of handwritten numerical fields in a large single-writer historical collection. In: Proceedings of the 10th Int. Conf. on Document Analysis and Recognition (ICDAR 2009), July, Barcelona, Spain Publications

Nederlandse Organisatie voor Wetenschappelijk Onderzoek  Bulacu, M. & Schomaker, L.R.B. (2007). Text-independent Writer Identification and Verification Using Textural and Allographic Features, IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), Special Issue - Biometrics: Progress and Directions, April, 29(4), p  Schomaker, L.R.B., Franke, K. & Bulacu, M. (2007). Using codebooks of fragmented connected-component contours in forensic and historic writer identification, Pattern Recognition Letters, 28(6), p  L. Schomaker & M. Bulacu (2004). Automatic writer identification using connected-component contours and edge-based features of upper-case Western script. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 26(6), June 2004, pp  M. Aussems, A. Brink (2009) Digital palaeography In: M. Rehbein, P. Sahle, and T. Schassan (Eds.), Kodikologie und Paläographie im Digitalen Zeitalter / Codicology and Palaeography in the Digital Age, ser. Schriftenreihe des Instituts für Dokumentologie und Editorik, Norderstedt: Books on Demand GmbH, 2009, vol. 2. ISBN:  Article [forthcoming]. Axel Brink, Jinna Smit, Marius Bulacu & Lambert Schomaker, "Quill Dynamics Feature for writer identification in historical documents". Publications – writer identification