Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Content Analysis for Digital Archives Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center.

Similar presentations


Presentation on theme: "Document Content Analysis for Digital Archives Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center."— Presentation transcript:

1 Document Content Analysis for Digital Archives Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center

2 Digital Archives TasksOperations -casual browsing -look up information -follow trails -compose narratives -form and organize collections -distribute -assemble timelines -browse by topic, type, etc. -search for known items -search for items meeting criteria -find duplicate items -find similar items -follow links -establish links -apply logical rules -edit metadata All enabled by Metadata Content layer Metadata layer Index

3 Metadata Two major problems with metadata: 1. Extracting metadata from raw content items. 2. Metadata is always incomplete for some purposes. Title: Sarix neob Date: Media: niobium Format: jnb Author: Rsi Liwer Text: aliirn xeca sarlia isyb... Index ID: 34962s pointer to item Metadata as a static record computeSimilarityTo() containsEntity?() fitsSlotInModel?(); extractTextAfterImageCleanup() Metadata as an interface functions applied to item content Automatic Content Analysis

4 State of the Art document image analysis photographic image analysis video/film analysis audio analysis web site analysis text appearance, layout who what where when topics entitites genre category functional roles genre scenes who, what,... genre speech/music speaker ID transciption

5 APR :38 FR TO 4264 P.02/06 * 9STCapitalModularSpace SALE INVOICE _ jz5| g'" ni'idspace.com -I Page: 1 FAX TO:_ BILL TO: REMIT TO: ACCOUNT NO.: ;m11 GE Capital Corp 10 Riverview Drive Danbury, CT PO NUMBER: per Chad LOCATION OF UNITS: SAME AS ABOVE UNIT NO.: SERIAL NO.: SM069A 26, ,;- UNIT NO.: SERIAL NO.: SM0G9B 26, DOWN PAYMENT BUILDING DELIVERY BUILDING DELIVERY BLOCK AND LEVEL BLOCK AND LEVEL 2, ANCHOR/TIE DOWN DECKING / ELECTRICAL 1, / PLUMBING 3, INSTALLATION SITE MANAGEMENT 1, SKIRTING- VINYL 1, TOTAL DUE THIS INVOICE 63, When OCR Works...

6 Header alignment Graphical logo Font / Layout / Symbol Pattern of Fax ID Line Redacting markings Address block Repeated elements Hand-drawn graphical annotation Handwritten Textual Annotation Textual Field Indicator Tabular Layout Graphic separator STST Amount Field How People See a Document Category Type Structural Elements and Relations Relational Context Invoice Construction project Supplier relationship Inventory & materials management Bill Itemized purchase listing Annotated document

7 Technology Ecology Academia Industry Computer Vision Document Recognition Information Retrieval Machine Learning Speech Recognition Natural Language Artificial Intelligence Document Imaging Transaction Processing Workflow Systems Database Vendors Business Software Business Process Outsourcing Advertising/Search Paying Customer: government industry businesses consumers government Hobbiests museums schools local governments NGOs individuals startups boutique companies shoestring projects in Academia and Industry Characteristics: science-based toy problems fragile engineering-based robust limited capabilities

8 A Hobby Project Document Capture Station + Collection Comprehension Engine Wanted:

9 Collection Comprehension Engine OCR Document Structure Modeling Document Collection Linking Image Processing Automatic Cataloging Genre Tagging Clustering Classification Visualization GUI

10 Conclusion The hobby stage brings together kindred spirits.


Download ppt "Document Content Analysis for Digital Archives Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center."

Similar presentations


Ads by Google