Presentation is loading. Please wait.

Presentation is loading. Please wait.

Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.

Similar presentations


Presentation on theme: "Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications."— Presentation transcript:

1 Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign (UIUC) {kooper, wmcfadd,pbajcsy}@ncsa.uiuc.edu Acknowledgments This research was partially supported by a National Archive and Records Administration supplement to NSF PACI cooperative agreement CA #SCI-9619019. Abstract This poster addresses the problems of comprehensive document comparisons and computational scalability of document mining using cluster computing and the Map and Reduce programming paradigm. While the volume of contemporary documents and the number of embedded object types have been steadily growing, there is a lack of understanding (a) how to compare documents containing heterogeneous digital objects, and (b) what hardware and software configurations would be cost-efficient for handling document processing operations such as document appraisals. The novelty of our work is in designing a methodology and a mathematical framework for comprehensive document comparisons including text, image and vector graphics components of documents, and in supporting decisions for using Hadoop implementation of Map/Reduce paradigm to perform counting operations. Motivation From the Strategic Plan of The National Archives and Records Administration: “Assist in improving the efficiency with which archivists manage all holdings from the time they are scheduled through accessioning, processing, storage, preservation, and public use.” The motivation is to provide support for answering appraisal criteria related to document relationships, chronological order of information, storage requirements and incorporation of preservation constraints (e.g., storage cost). Experiments For illustration purposes we used the NASA Columbia accident report on the causes of the Feb. 1, 2003 Space Shuttle accident. The report is 10MB (10,330,897 bytes) and contains 248 pages with 179,187 words, 236 images (an average image size is 209x188 = 16,655,776 pixels), and 30,924 vector graphics objects. We compared the time it took to extract the occurrence statistics from the Columbia report using Hadoop compared to using a stand alone application (SA). Conclusions The graph shows the execution times in mili-seconds (y-axis) needed to extract occurrences for all PDF elements using CCT and NCSA clusters, and using multiple data splits. The number of used nodes in the NCSA cluster ranged between 1 and 4. The results provide input into a decision support for hardware and software investments in the domains processing a large volume of complex documents. Objectives Design a methodology, algorithms and a framework for conducting comprehensive document appraisals by: enabling exploratory document analyses and integrity/authenticity verification, supporting automation of appraisal analyses evaluating computational and storage requirements of computer- assisted appraisal processes Proposed Approach Decompose the series of appraisal criteria into a set of focused analyses: Find groups of records with similar content, Rank records according to their creation/last modification time and digital volume, Detect inconsistency between ranking and content within a group of records, compare sampling strategies for preservation of records. INTEGRITY VERIFICATION – two or more document versions within one group SAMPLING – document versions in group 1 and 2 Methodology The methodology is described by starting with pair-wise comparisons of text, image (raster) and vector graphics components, computing their weights, establishing the group relationship s to permanent records, and focusing on integrity verification and sampling. Hadoop Map/Reduce is a software framework for writing applications which perform operations that could be decomposed into Map and Reduce phases and process vast amounts of data in-parallel. Hadoop-based applications run on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Configurations of Hadoop Clusters We used two different clusters. The first cluster, hadoop1, is at NCSA and consisted of four identical machines. The NCSA cluster is capable of 20 Map tasks and 4 Reduce tasks in parallel. The second cluster, Illinois Cloud Computing Testbed (CCT), is in the CS department at the University of Illinois and consists of 64 identical machines. The CCT is capable of 384 Map tasks and 128 Reduce tasks in parallel. Document Operations Suitable for Hadoop Our goal is to count the occurrence of words per page, of colors in each image and of vector graphics elements in the document. The counting operation is computationally intensive especially for images since each pixel is counted as if it was a word. While one can find about 900 words per page, a relative small image of the size 209x188 is equivalent to 44 pages of text. Occurrence of colors List of images Preview LOADED FILES “Ignore” colors Display of Pair-wise Document Similarities Exploratory View of Color Occurrences in a Selected PDF File and Its Image Input PDF File Viewed in Adobe Reader For more information: URL: http://isda.ncsa.uiuc.edu/NARA/http://isda.ncsa.uiuc.edu/NARA/ Time [ms] Data Split [pages] Configuration Computational Scalability Using Hadoop


Download ppt "Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications."

Similar presentations


Ads by Google