Digas Digital Archiving System. Digas is the database program used for research and fact checking in the Research Department (“Dokumentation”, ~ 60 researchers)

Slides:

Advertisements

Similar presentations

Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru.

Advertisements

Yukon – What is New Rajesh Gala. Yukon – What is new.NET Framework Programming Data Types Exception Handling Batches Databases Database Engine Administration.

Multi-user and internet mapping. Multi-user environments Simple file server solution, LAN (Novel, Windows network) View from everywhere, edit from one.

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

IWay Service Manager 6.1 Product Update Scott Hathaway iWay Software Copyright 2010, Information Builders. Slide 1.

Data - Information - Knowledge

LBSC 690 Session #7 Structured Information: Databases Jimmy Lin The iSchool University of Maryland Wednesday, October 15, 2008 This work is licensed under.

Web Server Hardware and Software

Fundamentals, Design, and Implementation, 9/e Chapter 14 JDBC, Java Server Pages, and MySQL.

Database Software File Management Systems Database Management Systems.

EASY LOGISTICS CENTER - the TURNTABLE for information, documents and processes EASY LOGISTICS CENTER DOCUMENTS SHOP CONTENT COMMUNITY MODULES EASY ENTERPRISE.

Portal Technologies An overview of portal products and other software.

Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.

Content Management Systems Digital Resources for Research in the Humanities 2001.

Literature Review Week 3 Lecture 1. School of Information Technologies Faculty of Science, College of Sciences and Technology The University of Sydney.

Using Social Care Online: an overview Version 1.0 April 2015.

By Godfrey Aziyo Department of LIS Telephone:

Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.

Searching ERIC Some quick tips for effectively searching ERIC for educational research. Laura A. Ewald Assistant Librarian in Public Services Ruby E. Dare.

Client/Server Architectures

High-Speed, High Volume Document Storage, Retrieval, and Manipulation with Documentum and Snowbound March 8, 2007.

E.halFILE 2.2 New Application Features Session II.

SednaSpace A software development platform for all delivers SOA and BPM.

Oracle Application Express 3.0 Joel R. Kallman Software Development Manager.

School library systems 3.2 Education. Libraries often contain many thousands of books, magazines, CD- ROMs, etc. In fact, some of the largest libraries.

Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.

Online Database Support Experiences Diana Bonham, Dennis Box, Anil Kumar, Julie Trumbo, Nelly Stanfield.

A Business solution for your account payable capture process David Dejean.

Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.

Chapter 9 Database Management Discovering Computers Fundamental.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

WISER Social Sciences: Politics & International Relations Gillian Beattie (Social Science Library) Jane Rawson (Vere Harmsworth Library)

University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

Putting it all together Dynamic Data Base Access Norman White Stern School of Business.

Partitioning Design For Performance and Maintainability Martin Cairns

Greg Janée chit-chat with CS database folks 10/26/01 Gazetteer database 4.5 million items, each having: –1+ names fair to good discriminator –1 geospatial.

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.

PatentScope - Electronic Publication World Intellectual Property Organization.

Indexes / Session 2/ 1 of 36 Session 2 Module 3: Types of Indexes Module 4: Maintaining Indexes.

Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

Welcome to the Business Source Premier tutorial By the end of this tutorial you should be able to: Do a basic search to find references Use search techniques.

Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –

Chapter 4 Logical & Physical Database Design

Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.

IS6146 Databases for Management Information Systems Lecture 1: Introduction to IS6146 Rob Gleasure robgleasure.com.

User Support. The need for user support Computers become ever more powerful The software that runs on them becomes ever more sophisticated GUIs have attracted.

Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.

Session id: Darrell Hilliard Senior Delivery Manager Oracle University Oracle Corporation.

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.

Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

Aaron Stanley King. What is SQL Azure? “SQL Azure is a scalable and cost-effective on- demand data storage and query processing service. SQL Azure is.

Presented by: Aaron Stanley King.  Benefits of SQL Azure  Features of SQL Azure  Demos, Demos, Demos!  How to query in SQL Azure  More Demos!  Recent.

 1- Definition  2- Helpdesk  3- Asset management  4- Analytics  5- Tools.

Internet Made Easy! Make sure all your information is always up to date and instantly available to all your clients.

SWT NET-TRIO SOFTWARE TOOLS RAPGEN - Report Generator

SQL Server 2000 and Access 2000 limits

Building Search Systems for Digital Library Collections

Introduction of Week 3 Assignment Discussion

Search Techniques and Advanced tools for Researchers

MANAGING DATA RESOURCES

Lecture 1: Multi-tier Architecture Overview

Automated Bulk Signing Solution

Developing and testing enterprise Java applications

Chapter 17 Designing Databases

SQL Views Presented by: Dr. Samir Tartir

Presentation transcript:

Digas Digital Archiving System

Digas is the database program used for research and fact checking in the Research Department (“Dokumentation”, ~ 60 researchers) and by the journalists writing for DER SPIEGEL (~ 270 journalists), manager magazin, SPIEGEL TV and SPIEGEL Online. Since 1995 the majority of the incoming documents and since 1998 all the incoming documents are digital. Different archiving systems were used since 1991 (BRS/Search, Trip). Today we manage the document workflow in an oracle database with a java application and use a different application (servlet / html) for searching in the intermedia index.

Research in dossiers vs. full text searching The archive in the Research Department contains more than 25 million documents and about 4 million pictures. These documents are traditionally organized in categories (=dossiers regarding subjects), companies and individuals. Documents were indexed in these categories for easy searchability. Only relevant articles were intellectually categorized, so by beeing indexed the documents were automatically weighted.

In the digital archive of DER SPIEGEL this system has been modified, but the principal idea is still the same. Categorizing supports fast searchability of only the relevant articles. In the traditional archive the not indexed part of a paper was lost. In the digital archive the full text search in field-structured documents supports efficient research strategies for the professional researcher. The combination of categorization and full text search can be used to simplify the categorization model and to reduce this very cost-intensive work.

The majority of our users are journalists who are not automatically professional database researchers with knowledge of boolean operators or complicated research strategies. So the categorization of articles is still one of our methods for supporting successful research, the Digas program supports full text search and research in dossiers in separate specialized front ends.

Digas is a multitier client server application (thin client). ClientServlet http corba Server Oracle DB jdbc Application Server: NT 4 (2 servers) Database Server: HP Superdome (16 processors, 48 GB ram (Tetragon, 6 GB cache)

Digas - the user front end How does a research via Digas work Examples of a Dossier Search and full text search

The Digas document base – different views Today there are 10 million documents in our database, about 15% in dossiers: –1,9 million articles in dossiers regarding individuals, 50% of these are images, not full text ( ) –1,4 million articles in dossiers regarding subjects –0,2 million articles in dossiers regarding company information –1 million full text articles with image data (jpg, pdf) The avarage size of a document is 4 KB

We import about new documents weekly with a peak on mondays (weekend editions)

In peak times there is an index load of up to 2000 new or changed documents per hour sync times on monday (week 21)

Users in document management and resarch About 30 users do work on documents About 60 users do research. Right now we see up to 1000 searches per day. We expect the system beeing used by around 500 paralell users with several thousend searches per day and in peak times up to 1000 searches per hour.

Performance Right now 25% of the searches are dossier-searches. Nearly 20% of these searches take longer than 10 seconds. In case of full text searches 10% of the searches take longer than 10 seconds. Wild card searches and phrases are usually the problem.

Why Oracle Scalability, performance, easy support for unix / hp-ux Professional support and commitment for our mission criticle application Integration in our document management Synergy effects in further developing the applications for research and document management Full text features were quite limited, we expect fast development

Intermedia index and execution of a search The index is built using USER_DATA_STORE. A PL/SQL procedure creates a virtual XML document which is beeing indexed „manual partitioning“: we have two sets of indices which are divided in three parts (90 days, 270 days and rest). These indices are kept on different columns of the document table. This improves index performance and manageability but searches get more complex. The second index set is for security: rebuilding an index takes the whole weekend.

All searchable document attributes are kept in the intermedia index (performance). This results in a lot of database triggers and complex search execution, e.g. supporting performant searches including date ranges. No stoplist is used No substring- or prefex-indexing is used (index size)

Discussion The scalability we need for our application depends on a very individual software solution for maintaining the index and executing searches sync and optimize of the different indexes are scheculed by a procedure especially created for this application One has to keep all searchable attributes in the intermedia index out of performance reasons. The integration of structured searches (joins) and fulltext searches is weak - one might expect this to be different.

Date range search, the lot of attributes due to the dossier search and the large document base with highly structured documents increase the number of items in the index. Another consequence of keeping all attributes in the intermedia index is that search statements grow quite long. They have to be copied and optimized for the three seperate indexes (time slices), have to be further divided in case they are longer than 4000 characters, the result sets then combined and sorted for presentation.

The locking issue The documents beeing indexed are locked from the beginning to the end of a sync run. Keeping a fultext index in sync is an asynchronous process which should be done without any locking. We believe that this is a serious design issue. Even in our hardware environment a sync can run up to two minutes, which is in itself not a bad thing. This locking behavior is bound to be a severe problem in every large scale environment where full text search and document management are done on the same document base. Together with Oracle we are currently working on workarounds.

These are the problems we find most important to be solved in the ongoing development of oracle intermedia: No locking during sync (and optimize) Tighter integration of fultext queries and structured constraints (e.g. date-range) Archive log during ctx_dll.optimize.index: 50 to 80 GB archive log daily Better performance in wildcard an phrase searches Support for refining a search e.g. do a search on the result set of another search Better support for getting fist rows of a result set ordered by date

Contact: Heiner Ulrich DER SPIEGEL phone