Oracle Text saves your time Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003.

Slides:



Advertisements
Similar presentations
Yukon – What is New Rajesh Gala. Yukon – What is new.NET Framework Programming Data Types Exception Handling Batches Databases Database Engine Administration.
Advertisements

Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
9 Creating and Managing Tables. Objectives After completing this lesson, you should be able to do the following: Describe the main database objects Create.
Data Definition Language (DDL)
Native XML Database or RDBMS. Data or Document orientation If you are primarily storing documents, then a Native XML Database may be the best option.
Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
AN INTRODUCTION TO PL/SQL Mehdi Azarmi 1. Introduction PL/SQL is Oracle's procedural language extension to SQL, the non-procedural relational database.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
IAEA International Atomic Energy Agency United Nations Library and Information Network for Knowledge Sharing (UN-LINKS) September 2013, Geneva.
Information Retrieval in Practice
DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)
Oracle Text Operations J. Molka-Danielsen Sept. 30, 2002.
Overview of Search Engines
An innovative platform to allow translation and indexing of internet sites Localization World
Module 5: Data Access. Overview Introduce database components involved in data access Introduce concepts of Transact -SQL and Procedural SQL as tools.
CERN – European Organization for Nuclear Research Administrative Support - Advanced Information Systems Introduction to Oracle interMedia-Text By Derek.
Oracle Text NoCOUG Presentation August 15, Session Objectives Review Oracle Text basics Index Options Compare Oracle Text with interMedia and ConText.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Advance Computer Programming Java Database Connectivity (JDBC) – In order to connect a Java application to a database, you need to use a JDBC driver. –
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Troubleshooting SQL Server Enterprise Geodatabase Performance Issues
Database Design for DNN Developers Sebastian Leupold.
Introduction. 
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
HAP 709 – Healthcare Databases SQL Data Manipulation Language (DML) Updated Fall, 2009.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
MySQL. Dept. of Computing Science, University of Aberdeen2 In this lecture you will learn The main subsystems in MySQL architecture The different storage.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Advanced searching with Oracle Text Indexing and searching in text and documents Author: Krasen Paskalev Certified Oracle DBA Semantec.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
1099 Why Use InterBase? Bill Todd The Database Group, Inc.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
IPC Working Group 30 - Updates on IT support for the IPC Geneva November 6, 2013 Patrick Fiévet Head of IT Systems Section.
´Google-ized´ search in your business data Author: Krasen Paskalev Certified Oracle 8i/9i DBA Seniour Oracle Consultant Semantec GmbH Benzstr.
Digas Digital Archiving System. Digas is the database program used for research and fact checking in the Research Department (“Dokumentation”, ~ 60 researchers)
9 Copyright © Oracle Corporation, All rights reserved. Creating and Managing Tables.
ITGS Databases.
SQL Fundamentals  SQL: Structured Query Language is a simple and powerful language used to create, access, and manipulate data and structure in the database.
SupervisorStudent Prof. Atilla ElciHussam Hussein ABUAZAB June 2007 Using ORACLE XML Parser to Access Ontology CMPE 588 Engineering Semantic for.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Oracle 8i interMedia Text Presented by Jorge Rimblas 4-Feb-2002 SSI Worldwide.
9 Copyright © Oracle Corporation, All rights reserved. Creating and Managing Tables.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Aliya Farheen October 29,2015.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
(SQL - Structured Query Language)
NSF DUE ; Wen M. Andrews J. Sargeant Reynolds Community College Richmond, Virginia.
Oracle & SQL. Oracle Data Types Character Data Types: Char(2) Varchar (20) Clob: large character string as long as 4GB Bolb and bfile: large amount of.
Oracle9i Developer: PL/SQL Programming Chapter 11 Performance Tuning.
1 11g NEW FEATURES ByVIJAY. 2 AGENDA  RESULT CACHE  INVISIBLE INDEXES  READ ONLY TABLES  DDL WAIT OPTION  ADDING COLUMN TO A TABLE WITH DEFAULT VALUE.
Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
11 Copyright © 2004, Oracle. All rights reserved. Managing XML Data in an Oracle 10g Database.
Product Lifecycle Management with the CERN EDMS David Widegren CERN, TS/CSE 8 Nov 2005EDMS:
In this session, you will learn to: Create and manage views Implement a full-text search Implement batches Objectives.
System is a set of interacting or interdependent components forming an integrated whole.
EURISOL, PSI, June 2006E.Wildner, CERN1 Data Bases for Parameter Lists N. Emelianenko, CERN AT-MAS E. Wildner, CERN AT-MAS Presentation is based on a presentation.
SQL Database Management
Information Retrieval in Practice
With Temporal Tables and More
Data Virtualization Tutorial: Introduction to SQL Script
SQL and SQL*Plus Interaction
22-INTEGRATION HUB
Unlocking Hidden Gems in Oracle Text
Chapter 1 Introduction(1.1)
Presentation transcript:

Oracle Text saves your time Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003

Oracle Text saves your time CERN Engineering Data Management System at CERN Oracle Text How we profit from this technology Conclusion Content

Oracle Text saves your time CERN Content

Oracle Text saves your time The world’s largest particle physics research laboratory Founded in 1954, CERN has today 20 member states 2400 staff Over 6500 scientists come here to use research facilities 500 universities, over 80 nationalities CERN explores what matter is made of, and what forces hold it together WWW was born here CERN - European Organization for Nuclear Research

Oracle Text saves your time LHC - The Large Hadron Collider Project

Oracle Text saves your time LHC - Cryodipole

Oracle Text saves your time EDMS Engineering Data Management System Content

Oracle Text saves your time EDMS Portal EDMS Common layer AxalantMP5Other DB’s Design Data Documents and Drawings Asset tracking Work management EDMS - Engineering Data Management System

Oracle Text saves your time Structures Managing EDMS - Engineering Data Management System

Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Managing EDMS - Engineering Data Management System

Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Managing EDMS - Engineering Data Management System

Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Managing EDMS - Engineering Data Management System

Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Managing EDMS - Engineering Data Management System

Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Managing Equipment workflow, data EDMS - Engineering Data Management System

Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Managing Equipment workflow, data Installation (jobs, locations, etc..) EDMS - Engineering Data Management System

Oracle Text saves your time Manage a full description of the LHC project’s engineering data over it’s lifetime (>25 years) Support and coordinate engineering work / information / data workflow Establish a knowledge transfer: evolving staff, many short time visitors A full description of the machine and its components through their lifecycle must be constantly available for all concerned parties Help tracing solutions to all problems occurring in the machine Provide an efficient search tool to support with requirements above - our choice Oracle Text EDMS mandate Operation InstallationDesignOperationDismantling

Oracle Text saves your time Our needs Oracle Text – our choice

Oracle Text saves your time Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Oracle Text – our choice

Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Oracle Text – our choice

Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Oracle Text – our choice

Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Simple for users Simple to develop Simple to maintain Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Simplicity: Oracle Text – our choice

Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Oracle Text supports most of the document formats Oracle text supports 39 languages Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages Very efficient for searches within big collection of data Our needs Simple for users Simple to develop Simple to maintain Simplicity: Oracle Text – our choice

Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Oracle Text supports most of the document formats Oracle text supports 39 languages Very efficient for searches within big collection of data Oracle text comes as an option in RDBMS - no additional costs Our needs Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages Simple for users Simple to develop Simple to maintain Simplicity: Oracle Text – our choice

Oracle Text saves your time Oracle Text Content

Oracle Text saves your time Oracle Text Takes care of: Enables the building of a Text Query Application and a Document Classification Application Oracle text indexing searching: word and theme viewing text Uses standard SQL

Oracle Text saves your time CREATE INDEX index_name ON table_name(column_name) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS(‘parameters string’); [datastore datastore_pref] [filter filter_pref] [charset column charset_column_name] [format column format_column_name] [lexer lexer_pref] [language column language_column_name] [wordlist wordlist_pref] [storage storage_pref] [stoplist stoplist] [section group section_group] [memory memsize] [populate | nopopulate] CONTEXT Index Creation

Oracle Text saves your time IndexQuery OperatorCharacteristics CONTEXTCLOB, BLOB, BFILE, CHAR, VARCHAR2, XML On text column Most complete of all 3 types. CTXCATCHAR, VARCHAR2 Combined index on a text column and one or more other columns. Transactional – no need for synchronizing when DML. Creating can be longer because of the sub-indexes. Supports: INDEX SET, LEXER*, STOPLIST, STORAGE, WORDLIST* Has it’s own query language. CONTAINS CTXRULE CATSEARCH MATCHES Used for Building a document classification application For indexing small text fragments and related information. To improve mixed query performance VARCHAR2, CLOB On column containing a set of queries. Supports: LEXER (only BASIC) Does not support number of operators. Large coherent documents Types of indexes

Oracle Text saves your time ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize fast’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full maxtime10’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full’); Index Maintenance & Optimization

Oracle Text saves your time ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; CTX_DDL package CTX_DDL.OPTIMIZE_INDEX CTX_DDL.SYNC_INDEX Index Maintenance & Optimization

Oracle Text saves your time INSERT A new row inserted in DR$PENDING queue, not available for query before synchronization UPDATE Existing ROWID is placed in DR$PENDING, neither new nor old content is available for query before synchronization DELETE The row is immediately unavailable for query(marked as invalid), but only removed when optimization complete CTX_USER_PENDING (CTX_PENDING) view To check records waiting for synchronization DML processing

Oracle Text saves your time “To calculate a relevance score for a returned document in a word query, Oracle uses an inverse frequency algorithm based on Salton's formula. Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.” Oracle Text Reference, Release In data set: M number of occurrences of TERM1, N number of occurrences of TERM2 M >> N Document having equal (n-occurrences) of TERM1 and TERM2 Example Result SCORE for querying TERM1 < SCORE for querying TERM2 Scoring

Oracle Text saves your time SYNonym ABOUT STEM Translation Term Broader, Narrower, Preferred, Related Term Boolean Linguistics Others OR NOT MINUS AND lhc AND magnet AND NOT cryogenic FUZZY NEAR SOUNDEX WITHIN SQE SYN (science) ABOUT (particle) begin ctx_query.store_sqe ( ‘particle‘, ’atom, molecule proton’ ); end; ‘SQE (particle)’ Query Operators

Oracle Text saves your time Administer servers and the data dictionary (only ctxsys user) Create and manage the preferences, section groups, stoplists, manage indexes Document presentation features (only for CONTAINS indexes) Manage logs for the indexes Manage and browse thesaurus Generating query feedback, counting hits, and creating SQE (stored query expressions) CTX_ADMIN CTX_DDL CTX_DOC CTX_OUTPUT CTX_QUERY CTX_THES CTX packages

Oracle Text saves your time How we profit from this technology Content

Oracle Text saves your time EDMS metadata index preferences

Oracle Text saves your time Version 1.5 accelerateur lhc méthode EDMS search for both languages

Oracle Text saves your time EDMS metadata index preferences

Oracle Text saves your time To be able to query on reserved words or symbols such as “minus”, “-”, “near” they must be escaped. There are 2 methods to escape the character, using “{}” or “\”. When using: We had to hardcode it for each symbol and word. A standard “dictionary table” with the reserved characters would be useful. Escaping characters to query them

Oracle Text saves your time It is important to know how users will search the data and what kind of data you are going to index before you actually do it. EDMS metadata index preferences

Oracle Text saves your time Meta dataFilesEnvironment Hardware & system: Two node cluster based on two Sun SPARC 450, running Solaris Sun Cluster 2.1 RDBMS: ~500 MB SGA size concurrent users (during working hours) EDMS Index Maintenance & Optimization 4 000New documents (monthly) Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 45Files (GB) / total 4 000Document updates (monthly) Drawings Test documents / / 78

Oracle Text saves your time Meta dataFiles Index synchronization: every 10 min, takes a few seconds Index optimization: every weekend, takes ~30 min PROCEDURE rebuild_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' sync ' ')'); END; PROCEDURE optimize_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' optimize full' ')'); END; Environment EDMS Index Maintenance & Optimization 4 000New documents (monthly) Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 45Files (GB) / total 4 000Document updates (monthly) Drawings Test documents / / 78

Oracle Text saves your time Synchronize every 24h ? Optimize (fast, full) every month? Meta dataFilesEnvironment EDMS Index Maintenance & Optimization 4 000New documents (monthly) Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 45Files (GB) / total 4 000Document updates (monthly) Drawings Test documents / / 78

Oracle Text saves your time SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’lhc’,10)>0 AND c_id = ; C_ID SCORE(10) SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’evolution’,10)>0 AND c_id = ; C_ID SCORE(10) Scoring

Oracle Text saves your time DECLARE xtab ctx_thes.exp_tab; …. BEGIN ctxsys.ctx_thes.rt(xtab,p_term,’edms_thes’); FOR i IN 1..xtab.COUNT LOOP IF xtab(i).xrel = C_RELETED_TERM THEN htp.anchor ( L_DOC_SEARCH ||'?cookie=' ||cookie ||'&p_search_type=' ||p_search_type ||'&p_free_text=' ||LOWER(xtab(i).xphrase),LOWER(xtab(i).xphrase) ); END IF; END LOOP; END; Propose the RT (Related Term) if nothing found with the original term(s). Would be nice to have a spell checker corrector, using existing tokens. Using the thesaurus

Oracle Text saves your time Using the thesaurus - example

Oracle Text saves your time Using the thesaurus - example

Oracle Text saves your time …WHERE CONTAINS (c_text, p_free_text) > 0; Total 83 ms Querying with Oracle Text versus standard SQL

Oracle Text saves your time … WHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Total 03.98s Querying with Oracle Text versus standard SQL

Oracle Text saves your time ToolOracle TextStandard SQL CharacteristicsUnderperforming. StatementWHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Fast. Time WHERE CONTAINS (c_text,p_free_text) > 0 * Tests done with TOra (in parentheses repeated 10x) 83 ms (821ms)03.98s (39.14s)* p_free_text is a single word or an exact sentence Querying with Oracle Text versus standard SQL

Oracle Text saves your time ToolOracle TextStandard SQL CharacteristicsUnderperforming. StatementWHERE ( UPPER(c_text) LIKE '%’||UPPER(p_text_1)||’%’ OR UPPER(c_text) LIKE '%’||UPER(p_text_2)||’%’ ) Fast. Time WHERE CONTAINS (c_text,p_free_text) > 0 * Tests done with TOra (in parentheses repeated 10x) 103ms (01:03 )09:09 (1:22.09)* p_free_text is an expression with OR operator Querying with Oracle Text versus standard SQL

Oracle Text saves your time Querying with Oracle Text Total 02.36s

Oracle Text saves your time Querying with Oracle Text Total 02.36s Total 02.31s

Oracle Text saves your time Querying with Oracle Text Total 02.36s Total 02.31s Total 02.25s

Oracle Text saves your time Mixed queries “LHC-Q-EI-0002” is a document number Search is done on: 1) the document number column using a standard index 2) the context index

Oracle Text saves your time Formatted documents such as Microsoft Word, PDF has to be filtered File_format column stores “TEXT” or “BINARY” value. INSO_FILTER ignores all with “TEXT” in the format column. Indexing various file formats NULL_FILTER for plain text and HTML formats

Oracle Text saves your time Some indexing problems we have The creation of an Intermedia Text Index (with URL_DATASTORE) is failing with ORA-4030 out of process memory. After successful indexing of the PDF files (using INSO_FILTER), some are indexed only “partially” without any error being created in the error table. In June 2002 this was identified to be a memory leak fixed in We observe now the same ORA-4030 error with OPS Result : very difficult to verify if the document is correctly indexed.

Oracle Text saves your time Conclusion Content

Oracle Text saves your time Oracle text is worth using because … Performance Simplicity of the code (integrated with Oracle, no external search engine) Simplicity of the index maintenance Functional features: bi-lingual support, special query operators, thesaurus Document presentation features Conclusion

Oracle Text saves your time EDMS SERVICE This presentation: Contact: Thank you