Presentation is loading. Please wait.

Presentation is loading. Please wait.

Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions.

Similar presentations


Presentation on theme: "Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions."— Presentation transcript:

1 Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions ODF 000419 Benchmarking Oracle8i Intermedia Text

2 Background for this benchmark The task of the thesis project The EDMS Search Engine CERN’s EDMS CADIM/EDB, managing product data documents MP5, managing physical components Oracle 7, database platform Implement Oracle8i in the future? ODF 000419 Benchmarking Oracle8i Intermedia Text

3 Oracle 8i Intermedia A product embedded in Oracle8i Intermedia allows a unified technique for accessing various types of data such as:  Text  Documents  Images  Audio  Video ODF 000419 Benchmarking Oracle8i Intermedia Text

4 Features in Intermedia Text ODF 000419 Benchmarking Oracle8i Intermedia Text Database integration Mixed queries against multiple text columns Single SQL API Indexes on most database columns All index data in the database Automatic triggers on textcolumns to detect changes Full text search Exact Word and phrase search, operators, removing stopwords XML and HTML document section searching

5 Features in Intermedia Text ODF 000419 Benchmarking Oracle8i Intermedia Text About Search (All languages) Parses any text search to perform an optimal search Complements full-text search Theme Identification (English only) Identifies strong themes in documents using a ”Theme base” ”Thematic” information is put into the index Themesearch is available via the about operator May be customized for specific terminology

6 Features in Intermedia Text ODF 000419 Benchmarking Oracle8i Intermedia Text Document services View documents as plain text or HTML format Store documents in a database, file system or at an URL adress Multilingual text search Full-text search in most languages including Japanese, Chinese and Korean Support for all Oracle-NLS character sets Stemming and Fuzzy search for Dutch, English, French, German and Spanish Base-letter support and alternate spelling for Western European languages

7 Features in Intermedia Text To be investigated: The text indexing technique The new query operators How to use this in the EDMS Search Engine ODF 000419 Benchmarking Oracle8i Intermedia Text

8 OIMT Textindexes An OIMT textindex can be created on: varchar2 columns Large object (LOB) columns B-file columns, indexing entire files Allowing queries like: Select id from table where contains(column,’Atlas’)>0; 1 One of the detectors in the LHC ring is ATLAS 2 The Atlasmountains are situated in the north of Africa -> 1 … contains(column,’Atlas inner detector’)>0; 1 The problems with the ATLAS inner detector... 2 The inner detector of ATLAS... -> 1 Query optimization Selects the best executing plan based on analyze table…compute statistics; ODF 000419 Benchmarking Oracle8i Intermedia Text

9 OIMT Textindexes An OIMT textindex is an ”inverted” index 1 the LHC accelerator 2 the LEP accelerator -> accelerator: row #1 position 2, row #2 position 2 LHC:row #1 position 1 LEP:row #2 position 1 A textindex is built up by five objects Four tables: I,K,N,R and one b-tree index: X Create OIMT textindex statement create index myindex on table(column) indextype is ctxsys.context; Note that table must have a primary key Updating, two choices Automaticly by starting ctxsrv and commit Rebuild ”manually” ODF 000419 Benchmarking Oracle8i Intermedia Text

10 OIMT Textindexes Datastore Filter Sectioner Lexer The indexing ”pipeline” Loops over the rows and reads data out of the column Or from remote servers, accessed via http or ftp via pointers Transformns the data into text representation Output can be in HTML or XML Takes the output from the filter and converts it to plain text Different sectioner for different formats Detects important section tags Separates text into tokens and words Remove stopwords ODF 000419 Benchmarking Oracle8i Intermedia Text

11 OIMT Textindexes ODF 000419 Benchmarking Oracle8i Intermedia Text The preference system allows customization of textindexes Classes of ”customizable” objects: DATASTORE, FILTER, SECTION_GROUP, LEXER, WORDLIST, STOPLIST, STORAGE Create a preference to customize an OIMT textindex : execute ctx_ddl.create_preference(’my_pref’, ’BASIC_LEXER’); execute ctx_ddl.set_attribute(’my_pref’, ’INDEX_THEMES’, ’YES’); create index my_index on table(column) indextype is ctxsys.context parameters(’LEXER my_pref’); Default preferences Unset preferences get their value from the default system

12 OIMT Query operators Scoring operators: WEIGHT(*), THRESHOLD(>), ACCUM(,), MINUS(-) Examples: …contains(column, ’(edms*2) AND cms’)>10; …contains(column, ’edms, cms’)>0; edms+cms scores higher than each word alone Word expansion operators: WILDCARD(%), FUZZY(?), STEM($), SOUNDEX(!), EQUIV(=) Examples: …contains(column, ’?mignets’)>0; Fuzzy correct missspellings 1 The magnets of the LHC accelerator… -> 1 …contains(column, ’$go’)>0; Stem considers e.g. go, went, gone as the ”same” word 1 I will go to the cinema. as well as plurals, magnet=magnets 2 I went back home. 3 The train has gone. -> 1,2,3 …contains(column, ’!dog’)>0; Soundex retrieves all word which sounds alike 1 I have a dog. 2 Someone has dug a hole. -> 1,2 ODF 000419 Benchmarking Oracle8i Intermedia Text

13 Proximity operator: NEAR(;) Examples: …contains(column, NEAR((edms,cms),4, TRUE))>0; 1 EDMS is available for CMS, ATLAS, LHCb... 2 The EDMS at CERN manages all product data documents of CMS -> 1 Section limiting and theme operators: WITHIN, ABOUT about is used as a ”general” operator, optimizing the query by including stem($) and theme search if available Thesaurus operator: SYN Examples: SYN(dog) == {boxer} | {rotweiler} | {terrier} Boolean operators: AND, OR, NOT OIMT Query operators ODF 000419 Benchmarking Oracle8i Intermedia Text

14 OIMT Benchmarks Is SELECT…CONTAINS(column,’word’)>0; faster than SELECT…LIKE(’%word%’); ? How do indexes on entire files perform? What is the prize in terms of memory storage, maintenance? Still fast retrieval? Is updating flexible? Do the query operators perform as expected? ODF 000419 Benchmarking Oracle8i Intermedia Text

15 Comparing CONTAINS and LIKE, retrieval times Using OIMT’s CONTAINS Fulltablescan using LIKE, as in EDMS Search today Y-axis: Retrieval time [s] X-axis: Tablesize [# of rows] ODF 000419 Benchmarking Oracle8i Intermedia Text

16 Comparing CONTAINS and LIKE, retrieval times Using OIMT’s CONTAINS Fulltablescan using LIKE, as in EDMS Search today Y-axis: Retrieval time [s] X-axis: Tablesize [# of rows] ODF 000419 Benchmarking Oracle8i Intermedia Text

17 Indexing entire files Tests with smaller amounts of files (<1000) works fine Encountered heavy problems when indexing up to 9000 files: Security bug A user gets the database owner’s OS privileges when accessing files through FILE DATASTORE. Special roles will be introduced in the 8.1.7 version to manage this problem. Not indexing certain Excel files, bug Said to be fixed in the 8.1.7 version Storage problems, tablespace, temp, shared pool, lobsegments A OIMT textindex is complex, several storage parameters have to be extended, which was non-trivial. Unrelevant errormessages Somewtimes when problems occur, unrelevant errormessages or no errormessages at all are returned, making troubleshooting difficult. ODF 000419 Benchmarking Oracle8i Intermedia Text

18 Indexing entire files Statistics about the 9000 file index: Total indexsize:1.0 GB Number of files:8942 Accumulated filesize:4.4 GB Tot indexsize/acc filesize:23.6 % Average indexsize per file:115.8 KB/file Average filesize:490.1 KB/file Creation time:7:06 hours Creation time per file:2.9 seconds/file Updating: Works fine, 2 seconds/inserted row Quering: 1-3 seconds/query (depends on the query!) Machine:SUN 4CPU 300MHz 1.5 GB RAM ODF 000419 Benchmarking Oracle8i Intermedia Text

19 A view of the testtable ODF 000419 Benchmarking Oracle8i Intermedia Text Index created on this file reference column

20 Conclusions ODF 000419 Benchmarking Oracle8i Intermedia Text Indexing entire files works, but is not trivial to do in version 8.1.6, to be improved in 8.1.7? Quering is fast, both for indexed varchar2 columns and filecolumns Updating textindexes can be done automatic Index preferences may be customized Multilingual Textindexes Query Operators Works mainly as expected, providing powerful tools for an advanced Search Engine Wildcards a very slow when executed on fileindexed columns O8i IMT A very interesting ”platform” for the future EDMS Search Engine Will provide tools for fast full-text searches, with both simple and advanced queries

21 Application interfaces are non-trivial to create Interface important to make userfriendly Many hits may be retrieved from full-text searches, limiting these may be crucial Some ideas: Ask the users for hints Keep the interface as simple as possible, avoid too many graphical objects etc A simple and an advanced search option Menus for choosing query operators, scoring etc The EDMS Search Engine Interface ODF 000419 Benchmarking Oracle8i Intermedia Text

22 An OIMT Test Search Engine: 9000 indexed files from the EDMS production database http://oraweb01.cern.ch:9000/anders/owa/edms_extended_search.enter_search


Download ppt "Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions."

Similar presentations


Ads by Google