Database Systems XML, DWH, In-Memory DBMS, IR

Database Systems XML, DWH, In-Memory DBMS, IR
Gergely Lukács Pázmány Péter Catholic University Faculty of Information Technology Budapest, Hungary

XML – Extensible markup language

XML eXtensible Markup Language Web-Standard (W3C) for exchanging data:
XML describes inputs and outputs of many applications (in most cases called: services) Industry created and supported xml standards for applications, communication protocols, service descriptions, etc. (e.g. or )

XML Syntax – XML Element
Object is defined by a pair of corresponding tags, like <prof> (opening tag) and </prof> (closing tag) Content of the element: text and other elements (subelements) included between tags Elements can be nested (no depth restrictions) Proper nesting! Empty elements: <year></year> can be shortened: <year/>

XML Syntax – XML Attribute
Name-value pair inside starting tag of element Tied to a specific xml element Alternative notation to nested tags Element can have multiple attributes, but each occurs only once

Advantages Truly Portable Data Easily readable by human users
Very expressive (semantics near data) Very flexible and customizable (no finite tag set) Easy to use from programs (libs available) Easy to convert into other representations (XML transformation languages) Many additional standards and tools Widely used and supported

Well-formed XML document
There must be exactly one root element. Every start tag has a matching end tag. Elements may nest, but must not overlap. An element may not have two attributes with the same name. Specific characters in XML have to be represented in special way …

Schema for XML Documents
The tags (like <to> and <from>) are not defined in any XML standard. These tags are "invented" by the author of the XML document. (Extensible!) Document Type Definition (DTD) XML Schema (XSD) XMLSchema 1.0 (May 2001) defines elements that can appear in a document attributes that can appear in a document which elements are child elements defines the order of child elements the number of child elements …

XML Schema Example <?xml version="1.0"?> <xs:schema xmlns:xs=" <xs:element name="note"> <xs:complexType> <xs:sequence> <xs:element name="to" type="xs:string"/> <xs:element name="from" type="xs:string"/> <xs:element name="heading" type="xs:string"/> <xs:element name="body" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>

Valid XML files Valid XML files are well-formed files which
have a schema description (DTD, XSD,…) and which conform to it. Schemas are very important for XML data exchange Otherwise, a site cannot automatically interpret data received from another site Libraries for XML checking, also for validity

XML Model XML can be represented as directed graph

XPath (W3C) Xpath expression returns a collection of element nodes according to the certain pattern specified in it. This is like a URL. XPATH has number of comparison operations.

XPath: Examples ... bookstore book book title author price title
The Autobio- graphy of ... 8.99 The Gorgias 9.99 firstname lastname name Benjamin Franklin Plato title illetve ./title All title elements in the actual element. author/name/firstname The firstname elements of the name elements of the author elements. //title All title elements in the document

XQuery XQuery is designed to be a small, easily implemental language in which the queries are easily understood. It is also flexible enough to query a broad spectrum of XML information sources, including both databases and documents. It is a human-readable query syntax and an XMLbased query syntax.

XQuery – Example 1 <?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore> <book category="COOKING"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="CHILDREN"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <price>29.99</price> <book category="WEB"> <title lang="en">XQuery Kick Start</title> <author>James McGovern</author> <author>Per Bothner</author> <author>Kurt Cagle</author> <author>James Linn</author> <author>Vaidyanathan Nagarajan</author> <year>2003</year> <price>49.99</price> </book> <book category="WEB"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </bookstore>

XQuery – Example 2. for $x in doc("books.xml")/bookstore/book
return $x/title <title lang="en">XQuery Kick Start</title> <title lang="en">Learning XML</title> ... where $x/price>30 order by $x/title

XQuery – FLWOR-Expressuibs
FLWOR ( „flower”) "For, Let, Where, Order by, Return". for var1 in expr1, ..., varn in expr2 let varn+1 := exprn+1, ..., varn+m := exprfn+m where condition order by expr ascending/descending return xml-expr.

JSON: JavaScript Object Notation
JSON is syntax for storing and exchanging text information. Much like XML. JSON is smaller than XML, and faster and easier to parse. { "employees": [ { "firstName":"John" , "lastName":"Doe" }, { "firstName":"Anna" , "lastName":"Smith" }, { "firstName":"Peter" , "lastName":"Jones" } ] }

Data warehouse

Data Warehouse „In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources. Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.” (Wikipedia) Also: Non-traditional areas: environmental monitoring Research biomedical applications

Data warehouse architecture
Figure 11-2: Generic two-level architecture L One, company-wide warehouse T E Periodic extraction  data is not completely current in warehouse OLTP: Online Transaction Processing OLAP: Online Analytical Processing

ETL process, example

Data cube/OLAP Cube

Multidimensional Data cube
Facts – numeric measures (revenue, amount sold, …..) Dimensions– (time period, area, product group, …..)

Star Schema

Requirements in OLTP and in OLAP
De-normalized, fewer tables Read-only (batch update) Large number of records (millions) aggregations Ad-hoc queries, Aggregations Highly normalized, large number of tables Read/Write Few records (tens) Standardized, simple queries DWH optimisations Bitmap-index Materialized view (with query rewrite!!)

Column-oriented DBMS Column-oriented Row-oriented
Aggregates over many rows and few columns Compression… Row-oriented Retrieving or changing few records with many attributes

(Sípos Zsófia, szakdolgozat)

https://demos.devexpress.com/ASPxPivotGridDemos/ PivotGrid

SAP HANA (High-Performance Analytic Appliance)
In-memory database SAP HANA (High-Performance Analytic Appliance)

Information Retrieval
39

Information Retrieval
Information Retrieval (IR) is finding material of an unstructured nature that is relevant to the user’s information need and helps the user complete a task from within large collections (usually stored on computers).

Unstructured data Unstructured data Text (documents) Images Audio data
Video …?

Unstructured (text) vs. structured (database) data in 2009
42

Goal, evaluation Task, Evaluation (for Boolean retrieval)
Information need, „Relevant” documents (objects) Evaluation (for Boolean retrieval) Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn 43

Boolean retrieval model
Sec. 1.1 Boolean retrieval model Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? The Boolean retrieval model is being able to ask a query that is a Boolean expression: Boolean Queries use AND, OR and NOT to join query terms Views each document as a set of words document matches condition or not. Perhaps the simplest model to build an IR system on Some search systems you still use are Boolean: , library catalog, … Grep is line-oriented; IR is document oriented. 44 44

Text Preprocessing 45

Tokenization Input: “Friends, Romans, Countrymen” Output: Tokens
Sec Tokenization Input: “Friends, Romans, Countrymen” Output: Tokens Friends Romans Countrymen A token is a sequence of characters in a document Uppercase, lowercase Hewlett-Packard  Hewlett and Packard as two tokens? Mar. 12, 1991 46

Lemmatization Reduce inflectional/variant forms to base form E.g.,
Sec Lemmatization Reduce inflectional/variant forms to base form E.g., am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color Lemmatization implies doing “proper” reduction to dictionary headword form 47

Stemming Reduce terms to their “roots” before indexing
Sec Stemming Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for exampl compress and compress ar both accept as equival to compress for example compressed and compression are both accepted as equivalent to compress. 48

Spell correction Two principal uses
Sec. 3.3 Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two main flavors: Isolated word Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from  form Context-sensitive Look at surrounding words, e.g., I flew form Heathrow to Narita. 49

Thesauri Do we handle synonyms and homonyms?
E.g., by hand-constructed equivalence classes car = automobile color = colour We can rewrite to form equivalence-class terms When the document contains automobile, index it under car-automobile (and vice-versa) Or we can expand a query When the query contains automobile, look under car as well 50

Sec. 3.4 Soundex Class of heuristics to expand a query into phonetic equivalents Language specific – mainly for names E.g., chebyshev  tchebycheff Invented for the U.S. census … in 1918 51

Ranked retrieval 52

Problem with Boolean search: feast or famine
Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000 hits Query 2: “standard user dlink 650 no card found”: 0 hits Experience + good knowledge of items needed Cf. our discussion of how Westlaw Boolean queries didn’t actually outperform free text querying 53

Ranked retrieval models
Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa 54

Scoring as the basis of ranked retrieval
Ch. 6 Scoring as the basis of ranked retrieval We wish to return in order the documents most likely to be useful to the searcher score – say in [0, 1] – to each document with respect to a query This score measures how well document and query “match”.

Bag of words model Vector representation doesn’t consider the ordering of words in a document „John is quicker than Mary” and „Mary is quicker than John” have the same vectors This is called the bag of words model.

Term frequency tf The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. Relevance does not increase proportionally with term frequency -> flattening using some function

Document frequency Rare terms are more informative than frequent terms
Sec Document frequency Rare terms are more informative than frequent terms Consider a term in the query that is rare in the collection (e.g., arachnocentric) We will use document frequency (df) to capture this: the number of documents that contain t Inverse: the higher df is, the smaller its weight should be Dampening/flattening: log

Sec tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collection

Sec. 6.3 Weight matrix Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V| High dimensional space (number of terms), sparse matrix

Sec. 6.3 Queries as vectors Key idea 1: Represent queries as vectors in the same space Key idea 2: Rank documents according to their proximity to the query in this space

Proximity Euclidean distance? Angle between vectors
Sec. 6.3 Proximity Euclidean distance? bad idea because Euclidean distance is large for vectors of different lengths. Angle between vectors Rank documents according to their angle with respect to the vector of thequery Rank documents in decreasing order of the angle between query and document Rank documents in increasing order of cosine(query,document)

Summary – vector space ranking
Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score

Evaluating ranked results: precision-recall curve
Sec. 8.4 Evaluating ranked results: precision-recall curve 64

Relevant products 65

Apache Lucene text search engine library high-performance
Sec. 8.4 Apache Lucene text search engine library high-performance full-featured written entirely in Java suitable for nearly any application that requires full-text search, especially cross-platform. 66

Apache Lucene Scalable, High-Performance Indexing
Sec. 8.4 Apache Lucene Scalable, High-Performance Indexing over 150GB/hour on modern hardware small RAM requirements -- only 1MB heap incremental indexing as fast as batch indexing index size roughly 20-30% the size of text indexed Powerful, Accurate and Efficient Search Algorithms ranked searching -- best results returned first many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more fielded searching (e.g. title, author, contents) sorting by any field multiple-index searching with merged results allows simultaneous update and searching flexible faceting, highlighting, joins and result grouping fast, memory-efficient and typo-tolerant suggesters pluggable ranking models, including the Vector Space Model and Okapi BM25 configurable storage engine (codecs) Cross-Platform Solution 67

Oracle Text Full text search, integrater in Oracle
All Oracle editions, free create table texttabelle( id number(10), dokument clob ) / create sequence seq_texttabelle insert into texttabelle values (seq_texttabelle.nextval, 'A-Partei gewinnt Wahl in Hansestadt'); insert into texttabelle values (seq_texttabelle.nextval, 'Terror in Nahost: Kriminalität steigt immer weiter an'); insert into texttabelle values (seq_texttabelle.nextval, 'Wirtschaft: Erneuter Gewinnzuwachs in diesem Jahr'); insert into texttabelle values (seq_texttabelle.nextval, 'Olympia rückt näher: Der Fackellauf ist in vollem Gange');

Oracle Text Word stem search
SQL> select * from texttabelle where contains(dokument, 'Papst and Skandal')>0; 10 Der Papst liest seine erste Messe in den USA! 6 Papst bestürzt über jüngsten Skandal! Word stem search SQL> select * from texttabelle where contains(dokument, '$lesen')>0; ID DOKUMENT

Oracle Text Fuzzy-operator Near
SQL> select * from texttabelle where contains(dokument, '?Wahlkrampf')>0; ID DOKUMENT 5 Wer wird US-Präsident? Obama und Clinton machen Wahlkampf 7 Wahlkampf in den USA geht weiter: Clinton und Obama ... Near SQL> select * from texttabelle 2 where contains(dokument, 'NEAR((Clinton, Wahlkampf),2)')>0;

Database Systems XML, DWH, In-Memory DBMS, IR

Similar presentations

Presentation on theme: "Database Systems XML, DWH, In-Memory DBMS, IR"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Systems XML, DWH, In-Memory DBMS, IR

Similar presentations

Presentation on theme: "Database Systems XML, DWH, In-Memory DBMS, IR"— Presentation transcript:

Similar presentations

About project

Feedback