Download presentation
Presentation is loading. Please wait.
Published byPhilippa Arleen Smith Modified over 7 years ago
1
Database Systems XML, DWH, In-Memory DBMS, IR
Gergely Lukács Pázmány Péter Catholic University Faculty of Information Technology Budapest, Hungary
2
XML – Extensible markup language
4
XML eXtensible Markup Language Web-Standard (W3C) for exchanging data:
XML describes inputs and outputs of many applications (in most cases called: services) Industry created and supported xml standards for applications, communication protocols, service descriptions, etc. (e.g. or )
5
XML Syntax – XML Element
Object is defined by a pair of corresponding tags, like <prof> (opening tag) and </prof> (closing tag) Content of the element: text and other elements (subelements) included between tags Elements can be nested (no depth restrictions) Proper nesting! Empty elements: <year></year> can be shortened: <year/>
6
XML Syntax – XML Attribute
Name-value pair inside starting tag of element Tied to a specific xml element Alternative notation to nested tags Element can have multiple attributes, but each occurs only once
7
Advantages Truly Portable Data Easily readable by human users
Very expressive (semantics near data) Very flexible and customizable (no finite tag set) Easy to use from programs (libs available) Easy to convert into other representations (XML transformation languages) Many additional standards and tools Widely used and supported
8
Well-formed XML document
There must be exactly one root element. Every start tag has a matching end tag. Elements may nest, but must not overlap. An element may not have two attributes with the same name. Specific characters in XML have to be represented in special way …
9
Schema for XML Documents
The tags (like <to> and <from>) are not defined in any XML standard. These tags are "invented" by the author of the XML document. (Extensible!) Document Type Definition (DTD) XML Schema (XSD) XMLSchema 1.0 (May 2001) defines elements that can appear in a document attributes that can appear in a document which elements are child elements defines the order of child elements the number of child elements …
10
XML Schema Example <?xml version="1.0"?> <xs:schema xmlns:xs=" <xs:element name="note"> <xs:complexType> <xs:sequence> <xs:element name="to" type="xs:string"/> <xs:element name="from" type="xs:string"/> <xs:element name="heading" type="xs:string"/> <xs:element name="body" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
11
Valid XML files Valid XML files are well-formed files which
have a schema description (DTD, XSD,…) and which conform to it. Schemas are very important for XML data exchange Otherwise, a site cannot automatically interpret data received from another site Libraries for XML checking, also for validity
12
XML Model XML can be represented as directed graph
13
XPath (W3C) Xpath expression returns a collection of element nodes according to the certain pattern specified in it. This is like a URL. XPATH has number of comparison operations.
14
XPath: Examples ... bookstore book book title author price title
The Autobio- graphy of ... 8.99 The Gorgias 9.99 first- name last- name name Benjamin Franklin Plato title illetve ./title All title elements in the actual element. author/name/firstname The firstname elements of the name elements of the author elements. //title All title elements in the document
15
XQuery XQuery is designed to be a small, easily implemental language in which the queries are easily understood. It is also flexible enough to query a broad spectrum of XML information sources, including both databases and documents. It is a human-readable query syntax and an XMLbased query syntax.
16
XQuery – Example 1 <?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore> <book category="COOKING"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="CHILDREN"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <price>29.99</price> <book category="WEB"> <title lang="en">XQuery Kick Start</title> <author>James McGovern</author> <author>Per Bothner</author> <author>Kurt Cagle</author> <author>James Linn</author> <author>Vaidyanathan Nagarajan</author> <year>2003</year> <price>49.99</price> </book> <book category="WEB"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </bookstore>
17
XQuery – Example 2. for $x in doc("books.xml")/bookstore/book
return $x/title <title lang="en">XQuery Kick Start</title> <title lang="en">Learning XML</title> ... where $x/price>30 order by $x/title
18
XQuery – FLWOR-Expressuibs
FLWOR ( „flower”) "For, Let, Where, Order by, Return". for var1 in expr1, ..., varn in expr2 let varn+1 := exprn+1, ..., varn+m := exprfn+m where condition order by expr ascending/descending return xml-expr.
20
JSON: JavaScript Object Notation
JSON is syntax for storing and exchanging text information. Much like XML. JSON is smaller than XML, and faster and easier to parse. { "employees": [ { "firstName":"John" , "lastName":"Doe" }, { "firstName":"Anna" , "lastName":"Smith" }, { "firstName":"Peter" , "lastName":"Jones" } ] }
22
Data warehouse
23
Data Warehouse „In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources. Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.” (Wikipedia) Also: Non-traditional areas: environmental monitoring Research biomedical applications
24
Data warehouse architecture
Figure 11-2: Generic two-level architecture L One, company-wide warehouse T E Periodic extraction data is not completely current in warehouse OLTP: Online Transaction Processing OLAP: Online Analytical Processing
25
ETL process, example
26
Data cube/OLAP Cube
27
Multidimensional Data cube
Facts – numeric measures (revenue, amount sold, …..) Dimensions– (time period, area, product group, …..)
28
Star Schema
29
Requirements in OLTP and in OLAP
De-normalized, fewer tables Read-only (batch update) Large number of records (millions) aggregations Ad-hoc queries, Aggregations Highly normalized, large number of tables Read/Write Few records (tens) Standardized, simple queries DWH optimisations Bitmap-index Materialized view (with query rewrite!!)
30
Column-oriented DBMS Column-oriented Row-oriented
Aggregates over many rows and few columns Compression… Row-oriented Retrieving or changing few records with many attributes
31
(Sípos Zsófia, szakdolgozat)
32
https://demos.devexpress.com/ASPxPivotGridDemos/ PivotGrid
33
SAP HANA (High-Performance Analytic Appliance)
In-memory database SAP HANA (High-Performance Analytic Appliance)
39
Information Retrieval
39
40
Information Retrieval
Information Retrieval (IR) is finding material of an unstructured nature that is relevant to the user’s information need and helps the user complete a task from within large collections (usually stored on computers).
41
Unstructured data Unstructured data Text (documents) Images Audio data
Video …?
42
Unstructured (text) vs. structured (database) data in 2009
42
43
Goal, evaluation Task, Evaluation (for Boolean retrieval)
Information need, „Relevant” documents (objects) Evaluation (for Boolean retrieval) Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn 43
44
Boolean retrieval model
Sec. 1.1 Boolean retrieval model Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? The Boolean retrieval model is being able to ask a query that is a Boolean expression: Boolean Queries use AND, OR and NOT to join query terms Views each document as a set of words document matches condition or not. Perhaps the simplest model to build an IR system on Some search systems you still use are Boolean: , library catalog, … Grep is line-oriented; IR is document oriented. 44 44
45
Text Preprocessing 45
46
Tokenization Input: “Friends, Romans, Countrymen” Output: Tokens
Sec Tokenization Input: “Friends, Romans, Countrymen” Output: Tokens Friends Romans Countrymen A token is a sequence of characters in a document Uppercase, lowercase Hewlett-Packard Hewlett and Packard as two tokens? Mar. 12, 1991 46
47
Lemmatization Reduce inflectional/variant forms to base form E.g.,
Sec Lemmatization Reduce inflectional/variant forms to base form E.g., am, are, is be car, cars, car's, cars' car the boy's cars are different colors the boy car be different color Lemmatization implies doing “proper” reduction to dictionary headword form 47
48
Stemming Reduce terms to their “roots” before indexing
Sec Stemming Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for exampl compress and compress ar both accept as equival to compress for example compressed and compression are both accepted as equivalent to compress. 48
49
Spell correction Two principal uses
Sec. 3.3 Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two main flavors: Isolated word Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from form Context-sensitive Look at surrounding words, e.g., I flew form Heathrow to Narita. 49
50
Thesauri Do we handle synonyms and homonyms?
E.g., by hand-constructed equivalence classes car = automobile color = colour We can rewrite to form equivalence-class terms When the document contains automobile, index it under car-automobile (and vice-versa) Or we can expand a query When the query contains automobile, look under car as well 50
51
Sec. 3.4 Soundex Class of heuristics to expand a query into phonetic equivalents Language specific – mainly for names E.g., chebyshev tchebycheff Invented for the U.S. census … in 1918 51
52
Ranked retrieval 52
53
Problem with Boolean search: feast or famine
Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000 hits Query 2: “standard user dlink 650 no card found”: 0 hits Experience + good knowledge of items needed Cf. our discussion of how Westlaw Boolean queries didn’t actually outperform free text querying 53
54
Ranked retrieval models
Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa 54
55
Scoring as the basis of ranked retrieval
Ch. 6 Scoring as the basis of ranked retrieval We wish to return in order the documents most likely to be useful to the searcher score – say in [0, 1] – to each document with respect to a query This score measures how well document and query “match”.
56
Bag of words model Vector representation doesn’t consider the ordering of words in a document „John is quicker than Mary” and „Mary is quicker than John” have the same vectors This is called the bag of words model.
57
Term frequency tf The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. Relevance does not increase proportionally with term frequency -> flattening using some function
58
Document frequency Rare terms are more informative than frequent terms
Sec Document frequency Rare terms are more informative than frequent terms Consider a term in the query that is rare in the collection (e.g., arachnocentric) We will use document frequency (df) to capture this: the number of documents that contain t Inverse: the higher df is, the smaller its weight should be Dampening/flattening: log
59
Sec tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collection
60
Sec. 6.3 Weight matrix Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V| High dimensional space (number of terms), sparse matrix
61
Sec. 6.3 Queries as vectors Key idea 1: Represent queries as vectors in the same space Key idea 2: Rank documents according to their proximity to the query in this space
62
Proximity Euclidean distance? Angle between vectors
Sec. 6.3 Proximity Euclidean distance? bad idea because Euclidean distance is large for vectors of different lengths. Angle between vectors Rank documents according to their angle with respect to the vector of thequery Rank documents in decreasing order of the angle between query and document Rank documents in increasing order of cosine(query,document)
63
Summary – vector space ranking
Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score
64
Evaluating ranked results: precision-recall curve
Sec. 8.4 Evaluating ranked results: precision-recall curve 64
65
Relevant products 65
66
Apache Lucene text search engine library high-performance
Sec. 8.4 Apache Lucene text search engine library high-performance full-featured written entirely in Java suitable for nearly any application that requires full-text search, especially cross-platform. 66
67
Apache Lucene Scalable, High-Performance Indexing
Sec. 8.4 Apache Lucene Scalable, High-Performance Indexing over 150GB/hour on modern hardware small RAM requirements -- only 1MB heap incremental indexing as fast as batch indexing index size roughly 20-30% the size of text indexed Powerful, Accurate and Efficient Search Algorithms ranked searching -- best results returned first many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more fielded searching (e.g. title, author, contents) sorting by any field multiple-index searching with merged results allows simultaneous update and searching flexible faceting, highlighting, joins and result grouping fast, memory-efficient and typo-tolerant suggesters pluggable ranking models, including the Vector Space Model and Okapi BM25 configurable storage engine (codecs) Cross-Platform Solution 67
68
Oracle Text Full text search, integrater in Oracle
All Oracle editions, free create table texttabelle( id number(10), dokument clob ) / create sequence seq_texttabelle insert into texttabelle values (seq_texttabelle.nextval, 'A-Partei gewinnt Wahl in Hansestadt'); insert into texttabelle values (seq_texttabelle.nextval, 'Terror in Nahost: Kriminalität steigt immer weiter an'); insert into texttabelle values (seq_texttabelle.nextval, 'Wirtschaft: Erneuter Gewinnzuwachs in diesem Jahr'); insert into texttabelle values (seq_texttabelle.nextval, 'Olympia rückt näher: Der Fackellauf ist in vollem Gange');
69
Oracle Text Word stem search
SQL> select * from texttabelle where contains(dokument, 'Papst and Skandal')>0; 10 Der Papst liest seine erste Messe in den USA! 6 Papst bestürzt über jüngsten Skandal! Word stem search SQL> select * from texttabelle where contains(dokument, '$lesen')>0; ID DOKUMENT
70
Oracle Text Fuzzy-operator Near
SQL> select * from texttabelle where contains(dokument, '?Wahlkrampf')>0; ID DOKUMENT 5 Wer wird US-Präsident? Obama und Clinton machen Wahlkampf 7 Wahlkampf in den USA geht weiter: Clinton und Obama ... Near SQL> select * from texttabelle 2 where contains(dokument, 'NEAR((Clinton, Wahlkampf),2)')>0;
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.