Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach David Yona Seminar On.

Similar presentations


Presentation on theme: "Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach David Yona Seminar On."— Presentation transcript:

1 Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach David Yona Seminar On Databases & The Internet December 2010

2 2 Reminder The Semantic Web –“A web of data that can be processed directly and indirectly by machines.” (Tim Berners-Lee) –Enables sharing and integration of data across different applications and organizations –Allows users to issue structured queries and receive correct data from different data-sources

3 3 Reminder RDF (Resource Description Framework) –The Semantic Web’s choice of data model –Basic idea: Making statements about resources in the form of subject-predicate-object expressions –A single statement can take the form of a triple: –Multiple statements can be represented using a graph

4 4 Reminder RDF Example: Represent the fact that Serge Abiteboul, Rick Hull, and Victor Vianu wrote a book called “Foundations of Databases”: person1 isNamed "Serge Abiteboul" person2 isNamed "Rick Hull" person3 isNamed "Victor Vianu" book1 hasAuthor person1 book1 hasAuthor person2 book1 hasAuthor person3 book1 isTitled "Foundations of Databases" 7 triples

5 5 Reminder RDF Example: Represent the fact that Serge Abiteboul, Rick Hull, and Victor Vianu wrote a book called “Foundations of Databases”: Person1Person2Person3 Book1 Serge AbiteboulRick HullVictor Vianu isNamed hasAuthor Foundations of Databases isTitled

6 6 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

7 7 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

8 8 Main Issue How do we store and query RDF data in relational databases?

9 9 Introduction Main Issue: First Approach First approach: –Create a single three-column schema, and store all RDF triples using this table. –Advantages: Easy! –Disadvantages: Serious performance issues! Almost all interesting queries involve many self- joins over this single table.

10 10 Introduction Main Issue: First Approach First approach - Performance issues: –Example: Find all of the authors of books whose title contains the word “Transaction”: SELECT p5.obj FROM rdf AS p1, rdf AS p2, rdf AS p3, rdf AS p4, rdf AS p5 WHERE p1.prop = ’title’ AND p1.obj LIKE ’%Transaction%’ AND p1.subj = p2.subj AND p2.prop = ’type’ AND p2.obj = ’book’ AND p3.prop = ’type’ AND p3.obj = ’auth’ AND p4.prop = ’hasAuth’ AND p4.subj = p2.subj AND p4.obj = p3.subj AND p5.prop = ’isnamed’ AND p5.subj = p4.obj; Five self joins!

11 11 Introduction Main Issue: Other Approaches Solution? –Stop Using RDF? RDF has a great deal of momentum in the web community: Several international conferences, hundreds of attendees and dozens of papers –Explore ways to improve RDF query performance Focus on using a relational query processor to execute RDF queries Two different physical organization methods: –Property Table –Vertically partitioned database

12 12 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

13 13 Current State of the Art RDF in RDBMSs Most RDF data storage solutions are relational DBMSs (such as Jena, Oracle, Sesame, 3store) The common solution: A giant triples table, containing one row for each statement. Optimizations: –Store URIs and literal values in a different table, and use keys or shortend versions in the triple table –Indexes on all three columns

14 14 Current State of the Art RDF in RDBMSs Multi-layered architecture: Removes any dependence on the particular RDBMS used. –Top layer: RDF-specific functionality –Bottom Layer: RDBMS Queries are issued using RDF-specific query languages (SPARQL), converted to SQL and sent to the RDBMS –RDBMS optimizes and executes the SQL query over the triple store

15 15 Current State of the Art RDF in RDBMSs Subj.Prop.Obj. ID1typeBookType ID1title“XYZ” ID1author“Fox, Joe” ID1copyright“2001” ID2typeCDType ID2title“ABC” ID2artist“Orr, Tim” ID2copyright“1985” ID2language“French” ID3typeBookType ID3title“MNO” ID3language“English” ID4typeDVDType ID4title“DEF” ID5typeCDType ID5title“GHI” ID5copyright“1995” ID6typeBookType ID6copyright“2004” SELECT ?title FROM table WHERE { ?book author ‘‘Fox, Joe’’ ?book copyright ‘‘2001’’ ?book title ?title } SELECT C.obj FROM TRIPLES AS A, TRIPLES AS B,TRIPLES AS C WHERE A.subj = B.subj AND B.subj = C.subj AND A.prop = ‘copyright’ AND A.obj = ‘‘2001’’ AND B.prop = ‘author’ AND B.obj = ‘‘Fox, Joe’’ AND C.prop = ‘title’ SPARQL SQL

16 16 Current State of the Art Property Tables An attempt to speed up queries over triple stores. Implementation: Denormalized RDF tables are physically stored in a wider, flattened representation. –This reduces the amount of subject-subject self joins needed. Two types of property tables: –Clustered property tables –Property-class tables

17 17 Current State of the Art Property Tables: Clustered property tables Contains clusters of properties that tend to be defined together. Each table is composed of a “subject” attribute, together with a set of attributes from a certain property cluster. Multiple property tables with different clusters of properties may be created. –A particular property may only appear in at most one property table.

18 18 Subj.Prop.Obj. ID1typeBookType ID1title“XYZ” ID1author“Fox, Joe” ID1copyright“2001” ID2typeCDType ID2title“ABC” ID2artist“Orr, Tim” ID2copyright“1985” ID2language“French” ID3typeBookType ID3title“MNO” ID3language“English” ID4typeDVDType ID4title“DEF” ID5typeCDType ID5title“GHI” ID5copyright“1995” ID6typeBookType ID6copyright“2004” Current State of the Art Property Tables: Clustered property tables Subj.TypeTitleCopyright ID1BookType“XYZ”“2001” ID2CDType“ABC”“1985” ID3BookType“MNP”NULL ID4DVDType“DEF”NULL ID5CDType“GHI”“1995” ID6BookTypeNULL“2004” Property Table Subj.Prop.Obj. ID1author“Fox, Joe” ID2artist“Orr, Tim” ID2language“French” ID3language“English Left-Over Triples

19 19 Current State of the Art Property Tables: Property-class tables Exploits the “type” property of subjects to cluster similar sets of subjects together in the same table. A property may exist in multiple property-class tables.

20 20 Subj.TitleAuthorcopyright ID1“XYZ”“Fox, Joe”“2001” ID3“MNP”NULL ID6NULL “2004” Subj.Prop.Obj. ID1typeBookType ID1title“XYZ” ID1author“Fox, Joe” ID1copyright“2001” ID2typeCDType ID2title“ABC” ID2artist“Orr, Tim” ID2copyright“1985” ID2language“French” ID3typeBookType ID3title“MNO” ID3language“English” ID4typeDVDType ID4title“DEF” ID5typeCDType ID5title“GHI” ID5copyright“1995” ID6typeBookType ID6copyright“2004” Subj.TitleArtistcopyright ID2“ABC”“Orr, Tim”“1985” ID5“GHI”NULL“1995” Current State of the Art Property Tables: Property-class tables Class: BookType Left-Over Triples Class: CDType Subj.Prop.Obj. ID2language“French” ID3language“English” ID4typeDVDType ID4title“DEF”

21 21 Current State of the Art Property Tables: Advantages Most important advantage: Reduce subject-subject self joins. –“return the title of the book(s) Joe Fox wrote in 2001” Original table Property Table with “title”, “author” and “copyright”

22 22 Current State of the Art Property Tables: Disadvantages In the real world, most queries require joins or unions to combine data from several tables. –“Find out if there are any items in the catalog copyrighted before 1990 in a language other than English” SELECT T.subject, T.object FROM TRIPLES AS T, PROPTABLE AS P WHERE T.subject == P.subject AND P.copyright < 1990 AND T.property = "language" AND T.object != "English" (SELECT T.subject, T.object FROM TRIPLES AS T, BOOKS AS B WHERE T.subject == B.subject AND B.copyright < 1990 AND T.property = ‘language’ AND T.object != "English") UNION (SELECT T.subject, T.object FROM TRIPLES AS T, CDS AS C WHERE T.subject == C.subject AND C.copyright < 1990 AND T.property = ‘language’ AND T.object != "English") Class Clustered

23 23 Current State of the Art Property Tables: Disadvantages RDF data tends not to be very structured, and not every subject listed in the table will have all properties defined –Lots of NULL values. Multi-valued properties Subj.TitleAuthorcopyright ID1“XYZ”“Fox, Joe”“2001” ID3“MNP”NULL ID6NULL “2004” Book1 Person1 Person2 hasAuthor

24 24 Current State of the Art Property Tables: Summary Property tables can significantly improve performance by reducing the number of self- joins and typing attributes. Property tables introduce complexity by requiring property clustering to be carefully done to create property tables that are not too wide, while still being wide enough to answer most queries directly. Multi-valued attributes cause further complexity.

25 25 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

26 26 A Simpler Alternative Vertically Partitioned Approach Split the triples table into n two-column tables, where n is the number of unique properties in the data. In each table, the first column contains the subjects that define the property and the second column contains the object values for those subjects. Each table is sorted by subject.

27 27 A Simpler Alternative Vertically Partitioned Approach ID1BookType ID2CDType ID3BookType ID4DVDType ID5CDType ID6BookType ID1“XYZ” ID2“ABC” ID3“MNO” ID4“DEF” ID5“GHI” ID1“2001” ID2“1985” ID5“1995” ID6“2004” ID2“French” ID3“English ID1“Fox, Joe” ID2“Orr, Tim” TypeTitle Copyright Artist Language Author

28 28 A Simpler Alternative Vertically Partitioned Approach: Advantages Support for heterogeneous records. –Very important when the data isn’t well structured. No clustering algorithms are needed. –Schema design is straightforward. Only the properties accessed by a query need to be read. Fewer unions –All data for a particular property is located in the same table

29 29 A Simpler Alternative Vertically Partitioned Approach: Advantages Tables are sorted by subject: –Particular subjects can be located quickly. –Fast merge joins can be used to reconstruct information about multiple properties for subsets of subjects. Support for multi valued attributes –If ID1 has two authors: ID1“Fox, Joe” ID1“Green, John” Author

30 30 A Simpler Alternative Vertically Partitioned Approach: Disadvantages Queries that access several properties –Merge joins not expensive, but not free either. Inserts into vertically partitioned tables are slower: –Multiple tables need to be accessed.

31 31 A Simpler Alternative Extending a Column-Oriented DBMS Row-Oriented DBMSs: –Entire tuples are stored consecutively on disk or in memory. –Not efficient when a few attributes are accessed per query. Column-Oriented DBMSs: –Main idea: Store tables as collections of columns rather than as collections of rows. –Projections occur for free. –Inserts are slower! –Common use: Storing big, wide tables where only a few attributes are queried at once.

32 32 A Simpler Alternative Extending a Column-Oriented DBMS ID1“XYZ” ID2“ABC” ID3“MNO” ID4“DEF” ID5“GHI” ID1“XYZ” ID2“ABC” ID3“MNO” ID4“DEF” ID5“GHI” ID1, “XYZ” ID2, “ABC” ID3, “MNO” ID4, “DEF” ID5, “GHI” ID1, ID2, ID3, ID4, ID5 “XYZ”, “ABC”, “MNO”, “DEF”, “GHI” Row Oriented DB Column Oriented DB

33 33 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Tuple headers are stored separately –Tuple metadata includes transactions timestamps, number of attributes in tuple, null flags, etc. –Row-oriented DBMSs store headers at the beginning of each tuple. (Header size in Postgres: 27 bytes). –Data size in two-column tables: Usually up to 8 bytes (strings are dictionary encoded). –Column-stores put header information in separate columns and can selectively ignore it. –Result: Table scans perform 4-5 times faster in column stores.

34 34 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Optimization for fixed-length tuples –In row-stores, if any attribute is variable- length, then the entire tuple is variable- length. –This is the common case, and therefore row-stores are designed for this case: tuples are located through pointers in header. –In column-stores, fixed-length attributes are stored as arrays. In our case, both attributes are fixed-length

35 35 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Column-oriented data compression –Each attribute is stored separately. –Each attribute can be compressed separately using an algorithm best suited for that column. –Result: significant performance improvement.

36 36 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Carefully optimized column merge code –Merging columns is a very frequent operation in column stores. –Therefore, the merging code is carefully optimized to achieve high performance Example: Extensive pre-fetching is used when merging multiple columns

37 37 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Direct access to sorted files rather than indirection through B-Trees –In column-stores: The increased dependence on merge-sort joins necessitates that heap files are maintained in guaranteed sorted order. –In most row-stores: Order is guaranteed only through an index

38 38 A Simpler Alternative Extending a Column-Oriented DBMS Implementation Details An open-source column-oriented database system was extended in order to experiment with the ideas presented: –Name: C-Store –Characteristics: Each table is stored as a collection of columns. Each column is stored in a separate file. Each file contains a list of 64K blocks with as many values as possible packed in each block. Support for temporary tables, index-nested loop joins, unions, string-data operators wasn’t available.

39 39 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

40 40 Materialized Path Expressions In RDF data, object values can either be literals (“Fox, Joe”) or URIs (“http://preamble/FoxJoe”). In the latter case, the value can be further described using additional triples: – Path expressions identify an object by describing how to navigate to it in some graph of objects

41 41 Materialized Path Expressions To find all books whose authors were born in 1860, we need a path expression through the data. In all three RDF storage schemas described, querying path expressions is expensive. –Querying path expressions is a common operation on RDF data.

42 42 Materialized Path Expressions In a triple schema, a path of length n requires (n-1) subject-object self joins. –Find all books whose authors were born in 1860: For the other schemas, (n-1) joins are required as well. –For vertically-partitioned schemas, all tables of properties involved must be joined, but these can’t be merge-joins. SELECT B.subj FROM triples AS A, triples AS B WHERE A.prop = wasBorn AND A.obj = “1860” AND A.subj = B.obj AND B.prop = “Author”

43 43 Materialized Path Expressions A graphical representation Book 1 Joe Fox 1860 Author wasBorn

44 44 Materialized Path Expressions What we want: Book 1 Joe Fox 1860 Author wasBorn SELECT A.subj FROM proptable AS A, WHERE A.author:wasBorn = “1860” Author:wasBorn

45 45 Materialized Path Expressions Solution: –Using a vertically partitioned schema, this author:wasBorn path expression can be pre-calculated and the result stored in its own two column table as if it were a regular property. The joins don’t have to be performed in run time. –Multi-value support is achieved as well

46 46 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

47 47 Benchmark Used to evaluate the performance of all three RDF databases. Based on publicly available library data and a collection of queries generated from a web-based user interface for browsing RDF content.

48 48 Benchmark Barton Data Data taken from the Barton Library dataset. Contain records acquired from an RDF-formatted dump of the MIT Libraries Barton catalog. Converted from RDF/XML syntax to triples. Duplicate triples were eliminated. The numbers: –50,255,599 triples –221 unique properties –82 (37%) properties are multi-valued –77% of the triples have a multi-valued property A good demonstration of the relatively unstructured nature of Semantic Web data.

49 49 Benchmark Longwell Overview A tool which provides a GUI for generic RDF data exploration in a web browser. Shows the user a list of currently filtered resources (RDF subjects) and a list of filters in the side-panels. User can select a filter and narrow down the presented data. Assumption: A set of 28 interesting properties over which the queries will be run has been selected before hand. –There are 26,761,389 triples for these properties.

50 50 Benchmark Longwell Overview Longwell Opening Screen

51 51 Benchmark Longwell Overview Longwell Screen Shot After Clicking on “Text” in the Type Property Panel

52 52 Benchmark Longwell Overview Longwell Screen Shot After Clicking on “Text” in the Type Property Panel and Scrolling down

53 53 Benchmark Longwell Overview Longwell Screen Shot After Clicking on “fre” in the Language Property Panel

54 54 Benchmark Longwell Queries Query 1: –Calculate the opening panel displaying the counts of the different types of data in the RDF store. –This requires a search for the objects and counts of those objects with property Type. There are 30 such objects. –For example: “Type: Text” has a count of 1,542,280, and “Type: NotatedMusic” has a count of 36,441.

55 55 Benchmark Longwell Queries Query 2: –The user selects “Type: Text” from the previous panel. –Longwell must present him with a list of other defined properties for resources of “Type: Text”. –It must also calculate the frequency of these properties. –For example, the Language property is defined 1,028,826 times for resources that are of “Type: Text”.

56 56 Benchmark Longwell Queries Query 3: –For each property defined on items of “Type: Text”, populate the property panel with the counts of popular object values for that property (where popular means that an object value appears more than once). –For example, the property Edition has 8 items with value “[1st ed. reprinted].”

57 57 Benchmark Longwell Queries Query 4: –This query recalculates all of the property- object counts from Q3 after the user clicks on the “French” value in the “Language” property panel. –Essentially this is narrowing the working set of subjects to those whose Type is Text and Language is French. –This query has a much higher-selectivity than Q3.

58 58 Benchmark Longwell Queries Query 5: –Here we perform a type of inference. –If there are triples of the form (X Records Y) and (Y Type Z) then we can infer that X is of type Z. Here “X Records Y” means that X records information about Y (for example, X might be a web page with information on Y). –For this query, we want to find the inferred type of all subjects that have this “Records” property defined that also originated in the US Library of Congress (i.e. contain triples of the form (X origin “DLC”)). –The subject and inferred type is returned for all non-Text entities.

59 59 Benchmark Longwell Queries Query 6: –For this query, we combine the inference first step of Q5 with the property frequency calculation of Q2 to extract information in aggregate about items that are either directly known to be of “Type: Text” (as in Q2) or inferred to be of “Type: Text” through the Q5 Records inference.

60 60 Benchmark Longwell Queries Query 7: –Finally, we include a simple triple selection query with no aggregation or inference. –The user tries to learn what a particular property (in this case, “Point”) actually means by selecting other properties that are defined along with a particular value of this property. –The user wishes to retrieve subject, Encoding, and Type of all resources with a Point value of “end”. The result set indicates that all such resources are of the type Date. –This explains why these resources can have “start” and “end” values: each of these resources represents a start or end date, depending on the value of Point.

61 61 Benchmark Evaluation The performance of all 7 queries is compared using three different schemas: –Triples schema –Property tables schema –Vertically partitioned schema Performance of each of the three schemas is studied in a row-store (Postgres), and for the vertically- partitioned schema, also in a column- store.

62 62 Benchmark Store Implementation Details All implementations feature a dictionary encoding table that maps strings to integer IDs. –These IDs are used instead of strings to represent properties, subject and objects. Encoding table has a clustered B+Tree index on the IDs and an unclustered B+Tree on the strings.

63 63 Benchmark Store Implementation Details Triple Store: –Table with three columns – subject, property, object. –Three B+Tree indices: “subject, property, object” – clustered “property, object, subject” – unclustered “object, subject, property” – unclustered –The list of 28 interesting properties is stored in a separate table. –Total storage space needed: 8.3GB

64 64 Benchmark Store Implementation Details Property Table Store –A property table for each query was created – each table containing only the columns accessed by that query. A table with all 28 interesting properties A table with subject and Text Etc. –In almost all cases, multi-valued attributes were stored as integer arrays. –Single-valued attributes that are used as selection predicates received B+Tree indices –All tables had a clustered index on “subject” –Total storage space needed: Over 14 GB

65 65 Benchmark Store Implementation Details Vertically Partitioned Store (Postgres) –One table per property. Each table contains a subject and an object column. –A clustered B+Tree index on “subject” –An unclustered B+Tree index on “object” –Multi-valued attributes are represented through multiple rows in the table. –Total storage space needed: 5.2GB

66 66 Benchmark Store Implementation Details Vertically Partitioned Store (C-Store) –Properties are stored on disk in separate files, in blocks of 64K. –Table structure as before. –Each property has a clustered B+Tree on “subject” –Single valued, low cardinality properties have a bit-map index on “object” –Total storage space needed: 2.7GB

67 67 Benchmark Query Implementation Details Query 1: –Triple store: No join needed Aggregation can occur directly on the object column after “property=Type” selection –Vertically partitioned table / Column store: Aggregate the object values for the Type table. –Property table: Same as vertically partitioned table. SELECT A.obj, count(*) FROM triples AS A WHERE A.prop = " " GROUP BY A.obj

68 68 Benchmark Query Implementation Details Query 5: –Triple store: Selection on “property=Origin” and “object=DLC”, then self-join on subject. For subjects with a “Records” property, subject-object join is performed. –Property table: Selection is applied on “Origin=DLC”. “Records” column is projected and (self) joined with the “subject” column of the original table. The “Type” values of the join results return. –Vertically partitioned table / Column store: “Object=DLC” selection on “Origin” property, join subjects with “Records” table, perform subject-object join of “Type”-”Records” to get the results. SELECT B.subj, C.obj FROM triples AS A, triples AS B, triples AS C WHERE A.subj = B.subj AND A.prop = " “ AND A.obj = " " AND B.prop = " “ AND B.obj = C.subj AND C.prop = " “ AND C.obj != " "

69 69 Benchmark Query Implementation Details Query 6: –Triple store: The query first finds subjects that are directly of “Type=Text” using selection, then finds subjects that are inferred to be “Type=Text” through subject- object join of “Records” property. Next, the other properties are found using a self-join on “subject”, and finally a count is performed. –Property table, Vertically partitioned table, Column store: Create temporary tables in a manner similar to Query 5 SELECT A.prop, count(*) FROM triples AS A, properties AS P ( (SELECT B.subj FROM triples AS B WHERE B.prop = " " AND B.obj = " ") UNION (SELECT C.subj FROM triples AS C, triples AS D WHERE C.prop = " " AND C.obj = D.subject AND D.prop = " " AND D.obj = " ") ) AS uniontable WHERE A.subj = uniontable.subj AND P.prop = A.prop GROUP BY A.prop;

70 70 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

71 71 Results

72 72 Results Property table and vertical partitioning approaches perform 2-3 times faster than the triple store approach. C-Store added another factor of 10 performance improvement.

73 73 Results Performance Differences Query 1: –Property table and vertical partitioning numbers are identical: Idealized property tables were used. –Triple-store not too slow: No self-joins. –C-Store performs an order of magnitude better: Table size is 4 times smaller. Joining keys with string dictionary table: C-Store uses index nested loop joins, Postgres uses merge-join.

74 74 Results Performance Differences Query 2: –Triple-store: Many subject-subject joins. –Property table vs. vertical partitioning: Vertical partitioning approach must perform 28 merge-joins, property table doesn’t perform even one. Query 3: –Property table hurt by multiple sequential scans: Grouping must be per property and then object for each column, and therefore each column must group by the object values in that particular column. Query 4: –Highly selective. Query 5-7 –Subject-object joins hurt each of the stores significantly.

75 75 Results Performance Differences A vertically partitioned database provides a significant improvement over the triple store schema. Vertical partitioning vs. property tables in a row store: –Performance –Simplicity Vertical partitioning in a column-store

76 76 Results Scalability The magnitude of query performance is important. How performance scales with size of data is at least as important.

77 77 Results Materialized Path Expressions Expensive subject-object joins replaced with cheaper subject-subject joins. Queries 5-6 rerun using MPEs. Implementation: –Property table: New column. –Vertically partitioned table: New table. Q5Q6 Property Table39.49 (17.5% faster)62.6 (38% faster) Vertical Partitioning4.42 (92% faster)65.84 (22% faster) C-Store2.57 (84% faster)2.70 (75% faster)

78 78 Results The Effect of Further Widening The Semantic-Web content is likely to have an unstructured schema. Property tables in the benchmark were pre- optimized. Result of adding extra columns: QueryWide Property TableProperty Table % slowdown Q160.91381% Q233.9385% Q3584.841% Q444.9658% Q576.3460% Q6154.3353% Q724.25298%

79 79 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

80 80 Conclusion The emergence of the Semantic-Web necessitates high-performance data management tools to manage the tremendous collections of RDF data being produced. Current state of the art RDF databases – triple-stores – perform and scale extremely poorly. The previously proposed “property table” optimization has not been adopted in most RDF databases, and has many disadvantages. Vertically partitioning tables achieve similar performance as property tables in a row-oriented database, and outperform other solutions in a column-oriented database.

81 Questions?

82 Thank You


Download ppt "Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach David Yona Seminar On."

Similar presentations


Ads by Google