CS 540 Database Management Systems Lecture 4: Project topics overview.

CS 540 Database Management Systems Lecture 4: Project topics overview

Outline How to choose a project topic? Broad topic areas How to pick a project, write your report, and present your work? Overview of sample project topics

What is a good research problem? A good research problem is a solvable challenge that is well connected to a real world need/problem. Real word challenges vs. imaginary challenges – Not all challenges are interesting (to the society) – Real world challenges are always interesting to work on – Imaginary challenges may (happen to) be interesting – Spend your effort to solve interesting challenges so that you’ll make more contributions to the society However, not all real world problems are challenges; some are straightforward to solve. Not all challenges/problems are solvable (with limited resources, time, money, tools, etc)

Identify a Good Research Problem Level of Challenges Impact/Usefulness Known Unknown Good applications Not interesting for research High impact Low risk (easy) Good short-term research problems High impact High risk (hard) Good long-term research problems Low impact Difficult Often publishable, but not good research problems Low impact Low risk Bad research problems Generally not publishable Course project

Landscape of data management Query capability Scale Data complexity Structured Data Exact Matching Inexact Matching Inferences/Mining Unstructured Data Multimedia Data RDMS

DB and related areas Databases Text Information Management (Information Retrieval) Multimedia Information Management Web/Bio Information Management Data Mining/Machine Learning

Map of general topic areas Databases IR Multimedia Web/Bio Data Mining Core/Traditional DB Web/Bio DB Applications DB+IR Data Mining, Decision Support Multimedia DB

The big challenge “... Our biggest challenge is a unification of approximate and exact reasoning. Most of us come from the exact-reasoning world – but most of our clients are asking questions with approximate or probabilistic answers….” - Jim Gray [SIGMOD 2004]

How to do a bad project and give a bad presentation! Slides from “How to Have a Bad Career!” by David A. Patterson

How to Do Bad a Project? Let Complexity Be Your Guide (Confuse Thine Enemies) Best compliment: “Its so complicated, I can’t understand the ideas” Easier to claim credit for subsequent good ideas – If no one understands, how can they contradict your claim? It’s easier to be complicated If it were not unsimple then how could distinguished colleagues in departments around the world be positively appreciative of both your extraordinary intellectual grasp of the nuances of issues as well as the depth of your contribution?

How to Do Bad a Project? Never be Proven Wrong Avoid Implementing Avoid Quantitative Experiments – If you’ve got good intuition, who needs experiments? – Takes too long to measure Avoid Benchmarks Projects whose payoff is ≥ 20 years gives you 19 safe years

How to Do a Bad Project? Use the Computer Scientific Method Computer Scientific Method Hunch 1 experiment & change all parameters Discard if doesn’t support hunch Why waste time? We know this Obsolete Scientific Method Hypothesis Sequence of experiments Change 1 parameter/exp. Prove/Disprove Hypothesis Document for others to reproduce results

5 Commandments for Bad Writing I.Thou shalt not define terms, nor explain anything. – that’s why there are dictionaries. its insults the readers. II.Thou shalt replace “will do” with “have done”. – After all, someone is likely to build it in the 2 to 3 years. III.Thou shalt not mention drawbacks to your approach. – that’s not your job; let others find the flaws. IV.Thou shalt not reference any papers. – if they were good people, they’d be at your institution. V. Thou shalt write before implementing. – highest performance.

7 Talk Commandments for a Bad Talk I.Thou shalt not illustrate. II.Thou shalt not covet brevity. – Do you want to continue the stereotype that engineers can't write? Always use complete sentences, never just key words. If possible, use whole paragraphs and read every word. III.Thou shalt not print large. – Be humble -- use a small font. Important people sit in front. IV.Thou shalt not use color. V.Thou shalt cover thy naked slides. VI.Thou shalt not skip slides in a long talk. – You prepared the slides; people came for your whole talk; so just talk faster. VII. Thou shalt not practice. – Why waste research time practicing a talk?

Following all the commandments We describe the philosophy and design of the control flow machine, and present the results of detailed simulations of the performance of a single processing element. Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the control flow processor compares favorably in the program. We present a denotational semantics for a logic program to construct a control flow for the logic program. The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions. We also present a bottom-up compilation of medium grain clusters from a fine grain control flow graph. We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters. A hierarchical macro-control-flow computation allows them to exploit the coarse grain parallelism inside a macrotask, such as a subroutine or a loop, hierarchically. We use a hierarchical definition of macrotasks, a parallelism extraction scheme among macrotasks defined inside an upper layer macrotask, and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters. We apply a parallel simulation scheme to a real problem: the simulation of a control flow architecture, and we compare the performance of this simulator with that of a sequential one. Moreover, we investigate the effect of modeling the application on the performance of the simulator. Our study indicates that parallel simulation can reduce the execution time significantly if appropriate modeling is used. We have demonstrated that to achieve the best execution time for a control flow program, the number of nodes within the system and the type of mapping scheme used are particularly important. In addition, we observe that a large number of subsystem nodes allows more actors to be fired concurrently, but the communication overhead in passing control tokens to their destination nodes causes the overall execution time to increase substantially. The relationship between the mapping scheme employed and locality effect in a program are discussed. The mapping scheme employed has to exhibit a strong locality effect in order to allow efficient execution Medium grain execution can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with an issue rate of ten is sufficient to exploit the internal parallelism of a cluster. Although the technique does not exhaustively detect all possible errors, it detects nontrivial errors with a worst-case complexity quadratic to the system size. It can be automated and applied to systems with arbitrary loops and nondeterminism.

Following all the commandments We describe the philosophy and design of the control flow machine, and present the results of detailed simulations of the performance of a single processing element. Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the control flow processor compares favorably in the program. We present a denotational semantics for a logic program to construct a control flow for the logic program. The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions. We also present a bottom-up compilation of medium grain clusters from a fine grain control flow graph. We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters. Our compiling strategy is to exploit coarse-grain parallelism at function application level: and the function application level parallelism is implemented by fork-join mechanism. The compiler translates source programs into control flow graphs based on analyzing flow of control, and then serializes instructions within graphs according to flow arcs such that function applications, which have no control dependency, are executed in parallel. A hierarchical macro-control-flow computation allows them to exploit the coarse grain parallelism inside a macrotask, such as a subroutine or a loop, hierarchically. We use a hierarchical definition of macrotasks, a parallelism extraction scheme among macrotasks defined inside an upper layer macrotask, and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters. We apply a parallel simulation scheme to a real problem: the simulation of a control flow architecture, and we compare the performance of this simulator with that of a sequential one. Moreover, we investigate the effect of modeling the application on the performance of the simulator. Our study indicates that parallel simulation can reduce the execution time significantly if appropriate modeling is used. We have demonstrated that to achieve the best execution time for a control flow program, the number of nodes within the system and the type of mapping scheme used are particularly important. In addition, we observe that a large number of subsystem nodes allows more actors to be fired concurrently, but the communication overhead in passing control tokens to their destination nodes causes the overall execution time to increase substantially. The relationship between the mapping scheme employed and locality effect in a program are discussed. The mapping scheme employed has to exhibit a strong locality effect in order to allow efficient execution. We assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain control flow execution. Medium grain execution can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with an issue rate of ten is sufficient to exploit the internal parallelism of a cluster. Although the technique does not exhaustively detect all possible errors, it detects nontrivial errors with a worst-case complexity quadratic to the system size. It can be automated and applied to systems with arbitrary loops and nondeterminism. How to Do a Bad Poster David Patterson University of California Berkeley, CA 94720

Alternatives to Bad Papers Do opposite of Bad Paper commandments Define terms, distinguish “will do” vs “have done”, mention drawbacks, real performance, reference other papers. Find related work First read Strunk and White, then follow these steps; 1. 1-page paper outline, with tentative page budget/section 2. Paragraph map 1 topic phrase/sentence per paragraph 3. (Re)Write draft Long captions/figure can contain details Uses Tables to contain facts that make dreary prose 4. Read aloud, spell check & grammar check 5. Get feedback from friends and critics on draft; go to 3. www.cs.berkeley.edu/~pattrsn/talks/writingtips.html

Alternatives to Bad Talk Do opposite of Bad Talk commandments I.Thou shalt not illustrate. II.Thou shalt not covet brevity. III.Thou shalt not print large. IV.Thou shalt not use color. V.Thou shalt cover thy naked slides. VI.Thou shalt not skip slides in a long talk. VII.Thou shalt not practice. Allocate 2 minutes per slide, leave time for questions Don’t over animate Do dry runs with friends/critics for feedback, – including tough audience questions Record a practice talk (audio or video) – Don’t memorize speech, but have notes ready

Alternatives to Bad Talk

Sample Project Topics

Query and visualize RDF data Many graph datasets are in Resource Description Framework (RDF) format – Also called linked data RDF database – set of triplets: subject predicate object The number and size of data sets are rapidly growing. Wikidata, DBPedia, FOAF, Knowledge graph, … You may find datasets at linkeddata.org, rdfdata.org, … 21

Query and visualize RDF data RDF database – No prescribed schema: easy to create and extend: semantic Web standard hard to formulate queries! query processing is relatively inefficient. RDF data management systems / triple stores – Public: Apache Jena, KiWi, … – Proprietary: IBM DB2, Oracle, … SPARQL query language – Similar to SQL 22

Query and visualize RDF data Create an easy to use query interface for RDF data – some work on keyword search over RDF low precision, slow – You may combine SPARQL with some keyword search features. – Query suggestion, auto-completion,.. for SPRAQL or keyword queries. 23

Query and visualize RDF data The results of RDF queries are usually not easy to understand – Large graphs Create an interface that summarizes the results – Show the most important/relevant nodes/ links first – User can navigate over results – You may do this for the whole database It helps users to understand the structure of the data and specify queries. 24

Query and visualize RDF data Create an interaction interface over RDF – Users usually interact with the database over a long period of time Submit query => explore the result => formulate the next query => explore the result => … – The interface makes it easier for users to formulate queries based on the current results. Keeps a history of previous queries 25

Querying relational data Most users do not know the schema and content of their relational databases. Create an interface that helps users write SQL queries – Query completion and suggestion – Create visualization of the schema More important tables at higher level. 26

Data independence Relational model are not access path independent How can you make SQL more access path independent? – Map the schema of the query to the schema of the database. database schema: EmpManager(E,M,D) user assumes the schema: Emp(E, D), Manager(M, D) user query: select E from Emp => transformed query: select E from EmpManager – Try all possible schemas. Slow! Data independent learning and inference 27

Visualize relational data Create a visualization engines for SQL queries – Many users like to see charts and visualizations instead of tables. – Visualization engines do not normally work with relational databases. Create an interactive query interface for SQL – Keeps a history of previous queries 28

Data preparation Most data scientists spend about 80% of their times on data preparation! – Transforming data from one form to another Most data sets are in spreadsheets, flat files, XML, HTML tables, … We have to transform them to relational or RDF form. – Cleaning data Removing meaningless values, apply constraints, … – …. Currently most data preparation are done manually. 29

Help users prepare their data Example: Data wrangler (now part of Trifacta) http://vis.stanford.edu/wrangler/app/ 30

Help users prepare their data Pick a widely used data format – spread sheet, Json, XML, log files, … Define natural and basic transformation operations for this format – Cleaning, re-organizing, transforming to relational or RDF format – Design a transformation interface Design a Domain Specific Language (DSL). Predict/ suggest transformation operations 31

Theory projects Read some papers and approaches on a problem, analyze, compare, and/or extend them. – High technical depth / theory. – You may slightly extend on approach. Schema equivalency – One can represent the same data in different schemas: Emp(E, D), Manager(M, D) vs. EmpManager(E,M,D) – Given two relational schemas, how can we find our if they represent the same information? Representation dependence in probabilistic inference, J. Halpern, JAIR, 2004. Relative information capacity of simple relational schema, R. Hull, PODS, 1984. 32

33 Good project Technical deep – More than building some forms over a database Novel – Has some new ideas Effectively presented All in the scope of a term!

34 Project timeline Proposal due 1/19 – Group members, brief description of the problem. Midterm presentation due 2/3 – 2/4 – Clear definition of the problem, initial work and result, plan for the rest of the term. – A practice for final presentation! Final presentation 8/4- 10/4 – Final results, analysis of the results. Final report 11/4

35 What you should do Form teams. Evaluate possible topics for your project. Talk to the instructors and TAs Submit your project proposal.

36 What is next Database system implementation – DBMS architecture, storage, and access methods You have two papers to review – rather short papers!

CS 540 Database Management Systems Lecture 4: Project topics overview.

Similar presentations

Presentation on theme: "CS 540 Database Management Systems Lecture 4: Project topics overview."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 540 Database Management Systems Lecture 4: Project topics overview.

Similar presentations

Presentation on theme: "CS 540 Database Management Systems Lecture 4: Project topics overview."— Presentation transcript:

Similar presentations

About project

Feedback