CS 540 Database Management Systems Lecture 4: Project topics overview.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Database System Concepts and Architecture
Steps of a Design Brief V Design Brief  Problem, identification, and definition Establish a clear idea of what is to be accomplished. Identify.
Lumberton High School Sci Vis I V105.02
How to Make a Good Presentation Daniela Stan DePaul University July 1 st, 2005.
How to Write a Bad Paper Tom Anderson (credits to John Ousterhout, Dave Patterson, and many others)
Information Retrieval in Practice
Multimedia Project Proposal
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Publishing Workflow for InDesign Import/Export of XML
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 8 Slide 1 System models.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Detailed Design Kenneth M. Anderson Lecture 21
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
Technical Writing II Acknowledgement: –This lecture notes are based on many on-line documents. –I would like to thank these authors who make the documents.
introduction to MSc projects
Basic Scientific Writing in English Lecture 3 Professor Ralph Kirby Faculty of Life Sciences Extension 7323 Room B322.
1 EECS 252 Graduate Computer Architecture Lec 20 – How to Have a Bad Career in Grad School and Beyond David Patterson Electrical Engineering and Computer.
Course Instructor: Aisha Azeem
Science and Engineering Practices
Overview of Search Engines
Unit 2: Engineering Design Process
Software Project Planning CS470. What is Planning? Phases of a project can be mostly predicted Planning is the process of estimating the time and resources.
IMSS005 Computer Science Seminar
Chapter 4 System Models A description of the various models that can be used to specify software systems.
System models Abstract descriptions of systems whose requirements are being analysed Abstract descriptions of systems whose requirements are being analysed.
CMSC 345 Fall 2000 Unit Testing. The testing process.
Fundamentals of Information Systems, Fifth Edition
Cs252.1 How to Give a Bad Talk Lecture 20: How to Give a Bad Talk Professor David A. Patterson Computer Science 152 Fall 1997.
Methodologies. The Method section is very important because it tells your Research Committee how you plan to tackle your research problem. Chapter 3 Methodologies.
DAP Spr.‘01 ©UCB 1 How to Communicate Poorly: giving bad talks, show bad posters, writing bad papers Professor David A. Patterson December
Capstone Presentation Guideline February 2010 Middletown High School Middletown Public Schools.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
Steps of a Design Brief Panther Creek SciVis V
SE: CHAPTER 7 Writing The Program
Chapter 7 System models.
Architectural Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
Real World IR Challenges (CS598-CXZ Advanced Topics in IR Presentation) Jan. 20, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
© A. Kwasinski, 2014 ECE 2795 Microgrid Concepts and Distributed Generation Technologies Spring 2015 Week #7.
DAP Spr.‘01 ©UCB 1 How to Have a Bad Career in Research/Academia Professor David A. Patterson February
Systems Analysis and Design in a Changing World, Fourth Edition
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
Making PowerPoint Slides Adopted from Mary Westervelt, University of Pennsylvania.
Compiler Construction (CS-636)
Steps of a Design Brief V  Is a Plan of work A written step-by- step process by which the goal is to be accomplished The plan can include expected.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
1 Computer Engineering Department Islamic University of Gaza ECOM 6303: Advanced Computer Networks (Graduate Course) Spr Prof. Mohammad A. Mikki.
Topic 4 - Database Design Unit 1 – Database Analysis and Design Advanced Higher Information Systems St Kentigern’s Academy.
Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
Research Methods and Techniques Lecture 6 Presentation Skills © 2004, J S Sventek, University of Glasgow.
Presenting Research. Facts Most people are intimidated in front of and audience. – Often more intimidating than flying, poisonous snakes, death… Most.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Colby Smart, E-Learning Specialist Humboldt County Office of Education
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
Steps of a Design Brief V Purpose of a Design Brief  A design brief is the process used to solve problems or complete presentations.  It is very.
Unit 5: Developing the Training Program 1 © SHRM 2009.
Information Retrieval in Practice
Databases (CS507) CHAPTER 2.
课程名 编译原理 Compiling Techniques
Multimedia Information Retrieval
Selection of Instructional Methods and Media
Database Systems Instructor Name: Lecture-3.
Document Design Justine Nielsen April 28, 2003
Applying Use Cases (Chapters 25,26)
Applying Use Cases (Chapters 25,26)
Presentation transcript:

CS 540 Database Management Systems Lecture 4: Project topics overview

Outline How to choose a project topic? Broad topic areas How to pick a project, write your report, and present your work? Overview of sample project topics

What is a good research problem? A good research problem is a solvable challenge that is well connected to a real world need/problem. Real word challenges vs. imaginary challenges – Not all challenges are interesting (to the society) – Real world challenges are always interesting to work on – Imaginary challenges may (happen to) be interesting – Spend your effort to solve interesting challenges so that you’ll make more contributions to the society However, not all real world problems are challenges; some are straightforward to solve. Not all challenges/problems are solvable (with limited resources, time, money, tools, etc)

Identify a Good Research Problem Level of Challenges Impact/Usefulness Known Unknown Good applications Not interesting for research High impact Low risk (easy) Good short-term research problems High impact High risk (hard) Good long-term research problems Low impact Difficult Often publishable, but not good research problems Low impact Low risk Bad research problems Generally not publishable Course project

Landscape of data management Query capability Scale Data complexity Structured Data Exact Matching Inexact Matching Inferences/Mining Unstructured Data Multimedia Data RDMS

DB and related areas Databases Text Information Management (Information Retrieval) Multimedia Information Management Web/Bio Information Management Data Mining/Machine Learning

Map of general topic areas Databases IR Multimedia Web/Bio Data Mining Core/Traditional DB Web/Bio DB Applications DB+IR Data Mining, Decision Support Multimedia DB

The big challenge “... Our biggest challenge is a unification of approximate and exact reasoning. Most of us come from the exact-reasoning world – but most of our clients are asking questions with approximate or probabilistic answers….” - Jim Gray [SIGMOD 2004]

How to do a bad project and give a bad presentation! Slides from “How to Have a Bad Career!” by David A. Patterson

How to Do Bad a Project? Let Complexity Be Your Guide (Confuse Thine Enemies) Best compliment: “Its so complicated, I can’t understand the ideas” Easier to claim credit for subsequent good ideas – If no one understands, how can they contradict your claim? It’s easier to be complicated If it were not unsimple then how could distinguished colleagues in departments around the world be positively appreciative of both your extraordinary intellectual grasp of the nuances of issues as well as the depth of your contribution?

How to Do Bad a Project? Never be Proven Wrong Avoid Implementing Avoid Quantitative Experiments – If you’ve got good intuition, who needs experiments? – Takes too long to measure Avoid Benchmarks Projects whose payoff is ≥ 20 years gives you 19 safe years

How to Do a Bad Project? Use the Computer Scientific Method Computer Scientific Method Hunch 1 experiment & change all parameters Discard if doesn’t support hunch Why waste time? We know this Obsolete Scientific Method Hypothesis Sequence of experiments Change 1 parameter/exp. Prove/Disprove Hypothesis Document for others to reproduce results

5 Commandments for Bad Writing I.Thou shalt not define terms, nor explain anything. – that’s why there are dictionaries. its insults the readers. II.Thou shalt replace “will do” with “have done”. – After all, someone is likely to build it in the 2 to 3 years. III.Thou shalt not mention drawbacks to your approach. – that’s not your job; let others find the flaws. IV.Thou shalt not reference any papers. – if they were good people, they’d be at your institution. V. Thou shalt write before implementing. – highest performance.

7 Talk Commandments for a Bad Talk I.Thou shalt not illustrate. II.Thou shalt not covet brevity. – Do you want to continue the stereotype that engineers can't write? Always use complete sentences, never just key words. If possible, use whole paragraphs and read every word. III.Thou shalt not print large. – Be humble -- use a small font. Important people sit in front. IV.Thou shalt not use color. V.Thou shalt cover thy naked slides. VI.Thou shalt not skip slides in a long talk. – You prepared the slides; people came for your whole talk; so just talk faster. VII. Thou shalt not practice. – Why waste research time practicing a talk?

Following all the commandments We describe the philosophy and design of the control flow machine, and present the results of detailed simulations of the performance of a single processing element. Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the control flow processor compares favorably in the program. We present a denotational semantics for a logic program to construct a control flow for the logic program. The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions. We also present a bottom-up compilation of medium grain clusters from a fine grain control flow graph. We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters. A hierarchical macro-control-flow computation allows them to exploit the coarse grain parallelism inside a macrotask, such as a subroutine or a loop, hierarchically. We use a hierarchical definition of macrotasks, a parallelism extraction scheme among macrotasks defined inside an upper layer macrotask, and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters. We apply a parallel simulation scheme to a real problem: the simulation of a control flow architecture, and we compare the performance of this simulator with that of a sequential one. Moreover, we investigate the effect of modeling the application on the performance of the simulator. Our study indicates that parallel simulation can reduce the execution time significantly if appropriate modeling is used. We have demonstrated that to achieve the best execution time for a control flow program, the number of nodes within the system and the type of mapping scheme used are particularly important. In addition, we observe that a large number of subsystem nodes allows more actors to be fired concurrently, but the communication overhead in passing control tokens to their destination nodes causes the overall execution time to increase substantially. The relationship between the mapping scheme employed and locality effect in a program are discussed. The mapping scheme employed has to exhibit a strong locality effect in order to allow efficient execution Medium grain execution can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with an issue rate of ten is sufficient to exploit the internal parallelism of a cluster. Although the technique does not exhaustively detect all possible errors, it detects nontrivial errors with a worst-case complexity quadratic to the system size. It can be automated and applied to systems with arbitrary loops and nondeterminism.

Following all the commandments We describe the philosophy and design of the control flow machine, and present the results of detailed simulations of the performance of a single processing element. Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the control flow processor compares favorably in the program. We present a denotational semantics for a logic program to construct a control flow for the logic program. The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions. We also present a bottom-up compilation of medium grain clusters from a fine grain control flow graph. We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters. Our compiling strategy is to exploit coarse-grain parallelism at function application level: and the function application level parallelism is implemented by fork-join mechanism. The compiler translates source programs into control flow graphs based on analyzing flow of control, and then serializes instructions within graphs according to flow arcs such that function applications, which have no control dependency, are executed in parallel. A hierarchical macro-control-flow computation allows them to exploit the coarse grain parallelism inside a macrotask, such as a subroutine or a loop, hierarchically. We use a hierarchical definition of macrotasks, a parallelism extraction scheme among macrotasks defined inside an upper layer macrotask, and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters. We apply a parallel simulation scheme to a real problem: the simulation of a control flow architecture, and we compare the performance of this simulator with that of a sequential one. Moreover, we investigate the effect of modeling the application on the performance of the simulator. Our study indicates that parallel simulation can reduce the execution time significantly if appropriate modeling is used. We have demonstrated that to achieve the best execution time for a control flow program, the number of nodes within the system and the type of mapping scheme used are particularly important. In addition, we observe that a large number of subsystem nodes allows more actors to be fired concurrently, but the communication overhead in passing control tokens to their destination nodes causes the overall execution time to increase substantially. The relationship between the mapping scheme employed and locality effect in a program are discussed. The mapping scheme employed has to exhibit a strong locality effect in order to allow efficient execution. We assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain control flow execution. Medium grain execution can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with an issue rate of ten is sufficient to exploit the internal parallelism of a cluster. Although the technique does not exhaustively detect all possible errors, it detects nontrivial errors with a worst-case complexity quadratic to the system size. It can be automated and applied to systems with arbitrary loops and nondeterminism. How to Do a Bad Poster David Patterson University of California Berkeley, CA 94720

Alternatives to Bad Papers Do opposite of Bad Paper commandments Define terms, distinguish “will do” vs “have done”, mention drawbacks, real performance, reference other papers. Find related work First read Strunk and White, then follow these steps; 1. 1-page paper outline, with tentative page budget/section 2. Paragraph map 1 topic phrase/sentence per paragraph 3. (Re)Write draft Long captions/figure can contain details Uses Tables to contain facts that make dreary prose 4. Read aloud, spell check & grammar check 5. Get feedback from friends and critics on draft; go to 3.

Alternatives to Bad Talk Do opposite of Bad Talk commandments I.Thou shalt not illustrate. II.Thou shalt not covet brevity. III.Thou shalt not print large. IV.Thou shalt not use color. V.Thou shalt cover thy naked slides. VI.Thou shalt not skip slides in a long talk. VII.Thou shalt not practice. Allocate 2 minutes per slide, leave time for questions Don’t over animate Do dry runs with friends/critics for feedback, – including tough audience questions Record a practice talk (audio or video) – Don’t memorize speech, but have notes ready

Alternatives to Bad Talk

Sample Project Topics

Query and visualize RDF data Many graph datasets are in Resource Description Framework (RDF) format – Also called linked data RDF database – set of triplets: subject predicate object The number and size of data sets are rapidly growing. Wikidata, DBPedia, FOAF, Knowledge graph, … You may find datasets at linkeddata.org, rdfdata.org, … 21

Query and visualize RDF data RDF database – No prescribed schema: easy to create and extend: semantic Web standard hard to formulate queries! query processing is relatively inefficient. RDF data management systems / triple stores – Public: Apache Jena, KiWi, … – Proprietary: IBM DB2, Oracle, … SPARQL query language – Similar to SQL 22

Query and visualize RDF data Create an easy to use query interface for RDF data – some work on keyword search over RDF low precision, slow – You may combine SPARQL with some keyword search features. – Query suggestion, auto-completion,.. for SPRAQL or keyword queries. 23

Query and visualize RDF data The results of RDF queries are usually not easy to understand – Large graphs Create an interface that summarizes the results – Show the most important/relevant nodes/ links first – User can navigate over results – You may do this for the whole database It helps users to understand the structure of the data and specify queries. 24

Query and visualize RDF data Create an interaction interface over RDF – Users usually interact with the database over a long period of time Submit query => explore the result => formulate the next query => explore the result => … – The interface makes it easier for users to formulate queries based on the current results. Keeps a history of previous queries 25

Querying relational data Most users do not know the schema and content of their relational databases. Create an interface that helps users write SQL queries – Query completion and suggestion – Create visualization of the schema More important tables at higher level. 26

Data independence Relational model are not access path independent How can you make SQL more access path independent? – Map the schema of the query to the schema of the database. database schema: EmpManager(E,M,D) user assumes the schema: Emp(E, D), Manager(M, D) user query: select E from Emp => transformed query: select E from EmpManager – Try all possible schemas. Slow! Data independent learning and inference 27

Visualize relational data Create a visualization engines for SQL queries – Many users like to see charts and visualizations instead of tables. – Visualization engines do not normally work with relational databases. Create an interactive query interface for SQL – Keeps a history of previous queries 28

Data preparation Most data scientists spend about 80% of their times on data preparation! – Transforming data from one form to another Most data sets are in spreadsheets, flat files, XML, HTML tables, … We have to transform them to relational or RDF form. – Cleaning data Removing meaningless values, apply constraints, … – …. Currently most data preparation are done manually. 29

Help users prepare their data Example: Data wrangler (now part of Trifacta) 30

Help users prepare their data Pick a widely used data format – spread sheet, Json, XML, log files, … Define natural and basic transformation operations for this format – Cleaning, re-organizing, transforming to relational or RDF format – Design a transformation interface Design a Domain Specific Language (DSL). Predict/ suggest transformation operations 31

Theory projects Read some papers and approaches on a problem, analyze, compare, and/or extend them. – High technical depth / theory. – You may slightly extend on approach. Schema equivalency – One can represent the same data in different schemas: Emp(E, D), Manager(M, D) vs. EmpManager(E,M,D) – Given two relational schemas, how can we find our if they represent the same information? Representation dependence in probabilistic inference, J. Halpern, JAIR, Relative information capacity of simple relational schema, R. Hull, PODS,

33 Good project Technical deep – More than building some forms over a database Novel – Has some new ideas Effectively presented All in the scope of a term!

34 Project timeline Proposal due 1/19 – Group members, brief description of the problem. Midterm presentation due 2/3 – 2/4 – Clear definition of the problem, initial work and result, plan for the rest of the term. – A practice for final presentation! Final presentation 8/4- 10/4 – Final results, analysis of the results. Final report 11/4

35 What you should do Form teams. Evaluate possible topics for your project. Talk to the instructors and TAs Submit your project proposal.

36 What is next Database system implementation – DBMS architecture, storage, and access methods You have two papers to review – rather short papers!