Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPT-S 483-05 Topics in Computer Science Big Data 1 Yinghui Wu EME 49.

Similar presentations


Presentation on theme: "CPT-S 483-05 Topics in Computer Science Big Data 1 Yinghui Wu EME 49."— Presentation transcript:

1 CPT-S 483-05 Topics in Computer Science Big Data 1 Yinghui Wu EME 49

2 2 Course summary Course project and presentation CPT_S 483-05 Big Data Big data: conclusion

3 3 Big Data model Big V’s Relational model XMLs and RDFs Big Data systems Relational DBMS noSQL DBMS Parallel system architectures Search Big Data query models Sequential search strategies Parallel and distributed query processing Parallel and distributed systems Special topics In Big Data Cloud computing Selected topics in Data mining Visualization Data security and privacy Big picture what’s Big Data how to present and store Big Data how to search Big Data critical topics related with Big Data

4 4 Big data models and systems

5 Big data: the 4 big V’s 5 Big data models: the 4 big V’s, data types, applications., research trend /topics What is big data? a large, complex data set; a challenge; a trend; an approach of data analytics What is the volume of big data? Variety? Velocity? Veracity? Why do we care about big data? Is there any fundamental challenge introduced by querying big data? Why study Big Data?

6 Relational data models and DBMS 6 Relational DBMS Relational Model Concepts Relational Model Constraints and Schemas Update Operations and Dealing with Constraint Violations The relational model has rigorously defined query languages that are simple and powerful. Relational algebra is more operational; useful as internal representation for query evaluation plans. Several ways of expressing a given query; a query optimizer should choose the most efficient version.

7 DBMS architectures & design 7 DBMS: architecture Centralized Client-server Parallel Distributed RDBMS design the normal forms 1NF, 2NF, 3NF, BCNF normal forms transformation

8 Beyond Relational Data 8 Introduction to XML XML basics DTD XML Schema XML Constraints Introduction to RDF RDF data model and syntax RDF schemas RDF inferencing XML is a prime data exchange format. DTD provides useful syntactic constraints on documents. XML Schema extends DTD by supporting a rich type system Integrity constraints are important for XML, yet are nontrivial RDF provides a foundation for representing and processing metadata RDF has a graph-based data model RDF has an XML-based syntax to support syntactic interoperability RDF has a decentralized philosophy and allows incremental building of knowledge, and its sharing and reuse

9 Beyond RDBMS: noSQL 9 noSQL: concept and theory CAP theory ACID vs EASE noSQL vs RDBMS noSQL databases Key-value stores Document DBs Column family Graph databases Cheap, easy to implement (open source) Data are replicated to multiple nodes (fault-tolerant) Easy to distribute Don't require a schema Can scale up and down Relax the data consistency requirement (CAP) Joins, ACID transactions SQL as a sometimes frustrating but still powerful query language easy integration with other applications that support SQL

10 10 Big data search

11 Query Processing 11 Querying framework overview Measures of Query Cost Basic of Database operations Basics of Graph Queries Basics of Graph Algorithms Graph search (traversal) PageRank Nearest neighbors Keyword search Graph pattern matching

12 Big data search: Make big data small 12 Exact Query Exact Answer “Big Data” Long Response Times! “Small Data” Approximate Query KB/MB Approximate Answer FAST!! Approximate query evaluation query driven: approximate query models data driven: synopses, histogram, sampling, sketches, spanners… View-based query evaluation Make big data small: indexing, sketch, sampling, spanners Cope with data streams: incremental query evaluation

13 Parallel data management 13 parallel DBMS Architectures 4 Parallelism: Intraquery, Interquery Intraoperation,Interoperation MapReduce mapper reducer Q( ) D D D1D1 D1D1 DnDn DnDn D2D2 D2D2 …

14 Hadoop 14 Hadoop: history, features and design Hadoop ecosystem HDFS Hive & Pig Hbase Zookeeper

15 Querying Big Graphs: Beyond MapReduce Parallel Graph programming models –MapReduce for BFS for distance queries, PageRank.. –Vertex Centric Programming: GraphLab and Pregel –Graph Centric Programming: Giraph ++ –GRAPE: Hybrid models 15 Virtual Processors

16 Cloud Computing 16 Cloud computing concept Service models and architecture Features and characteristics Pros and cons

17 17 Big data: Special topics

18 Data Mining and Graph Mining Basic 18 Data mining: from data to knowledge Graph Mining: pattern mining Classification: decision tree Clustering: k-means, DBSCAN age ? student?credit rating? <=30 >40 noyes 31..40 no fairexcellent yesno

19 Veracity of Big Data 19 Discover rulesReasoningDetect errorsRepair Data quality management: An overview Central aspects of data quality Data consistency Entity resolution Information completeness Data currency Data accuracy Deducing the true values of objects in data fusion

20 Data Visualization, security and privacy 20 Information Visualization Graph visualization Graph drawing and graph visualization Graph layout Information Security: basic concepts Privacy: basic concepts and comparison with security

21 The future of Big Data 21

22 Machine BI Business Intelligence 22

23 Smart Living 23

24 Big data & Healthcare 24

25 Future of Big Data techs (NSF National Priorities) 25

26 Project presentation Presentation (15 minutes + 2-3 minutes Q&A) –Background and motivation why the problem is important application of the solutions Challenges and difference with related work –Problem formulation: Input and output Object function, if any –Algorithm description Correctness analysis Complexity analysis Properties/features/optimization techniques –Experimental study Data sets/generation of dataset Algorithm implemented/baseline algorithms/platforms/test settings Figures/trend/explain Summary of experimental result –Conclusion and Future work How your current work can be improved 26

27 General tips 27 Every talk motivates a problem Talk is about idea Simple Slides are better A picture is worth a thousand words Keep logic flow Prepare for Questions Practice makes perfect

28 Course project report You have milestones 1-5. Combine your report and enrich with experimental results, references and future work. Do not simply put them together. Make title concise and right to the point Abstract: describe your problem, solution and experimental result. Section 1: Introduction: –Background –challenges (why your problem is hard) –Related work Section 2: Problem statement –Input/output –Object function –Hardness of the problem Section 3: Solution –Algorithm description –correctness and time complexity analysis (try big O notation) –Optimization techs (e.g., applying Big data search strategy) 28

29 Course project report Section 4: Experimental study Data sets/generation of dataset Algorithm implemented/baseline algorithms/platforms/test settings Figures/trend/explain Summary of experimental result Section 5: Conclusion & Future work –What have you observed in your project –What problem remains to be unresolved? What are possible extension of your problem? What’s your plan to solve it in future? 29

30 CPT_S 483-05 Big Data 30 I hope you enjoyed this course And found it useful! Thank you


Download ppt "CPT-S 483-05 Topics in Computer Science Big Data 1 Yinghui Wu EME 49."

Similar presentations


Ads by Google