CC5212-1 P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 11: Conclusion Aidan Hogan

Slides:



Advertisements
Similar presentations
Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
Advertisements

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture X: 2014/05/12.
Overview on ZHT 1.  General terms  Overview to NoSQL dabases and key-value stores  Introduction to ZHT  CS554 projects 2.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Information Retrieval in Practice
1 CS1001 Lecture Overview Java Programming Java Programming Midterm Review Midterm Review.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Distributed storage for structured data
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 9: NoSQL I Aidan Hogan
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Open Source Tools for Teaching.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Wrap-Up.
Advanced Topics in Distributed Systems Fall 2011 Instructor: Costin Raiciu.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
CSED421 Database Systems Lab. Welcome Lab Class –Library 501, Fri 9:00 – 10:40 Teacher Assistants – 안석현, 이상훈 –{ashworld, –IDS.
CS346: Advanced Databases Alexandra I. Cristea Term 1.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
By Vaibhav Nachankar Arvind Dwarakanath.  HBase is an open-source, distributed, column- oriented and sorted-map data storage.  It is a Hadoop Database;
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture V: 2014/04/07.
Nov 2006 Google released the paper on BigTable.
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
CPS 216: Advanced Database Systems Shivnath Babu.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
Big Data Yuan Xue CS 292 Special topics on.
Learn Hadoop and Big Data Technologies. Hadoop  An Open source framework that stores and processes Big Data in distributed manner on a large groups of.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Information Retrieval in Practice
Plan for Final Lecture What you may expect to be asked in the Exam?
CC Procesamiento Masivo de Datos Otoño Lecture 12: Conclusion
CS 405G: Introduction to Database Systems
Course Introduction 공학대학원 데이타베이스
NOSQL.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
CC Procesamiento Masivo de Datos Otoño Lecture 4: MapReduce/Hadoop II
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Storage Systems for Managing Voluminous Data
Database Systems Summary and Overview
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Presentation transcript:

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 11: Conclusion Aidan Hogan

FULL-CIRCLE

The value of data …

Twitter architecture

Google architecture

Generalise concepts to …

Working with large datasets

Value/danger of distribution

Frameworks For Distrib. Processing For Distrib. Storage

The Big Data Buzz-word

Distributed Systems

GFS (HDFS) / MapReduce (Hadoop)

Information Retrieval

NoSQL and Querying

EXAM …

Course Marking 45% for Weekly Labs (~3% a lab!) [Easy] 20% for Small Class Project [Medium] 35% for Final Exam [Hard]

Final Exam (35%) Goal: test your understanding of concepts – Coding covered by labs/project – No syntax writing questions! but there will be design and syntax reading questions Max. three hours Not marking you on English – If stuck, write in Spanish! Four questions (marked on best three) …

The following is not a legally abiding agreement. It is just a helpful guide for what’s important.

Q1: Distributed Systems

Q1: Distributed Systems (Slides) Slides: Lecture 2: Distributed Systems I Lecture 3: Distributed Systems II Names per the homepage:

Q1: Distributed Systems (Topics) Possible Topics: Advantages/disadvantages of a distributed system Five distributed system design goals Distributed architectures (P2P vs. C–S, Fat/Thin, n-Tier, etc.) Java RMI (high-level) Eight fallacies of distributed computing Consensus basics (fail-stop vs. Byzantine, synchronous vs. asynchronous, goals) Consensus protocols (2PC, 3PC, Paxos) CAP theorem may appear in Q4, but will not appear in Q1

GFS (HDFS) / MapReduce (Hadoop)

Q2: GFS (HDFS) / MapReduce (Hadoop) Slides: Lecture 4: DFS & MapReduce I Lecture 6: DFS & MapReduce III (Lecture 5 was whiteboard only, practicing MapReduce design) Names per the homepage:

Q2: GFS (HDFS) / MapReduce (Hadoop) Possible Topics: Google File System (reads, writes, fault-tolerance) MapReduce (incl. design question) HDFS/Hadoop (architecture) Pig (high-level, e.g., explain what a given script does)

Information Retrieval

Q3: Information Retrieval (Slides) Slides: Lecture 7: Information Retrieval I Lecture 8: Information Retrieval II Names per the homepage:

Q3: Information Retrieval (Topics) Possible Topics: Crawling (high-level multi-threading, (D)DoS, robots.txt, sitemap, distribution, bow-tie) Inverted indexes (data structure, normalisation, Heap’s law, Ziph’s law, Elias encoding, etc.) Ranking (relevance vs. importance, TF-IDF, Vector Space Model, etc.) PageRank (concept, random surfer, calculation)

Bring a Calculator!

NoSQL and Querying

Q4: NoSQL and Querying (Slides) Slides: Lecture 9: NoSQL I Lecture 10: NoSQL II Lecture 3: Distributed Systems II (slides on CAP) Names per the homepage:

Q4: NoSQL and Querying (Topics) Possible Topics: CAP theorem (<- note out of order) The Database Landscape Key–Value stores (data model, operations, distribution, consistent hashing, replication, Dynamo, Merkle trees) Document stores (high-level) Tabular/column-families (data model, Bigtable, sorting, tablets, column families, SSTables, writes, reads, compactions, hierarchy, bloom filters) Graph databases (high-level) Cassandra (high-level)

Final Exam (35%) Goal: test your understanding of concepts – Coding covered by labs/project – No syntax writing questions! but there will be design and syntax reading questions Max. three hours – Not marking you on English – If stuck, write in Span(gl)ish! Four questions (marked on best three) …

SOME ADVERTISING

Poll … lab on Wednesday? vs.

Next semester … La Web de Datos Web of Data / Semantic Web / Linked Data

November … ayudante wanted “Diplomado” version of this course 8 x 3 hour lectures over 3 weeks Help with: – Translating slides – Attending class – Laboratories – Marking 1 UF per hour! ($25,000/hour)