Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data a small introduction Prabhakar TV IIT Kanpur, India Much of this content is generously borrowed from all over the Internet.

Similar presentations


Presentation on theme: "Big Data a small introduction Prabhakar TV IIT Kanpur, India Much of this content is generously borrowed from all over the Internet."— Presentation transcript:

1 Big Data a small introduction Prabhakar TV IIT Kanpur, India tvp@iitk.ac.in Much of this content is generously borrowed from all over the Internet

2 Let us start with a story Can we predict that a customer is expecting a baby? 2

3 “As Pole’s(Statistician at Target) computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy” 3

4 What is Big Data? How big is Big? Constantly moving target More than 100 petabytes in 2012 4

5 Big in What? Big in Volume Big in Velocity Big in Variety 5

6 Big Data Dimensions Michael Schroeck etal Analytics: The real-world use of big data IBM Executive Report 6

7 Michael Schroeck etal Analytics: The real-world use of big data IBM Executive Report Big Data Dimensions – add more Vs 7

8 Gartner’s Definition "Big data are high volume, high velocity, high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." 8

9 Big data can be very small Thousands of sensors in planes, power stations, trains… These sensors have errors(tolerance) Monitor engine efficiency to safety passenger well being The size of the dataset is not very large – several Gigabytes, but number of permutations in the source are very large http://mike2.openmethodology.org/wiki/Big_Data_Definition 9

10 Large datasets that ain’t big Media streaming is generating very large volumes with increasing amounts of structured metadata. Telephone calls and internet connections Petabytes of data, but content is extremely structured. Relational databases can handle well-structured very well 10

11 Who coined the term Big Data? Not clear An Economist has claims to it (Prof Francis Diebold of Univ of Pennsylvania) There is even a NYTimes article http://bits.blogs.nytimes.com/2013/02/01/the-origins-of- big-data-an-etymological-detective-story/ 11

12 But generally speaking.. originated as a tag for a class of technology with roots in high-performance computing pioneered by Google in the early 2000s Includes technologies, such as distributed file and database management tools led by the Apache Hadoop project; Big data analytic platforms, also led by Apache; and integration technology for exposing data to other systems and services. 12

13 Big data Toolkit A/B testing association rule learning classification cluster analysis genetic algorithms machine learning natural language processing neural networks pattern recognition anomaly detection predictive modeling regression 13 sentiment analysis signal processing supervised and unsup ervised learning simulation time series analysis Visualisation

14 What is special about big data processing? 14

15 Big Volume - Little Analytics Well addressed by data warehouse crowd Who are pretty good at SQL analytics on Hundreds of nodes Petabytes of data From Stonebraker 15

16 Big Data - Big Analytics Complex math operations (machine learning, clustering, trend detection, ….) In the market, the world of the “quants” Mostly specified as linear algebra on array data 16

17 Big Data - Big Analytics An Example Consider closing price on all trading days for the last 5 years for two stocks A and B What is the covariance between the two time- series? 17

18 Now Make It Interesting … Do this for all pairs of 4000 stocks The data is the following 4000 x 1000 matrix 18 Stock t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 …. t 1000 S1S1 S2S2 … S 4000 Hourly data? All securities?

19 And Now try it for companies headquartered in Switzerland! 19

20 Goal of Big Data Good data management Integrated with complex analytics 20

21 How to manage big data? While big data technology may be quite advanced, everything else surrounding it – best practices, methodologies, organizational structures, etc. – is nascent. 21

22 What is wrong with Bigdata End of theory Traditional Statistics have model – a distribution, say normal Compute Mean and variance Here there is no apriori model – it is discovered Like how many clusters? 22

23 How companies learn your secrets? Privacy issues http://www.forbes.com/sites/kashmirhill/2012/02/16/how- target-figured-out-a-teen-girl-was-pregnant-before-her- father-did/ http://www.nytimes.com/2012/02/19/magazine/shopping- habits.html?pagewanted=1&_r=2&hp 23

24 Will now talk about Map reduce Hadoop Bigdata in India – academic scene 24

25 Map reduce

26 Map Reduce Inspired by Lisp programming language programming model for processing large data sets with a parallel, distributed algorithm on a cluster Many problems can be phrased this way Easy to distribute across nodes Google has a patent!! Will it hurt me? 26

27 27 The MapReduce Paradigm Platform for reliable, scalable parallel computing Abstracts issues of distributed and parallel environment from programmer. Runs over distributed file systems Google File System Hadoop File System (HDFS) Adapted from S. Sudarshan, IIT Bombay

28 28 MapReduce Consider the problem of counting the number of occurrences of each word in a large collection of documents How would you do it in parallel ? Solution: Divide documents among workers Each worker parses document to find all words, outputs (word, count) pairs Partition (word, count) pairs across workers based on word For each word at a worker, locally add up counts

29 Map - Reduce Iterate over a large number of records Map: extract something of interest from each Shuffle and sort intermediate results Reduce: aggregate intermediate results Generate final output 29

30 30 MapReduce Programming Model Input: a set of key/value pairs User supplies two functions: map(k,v)  list(k1,v1) reduce(k1, list(v1))  v2 (k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs

31 31 MapReduce: Execution overview

32 MapReduce: The Map Step 32 v2 k2 kv kv map v1 k1 vn kn … kv map Input key-value pairs Intermediate key-value pairs … kv Adapted from Jeff Ullman’s course slides E.g. (doc—id, doc-content)E.g. (word, wordcount-in-a-doc)

33 MapReduce: The Reduce Step 33 kv … kv kv kv Intermediate key-value pairs group reduce kvkvkv … kv … kv kvv vv Key-value groups Output key-value pairs Adapted from Jeff Ullman’s course slides E.g. (word, wordcount-in-a-doc) (word, list-of-wordcount) (word, final-count) ~ SQL Group by~ SQL aggregation

34 34 Pseudo-code map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); // Group by step done by system on key of intermediate Emit above, and // reduce called on list of values in each group. reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

35 Distributed Execution Overview 35 User Program Worker Master Worker fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 From Jeff Ullman’s course slides input data from distributed file system

36 Map Reduce vs. Parallel Databases Map Reduce widely used for parallel processing Google, Yahoo, and 100’s of other companies Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, …. Database people say: but parallel databases have been doing this for decades Map Reduce people say: we operate at scales of 1000’s of machines We handle failures seamlessly We allow procedural code in map and reduce and allow data of any type 36

37 Implementations Google Not available outside Google Hadoop An open-source implementation in Java Uses HDFS for stable storage Aster Data Cluster-optimized SQL Database that also implements MapReduce And several others, such as Cassandra at Facebook,.. 37

38 Reading Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, http://labs.google.com/papers/gfs.html 38

39 Map reduce in English Map: In this phase, a User Defined Function (UDF), also called Map, is executed on each record in a given file. The file is typically striped across many computers, and many processes (called Mappers) work on the file in parallel. The output of each call to Map is a list of pairs. Shuffle: This is a phase that is hidden from the programmer. All the pairs are sent to another group of computers, such that all pairs with the same KEY go to the same computer, chosen uniformly at random from this group, and independently of all other keys. At each destination computer, pairs with the same KEY are aggregated together. So if ; ; : : : ; are all the key-value pairs produced by the Mappers with the same key x, at the destination computer for key x, these get aggregated into a large pair ; observe that there is no ordering guarantee. The aggregated pair is typically called a Reduce Record, and its key is referred to as the Reduce Key. Reduce: In this phase, a UDF, also called Reduce, is applied to each Reduce Record, often by many parallel processes. Each process is called a Reducer. For each invocation of Reduce, one or more records may get written into a local output file. 39

40 Hadoop

41 Why is Hadoop exciting? Blazing speed at low cost on commodity hardware Linear Scalability Highly scalable data store with a good parallel programming model, MapReduce Doesn't solve all problems, but it is a strong solution for many tasks. 41

42 What is Hadoop? For the executives: Hadoop is an Apache open source software project Gives you value from the volume/velocity/variety of data you have 42

43 What is Hadoop? Technical managers An open source suite of software that mines the structured and unstructured BigData 43

44 What is Hadoop? Legal An open source suite of software that is packaged and supported by multiple suppliers. licensed under the Apache v2 license 44

45 Apache V2 A licensee of Apache Licensed V2 software can: copy, modify and distribute the covered software in source and/or binary forms exercise patent rights that would normally only extend to the licensor provided that: all copies, modified or unmodified, are accompanied by a copy of the licensee all modifications are clearly marked as being the work of the modifier all notices of copyright, trademark and patent rights are reproduced accurately in distributed copies the licensee does not use any trademarks that belong to the licensor Furthermore, the grant of patent rights specifically is withdrawn if: the licensee starts legal action against the licensor(s) over patent infringements within the covered software 45

46 What is Hadoop? Engineering A massively parallel, shared nothing, Java-based map- reduce execution environment. hundreds to thousands of computers working on the same problem, with built-in failure resilience Projects in the Hadoop ecosystem provide data loading, higher-level languages, automated cloud deployment, and other capabilities. Kerberos-secured software suite 46

47 What are the components of Hadoop? Two core components, File store called Hadoop Distributed File System (HDFS) Programming framework called MapReduce 47

48 HDFS MapReduce Hadoop Streaming Hive and Hue Pig Sqoop Hbase FlumeNG Whirr Mahout Fuse Zookeeper 48

49 Hadoop Components HDFS: spreads data over thousands of nodes The Datanodes store your data, and the Namenode keeps track of where stuff is stored. 49

50 Hadoop Components Pig: A higher-level programming environment to do MapReduce coding Sqoop: data transfer between Hadoop and relational database HBase: highly scalable key-value store Whirr: Cloud provisioning for Hadoop 50

51 51

52 Reduce Shuffle/sort mapper output Mapper – read 64+ MB blocks HDFS........

53 HDFS, the bottom layer, sits on a cluster of commodity hardware. For a map-reduce job, the mapper layer reads from the disks at very high speed. The mapper emits key value pairs that are sorted and presented to the reducer, and the reducer layer summarizes the key-value pairs 53

54 54

55 Hadoop and relational databases? Hadoop integrates very well with relational database Apache Sqoop Used for moving data between Hadoop and relational databases 55

56 Some elementary references Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL How to get started with Hadoop and your favorite databasesMarty Lurie (marty@cloudera.com), Systems Engineer, Cloudera http://www.ibm.com/developerworks/data/library/techart icle/dm-1209hadoopbigdata/Marty Luriemarty@cloudera.com 56

57 Bigdata India Scene

58 Big data India Will restrict myself to the academic scene Almost every institute has courses and researchers in this space But not with the label Big Data Found only one ‘course’ with this title 58

59 Big data Toolkit A/B testing association rule learning classification cluster analysis genetic algorithms machine learning natural language processing neural networks pattern recognition anomaly detection predictive modeling regression sentiment analysis signal processing supervised and unsupervised learning 59 simulation time series analysis Visualisation

60 Courses Machine Learning Natural Language Processing Data Mining Soft computing Statistics 60

61 MOOC on Bigdata Coursera 24 March 2013 10 weeks 61 Dr. Gautam Shroff Department of Computer Science and Engineering Indian Institute of Technology Delhi

62 http://www.mu-sigma.com/ Mu Sigma, one of the world’s largest Decision Sciences and analytics firms, helps companies institutionalize data-driven decision making and harness Big Data http://www.veooz.com/ 62

63 63

64 64

65 What is Veooz? pronounced as "views” helps you to get a quick overview, understand and insights from the views/opinions by users on different social media platforms like Facebook, Twitter, Google+, LinkedIn, News Sites, Blogs,... Track views/opinions expressed by social media users- on people, places, products, movies, events, brands … billions of views on millions of topics at one place 65

66 Goal: Organize thoughts and interactions in Social media in real time 66

67 veooz : Real time Social Media Search & Analytics Engine 67

68 Social media is a Good Proxy for the Real World 68

69 New Power … Social media monitoring to Social listening to Social Intelligence 69

70 Not easy, because… Most Social Media data i s Noisy 70

71 Noisy Data in Social Media 71 Short forms and abbreviations Semantic equivalents Spelling errors and variations messenger and message problem Irony, Sarcasm and Negation detection #HashTagMapping Social media SPAM

72 Noise? Because it makes sentiment computation and deeper text processing very hard Fine grained Text Analysis and Context/Semantic Processing 72 Topic level aggregation vs. text level processing Detecting variations of the topic Using Prior Global Sentiment in Computing current sentiment

73 Sentiment Expression Axis Literal Non-opinions Opinion Intensity/graded expressions Special Symbols Emoticons Punctuation Transgression Grapheme stretching Abbreviations Non-Literal Metaphor Sarcasm Irony Oxymoron SPAM Incorrect/Ill-intention Reputation/Influence Content User Engagement User actions User reactions Social relations 73

74 http://www.bda2013.net/ Important Dates (Research, Tutorial, Industry): Abstract submission deadline: June 30 2013 Paper submission deadline: July 7 2013 Notification to authors: August 23 2013 Camera ready submission: September 4 2013 74

75 Conferences on Big Data 75

76 Indian Institutes of Technology 76

77 Indian Institutes of Technology (IITs) IITs are a group of fifteen autonomous engineering and technology oriented institutes of higher education established and declared as Institutes of National Importance by the Parliament of India. 77

78 IITs were created to train scientists and engineers, with the aim of developing a skilled workforce to support the economic and social development of India after independence in 1947. 78

79 Original IITs 1. As a step towards this direction, the first IIT was established in 1951, in Kharagpur (near Kolkata) in the state of West Bengal. 79

80 2. IIT Bombay was founded in 1958 at Powai, Mumbai with assistance from UNESCO and the Soviet Union, which provided technical expertise. 80

81 3. IIT Madras is located in the city of Chennai in Tamil Nadu. It was established in 1959 with technical assistance from the Government of West Germany. 81

82 4. IIT Kanpur was established in 1959 in the city of Kanpur, Uttar Pradesh. During its first 10 years, IIT Kanpur benefited from the Kanpur–Indo-American Programme (KIAP), where a consortium of nine US universities. 82

83 5. Established as the College of Engineering in 1961, located in Hauz Khas was renamed as IIT Delhi. 83 6.IIT Guwahati was established in 1994 near the city of Guwahati (Assam) on the bank of the Brahmaputra River.

84 7. IIT Roorkee, originally known as the University of Roorkee, was established in 1847 as the first engineering college of the British Empire. Located in Uttarakhand, the college was renamed The Thomson College of Civil Engineering in 1854. It became first technical university of India in 1949 and was renamed University of Roorkee which was included in the IIT system in 2001. 84

85 New IITs 1. Patna (Bihar) 2. Jodhpur(Rajasthan) 3. Hyderabad (Andhra Pradesh) 4. Mandi(Himachal Pradesh) 5. Bhubaneshwar (Orissa) 6. Indore (Madhya Pradesh) 7. Gandhinagar (Gujarat) 8. Ropar (Punjab) 85

86 Admission Admission to undergraduate B.Tech., M.Sc., and dual degree (BT-MT) programs are through Joint Entrance Examination (JEE) 1 out of 100 get in 86

87 Features IITs receive large grants compared to other engineering colleges in India. About Rs. 1,000 million per year for each IIT. 87

88 Features (cont.) The availability of resources has translated into superior infrastructure and qualified faculty in the IITs and consequently higher competition among students to gain admissions into the IITs. 88

89 Features (cont.) The government has no direct control over internal policy decisions of IITs (such as faculty recruitment) but has representation on the IIT Council. 89

90 Features (cont.) All over, IIT degrees are respected, largely due to the prestige created by very successful alumni. 90

91 Success story Other factors contributing to the success of IITs are stringent faculty recruitment procedures and industry collaboration. This combination of success factors has led to the concept of the IIT Brand. 91

92 Success story (cont.) IIT brand was reaffirmed when the United States House of Representatives passed a resolution honouring Indian Americans and especially graduates of IIT for their contributions to the American society. Similarly, China also recognised the value of IITs and has planned to replicate the model. 92

93 Indian Institute of Technology Kanpur Indian Institute of Technology, Kanpur is one of the premier institutions established in 1959 by the Government of India. 93

94 IITK (Cont.) “to provide meaningful education, to conduct original research of the highest standard and to provide leadership in technological innovation for the industrial growth of the country” 94

95 IITK (Cont.) Under the guidance of eminent economist John Kenneth Galbraith, IIT Kanpur was the first Institute in India to start Computer Science education. The Institute now has its own residential campus spread over 420 hectors of land. 95

96 Statistics Undergraduate3679 Postgraduate2039 Ph.D. 1064 Faculty 351 Research Staff 30 Supporting Staff 900 Alumni 26900 96

97 Departments Sciences: Chemistry, Physics, Mathematics & Statistics Engineering: Aerospace, Bio-Sciences and Bioengineering, Chemical, Civil, Computer Science & Engineering, Electrical, Industrial & Management Engineering, Mechanical, Material Science & Engineering Humanities and Social Sciences Interdisciplinary: Environmental Engineering & Management, Laser Technology, Master of Design, Materials Science Programme, Nuclear Engineering & Technology 97

98 Thank you 98


Download ppt "Big Data a small introduction Prabhakar TV IIT Kanpur, India Much of this content is generously borrowed from all over the Internet."

Similar presentations


Ads by Google