Download presentation
Presentation is loading. Please wait.
Published byClement Augustine Dickerson Modified over 9 years ago
1
Emerging Information Technologies 1
DPS2017 – Fall 2015 Team3 Amir Ataee Marvin Billings Sharice Cannady LLiver José Egal Sanchez Professor Chuck Tappert, Ph.D. DCS860A Emerging Information Technologies 1 Brief overview of Big Data Analytics with an emphasis on enabling algorithms
2
Table of Content Algorithms (LLiver) Clustering K-means
Big Data Overview (Egal) Data Structures Perspectives on Data Repository State of the Practice in Analytics (Marvin) Current Analytical Architecture Data sources Data Warehouse (DW) EDW (Enterprise DW) Data Science Users Drivers of Big Data Example of Big Data Analytics – IOT (Sharice) Data Analytics Lifecycle (Amir) Phase 1: Discovery Phase 2: Data Preparation Phase 3: Model Planning Phase 4: Model Building Phase 5: Communicate Results Phase 6: Operationalize Algorithms (LLiver) Clustering K-means Association Rules Apriori Algorithm Regression Linear Regression Logistic Regression Classification Decision Trees Naïve Bayes Time Series Analysis Text Analysis
3
Egal’s section
4
Big Data Definition: Big Data can be defined as all data that is not a fit for a traditional Relational Data Base Management System (RDBMS), regardless of its use, for online transaction processing (OLTP) or analytic purpose. Big data is not only associated with the size of the data, but most importantly with the format of the data. However, there are ecosystems like Hadoop that can utilize both structured and unstructured data.
5
The collection of data and the size of the data have been throughout an evolutionary process that covers four stages Engineer hard wired computer to extract information “ the introduction of telegraph” Data input by Computer operators “corporation and organizations logging corporate data” The introduction of the web “ the introduction of web.01 and web.02” The introduction of sensors “ internet of things” Explosion of data collection has been exponential, going from KB to TB
6
In computer science, a data structure is a methodical way to organizing data in a computer so that it can be used efficiently Structured databases use B-tree indexes for small percentages of data retrieval Compilers and databases use dynamic hash tables as look-up tables Usually, efficient data structures are very important to designing efficient algorithms. Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design.
7
Relational Data Base Management System (RDBMS)
Popular RDBMS: Microsoft SQL, Oracle SQL, and MySQL Enterprise Data Warehouse (EDW) Structured data Data Warehouse In traditional EDW BI/analytics
8
Popular NO-SQL technology: Casandra, CouchDB, and MongoDB
Big-data Data repository: Unstructured data Hadoop Big-data Repository Big data applications
9
Some big data ecosystem like Hadoop can leverage both data repositories (RDBMS) and Big-Data NO-SQL repositories
10
marvin’s section
11
State of the Practice in Analytics - What is Driving It?
[3] 2.5 Quintillion bytes of data created daily. [5]Data generated every 2 days = All data from beginning of time until 2003. 90% percent of data stored in the world created in last 2 years. All data in storage in the world doubles in 1.2 years. 3.2 zettabytes today to 40 zettabytes in 2020. 570 websites are created every minute. Google processes over searches per second. Every minute Facebook process 1.8 million likes and photo uploads. Youtube receives 100 hours of video a minute. NSA monitors 60 petabytes daily(1.8% of internet traffic). 1.8 billion smartphones to 7.1 billion people on earth. 25% of the people on this planet have smart phone full of sensors. Typical smartphone sensors (Motion, Air Temp., Light, Promity, Humdity, Location, etc.)
12
State of the Practice in Analytics - Where is The data coming from?
Mobile device Users Sensors Archives Social Networks Internet of Things Enterprise Applications Cameras Software logs Health Data - Fitbit Databases Etc.
13
State of the Practice in Analytics - Current General Architecture
Mysore, D., Khupat, S., & Jain, S.[Photograph]Retrieved from
14
sharice’s section
15
Big Data & The Internet of Things
Data management software such as: NoSQL database Hadoop Initially were used to analyzed website traffic and social media data. It turns out, that NoSQL and Hadoop are capable of analyzing data streams coming from sensors and controllers.
16
Big Data & The Internet of Things
Today sensors and controllers are streaming data from: Jet Engines Mobile Devices Health Monitors Items of Shipment On the right: Is a Simplified view of key Internet of Things components.
17
Big Data & The Internet of Things
18
Big Data & The Internet of Things
The “Killer User Cases” for Big Data Sensors and intelligent controllers increasing provide data that is critical to running the business. Big Data use case could be driven by the need to analyze data from the Internet of Things.
19
amir’s section
20
Data Analytics Life Cycle
Big Data is defined as “extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools Dealing with big data has several problems such as acquirement, storage, searching, sharing, analytics, and visualization of data To overcome these issues, we need a process which facilitated the analytical process of the Big Data. For this purpose, the data analytics life cycle process was designed
21
Data Analytics Life Cycle- Continue
This life cycle, has 6 phases but they are not have to be in serial order At any time, one or more phases can happen at the same time most of these phases can either go forward or backward depend on what additional as new information is available Model Building Data Prep Planning Operationalize Communicate Results 1 2 3 4 5 6 Discovery
22
Phase 1: Discovery The team learns the business domain
The team assesses the resources available to support the project The team formulating initial hypotheses (IHs) to test and begin learning the data.
23
Phase 2: Data Preparation
It requires the presence of an analytic sandbox, in which the team can work with data The team needs to execute extract, load, and transform (ELT) The team needs to familiarize itself with the data thoroughly and take steps to condition the data
24
Phase 3: Model Planning The team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase The team explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models
25
Phase 4: Model Building The team develops datasets for testing and production purposes The team builds and executes models based on the work done in the model planning phase
26
Phase 5: Communicate Results
The team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders.
27
Phase 6: Operationalize
The team delivers final reports, briefings, code, and technical documents. The team may run a pilot project to implement the models in a production environment.
28
LLiver’s section Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000
29
What is an algorithm In mathematics and computer science, an algorithm is a self-contained step-by-step set of operations to be performed. Algorithms exist that perform calculation, data processing, automated reasoning, etc. The concept of algorithm has existed for centuries, however a partial formalization of what would become the modern algorithm began with attempts to solve the Entscheidungs problem ("determining whether or not two Boolean functions returning Boolean values are equivalent") posed by David Hilbert in Giving a formal definition of algorithms, corresponding to the intuitive notion, remains a challenging problem. [3] Algorithms normally behave as they are designed, performing a number of tasks. But when left unsupervised, can and will do strange things. As we put more and more of our world under the control of algorithms, we can lose track of who-or-what-is pulling the strings. Algorithms entered the general public newscast through the Flash Crash [12], but they did not leave. They soon showed up in stories about dating, shopping, entertainment, medicine, everything imaginable. Algorithms are taking over everything.
30
What is an algorithm (Cont.)
The bounds of algorithms get pushed every day. They have displaced humans in a growing number of industries, something they often do well. They are faster than us, and when things work as they should, they make far fewer mistakes than we do. But, as algorithms acquire power and independence, there can be unexpected consequences. They observe, experiment and learn – all independently of their human creators. Using computer science advanced techniques such as machine learning and neural networking, algorithms can even create new and improved algorithms based on observed results. Algorithms have already written symphonies, picked through legalese, diagnosed patients, written news article, fly airplanes, driven vehicles on urban highways with far better control than humans. [12] It’s no coincidence that the most upwardly mobile people in society right now are those who can manipulate code to create algorithms that can sprint through oceans of data. In the history of thought, before the discovery of calculus, mathematics had been a discipline of great interest; afterward, it became a discipline of great power. Only after the advent of computer algorithm in the twentieth century has represented a mathematical idea of comparable influence. The calculus and the algorithm are the two leading ideas of the Western science.
31
Algorithm: Discussion
What will become of: our duties as humans, our employment, etc.? Since they are here to stay, how can we be more in control, trust, etc.?
32
Classification of algorithms
By implementation means Recursion or iteration Logical Serial, parallel or distributed Deterministic or non-deterministic Exact or approximate Quantum By design paradigm Brute-force or exhaustive search Divide and conquer Search and enumeration Randomized Reduction of complexity Optimization problems Linear programming Dynamic programming The greedy method The heuristic method By field of study By complexity
33
Major Big Data Analytics Algorithms Clustering
Use Cases Clustering is the use of unsupervised techniques for grouping similar objects. In machine learning, unsupervised refers to the problem of finding hidden structure within unlabeled data. Clustering techniques are unsupervised in the sense that the data scientist does not determine, in advance, the labels to apply to the clusters. For large data set is computationally expensive. Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. K-means is a simple and straightforward method for defining clusters. Once clusters and their associated centroids are identified, it is easy to assign new objects to a cluster based on the object's distance from the closest centroid. Image Processing Video is one example of the growing volumes of unstructured data being collected. Within each frame of a video, k-means analysis can be used to identify objects in the video. Medical Patient attributes such as age, height, weight, systolic and diastolic blood pressures, cholesterol level, and other attributes can identify naturally occurring clusters. These clusters could be used to target individuals for specific preventive measures or clinical trial participation. Clustering, in general, is useful in biology for the classification of plants and animals, as well as in the field of human genetics. Customer Segmentation Marketing and sales groups use k-means to better identify customers who have similar behaviors and spending patterns.
34
Major Big Data Analytics Algorithms: Clustering (Cont.)
If n is the known number of patterns and c the desired number of clusters, the k-means algorithm is: initialize n, c, 1, 2, …, c(randomly selected) do classify n samples according to nearest I compute i until no change in i return 1, 2, …, c The question is how to evaluate that the samples in one cluster are more similar among themselves than samples in other clusters Two isses: How to measure the similarity between samples? How to evaluate a partitioning of a set into clusters? The most obvious measure of similarity between two samples is the distance between them, i.e., define a metric Once this measure is defined, one would expect the distance between samples of the same cluster to be significantly less than the distance between samples in different classes.
35
Major Big Data Analytics Algorithms association rules (ARs)
This is a descriptive, not predictive, method often used to discover interesting relationships hidden in a large dataset. The relationship occurs too frequently to be random and is meaningful from a business perspective, which may or may not be obvious. ARs are commonly used for mining transactions in databases. Possible questions that ARs can answer: Which products tend to be purchased together? Of those customers who are similar to this person, what products do they tend to buy? Of those customers who have purchased this product, what other similar products do they tend to view or purchase? Applications of Association Rules Market basket analysis refers to a specific implementation of association rules mining that many companies use for a variety of purposes, including these: • Broad-scale approaches to better merchandising • Cross-merchandising between products and high-margin or high-ticket items • Physical or logical placement of product within related categories of products • Promotional programs of multiple product purchase incentives managed through a loyalty card program ARs are commonly used for recommender systems [9] and click stream analysis. Many online service providers such as Amazon and Netflix use recommender systems; which can discover related products or identify customers who have similar interests. This observation provides valuable insight on how to better personalize and recommend the content to site visitors. The framework has expanded to web contexts, such as mining path traversal patterns and usage patterns [7] to facilitate organization of web pages. Apriori [8] is one of the earliest and the most fundamental algorithms for generating association rules.
36
Major Big Data Analytics Algorithms: regression: Linear regression
Regression analysis attempts to explain the influence that a set of variables has on the outcome of another variable of interest. Often, the outcome variable is called a dependent variable because the outcome depends on the other variables. These additional variables are sometimes called the input variables or the independent variables. What is a person's expected income? What is the probability that an applicant will default on a loan? Linear regression is an analytical technique used to model the relationship between several input variables and a continuous outcome variable. Linear regression models are useful in physical and social science applications where there may be considerable variation in a particular outcome based on a given set of input values. Use Cases Linear regression is often used in business, government, and other scenarios. Some common practical applications: Real estate: A simple linear regression analysis can be used to model residential home prices as a function of the home's living area. The model could be further improved by including other input variables such as number of bathrooms, number of bedrooms, lot size, school district rankings, crime statistics, and property taxes. Demand forecasting: Businesses and governments can use linear regression models to predict demand for goods and services. Similar models can be built to predict retail sales, emergency room visits, and ambulance dispatches. Medical: A linear regression model can be used to analyze the effect of a proposed radiation treatment on reducing tumor sizes.
37
Major Big Data Analytics Algorithms: regression: logistic regression
When the outcome variable is categorical, logistic regression is a better choice. Both models assume a linear additive function of the input variables. If such an assumption does not hold true, both regression techniques perform poorly. Use Cases The logistic regression model is applied to a variety of situations in both the public and the private sector. Medical: Develop a model to determine the likelihood of a patient's successful response to a specific medical treatment or procedure. Finance: Using a loan applicant's credit history and the details on the loan, determine the probability that an applicant will default on the loan. Based on the prediction, the loan can be approved or denied, or the terms can be modified. Marketing: Determine a wireless customer's probability of switching carriers (known as churning) based on age, number of family members on the plan, months remaining on the existing contract, and social network contacts. Engineering: Based on operating conditions and various diagnostic measurements, determine the probability of a mechanical part experiencing a malfunction or failure. With this probability estimate, schedule the appropriate preventive maintenance activity.
38
Major Big Data Analytics Algorithms: regression: classification
Classification is another fundamental learning method that appears in applications related to data mining. The primary task performed by classifiers is to assign class labels to new observations. The set of labels for classifiers is predetermined, unlike in clustering, which discovers the structure without a training set and allows the data scientist optionally to create and assign labels to the clusters. Most classification methods are supervised, in that they start with a training set of pre-labeled observations to learn how likely the attributes of these observations may contribute to the classification of future unlabeled observations. Classification is widely used for prediction purposes. Two fundamental methods: Decision Trees and Naïve Bayes. Decision Trees A decision tree (also called prediction tree) uses a tree structure to specify sequences of decisions and consequences. Classification trees usually apply to output variables that are categorical-often binary-in nature, such as yes or no. Regression trees, on the other hand, can apply to output variables that are numeric or continuous, such as the predicted price of a consumer good or the likelihood a subscription will be purchased. Naïve Bayes Naïve Bayes is a probabilistic classification method based on Bayes' theorem (or Bayes' law) with a few tweaks. Bayes' theorem gives the relationship between the probabilities of two events and their conditional probabilities. Because Bayes classifiers are easy to implement and can execute efficiently even without prior knowledge of the data, they are among the most popular algorithms for classifying text documents. spam filtering is a classic use case of Bayes text classification. Naïve Bayes classifiers can also be used for fraud detection [11]
39
Major Big Data Analytics Algorithms: regression: time series analysis
Time series analysis attempts to model the underlying structure of observations taken over time. It has many applications in finance, economics, biology, engineering, retail, and manufacturing. Following are the goals of time series analysis: Identify and model the structure of the time series. Forecast future values in the time series. Use cases: • Retail sales: For various product lines, a clothing retailer is looking to forecast future monthly sales. • Spare parts planning: Companies' service organizations have to forecast future spare part demands to ensure an adequate supply of parts to repair customer products. • Stock trading: Some high-frequency stock traders utilize a technique called pairs trading.
40
Major Big Data Analytics Algorithms: regression: text analysis
Text analysis, sometimes called text analytics, refers to the representation, processing, and modeling of textual data to derive useful insights. An important component of text analysis is text mining, the process of discovering relationships and interesting patterns in large text collections. Text analysis suffers from the curse of high dimensionality. A corpus (plural: corpora) is a large collection of texts used for various purposes in Natural Language Processing (NLP). Table below lists a few example corpora that are commonly used in NLP research. Corpus Word Count Domain Website Shakespeare 0.88 million Written Brown Corpus 1 million Penn Treebank Newswire Switchboard Phone Conversations 3 million Spoken British National 100 million Written and spoken NA News 350 million European Parliament Proceedings Parallel 600 million Legal Google N-Grams 1 trillion
41
References (lliver) [1] Axt, P (1959). "On a Sub-Recursive Hierarchy and Primitive Recursive Degrees". Transactions of the American Mathematical Society 92: 85–105. [2] Bell, C. Gordon and Newell, Allen (1971), Computer Structures: Readings and Examples, McGraw-Hill Book Company, New York. ISBN [3] Moschovakis, Yiannis N. (2001). "What is an algorithm?" In Engquist, B.; Schmid, W. Mathematics Unlimited — 2001 and beyond. Springer. pp. 919–936 (Part II). ISBN [4] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, 1967. [5] P. Hajek, I. Havel, and M. Chytil, "The GUHA Method of Automatic Hypotheses Determination," Computing, vol. 1, no. 4, pp , 1966. [6] R. Agrawal, T. lmielinski, and A. Swami, "Mining Association Rules Between Sets of Items in Large Databases," SIGMOD '93 Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp , 1993. [7] R. Cooley, B. Mobasher, and J. Srivastava, "Web Mining: Information and Pattern Discovery on the World Wide Web," Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, pp , 1997. [8] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," in Proceedings of the 20th International Conference a Very Large Data Bases, San Francisco, CA, USA, 1994. [9] W. Lin, S. A. Alvarez, and C. Ruiz, "Efficient Adaptive-Support Association Rule Mining for Recommender Systems," Data Mining and Knowledge Discovery, vol. 6, no. 1, pp , 2002. [10] W. Lin, S. A. Alvarez, and C. Ruiz, "Collaborative Recommendation via Adaptive Association Rule Mining," in Proceedings of the International Workshop on Web Mining for E-Commerce (WEBKDD), Boston, MA, 2000. [11] C. Phua, V. C. S. Lee, S. Kate, and R. W. Gayler, "A Comprehensive Survey of Data Mining-Based Fraud Detection," CoRR, vol. abs/ , 2010. [12] C. Steiner, “Automate This: How Algorithms Took Over Our Markets, Our Jobs, and the World”, Portfolio / Penguin, NY, NY, ISBN [13] R. O. Duda, P. E. Hart and D. G. Stork, “Pattern Classification” (2nd ed). John Wiley & Sons, 2000
42
References (Marvin) [1]Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. (2015). Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79, 3-15. [2]Mysore, D., Khupat, S., & Jain, S. (2013, September 17). Big data architecture and patterns, Part 1: Introduction to big data [3]classification and architecture. Retrieved from [4]Wang, L., Ranjan, R., Kołodziej, J., Zomaya, A., & Alem, L. (2015). Software Tools and Techniques for Big Data Computing in Healthcare Clouds. Future Generation Computer Systems, 43, [5]Marr, B. (2015). Big Data: using SMART big data, analytics and metrics to make better decisions and improve performance.
43
References (Egal): [1] [2] data-analytics [3] [4]Big Data: Evolution, Components, Challenges and Opportunities, Universidad Iberoamericana, Alejandro Zarate Santovena [5] data-warehouse/
44
References (sharice) Licht, A., Mantha, N., Nagode, L., & Stackowiak, Robert. , (2015). Big Data and The Internet of Things : Enterprise Information Architecture for A New Age. New York : Apress CIT Group.,(2015). The Internet of Things. Retrieved December 17, From
45
AMIR’s references Big Data Analytics: Turning Big Data into Big Money , Frank Ohlhorst John Wiley & Sons, 2013 tzu.html Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data , EMC Education Services, John Wiley & Sons, 2015
46
Questions?
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.