Download presentation
Published byErica Sutton Modified over 7 years ago
1
Big data and predictive analytics – a scientific paradigm shift?
Bo Sundgren Senior professor, Dalarna University Keynote speaker NTTS 2017 mailto: website:
2
Contents Three examples of applications Concepts and definitions
Google Translate, Consumer Price Index, Google Flu Trends Concepts and definitions Big Data, Predictive Analytics Analysis What is new? Paradigm shift? Disruptive innovation? Synthesis
3
Three examples Google Translate Consumer Price Index (CPI)
Google Flu Trends
4
Google Translate For several decades, brilliant scientists in AI tried to solve the problem of automatic translation. The attempts were based on domain-specific theories, that is, established language theories, which were used in developing software which was typically rule-based, like the early expert systems. The practical results were not impressive. Then came Google Translate, the result of a new approach, based on statistical methods. This new approach has caused outcries by established domain experts like Noam Chomsky. Priority given to operational results – “it works” – rather than insights and understanding, based on theories and models.
6
Consumer Price Index (CPI)
7
Consumer Price Index (CPI)
Proxies of the official consumer price indexes have been produced with good results from price data available on the Internet. See figure. The proxies can be produced and published much faster. Some National Statistical Institutes are now using internet robots to collect prices from the web as part of their data collection for the CPI. National institutes of official statistics, as well as private opinion research institutes, have big problems to collect data by means of traditional sample surveys. Response rates go down and costs go up. Using Big Data from the Internet may be one of the methods to use in order to cope with this situation. Other methods are: using registers and administrative data using model-based estimation using self-selected panels using representative rather than probability-based samples Some of these methods, including Big Data, may be used in combination.
8
Google Flu Trends Google Flu Trends: A web service operated by Google. By relating influenza-related search queries to influenza-related physician visits, it attempted to make predictions about flu activity. A linear regression model was used based on historical data. 50 million test queries were tested to produce a top list of queries giving the most accurate predictions. The top 45 queries, when aggregated together, fitted the history data most accurately. The linear model was fitted to the weekly data between 2003 and 2007. Finally, the trained model was used to predict flu outbreak across all regions in the United States as well as internationally.
9
Two major components in the new paradigm
Big Data Predictive Analytics
10
Gartner’s definition of Big Data
High-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight, decision-making, and process optimization. This definition combines three major aspects: WHAT? Contents and characteristics: ”high-volume, high-velocity and high-variety information assets” (3V) HOW? Methods and tools: ”cost-effective, innovative forms of information processing” (data collection, data management, analysis and presentation) WHY? Purposes: ”enhanced insight and decision making”
11
Analytical use of data for problem-solving, decision-making, and process control
12
Big Data: Volume, velocity, variety (3V)
Volume: Large sets of relevant data are available from the Internet and from operative systems like business systems, public administrative systems, and systems monitoring operations like traffic control: There are new sources and data collection methods, for example social media and sensors (Internet of Things). Data may be stored in traditional data warehouses, before they are analysed, but they may also be analysed “on the fly”, giving feed-back to the processes that generate them. Velocity: Data are often generated more or less continuously (streaming data) by events occurring with a high frequency and have to be taken care of in real-time with high speed. Variety: Big Data may exist in complex and heterogeneous structures and formats: Data may need to be combined, managed, and analysed in new ways: more or less structured data, free text data, pictures and photos, sound, videos, etc. New types of databases and data management systems may be required in addition to traditional SQL databases.
13
Predictive analytics The term ”predictive analytics” is used for a class of statistical methods, which are often used for discovering and analysing statistical relationships in “big data”. The methods may be grouped into two main categories: Methods based on correlation and regression: Linear regression, Discrete choice models, Logistic regression, Multinomial logistic regression, Probit regression, Time series models, Survival or duration analysis, Classification and regression trees, Multivariate adaptive regression splines Methods based on machine learning, for example neural networks and pattern recognition: Neural networks, Multilayer Perceptron (MLP), Radial basis functions, Naïve Bayes for performing classifications tasks, Pattern recognition methods, k-nearest neighbours, Geospatial predictive modelling
14
Predictive analytics: Some application areas
Despite the word “predictive” in “predictive analytics”, the methods are used also for other purposes than for making predictions and prognoses. Some common application areas are: Risk assessments in banking and insurance businesses Estimation of the potentials of different customer categories in marketing Estimation of security risks Fraud detection Typical characteristics of methods and applications: Exploit patterns found in historical and transactional data to identify risks and opportunities Capture relationships among many factors to allow assessment of risks or potentials associated with particular sets of conditions Provide a predictive score (probability) for each individual (customer, employee, patient, product, vehicle, machine, component, …) in marketing, banking, insurance, security risks, fraud detection, health care, pharmaceuticals, manufacturing, law enforcement, ...
15
What is new? ”Big Data” and ”Predictive Analytics” may be seen as the latest step in a development that started decades ago, and which includes: Decision Support Systems (DSS) Expert systems Data Mining Business Intelligence (BI) All of these use the same basic process model for collecting, processing, and usage of analytical information. What is new is, among other things: The three Vs: the data volumes, the velocity by which data are generated, and the diversity of data sources and data types. The Internet, streaming data, and data generating sensors (the Internet of Things), have created radically new conditions. New methods and tools for data management are needed. The decoupling of analyses from domain-specific theories and models: The successful applications of “big data” and “predictive analytics” indicate that, at least sometimes, it is possible to obtain very useful knowledge without necessarily building on domain theories and causal relationships.
16
What is new? What is new in the different steps in the process model?
Data sources and data collection methods: ”Found data” rather than ”made data” Available data sources and available data (from the web, administrative registers, operational sytems, including process control systems and sensor data) rather than probability-based sample surveys (with growing costs, measurement problems, and problems with huge and biased non-response controlled experiments in laboratory-like environments Data storage, data processing, and quality control of data: Streaming data and new types of databases rather than traditional databases alone Using macro-editing (selective editing, or significance editing), for optimising resources used for identifying and modifying suspicious data, and for replacing missing data by imputed values, rather than traditional editing based on rules and expert judgements Data analysis, problem solution, and decision-making: Using correlation and regression analyses and other statistical relationships in the data rather than testing theory-based testing of hypotheses for generating problem solutions and decisions Self-learning systems learning from experiences and corrections by experts rather than systems based on domain theories and decision rules formulated by experts
17
But certain basics are still necessary
Big Data does not eliminate the need for certain basics, such as data governance and data quality, data modelling, data architecture, and data management All of the typical steps necessary to transform raw data into insights still have to happen. Now they may just happen in different ways and at different times in the process.
18
Paradigm shift? Scientific revolution?
The philosopher Thomas Kuhn introduced “paradigm shift” and “scientific revolution” for describing revolutionary changes in scientific theories, changes which may also affect our worldview, like: when the worldview with the earth in the centre was replaced with a worldview with the sun in the centre, or when Einstein replaced Newton’s mechanics with new theories, which we will accept as true, until they have been falsified as well. All theories will sooner or later turn out to be false or incomplete Some scientists think that “Big Data and Predictive Analytics” really represent a paradigm shift both those who are positive to using statistical methods of analysis, which are not linked to domain-specific theories, for example Peter Norvig, AI giant and research director at Google, and the journal editor and debater Chris Anderson, journal editor and debater and those who condemn such usages, such as prominent statisticians like Gary King at Harvard University, and domain theorists like the linguist Noam Chomsky.
19
Chris Anderson (with some hubris):
“All models are wrong, but some are useful." So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? … Until now … … massive amounts of data and applied mathematics replace every other tool ... Out with every theory of human behaviour, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it ‘’... With enough data, the numbers speak for themselves … Scientists are trained to recognise that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y ... Instead, you must understand the underlying mechanisms that connect the two … There is now a better way... We can analyse the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters … and let statistical algorithms find patterns where science cannot. The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world … There's no reason to cling to our old ways. It's time to ask: What can science learn from Google?
20
Criticisms Noam Chomsky: Language recognition programs use massive databases of words, and statistical correlations between those words, to translate or to recognise speech. But correlation is not causation. Do these statistical data-dredgings give any insight into how language works? Or are they a mere big-number trick, useful but adding nothing to understanding? Gary King, a professor of statistics and a sceptic when it comes to the abilities of computers and algorithms to predict the future: People are influenced by their environment in innumerable ways. Trying to understand what people will do next assumes that all the influential variables can be known and measured accurately. People's environments change even more quickly than they themselves do. Everything from the weather to their relationship with their mother can change the way people think and act. All of those variables are unpredictable. How they will impact a person is even less predictable. If put in the same situation tomorrow, they may make a completely different decision. This means that a statistical prediction is only valid in sterile laboratory conditions, which suddenly isn't as useful as it seemed before. An example: Google Flu Trends does not work as well now as before.
21
Disruptive innovation
It is controversial whether “Big Data and Predictive Analytics” implies a paradigm shift, and, if so, if it is a desirable paradigm shift. Both enthusiasts and critics seem to agree that the new methods may be useful, even if they are not (yet) properly understood, and even if they do not reflect how human beings think. Maybe one should rather regard “Big Data and Predictive Analytics” as a so-called “disruptive innovation”, a radical change of methods and tools. We have witnessed many such disruptive changes during the last century, for example the refrigerator replacing ice distributors, digital technology replacing mechanics, digital mass media and the Internet replacing paper media, gramophone records, video films, and traditional distribution of these media.
22
Humans and computers in cooperation
Strategy 1: “Knowledge engineers” try to find out What knowledge domain experts use when they solve problems and make decisions, the “knowledge base” Which theories, models, and rules the domain experts use (or at least say that they use), when they solve problems and make decisions, “the inference engine” The knowledge base and the inference engine are represented by a computerised system, which is applied on concrete cases for problem-solving and decision-making Strategy 2: A computerised system – a learning machine or neural network registers problem solutions and decisions (outputs) produced by a domain expert on the basis of certain data (inputs), which are also registered – the training set of data No attempt is made to explain the reasoning behind the expert’s decision – if there is such a reasoning The system then derives statistical relationships between inputs and outputs, and uses those relationships to solve new cases, based on new data The systems may learn from experience in different ways: supervised learning, unsupervised learning, enforced learning Both strategies may be useful in different situation, but the second strategy is most typical for “Big Data and Predictive Analytics”
23
Alan Turing (1950): Can machines think?
There is reason to recall the wise words of Alan Turing, who suggested that the question “Can machines think?” should be replaced by the question “Can machines do what we, as thinking entities, can do?” Maybe computer-supported systems, loaded with “big data” can help us achieve new results by using methods that are fundamentally different from the way we, as human beings, are used to think, model, analyse, reason, and derive new knowledge.
24
A synthesis The new paradigm, based on Big Data and Predictive Analytics, produces better and more useful solutions to some important problems But the new paradigm does not enlighten our understanding of the mechanisms behind the problems and the solutions But do we have to understand? Yes and no Consider a patient with some medical problems, showing certain symptoms A first priority is to cure the patient, or at least to improve the patient’s situation A doctor would use the symptoms in combination with medical theory and test results in order to arrive at a diagnosis and a treatment According to the new paradigm, a piece of software might find statistical relationships between data in a database on symptoms, test results, expert diagnoses, treatments, and outcomes, which could be used for generating even better results than those achieved by doctors However, there would still be a need to develop a better understanding of diseases and treatments, based on a better understanding of processes in the human body
25
Bottom line: Is there a new paradigm?
Short answer: Yes, there is a new paradigm: The fast and easy availability, via the Internet, of huge amounts of relevant and useful data for many kinds of data-supported research and decision-making has paved the way for alternatives to traditional paradigms, which are based on domain-specific knowledge acquisition, modelling and theory building, logical reasoning and understanding. But the new paradigm does not make the traditional paradigms obsolete. They may continue to live side by side – with different purposes: The new paradigm will generate useful results for certain problems, for example automatic translation, where the solutions do not have to reflect human ways of solving similar problems, where domain-specific causal relationships are not important, and correlations suffice. The traditional paradigms will continue to improve our understanding of reality, for example how human beings learn and use natural languages, how the human body functions, and how diseases may be cured. Compare: the co-existence of quantitative and qualitative methods. Compare on the other hand: how disruptive technological changes make old technologies almost instantaneously obsolete and businesses go bankrupt.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.