Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri.

Slides:



Advertisements
Similar presentations
Teacher questioning helps the student bridge the gap between the content and the concepts that organize the content. When presented content, people employ.
Advertisements

Design of Experiments Lecture I
Modifying existing content Adding/Removing content on a page using jQuery.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Big Data Chapter 1 Verónica Morales Márquez,
Chapter 3 MESSY Mansoureh Rousta.
Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.
Search Engines and Information Retrieval
D. Roberts PHYS 121 University of Maryland PHYS 121: Fundamentals of Physics I September 6, 2006.
Chapter 2: Algorithm Discovery and Design
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
8 November Forms and JavaScript. Types of Inputs Radio Buttons (select one of a list) Checkbox (select as many as wanted) Text inputs (user types text)
Sources of Data Levin and Fox Ch. 1: The Experiment The Survey Content Analysis Participant Observation Secondary Analysis 1.
Chapter 2: Algorithm Discovery and Design
Chapter 2: Algorithm Discovery and Design
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Wheeler Lower School Mathematics Program Grades 4-5 Goals: 1.For all students to become mathematically proficient 2.To prepare students for success in.
1.Database plan 2.Information systems plan 3.Technology plan 4.Business strategy plan 5.Enterprise analysis Which of the following serves as a road map.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Minimizing Perceptual Mismatches Laura Beech Aubrey Devine Kerry Litwinski Whitney Rantz.
DBS201: DBA/DBMS Lecture 13.
Search Engines and Information Retrieval Chapter 1.
Density Curves Normal Distribution Area under the curve.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Chapter 1 Table of Contents Section 1 Science and Scientists
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.
Invitation to Computer Science, Java Version, Second Edition.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Figuring Americans Out: Cultural Adjustment & Intercultural Communication 8/05 Center for Global Engagement Division of Student Affairs.
Big Idea 1: The Practice of Science Description A: Scientific inquiry is a multifaceted activity; the processes of science include the formulation of scientifically.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Multiplying Whole Numbers © Math As A Second Language All Rights Reserved next #5 Taking the Fear out of Math 9 × 9 81 Single Digit Multiplication.
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Presenter’s Guide to Multiple Representations in the Teaching of Mathematics – Part 1 By Guillermo Mendieta Author of Pictorial Mathematics
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Presenter: Shanshan Lu 03/04/2010
Introduction: What Is Economics? 1 C H A P T E R 1 © 2001 Prentice Hall Business PublishingEconomics: Principles and Tools, 2/eO’Sullivan & Sheffrin.
A–Level Computing Project Introduction. Learning objectives Become familiar with the: Guidelines associated with choosing a project. Stages in project.
© copyright 2014 Semantic Insights™ “A New Natural Language Understanding Technology for Research of Large Information Corpora." By Chuck Rehberg, CTO.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Sampling And Resampling Risk Analysis for Water Resources Planning and Management Institute for Water Resources May 2007.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Study Guide Key Terms trade-off opportunity cost
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
IST_Seminar II CHAPTER 12 Instructional Methods. Objectives: Students will: Explain the role of all teachers in the development of critical thinking skills.
Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.
1 Unstructured Data (UD) What is unstructured data? How is it statistically valuable? Challenges of turning UD into information.
Lesson 4 Grammar - Chapter 13.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Cognition and Language. Cognition: thinking, gaining knowledge, and dealing with knowledge. I. Categorization A. Categorization: in general, we categorize.
Intro to Physics (Chapter 1). PHYSICS is an attempt to describe in a fundamental way, the nature and behavior of the world around us. is about the nature.
(c) University of Washington10-1 CSC 143 Java Errors and Exceptions Reading: Ch. 15.
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science.
Machine Learning. Definition Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Week 12 - Thursday CS 113.
1. Welcome to Software Construction
Recommendation Engines & Accumulo - Sqrrl
Chapter GS Getting Started.
Chapter GS Getting Started.
Course Introduction CSC 576: Data Mining.
Chapter GS Getting Started.
Discovery Tools and Music Library 2.0 Integrated Library Systems
Chapter GS Getting Started.
Presentation transcript:

Critical Thinking and Argumentation in Software Engineering-CT60A7000 Big Data Chapter3-Messy Behnaz Norouzi Francis Matheri

Increasing the volume  opens the door to inexactitude. One of the fundamental shifts of going to big data from small  considering inexactitude unavoidable and learn to live with them. Instead of treating them as problems and trying to get rid of them. In the world of small data  reducing errors & ensuring high quality data are essential. In the world of sampling  the obsession with exactitude was even more critical.

In the middle of nineteen century  the quest for exactitude began in Europe. If one could measure a phenomenon  the implicit belief was, one could understand it. Later measurement was tied to the scientific method of observation and explanation:  Lord Kelvin: “to measure is to know”  Francis Bacon: “Knowledge is power” By the nineteenth century  France developed a precise system to capture space, time and more. Half a century later  the discovery of quantum mechanics shattered forever the dream of comprehensive and perfect measurement.

In many new situations, allowing for messiness may be a positive feature not a shortcoming. A tradeoff: in return for allowing errors, one can get ahold of much more data. It isn’t just that “More trumps some”, but sometimes “More trumps better”. The likelihood of errors increases as you add more data points.

Messiness itself is messy  It can arise when we extract or process the data, since in doing so; we are transforming it, turning it into something else. Such as when we perform sentiment analysis on twitter messages to predict Hollywood box office receipts. Example  measuring the temperature in a vineyard: If we have only one temperature sensor for the whole plot of land  we must make sure it is accurate  no messiness allowed. If we have a sensor for every one hundred of vines  using cheaper sensors  messiness is allowed. It was again a tradeoff  we sacrificed the accuracy of each data point for breath, and in return we received details that we otherwise could not have seen.

Big data transforms figures into something more probabilistic than precise. More data  improvements in computing Example  Chess algorithms. Using (N=all) Banko and Brill : “ We want to reconsider the tradeoff between spending time and money on algorithm development versus spending it on corpus development.” What is the story of this saying?! The result  More data, better performance.

Google’s idea  language translation The result  An IBM computer translated sixty Russian phrases into English in The problem posed by a committee of machine-translation grandees  translation is not just about memorization and recall, it is about choosing the right words from many alternatives.

A novel idea of IBM researchers  Instead of feeding computer with explicit linguistic rules let the computer use statistical probabilities to calculate which word or phrase in one language is the most proper one in another language. Google’s mission in 2006  Organize the world’s information and make it universally accessible and useful. The result  despite messiness of input, Google’s service works the best. And it is far, far richer. Why it works well?  Fed in more data (not just high quality.)

Conventional sampling analysts  Accepting messiness is difficult for them. They use multiple error-reducing strategies. The problem  Such strategies are costly Exacting standards of collection are unlikely to be achieved consistently at such scale. More Trumps Better

Moving into a world of big data will require us to change our thinking about the merits of exactitude. In dealing with even more comprehensive datasets, we no longer need to worry so much about individual data points biasing the overall analysis. Take the way sensors are making their way into factories. Example  At a factory in Washington, wireless sensors are installed throughout the plant, forming an invisible mesh that produces vast amount of data in real time.

Moving to a large scale changes: Expectations of precision Practical ability to achieve exactitude Technology is imperfect  messiness is a practical reality we must deal with. To get the inflation number  the Bureau of Labor Statistics employs hundreds of staffs to do related matters and it costs around $250 million a year. The problem  by the time the numbers come out, they are already a few weeks old. Solution  quicker access to inflation numbers that cannot be achieved with conventional methods focused on sampling.

Two economist at Massachusetts Institute of Technology  Using big data is the shape of using software to crawl the web and collected half a million prices of products sold in the U.S. every single day. The benefit  combining big data collection with clever analysis led to the defection of a deflationary swing in prices immediately after Lehman Brothers field for bankruptcy in September 2008.

Move and messy OVER fewer and exact Categorizing content  hierarchical systems such as taxonomies and indexes are imperfect. Photo sharing site, Flickr  In 2011 held more than six billion photos from more than 75 million users. Tried to label each photo according preset categories. They replaced the preset by mechanisms that are messier but more flexible. The imprecision inherent in tagging is about accepting the natural messiness of the world. Messiness In Action

Database design  Traditional Databases: require highly structured and precise data. Traditional databases are good for a world in which data is sparse, and thus can be curated carefully. This view of storage is at odds with reality. The big shift  noSQL databases. It accepts data of varying type and size and allows it to be searched successfully. They require more processing and storage resources for permitting structural messiness. Pat Helland: “It is OK if we have “Lossy” answers. That’s frequently what business needs.”

Hadoop  An open source rival to Google’s MapReduce system. Why Hadoop is very good at processing large quantities of data?  It takes for granted that the quantity of data is so breathtakingly enormous that it can’t be moved and must be analyzed where it is.

By allowing for imprecision, we open a window into an untapped universe of insights. In return for living with messiness, we get tremendously valuable services that would be impossible at their scope and scale with traditional methods and tools. As big data techniques become a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of N=all of the mind. Big data with its emphasis on comprehensive datasets and messiness helps us get closer to reality than did our dependence on small data and accuracy. Big data may require us to change. To become more comfortable with disorder and uncertainty.