2USING ALL AVAILABLE DATA BUT… CostInexactitudeWe never wanted consider them unavoidable and learn to live with themGoing to big data from small
3In a world of small data Reducing errors Ensuring high quality was a natural and essential impulseCollected a little information so it must be as accurate ass possibleScientific try to make their measurement more and more precise. (celestial bodies or microscope)Analyzing only a limited number of data points means errors may get amplified
4If one could measure a phenomenon, the implicit belief was, one could understand it. Later, measurement was tied to the scientific method of observation and explanation: the ability to quantify, record, and present reproducible results.
5By the nineteenth century France… developed a system of precisely defined units of measurement to capture space, time, and more, and had begun to get other nations to adopt the same standards.It was the apex of the age of measurementJust half a century later, in the 1920s, the discoveries of quantum mechanics shattered forever the dream of comprehensive and perfect measurement.
6Howeverin many new situations that are cropping up today, allowing for imprecision—for messiness—may be a positive feature, not a shortcoming
7you can also increase messiness by combining different types of information from different sources, which don’t always align perfectly.Messiness can also refer to the inconsistency of formatting, for which the data needs to be “cleaned” before being processed.
8Suppose we need to measure the temperature in a vineyard One temperature sensora sensor for every one of the hundreds of vinesNow suppose we increase the frequency of the sensor readingsThe information will be a bit less accurate, but its great volume makes it worthwhile to forgo strict exactitude.
9in many cases it is more fruitful to tolerate error than it would be to work at preventing it For instance, we can accept some messiness in return for scaleAs Forrester, a technology consultancy, puts it, “Sometimes two plus two can equal 3.9, and that is good enough.”Of course the data can’t be completely incorrect, but we’re willing to sacrifice a bit of accuracy in return for knowing the general trend.
10Everyone knows how much processing power has increased over the years as predicted by Moore’s Law, which states that the number of transistors on a chip doubles roughly every two years.This continual improvement has made computers faster and memory more plentifulBUT…
11Fewer of us know that the performance of the algorithms that drive many of our systems has also increased—in many areas more than the improvement of processors under Moore’s Law.Many of the gains to society from big data, however, happen not so much because of faster chips or better algorithms but because there is more data.
12For example, chess algorithms have changed only slightly in the past few decades the way computers learn how to parse words as we use them in everyday speechAround 2000, Microsoft researchers Michele Banko and Eric Brill were looking for a method to improve the grammar checker that is part of the company’s Word program
13They weren’t sure whether it would be more useful to put their effort into improving existing algorithms,finding new techniques,or adding more sophisticated featuresThe results were astounding. As more data went in, the performance of all four types of algorithms improved dramatically.Simple algorithm 75% to 95%Large data algorithm 86% to 94%
14“These results suggest that we may want to reconsider the tradeoff between spending time and money on algorithm development versus spending it on corpus development,” Banko and Brill wrote in one of their research papers on the topic.
15language translation few years after Banko and Brill … researchers at rival Google were thinking along similar lines—but at an even larger scaleInstead of testing algorithms with a billion words, they used a trillion.Google did this not to develop a grammar checker but to crack an even more complex nut:language translation
16So-called machine translation has been a vision of computer pioneers since the dawn of computing in the 1940sThe idea took on a special urgency during the Cold WarAt first, computer scientists opted for a combination of grammatical rules and a bilingual dictionaryAn IBM computer translated sixty Russian phrases into English in 1954, using 250 word pairs in the computer’s vocabulary and six rules of grammar. The results were very promising
17The director of the research program, Leon Dostert of Georgetown University, predicted that… By 1966 a committee of machine-translation grandees had to admit failureThe problem was harder than they had realized it would beIn the late 1980s, researchers at IBM had a novel idea…
18In the 1990s IBM’s Candide project used ten years’ worth of Canadian parliamentary transcripts published in French and English—about three million sentence pairsSuddenly, computer translation got a lot betterEventually IBM pulled the plug.
19But less than a decade later, in 2006, Google got into translation, as part of its mission to “organize the world’s information and make it universally accessible and useful.”Instead of nicely translated pages of text in two languages, Google availed itself of a larger but also much messier dataset: the entire global Internet and moreIts trillion-word corpus amounted to 95 billion English sentences, albeit of dubious quality
20Despite the messiness of the input, Google’s service works the best By mid-2012 its dataset covered more than 60 languagesThe reason Google’s translation system works well is not that it has a smarter algorithm. It works well because its creators, like Banko and Brill at Microsoft, fed in more data—and not just of high qualityGoogle was able to use a dataset tens of thousands of times larger than IBM’s Candide because it accepted messiness
21More trumps betterMessiness is difficult to accept for the conventional sampling analysts, who for all their lives have focused on preventing and eradicating messinessThey work hard to reduce error rates when collecting samples, and to test the samples for potential biases before announcing their results
22They use multiple error-reducing strategies Such strategies are costly to implementand they are hardly feasible for big dataMoving into a world of big data will require us to change our thinking about the merits of exactitude we no longer need to worry so much about individual data points biasing the overall analysis
23At BP’s Cherry Point Refinery in Blaine, Washington, wireless sensors are installed throughout the plant, forming an invisible mesh that produces vast amounts of data in real time.The environment of intense heat and electrical machinery might distort the readings, resulting in messy data.But the huge quantity of information generated from both wired and wireless sensors makes up for those hiccups
24When the quantity of data is vastly larger and is of a new type, exactitude in some cases is no longer the goalIt bears noting that messiness is not inherent to big data.Instead it is a function of the imperfection of the tools we use to measure, record, and analyze information.If the technology were to somehow become perfect, the problem of inexactitude would disappear
25Messiness in actionIn many areas of technology and society, we are leaning in favor of more and messy over fewer and exactThese hierarchical systems have always been imperfect, as everyone familiar with a library card catalogue
26For example, in 2011 the photo-sharing site Flickr held more than six billion photos from more than 75 million users. Trying to label each photo according to preset categories would have been useless. Would there really have been one entitled “Cats that look like Hitler”?Instead, clean taxonomies are being replaced by mechanisms that are messier but also eminently more flexible and adaptable to a world that evolves and changes
27When we upload photos to Flickr, we “tag” them Many of the Web’s most popular sites flaunt their admiration for imprecision over the pretense of exactitude For example likes in face book or twitter
28Traditional database engines required data to be highly structured and precise. Data wasn’t simply stored; it was broken up into “records” that contained fields. Each field held information of a particular type and length. For example…
29The most common language for accessing databases has long been SQL, or “structured query language.” The very name evokes its rigidity.But the big shift in recent years has been toward something called noSQL, which doesn’t require a preset record structure to work. It accepts data of varying type and size and allows it to be searched successfully.
30Processing big data entails an inevitable loss of information Pat Helland, one of the world’s foremost authorities on database design“lossy”Processing big data entails an inevitable loss of information
31Hadoop, an open-source rival to Google’s MapReduce system Traditional database design promises to deliver consistent results across time. If you ask for your bank account balance, for example, you expect to receive the exact amount….While traditional systems would have a delay until all updates are madeInstead, accepting messiness is a kind of solutionHadoop, an open-source rival to Google’s MapReduce system
32typical data analysis requires an operation called “extract, transfer, and load,” or ETL Hadoop: data can’t be moved and must be analyzed where it isHadoop’s output isn’t as precise as that of relational databases: it can’t be trusted to launch a spaceship or to certify bank-account details. But for many less critical tasks, where an ultra-precise answer isn’t needed, it does the trick far faster than the alternatives
33web pages and videos, remain dark In return for living with messiness, we get tremendously valuable services that would be impossible at their scope and scale with traditional methods and tools. For example ZestFinance for loanAccording to some estimates only 5 percent of all digital data is “structured”—that is, in a form that fits neatly into a traditional database. Without accepting messiness, the remaining 95 percent of unstructured data soweb pages and videos, remain dark
34Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracy
35big data may require us to change, to become more comfortable with disorder and uncertainty
36as the next chapter will explain, finding associations in data and acting on them may often be good enough