What Is Big Data? Craig C. Douglas University of Wyoming.

What Is Big Data? Craig C. Douglas University of Wyoming

What Is Big Data?... It Depends 2 UnitApproximately10 n Related to Kilobyte (KB)1,000 bytes3Circa 1952 computer memory 32 KBApollo 11 computer memory (1969) Megabyte (MB)1,000 KB6Circa 1976 supercomputer memory Gigabyte (GB)1,000 MB92013 typical 16 GB memory stick Terabyte (TB)1,000 GB122012 largest SSD in a laptop Petabyte (PB)1,000 GB15250,000 DVDs or the entire digital library of all known books written in all known languages Exabyte (EB)1,000 PB18175 EB copied to disk in 2010 (est.) Zettabyte (ZB)1,000 EB212 ZB copied to disk in 2011 (est.) 32 GBSmart phone memory (2014)

What Is Big Data?... It Depends What if time counts? – Given a time period t, How much data can be read and written? – This changes over time as technology changes. – What if the quantity of data counts? How long does it take to read and write data? – This changes over time as technology changes. Definition of Big Data is fluid, not static. 3

Some Sources of Big Data Interactions with dynamic databases Internet data City or regional transportation flow control Environment and disaster management Oil/gas fields or pipelines, seismic imaging Credit cards and online businesses Government or industry regulation/statistics Dynamic data-driven apps 4

Why is Big Data a Hot Topic? Open positions in data analytics by 2020 (USA) – up to 200,000 open positions – might only be 140,000 open positions Bureau of Labor Statistics projects that 70% of all newly created jobs across all STEM fields during 2010s, – across engineering, the physical sciences, the life sciences, and the social sciences, – will be in computer science 5

Unprecedented Opportunities Significant contributions to the development of these transformative technologies have been made from diverse fields including: – mathematics, – natural sciences – engineering – social sciences – arts and entertainment industries – business world 6

Unprecedented Opportunities Algorithm and software development belong to computer science over the past 50 years: – Computer science researchers have designed and implemented the algorithms and data structures, languages, models, tools, and abstractions that have enabled these transformational technology developments 7

Quick summary Simulation oriented computational science is transformational science, but is only a niche in the grand scheme of things. Big data computing capabilities must be broadly available in any institution that strives to compete in the coming decade. If not, an institution will simply cease to be competitive, similar to not joining the ARPAnet or CSnet in the 1970s and 1980s. 8

Similarities in Sentences in Big Files

Big File Format One line per sentence with no punctuation Each word is separated by one blank All lower case Multiple languages and gibberish Watch for an extra blank at end of some lines 10

Goals In the big file of sentences: – Eliminate similar sentences – Find similar sentences of some distance or less Either goal is hard work if the file has enough sentences Both goals of about the same hardness Methods in Chapter 3 of Ullman et als Data Mining book useful 11

Goal 1 Eliminate all duplicate lines (distance 0) Eliminate all sentences of distance 1 – Two sentences S 1 and S 2 are distance n if S 1 can be transformed into S 2 by adding, removing, or substituting at most n words. – What happens if you eliminate sentence S i because of sentence S i-j, but you later find a sentence S k that has distance 0 or 1 from S i ? Need to define how you handle this case. 12

Goal 2 List all sentences that have duplicates. List all sentences that have distance 1 sentences List first one followed by all distance 0 or 1 sentences related to it – Can do as separate lists or just one – Should be sorted Redo for distance n 13

Preprocessing Read all of the file and build a dictionary with each word given a natural number as an index: – Given sentence one here as the first one 1 2 3 4 5 6 3 7 – Next sentence after sentence one 8 2 9 2 3 – And so on 10 11 12 14

Implementation Suggestions Use hash tables of considerable size – Hash table size should be a prime number Build and debug your code with small files – Start with < 10 sentences – Next try 100, 1000, and 10,000 sentences – Then try 17,788,002 sentences Consider using Hadoop (requires knowledge of Java, however) or MR-MPI (C/C++) 15

Tricky Part Build a code to do Goal 1 or 2. Notes: – Shingling and minhash do not work well for edit distance – Two approaches: Try Jaccard similarity or distance methodology on sentences considered as sets of words Modify index-based and length-based methods 16

Generalizing Substitute n for 1 – Not much extra work to do so – Instead of looking at sentences of word length difference 1, look at ones of difference up to n – Makes a much more useful program Take arbitrary sentences – Convert to one per line, each word separated by one blank – Take lower and upper case into account and convert to all lower case as preprocessing 17

Some Interesting Problems An Open Source, secure Hadoop replacement suitable for hospitals and medical records. – Must be HPPA compliant. – Must scale well for very large databases. – Must have individual access capabilities. – Must not have complexity O(disk access) on a DFS. Should use OpenMP and MPI. Should use cache aware hashing methods. – Will be useful well beyond medical records. 18

Some Interesting Problems Dynamic Data-Driven Application Systems and Big Data – A natural fit and there is no agreed upon software for DDDAS or DDDAS-BD or DBDDAS. DDDAS has been applied to many, many fields. – DDDAS researchers agree something should be produced: not considered an application and too applied to be considered networking research. – Need to find a niche or a program officer in a funding agency willing to think outside of the box. – Many Big Data issues long common to DDDAS. 19

Some Interesting Problems Sensors and telemetry – SensorML was supposed to provide a standard way of describing sensor data and be able to get the data and deliver it to applications. It went commercial ($$$...$$$) after the original PI retired. – A true Open Source, internationally recognized standard would benefit one area of Big Data and DDDAS. 20

Some Interesting Problems Reservoirs (oil, gas, water) – Dynamic reservoir meshing Vertical wells with micro sensors provide updates to fracked reservoirs. Speed up the meshing to including in a reservoir simulator time (e.g., go from a year to a day). Dynamically improve predictions. – Corporate oil/gas fields or pipelines (even small ones) produce excessive amounts of data Open Source data mining tools for specific problem 21

Some Interesting Problems Audio and photographic data mining – Worlds largest databases based on VoIP and phone monitoring by many governments (e.g., P.R. China, France, Germany, Kingdom of Saudi Arabia, United Kingdom, USA, …). – Keeps disk drive makers in business and lowers hard disk prices very significantly. Another problem: Find all file duplicates in a file system efficiently. Similar to sentence problem earlier. – Has commercial (e.g., Bing, satellite transmission) and research ramifications that are not nefarious. 22

What Is Big Data? Craig C. Douglas University of Wyoming.

Similar presentations

Presentation on theme: "What Is Big Data? Craig C. Douglas University of Wyoming."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What Is Big Data? Craig C. Douglas University of Wyoming.

Similar presentations

Presentation on theme: "What Is Big Data? Craig C. Douglas University of Wyoming."— Presentation transcript:

Similar presentations

About project

Feedback