From data to information

Slides:



Advertisements
Similar presentations
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Hash Tables,
Advertisements

Elementary Data Structures and Algorithms
Computer Science 2 Data Structures and Algorithms V Intro to “big o” Lists Professor: Evan Korth New York University 1.
CPS 100, Spring From data to information to knowledge l Data that’s organized can be processed  Is this a requirement?  What does “organized”
SIGCSE Tradeoffs, intuition analysis, understanding big-Oh aka O-notation Owen Astrachan
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
Week 2 CS 361: Advanced Data Structures and Algorithms
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved ADT Implementation:
1 Recursion Algorithm Analysis Standard Algorithms Chapter 7.
CPS Data processing example l Scan a large (~ 10 7 bytes) file l Print the 20 most frequently used words together with counts of how often they.
Analysis of Algorithms
Analysis of Algorithms These slides are a modified version of the slides used by Prof. Eltabakh in his offering of CS2223 in D term 2013.
CPS 100, Spring From data to information l Data that’s organized can be processed  Is this a requirement?  What does “organized” means l Purpose.
Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.
Dynamic Array. An Array-Based Implementation - Summary Good things:  Fast, random access of elements  Very memory efficient, very little memory is required.
CompSci 100e 7.1 Plan for the week l Review:  Union-Find l Understand linked lists from the bottom up and top-down  As clients of java.util.LinkedList.
CPS From sets, toward maps, via big-Oh l Consider the MemberCheck weekly problem  Find elements in common to two vectors  Alphabetize/sort resulting.
Java Methods Big-O Analysis of Algorithms Object-Oriented Programming
CompSci Analyzing Algorithms  Consider three solutions to SortByFreqs, also code used in Anagram assignment  Sort, then scan looking for changes.
CompSci 100E 15.1 What is a Binary Search?  Magic!  Has been used a the basis for “magical” tricks  Find telephone number ( without computer ) in seconds.
CPS 100e 5.1 Inheritance and Interfaces l Inheritance models an "is-a" relationship  A dog is a mammal, an ArrayList is a List, a square is a shape, …
CompSci 100E 19.1 Getting in front  Suppose we want to add a new element  At the back of a string or an ArrayList or a …  At the front of a string.
Compsci 100, Fall From data to information to knowledge l Data that’s organized can be processed  Is this a requirement?  What does “organized”
CPS 100e 6.1 What’s the Difference Here? l How does find-a-track work? Fast forward?
AP National Conference, AP CS A and AB: New/Experienced A Tall Order? Mark Stehlik
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Recursion,
TAFTD (Take Aways for the Day)
Algorithm Analysis 1.
ADT Implementation: Recursion, Algorithm Analysis, and Standard Algorithms Chapter 10 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second.
Analysis: Algorithms and Data Structures
Algorithmic complexity: Speed of algorithms
Analysis of Algorithms
CSC 222: Object-Oriented Programming
Building Java Programs
Introduction to complexity
Introduction to Algorithms
Plan for the Week Review Big-Oh
COMP 53 – Week Seven Big O Sorting.
What’s the Difference Here?
Compsci 201, Mathematical & Emprical Analysis
Owen Astrachan Jeff Forbes October 19, 2017
What's a pointer, why good, why bad?
Introduction to Algorithms
TAFTW (Take Aways for the Week)
COSC160: Data Structures Linked Lists
Algorithm Analysis CSE 2011 Winter September 2018.
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
Building Java Programs
Algorithm design and Analysis
What is CS 253 about? Contrary to the wide spread belief that the #1 job of computers is to perform calculations (which is why the are called “computers”),
Building Java Programs
Programming and Data Structure
PAC Intro to “big o” Lists Professor: Evan Korth New York University
Algorithmic complexity: Speed of algorithms
Building Java Programs
Searching, Sorting, and Asymptotic Complexity
CSE 143 Lecture 5 References and Linked Nodes
Algorithmic complexity: Speed of algorithms
Data Structures Introduction
Data Structures & Algorithms
Amortized Analysis and Heaps Intro
Sum this up for me Let’s write a method to calculate the sum from 1 to some n public static int sum1(int n) { int sum = 0; for (int i = 1; i
Building Java Programs
Lecture 7: Linked List Basics reading: 16.2
Estimating Algorithm Performance
Linked lists Low-level (concrete) data structure, used to implement higher-level structures Used to implement sequences/lists (see CList in Tapestry) Basis.
Java Basics – Arrays Should be a very familiar idea
Algorithms and data structures: basic definitions
Compsci 201, O-Notation and Maps (Interfaces too)
Presentation transcript:

From data to information Data that’s organized can be processed Is this a requirement? What does “organized” means Purpose of map in Markov assignment? Properties of keys? Comparable v. Hashable TreeSet v. HashSet Speed v. order Memory considerations

Foundations for Hash- and Tree-Set Typically linked lists used to implement hash tables List of frames for film: clip and insert without shifting Nodes that link to each other, not contiguous in memory Self-referential, indirect references, confusing? Why use linked lists? Insert and remove without shifting, add element in constant time, e.g., O(1) add to back Contrast to ArrayList which can double in size Master pointers and indirection Leads to trees and graphs which help structure data into information

Linked lists as recombinant DNA Splice three GTGATAATTC strands into DNA Use strings: length of result is N + 3*10 Generalize to N + B*S (# breaks x size-of-splice) We can use linked lists instead Use same GTGATAATTC if strands are immutable Generalize to N+ S + B, is this an improvement?

Getting in front Suppose we want to add a new element At the back of a string or an ArrayList or a … At the front of a string or an ArrayList or a … Is there a difference? Why? What's complexity? Suppose this is an important problem: we want to grow at the front (and perhaps at the back) Think editing film clips and film splicing Think DNA and gene splicing Self-referential data structures to the rescue References, reference problems, recursion, binky

Goldilocks and the Hashtable A hashtable is a collection of buckets Find the right bucket and search it Bucket organization? Array, linked list, search tree

Structuring Data: The inside story How does a hashtable work? (see SimpleHash.java) What happens with put(key,value) in a HashMap? What happens with getvalue(key)? What happens with remove(key)? ArrayList<ArrayList<Combo>> myTable; public void put(String key, int value) { int bucketIndex = getHash(key); ArrayList<Combo> list = myTable.get(bucketIndex); if (list == null){ list = new ArrayList<Combo>(); myTable.set(bucketIndex, list); } list.add(new Combo(key,value)); mySize++;

How do we compare times? Methods? Dual 2Ghz Power PC King James Bible: 823K words time to arraylist hash: 5.524 time to default hash: 6.137 time to link hash: 4.933 arraylist hash size = 34027 Default hash size = 34027 link hash size = 34027 Linux 2.4 Ghz, Core Duo, Wordlist: 354K words time to arraylist hash: 1.728 time to default hash: 1.416 time to link hash: 1.281 arraylist hash size = 354983 Default hash size = 354983 link hash size = 354983 Linux 2.4 Ghz, Core Duo, King James Bible: 823K words time to arraylist hash: 1.497 time to default hash: 1.128 time to link hash: 1.03 arraylist hash size = 34027 Default hash size = 34027 link hash size = 34027 OS X Laptop 2.4 Ghz, Core Duo, King James Bible: 823K words time to arraylist hash: 1.894 time to default hash: 1.315 time to link hash: 1.335 arraylist hash size = 34027 Default hash size = 34027 link hash size = 34027

What’s the Difference Here? How does find-a-track work? Fast forward?

Contrast LinkedList and ArrayList See ISimpleList, SimpleLinkedList, SimpleArrayList Meant to illustrate concepts, not industrial-strength Very similar to industrial-strength, however ArrayList --- why is access O(1) or constant time? Storage in memory is contiguous, all elements same size Where is the 1st element? 40th? 360th? Doesn’t matter what’s in the ArrayList, everything is a pointer or a reference (what about null?)

What about LinkedList? Why is access of Nth element linear time? Why is adding to front constant-time O(1)? front

ArrayLists and linked lists as ADTs As an ADT (abstract data type) ArrayLists support Constant-time or O(1) access to the k-th element Amortized linear or O(n) storage/time with add Total storage used in n-element vector is approx. 2n, spread over all accesses/additions (why?) Adding a new value in the middle of an ArrayList is expensive, linear or O(n) because shifting required Linked lists as ADT Constant-time or O(1) insertion/deletion anywhere, but… Linear or O(n) time to find where, sequential search Good for sparse structures: when data are scarce, allocate exactly as many list elements as needed, no wasted space/copying (e.g., what happens when vector grows?)

Linked list applications Remove element from middle of a collection, maintain order, no shifting. Add an element in the middle, no shifting What’s the problem with a vector (array)? Emacs visits several files, internally keeps a linked-list of buffers Naively keep characters in a linked list, but in practice too much storage, need more esoteric data structures What’s (3x5 + 2x3 + x + 5) + (2x4 + 5x3 + x2 +4x) ? As a vector (3, 0, 2, 0, 1, 5) and (0, 2, 5, 1, 4, 0) As a list ((3,5), (2,3), (1,1), (5,0)) and ________? Most polynomial operations sequentially visit terms, don’t need random access, do need “splicing” What about (3x100 + 5) ?

Linked list applications continued If programming in C, there are no “growable-arrays”, so typically linked lists used when # elements in a collection varies, isn’t known, can’t be fixed at compile time Could grow array, potentially expensive/wasteful especially if # elements is small. Also need # elements in array, requires extra parameter With linked list, one pointer used to access all the elements in a collection Simulation/modeling of DNA gene-splicing Given list of millions of CGTA… for DNA strand, find locations where new DNA/gene can be spliced in Remove target sequence, insert new sequence

Linked lists, CDT and ADT As an ADT A list is empty, or contains an element and a list ( ) or (x, (y, ( ) ) ) As a picture As a CDT (concrete data type) pojo: plain old Java object public class Node { Node p = new Node(); String value; p.value = “hello”; Node next; p.next = null; }; p

Building linked lists Add words to the front of a list (draw a picture) Create new node with next pointing to list, reset start of list public class Node { String value; Node next; Node(String s, Node link){ value = s; next = link; } }; // … declarations here Node list = null; while (scanner.hasNext()) { list = new Node(scanner.next(), list); What about adding to the end of the list?

Dissection of add-to-front List initially empty First node has first word Each new word causes new node to be created New node added to front Rhs of operator = completely evaluated before assignment list list A list = new Node(word,list); B Node(String s, Node link) { info = s; next = link;}

Standard list processing (iterative) Visit all nodes once, e.g., count them or process them public int size(Node list){ int count = 0; while (list != null) { count++; list = list.next; } return count; What changes in code if we generalize what process means? Print nodes? Append “s” to all strings in list?

Nancy Leveson: Software Safety Founded the field Mathematical and engineering aspects Air traffic control Microsoft word "C++ is not state-of-the-art, it's only state-of-the-practice, which in recent years has been going backwards" Software and steam engines: once extremely dangerous? http://sunnyday.mit.edu/steam.pdf THERAC 25: Radiation machine that killed many people http://sunnyday.mit.edu/papers/therac.pdf

Building linked lists continued What about adding a node to the end of the list? Can we search and find the end? If we do this every time, what’s complexity of building an N-node list? Why? Alternatively, keep pointers to first and last nodes of list If we add node to end, which pointer changes? What about initially empty list: values of pointers? Will lead to consideration of header node to avoid special cases in writing code What about keeping list in order, adding nodes by splicing into list? Issues in writing code? When do we stop searching?

Standard list processing (recursive) Visit all nodes once, e.g., count them public int recsize(Node list) { if (list == null) return 0; return 1 + recsize(list.next); } Base case is almost always empty list: null pointer Must return correct value, perform correct action Recursive calls use this value/state to anchor recursion Sometimes one node list also used, two “base” cases Recursive calls make progress towards base case Almost always using list.next as argument

Recursion with pictures recsize(Node list) return 1+ recsize(list.next) Counting recursively int recsize(Node list){ if (list == null) return 0; return 1 + recsize(list.next); } recsize(Node list) return 1+ recsize(list.next) recsize(Node list) return 1+ recsize(list.next) ptr recsize(Node list) return 1+ recsize(list.next) System.out.println(recsize(ptr));

Recursion and linked lists Print nodes in reverse order Print all but first node and… Print first node before or after other printing? public void print(Node list) { if (list != null) { } print(list.next); System.out.println(list.info); System.out.println(list.info); print(list.next);

Complexity Practice What is complexity of Build? (what does it do?) public Node build(int n) { if (null == n) return null; Node first = new Node(n, build(n-1)); for(int k = 0; k < n-1; k++) { first = new Node(n,first); } return first; Write an expression for T(n) and for T(0), solve. Let T(n) be time for build to execute with n-node list T(n) = T(n-1) + O(n)

Changing a linked list recursively Pass list to method, return altered list, assign to list Idiom for changing value parameters list = change(list, “apple”); public Node change(Node list, String key) { if (list != null) { list.next = change(list.next, key); if (list.info.equals(key)) return list.next; else return list; } return null; What does this code do? How can we reason about it? Empty list, one-node list, two-node list, n-node list Similar to proof by induction

Analyzing Algorithms Consider three solutions to SortByFreqs Sort, then scan looking for changes Insert into Set, then count each unique string Find unique elements without sorting, sort these, then count each unique string Use a Map (TreeMap or HashMap) We want to discuss trade-offs of these solutions Ease to develop, debug, verify Runtime efficiency Vocabulary for discussion

What is big-Oh about? (preview) Intuition: avoid details when they don’t matter, and they don’t matter when input size (N) is big enough For polynomials, use only leading term, ignore coefficients y = 3x y = 6x-2 y = 15x + 44 y = x2 y = x2-6x+9 y = 3x2+4x The first family is O(n), the second is O(n2) Intuition: family of curves, generally the same shape More formally: O(f(n)) is an upper-bound, when n is large enough the expression cf(n) is larger Intuition: linear function: double input, double time, quadratic function: double input, quadruple the time

Recall adding to list (class handout) Add one element to front of ArrayList Shift all elements Cost N for N-element list Cost 1 + 2 + … + N = N(N+1)/2 if repeated Add one element to front of LinkedList No shifting, add one link Cost is independent of N, constant-time cost Cost 1 + 1 + … + 1 = N if repeated

More on O-notation, big-Oh Big-Oh hides/obscures some empirical analysis, but is good for general description of algorithm Allows us to compare algorithms in the limit 20N hours vs N2 microseconds: which is better? O-notation is an upper-bound, this means that N is O(N), but it is also O(N2); we try to provide tight bounds. Formally: A function g(N) is O(f(N)) if there exist constants c and n such that g(N) < cf(N) for all N > n cf(N) g(N) x = n

Big-Oh calculations from code Search for element in an array: What is complexity of code (using O-notation)? What if array doubles, what happens to time? for(int k=0; k < a.length; k++) { if (a[k].equals(target)) return true; }; return false; Complexity if we call N times on M-element vector? What about best case? Average case? Worst case?

Amortization: Expanding ArrayLists Expand capacity of list when add() called Calling add N times, doubling capacity as needed What if we grow size by one each time? Item # Resizing cost Cumulative cost Resizing Cost per item Capacity After add 1 2 3-4 4 6 1.5 5-8 8 14 1.75 2m+1 - 2m+1 2 m+1 2m+2-2 around 2 2m+1

Some helpful mathematics 1 + 2 + 3 + 4 + … + N N(N+1)/2, exactly = N2/2 + N/2 which is O(N2) why? N + N + N + …. + N (total of N times) N*N = N2 which is O(N2) N + N + N + …. + N + … + N + … + N (total of 3N times) 3N*N = 3N2 which is O(N2) 1 + 2 + 4 + … + 2N 2N+1 – 1 = 2 x 2N – 1 which is O(2N ) Impact of last statement on adding 2N+1 elements to a vector 1 + 2 + … + 2N + 2N+1 = 2N+2-1 = 4x2N-1 which is O(2N) resizing + copy = total (let x = 2N)

Running times @ 106 instructions/sec O(log N) O(N) O(N log N) O(N2) 10 0.000003 0.00001 0.000033 0.0001 100 0.000007 0.00010 0.000664 0.1000 1,000 0.000010 0.00100 0.010000 1.0 10,000 0.000013 0.01000 0.132900 1.7 min 100,000 0.000017 0.10000 1.661000 2.78 hr 1,000,000 0.000020 19.9 11.6 day 1,000,000,000 0.000030 16.7 min 18.3 hr 318 centuries