# Geometric Algorithms and Data Structures

## Presentation on theme: "Geometric Algorithms and Data Structures"— Presentation transcript:

Geometric Algorithms and Data Structures
Prof. Neeraj Suri Andreas Johansson Constantin Sarbu Abdelmajid Khelil

Outline Introduction Geometric Data Structures Multidimensional Data
Quadtree Region quadtree Point quadtree K-d tree Strip tree K-d trie Binary trie Multidimensional Data Z-Order Multidimensional data Data mining

Geometric Problems (1) Algorithmic geometry: Study of the algorithmic complexity of elementary geometric problems Geometric problems: Are often abstract formulations of practical problems (similar to graph theory) Some geometric problems and their interpretation: Given a set of points in the plane. Find all the points within a rectangle „Clipping“ in VR Find tuples in a database with values within given bounds for attributes A1 and A2 Generalization for searching in a k-dimensional field (all points contained in a k-dimensional field)

Geometric Problems (2) Given a set of rectangles in the plane. Find all pairwise intersecting rectangles Correctness test at designing Very Large Scale Integration (VLSI), chip layers as rectangles Given a set of 3-dimensional objects (compounds). Find pair wise intersecting objects Ensuring the rule distance resp. the safety margin in CAD Given a set of rectangles in the plane. Find the slice plane. Geographic Information Systems (GIS), approximation of generic forms through rectangles, determining areas with specific properties on distinct maps (e.g. find regions which are sandy (map 1), wet (map 2), and between 200 and 300 m altitude (elevation map))

Geometric Problems (3) Given a set of polyhedrons in space. Determine the edges or portion of edges that are visible or hidden from a viewpoint. Computation of a realistic view of a 3-dimensional scene Determining the coverage area of a transmitter, the area with no reception Given a set of points in a k-dimensional space and a query-point P. Find the point S closest to P. Voice recognition: A spoken word is characterized by features and compared with the vocabulary (point set in a k-dimensional space).

Classification of Geometric Problems
2 classes of problems: Set problems: Compute the property of a set of objects S you’re interested in. E.g. the outline of the area covered by S Search problems: Given a set of objects S and a query-object q. Find all objects in S that have a specific relation with q. Set problems are often reducible to search problems E.g. Plane-Sweep algorithms reduce a k-dimensional set problem to a (k-1)-dimensional search problem Search problems are solved by organizing S with the aid of appropriate data structures and indexing

First Problem How do we efficiently represent this figure?

Representing Figures (1)
How about a matrix representation? 1 Black = 1, empty = 0  Not very effective

Representing Figures (2)
Idea: represent areas, not points Now represent the areas using another structure Quadtrees do this 1 1

Quadtree: A class of hierarchical data structures that are based on recursive decomposition of space Differentiation is possible based on: Data type represented by the Quadtree : Point data, regions, curves, surfaces, and volumes Principle of decomposition: regular vs. input-driven Resolution: Fixed vs. variable number of decomposition steps Examples: Region quadtree Point quadtree Literature: Samet, H.; “The Quadtree and Related Hierarchical Data Structures”, ACM Comp. Surveys, Vol. 16, No. 2, June 1984 (available from ACM DL)

Region Quadtree Successive subdivision of the image array into 4 equal-sized quadrants. Basic idea: Figure as an image array, i.e. every pixel of the figure has a value of 1, all other pixels have a value of 0 The entire area (image array) is subdivided into 4 equal-sized quadrants (usually 2k dimensional) Upon each division one has to check if the image array of a quadrant is homogeneous (i.e. only 1s or only 0s) homogeneous  no further subdivision heterogeneous  further subdivisions until homogeneous (possibly single pixels)

NW NE SW SE E N W S

Leaf nodes are said to be either BLACK or WHITE Non-leaf nodes are said to be GREY GREY 1 NW SE NE SW BLACK 1 1 WHITE

Step1 1 1

Step2 1

Step3 1

Quadtrees are especially useful for performing set operations Overlap (intersection) Overlays (union) Example: From data provided on forests, grassland, fields, nature reserve and polder, identify which areas are in agricultural use (typical overlay problem)

Traverse top-down quadtree QT1 beginning with root and compare with the corresponding node in quadtree QT2 if the node in QT1 is BLACK, then the corresponding node in the resulting quadtree is also BLACK if the node in QT1 is WHITE, then the node in the resulting quadtree is set to the node in QT2 if the node in QT1 is GREY, then set the node in the resulting quadtree to GREY if QT2 is GREY GREY if QT2 is WHITE BLACK if QT2 is BLACK if both nodes are gray, the algorithm returns after processing the next level to consolidate if necessary.

Decision Table: BLACK x WHITE GREY GREY1) 1) A check for a merger need to be performed to determine if all 4 sons are BLACK. Example:

Traverse top-down quadtree QT1 beginning with root and compare with the corresponding node in quadtree QT2 if the node in QT1 is BLACK and the node in QT2 is BLACK, then set the corresponding node in the resulting QT to BLACK if the node in QT1 or QT2 is WHITE, then the resulting node is WHITE if the node in QT1 is GREY, then set the node to GREY if QT2 is also GREY WHITE if QT2 is WHITE GREY if QT2 is BLACK if both nodes are grey, the algorithm returns after processing the next level to consolidate if necessary.

Decision Table: WHITE x BLACK GREY GREY1) 1) A check for a merger need to be performed to determine if all 4 sons are WHITE. Example:

Complexity Analysis Complexity is proportional to the number of nodes in the quadtree best case: whole area unicolored (1 node) worst case: “Salt and Pepper”, i.e. all inner nodes are grey, need to go down to pixel level (depends on the resolution)

Point data 2-D points can be stored and indexed in a point-quadtree A point-quadtree splits the space into 4 quadrants at the insertion point The insertion order is thus important (it determines the structure of the tree)

Insertion order: Chicago, Mobile, Toronto, Buffalo, Denver, Omaha, Atlanta, Miami (100,100) (0,0) (100,0) (0,100) (35,40) Chicago (5,45) Denver (25,35) Omaha (50,10) Mobile (90,5) Miami (85,15) Atlanta (80,65) Buffalo (60,75) Toronto

Insertion order: Chicago, Mobile, Toronto, Buffalo, Denver, Omaha, Atlanta, Miami (100,100) (0,0) (100,0) (0,100) (35,40) Chicago (5,45) Denver (25,35) Omaha (50,10) Mobile (90,5) Miami (85,15) Atlanta (80,65) Buffalo (60,75) Toronto Chicago Denver Toronto Omaha Mobile Buffalo Atlanta Miami

„find all points (records) within a given distance from another point (record)” Chicago Mobile Buffalo Atlanta Miami (100,100) (0,0) (100,0) (0,100) (35,40) (5,45) Denver (25,35) Omaha (50,10) (90,5) (85,15) (80,65) (60,75) Toronto Find all the cities, at most 8 units from the point (83,10)

Find all the cities, at most 8 units from the point (83,10) (100,100) (0,0) (100,0) (0,100) (35,40) Chicago (5,45) Denver (25,35) Omaha (50,10) Mobile (90,5) Miami (85,15) Atlanta (80,65) Buffalo (60,75) Toronto The root is (35,40)  NW, NE, SW can be ignored Next is Mobile (50,10)  NW and SW can be ignored Are Atlanta or Miami within 8? Solutions based on approximations with rectangles (bounding box), can contain negative reports Exact solution with a circle

Especially suitable for search problems of the following type: “find all points (records) within a given distance from another point (record)” Point Quadtrees are quite efficient for 2 dimensions. In k > 2 dimensions however, Point Quadtrees have a large branching factor and thus contain many NULL-pointers Chicago Denver Toronto Omaha Mobile Buffalo Atlanta Miami

K-d Trees k-dimensional point data
We want to avoid the large fan-out of point quadtree Quadtrees (22=4-way split) Octrees (23=8-way split) In general: 2k-way split A k-d tree is a binary search tree with the distinction that at each level, a different coordinate (dimension) is tested to determine the direction of the branch 2-way split Node consists of 2 child pointers Name Key

K-d Tree: Basic Idea Construct a binary Tree
At each step, choose one of the coordinates as a basis of dividing the rest of the points For example, at the root, choose x as the basis Like binary search trees, all items to the left of root will have the x-coordinate less than that of the root All items to the right of the root will have the x-coordinate greater than (or equal to) that of the root Choose y as the basis for discrimination for the root’s children Choose x again for the root’s grandchildren

K-d Tree: Example Insertion order: Chicago, Mobile, Toronto, Buffalo, Denver, Omaha, Atlanta, Miami Alternation of discriminator (0,100) (100,100) K-d tree x Chicago x≥xchicago x<xchicago x Toronto (60,75) Toronto Denver y Mobile y≥ymobile y<ymobile (80,65) Buffalo Omaha y Buffalo x Atlanta (5,45) Denver Miami (35,40) Chicago (25,35) Omaha (85,15) Atlanta (50,10) Mobile (90,5) Miami (0,0) (100,0) Fewer NULL pointers!

Adaptive k-d Tree Like k-d tree, but Balanced k-d Tree
Division is between (not on) data points. Division not by alternating the discriminator, but according to the dimension with the maximum spread (max-min). Balanced k-d Tree Internal nodes contain only split coordinates and their value (e.g. X=30) The records are stored at the terminal nodes (leaves) Insertion of one record requires rebuilding the tree ( Static structure ) Deletion of one record is highly complex Search is like k-d tree

(0,0) (100,0) (0,100) (100,100) 15,x 30,x 55,x 70,x (60,75) Toronto (80,65) Buffalo (5,45) Denver (35,40) Chicago 40,y (25,35) Omaha (85,15) Atlanta 25,y (50,10) Mobile (90,5) Miami 10,y Chicago (35,45) Mobile (50,10) Toronto (60,75) Buffalo (80,65) Denver (5,45) Omaha (25,35) Atlanta (85,15) Miami (90,5)

parallelizable Point Quadtree: parallelizable, dynamic K-d Tree: Not easily parallelizable, dynamic, better sequential data structure Adaptive k-d Tree: Not easily parallelizable, static, balanced, optimized search

Curvilinear Data: Strip Tree (Example)
Basic idea: Represent the curve by strips enclosing portions of it Selected as splitting point for A, since Wl > Wr A Root strip Wl Wr Q P Splitting point for C Strips become successively thinner The splitting finishes when all strips are thinner than a predefined value Strip Tree: B C D E

Strip Tree: Algorithm Recursive Splitting Node Structure
Join the endpoints of the curve (i.e. P and Q) The root corresponds to a rectangle enclosing the curve and whose sides are parallel to line PQ The next split point Lies on the curve and on one side of the strip rectangle Has maximum distance to line PQ Node Structure The node is an 8-tuple and contains 2 pairs of X,Y coordinates (the diagonal endpoints) The strip width on each side of the line connecting the endpoints Pointers to the 2 sons

Representation of Arbitrary Curves
Curves are well represented by chains, however indexing them is difficult A strip-tree is a quadtree variant for representing arbitrary curves by hierarchical decomposition Useful in applications that involve search and set operations

Trees and Tries We have seen (normal) trees for storing figures
We can also use Tries! Tries store the key “along the way”

Kd-Tries: Example Key stored along the path from the root, Ex: “RDRU”
L: left R: right D: Down U: Up L R X dim Y dim U D L R L R U D L R D U Key stored along the path from the root, Ex: “RDRU” The complete keys are located at the leaves RDRU

Binary Tries A binary trie is a binary tree, whereby left sons correspond to a “0” at the corresponding position in the key, and right sons correspond to a “1” 1 1 1 1 100101 1 1 1 1 1

Geometric Interpretation of the Binary Trie
A trie compresses a 1-dimensional space with 2d addresses through coding to a string with d characters In previous example: d=3+3=6 The root represents the complete space Left son (first character = 0) represents the lower half of the search space Right son (first character = 1) represents the upper half of the search space.

Binary Tries, Revisited
In 2D each key is a pair of bit sequences (x,y) 1 X0X1X2 000 001 010 011 100 101 110 111 Y0Y1Y2 1 1 1 100101 Binary x coordinate of the cell Binary y coordinate of the cell 1 1 1 1 1 The path to the key is composed of bits that are taken from the x and y coordinates on a rotating basis

Observations Kd-trie splits by rotating x and y coordinates
A kd-trie is unique for a given set of keys Trie structure does not depend on the insertion order Geometric kd-tries generate a total order of the search space Two points P1 and P2 in the kd-Space will always have the same order

Building a Linear Order
Given a 2D grid how (1) to find a linear order for the cells of the grid such that cells close together in space are also (as far as possible) close to each other in the linear order, and (2) to define this order recursively for a grid that is obtained by a hierarchical subdivision of space. The most popular solution is Bit interleaving (Z-Order)

Z-Order Addresses in a 2-dimensional space are identified by pairs (x,y) of values Each x and y value is a sequence of d bits This results in a grid with 2d x 2d cells How to build the addresses using bit interleaving? 111 1 1 1 1 1 1 1 1 110 Start with a vertical split for X0 (Z=X0) 1 1 1 1 101 1 1 1 1 Y0Y1Y2 100 1 1 1 1 011 1 1 1 1 010 1 1 1 1 001 1 1 1 1 000 000 001 010 011 100 101 110 111 X0X1X2

Z-Order Horizontal split for Y0 (Z=X0Y0) 01 01 01 01 11 11 11 11 01 01
111 01 01 01 01 11 11 11 11 110 01 01 01 01 11 11 11 11 101 01 01 01 01 11 11 11 11 100 Y0Y1Y2 00 00 00 00 10 10 10 10 011 00 00 00 00 10 10 10 10 010 00 00 00 00 10 10 10 10 001 00 00 00 00 10 10 10 10 000 000 001 010 011 100 101 110 111 X0X1X2

Z-Order Vertical split for X1 (Z=X0Y0X1) Y0Y1Y2 X0X1X2 010 010 011 011
110 110 111 111 111 010 010 011 011 110 110 111 111 110 010 010 011 011 110 110 111 111 101 010 010 011 011 110 110 111 111 100 Y0Y1Y2 000 000 001 001 100 100 101 101 011 000 000 001 001 100 100 101 101 010 000 000 001 001 100 100 101 101 001 000 000 001 001 100 100 101 101 000 000 001 010 011 100 101 110 111 X0X1X2

Z-Order Horizontal split for Y1 (Z=X0Y0X1Y1) Y0Y1Y2 X0X1X2 0101 0101
0111 0111 1101 1101 1111 1111 111 0101 0101 0111 0111 1101 1101 1111 1111 110 0100 0100 0110 0110 1100 1100 1110 1110 101 0100 0100 0110 0110 1100 1100 1110 1110 100 Y0Y1Y2 0001 0001 0011 0011 1001 1001 1011 1011 011 0001 0001 0011 0011 1001 1001 1011 1011 010 0000 0000 0010 0010 1000 1000 1010 1010 001 0000 0000 0010 0010 1000 1000 1010 1010 000 000 001 010 011 100 101 110 111 X0X1X2

Z-Order Vertical split for X2 (Z=X0Y0X1Y1X2) Y0Y1Y2 X0X1X2 01010 01011
01110 01111 11010 11011 11110 11111 111 01010 01011 01110 01111 11010 11011 11110 11111 110 01000 01001 01100 01101 11000 11001 11100 11101 101 01000 01001 01100 01101 11000 11001 11100 11101 100 Y0Y1Y2 00010 00011 00110 00111 10010 10011 10110 10111 011 00010 00011 00110 00111 10010 10011 10110 10111 010 00000 00001 00100 00101 10000 10001 10100 10101 001 00000 00001 00100 00101 10000 10001 10100 10101 000 000 001 010 011 100 101 110 111 X0X1X2

z-low und z-hi are located in the left lower and right upper corner
Z-Order Horizontal split for Y2 (Z=X0Y0X1Y1X2Y2) highest z 010101 010111 011101 011111 110101 110111 111101 111111 111 010100 010110 011100 011110 110100 110110 111100 111110 110 010001 010011 011001 011011 110001 110011 111001 111011 101 010000 010010 011000 011010 110000 110010 111000 111010 100 Y0Y1Y2 000101 000111 001101 001111 100101 100111 101101 101111 011 000100 000110 001100 001110 100100 100110 101100 101110 010 000001 000011 001001 001011 100001 100011 101001 101011 001 000000 000010 001000 001010 100000 100010 101000 101010 000 000 001 010 011 100 101 110 111 X0X1X2 Lowest z z-low und z-hi are located in the left lower and right upper corner

Z-Order If each possible z-value represents a cell in the grid, this yields the following space filling curve: 010101 010111 011101 011111 110101 110111 111101 111111 111 010100 010110 011100 011110 110100 110110 111100 111110 110 010001 010011 011001 011011 110001 110011 111001 111011 101 010000 010010 011000 011010 110000 110010 111000 111010 100 Y0Y1Y2 000101 000111 001101 001111 100101 100111 101101 101111 011 000100 000110 001100 001110 100100 100110 101100 101110 010 000001 000011 001001 001011 100001 100011 101001 101011 001 000000 000010 001000 001010 100000 100010 101000 101010 000 000 001 010 011 100 101 110 111 X0X1X2

Example: Point Data Data point: A = (3 , 5) = (011 , 101)
000000 000010 001000 001010 000001 000011 001001 001011 000100 000110 001100 001110 000101 000111 001101 001111 100000 100010 101000 101010 100001 100011 101001 101011 100100 100110 100101 100111 101100 101110 101101 101111 110000 110010 111000 111010 110001 110011 111001 111011 110100 110110 111100 111110 110101 110111 111101 111111 010000 010010 011000 011010 010001 010011 011001 011011 011100 011110 010100 010110 010101 010111 011101 011111 000 001 010 011 100 101 110 111 Data point: A = (3 , 5) = (011 , 101) Bit interleaving: z = A Y0Y1Y2 This gives simple method for translating between x,y coordinates and z-values X0X1X2

Example: Region Data 000000 000010 001000 001010 000001 000011 001001 001011 000100 000110 001100 001110 000101 000111 001101 001111 100000 100010 101000 101010 100001 100011 101001 101011 100100 100110 100101 100111 101100 101110 101101 101111 110000 110010 111000 111010 110001 110011 111001 111011 110100 110110 111100 111110 110101 110111 111101 111111 010000 010010 011000 011010 010001 010011 011001 011011 011100 011110 010100 010110 010101 010111 011101 011111 000 001 010 011 100 101 110 111 10 0111 The object with a z-value of 001 contains all elements with a prefix equal to 001 Y0Y1Y2 X0X1X2

Bit Interleaving: Recursive Definition
00 01 10 11 01 11 00 10 Y0=1 Y0=0 X0=0 X0=1 1101 1111 1100 1110 A vertical split differentiates values of X0 A horizontal split differentiates values of Y0 The address is given by the z-value (00,01,10,11) The z-value represents the path in the kd-trie We can use the z-values alone, s.t. we don’t need the kd-trie anymore

Explanation Z-order encoding preserves the spatial proximity of points
homogeneous regions are represented compactly the elements are clustered => efficient access to secondary storage Z-order coded data can be stored into secondary storage using conventional prefix B+ trees efficient “range queries” are possible direct access via z-value

Geometric Data Structures for non-Geometric Data?
Application of geometric data structures for geometric problems is obvious Geographic Information System (GIS) Computer graphic A further application of geometric data structures: multidimensional databases OLAP (Online Analytical Processing) Data-mining

Multidimensional Data Space
Region East West North South Coke Fanta Beer Milk Juice Water Product Day Each cell corresponds to an observation point, described by the attributes of individual cells. Each cell contains an observation, e.g. the sales value of Product “Coke” on Day “4” in Region “East”.

Multidimensional (MD) Data Space
Each observed fact w can be expressed as a function of the dimensions, which define the multidimensional data space: w = f(x,y,z) DOM(f) = DOM(x) x DOM(y) x DOM(z) A fact w0 is the value of function f for the specific values (x0,y0,z0) w0 = f(x0,y0,z0)

Sparseness in the MD Space
Typically, only a small fragment of the space defined by DOM(a) x … x DOM(z) is actually used Addressing in the MD space (a multi-dimensional array) is easy and fast However inefficient memory usage Need to find mechanisms to compress the MD space Linearization of the data space by totally ordering the facts with the aid of space filling curves Extraction of all facts into a table, then join this table with descriptive dimension tables

Linearization of the MD Space
Linearization with the aid of space filling curves (e.g. Z-Transforms or Hilbert construction) The principle is based on a coding, that generates a total order of all points in the data space The indexing is done by conventional, order preserving indexing methods (e.g. B+-Trees) The mechanism is well suited for 2-4 dimensions (x,y,z,t) for tracking applications and range queries

Data-Mining Till now: Storage und search of data
Evaluation and interpretation of results is done using Data-Mining Typical problem: “Where, in supermarket, should we put the beer that should be sold as early as possible (close date expiry, low sales volume ..)”

Data-Mining Overview of basic techniques for data-mining Data Mining
Forecast, Prediction Knowledge Discovery Numerical Prediction Variance Detection Classification Association Clustering

Prediction: Classification
Data entries are classified according to a certain property Purchased Lending Lending to sort year Total last year out yes No yes New data entry is automatically assigned year total last year out ?

Prediction: Numerical Prediction
Numerical prediction is similar to classification, however, a value is predicted instead of a class. Most important application: Weather forecast Yesterday Today Tomorrow Temp. Pressure Temp. Pressure Temp. 17, , ,5 10, , ,2 30, , ,9 14, , ?

Knowledge Discovery: Association
Tries to find common rules between the characteristics of data. Interesting relations are returned. Example: From the previous weather data one could derive the following rules: With a probability of 0.89: IF "Air pressure today" > "Air pressure yesterday" AND "Temperature today" > 12° THEN "Temperature tomorrow" > "Temperature today" With a probability of 0.75: IF "Air pressure today" < "Air pressure yesterday" AND "Temperature today" > 15° THEN "Temperature tomorrow" < "Temperature today"

Knowledge Discovery: Variance Detection
Given a data pool, variance detection tries to distinguish normal data entries from “Outlier” entries Example: A home security system has 100 Sensors (temperature, light barrier, sound detector, ....) should detect intruders. Hereby, flying birds, shade in the moonlight or car headlight should not have any impact on the operation of the system. The system gets a database describing “safe" configurations (where no alarm has to be triggered). The system creates a Model of the non-alarm-cases. Data for real intrusions are not provided! Using this model, updates from sensors can be checked: If they do not fit in the non-alarm-cases, an alarm is triggered.

Knowledge Discovery: Clustering
Find similar data entries and group them into clusters Example: Exam, the percentage that exercises E1 .. E5 were correctly answered? Student E1 E2 E3 E4 E5 S S S S S Ø 41, , ,8 Clustering may divide the students taking the exam into 2 groups: G1 = {S1, S4, S5} : good at exercises E2 und E5, G2 = {S2, S3} : good at exercises E1, E3 und E4.  Possibility of individual support!

k-means Clustering: Example

k-means Clustering: Algorithm
Fix the number of desired clusters  Parameter k. Place K random points into the space  initial group centroids. For all m data objects determine the Euclidian distance of the object (as vector) from all centroids und assign the object to the closest centroid. For all k centroids determine the real center of the assigned cluster (average). These are the new centroids. Repeat steps 3 and 4 , until the centroids no longer move (Old and new ones are so close to each other, so that no real improvement is more remarkable).

k-means Algorithm: Properties
Finds a local optimum, but does not necessarily find the most optimal configuration (global optimum)  Is a Heuristic Significantly sensitive to the initial randomly selected cluster centers  Optimizations Randomly modify the results between different rounds The k-means algorithm can be run multiple times Operates with linear optimization Highly stable and frequently used approach Operates also for very large data sets with a controllable complexity Ian H. Witten, Eibe Frank “Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations” Academic Press, San Diego, CA; 2000; ISBN