Presentation on theme: "Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine"— Presentation transcript:
1Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample CaseMichael A. Burr, Eynat Rafalin, and Diane L. SouvaineTufts UniversityShould be all set – colors on this page?CCCG 2004NSF grant #EIA
2Introduction Introduction to Data Depth Simplicial Depth Why?ExamplesDesirable PropertiesSimplicial DepthDefinitionPropertiesProblemsRevised DefinitionOngoing workShould be all set – another picture on the left?
3What is Data Depth and Why? Measures how deep (central) a given point is relative to a distribution or a data cloud.Deals with the shape of the data.Can be thought of as a measure of how well a point characterizes a data setProvides an alternative to classical statistical analysis.No assumption about the underlying distribution of the data.Deals with outliers.Why study?Many measures are geometric in nature.Can be computationally expensive to compute depth.Should be all set – Difficult to talk about outliers with this data set but the contours with the other data set are bad.
4Examples Half-Space (Tukey, Location) (Tukey 75) Regression Depth (Rousseeuw and Hubert 94)Simplicial Depth (Liu 90)… and many more.2 Data Points in this Half-plane3 Data Points in this Half-planeShould be all set - Check the year on Regression depth. Which side should be gray?
5Desirable Properties of Data Depth Liu (90) / Serfling and Zuo (00)P1 – Affine InvarianceP2 – Maximality at CenterP3 – Monotonicity Relative to Deepest PointP4 – Vanishing at InfinityWe propose (BRS 04)P5 – Invariance Under Dimensions ChangeShould be all set – the slash and the BRS?
6Affine Invariance (P1) A – affine transformation Should be all set – more information about this specific affine transformation?
7Maximality at Center (P2) p is the centerq is any pointShould be all set – anything to add?
8Monotonicity Relative to Deepest Point (P3) point between p and qp is the deepest pointShould be all set – anything to add?q is any point
9Vanishing at Infinity (P4) q is far from the data cloudShould be all set – anything to add?
10Invariance Under Dimensions Change (P5) Is this an data set?Is this an data set?Should be all set – anything to add?
11Simplicial Depth (Liu 90) The simplicial depth of a point p with respect to a probability distribution F in is the probability that a random closed simplex in contains p.where is a closed simplex formed by d+1 random observations from F.The simplicial depth of a point p with respect to a data set in is the fraction of closed simplicies formed by d+1 points of S containing p.where I is the indicator function.Should be all set – anything to add?
12Sample Version of Simplicial Depth The simplicial depth of a point p with respect to a data set in is the fraction of closed simplicies formed by d+1 points of S containing p.Total number of simplicies= =20( )18.104.22.168.2p is contained in 6 simplicies.22.214.171.124Should be all set – anything to add?.3The depth of p= =.3620__.126.96.36.199.188.8.131.52.3.3.2
13PropertiesIs a statistical depth function in the continuous case. (Liu 90)Is affine invariant (P1) and vanishes at infinity (P4) in the sample case. (Serfling and Zuo 00)Should be all set – anything to add?
14Problems in the Sample Case Does not always attain maximality at the center (P2) and does not always have monotonicity relative to the deepest point (P3). (Serfling and Zuo 00)The depth on the boundary of cells is at least the depth in each of the adjacent cells – causes discontinuities.Does not have invariance under dimensions change (P5).Should be all set – too much color?
15Simplicial Depth (Liu 90) (BRS 04).6.3BC.184.108.40.206.4E.5.3.3Y.220.127.116.11.7.4X.18.104.22.168.3DShould be all set – anything to add?.3Averaging number of closed and open simplicies containing a point.6ATotal number of simplicies = ( ) = 1053
16Revised Definition (BRS 04) The simplicial depth of a point p with respect to a data set in is the average of the fraction of closed simplicies containing p and the fraction of open simplicies containing p, formed by d+1 points of S.Equivalently- the fraction of simplicies with data points as vertices which contain p in their open interior.- the fraction of simplicies with data points as vertices which contain p in their boundary.Should be all set – anything to change?
17Properties of the Revised Definition Reduces to the original definition, for continuous distributions and for points lying in the interior of cells.Keeps ranking order of data pointsCan be calculated using the existing algorithms, with slight modifications.Fixes Zuo and Serfling’s counterexamples.The depth on the boundary of two cells is the average of the two adjacent cells.Invariant under dimensions change (P5) for the change from to .Should be all set – anything to add or remove?
18Invariance Under Dimension Change (P5) Degenerate simpliciesBoth points C and A (a point between B and C) lie within the open (degenerate) simplex BCD – think of it as a very thin triangle.Both points B and D are vertices of the (degenerate) simplex BCD.For a point, p, consider the ratio:For both definitions, the ratio for a position (non-data point) is 2/3.For Liu’s definition, the ratio for a data point is not 2/3.For the BRS definition, the ratio for a data point is 2/3.Should be all set – anything to add?
19Remaining Problems (P2 and P3) Should be all set – too many arrows?
20Remaining Problems (Data Points) Data points are still over counted – there can still be discontinuities at data points. However, to fix the depth at data points, more features need to be considered.Data points are inherently part of simplicies (a point makes a triangle with every other pair of points) and edges are inherently part of simplicies (the two endpoints of an edge make a triangle with every other vertex).To retain invariance under dimensions change (P5), given a data set in , which lies on a d-flat, then the depth of a point when the data set is evaluated as a d-dimensional data set should be a multiple of the depth when the data set is evaluated as a b-dimensional data set.Neither of the above ideas completely solve the problem and it appears that the best solutions take into account the geometry of the entire data set.Should be all set – too many words? Check d-flat
21Ongoing WorkThe current algorithm for finding the median (the deepest point) is O(n4) to walk an arrangement of O(n2) segments.We can improve this algorithm by comparing simplicial depth and half-space depth.We are further improving this by considering simplicial depth in the dual.The problems with data points are improved by generalizing this work to higher dimensions.To find the depth at all points, we are using local information to form an approximation for the depth measure.Should be all set – anything to add or remove?
22ReferencesG. Aloupis, C. Cortes, F. Gomez, M. Soss, and G. Toussaint. Lower bounds for computing statistical depth. Computational Statistics & Data Analysis, 40(2): , 2002.G. Aloupis, S. Langerman, M. Soss, and G. Toussaint. Algorithms for bivariate medians and a Fermat-Torricelli problem for lines. In Proc. 13th CCCG, pages 21-24, 2001.M. Burr, E. Rafalin, and D. L. Souvaine. Simplicial depth: An improved definition, analysis, and efficiency for the sample case. Technical report , DIMACS, 2003.A. Y. Cheng and M. Ouyang. On algorithms for simplicial depth. In Proc. 13th CCCG, pages 53-56, 2001.J. Gil, W. Steiger, and A. Wigderson. Geometric medians. Discrete Math., 108(1-3):37-51, Topological, algebraical and combinatorial structures. Frolík's memorial volume.S. Khuller and J. S. B. Mitchell. On a triangle counting problem. Inform. Process. Lett., 33(6): , 1990.R. Liu. On a notion of data depth based on random simplices. Ann. of Statist., 18: , 1990.Y. Zuo and R. Serfling. General notions of statistical depth function. Ann. Statist., 28(2): , 2000.Should be all set – too much?