Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Algebraic Topology in Data Science Algebraic Topology in Data Science GrC in Big Data Tsau Young (‘T. Y.’) Lin Institute of Data Science and Computing.

Similar presentations


Presentation on theme: "1 Algebraic Topology in Data Science Algebraic Topology in Data Science GrC in Big Data Tsau Young (‘T. Y.’) Lin Institute of Data Science and Computing."— Presentation transcript:

1 1 Algebraic Topology in Data Science Algebraic Topology in Data Science GrC in Big Data Tsau Young (‘T. Y.’) Lin Institute of Data Science and Computing GrC Society and Computer Science Department, San Jose State University Ty.lin@sjsu.edu ; prof.tylin@gmail.com

2 2 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm

3 3 Data Science Vasant Dhar, (New York University, E-i-C of Big Data ) defined: "Data Science is a study of generalizable extraction of knowledge from data."

4 4 Data Science I am not fully agree with his definition about Data Science; but I will adopt his idea in this talk.

5 5 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm

6 6 Data Mining Data Mining is a study about HOW TO EXTRACT PATTERNS from data.

7 7 Data Mining Core Methods of Data Mining: 1.Classification 2.Clustering 3.Association(rules)

8 8 Data Mining Methods for mining frequent patterns Aprior(Rakesh Agrawarl) FP-growth(Jia-Wei Han) (frequent pattern growth)

9 (skip) FP-growth uses an extended prefix-tree (FP-tree) structure to store the database. It adopts a divide-and-conquer strategy.

10 1.(skip) First of all, compress database into a frequent-pattern tree (FP-tree) 2.Then divide FP-tree into a set of conditional FP-tree 3.Next, mine each conditional FP- tree separately to get the complete frequent patterns of database

11 11 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm

12 12 GrC Is it a partition/granulation ?

13 13 Partition (Rough Sets) (usual pictures are vague)

14 14 Open Simplexes Open Simplexes  0- Simplex ( 1 point)

15 15 Open Simplexes Open Simplexes    1-simplex (no end points)

16 16 Open Simplexes Open Simplexes     2-simplex (no boundary)

17 17 Closed 3-simplex : It summarizes all information of First tuple Closed 3-simplex : It summarizes all information of First tuple C A B D

18 18 Open Simplexes Open Simplexes 3-simplex(no boundary) Open tetrahedron

19 19 A New Partition (algebraic topology) 12 open triangles (2-simplex); 23 open segments (1-simplexes); 12 vertices (0-simplexes Simplicial complex

20 20 Geometric Closed Tetrahedron U: The closed geometric Tetrahedron S implexes = the set of all open simplexes = {  ABCD,  BCD,  ACD,  ABD,  ABC  AB,  AC,  AD,  BC,  BC,  BD,  CD  A,  B,  C,  D } Is it rough set ? Yes

21 21 Abstract Closed Tetrahedron U = {A, B, C,D} S implexes = {  ABCD  {A, B, C, D},  BCD,  ACD,  ABD,  ABC  {A, B, C}  AB,  AC,  AD,  BC,  BC,  BD,  CD  A,  B,  C,  D } Is it rough set ? Yes

22 22 Geometric Simplicial Complex (U, S) is a Simplicial Complex, if U is (1) d ecomposed into a set S of simplexes, (2) all faces of any simplexes are also simplexes. (U is called a polyhedron)

23 23 A Rough Set ? A Rough Set ? D A B C Red lines only

24 24 Geometric Simplicial Complex U = The Picture of Red lines S implexes = {  BC,  BD,  CD,  AB  A,  B,  C,  D } Is a simplicial complex Is (U, S) a RS? infinite RS

25 25 Abstract Simplicial complex U = {A, B, C, D} S implexes = { {B, C}, {B, D}, {C, D}, {A, B} {A}, {B},{C}, {D} } Is a simplicial complex Is (U, S) a RS? No

26 26 A Rough Set ? A Rough Set ? D A B C Red Zone only

27 27 Geometric Simplicial Complex U= The picture of red Zone S= {  BCD,  AB and all their faces (descendants) } I s (U, S) a RS? Yes

28 28 Abstract Simplicial complex U= {A, B, C, D} S= { {B, C, D}, {A, B} and all their descendants (we will skip them): 1){B, C}, {B, D}, {C, D}, {A, B} 2){A}, {B},{C}, {D} } Is (U, S) a RS? No

29 29 Geometric Simplicial Complex Geometric Simplicial Complex D E B A C Open tetrahedron

30 30 Simplicial complex U= {A, B, C, D, E} S: { {A, B, C, D}, {A, E} and all their descendants. {A, B,C},{A, B,D},{A,C,D},{B, C, D}, {A, B},{A, C},{A,D},{B, C}, {B, D}, {C, D}, {A, E}, {A}, {B}, {C}, {D},{E} }.

31 31 Abstract Simplicial complex U= {A, B, C, D, E} S: { {A, B, C, D}, {A, E} and all their descendants. {A, B,C},{A, B,D},{A,C,D},{B, C, D}, {A, B},{A, C},{A,D},{B, C}, {B, D}, {C, D}, {A, E}, {A}, {B}, {C}, {D},{E} }.

32 32 Outline 1.What is GrC ? Simplicial Complexes 2. Bit Information Table(IT) 3. Traces the Geometric Data Mining algorithm

33 33 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm

34 Bit IT Theory Next Table is a BiT Information Table of a DEPARTMENT STORE. Column names are item names. If a customer has purchased an item, it will mark with bit 1, otherwise 0 in the column of that item. 34

35 35 BiT IT (IT= A Relation in Relational Database) D iaper B eer M ilk P en …. 111100

36 BiT IT (Geometric View) The first row can be visualized GEOMETRICALLY as A CLOSED SIMPLEX 36

37 37 Closed 3-simplex is: a pair ( U={D iaper, B eer, M ilk, P en }, a set S of open simplexes) S is 3-simplexes {D, B, M, P} and all its descendants

38 38 1 st generation: 2-simplexes {B, M, P}, {D, M, P}, {D, B, P}, {D, B, M} 2 nd generation: 1-simplexes {D, B},{D, M},{D, P}, B,M},{B, P}, {M, P}. 3 rd generation: 0-simplexes {D}, {B}, {M}, {P}; Total (2 4 -1) OPEN simplexes

39 BiT Geometric IT Instead of item names, we may use name of the “unit vector” First item is A=(1, 0, 0, …) The second item is B=(0, 1, 0, 0, ….) 39

40 40 Closed 3-simplex : It summarizes all information of First tuple Closed 3-simplex : It summarizes all information of First tuple C A B D

41 41 C is the 1 st visit A is the 2 nd visit A C is the 3rd visit D is the 4 th visit D C D A D A C

42 42 Outline 1.What is Data Science 2.What is Data Mining? 3.What is GrC? Simplicial Complexes 4. Information Table (IT) Bit IT theory (BIT) 5. Traces the Geometric Data Mining algorithm

43 43 A Bit IT D iaper B eer M ilk P en EFG 111 1000#1 011 1100#2 10100 11#3 #4#2#1 #3#5#6#7

44 44 Aprior(1-itemset) D iaper B eer M ilk P en EFG 111 1000#1 011 1100#2 10100 11#3 #4#2#1 #3#5#6#7 22321 11

45 45 Aprior(2-itemset) D iaper B eer M ilk P en EFG 111 1000#1 011 1100#2 10100 11#3 #4#2#1 #3#5#6#7 ….

46 46 Its Simplicial Complex View Its Simplicial Complex View a #4 b #2 c #1 d #3 c b e #5 d f #6 g # 7 Open tetrahedron 1 Open tetrahedron 2 Open tetrahedron 3

47 47 Main Idea So our main approach is to use Weighted Simplicial Complex to find Frequent Itemsets. The results are very impressive. It is 200 times faster than FP-Growth in Real World database (1257 column and 65,536 rows)

48 Traversal of Weighted Simplicial Complex [1] 3, [4] 2, [4 1] 2 [3] 2, [3 1] 2, [3 4] 1, [3 4 1] 1 [2] 2, [2 1] 2, [2 4] 1, [2 4 1] 1 [2 3] 2, [2 3 1] 2, [2 3 4] 1 [2 3 4 1] 1 48

49 [7] 1, [7 1] 1 [7 4] 1,[7 4 1] 1 [6] 1,[6 1] 1,[6 4] 1, [6 4 1] 1 [6 7] 1, [6 7 1] 1 [6 7 4] 1, [6 7 4 1] 1 49

50 [5] 1, [5 1] 1 [5 3] 1,[5 3 1] 1 [5 2] 1,[5 2 1] 1 [5 2 3] 1,[5 2 3 1] 1 50

51 51 Data Science Here we have used a very simple example to illustrate the idea of data science. We not only extract the frequent items sets, but also, their structure of interactions. For example, we have the homology groups of the output.

52 Knowledge Complex [1] 3, [4] 2, [4 1] 2 [3] 2, [3 1] 2, [2] 2, [2 1] 2, [2 3] 2, [2 3 1] 2, 52

53 53 Knowledge Complex Knowledge Complex D A B C Red Zone only

54 Knowledge Complex H 0 ( K )= Z H i (K)=0 i  0 54

55 55 A Bit IT A BC DEFG 111 1000#1 011 0100#21 010 1100#22 001 1100#23 10100 11#3

56 56 Its Simplicial Complex View Its Simplicial Complex View a #4 b #2 c #1 d #3 c b e #5 d f #6 g # 7 Open tetrahedron 1 Open tetrahedron 2 {b, c, d} in tetrahedron 2 is removed

57 Traversal of Weighted Simplicial Complex [1] 4 [5] 3 [5 1] 2 [3] 3 [3 1] 2 [3 5] 2 [3 5 1] 1 57

58 [2] 3 [2 1] 2 [2 5] 2 [2 5 1] 1 [2 3] 2 [2 3 1] 1 [2 3 5] 1 58

59 [4] 2 [4 1] 2 [4 3] 1 [4 3 1] 1 [4 2] 1 [4 2 1] 1 [4 2 3] 1 [4 2 3 1] 1 59

60 [7] 1 [7 1] 1 [7 4] 1 [7 4 1] 1 60

61 [6] 1 [6 1] 1 [6 4] 1 [6 4 1] 1 [6 7] 1 [6 7 1] 1 [6 7 4] 1 [6 7 4 1] 1 61

62 Knowledge Complex [1] 4, [5] 3, [5 1] 2 [3] 3, [3 1] 2, [3 5] 2 [2] 3, [2 1] 2, [2 5] 2, [2 3] 2 [4] 2, [4 1] 2 62

63 Knowledge Complex [5 1] 2 [3 1] 2, [3 5] 2 [2 1] 2, [2 5] 2, [2 3] 2 [2 1] 2, [2 5] 2, [2 3] 2 [4 1] 2 63

64 64 Knowledge Complex w Knowledge Complex w c b e #5 d 7 segments and all the points a

65 65 For a real world database (1256 column; 65,536 rows), New algorithm(5.07 secs) FP-growth of Professor Jia-Wei Han (1283.337036). runs nearly 200-300 times faster than

66 66 Thanks !

67 67 Simplicial Complex in Web (21) A Web page is 1. a linearly ordered Text. 2. a knowledge representation of human thoughts 2

68 68 1. Wall Street is a symbol for American financial industry. Most of the computer systems for those financial institute have employed information flow security policy. 2. Wall Street is a shorthand for US financial industry. Its E-security has applied security policy that was based on the ancient intent of Chinese wall. 3. Wall Street represents an abstract concept of financial industry. Its information security policy is Chinese wall.

69 69 2-ary Relation Wall Street InformationSecurity FinanceIndustry

70 70 1. Wall Street is a symbol for American finance industry. Most of the computer systems for those financial institute have employed information flow security policy. that was based on the ancient intent of 2. Wall Street is a shorthand for US finance industry. Its E-security has applied security policy that was based on the ancient intent of Chinese wall. is 3. Wall Street represents an abstract concept of finance industry. Its information security policy is Chinese wall.

71 71 4-nary Relation securitypolicyChinawall

72 72 Concept Mining Here we used the same idea to do concept mining in Documents

73 73 Concept Analysis Simplex, as an ordered keyword set, represents a Concept in the web Simplicial complex is the knowledge structure of the web

74 74 Knowledge Structure Concept: 1-simplex Knowledge Structure Concept: 1-simplex Wall    Street Wall Street is a 1-simplex represents the concept of financial industry

75 75 Knowledge Structure Concept: 1-simplex Finance    Industry Finance Industry (Stemming)

76 76 Knowledge Structure Indexing the Concepts by indexing the concepts in simplicial complex,... building Knowledge Based Search Engine Can be built.

77 77 Concepts will be clustered by Homology Theory T. Y. LIN  – Tung Yen Lin –Tsau Young Lin...

78 78 Conclusions

79 79 Key Components 1.GrC Model (U, β): 2.Two Operations: (skip) Granulation and Integration 3. Three Semantic Views on β Knowledge Engineering (considering) Uncertainty Theory How-to-solve/compute-it

80 80 Key Components 4. Four Structures Granular structure/variable (Zadeh) Quotient Structure (QS - Zhang) Knowledge Structure (KS - Pawlak) Linguistic Structure/variable(Zadeh) http://xanadu.cs.sjsu.edu/~grc/grcinfo_center/1Linabs_william.pdf (From TY Lin’s home page  granular computing conference 2009  GrC Information Center  Click here for a formal theory in First paragraph.) Click here for a formal theory

81 81 Other Applications 2. Information Flow Security 3 rd GrC model Solve 30 years outstanding Problem; IEEE SMC 2009

82 82 Other Applications 3. Approximation Theory in the category of Turing machines 7 th GrC Model Expressing DNA sequences by finite automata 2014

83 83 Other Applications Approximation Theory in the category of Functions 6 th GrC Model Patterns in numerical sequences (1999)

84 84 Other “Applications” Interpreting Uncertainty in Quantum Mechanics as GrC 3rd GrC Model Interpreting Approximations in Big Data 1 st GrC Model

85 85 Thanks !


Download ppt "1 Algebraic Topology in Data Science Algebraic Topology in Data Science GrC in Big Data Tsau Young (‘T. Y.’) Lin Institute of Data Science and Computing."

Similar presentations


Ads by Google