Output Sensitive Enumeration

Output Sensitive Enumeration
9. Data Polishing Not to Exhaust Enumeration Abstraction; Motivation on Bigdata Difficulty in Modeling Graph Polishing Bigraph Polishing

9-1 Enumeration in Practice

Enumeration in Practice
• Enumeration algorithms, especially for pattern mining, have been intensively studied However, in practice the algorithms have not been used… A big problem is “intractably numerous solutions” we don’t want to get billions of solutions at hand… But “completeness” is surely the justice in enumeration Of course, clustering of solutions doesn’t work by huge size

Sampling doesn’t Work • Left has one maximal clique and right has exponentially many maximal cliques Then, sampling never find the left… • Dirty areas are intensively investigated, focused, • But clean, well-organized areas are ignored missing edges

Mining & other Methodologies on BigData
Mining finds structures that are used by methods in upper layers from big, shallow meaning, sparse, noisy, and ill-granularity data Knowledge, Value cloud sensor simulation visualization statistics testing privacy machine learning algorithm CPS gamification feature generation mining annotation cloud sourcing NLP GPS data image processing BigData sensor data

Data Abstraction ex.) trajectory: sequence of points  sequence of few places ex.) data points: data objects  local groups of similar objects be here 10:40 this street is common to many people stayed in this area for 2hours passed through this part of street ・例えば位置情報データでは。・点列。時刻・こういう滞留しているところ、・みんなが通るところ、道っぽいところ・意味の列になり、簡潔

Data Abstraction KNOWLEDGE DATA raw data particles
trajectory: sequence of points  sequence of few places POS data: customer’s baskets  groups of customers and items • Solutions composed of raw data is difficult to understand • Abstract data by particles so that solutions are composed of abstracted structures that are relatively easy to understand KNOWLEDGE shop park DATA point A point B

Following the Way of Cognition
• When helping thinking, system’s support should follow the way of human’s recognition + set and change view points + move the focus to neighbors or similar issues, iteratively + understand the mechanism and project to the others… • For understanding some, abstraction/specification is also popular indivi. entity whole data knowledge abstract specify

Abstraction / Specification
• Data is a collection of individual entities As a whole collection, the property of data is easy to get  size, mean, variance, density, … • Then, specify the data by dividing understanding something. segment, classify, looking at an attribute,… • On the other side, deep observation and personality are from entities  common personality, non-unique anomaly, diversity,… indivi. entity whole data knowledge abstract specify

Abstraction / Specification
Specification: for global properties, majority, popular attributes segmentation, distribution, partition, classification, PCA… Abstraction: for local structure, minority, diversity, micro-clustering, pattern mining, anomaly detection … knowledge abstract specify indivi. entity whole data マイノリティ、コミュニティ、トピック、多様性セグメント、PCA、全体的な特徴、全体的な分布 minority community topics diversity segments PCA major attr. distribution

Machine Learning without Abstract
• Partition the data into two areas, including more reds, and not • Even though attains high accuracy, the solution is “hard to understand” the mechanism 赤い点はどこに多いか？

Easier to get some meanings, or inspires
With Particles Easier to get some meanings, or inspires here 赤い点はどこにあるか？ here

9-2 Difficulty on Abstraction

Why not Clustering? Clustering finds (global) ”classes”, but particles are ”structures” … so, has many problems huge small solutions, unbalanced sizes, skewed granularity 　　 clusters by graph cut 小型機が着陸失敗、日本人女性… (airplane accident) 東京株、午前終値は152円高 8カ月ぶり9… (stock market) オセロ松嶋が体調不良、笑福亭鶴瓶の… (entertainment) トヨタの前３月期、豊田社長ら首脳３氏が報酬… (economy) タケノコ採り４人無事秋田・仙北で一時不明 (accident) 　… cluster sizes decreasing order  size 10,000 articles 1,000,000 topics  clusters

Experiment on Reuter Articles
• The sizes follow Zipf’s distribution. AMSTERDAM NETHERLANDS: AOT PROVISIONAL 1996 NET ALMOST DOUBLES. MANILA PHILIPPINES: Planters sets 275 to 300 pct stock dividend. SHANGHAI CHINA: Shanghai's tax income up 26.9 pct in 1996. TOKYO JAPAN: Japan bank lending fell 0.4 pct in December -- BOJ. CONCORD, Mass USA: GenRad Inc Q3 diluted shr rises to $0.21. BANGKOK THAILAND: Thai 1996 sugar exports 4.34 mln tonnes - surveyors. HOFFMAN ESTATES, Ill USA: Sears, Roebuck and Co Q3 net higher. HONG KONG HONG KONG: Trade in Chung Hwa Dev shares suspended. SEOUL SOUTH KOREA: S.Korean banks' operating profits rise 16.2 pct. ISLANDIA, N.Y USA: Compouter Assocs Q2 shr rises to $0.59. HAMPTON, N.H USA: Wheelabrator Q3 shr down to $0.28. BASLE, Switzerland SWITZERLAND: Roche sees higher 1996 profit.

Partition the graph into two parts
Graph Cut Partition the graph into two parts

Modularity Maximization
Partition the graph while maximizing modularity

Ex. Modularity Maximization
Partition the graph while maximizing modularity

Maximal Clique Enumeration
Clique: subgraph being complete graph A clique is maximal when not contained in any other  a basic model of communities and clusters

Maximal Clique Enumeration
How many maximal cliques are there in this graph? = 4096 #maximal subgraphs with density > 80% > 20( ) = 81920 データマイニングが、実利用を軽視してきたこと。いたちごっこなこと Very hard to efficiently represent what people want to find “Strange” solutions have been given to users in the past

Popular Approaches • Think about the “MODELS” that would represent the human’s thinking well, and practically/theoretically efficient  Sophisticated mathematical models,  lot of computation time! • Usually, O(n3) is already intractable, even with HPC However, this type of research is most popular…

Algorithmic modeling • Many algorithms generate valuable solutions
 whether the result of the computation is valuable or not is important # of issues that are computable > # of issues that can be modeled mathematically • Then, how about skipping the “mathematical modeling phase”? i.e., to start by modeling by developing good algorithms, possibly followed by mathematical models computable math- modelable problems, motivations mathematical model users, usability algorithm

Algorithmic modeling • Many algorithms generate valuable solutions
 whether the result of the computation is valuable or not is important # of issues that are computable > # of issues that can be modeled mathematically • Then, how about skipping the “mathematical modeling phase”? i.e., to start by modeling by developing good algorithms, possibly followed by mathematical models computable math- modelable problems, motivations users, usability algorithm

9-3 Abstraction by Clarification

Basic Idea : Clarify Structures
Why bad? ... because, the boundaries of the structures are not clear The analogy: making the picture visually clearer sharpening edges, erasing noise, removing shadows, … and rearranging objects At the same time, the accuracy in recognizing, classifying, and segmenting of the objects in the picture can be increased ・マイニングのこういった弱点が出てくる要因の一つは、隠れた構造がはっきりとしていないこと。境界が曖昧なこと。・写真で言うならピンぼけ・こういうとき、当然、辺を際立たせ、乱れたドットをならし、影を消し、でキレイに・人間にきれいだけでなく、認識の精度向上・明確化をデータに持ち込むのが我々のアイディア・もっと積極的に、ものをそろえた方が、わかりやすく、かつ認識精度は上がるはず。・写真は変わるが、中身は変わらない、アナロジーで攻めます。 books bottles PC Do the same in Bigdata! papers pens

New Approach: Data Polishing
Remove ambiguity by local change from Feasible Hypothesis dense graph sequence/string pattern • Modify parts that should be so, thus no loss of solution • Clear the ambiguities, unify similar solutions, decrease #solutions • No loss of comprehension

Data Polishing for Graph Clustering
Use a necessary condition instead of hard dense graph enumeration + if #common neighbors of A and B, ≥ k  make edge (A,B) < k  remove edge (A,B) (neighbor similarity ≥ θ, to uniform granularity) • Apply this to all vertex pairs at once • Repeat this until convergence (the density of a dense subgraph is always increased) A and B belong to a group  A and B share many neighbors (neighbors are similar) A B ・密部分グラフをクリークにする密部分グラフの列挙は大変　ｰｰ> 実行可能仮説に基づく必要条件で近似・リンク予測など、共通の友達が多ければ同じクラスタに属するし、そうでなければ属さない、・ K より大きければ、枝を引く、とする。・これを全ての頂点対にいっぺんに適用してグラフを更新する、という作業を、変化がなくなるまで続ける。・k以上、の場合、ある程度大きく、密度1/3以上なら必ず密度が上がる B A

Preliminary Study for Graph Clustering
the scale original polished #nodes 3,282 #edges 35,168 73,132 density 3.3‰ 6.8‰ #cliques 32,953 343 Companies and their business relations Prediction accuracy: accuracy on customer attribute prediction by clustering methods これが結果。企業間の取引データ研磨すると、クリークが減り、非常に見やすくなる。中身も確かに同じ業種のものがある。・買い物データの属性クラスタリングによる顧客の健康志向予測。研磨した方が、いい属性クラスタを見つけている・グラフクラスタリングにおける、人工データでの、粒子発見精度。クリークだけのグラフを作って、ノイズを乗せて、正解をどれくらい見つけられるかを調査、 Noise robustness: discovery rates of clusters (particles) by clustering methods clique Newman graph cut original 60.60% 59.70% 60.03% particle 71.36% 62.76% 67.78% polishing Newman graph cut noise 10% 100.00% 68.74% 76.10% noise 40% 99.69% 7.91% 77.03%

Result: Data Polishing
• Number, size and distribution are appropriate the contents are also so PARIS FRANCE: PRESS DIGEST - Algeria - Dec 2. PARIS FRANCE: PRESS DIGEST - Algeria - Dec 1. PARIS FRANCE: PRESS DIGEST - Algeria - Nov 27. PARIS FRANCE: PRESS DIGEST - Algeria - Nov 24. … PARIS FRANCE: Single currency to be engine for Europe - Juppe. BUENOS AIRES ARGENTINA: TABLE - ARGENTINA POSTS $129 MLN TRADE GAP IN JUNE. BRASILIA BRAZIL: Brazil, Argentina resume talks over import rule. BUENOS AIRES ARGENTINA: Fiat 'pens $600 mln plant in Argentina. BRASILIA BRAZIL: Mercosur union to hold contest for logo - diplomat.

Algorithm Input: adjacency matrix M (n×n) O(n3)
1. construct similarity matrix H where cell (x,y) is one  row x and row y of M are similar (having intersection of size>θ , Jaccard distance, etc) 2. if H ≠ M, then M=H, go to 1 (3. enumerate all maximal clique in the resulted graph) O(n3)

Computation Time • Polishing needs #common neighbors for each pair
 n2 time is usually too much! • However, most pairs have no common neighbor in real data • v shares a neighbor u with only neighbors of u  only “neighbors of neighbors are needed to v • Δ2 time if degree sequence is in zip distribution (Δ:max. degree) u v for each neighbor u of v for each neighbor w≠v of u, cnt[v,u]++

linear time, if α is small
Computation time • Computation time for row x is O(Σz∈N[x] N[z]) So, total computation time is O(Σx Σz∈N[x] |N[z]|) • |N[z]| is counted |{x | z∈N[x]}| times since |{x | z∈N[x]}| = |N[z]|, we have O(Σx Σz∈N[x] |N[z]|) = O(Σz |N[z]|2) • If |N[z]| is in Zipf’s distribution, (|N[z]|2 ≤ αΔk, Δ<1) 　　O(Σz |N[z]|2) = O(Σz |N[z]| + Σk αΔk) = O(Σz |N[z]| + α2 ) for each z ∈ N[x] for each y∈ N[z] 　　　　　　c(x,y) = c(x,y) + 1 linear time, if α is small

Exp. Web link data • Web link data in Japan (from Kitsuregawa-lab, Tokyo univ.) 　- 550M nodes (sites), 1,300M edges (link) (pages shrunk to cites) • Maximal clique enumeration doesn’t work; doesn’t terminate • By data polishing, 　+ the number decreases to about 8000!!!! 　　　(almost they are spam sites…) 　+ each site is included in not so many cites 　+ many (large degree) sites are included in some cliques, but not complete (roughly half)

9-4 Related Issues and Applications

Convergence • Graph polishing almost always converges to some solutions in practice • However, there are some graphs on that k-intersection graph polishing does not converge (k-intersection polishing is that the similarity measure for graph polishing is intersection size, and the threshold is k) • This example vibrates. And, can be extended to closed intersection, and Jaccard distance clique of size 2k k edges

Solution Distribution by Size
• In usual mining, when we classify the solutions by size, the number exponentially increases as the increase of solution sizes • In our method, #solutions mildly increases, and has large solutions 　 time to find large solutions is quite short 　 can efficiently find large non-trivial solutions • Perhaps, we could extract meaningful something #maximal cliques size

Always Increase Density
• Polishing by “make edge if |N[u] ∩ N[v]| ≥ k” (intersection size)  subgraph with γk vertices s.t. (min. degree) ≥ (γ+1)k/2 always becomes a clique, by one application of data polishing  subgraph with γk vertices s.t. γk s.t. (min. degree) ≥ (γ-1)k/3 increases its density if density δ ≥

Bounding #Maximal Cliques
• Polishing by “make edge if |N[u] ∩ N[v]| ≥ k” (intersection size) 　 this defines a graph class s.t. (u,v)∈E  |N[u] ∩ N[v]| ≥ k • For this class We can show that, #maximal cliques is at most nk • If k is constant, enumeration/optimization of cliques will be polynomial time for other problems, would be

Summary; Data Polishing Approach
Choose structures we want to see Model the perfect case of the structures Model data modification by feasible hypothesis Construct algorithms for the modification

Targeting on Internet Ads.
• Learn users who may have interest to a particular sites, by using the user’s web browsing log data Apply data polishing to sites clustering, the features of user behaviors had become helpful in learning process • Similarity of the sites is defined by the similarity of the sets of users visiting the sites Up to 90% improvements, on the decrease of necessary ads to get required number of clicks

Matching Recommendation
• Profile search limits the candidates for marriage  based on ” similar favorites, similar personality” hypothesis, recommend persons with particles of similar favorites Acceptance ratio increased to 2.2 times larger! 愛媛結婚支援センター総務省地域情報化大賞特別賞 The “big data recommendation” gave me a nice guy who I fell in love, for the first time in the last 5 years.

Finding dense subgraphs and Bid rigging
• Extracting particles from real big data 　　 Chiba-city fiscal 2005 (172 items, 276 companies) to finding sets of items bid by groups of companies and groups of companies bidding a set of items simultaneously • Combination of NMF for matrix computation and merging clusters according to “density first search”

Herding in Finance Graph sequence of Leaman case TOPIX Edge Density 51
• Model building of "herding" phenomenon using individual stock price data. • Representation by graph sequence • Similarity graph with pairs of equities having price co-movement • Edge density predict major bottoms of market decline. Graph sequence of Leaman case TOPIX Edge Density

9-5 Bipartite Case

Bipartite Case • Consider the case of bipartite cliques (matrix);
+ how can we identify that an edge is in a dense subgraph? (finding a dense subgraph of a certain size is NP-hard) + dense subgraphs can be exponentially many, for enumeration

Hypothesis in Bipartite Case
• Again, consider the case of bipartite cliques (matrix) + we observe that usually each row/column is included in not so many biclusters + the clusters overlap with few others • Then, rows/columns in a bicluster are similar (as 01 vectors)

rows/columns are the same!
be Simple, then Pit Fall • A row/column is in a cluster  it has several similar ones • Consider the following idea; a cell (x,y) is “one”  row x has t1 similar rows, and column y has t2 similar columns  Some space that should be empty will be filled rows/columns are the same!

Avoiding Pit Fall • A cell (x,y) should not be “one” if there are few “one’s” in the similar rows/columns  So, (x,y) is one only if row x is likely to be one at the columns that are similar to columns y (resp., column y) • A cell (x,y) is one  row x/column y and row x/column y in the similarity matrix is similar

Formulae • A cell (x,y) is one  row x/column y and row x/column y in the similarity matrix is similar ☆ cell (x1,x2) is 1 in similarity matrix  rows x1 and x2 are similar  computed by product of original matrix and similarity matrix

Variants • We can consider three versions;
consider similarity on rows, columns, and (row and columns) • Rows and columns often have different characteristics, so using only rows or columns may have advantage (or, using only one side would be enough)

Experimental Result • Randomly generate clusters in a matrix, so that
any row/column belongs at most k1/k2 clusters, and a row, or a column belongs 0.7k1/0.7k2 clusters on average • Delete (set to 0) cells of clusters randomly with probability p, and add cells for each row so that the size of the row will be t times larger.

Experimental Result • Set k1 = 2, and k2 = 5 , and #vertices = 10000
104557 21641 50001 113308 211278 332017 448256 527791 547917 505612 416927 308078 203891 120288 62309 27576 9674 2543 322 2 1 4052 9320 388 11 7 61 640 79 • Set k1 = 2, and k2 = 5 , and #vertices = 10000 cluster size = 20x20 • Polish the matrix by using only rows Enumerate concepts of size no less than 3 (on column direction) noise patterns #patterns classified by sizes original clusters

Result by Spectral Co-Clustering
• Spectral Co-Clustering is one of popular bi-clustering method approaching with generalized eigenvalue decomposition of the Laplacian of the graph • The result of this algorithm is not good; the sizes of the output biclusters are far from the solutions good size size

Algorithm Input: matrix M (n×m) O(nm2)
1. construct similarity matrix H where cell (x,y) is one  row x and row y of M are similar (in, for ex. Jaccard distance) 2. set M’ to the similarity matrix between rows of H and M 3. if M ≠ M’ go to 1 ( 4. enumerate all maximal complete submatrices (concepts) ) O(nm2) O(nm2)

Intersection Size • Set similarities are usually computed with |A|, |B| and |A∩B| usually, |A∩B| = 0 implies they are not similar  compute intersection sizes of only pairs of row x and row y having non-empty intersection • Intersection sizes form M×MT  sparse matrix multiplication algorithms work • Particularly, computation time is linear when column sizes are in Zipf’s distribution

Exercise 9

Frequent Patterns 9-1. Prove that polished graph by k-intersection polishing can have at most nk maximal cliques 9-2. Show another graph example that k-intersection polishing doesn’t converge on the graph 9-3. How do you think what happens when we use |A∩B| / min{|A|, |B|} for similarity measure of graph polishing? 9-4. We have person trip data of a day. When we consider only the starting point and ending point of each trip, how do you abstract the data like clustering? Formulate a clustering problem (or clarification problem, or abstraction problem)

Output Sensitive Enumeration

Similar presentations

Presentation on theme: "Output Sensitive Enumeration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Output Sensitive Enumeration

Similar presentations

Presentation on theme: "Output Sensitive Enumeration"— Presentation transcript:

Similar presentations

About project

Feedback