Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMU SCS Bio-informatics, Graph and Stream mining Christos Faloutsos CMU.

Similar presentations


Presentation on theme: "CMU SCS Bio-informatics, Graph and Stream mining Christos Faloutsos CMU."— Presentation transcript:

1 CMU SCS Bio-informatics, Graph and Stream mining Christos Faloutsos CMU

2 CMU SCS IC '07C. Faloutsos2 CONGRATULATIONS!

3 CMU SCS IC '07C. Faloutsos3 Outline Problem definition / Motivation Biological image mining Graphs and power laws Streams and forecasting [Scalability: Gb and Tb of data…] Conclusions

4 CMU SCS IC '07C. Faloutsos4 Motivation Data mining: ~ find patterns (rules, outliers) Trends in fly embryos gene expressions? How do real graphs look like? How do (numerical) streams look like?

5 CMU SCS IC '07C. Faloutsos5 FEMine: Mining Fly Embryos

6 CMU SCS IC '07C. Faloutsos6 Problem Given –fly embryo images –and their labels (eg., 1h, 2h, radiated, etc) Find patterns and trends, e.g., Ultimate goal: how genes affect each other.

7 CMU SCS IC '07C. Faloutsos7 With Eric Xing (CMU CS) Bob Murphy (CMU – Bio) Tim Pan (CMU -> Google) Andre Balan (U. Sao Paulo)

8 CMU SCS IC '07C. Faloutsos8 Outline Problem definition / Motivation Biological image mining Graphs and power laws Streams and forecasting [Scalability: Gb and Tb of data…] Conclusions

9 CMU SCS IC '07C. Faloutsos9 Graphs - why should we care?

10 CMU SCS IC '07C. Faloutsos10 Graphs - why should we care? Internet Map [lumeta.com] Food Web [Martinez ’91] Protein Interactions [genomebiology.com] Friendship Network [Moody ’01]

11 CMU SCS IC '07C. Faloutsos11 Joint work with Dr. Deepayan Chakrabarti (CMU/Yahoo R.L.)

12 CMU SCS IC '07C. Faloutsos12 Problem: network and graph mining How does the Internet look like? How does the web look like? What constitutes a ‘normal’ social network? What is ‘normal’/‘abnormal’? which patterns/laws hold?

13 CMU SCS IC '07C. Faloutsos13 Graph mining Are real graphs random?

14 CMU SCS IC '07C. Faloutsos14 Laws and patterns NO!! Diameter in- and out- degree distributions other (surprising) patterns

15 CMU SCS IC '07C. Faloutsos15 Laws – degree distributions Q: avg degree is ~3 - what is the most probable degree? degree count ?? 3

16 CMU SCS IC '07C. Faloutsos16 Laws – degree distributions Q: avg degree is ~3 - what is the most probable degree? degree count ?? 3 count 3

17 CMU SCS IC '07C. Faloutsos17 Solution: The plot is linear in log-log scale [FFF’99] freq = degree (-2.15) O = -2.15 Exponent = slope Outdegree Frequency Nov’97 -2.15

18 CMU SCS IC '07C. Faloutsos18 But: Q1: How about graphs from other domains? Q2: How about temporal evolution?

19 CMU SCS IC '07C. Faloutsos19 The Peer-to-Peer Topology Frequency versus degree Number of adjacent peers follows a power-law [Jovanovic+]

20 CMU SCS IC '07C. Faloutsos20 More power laws: citation counts: (citeseer.nj.nec.com 6/2001) log(#citations) log(count) Ullman

21 CMU SCS IC '07C. Faloutsos21 More power laws: web hit counts [w/ A. Montgomery] Web Site Traffic log(in-degree) log(count) Zipf users sites ``ebay’’

22 CMU SCS IC '07C. Faloutsos22 But: Q1: How about graphs from other domains? Q2: How about temporal evolution?

23 CMU SCS IC '07C. Faloutsos23 Time evolution with Jure Leskovec (CMU) and Jon Kleinberg (Cornell) (‘best paper’ KDD05)

24 CMU SCS IC '07C. Faloutsos24 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: –diameter ~ O(log N) –diameter ~ O(log log N) What is happening in real data?

25 CMU SCS IC '07C. Faloutsos25 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: –diameter ~ O(log N) –diameter ~ O(log log N) What is happening in real data? Diameter shrinks over time –As the network grows the distances between nodes slowly decrease

26 CMU SCS IC '07C. Faloutsos26 Diameter – ArXiv citation graph Citations among physics papers 1992 –2003 One graph per year time [years] diameter

27 CMU SCS IC '07C. Faloutsos27 Diameter – “Patents” Patent citation network 25 years of data time [years] diameter

28 CMU SCS IC '07C. Faloutsos28 Temporal Evolution of the Graphs N(t) … nodes at time t E(t) … edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t)

29 CMU SCS IC '07C. Faloutsos29 Temporal Evolution of the Graphs N(t) … nodes at time t E(t) … edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) A: over-doubled! –But obeying the ``Densification Power Law’’

30 CMU SCS IC '07C. Faloutsos30 Densification – Physics Citations Citations among physics papers 2003: –29,555 papers, 352,807 citations N(t) E(t) 1.69

31 CMU SCS IC '07C. Faloutsos31 Densification – Patent Citations Citations among patents granted 1999 –2.9 million nodes –16.5 million edges Each year is a datapoint N(t) E(t) 1.66

32 CMU SCS IC '07C. Faloutsos32 Outline Problem definition / Motivation Biological image mining Graphs and power laws –Time evolving graphs + tensors Streams and forecasting [Scalability: Gb and Tb of data…] Conclusions

33 CMU SCS IC '07C. Faloutsos33 Tensors for time evolving graphs [Jimeng Sun+ KDD’06] [ “, SMD’07] [ CF, Kolda, Sun, SDM’07 tutorial]

34 CMU SCS IC '07C. Faloutsos34 Application: Network Anomaly Detection Anomaly detection Data –TCP flows collected at CMU backbone –500GB with compression – –1200 timestamps (hours) –‘Tensor’ destination source

35 CMU SCS IC '07C. Faloutsos35 with Jimeng Sun Hui Zhang Yinglian Xie (Dave Anderson)

36 CMU SCS IC '07C. Faloutsos36 destination source Network anomaly detection Identify when and where anomalies occurred. Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin). scanners Time (hour) destination source error AbnormalNormal

37 CMU SCS IC '07C. Faloutsos37 Outline Problem definition / Motivation Biological image mining Graphs and power laws Streams and forecasting [Scalability: Gb and Tb of data…] Conclusions

38 CMU SCS IC '07C. Faloutsos38 Why care about streams? Sensor devices –Temperature, weather measurements –Road traffic data –Geological observations –Patient physiological data –Chlorine in drinking water (*) Embedded devices –Network routers

39 CMU SCS IC '07C. Faloutsos39 SPIRIT / InteMon http://warsteiner.db.cs.cmu.edu/demo/intem on.jsphttp://warsteiner.db.cs.cmu.edu/demo/intem on.jsp http://localhost:8080/demo/graphs.jsp self-* storage system (PDL/CMU) with Jimeng Sun (CMU/CS) Evan Hoke (CMU/CS-ug) Prof. Greg Ganger (CMU/CS/ECE) John Strunk (CMU/ECE)

40 CMU SCS IC '07C. Faloutsos40 self-* system @ CMU >200 nodes 40 racks of computing equipment 774kw of power. target: 1 PetaByte goal: self-correcting, self- securing, self-monitoring, self-...

41 CMU SCS IC '07C. Faloutsos41 System Architecture

42 CMU SCS IC '07C. Faloutsos42 Data center room monitoring Abnormal dehumidification and reheating cycle is identified Temperature Humidity

43 CMU SCS IC '07C. Faloutsos43 Outline Problem definition / Motivation Biological image mining Graphs and power laws Streams and forecasting [Scalability: Gb and Tb of data…] Conclusions

44 CMU SCS IC '07C. Faloutsos44 Our emphasis: scalability Gb (->Tb) of data Dream: to exploit ‘map-reduce’/hadoop, By re-designing D.M. algorithms for such an environment, within the D.I.S.C. effort (Data Intensive Scientific Computing)

45 CMU SCS IC '07C. Faloutsos45 D.I.S.C. Randy Bryant (SCS Dean) Dave O’Hallaron (dir. of INTEL Pittsburgh) Garth Gibson Greg Ganger ++ INTEL: 15 nodes; Yahoo: ~100 nodes, running hadoop

46 CMU SCS IC '07C. Faloutsos46 DM for Tera- and Peta-bytes Two-way street: <- DM can use such infrastructures to find patterns -> DM can help such infrastructures become self-healing, self-adjusting, ‘self-*’

47 CMU SCS IC '07C. Faloutsos47 Conclusions Biological images, graphs & streams pose fascinating problems self-similarity, fractals and power laws work, when other methods fail! SCALABILITY: creates fascinating research problems!

48 CMU SCS IC '07C. Faloutsos48 Contact info christos@cs.cmu.edu www.cs.cmu.edu/~christos Wean Hall 7107 Ph#: x8.1457 and, again WELCOME!

49 CMU SCS IC '07C. Faloutsos49

50 CMU SCS IC '07C. Faloutsos50

51 CMU SCS IC '07C. Faloutsos51 A famous power law: Zipf’s law Bible - rank vs frequency (log-log) similarly, in many other languages; for customers and sales volume; city populations etc etc log(rank) log(freq)

52 CMU SCS IC '07C. Faloutsos52 Olympic medals (Sidney’00, Athens’04): log( rank) log(#medals)

53 CMU SCS IC '07C. Faloutsos53

54 CMU SCS IC '07C. Faloutsos54 Diameter – “Autonomous Systems” Graph of Internet One graph per day 1997 – 2000 number of nodes diameter

55 CMU SCS IC '07C. Faloutsos55 Diameter – “Affiliation Network” Graph of collaborations in physics – authors linked to papers 10 years of data time [years] diameter

56 CMU SCS IC '07C. Faloutsos56 Densification – Autonomous Systems Graph of Internet 2000 –6,000 nodes –26,000 edges One graph per day N(t) E(t) 1.18

57 CMU SCS IC '07C. Faloutsos57 Densification – Affiliation Network Authors linked to their publications 2002 –60,000 nodes 20,000 authors 38,000 papers –133,000 edges N(t) E(t) 1.15


Download ppt "CMU SCS Bio-informatics, Graph and Stream mining Christos Faloutsos CMU."

Similar presentations


Ads by Google