Presentation is loading. Please wait.

Presentation is loading. Please wait.

CAS May 24, 2010 Creating a science base to support new directions in computer science John Hopcroft Cornell University Ithaca, New York.

Similar presentations


Presentation on theme: "CAS May 24, 2010 Creating a science base to support new directions in computer science John Hopcroft Cornell University Ithaca, New York."— Presentation transcript:

1 CAS May 24, 2010 Creating a science base to support new directions in computer science John Hopcroft Cornell University Ithaca, New York

2 CAS May 24, 2010 Time of change  The information age is a fundamental revolution that is changing all aspects of our lives.  Those individuals and nations who recognize this change and position themselves for the future will benefit enormously.

3 CAS May 24, 2010 Drivers of change  Merging of computing and communications  Data available in digital form  Networked devices and sensors  Computers becoming ubiquitous

4 CAS May 24, 2010 Internet search engines are changing When was Einstein born? Einstein was born at Ulm, in Wurttemberg, Germany, on March 14, 1879. List of relevant web pages

5 CAS May 24, 2010

6 Internet queries will be different  Which car should I buy?  What are the key papers in Theoretical Computer Science?  Construct an annotated bibliography on graph theory.  Where should I go to college?  How did the field of CS develop?

7 CAS May 24, 2010 Which car should I buy? Search engine response: Which criteria below are important to you?  Fuel economy  Crash safety  Reliability  Performance  Etc.

8 CAS May 24, 2010 MakeCostReliabilityFuel economy Crash safety Links to photos/ articles Toyota Prius23,780Excellent44 mpgFair Honda Accord28,695Better26 mpgExcellent Toyota Camry29,839Average24 mpgGood Lexus 35038,615Excellent23 mpgGood Infiniti M3547,650Excellent19 mpgGood

9 CAS May 24, 2010

10 2010 Toyota Camry - Auto Shows Toyota sneaks the new Camry into the Detroit Auto Show. Usually, redesigns and facelifts of cars as significant as the hot- selling Toyota Camry are accompanied by a commensurate amount of fanfare. So we were surprised when, right about the time that we were walking by the Toyota booth, a chirp of our Blackberries brought us the press release announcing that the facelifted 2010 Toyota Camry and Camry Hybrid mid-sized sedans were appearing at the 2009 NAIAS in Detroit.Toyota CamryCamry Hybrid2009 NAIAS We’d have hardly noticed if they hadn’t told us—the headlamps are slightly larger, the grilles on the gas and hybrid models go their own way, and taillamps become primarily LED. Wheels are also new, but overall, the resemblance to the Corolla is downright uncanny. Let’s hear it for brand consistency! Four-cylinder Camrys get Toyota’s new 2.5-liter four-cylinder with a boost in horsepower to 169 for LE and XLE grades, 179 for the Camry SE, all of which are available with six-speed manual or automatic transmissions. Camry V-6 and Hybrid models are relatively unchanged under the skin. Inside, changes are likewise minimal: the options list has been shaken up a bit, but the only visible change on any Camry model is the Hybrid’s new gauge cluster and softer seat fabrics. Pricing will be announced closer to the time it goes on sale this March. Toyota Camry  › OverviewOverview  › SpecificationsSpecifications  › Price with OptionsPrice with Options  › Get a Free QuoteGet a Free Quote News & Reviews  2010 Toyota Camry - Auto Shows 2010 Toyota Camry - Auto Shows Top Competitors  Chevrolet Malibu Chevrolet Malibu  Ford Fusion Ford Fusion  Honda Accord sedan Honda Accord sedan

11 CAS May 24, 2010 Which are the key papers in Theoretical Computer Science? Hartmanis and Stearns, “On the computational complexity of algorithms” Blum, “A machine-independent theory of the complexity of recursive functions” Cook, “The complexity of theorem proving procedures” Karp, “Reducibility among combinatorial problems” Garey and Johnson, “Computers and Intractability: A Guide to the Theory of NP-Completeness” Yao, “Theory and Applications of Trapdoor Functions” Shafi Goldwasser, Silvio Micali, Charles Rackoff, “The Knowledge Complexity of Interactive Proof Systems” Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan, and Mario Szegedy, “Proof Verification and the Hardness of Approximation Problems”

12 CAS May 24, 2010

13 Fed Ex package tracking

14 CAS May 24, 2010

15

16

17

18

19 NEXRAD Radar Binghamton, Base Reflectivity 0.50 Degree Elevation Range 124 NMI — Map of All US Radar Sites Map of All US Radar Sites Animate Map Storm TracksTotal PrecipitationShow SevereRegional RadarZoom Map Click:Zoom InZoom OutPan Map ( Full Zoom Out ) Full Zoom Out »AdvancedRadarTypesCLICK»»AdvancedRadarTypesCLICK»

20 CAS May 24, 2010 Collective Inference on Markov Models for Modeling Bird Migration Space time

21 CAS May 24, 2010 Daniel Sheldon, M. A. Saleh Elmohamed, Dexter Kozen

22 CAS May 24, 2010 Science base to support activities Track flow of ideas in scientific literature Track evolution of communities in social networks Extract information from unstructured data sources.

23 CAS May 24, 2010 Tracking the flow of ideas in scientific literature Yookyung Jo

24 CAS May 24, 2010 Tracking the flow of ideas in the scientific literature Yookyung Jo Page rank Web Link Graph Retrieval Query Search Text Web Page Search Rank Web Chord Usage Index Probabilistic Text File Retrieve Text Index Discourse Word Centering Anaphora

25 CAS May 24, 2010 Original papers

26 CAS May 24, 2010 Original papers cleaned up

27 CAS May 24, 2010 Referenced papers

28 CAS May 24, 2010 Referenced papers cleaned up. Three distinct categories of papers

29 CAS May 24, 2010

30 Tracking communities in social networks Liaoruo Wang

31 CAS May 24, 2010 “Statistical Properties of Community Structure in Large Social and Information Networks”, Jure Leskovec; Kevin Lang; Anirban Dasgupta; Michael Mahoney Studied over 70 large sparse real-world networks. Best communities approximately size 100 to 150.

32 CAS May 24, 2010 Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like."

33 CAS May 24, 2010 Conductance Size of community 100

34 CAS May 24, 2010 Giant component

35 CAS May 24, 2010 Whisker: A component with v vertices connected by edges

36 CAS May 24, 2010 Our view of a community TCS Me Colleagues at Cornell Classmates Family and friends More connections outside than inside

37 CAS May 24, 2010 Should we remove all whiskers and search for communities in core? Core

38 CAS May 24, 2010 Should we remove whiskers? Does there exist a core in social networks? Yes Experimentally it appears at p=1/n in G(n,p) model Is the core unique? Yes and No In G(n,p) model should we require that a whisker have only a finite number of edges connecting it to the core? Laura Wang

39 CAS May 24, 2010 Algorithms How do you find the core? Are there communities in the core of social networks?

40 CAS May 24, 2010 How do we find whiskers? NP-complete if graph has a whisker There exists graphs with whiskers for which neither the union of two whiskers nor the intersection of two whiskers is a whisker

41 CAS May 24, 2010 Graph with no unique core 3 1 1

42 CAS May 24, 2010 Graph with no unique core 1

43 CAS May 24, 2010 What is a community? How do you find them?

44 CAS May 24, 2010 Communities  Conductance  Can we compress graph?  Rosvall and Bergstrom, “An informatio- theoretic framework for resolving community structure in complex networks”  Hypothesis testing  Yookyung Jo

45 CAS May 24, 2010 Description of graph with community structure Specify which vertices are in which communities Specify the number of edges between each pair of communities

46 CAS May 24, 2010 Information necessary to specify graph given community structure m=number of communities n i =number of vertices in i th community l ij number of edges between i th and j th communities

47 CAS May 24, 2010 Description of graph consists of description of community structure plus specification of graph given structure. Specify community for each edge and the number of edges between each community Can this method be used to specify more complex community structure where communities overlap?

48 CAS May 24, 2010 Hypothesis testing Null hypothesis: All edges generated with some probability p 0 Hypothesis: Edges in communities generated with probability p 1, other edges with probability p 0.

49 CAS May 24, 2010 Massively overlapping communities Are there a small number of massively overlapping communities that share a common core? Are there massively overlapping communities in which one can move from one community to a totally disjoint community?

50 CAS May 24, 2010 Massively overlapping communities with a common core

51 CAS May 24, 2010 Massively overlapping communities

52 CAS May 24, 2010 Clustering Social networks Mishra, Schreiber, Stanton, and Tarjan Each member of community is connected to a beta fraction of community No member outside the community is connected to more than an alpha fraction of the community Some connectivity constraint

53 CAS May 24, 2010 In sparse graphs How do you find alpha – beta communities? What if each person in the community is connected to more members outside the community then inside?

54 CAS May 24, 2010 Sucheta Soundarajan Transmission paths for viruses, flow of ideas, or influence

55 CAS May 24, 2010 Trivial model Half of contacts Two third of contacts

56 CAS May 24, 2010 Time of first item Time of second item

57 CAS May 24, 2010 Theory to support new directions  Large graphs  Spectral analysis  High dimensions and dimension reduction  Clustering  Collaborative filtering  Extracting signal from noise

58 CAS May 24, 2010 Theory of Large Graphs  Large graphs with billions of vertices  Exact edges present not critical  Invariant to small changes in definition  Must be able to prove basic theorems

59 CAS May 24, 2010 Erdös-Renyi  n vertices  each of n 2 potential edges is present with independent probability NnNn p n (1-p) N-n vertex degree binomial degree distribution number of vertices

60 CAS May 24, 2010

61 Generative models for graphs  Vertices and edges added at each unit of time  Rule to determine where to place edges  Uniform probability  Preferential attachment- gives rise to power law degree distributions

62 CAS May 24, 2010 Vertex degree Number of vertices Preferential attachment gives rise to the power law degree distribution common in many graphs

63 CAS May 24, 2010 Protein interactions 2730 proteins in data base 3602 interactions between proteins Science 1999 July 30; 285:751-753 Only 899 proteins in components, where are 1851 missing proteins?

64 CAS May 24, 2010 Protein interactions 2730 proteins in data base 3602 interactions between proteins Science 1999 July 30; 285:751-753

65 CAS May 24, 2010 Giant Component 1.Create n isolated vertices 2.Add Edges randomly one by one 3.Compute number of connected components

66 CAS May 24, 2010 Giant Component 1 1000 12 9981 1234567891011 5488928149531111

67 CAS May 24, 2010 Giant Component 12345678 3677024129322 9101213142055101 22122111

68 CAS May 24, 2010 Giant Component 12345678910 25239136362110 1112131415161718514 1000000001

69 CAS May 24, 2010 Science base What do we mean by science base?  Example: High dimensions

70 CAS May 24, 2010 High dimension is fundamentally different from 2 or 3 dimensional space

71 CAS May 24, 2010 High dimensional data is inherently unstable Given n random points in d dimensional space essentially all n 2 distances are equal.

72 CAS May 24, 2010 High Dimensions Intuition from two and three dimensions not valid for high dimension Volume of cube is one in all dimensions Volume of sphere goes to zero

73 CAS May 24, 2010 1 Unit sphere Unit square 2 Dimensions

74 CAS May 24, 2010 4 Dimensions

75 CAS May 24, 2010 d Dimensions 1

76 CAS May 24, 2010 Almost all area of the unit cube is outside the unit sphere

77 CAS May 24, 2010 Gaussian distribution Probability mass concentrated between dotted lines

78 CAS May 24, 2010 Gaussian in high dimensions

79 CAS May 24, 2010 Two Gaussians

80 CAS May 24, 2010

81 Distance between two random points from same Gaussian  Points on thin annulus of radius  Approximate by sphere of radius  Average distance between two points is (Place one pt at N. Pole other at random. Almost surely second point near the equator.)

82 CAS May 24, 2010

83

84 Expected distance between pts from two Gaussians separated by δ

85 CAS May 24, 2010 Can separate points from two Gaussians if

86 CAS May 24, 2010 Dimension reduction Project points onto subspace containing centers of Gaussians Reduce dimension from d to k, the number of Gaussians

87 CAS May 24, 2010 Centers retain separation Average distance between points reduced by

88 CAS May 24, 2010 Can separate Gaussians provided > some constant involving k and γ independent of the dimension

89 Finding centers of Gaussians First singular vector v 1 minimizes the perpendicular distance to points CAS May 24, 2010

90 The first singular vector goes through center of Gaussian and minimizes distance to points Gaussian CAS May 24, 2010

91 Best k-dimensional space for Gaussian is any space containing the line through the center of the Gaussian CAS May 24, 2010

92 Given k Gaussians, the top k singular vectors define a k dimensional space that contains the k lines through the centers of the k Gaussian and hence contain the centers of the Gaussians CAS May 24, 2010

93 We have just seen what a science base for high dimensional data might look like. What other areas do we need to develop a science base for?

94 CAS May 24, 2010  Ranking is important  Restaurants, movies, books, web pages  Multi billion dollar industry  Collaborative filtering  When a customer buys a product what else is he likely to buy  Dimension reduction  Extracting information from large data sources  Social networks

95 CAS May 24, 2010 Conclusions  We are in an exciting time of change.  Information technology is a big driver of that change.  The computer science theory needs to be developed to support this information age.


Download ppt "CAS May 24, 2010 Creating a science base to support new directions in computer science John Hopcroft Cornell University Ithaca, New York."

Similar presentations


Ads by Google