Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York.

Slides:



Advertisements
Similar presentations
Future directions in computer science research John Hopcroft Department of Computer Science Cornell University Heidelberg Laureate Forum Sept 27, 2013.
Advertisements

Sergey Bravyi, IBM Watson Center Robert Raussendorf, Perimeter Institute Perugia July 16, 2007 Exactly solvable models of statistical physics: applications.
The Theory of Zeta Graphs with an Application to Random Networks Christopher Ré Stanford.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Analysis and Modeling of Social Networks Foudalis Ilias.
Week 5 - Models of Complex Networks I Dr. Anthony Bonato Ryerson University AM8002 Fall 2014.
Minimum Energy Mobile Wireless Networks IEEE JSAC 2001/10/18.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Beyond Trilateration: On the Localizability of Wireless Ad Hoc Networks Reported by: 莫斌.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Generative Models for the Web Graph José Rolim. Aim Reproduce emergent properties: –Distribution site size –Connectivity of the Web –Power law distriubutions.
Future directions in computer science research 23rd International Symposium on Algorithms and Computation John Hopcroft Cornell University ISAAC.
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
On the Spread of Viruses on the Internet Noam Berger Joint work with C. Borgs, J.T. Chayes and A. Saberi.
Future directions in computer science research John Hopcroft Department of Computer Science Cornell University CINVESTAV-IPN Dec 2,2013.
Small Worlds Presented by Geetha Akula For the Faculty of Department of Computer Science, CALSTATE LA. On 8 th June 07.
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Advanced Topics in Data Mining Special focus: Social Networks.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
Computer Science 1 Web as a graph Anna Karpovsky.
1 Refined Search Tree Technique for Dominating Set on Planar Graphs Jochen Alber, Hongbing Fan, Michael R. Fellows, Henning Fernau, Rolf Niedermeier, Fran.
The Erdös-Rényi models
Information Networks Power Laws and Network Models Lecture 3.
Primal-Dual Meets Local Search: Approximating MST’s with Non-uniform Degree Bounds Author: Jochen Könemann R. Ravi From CMU CS 3150 Presentation by Dan.
Programming for Geographical Information Analysis: Advanced Skills Online mini-lecture: Introduction to Networks Dr Andy Evans.
Creating a science base to support new directions in computer science Cornell University Ithaca New York John Hopcroft.
Percolation in self-similar networks Dmitri Krioukov CAIDA/UCSD M. Á. Serrano, M. Boguñá UNT, March 2011.
Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.
Building a Science Base for the Information Age John Hopcroft Cornell University Ithaca, NY Xiamen University.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
1 Burning a graph as a model of social contagion Anthony Bonato Ryerson University Institute of Software Chinese Academy of Sciences.
Random-Graph Theory The Erdos-Renyi model. G={P,E}, PNP 1,P 2,...,P N E In mathematical terms a network is represented by a graph. A graph is a pair of.
Lecture 5: Mathematics of Networks (Cont) CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
2010 © University of Michigan 1 DivRank: Interplay of Prestige and Diversity in Information Networks Qiaozhu Mei 1,2, Jian Guo 3, Dragomir Radev 1,2 1.
Mathematics of Networks (Cont)
Random Dot Product Graphs Ed Scheinerman Applied Mathematics & Statistics Johns Hopkins University IPAM Intelligent Extraction of Information from Graphs.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Slides are modified from Lada Adamic
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Topics Paths and Circuits (11.2) A B C D E F G.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Percolation in self-similar networks PRL 106:048701, 2011
Miniconference on the Mathematics of Computation
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Models of Web-Like Graphs: Integrated Approach
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Cohesive Subgraph Computation over Large Graphs
Stochastic Streams: Sample Complexity vs. Space Complexity
Centralities (4) Ralucca Gera,
Modelling and Searching Networks Lecture 5 – Random graphs
Clustering.
Modelling and Searching Networks Lecture 6 – PA models
Discrete Mathematics and its Applications Lecture 5 – Random graphs
Network Models Michael Goodrich Some slides adapted from:
Discrete Mathematics and its Applications Lecture 6 – PA models
Presentation transcript:

Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Time of change There is a fundamental revolution taking place that is changing all aspects of our lives. Those individuals who recognize this and position themselves for the future will benefit enormously.

Drivers of change in Computer Science Computers are becoming ubiquitous Speed sufficient for word processing, , chat and spreadsheets Merging of computing and communications Data available in digital form Devices being networked

Computer Science departments are beginning to develop courses that cover underlying theory for Random graphs Phase transitions Giant components Spectral analysis Small world phenomena Grown graphs

What is the theory needed to support the future? Search Networks and sensors Large amounts of noisy, high dimensional data Integration of systems and sources of information

Internet queries are changing Today Autos Graph theory Colleges, universities Computer science Tomorrow Which car should I buy? Construct an annotated bibliography on graph theory Where should I go to college? How did the field of CS develop?

What car should I buy? List of makes  Cost  Reliability  Fuel economy  Crash safety Pertinent articles  Consumer reports  Car and driver

Where should I go to college? List of factors that might influence choice  Cost  Geographic location  Size  Type of institution Metrics  Ranking of programs  Student faculty ratios  Graduates from your high school/neighborhood

How did field develop? From ISI database create set of important papers in field? Extract important authors author of key paper author of several important papers thesis advisor of Ph.D.’s student(s) who is (are) important author(s) Extract key institutions institution of important author(s) Ph.D. granting institution of important authors Output Directed graph of key papers Flow of Ph.D.’s geographically with time Flow of ideas

Theory to support new directions Large graphs Spectral analysis High dimensions and dimension reduction Clustering Collaborative filtering Extracting signal from noise

Graph Theory of the 50’s Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction.

Theory of Large Graphs Large graphs Billion vertices Exact edges present not critical Theoretical basis for study of large graphs Maybe theory of graph generation Invariant to small changes in definition Must be able to prove basic theorems

Erdös-Renyi n vertices each of n 2 potential edges is present with independent probability NnNn p n (1-p) N-n vertex degree binomial degree distribution number of vertices

Generative models for graphs Vertices and edges added at each unit of time Rule to determine where to place edges  Uniform probability  Preferential attachment- gives rise to power law degree distributions

Vertex degree Number of vertices Preferential attachment gives rise to the power law degree distribution common in many graphs

Protein interactions 2730 proteins in data base 3602 interactions between proteins Science 1999 July 30; 285:

Giant Component 1.Create n isolated vertices 2.Add Edges randomly one by one 3.Compute number of connected components

Giant Component

Giant Component

Tubes Tendrils In 44 million Out 44 million SCC 56 million nodes Disconnected components Source: almaden.ibm.com

Phase transitions G(n,p)  Emergence of cycle  Giant component  Connected graph N(p)  Emergence of arithmetic sequence CNF satisfiability  Fix number of variables, increase number of clauses

Access to Information SMART Technology document aardvark0 abacus0. antitrust42. CEO17. microsoft61. windows14 wine0 wing0 winner3 winter0. zoo0 zoology0 Zurich0

Locating relevant documents Query: Where can I get information on gates? 2,060,000 hits Bill Gates593,000 Gates county177,000 baby gates170,000 gates of heaven169,000 automatic gates 83,000 fences and gates 43,000 Boolean gates 19,000

Clustering documents cluster documents refine cluster microsoft windows antitrust Boolean gates Gates County automatic gates Bill Gates

Refinement of another type: books mathematics physics chemistry astronomy children’s books textbooks reference books general population

High Dimensions Intuition from two and three dimensions not valid for high dimension Volume of cube is one in all dimensions Volume of sphere goes to zero

1 Unit sphere Unit square 2 Dimensions

4 Dimensions

d Dimensions 1

Almost all area of the unit cube is outside the unit sphere

High dimension is fundamentally different from 2 or 3 dimensional space

High dimensional data is inherently unstable Given n random points in d dimensional space essentially all n 2 distances are equal.

Gaussian distribution Probability mass concentrated between dotted lines

Gaussian in high dimensions

Two Gaussians

Distance between two random points from same Gaussian Points on thin annulus of radius Approximate by sphere of radius Average distance between two points is (Place one pt at N. Pole other at random. Almost surely second point near the equator.)

Expected distance between pts from two Gaussians separated by δ Rotate axis so first point is perpendicular to screen All other points close to plane of screen

Expected distance between pts from two Gaussians separated by δ

Can separate points from two Gaussians if

Dimension reduction Project points onto subspace containing centers of Gaussians Reduce dimension from d to k, the number of Gaussians

Centers retain separation Average distance between points reduced by

Can separate Gaussians provided > some constant involving k and γ independent of the dimension

Ranking is important  Restaurants  Movies  Web pages Multi billion dollar industry

Page rank equals stationary probability of random walk

restart 15%

restart 15% Restart yields strongly connected graph

Suppose you wish to increase the page rank of vertex v Capture restart web farm Capture random walk small cycles

Capture restart Buy 20,000 url’s and capture restart Can be countered by small restart value Small restart increases web rank of page that captures random walk by small cycles.

restart 0.23 restart Capture random walk 0.1 restart 0.66

1 X=1.56 Y= restart 0.1 restart X=1+0.85*y Y=0.85*x/

If one loop increases Pagerank from 1 to 1.56 why not add many self loops? Maximum increase in Pagerank is 6.67

Discovery time – time to first reach a vertex by random walk from uniform start Cannot lower discovery time of any page in S below minimum already in S S

Why not replace Pagerank by discovery time? No efficient algorithm for discovery time DiscoveryTime(v) remove edges out of v calculate Pagerank(v) in modified graph

Is there a way for a spammer to raise Pagerank in a way that is not statistically detectable

Information is important When a customer makes a purchase what else is he likely to buy? Camera  Memory card  Batteries  Carrying case  Etc. Knowing what a customer is likely to buy is important information.

How can we extract information from a customer’s visit to a web site?  What web pages were visited?  What order?  How long?

Detecting trends before they become obvious Is some category of customer changing their buying habits?  Purchases, travel destination, vacations Is there some new trend in the stock market? How do we detect changes in a large database over time?

Identifying Changing Patterns in a Large Data Set How soon can one detect a change in patterns in a large volume of information? How large must a change be in order to distinguish it from random fluctuations?

Conclusions We are in an exciting time of change. Information technology is a big driver of that change. The computer science theory of the last thirty years needs to be extended to cover the next thirty years.