Creating a science base to support new directions in computer science Cornell University Ithaca New York John Hopcroft.

Slides:



Advertisements
Similar presentations
Future directions in computer science research John Hopcroft Department of Computer Science Cornell University Heidelberg Laureate Forum Sept 27, 2013.
Advertisements

Scale Free Networks.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Network biology Wang Jie Shanghai Institutes of Biological Sciences.
Analysis and Modeling of Social Networks Foudalis Ilias.
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta and Michael Mahoney Yahoo! Research.
VL Netzwerke, WS 2007/08 Edda Klipp 1 Max Planck Institute Molecular Genetics Humboldt University Berlin Theoretical Biophysics Networks in Metabolism.
Beyond Trilateration: On the Localizability of Wireless Ad Hoc Networks Reported by: 莫斌.
Models of Network Formation Networked Life NETS 112 Fall 2013 Prof. Michael Kearns.
SILVIO LATTANZI, D. SIVAKUMAR Affiliation Networks Presented By: Aditi Bhatnagar Under the guidance of: Augustin Chaintreau.
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research.
Future directions in computer science research 23rd International Symposium on Algorithms and Computation John Hopcroft Cornell University ISAAC.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Hierarchy in networks Peter Náther, Mária Markošová, Boris Rudolf Vyjde : Physica A, dec
Future directions in computer science research John Hopcroft Department of Computer Science Cornell University CINVESTAV-IPN Dec 2,2013.
PCPs and Inapproximability Introduction. My T. Thai 2 Why Approximation Algorithms  Problems that we cannot find an optimal solution.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Network Statistics Gesine Reinert. Yeast protein interactions.
DISC-Finder: A distributed algorithm for identifying galaxy clusters.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
SING* and ToNC * Scientific Foundations for Internet’s Next Generation Sirin Tekinay Program Director Theoretical Foundations Communication Research National.
Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Computer Science 1 Web as a graph Anna Karpovsky.
Peer-to-Peer and Social Networks Random Graphs. Random graphs E RDÖS -R ENYI MODEL One of several models … Presents a theory of how social webs are formed.
Random Graph Models of Social Networks Paper Authors: M.E. Newman, D.J. Watts, S.H. Strogatz Presentation presented by Jessie Riposo.
Large-scale organization of metabolic networks Jeong et al. CS 466 Saurabh Sinha.
The Erdös-Rényi models
Information Networks Power Laws and Network Models Lecture 3.
Lecture 6 - Models of Complex Networks II Dr. Anthony Bonato Ryerson University AM8002 Fall 2014.
CAS May 24, 2010 Creating a science base to support new directions in computer science John Hopcroft Cornell University Ithaca, New York.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,
All that remains is to connect the edges in the variable-setters to the appropriate clause-checkers in the way that we require. This is done by the convey.
LANGUAGE NETWORKS THE SMALL WORLD OF HUMAN LANGUAGE Akilan Velmurugan Computer Networks – CS 790G.
Building a Science Base for the Information Age John Hopcroft Cornell University Ithaca, NY Xiamen University.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Random-Graph Theory The Erdos-Renyi model. G={P,E}, PNP 1,P 2,...,P N E In mathematical terms a network is represented by a graph. A graph is a pair of.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
October Large networks: a new language for science László Lovász Eötvös Loránd University, Budapest
Random Dot Product Graphs Ed Scheinerman Applied Mathematics & Statistics Johns Hopkins University IPAM Intelligent Extraction of Information from Graphs.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Jure Leskovec Kevin J. Lang Anirban Dasgupta Michael W. Mahoney WWW’ 2008 Statistical Properties of Community Structure in Large Social and Information.
Progress in New Computer Science Research Directions John Hopcroft Cornell University Ithaca, New York CAS May 21, 2010.
Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Probabilistic Robotics Introduction. SA-1 2 Introduction  Robotics is the science of perceiving and manipulating the physical world through computer-controlled.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
Structural Properties of Networks: Introduction
Machine Learning Feature Creation and Selection
Statistical properties of network community structure
Goodfellow: Chapter 14 Autoencoders
Department of Computer Science University of York
Lecture 15: Least Square Regression Metric Embeddings
Network Models Michael Goodrich Some slides adapted from:
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Creating a science base to support new directions in computer science Cornell University Ithaca New York John Hopcroft

Time of change  The information age is a fundamental revolution that is changing all aspects of our lives.  Those individuals, institutions, and nations who recognize this change and position themselves for the future will benefit enormously. 21 Century computing

Computer Science is changing Early years  Programming languages  Compilers  Operating systems  Algorithms  Data bases Emphasis on making computers useful 21 Century computing

Computer Science is changing The future years  Tracking the flow of ideas in scientific literature  Tracking evolution of communities in social networks  Extracting information from unstructured data sources  Processing massive data sets and streams  Extracting signals from noise  Dealing with high dimensional data and dimension reduction 21 Century computing

Drivers of change  Merging of computing and communications  Data available in digital form  Networked devices and sensors  Computers becoming ubiquitous 21 Century computing

Implications for TCS Need to develop theory to support the new directions Update computer science education 21 Century computing

Internet search engines are changing When was Einstein born? Einstein was born at Ulm, in Wurttemberg, Germany, on March 14, List of relevant web pages 21 Century computing

Internet queries will be different  Which car should I buy?  What are the key papers in Theoretical Computer Science?  Construct an annotated bibliography on graph theory.  Where should I go to college?  How did the field of computer science develop? 21 Century computing

Which car should I buy? Search engine response: Which criteria below are important to you?  Fuel economy  Crash safety  Reliability  Performance  Etc. 21 Century computing

MakeCostReliabilityFuel economy Crash safety Links to photos/ articles Toyota Prius23,780Excellent44 mpgFairphoto article Honda Accord28,695Better26 mpgExcellentphoto article Toyota Camry29,839Average24 mpgGoodphoto article Lexus 35038,615Excellent23 mpgGoodphoto article Infiniti M3547,650Excellent19 mpgGoodphoto article

21 Century computing

2010 Toyota Camry - Auto Shows Toyota sneaks the new Camry into the Detroit Auto Show. Usually, redesigns and facelifts of cars as significant as the hot-selling Toyota Camry are accompanied by a commensurate amount of fanfare. So we were surprised when, right about the time that we were walking by the Toyota booth, a chirp of our Blackberries brought us the press release announcing that the facelifted 2010 Toyota Camry and Camry Hybrid mid-sized sedans were appearing at the 2009 NAIAS in Detroit. Toyota CamryCamry Hybrid2009 NAIAS We’d have hardly noticed if they hadn’t told us—the headlamps are slightly larger, the grilles on the gas and hybrid models go their own way, and taillamps become primarily LED. Wheels are also new, but overall, the resemblance to the Corolla is downright uncanny. Let’s hear it for brand consistency! Four-cylinder Camrys get Toyota’s new 2.5-liter four-cylinder with a boost in horsepower to 169 for LE and XLE grades, 179 for the Camry SE, all of which are available with six-speed manual or automatic transmissions. Camry V-6 and Hybrid models are relatively unchanged under the skin. Inside, changes are likewise minimal: the options list has been shaken up a bit, but the only visible change on any Camry model is the Hybrid’s new gauge cluster and softer seat fabrics. Pricing will be announced closer to the time it goes on sale this March. Toyota Camry  › OverviewOverview  › SpecificationsSpecifications  › Price with OptionsPrice with Options  › Get a Free QuoteGet a Free Quote News & Reviews  2010 Toyota Camry - Auto Shows 2010 Toyota Camry - Auto Shows Top Competitors  Chevrolet Malibu Chevrolet Malibu  Ford Fusion Ford Fusion  Honda Accord sedan Honda Accord sedan

Which are the key papers in Theoretical Computer Science? Hartmanis and Stearns, “On the computational complexity of algorithms” Blum, “A machine-independent theory of the complexity of recursive functions” Cook, “The complexity of theorem proving procedures” Karp, “Reducibility among combinatorial problems” Garey and Johnson, “Computers and Intractability: A Guide to the Theory of NP-Completeness” Yao, “Theory and Applications of Trapdoor Functions” Shafi Goldwasser, Silvio Micali, Charles Rackoff, “The Knowledge Complexity of Interactive Proof Systems” Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan, and Mario Szegedy, “Proof Verification and the Hardness of Approximation Problems” 21 Century computing

Fed Ex package tracking 21 Century computing

NEXRAD Radar Binghamton, Base Reflectivity 0.50 Degree Elevation Range 124 NMI — Map of All US Radar Sites Map of All US Radar Sites Animate Map Storm TracksTotal PrecipitationShow SevereRegional RadarZoom Map Click:Zoom InZoom OutPan Map ( Full Zoom Out ) Full Zoom Out »AdvancedRadarTypesCLICK»»AdvancedRadarTypesCLICK»

21 Century computing

Collective Inference on Markov Models for Modeling Bird Migration 21 Century computing Space Time

21 Century computing Daniel Sheldon, M. A. Saleh Elmohamed, Dexter Kozen

Science base to support activities Track flow of ideas in scientific literature Track evolution of communities in social networks Extract information from unstructured data sources. 21 Century computing

Tracking the flow of ideas in scientific literature Yookyung Jo

21 Century computing Tracking the flow of ideas in scientific literature Yookyung Jo Page rank Web Link Graph Retrieval Query Search Text Web Page Search Rank Web Chord Usage Index Probabilistic Text File Retrieve Text Index Discourse Word Centering Anaphora

21 Century computing Original papers

21 Century computing Original papers cleaned up

21 Century computing Referenced papers

21 Century computing Referenced papers cleaned up. Three distinct categories of papers

21 Century computing

Tracking communities in social networks 21 Century computing Liaoruo Wang

“Statistical Properties of Community Structure in Large Social and Information Networks”, Jure Leskovec; Kevin Lang; Anirban Dasgupta; Michael Mahoney Studied over 70 large sparse real-world networks. Best communities are of approximate size 100 to Century computing

Our most striking finding is that in nearly every network dataset we examined, we observed tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community- like". 21 Century computing

Conductance Size of community 100

21 Century computing Giant component

21 Century computing Whisker: A component with v vertices connected by edges

Our view of a community 21 Century computing TCS Me Colleagues at Cornell Classmates Family and friends More connections outside than inside

Early work  Min cut – two equal size communities  Conductance – minimizes cross edges Future work  Consider communities with more cross edges than internal edges  Find small communities  Track communities over time  Develop appropriate definitions for communities  Understand the structure of different types of social networks 21 Century computing

“Clustering Social networks” Mishra, Schreiber, Stanton, and Tarjan Each member of community is connected to a beta fraction of community No member outside the community is connected to more than an alpha fraction of the community Some connectivity constraint 21 Century computing

In sparse graphs How do you find alpha-beta communities? What if each person in the community is connected to more members outside the community than inside? 21 Century computing

Structure of social networks Different networks may have quite different structures. An algorithm for finding alpha-beta communities in sparse graphs. How many communities of a given size are there in a social network? Results on exploring a tweeter network of approximately 100,000 nodes. 21 Century computing Liaoruo Wang, Jing He, Hongyu Liang

200 randomly chosen nodes Apply alpha- beta algorithm Resulting alpha- beta community How many alpha-beta communities? 21 Century computing

Massively overlapping communities Are there a small number of massively overlapping communities that share a common core? Are there massively overlapping communities in which one can move from one community to a totally disjoint community? 21 Century computing

Massively overlapping communities with a common core

21 Century computing Massively overlapping communities

Define the core of a set of overlapping communities to be the intersection of the communities. There are a small number of cores in the tweeter data set. 21 Century computing

Size of initial setNumber of cores Century computing

Century computing

What is the graph structure that causes certain cores to merge and others to simply vanish? What is the structure of cores as they get larger? Do they consist of concentric layers that are less dense at the outside? 21 Century computing

Sucheta Soundarajan Transmission paths for viruses, flow of ideas, or influence

Sparse vectors 21 Century computing There are a number of situations where sparse vectors are important Tracking the flow of ideas in scientific literature Biological applications Signal processing

Sparse vectors in biology 21 Century computing plants Genotype Internal code Phenotype Observables Outward manifestation

Theory to support new directions 21 Century computing  Large graphs  Spectral analysis  High dimensions and dimension reduction  Clustering  Collaborative filtering  Extracting signal from noise

Theory of Large Graphs  Large graphs with billions of vertices  Exact edges present not critical  Invariant to small changes in definition  Must be able to prove basic theorems 21 Century computing

Erdös-Renyi  n vertices  each of n 2 potential edges is present with independent probability 21 Century computing NnNn p n (1-p) N-n vertex degree binomial degree distribution number of vertices

21 Century computing

Generative models for graphs  Vertices and edges added at each unit of time  Rule to determine where to place edges  Uniform probability  Preferential attachment- gives rise to power law degree distributions 21 Century computing

Vertex degree Number of vertices Preferential attachment gives rise to the power law degree distribution common in many graphs

21 Century computing Protein interactions 2730 proteins in data base 3602 interactions between proteins Science 1999 July 30; 285: Only 899 proteins in components. Where are the 1851 missing proteins?

21 Century computing Protein interactions 2730 proteins in data base 3602 interactions between proteins Science 1999 July 30; 285:

Science base What do we mean by science base?  Example: High dimensions 21 Century computing

High dimension is fundamentally different from 2 or 3 dimensional space 21 Century computing

High dimensional data is inherently unstable Given n random points in d dimensional space, essentially all n 2 distances are equal. 21 Century computing

High Dimensions Intuition from two and three dimensions not valid for high dimension Volume of cube is one in all dimensions Volume of sphere goes to zero

21 Century computing 1 Unit sphere Unit square 2 Dimensions

21 Century computing 4 Dimensions

21 Century computing d Dimensions 1

Almost all area of the unit cube is outside the unit sphere 21 Century computing

Gaussian distribution 21 Century computing Probability mass concentrated between dotted lines

Gaussian in high dimensions 21 Century computing

Two Gaussians 21 Century computing

Distance between two random points from same Gaussian  Points on thin annulus of radius  Approximate by a sphere of radius  Average distance between two points is (Place one point at N. Pole, the other point at random. Almost surely, the second point is near the equator.) 21 Century computing

Expected distance between points from two Gaussians separated by δ

Can separate points from two Gaussians if 21 Century computing

Dimension reduction Given n points in d dimensions, a random projection to log n dimensions will preserve all pair wise distances 21 Century computing

Dimension reduction for the Gaussians Project points onto subspace containing centers of Gaussians Reduce dimension from d to k, the number of Gaussians 21 Century computing

Centers retain separation Average distance between points reduced by

Can separate Gaussians provided 21 Century computing > some constant involving k and γ independent of the dimension

Finding centers of Gaussians 21 Century computing First singular vector v 1 minimizes the perpendicular distance to points

The first singular vector goes through center of Gaussian and minimizes distance to points Gaussian 21 Century computing

Best k-dimensional space for Gaussian is any space containing the line through the center of the Gaussian. 21 Century computing

Given k Gaussians, the top k singular vectors define a k dimensional space that contains the k lines through the centers of the k Gaussian and hence contain the centers of the Gaussians. 21 Century computing

We have just seen what a science base for high dimensional data might look like. What other areas do we need to develop a science base for?

21 Century computing  Ranking is important  Restaurants, movies, books, web pages  Multi-billion dollar industry  Collaborative filtering  When a customer buys a product, what else is he likely to buy?  Dimension reduction  Extracting information from large data sources  Social networks

Time of change  The information age is a fundamental revolution that is changing all aspects of our lives.  Those individuals and nations who recognize this change and position themselves for the future will benefit enormously. 21 Century computing

Conclusions  We are in an exciting time of change.  Information technology is a big driver of that change.  Computer science theory needs to be developed to support this information age. 21 Century computing

THANK YOU