Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA.

Similar presentations


Presentation on theme: "Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA."— Presentation transcript:

1 Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA

2 Hanse Institute for Advanced Study, March 2002 Outline of the talk What is a knowledge network and how is it different from an ordinary graph or network? Knowledge networks on the internet: matching products to customers Knowledge networks in biology: large ensembles of interacting biomolecules Empirical study of correlations in the network of interacting proteins Collaborators: Y-C. Zhang, and K. Sneppen

3 Hanse Institute for Advanced Study, March 2002 Networks in complex systems Network is the backbone of a complex system Answers the question: who interacts with whom? Examples: – Internet and WWW – Interacting biomolecules (metabolic, physical, regulatory) – Food webs in ecosystems – Economics: customers and products; Social: people and their choice of partners

4 Hanse Institute for Advanced Study, March 2002 Predicting tastes of customers based on their opinions on products Each of us has personal tastes These tastes are sometimes unknown even to ourselves (hidden wants) Information is contained in our opinions on products Matchmaking: customers with similar tastes can be used to predict future opinions Internet allows to do it on a large scale

5 Hanse Institute for Advanced Study, March 2002 Types of networks readers books 2 1 3 4 1 2 3 Plain networkKnowledge or opinion network reader’s tastes book’s features opinion 2 1 3 4 1 2 3

6 Hanse Institute for Advanced Study, March 2002 Storing opinions XXX29?? XXX?8?8 XXX??1? 2??XXXX 98?XXXX ??1XXXX ?8?XXXX books 2 1 3 4 readers 9 8 81 2 1 2 3 Matrix of opinions  IJ Network of opinions

7 Hanse Institute for Advanced Study, March 2002 Using correlations to reconstruct customer’s tastes Similar opinions  similar tastes Simplest model: – Readers  M-dimensional vector of tastes r I – Books  M-dimensional vector of features b J – Opinions  scalar product:  IJ = r I  b J customers books 9 8 81 2 1 2 2 1 3 4 3

8 Hanse Institute for Advanced Study, March 2002 Loop correlation customers books 9 8 8 1 2 2 1 3 4 3 predictive power 1/M (L-1)/2 one needs many loops to completely freeze mutual orientation of vectors an unknown opinion L known opinions

9 Hanse Institute for Advanced Study, March 2002 Field Theory Approach If all components of vectors are Gaussian and uncorrelated: Generating functional is: det(1+i  ) -M/2 All irreducible correlations are proportional to M All loop correlations =M Since each is  IJ ~  M sign correlation scales as M –(L-1)/2

10 Hanse Institute for Advanced Study, March 2002 Main parameter: density of edges The larger is the density of edges p the easier is the prediction At p 1  1/N (N=N readers +N books ) macroscopic prediction becomes possible. Nodes are connected but vectors r I b J are not fixed: ordinary percolation threshold At p 2  2M/N > p 1 all tastes and features ( r I and b J ) can be uniquely reconstructed: rigidity percolation threshold

11 Hanse Institute for Advanced Study, March 2002 Spectral properties of  For M<N the matrix  IJ has N-M zero eigenvalues and M positive ones:  = R  R +. Using SVD one can “diagonalize” R = U  D  V + such that matrices V and U are orthogonal V +  V = 1, U  U + = 1, and D is diagonal. Then  = U  D 2  U + The amount of information contained in  : NM-M(M-1)/2 << N(N-1)/2 - the # of off-diagonal elements

12 Hanse Institute for Advanced Study, March 2002 Practical recursive algorithm of prediction of unknown opinions 1. Start with  0 where all unknown elements are filled with (zero in our case) 2. Diagonalize and keep only M largest eigenvalues and eigenvectors 3. In the resulting truncated matrix  ’ 0 replace all known elements with their exact values and go to step 1

13 Hanse Institute for Advanced Study, March 2002 Convergence of the algorithm Above p 2 the algorithm exponentially converges to the exact values of unknown elements The rate of convergence scales as (p-p 2 ) 2

14 Hanse Institute for Advanced Study, March 2002 Reality check: sources of errors Customers are not rational!  IJ = r I  b J +  Ij (idiosyncrasy) Opinions are delivered to the matchmaker through a narrow channel: – Binary channel S IJ = sign(  IJ ) : 1 or 0 (liked or not) – Experience rated on a scale 1 to 5 or 1 to 10 at best If number of edges K, and size N are large, while M is small these errors can be reduced

15 Hanse Institute for Advanced Study, March 2002 How to determine M? In real systems M is not fixed: there are always finer and finer details of tastes Given the number of known opinions K one should choose M eff  K/(N readers +N books ) so that systems are below the second transition p 2  tastes should be determined hierarchically

16 Hanse Institute for Advanced Study, March 2002 Avoid overfitting Divide known votes into training and test sets Select M eff so that to avoid overfitting !!! Reasonable fit Overfit

17 Hanse Institute for Advanced Study, March 2002 Knowledge networks in biology Interacting biomolecules: key and lock principle Matrix of interactions (binding energies)  IJ = k I  l J + l I  k J Matchmaker (bioinformatics researcher) tries to guess yet unknown interactions based on the pattern of known ones Many experiments measure S IJ =  (  IJ -  th ) k (1) k (2) l (2) l (1)

18 Hanse Institute for Advanced Study, March 2002 Real systems Internet commerce: the dataset of opinions on movies collected by Compaq systems research center: – 72916 users entered a total of 2811983 numeric ratings (* to *****) for 1628 different movies: M eff ~40 – Default set for collaborative filtering research Biology: table of interactions between yeast proteins from Ito et al. high throughput two-hybrid experiment – 6000 proteins (~3300 have at least one interaction partner) and 4400 known interactions – Binary (interact or not) – M eff ~1: too small!

19 Hanse Institute for Advanced Study, March 2002 Yeast Protein Interaction Network Data from T. Ito, et al. PNAS (2001) Full set contains 4549 interactions among 3278 yeast proteins Here are shown only nuclear proteins interacting with at least one other nuclear protein

20 Hanse Institute for Advanced Study, March 2002 Correlations in connectivities Basic design principles of the network can be revealed by comparing the frequency of a pattern in real and random networks P(k 0,k 1 ) – probability that nodes with connectivities k 0 and k 1 directly interact Should be normalized by P r (k 0,k 1 ) – the same property in a randomized network such that: – Each node has the same number of neighbors (connectivity) – These neighbors are randomly selected – The whole ensemble of random networks can be generated

21 Hanse Institute for Advanced Study, March 2002 Correlation profile of the protein interaction network P(k 0,k 1 )/P r (k 0,k 1 ) Z(k 0,k 1 ) =(P(k 0,k 1 )-P r (k 0,k 1 ))/  r (k 0,k 1 )

22 Hanse Institute for Advanced Study, March 2002 Correlation profile of the internet

23 Hanse Institute for Advanced Study, March 2002 What it may mean? Hubs avoid each other ( like in the internet R. Pastor-Satorras, et al. Phys. Rev. Lett. (2001)) Hubs prefer to connect to terminal ends (low connected nodes) Specificity: network is organized in modules clustered around individual hubs Stability: the number of second nearest neighbors is suppressed  harder to propagate deleterious perturbations

24 Hanse Institute for Advanced Study, March 2002 Conclusion Studies of networks are similar to paleontology: learning about an organism from its backbone You can learn a lot about a complex system from its network !! But not everything…

25 Hanse Institute for Advanced Study, March 2002 THE END

26 Hanse Institute for Advanced Study, March 2002 Entropy of unknown opinions Density of known opinions p p1p1 p2p2 Entropy 01

27 Hanse Institute for Advanced Study, March 2002 How to determine p 2 ? K known elements of an NxN matrix  IJ = r I  b J (N=N r +N b ) Approximately N x M degrees of freedom (minus M(M-1)/2 gauge parameters) For K>MN all missing elements can be reconstructed  p 2 =K 2 /(N(N-1)/2)  2M/N

28 Hanse Institute for Advanced Study, March 2002 What is a knowledge network? Undirected graph with N vertices and K edges Each vertex has a (hidden) M-dimensional vector of tastes/features Each edge carries a scalar product (opinion) of vectors on vertices it connects The centralized matchmaker is trying to guess vectors (tastes) based on their scalar products (opinions) and to predict unknown opinions

29 Hanse Institute for Advanced Study, March 2002 Versions of knowledge networks Regular graph: every link is allowed. Example: recommending people to other people according to their areas of interests Bipartite graphs: Example: Customers to products Non-reciprocal opinions: each vertex has two vectors d I, q I so that  IJ = d I  q J. Example: Real matchmaker recommending men to women.


Download ppt "Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA."

Similar presentations


Ads by Google