Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer.

Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer Science & Engineering & Interdisciplinary Center for Neural Computation The Hebrew University, Jerusalem, Israel http://www.cs.huji.ac.il/~tishby Workshop on Machine Learning in Natural Language Processing CRI, Haifa University December 2006

Outline: u Language – a window into our cognitive processing l What can we learn from word statistics? l How can we quantify it? l Is there a “correct level” of description ? u Information Bottleneck (IB) and the representation of relevance l Finding Approximate sufficient statistics u Words, documents and meaning… u Trading complexity and accuracy l Scaling of semantic information u Possible models: small world properties

012345 6 x 10 4 0 4000 6000 Number of observed words What are words? acquired persistent neural activity associated with perception and cognitive functions appear in every language in a regular power-law sub-linear rate Number of different words Log number of words Log number of different words 8.599.51010.511 7.5 8 8.5 9 9.5 data y = 0.64x + 2.07 10000 8000 2000

Rank – Frequency of words Words exhibit “scale-free” statistics- Zipf’s law

How are words/languages generated? Basic observations: Basic observations: serve for communication and representation serve for communication and representation adapt to variable world statistics adapt to variable world statistics collective (social) entity collective (social) entity acquired continuously (individually and collectively) acquired continuously (individually and collectively)Competition between comm. efficiency and adaptability / learnability

Complexity – Accuracy Tradeoff Complexity Accuracy Possible Representations

Complexity Accuracy Possible Models/representations Limited data BoundedComputation Complexity – Accuracy Tradeoff

Can we quantify it…? When there is a (relevant) prediction or distortion measure Accuracy good predictions (low distortion/error) Complexity long minimal description (optimal codes) A general tradeoff between distortion and compression: Information Theory

What can we learn from word co-occurrence...?

We need to index the max number of non-overlapping green blobs inside the blue blob: (mutual information!) Representation and Mutual Information

IB: an Information Theoretic Principle For extracting Relevant structure  The minimal representation of X that keeps as much information about another variable, Y, as possible.  Generalizes the classical notion of “sufficient statistics”.

Self Consistent Equations The Self Consistent Equations  Marginal:  Markov condition:  Bayes’ rule:

effective distortion The emerged effective distortion measure: Regular if is absolutely continuous w.r.t. Small if predicts y as well as x:

Information Bottleneck The Information Bottleneck Algorithm “free energy”

The emergent effective distortion measure: Generalized BA-algorithm

Can be calculated analytically for Markov chains, Gaussian processes, etc., and numerically in general. IYIY IXIX I C1 Y (I C1 X ) I C2 Y (I C2 X ) I C3 Y (I C3 X ) The limit is always the convex envelope of increasing complexity Information Curves

Naftali Tishby ACAI-99 20 Words and topics again...

Simple Example

A new compact representation The document clusters preserve the relevant information between the documents and words

Analyzing Co-Occurrence Tables Topics Words Topics-Words counts matrix

Words The exact same counts matrix after permutation Topics

Word clusters Topic Clusters The eord clusters provide a compact representation that preserve the information about the topics

Quantified by Mutual Information The distinctions inside each cluster Are less relevant for predicting the class Words Irrelevant distinctions

Symmetric IB through Deterministic Annealing alt.atheism rec.autos rec.motorcycles rec.sport.* sci.med sci.space soc.religion.christian talk.politics.* comp.* misc.forsale sci.crypt sci.electronics car turkish game team jesus gun hockey … x file image encryption window dos mac … Newsgroup Word  P(T C,T W )

Symmetric IB through Deterministic Annealing Newsgroup word comp.graphics comp.os.ms-windows.misc comp.windows.x comp.sys.ibm.pc.hardware comp.sys.mac.hardware misc.forsale sci.crypt sci.electronics windows image window jpeg graphics … encryption db ide escrow monitor …  P(T C,T W )

Symmetric IB through Deterministic Annealing Newsgroup word  P(T C,T W )

Symmetric IB through Deterministic Annealing Newsgroup word alt.atheism rec.sport.baseball rec.sport.hockey soc.religion.christian talk.politics.mideast talk.religion.misc rec.autos rec.motorcycles sci.med sci.space talk.politics.guns talk.politics.misc armenian turkish jesus hockey israeli armenians … car q gun bike fbi health …  P(T C,T W )

Symmetric IB through Deterministic Annealing Newsgroup Word  P(T C,T W )

Symmetric IB through Deterministic Annealing Newsgroup Word atheists christianity jesus bible sin faith … alt.atheism soc.religion.christian talk.religion.misc  P(T C,T W )

We observe Semantic Scaling

Simplified Chinese 2.09 Traditional Chinese 1.73 Dutch 2.3 French 2.22 Hebrew 1.63 Italian 2.35 Japanese 1.42 Portuguese 2.9 Spanish 1.89

-4.5-4-3.5-3-2.5-2-1.5-0.50 -8 -7 -6 -5 -4 -3 -2 0 log(1-I(T;X)/H(X)) log(1-I(T;Y)/I(X;Y)) Chinese Simplified Chinese Traditional Dutch French Hebrew Italian Japanese Korean Porgutuese Spanish English 20NG Jose English UTF English Reuters English 20NG Noam

-4.5-4-3.5-3-2.5-2-1.5-0.5 -14 -12 -10 -8 -6 -4 -2 log(1-I(T;X)/H(X)) log(1-I(T;Y)/I(X;Y)) Random selection of 200 words

Can we understand it? Any subset of the language has the same exponent!

But what does it tell about Language? “Efficiency of the words”: Log-ratio of added Word Entropy that is transferred to Meaningful Information Language appears to have constant word efficiency! ~ 2

Possible Explanations?  Power laws are too common to mean anything… Zipf’s law and similar… “never trust linear log-log plots…” Zipf’s law and similar… “never trust linear log-log plots…”  It’s a property of my Analysis, not of Language How do I know that its not all in the way we cluster the words? How do I know that its not all in the way we cluster the words?  Words are generated at a Constant level of Ambiguity: words are generated at a constant rate, depending only on the concept (occurred) ambiguity in usage irrespective of vocabulary size or domain words are generated at a constant rate, depending only on the concept (occurred) ambiguity in usage irrespective of vocabulary size or domain  Small world (scale free) properties of word acquisition…

Many Thanks to… u Bill u Bill Bialek u Fernando u Fernando Pereira u Noam u Noam Slonim u Dmitry u Dmitry Davidov u Amir u Amir Navot u Josemine u Josemine Magdalen u Banter u Banter Co. (z”l)

Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer.

Similar presentations

Presentation on theme: "Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer.

Similar presentations

Presentation on theme: "Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer."— Presentation transcript:

Similar presentations

About project

Feedback