Social Network Inspired Models of NLP and Language Evolution Monojit Choudhury (Microsoft Research India) Animesh Mukherjee (IIT Kharagpur) Niloy Ganguly.

Social Network Inspired Models of NLP and Language Evolution Monojit Choudhury (Microsoft Research India) Animesh Mukherjee (IIT Kharagpur) Niloy Ganguly (IIT Kharagpur)

What is a Social Network? Nodes: Social entities (people, organization etc.) Edges: Interaction/relationship between entities (Friendship, collaboration, sex) Courtesy: http://blogs.clickz.com

Social Network Inspired Computing Society and nature of human interaction is a Complex System Complex Network: A generic tool to model complex systems  There is a growing body of work on CNT Theory  Applied to a variety of fields – Social, Biological, Physical & Cognitive sciences, Engineering & Technology Language is a complex system

Objective of this Tutorial To show that SNIC (Soc. Net. Inspired Comp.) is an emerging and promising technique Apply it to model Natural Languages  NLP, Quantitative Linguistics, Language Evolution, Historical Linguistics, Language acquisition Familiarize with tools and techniques in SNIC Compare it with other standard approaches to NLP

Outline of the Tutorial Part I: Background  Introduction [25 min]  Network Analysis Techniques [25 min]  Network Synthesis Techniques [25 min] Break [3:20pm – 3:40pm] Part II: Case Studies  Self-organization of Sound Systems [20 min]  Modeling the Lexicon [20 min]  Unsupervised Labeling (Syntax & Semantics) [20 min] Conclusion and Discussions [20 min]

Complex System Non-trivial properties and patterns emerging from the interaction of a large number of simple entities Self-organization: The process through which these patterns evolve without any external intervention or central control Emergent Property or Emergent Behavior: The pattern that emerges due to self-organization

The best example from nature A termite "cathedral" mound produced by a termite colony

Emergence of a networked life Atom Molecule Cell Tissue Organs Organisms Communities

Language – a complex system Language: medium for communication through an arbitrary set of symbols Constantly evolving An outcome of self-organization at many levels  Neurons  Speakers and listeners  Phonemes, morphemes, words … 80-20 Rule in every level of structure

Three Views of a System MACROSCOPY MICROSCOPY MESOSCOPY May not give a complete picture or explanation of what goes on May be too difficult to analyze or simulate the macroscopic behavior A useful trade- off between the two

Language as a physical system Microscopic: a collection of utterances by individual speakers Mesoscopic: an interaction between phonemes, syllables, words, phrases Macroscopic: A set of grammar rules with a lexicon

Syntactic Network of Words light color red blue blood sky heavy weight 100 20 1

Complex Network Theory Handy toolbox for modeling mesoscopy Marriage of Graph theory and Statistics Complex because:  Non-trivial topology  Difficult to specify completely  Usually large (in terms of nodes and edges) Provides insight into the nature and evolution of the system being modeled

Internet

Genetic interaction network

9-11 Terrorist Network Social Network Analysis is a mathematical methodology for connecting the dots -- using science to fight terrorism. Connecting multiple pairs of dots soon reveals an emergent network of organization.

CNT Examples: Road and Airlines Network

What Questions can be asked Does these networks display some symmetry? Are these networks creation of intelligent objects or they have emerged? How have these networks emerged What are the underlying simple rules leading to their complex formation?

Bi-directional Approach Analysis of the real-world networks  Global topological properties  Community structure  Node-level properties Synthesis of the network by means of some simple rules  Preferential attachment models  Small-world models ……..

Application of CNT in Linguistics - I Quantitative linguistics  Invariance and typology (Zipf’s law, syntactic dependencies) Natural Language Processing  Unsupervised methods for text labeling (POS tagging, NER, WSD, etc.)  Textual similarity (automatic evaluation, document clustering)  Evolutionary Models (NER, multi-document summarization)

Application of CNT in Linguistics - II Language Evolution  How did sound systems evolve?  Development of syntax Language Change  Innovation diffusion over social networks  Language as an evolving network Language Acquisition  Phonological acquisition  Evolution of the mental lexicon of the child

Linguistic Networks NameNodesEdgesWhy? PhoNetPho- nemes Co-occurrence likelihood in languages Evolution of sound systems WordNetWordsOntological relationHost of NLP applications Syntactic Network WordsSimilarity between syntactic contexts POS Tagging Semantic Network Words, Names Semantic relationIR, Parsing, NER, WSD Mental Lexicon WordsPhonetic similarity and semantic relation Cognitive modeling, Spell Checking Tree-banksWordsSyntactic Dependency links Evolution of syntax Word Co- occurrence WordsCo-occurrenceIR, WSD, LSA, …

Summarizing SNIC and CNT are emerging techniques for modeling complex systems at mesoscopic level Applied to Physics, Biology, Sociology, Economics, Logistics … Language - an ideal application domain for SNIC SNIC models in NLP, Quantitative linguistics, language change, evolution and acquisition

Topological Characterization of Networks

Types Of Networks and Representation UnipartiteBinary/ Weighted Undirected/ Directed BipartiteBinary/ Weighted Undirected/ Directed Representation 1.Adjacency Matrix 2.Adjacency List a{b,c} b{a,c} c{a,b} abc a011 b101 c110

Properties of Adjacency Matrix A={a ij }, where i and j are nodes and a ij =1 if there is an edge between i an j. A 2 = A*A; Entries denote number paths of length 2 between any two node (Σa ik *a kj ) In general, A n denotes number of paths of length n Trace(A) = Σa ii How is the trace of A 3 related to the number of triangles in the n/w? k

Characterization of Complex N/ws?? They have a non-trivial topological structure Properties:  Heavy tail in the degree distribution (non-negligible probability mass towards the tail; more than in the case of an exp. distribution)  High clustering coefficient  Centrality Properties  Social Roles & Equivalence  Assortativity  Community Structure  Random Graphs & Small avg. path length  Preferential attachment  Small World Properties

Degree Distribution (DD) Let p k be the fraction of vertices in the network that has a degree k. The k versus p k plot is defined as the degree distribution of a network For most of the real world networks these distributions are right skewed with a long right tail showing up values far above the mean – p k varies as k -α Due to noisy and insufficient data sometimes the definition is slightly modified  Cumulative degree distribution is plotted Probability that the degree of a node is greater than or equal to k

A Few Examples Power law: P k ~ k -α

Friend of Friends Consider the following scenario  Sourish and Ravi are friends  Sourish and Shaunak are friends  Are Shaunak and Ravi friends?  If so then … This property is known as transitivity Ravi SaurishSaunak

Measuring Transitivity: Clustering Coefficient The clustering coefficient for a vertex ‘v’ in a network is defined as the ratio between the total number of connections among the neighbors of ‘v’ to the total number of possible connections between the neighbors High clustering coefficient means my friends know each other with high probability – a typical property of social networks

Mathematically… The clustering coefficient of a vertex i is The clustering coefficient of the whole network is the average Alternatively, Ci =Ci = # of links between ‘n’ neighbors n(n-1)/2 C=C= 1 N ∑C i C = # triangles in the n/w # triples in the n/w

Centrality Centrality measures are commonly described as indices of 4 Ps -- prestige, prominence, importance, and power  Degree – Count of immediate neighbors  Betweenness – Nodes that form a bridge between two regions of the n/w Where σ st is total number of shortest paths between s and t and σ st (v) is the total number of shortest paths from s to t via v

Eigenvector centrality – Bonacich (1972) It is not just how many people knows me counts to my popularity (or power) but how many people knows people who knows me – this is recursive! In context of HIV transmission – A person x with one sex partner is less prone to the disease than a person y with multiple partners  But imagine what happens if the partner of x has multiple partners  The basic idea of eigenvector centrality

Definition Eigenvector centrality is defined as the principal eigenvector of the adjacency matrix Eigenvector of any symmetric matrix A = {a ij } is any vector e such that  Where λ is a constant and e i is the centrality of the node i What does it imply – centrality of a node is proportional to the centrality of the nodes it is connected to (recursively)… Practical Example: Google PageRank

Node Equivalence Social Roles – Nodes (actors) in a social n/w who have similar patterns of relations (ties) with other nodes. Three Different ways to find equivalence classes:  Structural Equivalence  Automorphic Equivalence  Regular Equivalence

Structural Equivalence Two nodes are said to be exactly structurally equivalent if they have the same relationships to all other nodes. Computation: Let A be the adjacency matrix. Compute the Euclidean Distance /Pearson Correlation between a pair or rows/columns representing the neighbor profile of two nodes (say i and j). This value shows how much structurally similar i and j are.

Automorphic Equivalence The idea of automorphic equivalence is that sets of actors can be equivalent by being embedded in local structures that have the same patterns of ties -- "parallel" structures. Top Boss Managers of 3 Diff. stores Workers Swap(B,D) with all their neighbors:The distances among all the actors in the graph would be exactly identical Path vectors of i: how many nodes are at distance 1, 2, … from node i. Amount of Equivalence: Distance between path vectors

Regular Equivalence Two nodes are said to be regularly equivalent if they have the same profile of ties with members of other sets of actors that are also regularly equivalent. Class III Class II Class I No tie with Class I 1 tie with Class II 1 tie with Class I 1/2 tie(s) with Class III 1 tie with Class II No tie with Class III

Rich goes with the rich (selective linking)  A famous actor (e.g., Shah Rukh Khan) would prefer to pair up with some other famous actor (e.g., Rani Mukherjee) in a movie rather than a new comer in the film industry. Assortativity (homophily) Assortative Scale-free network Disassortative Scale-free network

Measures of Assortativity  ANND (Average nearest neighbor degree) Find the average degree of the neighbors of each node i with degree k Find the Pearson correlation (r) between the degree of i and the average degree of its neighbors For further reference see the supplementary material

Community structure  Community structure: a group of vertices that have a high density of edges within them and a low density of edges in between groups  Example : Friendship n/w of children Citation n/ws: research interest World Wide Web: subject matter of pages Metabolic networks: Functional units Linguistic n/ws: similar linguistic categories

Some Examples Community Structure in Political Books Community structure in a Social n/w of Students (American High School)

Community Identification Algorithms Hierarchical Girvan-Newman Radicchi et al. Chinese Whishpers Spectral Bisection See (Newman 2004) for a comprehensive survey (you will find the ref. in the supplementary material)

Girvan-Newman Algorithm Bisection Method  Calculate the betweenness for all edges in the network.  Remove the edge with the highest betweenness.  Recalculate betweennesses for all edges affected by the removal.  Repeat from step 2 until no edges remain.

Evolution of Networks Processes on Networks

Random Graphs & Small Average Path Length Q: What do we mean by a ‘random graph’? A: Erdos-Renyi random graph model: For every pair of nodes, draw an edge between them with equal probability p. Poisson distribution P(k)~ e - k /k! N nodes z neighbors per node, on average, z = D degrees of separation Degrees of Separation in a Random Graph

Degree Distributions

(a)World-Wide Web (b)Coauthorship networks: computer science, high energy physics, condensed matter physics, astrophysics (c)Power grid of the western United States and Canada (d)Social network of 43 Mormons in Utah Degree distributions for various networks

How do Power law DDs arise? (1) GROWTH : Starting with a small number of nodes (m 0 ) at every timestep we add a new node with m (<=m 0 ) edges (connected to the nodes already present in the system). (2) PREFERENTIAL ATTACHMENT : The probability Π that a new node will be connected to node i depends on the connectivity k i of that node A.-L.Barabási, R. Albert, Science 286, 509 (1999) Barabási-Albert Model of Preferential Attachment (Rich gets Richer)

Mean Field Theory

The World is Small! “Registration fee for IJCNLP 2008 are being waived for all participants – get it collected from the registration counter” How long do you think the above information will take to spread among yourselves Experiments say it will spread very fast – within 6 hops from the initiator it would reach all This is the famous Milgram’s six degrees of separation

The Small World Effect Even in very large social networks, the average distance between nodes is usually quite short. Milgram’s small world experiment: Target individual in Boston Initial senders in Omaha, Nebraska Each sender was asked to forward a packet to a friend who was closer to the target Friends asked to do the same Result: Average of ‘six degrees’ of separation. S. Milgram, The small world problem, Psych. Today, 2 (1967), pp. 60-67.

Measure of Small-Worldness Low average geodesic path length High clustering coefficient Geodesic path – Shortest path through the network from one vertex to another Mean path length  ℓ = 2∑ i≥j d ij /n(n+1) where d ij is the geodesic distance from vertex i to vertex j  Most of the networks observed in real world have ℓ ≤ 6 Film actors3.48 Company Directors4.60 Emails4.95 Internet3.33 Electronic circuits4.34

Clustering C = Probability that two of a node’s neighbors are themselves connected In a random graph: C rand ~ 1/N (if the average degree is held constant)

Watts-Strogatz ‘Small World’ Model Watts and Strogatz introduced this simple model to show how networks can have both short path lengths and high clustering. D. J. Watts and S. H. Strogatz, Collective dynamics of “small-world” networks, Nature, 393 (1998), pp. 440–442.

Small-world model Used for modeling network transitivity Many networks assume some kind of geographical proximity Small-world model:  Start with a low-dimensional regular lattice  Rewire: Add/remove edges to create shortcuts to join remote parts of the lattice For each edge with prob p move the other end to a random vertex Rewiring allows to interpolate between regular lattice and random graph

Small-world model Regular lattice ( p=0 ):  Clustering coefficient C=(3k- 3)/(4k-2)=3/4  Mean distance L/4k Almost random graph ( p=1 ):  Clustering coefficient C=2k/L  Mean distance log L / log k No power-law degree distribution Rewiring probability p Degree distribution

Resilience of Networks We consider the resilience of the network to the removal of its vertices (site percolation) or edges (bond percolation). As vertices (or edges) are removed from the network, the average path length will increase. Ultimately, the giant component will disintegrate. Networks vary according to their level of resilience to vertex (or edge) removal.

Stability Metric: Percolation Threshold Initial single connected component f fraction of nodes removed Giant component still exists f c fraction of nodes removed The entire graph breaks into smaller fragments Therefore f c =1-q c becomes the percolation threshold

Ordinary Percolation on Lattices Fill in each link (bond percolation) or site (site percolation) with probability p and ask questions about the sizes of connected components.

Percolation in Poisson and Scale free networks Exponential Network Scale free Network

CASE STUDY I: Self-Organization of the Sound Inventories

Human Communication Human beings and many other living organisms produce sound signals Unlike other organisms, they can concatenate these sounds to produce new messages – Language Language is one of the primary cause/effect of human intelligence

Human Speech Sounds Human speech sounds are called phonemes – the smallest unit of a language Phonemes are characterized by certain distinctive features like Mermelstein’s Model I.Place of articulation II.Manner of articulation III.Phonation

Types of Phonemes Vowels Consonants Diphthongs /ai/ L /a/ /i/ /u//p/ /t/ /k/

Choice of Phonemes How a language chooses a set of phonemes in order to build its sound inventory? Is the process arbitrary? Certainly Not! What are the forces affecting this choice?

Forces of Choice /a/ Speaker Listener / Learner /a/ Desires “ease of articulation” Desires “perceptual contrast” / “ease of learnability” A Linguistic System – How does it look? The forces shaping the choice are opposing – Hence there has to be a non-trivial solution

Vowels: A (Partially) Solved Mystery Languages choose vowels based on maximal perceptual contrast. For instance if a language has three vowels then in more than 95% of the cases they are /a/,/i/, and /u/. Maximally Distinct /u/ /a/ /i/

Consonants: A puzzle Research: From 1929 – Date No single satisfactory explanation of the organization of the consonant inventories  The set of features that characterize consonants is much larger than that of vowels  No single force is sufficient to explain this organization  Rather a complex interplay of forces goes on in shaping these inventories J i g s a w

Principle of Occurrence PlaNet – The “Phoneme-Language Network”  A bipartite network N=(VL,VC,E)  VL : Nodes representing languages of the world  VC : Nodes representing consonants  E : Set of edges which run between VL and VC There is an edge e Є E between two nodes vl Є VL and vc Є VC if the consonant c occurs in the language l. L1L1 L4L4 L2L2 L3L3 /m/ /ŋ//ŋ/ /p/ /d/ /s/ /θ//θ/ Consonants Languages The Structure of PlaNet Choudhury et al. 2006 ACL Mukherjee et al. 2007 Int. Jnl of Modern Physics C

Construction of PlaNet Data Source : UCLA Phonological Inventory Database (UPSID) Number of nodes in VL is 317 Number of nodes in VC is 541 Number of edges in E is 7022

Degree Distribution of PlaNet 0 50100 150 0.02 0.04 0.06 0.08 Language inventory size (degree k) pkpk p k = beta(k) with α = 7.06, and β = 47.64 p k = Γ(54.7) k 6.06 (1-k) 46.64 Γ(7.06) Γ(47.64) k min = 5, k max = 173, k avg = 21 200 PkPk 1000 Degree of a consonant, k P k = k -0.71 Exponential Cut-off 1 10 100 0.001 0.01 0.1 1 DD of the language nodes follows a β- distribution DD of the consonant nodes follows a power-law with an exponential cut-off Distribution of Consonants over Languages follow a power-law

Synthesis of PlaNet Non-linear preferential attachment Iteratively construct the language inventories given their inventory sizes Pr(C i ) = d i α + ε ∑ x  V* (d x α + ε) L1L1 L3L3 L2L2 L4L4 L1L1 L3L3 L2L2 L4L4 After step 3 After step 4

Simulation Result The parameters α and ε are 1.44 and 0.5 respectively. The results are averaged over 100 runs PlaNet rand PlaNet PlaNet syn 1 10 100 1000 1.1.01.001 Degree (k) PkPk

Principle of Co-occurrence Consonants tend to co-occur in groups or communities These groups tend to be organized around a few distinctive features (based on: manner of articulation, place of articulation & phonation) – Principle of feature economy If a language has in its inventory then it will also tend to have voiced voiceless bilabial dental /b//p/ /d//t/ plosive

How to Capture these Co-occurrences? PhoNet – “Phoneme Phoneme Network”  A weighted network N=(VC,E)  VC : Nodes representing consonants  E : Set of edges which run between the nodes in VC There is an edge e Є E between two nodes vc1,vc2 Є VC if the consonant c 1 and c 2 co-occur in a language. The number of languages in which c 1 and c 2 co-occurs defines the edge-weight of e. The number of languages in which c 1 occurs defines the node-weight of v c1. /k w / /k′/ /k/ /d′/ 42 14 38 13 283 17 50 39

Construction of PhoNet Data Source : UPSID Number of nodes in V C is 541 Number of edges is 34012 PhoNet

Community Structures in PhoNet Radicchi et al. algorithm (for unweighted networks) – Counts number of triangles that an edge is a part of. Inter-community edges will have low count so remove them. Modification for a weighted network like PhoNet  Look for triangles, where the weights on the edges are comparable.  If they are comparable, then the group of consonants co- occur highly else it is not so.  Measure strength S for each edge (u,v) in PhoNet where S is,  Remove edges with S less than a threshold η S = w uv √Σ i Є V c -{u,v} (w ui – w vi ) 2 if √Σ i Є V c -{u,v} (w ui – w vi ) 2 >0 else S = ∞

3 1 2 4 100 110 101 10 5 6 46 52 45 3 1 2 4 11.11 10.94 7.14 0.06 5 6 3.77 5.17 7.5 S η >1 3 1 2 6 4 5 Community Formation For different values of η we get different sets of communities

Consonant Societies! η=1.25 η=0.72 η=0.60 η=0.35 The fact that the communities are good can quantitatively shown by measuring the feature entropy

Problems to ponder on … Physical significance of PA:  Functional forces  Historical/Evolutionary process Labeled synthesis of PlaNet and PhoNet Language diversity vs. Preferential attachment

CASE STUDY II: Modeling the Mental Lexicon

Metal Lexicon (ML) – Basics It refers to the repository of the word forms that resides in the human brain Two Questions:  How words are stored in the long term memory, i.e., the organization of the ML.  How are words retrieved from the ML (lexical access) The above questions are highly inter-related – to predict the organization one can investigate how words are retrieved and vice versa.

Different Possible Ways of Organization Un-organized (a bag full of words) or, Organized  By sound (phonological similarity) E.g., start the same: banana, bear, bean … End the same: look, took, book … Number of phonological segments they share  By Meaning (semantic similarity) Banana, apple, pear, orange …  By age at which the word is acquired  By frequency of usage  By POS  Orthographically

The Hierarchical Model of ML Proposed by Collins and Quillian in 1969  Concepts are organized in a hierarchy  Taxonomic and attributive relations are represented  Cognitive Economy: Put the attributes at the highest of all appropriate levels – e.g., ‘reproduces’ applies to the whole animal kingdom Animal MammalFish warm-blooded has mammary glands cold-blooded has gills reproduces eats

Hierarchical Model According to the principle of cognitive economy  Animals eat < mammals eat < humans eat  However, shark is a fish = salmon is a fish  What do < and = mean? < : Less time to judge = : Equal time to judge

Spreading Activation Model of ML Not a hierarchical structure but a web of inter-connected nodes (first proposed by Collins and Loftus in 1975) Distance between nodes is determined by the structural characteristics of the word- forms (by sound, by meaning, by age, by …) Combining the above two: plethora of complex networks

Phonological Neighborhood Network (Vitevitch 2004) (Gruenenfelder & Pisoni, 2005) (Kapatsinski 2006) Sound Similarity Relations in the Mental Lexicon: Modeling the Lexicon as a Complex Network

N/W Definition Nodes: Words Edge: An edge is drawn from node A to node B if at least 2/3 of the segments that occur in word represented by A also occurs in the word represented by B  i.e., if the word represented by A is 6 segments long then one can derive all its neighbors (B) from it by two phoneme changes (insertions, deletions or substitutions).

N/W Construction Datbase  Hoosier Mental Lexicon (Nusbaum et al., 1984) phonologically transcribed words  n/w using the metric defined earlier Nodes with no links (correspond to hermit words i.e., words that have no neighbors) Random networks (E-R) for comparison Directed n/w  a long word can have a short word as a neighbor, not vice versa  Have a link only if the duration of the difference in the word pair <= (duration of a word)/3 (the factor 1/3 is experimentally derived … see the paper for further info.)

Neighborhood Density The node whose neighbors are searched  base words Neighborhood density of a base word is expressed as the out-degree of the node representing the base word Is an estimate of the number of words activated by the base word when the base word is presented  spreading activation  Something like semantic priming (however, in the phonological level)

Results of the N/W Analysis Small-world Properties High clustering but also long average path length -- like a SW network the lexicon has densely connected neighborhoods but the links between two nodes of different neighborhoods is harder to find than in SW networks

Visualization – A Disconnected Graph with a Giant Component (GC) GC is elongated – there are some nodes that have really long chain of intermediates and hence the mean path length is long

Low Degree Nodes are Important!!! Removal of low degree nodes renders the n/w almost disconnected A bottleneck is formed between longer (more than 7 segments long) and shorter words  This bottleneck consists the ‘tion’ final words: coalition, passion, nation, fixation/fission – they form short-cuts between the high-degree nodes (i.e., they are low-degree stars that connect mega-neighborhoods)

Removal of Nodes with Degree <= 40 2-4 segment words 8-10 segment words Removal of low- degree nodes disconnect the n/w as opposed to the removal of hubs like “pastor” (deg. =112)

Why low connectivity between neighborhoods? Spreading activation should not inhibit  neighbors of the stimulus’ neighbors that are non- neighbors of the stimulus itself (and are therefore, not similar to the stimulus) Low mean path  complete traversal of n/ws, for e.g., in general purpose search Search in lexicon does not need to traverse links between distant nodes; rather it involves an activation of the structured neighborhood that share a single sub-lexical chunk that could be acoustically related during word-recognition (Marslen-Wilson, 1990).

Degree Distribution (DD) Exponential rather than power-law Entire Lexicon 5-7 segment words 8-10 segment words

Other Works (see supplementary material for reference) Vitevitch (2005)  similar to the above work but builds n/ws of nodes that are just one-segment different (Choudhury et al. (2007)  Builds weighted n/ws in Hindi, Bengali and English based on orthographic proximity (nodes: words; edges: orthographic edit-distance) – SpellNet  Does thresholding (θ) to make the n/ws binary (at θ = 1, 3, 5).  They also obtain exponential DDs  Observe that occurrence of real word errors in a language is proportional to avg. wghtd. deg. of the SpellNet of that language

Other Works Sigman et al. (2002)  Analyzes the English WordNet  All semantic relationships are scale-invariant  Inclusion of polysemy make the n/w SW Ferrer i Cancho et al. (2000,2001) –  Word co-occurrence (in a sentence) based definitions of the lexicon  Lexicon = Kernel Lexicon + Peripheral Lexicon  Finds a 2-regime DD … one comprises words in the kernel lexicon and the other words in the peripheral lexicon  Finds that these n/ws are small-world

Some Unsolved Mysteries – You can Give it a Try What can be a model for the evolution of the ML? How is the ML acquired by a child learner? Is there a single optimal structure for the ML; or is it organized based on multiple criteria (i.e., a combination of the different n/ws) – Towards a single framework for studying ML!!!

CASE STUDY III: Syntax Unsupervised POS Tagging

Labeling of Text Lexical Category (POS tags) Syntactic Category (Phrases, chunks) Semantic Role (Agent, theme, …) Sense Domain dependent labeling (genes, proteins, …) How to define the set of labels? How to (learn to) predict them automatically?

“Nothing makes sense, unless in context” Distribution-based definition of  Lexical category  Sense (meaning) The X is … If you X then I shall … … looking at the star PP

General Approach Represent the context of a word (token) Define some notion of similarity between the contexts Cluster the contexts of the tokens Get the label of the tokens w 1 w 2 w 3 w 4 … w1w1 w2w2 w4w4 w3w3

Issues How to define the context? How to define similarity How to Cluster? How to evaluate?

Unsupervised Parts-of-Speech Tagging Employing Efficient Graph Clustering Chris Biemann COLING-ACL 2006

Stages Input: raw text corpus Identify feature words and define a graph for high and medium frequency words (10000) Cluster the graph to identify the classes For low frequency words, use context similarity Lexicon of word classes  tag the same text  learn a Viterbi tagger

Features Words Estimate the unigram frequencies Feature words: Most frequent 200 words

Feature Vector From the familiar to the exotic, the collection is a delight 00…01 10…00 01…00 10…00 fw 1 fw 2 fw 199 fw 200 p -2 p -1 p1p1 p2p2 thetoisfrom

Syntactic Network of Words light color red blue blood sky heavy weight 100 20 1 1 1 – cos(red, blue)

The Chinese Whisper Algorithm light color red blue blood sky heavy weight 0.9 0.5 0.9 0.7 0.8 -0.5

Medium and Low Frequency Words Neighboring (window 4) co-occurrences ranked by log-likelihood thresholded by θ Two words are connected iff they share at least 4 neighbors LanguageEnglishFinnishGerman Nodes5285785627137951 Edges6912417023491493571

Construction of Lexicon Each word assigned a unique tag based on the word class it belongs to  Class 1: sky, color, blood, weight  Class 2: red, blue, light, heavy Ambiguous words:  High and medium frequency words that formed singleton cluster  Possible tags of neighboring clusters

Training and Evaluation Unsupervised training of trigram HMM using the clusters and lexicon Evaluation:  Tag a text, for which gold standard is available  Estimate the conditional entropy H(T|C) and the related perplexity 2 H(T|C) Final Results:  English – 2.05 (619/345), Finnish – 3.22 (625/466), German – 1.79 (781/440)

Example From the familiar to the exotic, the collection is a delight Prep At JJ Prep At JJ At NN V At NN C200 C1 C331 C5 C1 C331 C1 C221 C3 C1 C220

Word Sense Disambiguation Véronis, J. 2004. HyperLex: lexical cartography for information retrieval. Computer Speech & Language 18(3):223-252. Let the word to be disambiguated be “light” Select a subcorpus of paragraphs which have at least one occurrence of “light” Construct the word co-occurrence graph

HyperLex A beam of white light is dispersed into its component colors by its passage through a prism. Energy efficient light fixtures including solar lights, night lights, energy star lighting, ceiling lighting, wall lighting, lamps What enables us to see the light and experience such wonderful shades of colors during the course of our everyday lives? beam colors prism dispersed white energy lamps fixturesefficient shades

Hub Detection and MST beam colors prism dispersed white energy lamps fixturesefficient shades light colorslamps beamprism dispersedwhite shades energy fixtures efficient White fluorescent lights consume less energy than incandescent lamps

Other Related Works Solan, Z., Horn, D., Ruppin, E. and Edelman, S. 2005. Unsupervised learning of natural languages. PNAS, 102 (33): 11629-11634 Ferrer i Cancho, R. 2007. Why do syntactic links not cross? Europhysics Letters Also applied to: IR, Summarization, sentiment detection and categorization, script evaluation, author detection, …

Discussions & Conclusions What we learnt Advantages of SNIC in NLP Comparison to standard techniques Open problems Concluding remarks and Q&A

What we learnt What is SNIC and Complex Networks Analytical tools for SNIC Applications to human languages Three Case-studies: AreaPerspectiveTechnique ISound systems Language evolution and change Synthesis models IILexiconPsycholinguistic modeling and linguistic typology Topology and search IIISyntax & Semantics Applications to NLPClustering

What we saw Language features complex structure at every level of organization Linguistic networks have non-trivial properties: scale-free & small-world Therefore, Language and Engineering systems involving language should be studied within the framework of complex systems, esp. CNT

Advantages of SNIC Fully Unsupervised techniques:  No labeled data required: A good solution to resources scarcity  Problem of evaluation: circumvented by semi- supervised techniques Ease of computation:  Simple and scalable  Distributed and parallel computable Holistic treatment:  Language evolution & psycho-linguistic theories

Comparison to Standard Techniques Rule-based vs. Statistical NLP Graphical Models  Generative models in machine learning  HMM, CRF, Bayesian belief networks JJNNRBVF

Graphical Models vs. SNIC GRAPHICAL MODEL Principled: based on Bayesian Theory Structure is assumed and parameters are learnt Focus: Decoding & parameter estimation Data-driven or computationally intensive The generative process is easy to visualize, but no visualization of the data COMPLEX NETWORK Heuristic, but underlying principles of linear algebra Structure is discovered and studied Focus: Topology and evolutionary dynamics Unsupervised and computationally easy Easy visualization of the data

Language Modeling A network of words as a model of language vs. n-gram models Hierarchical, hyper-graph based models Smoothing through holistic analysis of the network topology Jedynak, B. and Karakos, D. 2007. Unigram Language Models using Diffusion Smoothing over Graphs. Proc. of TextGraphs - 2

Open Problems Universals and variables of linguistic networks Superimposition of networks: phonetic, syntactic, semantic Which clustering algorithm for which topology? Metrics for network comparison – important for language modeling Unsupervised dependency parsing using networks Mining translation equivalents

Resources Conferences  TextGraphs, Sunbelt, EvoLang, ECCS Journals  PRE, Physica A, IJMPC, EPL, PRL, PNAS, QL, ACS, Complexity, Social Networks Tools  Pajek, C#UNG, http://www.insna.org/INSNA/soft_inf.html http://www.insna.org/INSNA/soft_inf.html Online Resources  Bibliographies, courses on CNT

Contact Monojit Choudhury  monojitc@microsoft.com  http://www.cel.iitkgp.ernet.in/~monojit/ Animesh Mukherjee  animeshm@cse.iitkgp.ernet.in  http://www.cel.iitkgp.ernet.in/~animesh/ Niloy Ganguly  niloy@cse.iitkgp.erent.in  http://www.facweb.iitkgp.ernet.in/~niloy/

Thank you!!

Social Network Inspired Models of NLP and Language Evolution Monojit Choudhury (Microsoft Research India) Animesh Mukherjee (IIT Kharagpur) Niloy Ganguly.

Similar presentations

Presentation on theme: "Social Network Inspired Models of NLP and Language Evolution Monojit Choudhury (Microsoft Research India) Animesh Mukherjee (IIT Kharagpur) Niloy Ganguly."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Social Network Inspired Models of NLP and Language Evolution Monojit Choudhury (Microsoft Research India) Animesh Mukherjee (IIT Kharagpur) Niloy Ganguly.

Similar presentations

Presentation on theme: "Social Network Inspired Models of NLP and Language Evolution Monojit Choudhury (Microsoft Research India) Animesh Mukherjee (IIT Kharagpur) Niloy Ganguly."— Presentation transcript:

Similar presentations

About project

Feedback