Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea University of North Texas RANLP 2005.

Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea University of North Texas rada@cs.unt.edu RANLP 2005

RANLP 2005 2 Introduction l Motivation –Graph-theory is a well studied discipline –So are the fields of Information Retrieval (IR) and Natural Language Processing (NLP) –Often perceived as completely different disciplines l Goal of the tutorial: provide an overview of methods and applications in IR and NLOP that rely on graph-based algorithms, e.g. –Graph-based algorithms: graph traversal, min-cut algorithms, node ranking –Applied to: Web search, text understanding, text summarization, keyword extraction, text clustering

RANLP 2005 3 Outline l Part 1: Elements of graph-theory: –Terminology, notations, algorithms on graphs l Part 2: Graph-based algorithms for IR –Web search –Text clustering and classification l Part 3: Graph-based algorithms for NLP –Word Sense Disambiguation –Clustering of entities (names/numbers) –Thesaurus construction –Text summarization  content  style

RANLP 2005 4 What is a Graph? l A graph G = (V,E) is composed of: V: set of vertices E: set of edges connecting the vertices in V l An edge e = (u,v) is a pair of vertices l Example: a b c d e V= {a,b,c,d,e} E= {(a,b),(a,c),(a,d), (b,e),(c,d),(c,e), (d,e)}

RANLP 2005 5 Terminology: Adjacent and Incident l If (v 0, v 1 ) is an edge in an undirected graph, –v 0 and v 1 are adjacent –The edge (v 0, v 1 ) is incident on vertices v 0 and v 1 l If is an edge in a directed graph –v 0 is adjacent to v 1, and v 1 is adjacent from v 0 –The edge is incident on v 0 and v 1

RANLP 2005 6 l The degree of a vertex is the number of edges incident to that vertex l For directed graph, –the in-degree of a vertex v is the number of edges that have v as the head –the out-degree of a vertex v is the number of edges that have v as the tail –if d i is the degree of a vertex i in a graph G with n vertices and e edges, the number of edges is Why? Since adjacent vertices each count the adjoining edge, it will be counted twice Terminology: Degree of a Vertex

RANLP 2005 7 0 12 3 4 5 6 G1G1 G2G2 3 2 33 1 1 1 1 directed graph in-degree out-degree 0 1 2 G3G3 in:1, out: 1 in: 1, out: 2 in: 1, out: 0 0 12 3 3 3 3 Examples

RANLP 2005 8 Terminology: Path l path: sequence of vertices v 1,v 2,...v k such that consecutive vertices v i and v i+1 are adjacent.

RANLP 2005 9 More Terminology l simple path: no repeated vertices l cycle: simple path, except that the last vertex is the same as the first vertex a b c d e b e c a b c d e a c d a

RANLP 2005 10 Even More Terminology l subgraph: subset of vertices and edges forming a graph l connected component: maximal connected subgraph. E.g., the graph below has 3 connected components. connectednot connected connected graph: any two vertices are connected by some path

RANLP 2005 11 00 123 120 12 3 (i) (ii) (iii) (iv) (a) Some of the subgraphs of G 1 0 0 1 0 1 2 0 1 2 (i) (ii) (iii) (iv) (b) Some of the subgraphs of G 3 0 12 3 G1G1 0 1 2 G3G3 Subgraph Examples

RANLP 2005 12 More … l tree - connected graph without cycles l forest - collection of trees

RANLP 2005 13 Connectivity l Let n = #vertices, and m = #edges l A complete graph: one in which all pairs of vertices are adjacent l How many total edges in a complete graph? –Each of the n vertices is incident to n-1 edges, however, we would have counted each edge twice! Therefore, intuitively, m = n(n -1)/2. l Therefore, if a graph is not complete, m < n(n -1)/2

RANLP 2005 14 More Connectivity n = #vertices m = #edges l For a tree m = n - 1 If m < n - 1, G is not connected

RANLP 2005 15 Directed and Undirected Graphs l An undirected graph is one in which the pair of vertices in a edge is unordered, (v 0, v 1 ) = (v 1,v 0 ) l A directed graph is one in which each edge is a directed pair of vertices, != tail head

RANLP 2005 16 Weighted and Unweighted Graphs l Edges have / do not have a weight associated with them s 2 3 4 5 6 7 t 15 5 10 4 4 weighted unweighted

RANLP 2005 17 Graph Representations: Adjacency Matrix l Let G=(V,E) be a graph with n vertices. l The adjacency matrix of G is a two-dimensional n by n array, say adj_mat l If the edge (v i, v j ) is in E(G), adj_mat[i][j]=1 l If there is no such edge in E(G), adj_mat[i][j]=0 l The adjacency matrix for an undirected graph is symmetric; the adjacency matrix for a digraph need not be symmetric

RANLP 2005 18 Examples of Adjacency Matrices G1G1 G2G2 G3G3 0 12 3 0 1 2 1 0 2 3 4 5 6 7 symmetric undirected: n 2 /2 directed: n 2

RANLP 2005 19 Advantages of Adjacency Matrix l Starting with an adjacency matrix, it is easy to determine the links between vertices l The degree of a vertex is l For a digraph (= directed graph), the row sum is the out_degree, while the column sum is the in_degree

RANLP 2005 20 Graph Representations: Adjacency Lists l Each row in an adjacency matrix is represented as a list l An undirected graph with n vertices and e edges => n head nodes and 2e list nodes l Pros: More compact representation –Memory efficient l Cons: Sometimes less efficient computation of in- or out- degrees

RANLP 2005 21 01230123 012012 0123456701234567 1 23 023 013 0 12 G1G1 1 0 2 G3G3 1 2 0 3 0 3 12 5 4 6 57 6 G4G4 0 12 3 0 1 2 1 0 2 3 4 5 6 7 Examples of Adjacency Lists

RANLP 2005 22 Operations on Adjacency Lists degree of a vertex in an undirected graph # of nodes in adjacency list # of edges in a graph determined in O(n+e) out-degree of a vertex in a directed graph # of nodes in its adjacency list in-degree of a vertex in a directed graph traverse the whole data structure

RANLP 2005 23 Algorithms on Graphs l Graph traversal l Min-Cut / Max-Flow l Ranking algorithms l Spectral Clustering l Other: –Dijkstra’s Algorithm –Minimum Spanning Tree –…

RANLP 2005 24 Graph Traversal l Problem: Search for a certain node or traverse all nodes in the graph l Depth First Search –Once a possible path is found, continue the search until the end of the path l Breadth First Search –Start several paths at a time, and advance in each one step at a time

RANLP 2005 25 Depth-First Search

RANLP 2005 26 Depth-First Search Algorithm DFS(v); Input: A vertex v in a graph Output: A labeling of the edges as “discovery” edges and “backedges” for each edge e incident on v do if edge e is unexplored then let w be the other endpoint of e if vertex w is unexplored then label e as a discovery edge recursively call DFS(w) elselabel e as a backedge

RANLP 2005 27 Breadth-First Search d) c) b)a)a)

RANLP 2005 28 Algorithm BFS(s): Input: A vertex s in a graph Output: A labeling of the edges as “discovery” edges and “cross edges” initialize container L 0 to contain vertex s i  0 while L i is not empty do create container L i+1 to initially be empty for each vertex v in L i do if edge e incident on v do let w be the other endpoint of e if vertex w is unexplored then label e as a discovery edge insert w into L i+1 else label e as a cross edge i  i + 1 Breadth-First Search

RANLP 2005 29 Path Finding l Find path from source vertex s to destination vertex d l Use graph search starting at s and terminating as soon as we reach d –Need to remember edges traversed l Use depth – first search l Use breath – first search

RANLP 2005 30 Path Finding with Depth First Search E F G B C D A start destination A DFS on A A DFS on B B A DFS on C B C A B Return to call on B D Call DFS on D A B D Call DFS on G G found destination - done path is implicitly stored in DFS recursion path is: A, B, D, G

RANLP 2005 31 E F G B C D A start destination A Initial call to BFS on A Add A to queue B Dequeue A Add B frontrearfrontrear C Dequeue B Add C, D frontrear DD Dequeue C Nothing to add frontrear G Dequeue D Add G frontrear found destination - done path must be stored separately Path Finding with Breadth First Search

RANLP 2005 32 l Max flow network: G = (V, E, s, t, u). –(V, E) = directed graph, no parallel arcs. –Two distinguished nodes: s = source, t = sink. –u(e) = capacity of arc e. Max Flow Network Capacity s 2 3 4 5 6 7 t 15 5 30 15 10 8 15 9 6 10 15 4 4

RANLP 2005 33 l An s-t flow is a function f: E   that satisfies: –For each e  E: 0  f(e)  u(e) (capacity) –For each v  V – {s, t}: (conservation) Flows s 2 3 4 5 6 7 t 15 5 30 15 10 8 15 9 6 10 15 4 4 4 0 0 0 0 0 0 0 4 4 0 4 0 0 0 Capacity Flow

RANLP 2005 34 l An s-t flow is a function f: E   that satisfies: –For each e  E: 0  f(e)  u(e) (capacity) –For each v  V – {s, t}: (conservation) Flows s 2 3 4 5 6 7 t 15 5 30 15 10 8 15 9 6 10 15 4 4 4 0 0 0 0 0 0 0 4 4 0 4 0 0 0 Capacity Flow Value = 4 MAX FLOW: find s-t flow that maximizes net flow out of the source.

RANLP 2005 35 Arcs Networks Flow Examples communication Network telephone exchanges, computers, satellites Nodes cables, fiber optics, microwave relays Flow voice, video, packets circuits gates, registers, processors wirescurrent mechanicaljointsrods, beams, springsheat, energy hydraulic reservoirs, pumping stations, lakes pipelinesfluid, oil financialstocks, currencytransactionsmoney transportation airports, rail yards, street intersections highways, railbeds, airway routes freight, vehicles, passengers chemicalsitesbondsenergy

RANLP 2005 36 l An s-t cut is a node partition (S, T) such that s  S, t  T. –The capacity of an s-t cut (S, T) is: l Min s-t cut: find an s-t cut of minimum capacity. Cuts s 2 3 4 5 6 7 t 15 5 30 15 10 8 15 9 6 10 15 4 4 Capacity = 30

RANLP 2005 37 Cuts s 2 3 4 5 6 7 t 15 5 30 15 10 8 15 9 6 10 15 4 4 Capacity = 62 l An s-t cut is a node partition (S, T) such that s  S, t  T. –The capacity of an s-t cut (S, T) is: l Min s-t cut: find an s-t cut of minimum capacity.

RANLP 2005 38 l L1. Let f be a flow, and let (S, T) be a cut. Then, the net flow sent across the cut is equal to the amount reaching t. Flows and Cuts s 2 3 4 5 6 7 t 15 5 30 15 10 8 15 9 6 10 15 4 4 Value = 24 10 6 6 0 4 8 8 0 4 0 0 0

RANLP 2005 39 10 6 6 0 4 8 8 0 4 0 0 0 Flows and Cuts s 2 3 4 5 6 7 t 15 5 30 15 10 8 15 9 6 10 15 4 4 Value = 24 l L1. Let f be a flow, and let (S, T) be a cut. Then, the net flow sent across the cut is equal to the amount reaching t.

RANLP 2005 40 l Let f be a flow, and let (S, T) be a cut. Then, l Proof by induction on |S|. –Base case: S = { s }. –Inductive hypothesis: assume true for |S’| = k-1.  consider cut (S, T) with |S| = k  S = S'  { v } for some v  s, t, |S' | = k-1  cap(S', T') = | f |.  adding v to S' increase cut capacity by Flows and Cuts s v t After s v t S' Before S

RANLP 2005 41 Max-Flow Min-Cut Theorem l Max-Flow Min-Cut Theorem (Ford-Fulkerson, 1956): In any network, the value of the max flow is equal to the value of the min cut. s 2 3 4 5 6 7 t 15 5 30 15 10 8 15 9 6 10 15 4 4 10 9 9 15 4 10 4 8 9 1 0 0 Flow value = 28 0 0 Cut capacity = 28

RANLP 2005 42 Random Walk Algorithms l Algorithms for link analysis l Based on random walks –Results into a ranking of vertex “importance” l Used in many fields: –Social networks  co-authorship, citations –Web search engines –Clustering –NLP –…

RANLP 2005 43 PageRank Intuition l (Brin & Page 1998) l Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. l Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. l At some point, the percentage of time spent at each page will converge to a fixed value. l This value is known as the PageRank of the page.

RANLP 2005 44 HITS Intuition l Hyperlinked Induced Topic Search (Kleinberg 1999) l Defines 2 types of importance: –Authority  Can be thought of as being an expert on the subject  “You are important if important hubs point to you” –Hub  Can be thought of as being knowledgeable about the subject  “You are important if important authorities point to you” l Hubs and Authorities mutually reinforce each other l Note: A node in a graph can be both an Authority and a Hub at the same time, eg. cnbc.com l Note: this is equivalent to performing a random walk where: –Odd steps you walk from hubs to authorities (follow outlink) –Even steps you walk from authorities to hubs (follow inlink)

RANLP 2005 45 ABCF E D 1.35 0.63 0.36 1.17 0.51 1.47

RANLP 2005 46 Graph-based Ranking Decide the importance of a vertex within a graph A link between two vertices = a vote –Vertex A links to vertex B = vertex A “votes” for vertex B –Iterative voting => Ranking over all vertices Model a random walk on the graph –A walker takes random steps –Converges to a stationary distribution of probabilities  Probability of finding the walker at a certain vertex –Guaranteed to converge if the graph is  Aperiodic – holds for non-bipartite graphs  Irreducible – holds for strongly connected graphs

RANLP 2005 47 Graph-based Ranking Traditionally applied on directed graphs –From a given vertex, the walker selects at random one of the out-edges Applies also to undirected graphs –From a given vertex, the walker selects at random one of the incident edges Adapted to weighted graphs –From a given vertex, the walker selects at random one of the out (directed graphs) or incident (undirected graphs) edges, with higher probability of selecting edges with higher weight

RANLP 2005 48 Algorithms G = (V,E) – a graph – In(V i ) = predecessors, Out(V i ) = successors PageRank – (Brin & Page 1998) – integrates incoming / outgoing links in one formula where d  [0,1] Adapted to weighted graphs

RANLP 2005 49 Algorithms Positional Function – (Positional Power / Weakness (Herings et al. 2001) HITS – Hyperlinked Induced Topic Search (Kleinberg 1999) – authorities / hubs

RANLP 2005 50 Convergence –Error below a small threshold value (e.g. 0.0001) Text-based graphs – convergence usually achieved after 20-40 iterations –more iterations for undirected graphs

RANLP 2005 51 Convergence Curves (250 vertices, 250 edges)

RANLP 2005 52 0.1 0.2 0.8 0.7 0.6 0.8 Spectral Clustering l Algorithms that cluster data using graph representations l Similarity graph: –Distance decreases as similarity increases –Represent dataset as a weighted graph G(V,E) l Clustering can be viewed as partitioning a similarity graph l Bi-partitioning: divide vertices into two disjoint groups: (A,B) 1 2 3 4 5 6 1 2 3 4 5 6 A B

RANLP 2005 53 Clustering Objectives l Traditional definition of a “good” clustering: 1.Points assigned to same cluster should be highly similar. 2.Points assigned to different clusters should be highly dissimilar. Minimize weight of between-group connections 0.1 0.2 0.8 0.7 0.6 0.8 1 2 3 4 5 6 l Apply these objectives to our graph representation l Options: 1.Find minimum cut between two clusters 2.Apply spectral graph theory

RANLP 2005 54 Spectral Graph Theory l Possible approach –Represent a similarity graph as a matrix –Apply knowledge from Linear Algebra… l Spectral Graph Theory –Analyse the “spectrum” of matrix representing a graph –Spectrum : The eigenvectors of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues l The eigenvalues and eigenvectors of a matrix provide global information about its structure.

RANLP 2005 55 Matrix Representations Adjacency matrix (A) – n x n matrix – : edge weight between vertex x i and x j x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x1x1 00.80.600.10 x2x2 0.80 000 x3x3 0.60.800.200 x4x4 00 00.80.7 x5x5 0.1000.80 x6x6 0000.70.80 0.1 0.2 0.8 0.7 0.6 0.8 1 2 3 4 5 6 l Important properties: –Symmetric matrix

RANLP 2005 56 Matrix Representations (continued) Degree matrix (D) – n x n diagonal matrix – : total weight of edges incident to vertex x i x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x1x1 1.500000 x2x2 01.60000 x3x3 00 000 x4x4 0001.700 x5x5 0000 0 x6x6 000001.5 0.1 0.2 0.8 0.7 0.6 0.8 1 2 3 4 5 6

RANLP 2005 57 Matrix Representations (continued) Laplacian matrix (L) – n x n symmetric matrix 0.1 0.2 0.8 0.7 0.6 0.8 1 2 3 4 5 6 L = D - A x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x1x1 1.5-0.8-0.60-0.10 x2x2 -0.81.6-0.8000 x3x3 -0.6-0.81.6-0.200 x4x4 00 1.7-0.8-0.7 x5x5 -0.1000.8-1.7-0.8 x6x6 000-0.7-0.81.5

RANLP 2005 58 Spectral Clustering Algorithms l Three basic stages: 1.Pre-processing  Construct a matrix representation of the dataset. 2.Decomposition  Compute eigenvalues and eigenvectors of the matrix.  Map each point to a lower-dimensional representation based on one or more eigenvectors. 3.Grouping  Assign points to two or more clusters, based on the new representation.

RANLP 2005 59 Spectral Bi-partitioning Algorithm 1. Pre-processing –Build Laplacian matrix L of the graph 0.90.80.5-0.2-0.70.4 -0.2-0.6-0.8-0.4-0.70.4 -0.6-0.40.20.9-0.40.4 0.6-0.20.0-0.20.20.4 0.30.4-0.0.10.20.4 -0.9-0.20.40.10.20.4 3.0 2.5 2.3 2.2 0.4 0.0 Λ =Λ =Λ =Λ = X =X =X =X = 2. Decomposition –Find eigenvalues Λ and eigenvectors X of the matrix L -0.7 x6x6 x5x5 -0.4 x4x4 0.2 x3x3 x2x2 x1x1 –Map vertices to corresponding components of λ 2 (= algebraic connectivity of the graph) x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x1x1 1.5-0.8-0.60-0.10 x2x2 -0.81.6-0.8000 x3x3 -0.6-0.81.6-0.200 x4x4 00 1.7-0.8-0.7 x5x5 -0.100-0.81.7-0.8 x6x6 000-0.7-0.81.5 Fiedler vector

RANLP 2005 60 Spectral Bi-partitioning (continued) l Grouping –Sort components of reduced 1-dimensional vector. –Identify clusters by splitting the sorted vector in two. l How to choose a splitting point? –Naïve approaches:  Split at 0, mean or median value –More expensive approaches  Attempt to minimise (normalised) cut in 1-dimension -0.7 x6x6 x5x5 -0.4 x4x4 0.2 x3x3 x2x2 x1x1 Split at 0 Cluster A : Positive points Cluster B : Negative points 0.2 x3x3 x2x2 x1x1 -0.7 x6x6 x5x5 -0.4 x4x4 A B

RANLP 2005 61 3-Clusters Lets assume the next data points

RANLP 2005 62 l If we use the 2 nd eigenvector l If we use the 3 rd eigenvector

RANLP 2005 63 K -Way Spectral Clustering How do we partition a graph into k clusters? l Two basic approaches: 1.Recursive bi-partitioning (Hagen et al.,’91)  Recursively apply bi-partitioning algorithm in a hierarchical divisive manner. 2.Cluster multiple eigenvectors  Build a reduced space from multiple eigenvectors.  Use 3 rd eigenvector to further partition the clustering output of the 2 nd eigenvector  Commonly used in recent work –Multiple eigenvectors prevent instability due to information loss

RANLP 2005 64 K -Eigenvector Clustering l K-eigenvector Algorithm (Ng et al.,’01) 1.Pre-processing –Construct the scaled adjacency matrix 2.Decomposition  Find the eigenvalues and eigenvectors of A'.  Build embedded space from the eigenvectors corresponding to the k largest eigenvalues. 3.Grouping  Apply k-means to reduced n x k space to produce k clusters. –Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. –Assign each object to the group that has the closest centroid. –When all objects have been assigned, recalculate the positions of the K centroids. –Repeat Steps 2 and 3 until the centroids no longer move.

RANLP 2005 66 Web: heterogeneous and unstructured Free of quality control on the web Commercial interest to manipulate ranking Use a model of Web page importance (based on link analysis) (Brin and Page 1998), (Kleinberg 1999) Related work Academic citation analysis Clustering methods of link structure Hubs and Authorities Model Web-link Analysis for Improved IR

RANLP 2005 67 Back-links l Link Structure of the Web l Approximates the importance / quality of Web pages l PageRank: –Pages with lots of back-links are important –Back-links coming from important pages convey more importance to a page –Problem: rank sink

RANLP 2005 68 Rank Sink l Page cycles pointed by some incoming link l Problem: this loop will accumulate rank but never distribute any rank outside

RANLP 2005 69 Escape Term l Solution: Rank Source l E(u) is some vector over the web pages – uniform, favorite pages, etc. l Recall that the rank (PageRank) is the eigenvector corresponding to the largest eigenvalue

RANLP 2005 70 Random Surfer Model l Page Rank corresponds to the probability distribution of a random walk on the web graphs l E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever

RANLP 2005 71 Implementation l Computing resources — 24 million pages — 75 million URLs l Unique integer ID for each URL l Sort and Remove dangling links l Rank initial assignment l Iteration until convergence l Add back dangling links and Re-compute

RANLP 2005 72 Ranking Web-pages l Combined PageRank + vectorial model l Vectorial model: –Finds similarity between query and documents –Uses term weighting schemes (TF.IDF) –Ranks documents based on similarity –Ranking is query specific l PageRank: –Ranks documents based on importance –Ranking is general (not query specific) l Combined model: –Weighted combination of vectorial and PageRank ranking

RANLP 2005 73 Issues l Users are not random walkers – Content based methods l Starting point distribution – Actual usage data as starting vector l Reinforcing effects/bias towards main pages l No query specific rank l Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)

RANLP 2005 74 Personalized PageRank l Initial value for rank source E can change ranking: 1. Uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives may result in overly high ranking 2. Total weight on a few sets of pages based on a-priori known popularity, e.g. Netscape, McCarthy 3. Different weight depending on the query or topic: 1.topic-sensitive PageRank (Haveliwala 2001) 2.query-sensitive PageRank (Richardson 2002)

RANLP 2005 75 Link Analysis for IR l PageRank is a global ranking based on the web's graph structure l PageRank relies on back-links information to “bring order to the web” l Proven in practice to exceed the performance of other search engines, which are exclusively based on vectorial / boolean models l Can be adapted to “personalized” searches –Although personalized search engines are still not very popular

RANLP 2005 76 Text Classification l Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set of categories: C = {c 1, c 2,…, c n } l Determine: –The category of x: c(x)  C, where c(x) is a categorization function whose domain is X and whose range is C

RANLP 2005 77 Text Classification Examples Assign labels to each document or web-page: l Labels are most often topics such as Yahoo-categories e.g., "finance," "sports," "news>world>asia>business" l Labels may be genres e.g., "editorials" "movie-reviews" "news“ l Labels may be opinion e.g., “like”, “hate”, “neutral” l Labels may be domain-specific binary e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “is a toner cartridge ad” :“isn’t”

RANLP 2005 78 Traditional Methods l Manual classification –Used by Yahoo!, Looksmart, about.com, ODP, Medline –very accurate when job is done by experts –consistent when the problem size and team is small –difficult and expensive to scale l Automatic document classification –Hand-coded rule-based systems  Used by CS dept’s spam filter, Reuters, …  E.g., assign category if document contains a given boolean combination of words l Supervised learning of document-label assignment function –Many new systems rely on machine learning  k-Nearest Neighbors (simple, powerful)  Naive Bayes (simple, common method)  Support-vector machines (new, more powerful) …

RANLP 2005 79 Text Clustering l Goal similar to the task of text classification: classify text into categories l No apriori known set of categories l Cluster data into sets of similar documents l Pros: no annotated data is required l Cons: more difficult to assign labels l Traditional approaches: –All clustering methods have been also applied to text clustering  Agglomerative clustering  K-means  …

RANLP 2005 80 Spectral Learning / Clustering l Use graph representation of text collections for: –Text clustering: spectral clustering –Text classification: spectral learning l Recall bi-partitioning and k-way clustering: –Use 2 nd or first K eigenvectors as a reduced space –Map vertices to the reduce space –Cluster the reduced space  E.g. cluster the points in the 2 nd eigenvector in two classes: positive vs. negative => bi-partitioning  Use k-means to cluster the representation obtained using top K eigenvectors

RANLP 2005 81 Spectral Learning / Clustering l Spectral representation: (Kamvar et al., 2003) –Find similarity graph for the collection –Find Laplacian matrix (alternatives: normalized L, LSA, etc.) –Find the K largest eigenvectors of the Laplacian => X –Normalize the rows in X to be unit length l Clustering: –Use k-means (or other clustering algorithm) to cluster the rows in X into K clusters l Classification: –Assign known labels to the training points in the collection, using the representation in X –Classify unlabeled rows corresponding to the test points in the collection, using the representation in X, with the help of any learning algorithm (SVM, Naïve Bayes, etc.)

RANLP 2005 82 Empirical Results l Data sets: –20 newsgroups: 1000 postings from each of 20 usenet newsgroups –3 newsgroups: 3 of the 20 newsgroups (crypt, politics, religion) –Lymphoma: gene expression profiles. 2 classes with 42 and 54 samples each –Soybean: the soybean-large data set from the UCI repository, with 15 classes

RANLP 2005 83 Text Clustering l Compare spectral clustering with k-means l Use document-to-document similarities to build similarity graph for each collection –Pairwise cosine similarity –Use k = 20

RANLP 2005 84 Text Classification l Use the same spectral representation, but apply a learning algorithm trained on the labeled examples –Reinforce the prior knowledge: set A ij to 1 if i and j are known to belong to the same class, or 0 otherwise l Can naturally integrate unlabeled training examples – included in the initial matrix

RANLP 2005 85 Text Classification

RANLP 2005 86 References l Brin, S. and Page, L., The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems, 1998 l Herings, P.J. van der Laan, G. and Talman, D., Measuring the power of nodes in digraphs, Technical Report, Tinbergen Institute, 2001 l Kamvar, S. and Klein, D. and Manning, C. Spectral Learning, IJCAI 2003 l Kleinberg, J., Authoritative sources in a hyperlinked environment, Journal of the ACM, 1999 l Ng, A. and Jordan, M. and Weiss, Y. On Spectral Clustering: Analysis and an algorithm, NIPS 14, 2002

RANLP 2005 88 Word Sense Disambiguation l Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. –Sense Inventory usually comes from a dictionary or thesaurus. –Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches l Word polysemy (with respect to a dictionary) l Determine which sense of a word is used in a specific sentence –Ex: “chair” – furniture or person –Ex: “child” – young person or human offspring “Sit on a chair” “Take a seat on this chair” “The chair of the Math Department” “The chair of the meeting”

RANLP 2005 89 Graph-based Solutions for WSD l Use information derived from dictionaries / semantic networks to construct graphs l Build graphs using measures of “similarity” –Similarity determined between pairs of concepts, or between a word and its surrounding context  Distributional similarity (Lee 1999) (Lin 1999)  Dictionary-based similarity (Rada 1989) carnivore wild dogwolf bearfeline, felidcanine, canidfissiped mamal, fissiped dachshund hunting doghyena dogdingo hyenadog terrier

RANLP 2005 90 Semantic Similarity Metrics l Input: two concepts (same part of speech) l Output: similarity measure l E.g. (Leacock and Chodorow 1998) –E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42 l Other metrics: –Similarity using information content (Resnik 1995) (Lin 1998) –Similarity using gloss-based paths across different hierarchies (Mihalcea and Moldovan 1999) –Conceptual density measure between noun semantic hierarchies and current context (Agirre and Rigau 1995) where D is the taxonomy depth

RANLP 2005 91 Lexical Chains for WSD l Apply measures of semantic similarity in a global context l Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976) l “A lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse” Algorithm for finding lexical chains: 1. Select the candidate words from the text. These are words for which we can compute similarity measures, and therefore most of the time they have the same part of speech. 2. For each such candidate word, and for each meaning for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word meaning. 3. If such a chain is found, insert the word in this chain; otherwise, create a new chain.

RANLP 2005 92 Lexical Chains A very long train traveling along the rails with a constant velocity v in a certain direction … train #1: public transport #2: order set of things #3: piece of cloth travel #1 change location #2: undergo transportation rail #1: a barrier # 2: a bar of steel for trains #3: a small bird

RANLP 2005 93 Lexical Chains for WSD l Identify lexical chains in a text –Usually target one part of speech at a time l Identify the meaning of words based on their membership to a lexical chain l Evaluation: –(Galley and McKeown 2003) lexical chains on 74 SemCor texts give 62.09% –(Mihalcea and Moldovan 2000) on five SemCor texts give 90% with 60% recall  lexical chains “anchored” on monosemous words

RANLP 2005 94 Graph-based Ranking on Semantic Networks l Goal: build a semantic graph that represents the meaning of the text l Input: Any open text l Output: Graph of meanings (synsets) –“importance” scores attached to each synset –relations that connect them l Models text cohesion –(Halliday and Hasan 1979) –From a given concept, follow “links” to semantically related concepts l Graph-based ranking identifies the most recommended concepts

RANLP 2005 95 Two U.S. soldiers and an unknown number of civilian contractors are unaccounted for after a fuel convoy was attacked near the Baghdad International Airport today, a senior Pentagon official said. One U.S. soldier and an Iraqi driver were killed in the incident. … … …

RANLP 2005 96 Main Steps l Step 1: Preprocessing –SGML parsing, text tokenization, part of speech tagging, lemmatization l Step 2: Assume any possible meaning of a word in a text is potentially correct –Insert all corresponding synsets into the graph l Step 3: Draw connections (edges) between vertices l Step 4: Apply the graph-based ranking algorithm –PageRank, HITS, Positional power

RANLP 2005 97 Semantic Relations l Main relations provided by WordNet –ISA (hypernym/hyponym) –PART-OF (meronym/holonym) –causality –attribute –nominalizations –domain links l Derived relations –coord: synsets with common hypernym l Edges (connections) –directed (direction?) / undirected –Best results with undirected graphs l Output: Graph of concepts (synsets) identified in the text –“importance” scores attached to each synset –relations that connect them

RANLP 2005 98 Word Sense Disambiguation l Rank the synsets/meanings attached to each word l Unsupervised method for semantic ambiguity resolution of all words in unrestricted text (Mihalcea et al. 2004) l Related algorithms: –Lesk –Baseline (most frequent sense / random) l Hybrid: –Graph-based ranking + Lesk –Graph-based ranking + Most frequent sense l Evaluation –“Informed” (with sense ordering) –“Uninformed” (no sense ordering) l Data –Senseval-2 all words data (three texts, average size 600) –SemCor subset (five texts: law, sports, debates, education, entertainment)

RANLP 2005 99 Evaluation “uninformed” (no sense order)

RANLP 2005 100 Evaluation “informed” (sense order integrated)

RANLP 2005 101 Ambiguous Entitites l Name ambiguity in research papers: –David S. Johnson, David Johnson, D. Johnson –David Johnson (Rice), David Johnson (AT & T) l Similar problem across entities: –Washington (person), Washington (state), Washington (city) l Number ambiguity in texts: –quantity: e.g. 100 miles –time: e.g. 100 years –money: e.g. 100 euro –misc.: anything else l Can be modeled as a clustering problem

RANLP 2005 102 Name Disambiguation l Extract attributes for each person – e.g. for research papers: –Co-authors –Paper titles –Venue titles l Each word in these attribute sets constitutes a binary feature l Apply a weighting scheme: –E.g. normalized TF, TF/IDF l Construct a vector for each occurrence of a person name l (Han et al., 2005)

RANLP 2005 103 Spectral Clustering l Apply k-way spectral clustering to name data sets l Two data sets: –DBLP: authors of 400,000 citation records – use top 14 ambiguous names –Web-based data set: 11 authors named “J. Smith” and 15 authors named “J. Anderson”, in a total of 567 citations l Clustering evaluated using confusion matrices –Disambiguation accuracy = sum of diagonal elements A[i,i] divided by the sum of all elements in the matrix

RANLP 2005 104 Name Disambiguation – Results l DBLP data set: l Web-based data set –11 “J.Smith”: 84.7% (k-means 75.4%) –15 “J.Anderson”: 71.2% (k-means 67.2%)

RANLP 2005 105 Number Disambiguation l Classify numbers into 4 classes: quantity, time, money, miscellaneous l Two algorithms –Spectral clustering on bipartite graphs –Tripartite updating l Features –Surrounding window +/- 2, and the “range” of the word (w ≤ 3, 3 < w ≤ 10, 10 < w ≤ 30 …) –Map all numbers in the data set into these features l (Radev, 2004)

RANLP 2005 106 Tripartite Updating l F – set of features, L/U – set of labeled / unlabeled examples l T 1 : connectivity matrix between F and L l T 2 : connectivity matrix between F and U l F and U are iteratively updated

RANLP 2005 107 Number Disambiguation – Results l Use a subset of the numbers from the APW section of the ACQUAINT corpus –5000 unlabeled examples, of which 300 were manually annotated –200 labeled examples for evaluation

RANLP 2005 108 Automatic Thesaurus Generation l Idea: Use an (online) traditional dictionary to generate a graph structure, and use it to create thesaurus-like entries for all words (stopwords included) –(Jannink and Wiederhold, 1999) l For instance: –Input: the 1912 Webster’s Dictionary –Output: a repository with rank relationships between terms l The repository is comparable with handcrafted efforts such as MindNet and WordNet

RANLP 2005 109 Algorithm 1. Extract a directed graph from the dictionary - e.g. relations between head-words and words included in definitions 2. Obtain the relative measure of arc importance 3. Rank the arcs with ArcRank (graph-based algorithm for edge ranking)

RANLP 2005 110 Algorithm (cont.) Directed Graph l One arc from each headword to all words in the definition. l Potential problems as: syllable and accent markers in head words, misspelled head words, accents, special characters, mistagged fields, common abbreviations in definitions, steaming, multi-word head words, undefined words with common prefixes, undefined hyphenated. l Source words: words never used in definitions l Sink words: undefined words Transport. To carry or bear from one place to another;… head word Transport to carry or bear … 1. Extract a directed graph from the Dictionary

RANLP 2005 111 Algorithm (cont.) 2. Obtain the relative measure of arc importance Transport carry |a s | outgoing arcs e edge s source node t target node r e = p s / |a s | ptpt l Where: r e is the rank of the edge p s is the rank of the source node p t is the rank of the target node |a s | is the number of outgoing nodes r s,t = p s / |a s | ptpt ∑ e=1 m For more than 1 node (m) between s and t

RANLP 2005 112 Algorithm (cont.) ArcRank Input: triples (source s, target t, importance v s,t ) 1. given source s and target t nodes 2. at s, sort v s, t j and rank arcs rs(v s, t j ) 3. at t, sort v s i,t and rank arcs rt(c s i,t) 4. compute ArcRank: mean(rs (vs,t), rt(cs,t)) 5. Rank Arcs input: sorted arc importance {0.9, 0.75, 0.75, 0.75, 0.6, 0.5, …., 0.1} sample values {1, 2, 2, 2, …} equal values take same rank {1, 2, 2, 2, 3, …} number ranks consecutively 3. Rank the arcs using ArcRank Rank the importance of arcs with respect to source and target nodes It promotes arcs that are important in both endpoints

RANLP 2005 113 Results Webster 96,800 terms 112,897 distinct words error rates: <1% of original input <0.05% incorrect arcs (hyphenation) <0.05% incorrect terms (spelling) 0% artificial terms An automatically built thesaurus starting with Webster WordNet 99,642 terms 173,941 word senses error rates: ~0.1% inappropriate classifications ~1-10% artificial & repeated terms MindNet 159,000 head words 713,000 relationships between headwords (not publicly available)

RANLP 2005 114 Results Analysis The Webster’s repository PROS l It has a very general structure l It can also address stopwords l It has more relationships than WordNet l It allows for any relationships between words (not only within a lexical category) l It is more “natural”: it does not include artificial concepts such as non- existent words and artificial categorizations CONS l The type of relationships is not always evident l The accuracy increases with the amount of data, nevertheless, the dictionary contains sparse definitions l It only distinguishes senses based on usage, not grammar l It is less precise than other thesauruses (e.g. WordNet)

RANLP 2005 115 Extractive Summarization Graph-based ranking algorithms for extractive summarization 1. Build a graph: – Sentences in a text = vertices – Similarity between sentences = weighted edges Model the cohesion of text using intersentential similarity 2. Run a graph-based ranking algorithm: – keep top N ranked sentences – = sentences most “recommended” by other sentences l (Mihalcea & Tarau, 2004) (Erkan & Radev 2004)

RANLP 2005 116 A Process of Recommendation l Underlying idea: a process of recommendation –A sentence that addresses certain concepts in a text gives the reader a recommendation to refer to other sentences in the text that address the same concepts –Text knitting (Hobbs 1974) “ repetition in text knits the discourse together” –Text cohesion (Halliday & Hasan 1979)

RANLP 2005 117 Sentence Similarity l Inter-sentential relationships –weighted edges l Count number of common concepts l Normalize with the length of the sentence l Other similarity metrics are also possible: –cosine, string kernels, etc. –work in progress

RANLP 2005 118 Graph Structure Undirected –No direction established between sentences in the text –A sentence can “recommend” sentences that preceed or follow in the text Directed forward –A sentence can “recommend” only sentences that follow in the text –Seems more appropriate for movie reviews, stories, etc. Directed backward –A sentence can “recommend” only sentences that preceed in the text –More appropriate for news articles

RANLP 2005 119 An Example 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert, 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO, Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas. 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph. 11. " There is no need for alarm, " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday. 12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement. 13. An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo. 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night. 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo. 16. The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm. 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday. 18. Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet to Puerto Rico 's south coast. 19. There were no reports of casualties. 20. San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night. 21. On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast. 22. Residents returned home, happy to find little damage from 80 mph winds and sheets of rain. 23. Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane. 24. The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month. 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert, 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO, Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas. 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph. 11. " There is no need for alarm, " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday. 12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement. 13. An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo. 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night. 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo. 16. The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm. 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday. 18. Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet to Puerto Rico 's south coast. 19. There were no reports of casualties. 20. San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night. 21. On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast. 22. Residents returned home, happy to find little damage from 80 mph winds and sheets of rain. 23. Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane. 24. The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month. 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert, 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO, Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas. 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph. 11. " There is no need for alarm, " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday. 12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement. 13. An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo. 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night. 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo. 16. The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm. 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday. 18. Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet to Puerto Rico 's south coast. 19. There were no reports of casualties. 20. San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night. 21. On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast. 22. Residents returned home, happy to find little damage from 80 mph winds and sheets of rain. 23. Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane. 24. The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month. 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert, 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO, Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas. 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph. 11. " There is no need for alarm, " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday. 12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement. 13. An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo. 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night. 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo. 16. The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm. 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday. 18. Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet to Puerto Rico 's south coast. 19. There were no reports of casualties. 20. San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night. 21. On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast. 22. Residents returned home, happy to find little damage from 80 mph winds and sheets of rain. 23. Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane. 24. The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month. 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert, 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO, Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas. 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph. 11. " There is no need for alarm, " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday. 12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement. 13. An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo. 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night. 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo. 16. The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm. 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday. 18. Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet to Puerto Rico 's south coast. 19. There were no reports of casualties. 20. San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night. 21. On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast. 22. Residents returned home, happy to find little damage from 80 mph winds and sheets of rain. 23. Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane. 24. The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month. 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert, 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO, Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas. 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph. 11. " There is no need for alarm, " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday. 12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement. 13. An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo. 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night. 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo. 16. The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm. 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday. 18. Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet to Puerto Rico 's south coast. 19. There were no reports of casualties. 20. San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night. 21. On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast. 22. Residents returned home, happy to find little damage from 80 mph winds and sheets of rain. 23. Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane. 24. The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month. 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert, 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO, Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas. 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph. 11. " There is no need for alarm, " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday. 12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement. 13. An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo. 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo. 16. The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm. 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday. 18. Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet to Puerto Rico 's south coast. 19. There were no reports of casualties. 20. San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night. 21. On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast. 22. Residents returned home, happy to find little damage from 80 mph winds and sheets of rain. 23. Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane. 24. The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month A text from DUC 2002 on “Hurricane Gilbert” 24 sentences

RANLP 2005 120 4 6 5 21 7 16 15 9 8 10 11 12 14 13 22 20 19 18 17 24 23 0.27 0.35 0.55 0.15 0.19 0.15 0.16 0.59 0.30 [0.50] [0.80] [0.70] [0.15] [1.20] [0.71] [0.15] [0.70] [1.83] [0.99] [0.56] [0.93] [0.76] [1.09] [1.36] [1.65] [0.70][1.58] [0.15] [0.84] [1.02] An Example (cont'd)

RANLP 2005 121 Evaluation l Task-based evaluation: automatic text summarization –Single document summarization  100-word summaries –Multiple document summarization  100-word multi-doc summaries  clusters of ~10 documents l Data from DUC (Document Understanding Conference) –DUC 2002 –567 single documents –59 clusters of related documents l Automatic evaluation with ROUGE (Lin & Hovy 2003) –n-gram based evaluations  unigrams found to have the highest correlations with human judgment –no stopwords, stemming

RANLP 2005 122 Results: Single Document Summarization l Single-doc summaries for 567 documents (DUC 2002)

RANLP 2005 123 Multiple Document Summarization Cascaded summarization (“meta” summarizer) –Use best single document summarization alorithms  PageRank (Undirected / Directed Backward)  HITS A (Undirected / Directed Backward) –100-word single document summaries –100-word “summary of summaries” Avoid sentence redundancy: –set max threshold on sentence similarity (0.5) Evaluation: –build summaries for 59 clusters of ~10 documents –compare with top 5 performing systems at DUC 2002 –baseline: first sentence in each document

RANLP 2005 124 Results: Multiple Document Summarization – Multi-doc summaries for 59 clusters (DUC 2002)

RANLP 2005 125 Portuguese Single-Document Evaluation 100 news articles from Jornal de Brasil, Folha de Sao Paolo Create summaries with the same length as manual summaries Automatic evaluation with ROUGE Same baseline (top sentences in each document)

RANLP 2005 126 Graph-based Ranking for Sentence Extraction Unsupervised method for sentence extraction – based exclusively on information drawn from the text itself Goes beyond simple sentence connectivity Gives a ranking over all sentences in a text – can be adapted to longer / shorter summaries No training data required – can be adapted to other languages –Already applied on Portuguese

RANLP 2005 127 Subjectivity Analysis for Sentiment Classification l The objective is to detect subjective expressions in text (opinions against facts) l Use this information to improve the polarity classification (positive vs. negative) –E.g. Movie reviews ( see: www.rottentomatoes.com)www.rottentomatoes.com l Sentiment analysis can be considered as a document classification problem, with target classes focusing on the authors sentiments, rather than topic-based categories –Standard machine learning classification techniques can be applied

RANLP 2005 128 Subjectivity Extraction

RANLP 2005 129 Subjectivity Detection/Extraction l Detecting the subjective sentences in a text may be useful in filtering out the objective sentences creating a subjective extract l Subjective extracts facilitate the polarity analysis of the text (increased accuracy at reduced input size) l Subjectivity detection can use local and contextual features: –Local: relies on individual sentence classifications using standard machine learning techniques (SVM, Naïve Bayes, etc) trained on an annotated data set –Contextual: uses context information, such as e.g. sentences occurring near each other tend to share the same subjectivity status (coherence) l (Pang and Lee, 2004)

RANLP 2005 130 Cut-based Subjectivity Classification l Standard classification techniques usually consider only individual features (classify one sentence at a time). l Cut-based classification takes into account both individual and contextual (structural) features l Suppose we have n items x 1,…,x n to divide in two classes: C 1 and C 2. l Individual scores: ind j (x i ) - non-negative estimates of each x i being in C j based on the features of x i alone l Association scores: assoc(x i,x k ) - non-negative estimates of how important it is that x i and x k be in the same class

RANLP 2005 131 Cut-based Classification l Maximize each item’s assignment score (individual score for the class it is assigned to, minus its individual score for the other class), while penalize the assignment of different classes to highly associated items Formulated as an optimization problem: assign the x i items to classes C 1 and C 2 so as to minimize the partition cost:

RANLP 2005 132 Cut-based Algorithm There are 2 n possible binary partitions of the n elements, we need an efficient algorithm to solve the optimization problem Build an undirected graph G with vertices {v 1,…v n,s,t} and edges: – (s,v i ) with weights ind 1 (x i ) – (v i,t) with weights ind 2 (x i ) – (v i,v k ) with weights assoc(x i,x k )

RANLP 2005 133 Cut-based Algorithm (cont.) l Cut: a partition of the vertices in two sets: The cost is the sum of the weights of all edges crossing from S to T l A minimum cut is a cut with the minimal cost l A minimum cut can be found using maximum-flow algorithms, with polynomial asymptotic running times l Use the min-cut / max-flow algorithm

RANLP 2005 134 Cut-based Algorithm (cont.) Notice that without the structural information we would be undecided about the assignment of node M

RANLP 2005 135 Subjectivity Extraction l Assign every individual sentence a subjectivity score –e.g. the probability of a sentence being subjective, as assigned by a Naïve Bayes classifier, etc l Assign every sentence pair a proximity or similarity score –e.g. physical proximity = the inverse of the number of sentences between the two entities l Use the min-cut algorithm to classify the sentences into objective/subjective

RANLP 2005 136 Subjectivity Extraction with Min-Cut

RANLP 2005 137 Results l 2000 movie reviews (1000 positive / 1000 negative) l The use of subjective extracts improves or maintains the accuracy of the polarity analysis while reducing the input data size

RANLP 2005 138 References l Galley, M. and McKeown, K.: Improving Word Sense Disambiguation in Lexical Chaining. IJCAI 2003. l Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms. WordNet: An electronic lexical database, MIT Press, 1998 l Halliday, M. and Hasan, R. Cohesion in English. Longman, 1976. l Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. SIGDOC 1986. l Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation. FLAIRS 2000. l Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004

RANLP 2005 139 References l Mihalcea, R. and Tarau, P. TextRank: Bringing Order Into Texts, EMNLP 2004 l Han, H and Zha, H. and Giles, C.L. Name disambiguation in author citations using a K-way spectral clustering method, ACM/IEEE – IJCDL 2005 l Radev, D., Weakly-Supervised Graph-based Methods for Classification, U. Michigan Technical Report CSE-TR-500-04, 2004 l Jannink, J. and Wiederhold, G., Thesaurus Entry Extraction from an On-line Dictionary, Fusion 1999, Sunnyvale CA, 1999. l Pang, B. and Lee, L., A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, ACL 2004. l Erkan, G. and Radev, D.. "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization", Journal of Artificial Intelligence Research, 2004

RANLP 2005 140 Conclusions l Graphs: encode information about nodes and relations between nodes l Graph-based algorithms can lead to state-of-the- art results in several IR and NLP problems l Graph-based NLP/IR: –On the verge between graph-theory and NLP / IR –Many opportunities for novel approaches to traditional problems

Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea University of North Texas RANLP 2005.

Similar presentations

Presentation on theme: "Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea University of North Texas RANLP 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea University of North Texas RANLP 2005.

Similar presentations

Presentation on theme: "Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea University of North Texas RANLP 2005."— Presentation transcript:

Similar presentations

About project

Feedback