Statistical Models of (Social) Networks Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, Natasha Mohanty, Andres Corrada
Workplace effectiveness ~ Ability to leverage network of acquaintances But filling Contacts DB by hand is tedious, and incomplete. Inbox Contacts DB WWW Automatically Managing and Understanding Connections of People in our World
System Overview Contact Info and Person Name Extraction Person Name Extraction Name Coreference Homepage Retrieval Social Network Analysis Keyword Extraction CRF WWW names
An Example To: “Andrew McCallum” Subject... First Name: Andrew Middle Name: Kachites Last Name: McCallum JobTitle:Associate Professor Company:University of Massachusetts Street Address: 140 Governor’s Dr. City:Amherst State:MA Zip:01003 Company Phone: (413) Links:Fernando Pereira, Sam Roweis,… Key Words: Information extraction, social network,… Search for new people
Summary of Results Token Acc Field Prec Field Recall Field F1 CRF PersonKeywords William CohenLogic programming Text categorization Data integration Rule learning Daphne KollerBayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom MitchellMachine learning Cognitive states Learning apprentice Artificial intelligence Contact info and name extraction performance (25 fields) Example keywords extracted 1.Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) 2.Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.
Social Network in an Dataset
Outline Social Network Analysis with (Language) Attributes –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers
Outline Social Network Analysis with (Language) Attributes –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers
Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Sample a distribution over topics, For each document: Sample a topic, z For each word in doc Sample a word from the topic, w Example: 70% Iraq war 30% US election Iraq war “bombing” Generative Process:
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN Example topics induced from a large collection of text FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al]
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Example topics induced from a large collection of text [Tennenbaum et al]
From LDA to Author-Recipient-Topic (ART)
Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r
Enron Corpus 250k messages 23k people Date: Wed, 11 Apr :56: (PDT) From: To: Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas
Topics, and prominent senders / receivers discovered by ART Topic names, by hand
Topics, and prominent senders / receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs”
Comparing Role Discovery connection strength (A,B) = distribution over authored topics Traditional SNA distribution over recipients distribution over authored topics Author-TopicART
Comparing Role Discovery Tracy Geaconne Dan McCarty Traditional SNAAuthor-TopicART Similar roles Different roles Geaconne = “Secretary” McCarty = “Vice President”
Traditional SNAAuthor-TopicART Different roles Very similarNot very similar Geaconne = “Secretary” Hayslett = “Vice President & CTO” Comparing Role Discovery Tracy Geaconne Rod Hayslett
Traditional SNAAuthor-TopicART Different roles Very differentVery similar Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning” Comparing Role Discovery Lynn Blair Kimberly Watson
McCallum Corpus 2004 January - October k messages 825 people From: Subject: NIPS and.... Date: June 14, :27:41 PM EDT To: There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate
McCallum Blockstructure
Four most prominent topics in discussions with ____?
Two most prominent topics in discussions with ____?
Pairs with highest rank difference between ART & SNA 5 other professors 3 other ML researchers
Role-Author-Recipient-Topic Models
Results with RART: People in “Role #3” in Academic olc lead Linux sysadmin gauthier sysadmin for CIIR group irsystem mailing list CIIR sysadmins system mailing list for dept. sysadmins allan Prof., chair of “computing committee” valerie second Linux sysadmin tech mailing list for dept. hardware steve head of dept. I.T. support
Roles for allan (James Allan) Role #3I.T. support Role #2Natural Language researcher Roles for pereira (Fernando Pereira) Role #2Natural Language researcher Role #4SRI CALO project participant Role #6Grant proposal writer Role #10Grant proposal coordinator Role #8Guests at McCallum’s house
Summary Traditionally, SNA examines links, but not the language content on those links. This talk introduced ART, an Bayesian network model for messages sent in a social network: it captures topics and role-similarity. RART explicitly represents roles. Future work: –Explicitly model & discover roles and groups –Integrate with coreference and relation extraction –Model correlations and topic/group trends over time
Traditional SNAAuthor-TopicART Block structured Not ART: Roles but not Groups Enron TransWestern Division
Outline Social Network Analysis with (Language) Attributes –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers
Groups and Topics Input: –Observed relations between people –Attributes on those relations (text, or categorical) Output: –Attributes clustered into “topics” –Groups of people---varying depending on topic
Discovering Groups from Observed Set of Relations Admiration relations among six high school students. Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C)
Adjacency Matrix Representing Relations ABCDEF A B C D E F ABCDEF G1G2G1G2G3 G1 G2 G1 G2 G3 A B C D E F ACBDEF G1 G2 G3 G1 G2 G3 A C B D E F Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C)
Group Model: Partitioning Entities into Groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] S: number of entities G: number of groups Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004] Beta Dirichlet Binomial Multinomial
Two Relations with Different Attributes ACBDEF G1 G2 G3 G1 G2 G3 ACEBDF G1 G2 G1 G2 A C E B D F Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Social Admiration Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) A C B D E F
D: number of documents T: number of topics : number of tokens in document d Simple Topic Model: Good for Single Topic Documents Mixture of Unigrams Dirichlet Multinomial Uniform
Goal: Model relations and their (textual) attributes simultaneously to obtain better groups and more meaningful topics. budget, funding, annual, cash document, corrections, review, annual
The Group-Topic Model: Discovering Groups and Topics Simultaneously Dirichlet Multinomial Uniform Beta Dirichlet Binomial Multinomial
Inference and Estimation Gibbs Sampling: - Many r.v.s can be integrated out - Easy to implement - Reasonably fast We assume the relationship is symmetric.
Dataset #1: U.S. Senate 16 years of voting records in the US Senate (1989 – 2005) a Senator may respond Yea or Nay to a resolution 3423 resolutions with text attributes (index terms) 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms Banks and bankingAccountingAdministrative feesCost control CreditDeposit insuranceDepressed areas Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……
Topics Discovered (U.S. Senate) EducationEnergy Military Misc. Economic educationenergygovernmentfederal schoolpowermilitarylabor aidwaterforeigninsurance childrennucleartaxaid druggascongresstax studentspetrolaidbusiness elementaryresearchlawemployee preventionpollutionpolicycare Mixture of Unigrams Group-Topic Model Education + Domestic ForeignEconomic Social Security + Medicare educationforeignlaborsocial schooltradeinsurancesecurity federalchemicalstaxinsurance aidtariffcongressmedical governmentcongressincomecare taxdrugsminimummedicare energycommunicablewagedisability researchdiseasesbusinessassistance
Groups Discovered (US Senate) Groups from topic Education + Domestic
Senators Who Change Coalition the most Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid
Dataset #2: The UN General Assembly Voting records of the UN General Assembly ( ) A country may choose to vote Yes, No or Abstain 931 resolutions with text attributes (titles) 192 countries in total Also experiments later with resolutions from Vote on Permanent Sovereignty of Palestinian People, 87th plenary meetingPermanent Sovereignty of Palestinian People The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.
Topics Discovered (UN) Everything Nuclear Human Rights Security in Middle East nuclearrightsoccupied weaponshumanisrael usepalestinesyria implementationsituationsecurity countriesisraelcalls Mixture of Unigrams Group-Topic Model Nuclear Non-proliferation Nuclear Arms Race Human Rights nuclear rights statesarmshuman unitedpreventionpalestine weaponsraceoccupied nationsspaceisrael
Groups Discovered (UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.
Do We Get Better Groups with the GT Model? 1.Cluster bills into topics using mixture of unigrams; 2.Apply group model on topic- specific subsets of bills. Agreement Index (AI) measures group cohesion. Higher, better. DatasetsAvg. AI for BaselineAvg. AI for GTp-value Senate <.01 UN <.01 1.Jointly cluster topic and groups at the same time using the GT model. Baseline Model GT Model
Groups and Topics, Trends over Time (UN)
Outline Social Network Analysis with (Language) Attributes –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers
Previous Systems
Research Paper Cites Previous Systems
Research Paper Cites Person UniversityVenue Grant Groups Expertise More Entities and Relations
Outline Examples of IE and Data Mining. Brief introduction of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling for Transfer Learning (Piecewise Training & BP) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) Joint Topic Discovery and Social Network Analysis –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers
Summary CRFs: conditional probability structured models Joint inference can avoid accumulating errors in an pipeline from extraction to data mining Early examples –Factorial finite state models –Jointly labeling distant entities –Coreference analysis –Segmentation uncertainty aiding coreference , contact management, expert-finding, SNA –Discover topics, roles, & groups from text and relational data. New research paper search engine coming soon.
End of Talk
Summary Traditionally, SNA examines links, but not the language content on those links. Presented ART, an Bayesian network for messages sent in a social network: captures topics and role-similarity. RART explicitly represents roles. Additional work –Group-Topic model discovers groups and clusters attributes of relations. [Wang, Mohanty, McCallum, LinkKDD 2005]