Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Models of (Social) Networks Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, Natasha.

Similar presentations


Presentation on theme: "Statistical Models of (Social) Networks Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, Natasha."— Presentation transcript:

1 Statistical Models of (Social) Networks Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, Natasha Mohanty, Andres Corrada

2

3 Workplace effectiveness ~ Ability to leverage network of acquaintances But filling Contacts DB by hand is tedious, and incomplete. Email Inbox Contacts DB WWW Automatically Managing and Understanding Connections of People in our Email World

4 System Overview Contact Info and Person Name Extraction Person Name Extraction Name Coreference Homepage Retrieval Social Network Analysis Keyword Extraction CRF WWW names Email

5 An Example To: “Andrew McCallum” mccallum@cs.umass.edu Subject... First Name: Andrew Middle Name: Kachites Last Name: McCallum JobTitle:Associate Professor Company:University of Massachusetts Street Address: 140 Governor’s Dr. City:Amherst State:MA Zip:01003 Company Phone: (413) 545-1323 Links:Fernando Pereira, Sam Roweis,… Key Words: Information extraction, social network,… Search for new people

6 Summary of Results Token Acc Field Prec Field Recall Field F1 CRF94.5085.7376.3380.76 PersonKeywords William CohenLogic programming Text categorization Data integration Rule learning Daphne KollerBayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom MitchellMachine learning Cognitive states Learning apprentice Artificial intelligence Contact info and name extraction performance (25 fields) Example keywords extracted 1.Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) 2.Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.

7 Social Network in an Email Dataset

8 Outline Social Network Analysis with (Language) Attributes –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers

9 Outline Social Network Analysis with (Language) Attributes –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers

10 Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Sample a distribution over topics,  For each document: Sample a topic, z For each word in doc Sample a word from the topic, w Example: 70% Iraq war 30% US election Iraq war “bombing” Generative Process:

11 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN Example topics induced from a large collection of text FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al]

12 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Example topics induced from a large collection of text [Tennenbaum et al]

13 From LDA to Author-Recipient-Topic (ART)

14 Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r

15 Enron Email Corpus 250k email messages 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: debra.perlingiere@enron.com To: steve.hooser@enron.com Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron.com

16 Topics, and prominent senders / receivers discovered by ART Topic names, by hand

17 Topics, and prominent senders / receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs”

18 Comparing Role Discovery connection strength (A,B) = distribution over authored topics Traditional SNA distribution over recipients distribution over authored topics Author-TopicART

19 Comparing Role Discovery Tracy Geaconne  Dan McCarty Traditional SNAAuthor-TopicART Similar roles Different roles Geaconne = “Secretary” McCarty = “Vice President”

20 Traditional SNAAuthor-TopicART Different roles Very similarNot very similar Geaconne = “Secretary” Hayslett = “Vice President & CTO” Comparing Role Discovery Tracy Geaconne  Rod Hayslett

21 Traditional SNAAuthor-TopicART Different roles Very differentVery similar Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning” Comparing Role Discovery Lynn Blair  Kimberly Watson

22 McCallum Email Corpus 2004 January - October 2004 23k email messages 825 people From: kate@cs.umass.edu Subject: NIPS and.... Date: June 14, 2004 2:27:41 PM EDT To: mccallum@cs.umass.edu There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate

23 McCallum Email Blockstructure

24 Four most prominent topics in discussions with ____?

25

26 Two most prominent topics in discussions with ____?

27

28 Pairs with highest rank difference between ART & SNA 5 other professors 3 other ML researchers

29 Role-Author-Recipient-Topic Models

30 Results with RART: People in “Role #3” in Academic Email olc lead Linux sysadmin gauthier sysadmin for CIIR group irsystem mailing list CIIR sysadmins system mailing list for dept. sysadmins allan Prof., chair of “computing committee” valerie second Linux sysadmin tech mailing list for dept. hardware steve head of dept. I.T. support

31 Roles for allan (James Allan) Role #3I.T. support Role #2Natural Language researcher Roles for pereira (Fernando Pereira) Role #2Natural Language researcher Role #4SRI CALO project participant Role #6Grant proposal writer Role #10Grant proposal coordinator Role #8Guests at McCallum’s house

32 Summary Traditionally, SNA examines links, but not the language content on those links. This talk introduced ART, an Bayesian network model for messages sent in a social network: it captures topics and role-similarity. RART explicitly represents roles. Future work: –Explicitly model & discover roles and groups –Integrate with coreference and relation extraction –Model correlations and topic/group trends over time

33 Traditional SNAAuthor-TopicART Block structured Not ART: Roles but not Groups Enron TransWestern Division

34 Outline Social Network Analysis with (Language) Attributes –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers 

35 Groups and Topics Input: –Observed relations between people –Attributes on those relations (text, or categorical) Output: –Attributes clustered into “topics” –Groups of people---varying depending on topic

36 Discovering Groups from Observed Set of Relations Admiration relations among six high school students. Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C)

37 Adjacency Matrix Representing Relations ABCDEF A B C D E F ABCDEF G1G2G1G2G3 G1 G2 G1 G2 G3 A B C D E F ACBDEF G1 G2 G3 G1 G2 G3 A C B D E F Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C)

38

39 Group Model: Partitioning Entities into Groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] S: number of entities G: number of groups Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004] Beta Dirichlet Binomial Multinomial

40 Two Relations with Different Attributes ACBDEF G1 G2 G3 G1 G2 G3 ACEBDF G1 G2 G1 G2 A C E B D F Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Social Admiration Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) A C B D E F

41 D: number of documents T: number of topics : number of tokens in document d Simple Topic Model: Good for Single Topic Documents Mixture of Unigrams Dirichlet Multinomial Uniform

42 Goal: Model relations and their (textual) attributes simultaneously to obtain better groups and more meaningful topics. budget, funding, annual, cash document, corrections, review, annual

43 The Group-Topic Model: Discovering Groups and Topics Simultaneously Dirichlet Multinomial Uniform Beta Dirichlet Binomial Multinomial

44 Inference and Estimation Gibbs Sampling: - Many r.v.s can be integrated out - Easy to implement - Reasonably fast We assume the relationship is symmetric.

45 Dataset #1: U.S. Senate 16 years of voting records in the US Senate (1989 – 2005) a Senator may respond Yea or Nay to a resolution 3423 resolutions with text attributes (index terms) 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms Banks and bankingAccountingAdministrative feesCost control CreditDeposit insuranceDepressed areas Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……

46 Topics Discovered (U.S. Senate) EducationEnergy Military Misc. Economic educationenergygovernmentfederal schoolpowermilitarylabor aidwaterforeigninsurance childrennucleartaxaid druggascongresstax studentspetrolaidbusiness elementaryresearchlawemployee preventionpollutionpolicycare Mixture of Unigrams Group-Topic Model Education + Domestic ForeignEconomic Social Security + Medicare educationforeignlaborsocial schooltradeinsurancesecurity federalchemicalstaxinsurance aidtariffcongressmedical governmentcongressincomecare taxdrugsminimummedicare energycommunicablewagedisability researchdiseasesbusinessassistance

47 Groups Discovered (US Senate) Groups from topic Education + Domestic

48 Senators Who Change Coalition the most Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid

49 Dataset #2: The UN General Assembly Voting records of the UN General Assembly (1990 - 2003) A country may choose to vote Yes, No or Abstain 931 resolutions with text attributes (titles) 192 countries in total Also experiments later with resolutions from 1960-2003 Vote on Permanent Sovereignty of Palestinian People, 87th plenary meetingPermanent Sovereignty of Palestinian People The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.

50 Topics Discovered (UN) Everything Nuclear Human Rights Security in Middle East nuclearrightsoccupied weaponshumanisrael usepalestinesyria implementationsituationsecurity countriesisraelcalls Mixture of Unigrams Group-Topic Model Nuclear Non-proliferation Nuclear Arms Race Human Rights nuclear rights statesarmshuman unitedpreventionpalestine weaponsraceoccupied nationsspaceisrael

51 Groups Discovered (UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.

52 Do We Get Better Groups with the GT Model? 1.Cluster bills into topics using mixture of unigrams; 2.Apply group model on topic- specific subsets of bills. Agreement Index (AI) measures group cohesion. Higher, better. DatasetsAvg. AI for BaselineAvg. AI for GTp-value Senate0.81980.8294<.01 UN0.85480.8664<.01 1.Jointly cluster topic and groups at the same time using the GT model. Baseline Model GT Model

53 Groups and Topics, Trends over Time (UN)

54 Outline Social Network Analysis with (Language) Attributes –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers  

55 Previous Systems

56

57 Research Paper Cites Previous Systems

58 Research Paper Cites Person UniversityVenue Grant Groups Expertise More Entities and Relations

59

60

61

62

63

64

65

66

67

68

69

70

71 Outline Examples of IE and Data Mining. Brief introduction of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling for Transfer Learning (Piecewise Training & BP) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) Joint Topic Discovery and Social Network Analysis –Roles and Topics (Author-Recipient-Topic Model) –Groups and Topics (Group-Topic Model) Demo: Rexa, a Web portal for researchers     

72 Summary CRFs: conditional probability structured models Joint inference can avoid accumulating errors in an pipeline from extraction to data mining Early examples –Factorial finite state models –Jointly labeling distant entities –Coreference analysis –Segmentation uncertainty aiding coreference Email, contact management, expert-finding, SNA –Discover topics, roles, & groups from text and relational data. New research paper search engine coming soon.

73 End of Talk

74 Summary Traditionally, SNA examines links, but not the language content on those links. Presented ART, an Bayesian network for messages sent in a social network: captures topics and role-similarity. RART explicitly represents roles. Additional work –Group-Topic model discovers groups and clusters attributes of relations. [Wang, Mohanty, McCallum, LinkKDD 2005]


Download ppt "Statistical Models of (Social) Networks Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, Natasha."

Similar presentations


Ads by Google