Presentation is loading. Please wait.

Presentation is loading. Please wait.

Research Overview for Harvard Medical Library Andrew McCallum Associate Professor Computer Science Department University of Massachusetts Amherst.

Similar presentations


Presentation on theme: "Research Overview for Harvard Medical Library Andrew McCallum Associate Professor Computer Science Department University of Massachusetts Amherst."— Presentation transcript:

1 Research Overview for Harvard Medical Library Andrew McCallum Associate Professor Computer Science Department University of Massachusetts Amherst

2 Outline Self & Lab Intro Information Extraction and Data Mining. Research vingette: Conditional Random Fields Research vingette: Social network analysis & Topic models Demo: Rexa, a new Web portal for research literature Future work, brainstorming, collaboration, next steps

3 Personal History PhD, University of Rochester –Machine Learning, “Reinforcement Learning”, –Eye movements and short-term memory Postdoc, Carnegie Mellon University –Machine Learning for Text, with Tom Mitchell –WebKB Project (met Hooman there) Research Scientist, Just Research Labs –Information extraction from text, clustering... –Built Cora, a precursor to CiteSeer, in 1997. VP Research & Development, WhizBang Labs –Information extraction from the Web, ~$50m VC funding, 170 people –Job search subsidiary, FlipDog.com, sold to Monster.com Associate Professor, UMass Amherst –#5 CS department in Artificial Intelligence –Strong ML & NSF Center for Intelligent Information Retrieval

4 Information Extraction & Synthesis Laboratory (IESL) Assoc. Prof. Andrew McCallum, Director 9 PhD students 2 postdocs 3 undergraduates 2 full-time staff programmers 40+ publications in the past 2 years Grants from NSF, DARPA, DHS, Microsoft, IBM, IMS. Collaborations with BBN, Aptima, BAE, IBM, SRI,... MIT, Stanford, CMU, Princeton, UPenn, UWash,... ~70 compute servers, ~10 Tb disk storage

5 Outline Self & Lab Intro Information Extraction and Data Mining. Research vingette: Conditional Random Fields Research vingette: Social network analysis & Topic models Demo: Rexa, a new Web portal for research literature Future work, brainstorming, collaboration, next steps

6 Goal of my research Mine actionable knowledge from unstructured text.

7 Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

8 A Portal for Job Openings

9 Job Openings: Category = High Tech Keyword = Java Location = U.S.

10 Data Mining the Extracted Job Information

11 IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

12 IE from Research Papers [McCallum et al ‘98]

13 IE from Research Papers

14 Mining Research Papers [Giles et al] [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004]

15 What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

16 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

17 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

18 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. * * * *

19 Larger Context Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge

20 Outline Self & Lab Intro Information Extraction and Data Mining. Research vingette: Conditional Random Fields Research vingette: Social network analysis & Topic models Demo: Rexa, a new Web portal for research literature Future work, brainstorming, collaboration, next steps

21 Hidden Markov Models S t-1 S t O t S t+1 O t +1 O t - 1... Finite state model Graphical model Parameters: for all states S={s 1,s 2,…} Start state probabilities: P(s t ) Transition probabilities: P(s t |s t-1 ) Observation (emission) probabilities: P(o t |s t ) Training: Maximize probability of training observations (w/ prior) HMMs---the standard sequence modeling tool in genomics, music, speech, NLP, …... transitions observations o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Generates: State sequence Observation sequence Usually a multinomial over atomic, fixed alphabet

22 IE with Hidden Markov Models Yesterday Rich Caruana spoke this example sentence. Person name: Rich Caruana Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name: person name location name background

23 (Linear Chain) Conditional Random Fields y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence... FSM states observations y t+2 x t +2 y t+3 x t +3 said Jones a Microsoft VP … where OTHER PERSON OTHER ORG TITLE … output seq input seq Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… [Lafferty, McCallum, Pereira 2001] (500 citations)

24 Linear-chain CRFs vs. HMMs Comparable computational efficiency for inference Features may be arbitrary functions of any or all observations –Parameters need not fully specify generation of observations; can require less training data –Easy to incorporate domain knowledge

25 Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

26 Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. CRF Labels: Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote... (12 in all) [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Features: Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev.... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. 100+ documents from www.fedstats.gov

27 Table Extraction Experimental Results Line labels, percent correct 95 % 65 % 85 % HMM Stateless MaxEnt CRF [Pinto, McCallum, Wei, Croft, 2003 SIGIR]

28 IE from Research Papers [McCallum et al ‘99]

29 IE from Research Papers Field-level F1 Hidden Markov Models (HMMs)75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs)89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)93.9 [Peng, McCallum, 2004]  error 40% (Word-level accuracy is >99%)

30 Other Successful Applications of CRFs Information Extraction of gene & protein names from text –Winning teams from UPenn, UWisc, Stanford –...using UMass CRF software Gene finding in DNA sequences –[Culotta, Kulp, McCallum 2005] –New work at UPenn Computer vision, OCR, music, robotics, reference matching, author resolution,... protein fold recognition,...

31 Automatically Annotating MedLine Abstracts

32 CRF String Edit Distance W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopydelete substcopy delete joint complete data likelihood 1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14 x1x1 x2x2 a.i 1 a.e a.i 2 string 1 alignment string 2 conditional complete data likelihood Want to train from set of string pairs, each labeled one of {match, non-match} match“William W. Cohon”“Willlleam Cohen” non-match“Bruce D’Ambrosio”“Bruce Croft” match“Tommi Jaakkola”“Tommi Jakola” match“Stuart Russell”“Stuart Russel” non-match“Tom Deitterich”“Tom Dean”

33 CRF String Edit Distance FSM subst insertdelete copy

34 CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 conditional incomplete data likelihood

35 CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 Probability summed over all alignments in match states 0.8 Probability summed over all alignments in non-match states 0.2 x 1 = “Tommi Jaakkola” x 2 = “Tommi Jakola”

36 CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 Probability summed over all alignments in match states 0.1 Probability summed over all alignments in non-match states 0.9 x 1 = “Tom Dietterich” x 2 = “Tom Dean”

37 Outline Self & Lab Intro Information Extraction and Data Mining. Research vingette: Conditional Random Fields Research vingette: Social network analysis & Topic models Demo: Rexa, a new Web portal for research literature Future work, brainstorming, collaboration, next steps

38 Workplace effectiveness ~ Ability to leverage network of acquaintances But filling Contacts DB by hand is tedious, and incomplete. Email Inbox Contacts DB WWW Automatically Managing and Understanding Connections of People in our Email World

39 System Overview Contact Info and Person Name Extraction Person Name Extraction Name Coreference Homepage Retrieval Social Network Analysis Keyword Extraction CRF WWW names Email

40 An Example To: “Andrew McCallum” mccallum@cs.umass.edu Subject... First Name: Andrew Middle Name: Kachites Last Name: McCallum JobTitle:Associate Professor Company:University of Massachusetts Street Address: 140 Governor’s Dr. City:Amherst State:MA Zip:01003 Company Phone: (413) 545-1323 Links:Fernando Pereira, Sam Roweis,… Key Words: Information extraction, social network,… Search for new people

41 Summary of Results Token Acc Field Prec Field Recall Field F1 CRF94.5085.7376.3380.76 PersonKeywords William CohenLogic programming Text categorization Data integration Rule learning Daphne KollerBayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom MitchellMachine learning Cognitive states Learning apprentice Artificial intelligence Contact info and name extraction performance (25 fields) Example keywords extracted 1.Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) 2.Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.

42 Social Network in an Email Dataset

43 Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Sample a distribution over topics,  For each document: Sample a topic, z For each word in doc Sample a word from the topic, w Example: 70% Iraq war 30% US election Iraq war “bombing” Generative Process:

44 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN Example topics induced from a large collection of text FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al]

45 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Example topics induced from a large collection of text [Tennenbaum et al]

46 From LDA to Author-Recipient-Topic (ART)

47 Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r

48 Enron Email Corpus 250k email messages 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: debra.perlingiere@enron.com To: steve.hooser@enron.com Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron.com

49 Topics, and prominent senders / receivers discovered by ART Topic names, by hand

50 Topics, and prominent senders / receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs”

51 Comparing Role Discovery connection strength (A,B) = distribution over authored topics Traditional SNA distribution over recipients distribution over authored topics Author-TopicART

52 Comparing Role Discovery Tracy Geaconne  Dan McCarty Traditional SNAAuthor-TopicART Similar roles Different roles Geaconne = “Secretary” McCarty = “Vice President”

53 Traditional SNAAuthor-TopicART Different roles Very differentVery similar Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning” Comparing Role Discovery Lynn Blair  Kimberly Watson

54 Traditional SNAAuthor-TopicART Block structured Not ART: Roles but not Groups Enron TransWestern Division

55 Groups and Topics Input: –Observed relations between people –Attributes on those relations (text, or categorical) Output: –Attributes clustered into “topics” –Groups of people---varying depending on topic

56 Adjacency Matrix Representing Relations ABCDEF A B C D E F ABCDEF G1G2G1G2G3 G1 G2 G1 G2 G3 A B C D E F ACBDEF G1 G2 G3 G1 G2 G3 A C B D E F Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C)

57 Group Model: Partitioning Entities into Groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] S: number of entities G: number of groups Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004] Beta Dirichlet Binomial Multinomial

58 Two Relations with Different Attributes ACBDEF G1 G2 G3 G1 G2 G3 ACEBDF G1 G2 G1 G2 A C E B D F Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Social Admiration Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) A C B D E F

59 The Group-Topic Model: Discovering Groups and Topics Simultaneously Dirichlet Multinomial Uniform Beta Dirichlet Binomial Multinomial

60 Inference and Estimation Gibbs Sampling: - Many r.v.s can be integrated out - Easy to implement - Reasonably fast We assume the relationship is symmetric.

61 Dataset #1: U.S. Senate 16 years of voting records in the US Senate (1989 – 2005) a Senator may respond Yea or Nay to a resolution 3423 resolutions with text attributes (index terms) 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms Banks and bankingAccountingAdministrative feesCost control CreditDeposit insuranceDepressed areas Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……

62 Topics Discovered (U.S. Senate) EducationEnergy Military Misc. Economic educationenergygovernmentfederal schoolpowermilitarylabor aidwaterforeigninsurance childrennucleartaxaid druggascongresstax studentspetrolaidbusiness elementaryresearchlawemployee preventionpollutionpolicycare Mixture of Unigrams Group-Topic Model Education + Domestic ForeignEconomic Social Security + Medicare educationforeignlaborsocial schooltradeinsurancesecurity federalchemicalstaxinsurance aidtariffcongressmedical governmentcongressincomecare taxdrugsminimummedicare energycommunicablewagedisability researchdiseasesbusinessassistance

63 Groups Discovered (US Senate) Groups from topic Education + Domestic

64 Senators Who Change Coalition the most Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid

65 Dataset #2: The UN General Assembly Voting records of the UN General Assembly (1990 - 2003) A country may choose to vote Yes, No or Abstain 931 resolutions with text attributes (titles) 192 countries in total Also experiments later with resolutions from 1960-2003 Vote on Permanent Sovereignty of Palestinian People, 87th plenary meetingPermanent Sovereignty of Palestinian People The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.

66 Topics Discovered (UN) Everything Nuclear Human Rights Security in Middle East nuclearrightsoccupied weaponshumanisrael usepalestinesyria implementationsituationsecurity countriesisraelcalls Mixture of Unigrams Group-Topic Model Nuclear Non-proliferation Nuclear Arms Race Human Rights nuclear rights statesarmshuman unitedpreventionpalestine weaponsraceoccupied nationsspaceisrael

67 Groups Discovered (UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.

68 Do We Get Better Groups with the GT Model? 1.Cluster bills into topics using mixture of unigrams; 2.Apply group model on topic- specific subsets of bills. Agreement Index (AI) measures group cohesion. Higher, better. DatasetsAvg. AI for BaselineAvg. AI for GTp-value Senate0.81980.8294<.01 UN0.85480.8664<.01 1.Jointly cluster topic and groups at the same time using the GT model. Baseline Model GT Model

69 Groups and Topics, Trends over Time (UN)

70 Outline Self & Lab Intro Information Extraction and Data Mining. Research vingette: Conditional Random Fields Research vingette: Social network analysis & Topic models Demo: Rexa, a new Web portal for research literature Future work, brainstorming, collaboration, next steps

71 Previous Systems

72

73 Research Paper Cites Previous Systems

74 Research Paper Cites Person UniversityVenue Grant Groups Expertise More Entities and Relations

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95 Neural Information Processing Conference Dataset 1740 Papers 13649 Unique words 2,301,375 Words Volumes 0-12 Spanning 1987 – 1999. Prepared by Sam Roweis.

96 Trends in 17 years of NIPS proceedings

97 Topic Distributions Conditioned on Time time topic mass (in vertical height)

98 Finding Topics in 1 million CS papers 200 topics & keywords automatically discovered.

99 Topical Bibliometric Impact Measures Topical Citation Counts Topical Impact Factors Topical Longevity Topical Diversity Topical Precedence Topical Transfer [Mann, Mimno, McCallum, 2006]

100 Topical Diversity

101 Entropy of the topic distribution among papers that cite this paper (this topic). Low Diversity High Diversity

102 Topical Precedence Within a topic, what are the earliest papers that received more than n citations? “Early-ness” Speech Recognition: Some experiments on the recognition of speech, with one and two ears, E. Colin Cherry (1953) Spectrographic study of vowel reduction, B. Lindblom (1963) Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965) Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974) Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

103 Topical Transfer Citation counts from one topic to another. Map “producers and consumers”

104 Outline Self & Lab Intro Information Extraction and Data Mining. Research vingette: Conditional Random Fields Research vingette: Social network analysis & Topic models Demo: Rexa, a new Web portal for research literature Future work, brainstorming, collaboration, next steps

105 Collaborative Avenues Run and extend Rexa on Countway data Research on bibliometric IE and matching –citation extraction in bio/med literature –author entity resolution (authority, coreference) in all PubMed –IE for proteins, genes, diseases, treatments,... from abstracts. Research on bibliometric / paper-body mining –Topic analysis, discovery, impact / influence mapping –Keyphrase, topic-hierarchy discovery –Group-topic discovery on diseases, drugs, treatments, success... or on genes, proteins, etc. –Routing CFPs to potential PIs. Expert-finding. Research on machine learning for bioinformatics –Research on bi-clustering gene data –Research on CRF string edit distance for bio seq How might I help i2b2? Some additional grant we might write together? Next concrete steps, responsibilities, deliverables, timeline.

106 Summary Traditionally, SNA examines links, but not the language content on those links. Presented ART, an Bayesian network for messages sent in a social network: captures topics and role-similarity. RART explicitly represents roles. Additional work –Group-Topic model discovers groups and clusters attributes of relations. [Wang, Mohanty, McCallum, LinkKDD 2005]

107 End of Talk


Download ppt "Research Overview for Harvard Medical Library Andrew McCallum Associate Professor Computer Science Department University of Massachusetts Amherst."

Similar presentations


Ads by Google