News and Blog Analysis with Lydia Steven Skiena Dept. of Computer Science SUNY Stony Brook

Slides:



Advertisements
Similar presentations
Open Source Intelligence: Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC IOP 06 Sheraton Premier, Tysons Corner, Virginia January.
Advertisements

Almaden Research Center © 2006 IBM Corporation IOP 06 Open Source Intelligence Lesson Learned.
Teaching Chinese in the Virtual World What tools can be used? By Xie Tianwei California State University, Long Beach Virtual office:
By Constanza Lermanda G.. Topic: Electronic NewspaperDuration of the lesson: 50 minutesGrade Level: 12th.
Tools for Unstructured Text
Introduction to Computational Linguistics
C6 Databases.
Automatic summarization Dragomir R. Radev University of Michigan
Transferable Skills beyond the academic training 22nd January, 14-18h, Building 3, Floor 1, Computer Room 9 (16.P1.E3) 29nd January, 14-18h, Building.
Technology of Data Analytics. INTRODUCTION OBJECTIVE  Data Analytics mindset – shallow and wide, deep when you need it  Quick overview, useful tidbits,
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Kira Radinsky, Sagie Davidovich, Shaul Markovitch Computer Science Department Technion – Israel Institute of technology.
Text Mining of Medical Documents Michael Elhadad - Raphael Cohen Dept of Computer Science.
Topic Extraction From Turkish News Articles Anıl Armağan Fuat Basık Fatih Çalışır Arif Usta.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Scientific Web Intelligence The Birth of a New Research Field Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK.
Columbia’s Vision for Tomorrow’s Global Intelligent Systems Henning Schulzrinne, Chair Department of Computer Science October 13, 2005 Bill Gates/CS Faculty.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Finding Information Online Objectives: Students will be able to distinguish between web search tools and library search tools and understand the types.
Open Information Extraction From The Web Rani Qumsiyeh.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
How to find a worthy research subject Sadeghi Ramin, MD Nuclear Medicine Research Center, Mashhad University of Medical Sciences.
Tools to keep track of the literature Diego Mauricio Riaño-Pachón AG Mueller-Roeber University of Potsdam GabiPD team - MPIMP April 2008
Friday, March 7 Searching for and Acquiring Scientific Literature Writing Process Map.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
Analysing the link structures of the Web sites of national university systems Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
SOCIAL NETWORKS Social Media Class. Review … What have we learned so far?
Big Data, Future of Computing, Parting Thoughts Slobodan Vucetic Associate Professor Department of Computer and Information Sciences Temple University.
I learn English because…
LIBRARY SERVICES Internet sources of information Paula Funnell Senior Academic Liaison Librarian (Medicine and Dentistry)
AIDA SURYANI K-1 / Useful Websites For Learning English.
9. Learning Objectives  How do companies utilize social media research? What are the primary approaches to social media research?  What is the research.
Datamining MEDLINE for Topics and Trends in Dental and Craniofacial Research William C. Bartling, D.D.S. NIDCR/NLM Fellow in Dental Informatics Center.
The Scientific Library of NUPh for students: services and electronic resources The presentation of services, provided by the Scientific Library of NUPh.
Branching Out: Introduction to Sources for Public Health Research Social Sciences Research Toolkit Kay Hogan Smith, MLS, MPH, CHES.
Lydia: Knowledge Extraction from Curated Text Steven Skiena Dept. of Computer Science SUNY Stony Brook
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Social Content ASIDIC, Tampa Fl, March 2009 What is Social Content? How can we use Social Content? What is the future of Social Content?
1 How to find literature - A very short introduction SMED 8004 Medicine and Health Library October 2014.
Knowledge Discovery in the Digital Library Access tools for mining science ICSTI Public Workshop Presented by: Bernard Dumouchel, Director-General February.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Social Media Social media is not new – Opinion pages – Telephone – Others? What is new – Fewer intermediaries – Individual ability to broadcast – Massive.
Data Mining By Dave Maung.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
Effects of Computing How have computers affected society?
Should Students in Primary Schools Learn Computer Science?
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 5 Finding and Critiquing Evidence: Research Literature Reviews.
HSC Legal Studies From Google to Legal Studies Research Guide - May 2010.
ICT-enabled Agricultural Science for Development Scenarios, Opportunities, Issues by ICTs transforming agricultural science, research & technology generation.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
Reference Collections: Collection Characteristics.
WEB 2.0 ELEMENTS BY: ANSHUL SINHA 8AA. CONTENT Back ground information Podcasts Blogs Tagging Rss- Really Simple Sindicación Sources Why I chose this.
Journalism online The World Wide Web, the past, the present, the future.
Peer Review in the Google Age Feb 23, 2006 RSS club meeting special session.
Realtime Financial Monitoring and Analysis System May 2010 Lietu Search Engine.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Engaging with Scientific Literature Image adapted from xkcd.com.
INTRODUCTION TO APPLIED LINGUISTICS
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Computing & Information Sciences Kansas State University An Overview of Big Data Analytics: Challenges & Selected Applications Guest Seminar Drake University.
TEL, Distance Education and virtual classrooms with the IEI Noelle Carroll Juley McGourty International Equine Institute.
DeepWalk: Online Learning of Social Representations
Huayi Zhagn and Haiyan Liang
Panagiotis G. Ipeirotis Luis Gravano
PURE Learning Plan Richard Lee, James Chen,.
Service Platform for Green European Transportation
Text Mining of Medical Documents
Presentation transcript:

News and Blog Analysis with Lydia Steven Skiena Dept. of Computer Science SUNY Stony Brook

Opportunities in Text Analysis The increasing volume of online information coupled with decreasing costs of communications and computation creates exciting new opportunities in text mining. We can analyze all of the online English- language newspapers daily on a single commodity computer! Our ultimate goal is to build a computational model of much of the worlds knowledge through analysis of news media, reference texts, and primary sources.

Learning From Text Sources Knowledge extraction becomes easier when you start with reliable sources: Online newspapers (both domestic and foreign) what is happening in the world? Scientific abstracts, e.g. Medline/Pubmed what is known about disease and medicine? Blogs and chat rooms what do people think is happening in the world?

Tracking research interest in diseases over time

Related Systems IBM's WebFountain Framework for web-scale text analytics Columbia's Newsblaster Provides summaries of online newspaper articles Google News Automatic aggregation and presentation of online news.

String Algorithms and NLP Since both fields deal with text there should be real connections, but NLP people parse sentences, not documents, so n<20. The weirdnesses of language dont tend to result in well-defined problems. Problems related to synonym set identification and phonetic hashing may be interesting to the CPM community.

Outline of Talk Lydia NLP pipeline Spatial and temporal analysis Blogs vs. news Current research Future visions

System Architecture Spidering – text is retrieved from a given site on a daily basis using semi-custom spidering agents. Normalization – clean text is extracted with semi- custom parsers and formatted for our pipeline Text Markup – annotates parts of the source text for storage and analysis. Back Office Operations – we aggregate entity frequency and relational data for a variety of statistical analyses. Levon Lloyd, Dimitrios Kechagias, and Steven Skiena. Lydia: A System for Large-Scale News Analysis. In String Processing and Information Retrieval: 12 th International Conference (SPIRE 2005).

Text Markup We apply natural language processing (NLP) techniques to annotate interesting features of the document. Full parsing techniques are too slow to keep up with our volume of text, so we employ shallow parsing instead. We can currently markup approximately 2000 newspapers per day per CPU. Analysis phases include…

Flowchart

Input Dr. Judith Rodin, the former president of the University of Pennsylvania, will become president of the Rockefeller Foundation next year, the foundation announced yesterday in New York. She will take over in March 2005, succeeding Gordon Conway, the foundation's first non-American president. Mr. Conway announced last year that he would retire at 66 in December and return to Britain, where his children and grandchildren live.

Sentence and Paragraph Identification Dr. Judith Rodin, the former president of the University of Pennsylvania, will become president of the Rockefeller Foundation next year, the foundation announced yesterday in New York. She will take over in March 2005, succeeding Gordon Conway, the foundation's first non-American president. Mr. Conway announced last year that he would retire at 66 in December and return to Britain, where his children and grandchildren live.

Part Of Speech Tagging Dr./NNP Judith/NNP Rodin/NNP,/, the/DT former/JJ president/NN of/IN the/DT University/NNP of/IN Pennsylvania/NNP,/, will/MD become/VB president/NN of/IN the/DT Rockefeller/NNP Foundation/NN next/JJ year/NN,/, the/DT foundation/NN announced/VBD yesterday/RB in/IN New/NNP York/NNP./. She/PRP will/MD take/VB over/IN in/IN March/NNP 2005/CD,/, succeeding/VBG Gordon/NNP Conway/NNP,/, the/DT foundation/NN 's/POS first/JJ non-American/JJ president/NN./. Mr./NNP Conway/NNP announced/VBD last/JJ year/NN that/IN he/PRP would/MD retire/VB at/IN 66/CD in/IN December/NNP and/CC return/NN to/TO Britain/NNP,/, where/WRB his/PRP$ children/NNS and/CC grandchildren/NNS live/VBP./.

Proper Noun Extraction Dr./NNP Judith/NNP Rodin/NNP,/, the/DT former/JJ president/NN of/IN the/DT University/NNP of/IN Pennsylvania/NNP,/, will/MD become/VB president/NN of/IN the/DT Rockefeller/NNP Foundation/NN next/JJ year/NN,/, the/DT foundation/NN announced/VBD yesterday/RB in/IN New/NNP York/NNP./. She/PRP will/MD take/VB over/IN in/IN March/NNP 2005/CD,/, succeeding/VBG Gordon/NNP Conway/NNP,/, the/DT foundation/NN 's/POS first/JJ non-American/JJ president/NN./. Mr./NNP Conway/NNP announced/VBD last/JJ year/NN that/IN he/PRP would/MD retire/VB at/IN 66/CD in/IN December/NNP and/CC return/NN to/TO Britain/NNP,/, where/WRB his/PRP$ children/NNS and/CC grandchildren/NNS live/VBP./.

Date and Number Extraction Dr./NNP Judith/NNP Rodin/NNP,/, the/DT former/JJ president/NN of/IN the/DT University/NNP of/IN Pennsylvania/NNP,/, will/MD become/VB president/NN of/IN the/DT Rockefeller/NNP Foundation/NN next/JJ year/NN,/, the/DT foundation/NN announced/VBD yesterday/RB in/IN New/NNP York/NNP./. She/PRP will/MD take/VB over/IN in/IN March/NNP 2005/CD,/, succeeding/VBG Gordon/NNP Conway/NNP,/, the/DT foundation/NN 's/POS first/JJ non- American/JJ president/NN./. Mr./NNP Conway/NNP announced/VBD last/JJ year/NN that/IN he/PRP would/MD retire/VB at/IN 66/CD in/IN December/NNP and/CC return/NN to/TO Britain/NNP,/, where/WRB his/PRP$ children/NNS and/CC grandchildren/NNS live/VBP./.

Actor Classification Dr./NNP Judith/NNP Rodin/NNP,/, the/DT former/JJ president/NN of/IN the/DT University/NNP of/IN Pennsylvania/NNP,/, will/MD become/VB president/NN of/IN the/DT Rockefeller/NNP Foundation/NN next/JJ year/NN,/, the/DT foundation/NN announced/VBD yesterday/RB in/IN New/NNP York/NNP./. She/PRP will/MD take/VB over/IN in/IN March/NNP 2005/CD,/, succeeding/VBG Gordon/NNP Conway/NNP,/, the/DT foundation/NN 's/POS first/JJ non-American/JJ president/NN./. Mr./NNP Conway/NNP announced/VBD last/JJ year/NN that/IN he/PRP would/MD retire/VB at/IN 66/CD in/IN December/NNP and/CC return/NN to/TO Britain/NNP,/, where/WRB his/PRP$ children/NNS and/CC grandchildren/NNS live/VBP./.

Rewrite Rules Dr. Judith Rodin, the former president of the University of Pennsylvania, will become president of the Rockefeller Foundation next year, the foundation announced yesterday in New York. She will take over in March 2005, succeeding Gordon Conway, the foundation 's first non-American president. Mr. Conway announced last year that he would retire at 66 in December and return to Britain, where his children and grandchildren live.

Alias Expansion Dr. Judith Rodin, the former president of the University of Pennsylvania, will become president of the Rockefeller Foundation next year, the foundation announced yesterday in New York. She will take over in March 2005, succeeding Gordon Conway, the foundation 's first non-American president. Mr. Gordon Conway announced last year that he would retire at 66 in December and return to Britain, where his children and grandchildren live.

Geography Normalization Dr. Judith Rodin, the former president of the University of Pennsylvania, will become president of the Rockefeller Foundation next year, the foundation announced yesterday in New York City, New York, USA. She will take over in March 2005, succeeding Gordon Conway, the foundation 's first non-American president. Mr. Gordon Conway announced last year that he would retire at 66 in December and return to Britain, where his children and grandchildren live.

Back Office Operations The most interesting analysis occurs after markup, using our MySQL database of all occurrences of interesting entities. Each days worth of analysis yields about 10 million occurrences of about 1 million different entities, so efficiency matters... Linkage of each occurrence to source and time facilitates a variety of interesting analysis.

Duplicate Article Elimination Supreme Court Justice David Souter suffered minor injuries when a group of young men assaulted him as he jogged on a city street, a court spokeswoman and Metropolitan Police said Saturday. Supreme Court Justice David Souter suffered minor injuries when a group of young men assaulted him as he jogged on a city street, a court spokeswoman and Metropolitan Police said. Hashing techniques can efficiently identify duplicate and near-duplicate articles appearing in different news sources.

Synonym Sets JFK, John Kennedy, John F. Kennedy, and John Fitzgerald Kennedy all refer to the same person. We need a mechanism to link multiple entities that have slightly different names but refer to the same thing. We say that two actors belong in the same synonym set if: Their names are morphologically compatible. If the sets of entities that they are related to are similar. Levon Lloyd, Andrew Mehler, and Steven Skiena. Identifying Co-referential Names Across Large Corpra. In Proc. Combinatorial Pattern Matching (CPM 2006)

Morphological Similarity Subsequence Similarity George Bush = George W. Bush Stemming New York Yankees = New York Yankee Pronunciation Similarity – Metaphone Victor Yanukovich = Viktor Yanukovych Abbreviations JFK = John F. Kennedy = John Fitzgerald Kennedy

Contextual Similarity Virginia Woolf The Hours Michael Cunningham Mrs. Dalloway Nicole Kidman Butterfield 8 Edward Albee Elizabeth Taylor Virginia Alexandria Maryland North Carolina John Allen Muhammad Falls Church Lee Boyd Malvo Tennessee Bush Iraq Al Gore Democrats Texas Ari Fleischer Florida Republican George W. Bush Al Gore Republican GOP Texas Arizona Sen. John McCain Florida Dick Cheney =

Algorithmics of Synsets All morphologically similar pairs of n entities could be found in O(n^2) similarity comparisons, but n is very large Instead, we use phonetic hashing techniques such that two similar strings are likely hashed to the same bucket, and compare within buckets. As an aside, we believe such hashing techniques should be used for correcting typing errors in password entry. Andrew Mehler and Steven Skiena. Improved Usability Through Password-Corrective Hashing. In String Processing and Information Retrieval: 13 th International Conference (SPIRE 2006).

Phonetic Hashing Soundex and Metaphone map similar sounding strings to the same codes: Kant/Knuth = K530. Optimally designing such schemes for particular notions of similarity seems like an interesting problem. We generalized such schemes to tradeoff precision/recall but not very well.

Outline of Talk Lydia NLP pipeline Spatial and temporal analysis Blogs vs. news Current research Future visions

Juxtaposition Analysis We want to compute the significance of the co-occurrences between two entities Similar to collaborative filtering, determining which customers are most similar in order to predict future buying preferences Just counting the number of co-occurrences causes the most popular entities to be related to everyone

Scoring Juxtapositions In collaborative filtering, correlation coefficients and dot products are often used, but they did not work well for our problem We use a Chernoff Bound to bound the probability of the number of co-occurrences P(X > (1+δ)E[X]) (e δ /(1+δ) (1+δ) )E[X]

Typical Juxtapositions

Time Series Analysis Martin Luther King Samuel Alito

Heatmaps Where are people are talking about particular topics? Newspapers have a sphere of influence based on: Power of the source – circulation, website popularity Population density of surrounding cities The heat a given entity generates in a particular location is a function of the frequency it is mentioned in local sources A. Mehler, Y. Bao, X. Li, Y. Wang, and S. Skiena. Spatial analysis of News Sources. InfoVis 2006

Donde Esta Mexico?

Who is running for president?

Where is Washington?

Outline of Talk Lydia NLP pipeline Spatial and temporal analysis Blogs vs. news Current research Future visions

Blog Analysis with Lydia Blogs represent a different view of the world than newspapers. Less objective Greater diversity of topics We adapted Lydia to process Livejournal blogs, and compared blog content to that of newspapers. Levon Lloyd, Prachi Kaulgud, and Steven Skiena. News vs. Blogs: Who Gets the Scoop?. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

Blog-specific Processing Blogs prove more difficult to analyze than professional news sources: Inconsistent capitalization Some write in uniform case. Some convey emotions through capitalization (I'm SO excited) Emoticons and unique abbreviations ;), :(, etc. b4 = before, 2nite = tonight, etc.

Co-reference Sets in Blogs Bloggers frequently misspell entity names. Using our co-reference set system, we can identify and correct such misspellings. Examples include Britney Spears, Brittany Spears Michael Jackson, Micheal Jackson Stephen King, Steven King

Heatmaps for Blogs Roughly half of all Livejournal users specify their geographic location. Modelling the geographic influence for blogs is harder than newspapers Many more sources/locations No published ratings of source power Individuals need not have local constituencies

Where's the nearest Steak n Shake?

Local Dialect: Who says Hella?

Popular Entities in News vs. Blogs

Blogs can track the news..

Or move independently.. Only one of Martha Stewart's new TV shows caused a blip in blogspace.

Who Gets the Scoop?

Outline of Talk Lydia NLP pipeline Spatial and temporal analysis Blogs vs. news Current research Future visions

Sentiment Analysis Sentiment analysis lets us to measure how positively/negatively an entity is regarded, not just how much it is talked about.

Most Positive Actors in News and Blogs News: Felicity Huffman, Fenando Alonso, Dan Rather, Warren Buffett, Joe Paterno, Ray Charles, Bill Frist, Ben Wallace, John Negroponte, George Clooney, Alicia Keys, Roy Moore, Jay Leno, Roger Federer Blogs: Joe Paterno, Phil Mickelson, Tom Brokow, Sasha Cohen, Ted Stevens, Rafael Nadal, Felicity Huffman, Warren Buffett, Fernando Alonso, Chauncey Billups, Maria Sharapova, Earl Woods, Kasey Kahne, Tom Brady

Most Negative Actors in News and Blogs News: Slobodan Milosevic, John Ashcroft, Zacarias Moussaoui, John Allen Muhammad, Lionel Tate, Charles Taylor, George Ryan, Al Sharpton, Peter Jennings, Saddam Hussein, Jose Padilla, Abdul Rahman, Adolf Hitler, Harriet Miers, King Gyanendra Blogs: John Allen Muhammad, Sammy Sosa, George Ryan, Lionel Tate, Esteban Loaiza, Slobodan Milosevic, Charles Schumer, Scott Peterson, Zacarias Moussaoui, William Jefferson, King Gyanemdra, Ricky Williams, Ernie Fletcher, Edward Kennedy, John Gotti

How Do We Do it? We use large-scale statistical analysis instead of careful NLP of individual reviews. We expand small seed lists of +/- terms into large vocabularies using Wordnet and path- counting algorithms. We correct for modifiers and negation. Statistical methods turn these counts into indicies. N. Godbole, M. Srinivasaiah, and S. Skiena. Large-Scale Sentiment Analysis for News and Blogs. Submited for publication.

Good to Bad in Three Hops Paths of WordNet synonyms can lead to contradictory results, requiring careful path selection.

What Does it Mean? Our scores corrolate very well with financial, political, and sporting events.

Who is the American Idol?

The Half-Life of Victory: Baseball

Seasonal Effects on Sentiment The low point is not September 2001 but April 2004, with the Madrid bombings and war in Iraq.

Social Network Analysis

Relationship Identification We use verb-frames and template-based methods to try to identify the nature of statistically-significant relationships, e.g devastated killed-in became not-watch

Description Extraction We use template-based methods and WordNet sense analysis to extract meaningful descriptions, such as: Warren Buffett, billionaire investor Giacomo, Kentucky Derby winner Kim Jong Il, North Korean leader

Question Answering Our Lydia question answering system competed in the TREC 2005 Question Answering Track, finishing with median scores using only 2000 lines of code. Jae Hong Kil, Levon Lloyd, and Steven Skiena. Question Answering with Lydia. In The Fourteenth Text Retrieval Conference Proceedings (TREC 2005). Q: Where was Herbert Hoover born? A: West Branch, Iowa Q: Who is the Governor of Iowa? A: Tom Vilsack

Outline of Talk Lydia NLP pipeline Spatial and temporal analysis Blogs vs. news Current research Future visions

Future Visions Entity-oriented (instead of document-based) search engines Market research: geographic/temporal analysis Legal document (deposition/filing) analysis Financial modelling and analysis Medical/Scientific applications Law enforcement/Homeland Security We actively seek industrial collaboration..

The Lydia Team

Namrata Godbole – NLP pipeline, sentiment analysis Prachi Kaulgud – NLP pipeline, web development Dimitris Kechagias – spidering Jae Hong Kil – question answering Levon Lloyd – systems architecture, co-reference sets Andrew Mehler – rule-based processing, heatmaps Manja Srinivasaiah – production, sentiment analysis Izzet Zorlu – web development

… and the Lydia Cluster