Tamil Summary Generation for a Cricket Match

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
An End-User Perspective On Using NatQuery Summary Processing T
Analysis of World Cup Finals. Outline Project Understanding – World Cup History Data Understanding – How to collect the data Data Manipulation – Data.
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Fast Algorithms For Hierarchical Range Histogram Constructions
What makes an image memorable?
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Evaluating Search Engine
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Data Mining By Archana Ketkar.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
SIMULATION. Simulation Definition of Simulation Simulation Methodology Proposing a New Experiment Considerations When Using Computer Models Types of Simulations.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Monté Carlo Simulation MGS 3100 – Chapter 9. Simulation Defined A computer-based model used to run experiments on a real system.  Typically done on a.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
BPS - 5th Ed. Chapter 101 Introducing Probability.
Presenter: Shanshan Lu 03/04/2010
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Final Presentation Industrial project Automatic tagging tool for Hebrew Wiki pages Supervisors: Dr. Miri Rabinovitz, Supervisors: Dr. Miri Rabinovitz,
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved. 1.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Crawling the Hidden Web Authors: Sriram Raghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar 1.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 Data Mining at work Krithi Ramamritham. 2 Dynamics of Web Data Dynamically created Web Pages -- using scripting languages Ad Component Headline Component.
1-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
Ontology Based Annotation of Text Segments Presented by Ahmed Rafea Samhaa R. El-Beltagy Maryam Hazman.
Artificial Neural Network System to Predict Golf Score on the PGA Tour ECE 539 – Fall 2003 Final Project Robert Steffes ID:
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Memory Standardization
Panagiotis G. Ipeirotis Luis Gravano
Emna Krichene 1, Youssef Masmoudi 1, Adel M
Presentation transcript:

Tamil Summary Generation for a Cricket Match J. Jai Hari Raju, P. Indhu Reka, K.K Nandavi, Dr. Madhan Karky

Table of Contents Contents Objective Architecture Data Gathering and Modeling module Data Mining and Data Analytics module Summary Generator Content Determiner Aggregator Tamil Morphological Generator Layout Determiner Evaluator Results and Conclusion Future Work Reference

Cricket and Tamil Summaries Cricket is one of the most followed sports in the Indian subcontinent. Number of websites which facilitate people to follow and analyze Cricket has increased manyfold. Involve participation of experts, who present their views and summaries in English about cricket matches. Wide requirement for natural language descriptions, which summarize a cricket match effectively. Absence of online services which provide Tamil summaries about Cricket matches.

Objective To propose a framework for automatic analysis and summary generation for a cricket match in Tamil, with the scorecard of the match as the input. The framework proposes a method to evaluate the interestingness of a cricket match. The framework proposes a customization model for the summary. The framework also proposes methods for evaluating the humanness of the generated summary.

Background Alice Oh et al. generated multiple stories about a single baseball game based on different perspectives using a reordering algorithm [1]. Ehud Reiter et al. explain the difference between natural language generation and natural language processing [2]. Jacques Robin et al. presented a system (called STREAK) for summarizing basketball game data in natural language [3]. L. Bourbeau et al. came up with the FoG (Forecast Generator) using the streamlined version of the Meaning-Text linguistic model. This system was capable of generating weather forecasts in both English and French [4].

Architecture

Data gathering and Modeling This module is used to obtain the statistical data from the internet. Provides an user interface to get input URL. Crawls the provided URL and downloads the web page containing match data. Contains a custom designed parser for tag structure of the data source. Parse the page to obtain statistics. The statistical data is then modeled in the form of the predefined feature vectors.

Data Mining and Analytics Modified version of Apriori algorithm is used to find the association rules from the feature vectors. Mathematical analysis using correlation of variance (CoV) is performed, CoV is plotted against average to give an idea about how consistent the player is. The interestingness of the match is calculated based on the weighted average of the scores assigned to the factors identified, they include the Winning margin, Team history, Individual records made, High run rate, Series state, Relative position in international ranking, Reaction in social networks etc.

Content Determiner Responsible for identifying those facts which are worth mentioning in the summary Events to be included in the summary are not predefined and are not the same for every match. Based on the interestingness of the total match, the interestingness of the individual events and the expert level chosen by the user, particular events are chosen to be included in the summary.

Aggregator Aggregation of relevant events from other matches in the summary will make it more readable and interesting. Chooses events based on their similarity and coherence Aggregates them with the key events selected in the content determiner module.

Sentence Generation The sentence which is the most apt to the current event under consideration is selected The vocabulary used in the sentence and the depth to which an event is discussed is also varied based on the expert level of the user The nouns in the key events are passed to the morphological generator along with the desired case endings and the generated variants are added to the sentences. The system uses the morphological generator developed at TaCoLa சச்சின் + இன் = சச்சினின்

Sample Summary

Evaluator Humanness evaluated based on its degree of similarity with human written summaries. The summaries are compared based on two parameters, the Nouns Mentioned and the Events Mentioned. The nouns and the events are extracted along with their absolute positions. The events in the summary are modeled as a set consisting of Performers (the persons who take part in the event) Numeral (the numeric part involved in the event e.g. 4 wickets) Descriptor (the action connecting the Performer and Numeral) Their absolute positions refer to the sentence number in which they are mentioned. Absolute positions are normalized based on the total number of sentences present in the summary.

Three different scores are calculated they are, Similarity Score: The ratio of the number of nouns and events mentioned in both the summaries to the total number of nouns and events mentioned at least in one summary. Count Score: The ratio of the number of nouns and events mentioned in the system generated summary to the number of nouns and events mentioned in the human written summary Closeness Score: The degree of closeness, in terms of the normalized positions of the nouns and events mentioned in both the summaries. A weighted average of these three scores yields the final humanness score.

Results Score cards of 90 One Day International matches where retrieved and their summaries were generated. These include matches between 9 countries. A large number of hidden patterns in cricket domain have been retrieved based on the algorithm used. The factors contributing to the interestingness of the match have been identified and the weights associated with them have been found. The consistency of a player has been modelled and consistency analysis of a player is done to analyse his performance.

The difference in the language used and the events mentioned in the summary is pronounced when the user opts for an expert level. Similar facts occurring in the past have been identified and added to the summary. Each summary was compared with two human written summaries, one an expert summary and other an average summary, their cumulative scores were considered. The humanness score of the summaries tend to be in the range of 70% to 85%. The recurrence of layouts is also minimal, which reflects the fact that the summaries generated are not monotonous.

GUI Screen Shot

Future work Enhancement by adding machine learning capabilities to make the summaries more human and interesting. Summary generation in multiple languages apart from Tamil. Summary generation about the match in real time. Summary generation in other sports. Usage as a guideline to develop summary generation systems, which can be applied for any domain where frequent numerical reports are used. (Weather Prediction, Industrial Quality Testing etc)

References [1] Alice Oh and Howard Shrobe, “Generating baseball summaries from multiple perspectives by reordering content,” in Proc. 5th International Natural Language Generation Conference, 2008, pp. 173-176. [2] Ehud Reiter and Robert Dale, “Building natural language generation systems,” Cambridge: Cambridge University Press, 2000. [3] Jacques Robin and Kathleen McKeown, “Empirically Designing and Evaluating a New Revision-Based Model for Summary Generation,” Department of Computer Science, Columbia University, 1996, vol. 85, pp.135-179. [4] L. Bourbeau, D. Carcagno, E. Goldberg, R. Kittredge and A. Polguere. “Bilingual generation of weather forecasts in an operations environment,” In Proc. 13th International Conference on Computational Linguistics, Helsinki University, Finland, 1990. COLING

Thank you J.Jai Hari Raju P. Indhu Reka K.K Nandavi Dr.Madhan Karky