Presentation is loading. Please wait.

Presentation is loading. Please wait.

AND Working Groups July 24, 2008  Slide 1 AND Second Workshop on Analytics for Noisy Unstructured Text Data Group-1  Task : Data sets, benchmarks, evaluation.

Similar presentations


Presentation on theme: "AND Working Groups July 24, 2008  Slide 1 AND Second Workshop on Analytics for Noisy Unstructured Text Data Group-1  Task : Data sets, benchmarks, evaluation."— Presentation transcript:

1 AND Working Groups July 24, 2008  Slide 1 AND Second Workshop on Analytics for Noisy Unstructured Text Data Group-1  Task : Data sets, benchmarks, evaluation techniques for analysis of noisy texts.  Participant: Maarten de Rijke, Amaresh Pandey, Donna Harman, Venu Govindaraju, Aixin Sun and Venkat Subramaniam

2 AND Working Groups July 24, 2008  Slide 2 AND Second Workshop on Analytics for Noisy Unstructured Text Data Datasets  Important to list out datasets that are out there.  A List of datasets that are publicly available can be added to the proceedings along with descriptions as well as comments.  Create a Table: dataset name and source; application; usability; tools for creating and analyzing the data sets.  Take a references from AND 07.  List out Missing things about data sets.  Data sets can be for speech, text, OCR, etc.  LDC and ELDC can be a source for speech data  NIST can be a source for OCR data  List out tools and sources which gives data for academic/Industry research work.

3 AND Working Groups July 24, 2008  Slide 3 AND Second Workshop on Analytics for Noisy Unstructured Text Data Benchmarks  Identify Popular tasks, organize competitions to create  List of past evaluations and benchmarks say from TREC and list what can be done  Blogs, speech, OCR in TREC 5, legal, spam, cross language text, historical texts and etc.  Create a table: popular tasks; what benchmarks exist; new benchmarks  Give emphasis of certain type of data sets like, Blogs and OCR.

4 AND Working Groups July 24, 2008  Slide 4 AND Second Workshop on Analytics for Noisy Unstructured Text Data Evaluation  Cascaded evaluation should be done: noise, effect of noise, effects of different stages of processing.  Evaluation requires truth data. Creating labeled truth is costly. So create a common task on a given dataset, that way truth data gets generated.  List evaluation techniques and metrics for common tasks  Create a table which contains: Task, evaluations technique, source and references.

5 AND Working Groups July 24, 2008  Slide 5 AND Second Workshop on Analytics for Noisy Unstructured Text Data Datasets, Benchmarks, Evaluation Techniques  What Data sets, benchmarks, evaluation techniques are needed for the analysis of noisy texts?  Datasets today comprise mostly newswire data. Blogs, sms, email, voice, and other spontaneous communcation datasets are needed. TREC Tracks have recently started including such datasets  Are benchmarks/evaluation dependent on the task QA over blogs….. blogs are not factual Business Intelligence over customer calls and emails Opinion and sentiment mining from emails and blogs On such datasets agreement between humans is also very low


Download ppt "AND Working Groups July 24, 2008  Slide 1 AND Second Workshop on Analytics for Noisy Unstructured Text Data Group-1  Task : Data sets, benchmarks, evaluation."

Similar presentations


Ads by Google