Presentation is loading. Please wait.

Presentation is loading. Please wait.

Final Project Presentation 11748 - Information Extraction Learning to Extract Signature and Reply Lines from Email Vitor R. Carvalho.

Similar presentations


Presentation on theme: "Final Project Presentation 11748 - Information Extraction Learning to Extract Signature and Reply Lines from Email Vitor R. Carvalho."— Presentation transcript:

1 Final Project Presentation 11748 - Information Extraction Learning to Extract Signature and Reply Lines from Email Vitor R. Carvalho

2 Sig Lines Reply lines Idea:

3 Directions  Motivation:  Text-to-Speech, automatic personal address management, anonymization of email corpora, preprocessing for email classification experiments  Related work  Sproat, Chen & Hu; “Emu: An e-mail preprocessor for text-to-speech”, “geometrical and linguistic analysis for e-mail signature”  Pinto et al., McCallum et al., Classification of lines on FAQ pages and Tables in text documents using machine learning algorithms.  2 tasks: sig detection and line extraction  Compare state-of-the-art algorithms  Supervised learning

4 Data 20 Newsgroups dataset Searched for pairs of messages from the same sender, whose last K lines were identical. K ≤ 1 Unlikely to have a sig Manually checked: 586 Messages without Sigs K ≥ 6 Likely to have a sig Manually Checked + Sig and Reply-to Lines Annotated 617 Messages Total: 33013 lines (3321 sig lines, 5587 reply-to lines)

5 Sig Detection Features

6 Sig Detection Results Sproat et al. (1999): “SIG fields are rarely longer than ten lines”.

7 Sig Extraction Features

8 Sig Extraction Results

9 Reply Extraction Results

10 Sig & Reply Extraction Results

11 Last Lines  Efficient method to extract sig and reply-to lines in email messages – sequence of line representation  Comparison of state-of-the-art learning algorithms  References: R. Sproat, J. Hu, and H. Chen. Emu: An e-mail preprocessor for text-to-speech. In 1998 Workshop on Multimedia Signal Processing, pages 239--244, Redondo Beach, CA, December 1998. R. Sproat, J. Hu, and H. Chen. Emu: An e-mail preprocessor for text-to-speech. In 1998 Workshop on Multimedia Signal Processing, pages 239--244, Redondo Beach, CA, December 1998. H. Chen, J. Hu, and R. Sproat. Integrating geometrical and linguistic analysis for e-mail signature block parsing. ACM Transactions on Information Systems, 17(4):343--366, October 1999. H. Chen, J. Hu, and R. Sproat. Integrating geometrical and linguistic analysis for e-mail signature block parsing. ACM Transactions on Information Systems, 17(4):343--366, October 1999. A. McCallum, D. Freitag and F. Pereira. Maximum Entropy Markov Models for Information Extraction and Segmentation. Proceedings of the ICML-2000, 2000 A. McCallum, D. Freitag and F. Pereira. Maximum Entropy Markov Models for Information Extraction and Segmentation. Proceedings of the ICML-2000, 2000 D. Pinto, A. McCallum, X. Wei and W. B. Croft. Table Extraction Using Conditional Random Fields, SIGIR, ACM, Toronto, Canada, 2003 D. Pinto, A. McCallum, X. Wei and W. B. Croft. Table Extraction Using Conditional Random Fields, SIGIR, ACM, Toronto, Canada, 2003

12


Download ppt "Final Project Presentation 11748 - Information Extraction Learning to Extract Signature and Reply Lines from Email Vitor R. Carvalho."

Similar presentations


Ads by Google