Presentation is loading. Please wait.

Presentation is loading. Please wait.

ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives.

Similar presentations


Presentation on theme: "ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives."— Presentation transcript:

1 ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives Testbed Working Meeting SDSC, La Jolla, CA Feb 17-18, 2005

2 ITTL.ppt-2 Information Technology & Telecommunications Laboratory Overview Information Extraction Machine learning and recognition of document types Content Extraction Summarization (Folder titles and Content Notes) FOIA Review

3 ITTL.ppt-3 Information Technology & Telecommunications Laboratory Access Restriction Checker

4 ITTL.ppt-4 Information Technology & Telecommunications Laboratory Information Extraction Information extraction (IE) is a procedure that selects, extracts and combines data from text in order to produce structured information. The Named entity (NE) Task is to identify all named persons, organizations, locations, dates, times, numeric monetary amounts and percentages in text.

5 ITTL.ppt-5 Information Technology & Telecommunications Laboratory Letter From George Bush to Ronald Reagan

6 ITTL.ppt-6 Information Technology & Telecommunications Laboratory Named Entity Recognition

7 ITTL.ppt-7 Information Technology & Telecommunications Laboratory Content Extraction Tasks The Template Element (TE) Task is to fill in templates about persons and organizations from an automatic analysis of text. The Scenario Template (ST) task is to fill in templates about events and their participants (persons, organizations, etc.) from an automatic analysis of text?

8 ITTL.ppt-8 Information Technology & Telecommunications Laboratory Content Extraction Applied to Recognizing Request for Confidential Advice

9 ITTL.ppt-9 Information Technology & Telecommunications Laboratory Content Extraction and Access Restriction Rules Action: Request Agent: Person Job_Title: President Object: Analysis of the War Powers Resolution Patient: C Boyden Gray Job_Title: Counsel to the President Presidential_Advisor: C Boyden Gray If Document(X), and Action(X) = Request, and Agent(X) = Y, and (Job_Title(Y) = President, or Presidential_Advisor(Y)) and Patient(X) = Z and Presidential_Advisor(Z) and Object(X) = Information Then Access_Restriction(X) = a(5).

10 ITTL.ppt-10 Information Technology & Telecommunications Laboratory Some Document Types in Bush Presidential Electronic Records Agenda Biographical Information Briefing Memo Decision Memo Executive Order Information Memo White House Letter List of Candidates for Appointment to Federal Office Mailing List Minutes of Meeting Nomination for Appointment to Federal Office Press Release Resume Schedule Telephone Call Recommendation

11 ITTL.ppt-11 Information Technology & Telecommunications Laboratory Document Type Recognition Convert document format to ASCII or HTML Use Information Extraction Technology to Markup Different Document Types. Machine Learning of Document Type through Grammatical Inference Evaluate Performance Use for Recognizing Document Types of other Records

12 ITTL.ppt-12 Information Technology & Telecommunications Laboratory Annotated White House Correspondence March 27, 1990 Dear Mr. Allen Thank you very much for your letter of March 15, 1990 which stated your concerns and suggestions regarding the Americans with Disabilities Act. In order to fulfill President Bush's campaign promise of bringing Americans with handicaps into the mainstream of American life, the Bush Administration supports the objectives of the A.D.A. As you may know, the bill is still in House Committee for consideration and change. You can be sure that your thoughts have been fully noted and are appreciated. Sincerely, Doug Wead Special Assistant to the President for Public Liaison Ray Allen, President American Cultural Traditions P.O. Box 1895 Washington, D.C. 20013

13 ITTL.ppt-13 Information Technology & Telecommunications Laboratory Regular Grammar for the Layout of White House Correspondence Letter  A A  B B  B B  C C  D D  E E  F F 

14 ITTL.ppt-14 Information Technology & Telecommunications Laboratory Scope and Content Note for John Sununu’s Files These files contain correspondence from senior level staff in the Executive Office of the President, and from every member of the Cabinet. The material covers issues that faced the Bush Administration from 1989 to 1990, including abortion / fetal research, the Exxon Valdez oil spill, the savings and loan industry, the Clean Air Act, the White House Conference on Global Climate Change, relations with China following the student demonstrations in Tiananmen Square, the National Drug Control Strategy, the 1990 Bipartisan Budget Agreement, the spotted owl issue, the Americans with Disabilities Act, and the nomination of Supreme Court Justice David Souter. It includes correspondence, routine reports, press releases, press clippings, papers produced by organizations outside the Administration, and speech drafts.

15 ITTL.ppt-15 Information Technology & Telecommunications Laboratory Relationship to Persistent Archives Testbed Information extraction, document type learning and recognition and series summarization will be provided as Archival Services within the NARA Persistent Archives Prototype, and could be provided within the PAT.

16 ITTL.ppt-16 Information Technology & Telecommunications Laboratory Additional Information http://perpos.gtri.gatech.edu Archival Processing Tools: User Manual An Analysis of the Knowledge Required to Perform FOIA and PRA Review, PERPOS Technical Report ITTL/CSITD 04-1,Mar 2004. PERPOS: Results of Laboratory Experiments and Use by Archivists, Nov 2003 Recognizing Named Entities in Presidential Electronic Records, PERPOS Technical Report ITTL/CISTD 04-4, June, 2004


Download ppt "ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives."

Similar presentations


Ads by Google