Presentation is loading. Please wait.

Presentation is loading. Please wait.

USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

Similar presentations


Presentation on theme: "USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008."— Presentation transcript:

1 USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008

2 2 Outline Patent USPTO Search USPTO Patents Data Extraction: Case Study of NSE Patents

3 3 Patent “Patent" usually refers to a right granted to anyone who invents or discovers any new and useful process, machine, article of manufacture, or composition of matter, or any new and useful improvement. –A patent is not a right to practice or use the invention. Rather, it provides the right to exclude others from making, using, selling, offering for sale, usually 20 years from the filing date. –It is a limited property right that the government offers to inventors in exchange for their agreement to share the details of their inventions with the public. A patent is a special type of technology document which documents many important innovations and technology advances.

4 4 USPTO The United States Patent and Trademark Office (USPTO) is an agency in the United States Department of Commerce that provides patent protection to inventors and businesses for their inventions, and trademark registration for product and intellectual property identification. Each year, the USPTO issues thousands of patents to companies and individuals worldwide. As of March 2006, the USPTO has issued over 7 million patents, with 3,500 to 4,500 newly granted patents each week. USPTO provides online full-text access for patents issued since 1976. URLs: –USPTO Official Website: http://www.uspto.gov/ –USPTO Patent Search: http://www.uspto.gov/main/search.html

5 5 Search USPTO Patents http://www.uspto.gov/main/search.html

6 6

7 7

8 8 Data Extraction: Case Study of NSE Patents Nanoscale Science and Engineering (NSE) field –Fundamental technology that is critical for a nation’s technological competence. –Revolutionize a wide range of application domains. Nanotechnology –Is an applied science/ technology field that is multi- disciplinary and encompasses engineering and other work taking place at the nanoscale. –Critical for a nation’s technological competence. –R&D status attracts various communities’ interest.

9 9 Data Extraction Procedure The goal is to gather all the related patents from USPTO Web site as free-text html pages and then parse them into structured data and stored in a database. Procedure of extracting NSE patents from USPTO: 1.Spider search results (summary pages) 2.Spider individual patent documents (detailed pages) 3.Noise filtering 4.Parsing

10 10 1. Spider search results (summary pages) A list of keywords can be used to search for patents related to NSE domain. The keywords were provided by domain experts. A spider program written by Perl was used to spider the search result pages.

11 11 use HTML::TokeParser; use LWP; use URI::Escape; use strict; sub query { … … … … open(f, $ARGV[0]); my @keywords = ; close(f); … … $query_url = "http://patft.uspto.gov/netacgi/nphParser?Sect1=PTO2&Sect2=HITOFF&p=$pno&u=%2Fnetahtml%2Fsearc- bool.html&r=0&f=S&l=50&TERM1=$kw&FIELD1=&co1=AND&TERM2=$start%3E$end&FIELD2=ISD&d=ptx"; $response = $browser->get($query_url); $result = $response->content(); open(f, "> $fpage-$pno.html"); select(f); print $result; close(f); } query('1/1/2007', '12/31/2007'); Example code Get keywords Download search p ages Set up time range

12 12 Patent IDs Search result page example

13 13 2. Spider individual patent documents (detailed pages) In this step, we need to: –1st, collect all the patent IDs; –2nd, download all the patents based on the patent IDs by using proxies. The data set is often very large, so using proxies can save a lot of time.

14 14 1 Download detailed patent documents Create several files, each of which contains a fixed amount of patent IDs (e.g., 300 patent IDs). Server: Send different patent ID files to different client threads. … … open(f, $ARGV[0]); my @theids = ; close(f); my $theid; foreach $theid (@theids){ $new_sock = $sock->accept(); my $buf = ; print ($new_sock $theid."\n"); print $buf. " ". $theid."\n"; close $new_sock; … … Client: Use proxy to download the patents whose IDs are in the file sent from the server. … … do { $response = $browser->get($pat_url); if (!$response->is_success()){ select(stdout); print $response->status_line, "\n\n"; sleep(rand(7)+1); }while (!$response->is_success()) … …

15 15 Patent document example

16 16

17 17 3. Noise filtering Some patents we gathered may have noisy NSE keywords, some may even have no NSE keywords. –Such patents need to be filtered out. Noise keywords includes: –nanosecond –nanoliter –nano$ –nano-second –nano-liter –nano.sub –nano [space] –nano2

18 18 4. Parsing Extract different data fields from the HTML patent documents and parse into database.

19 19 public static void processAssignees( ) throws IOException {… … … … String[] assignees = assigneeString.split(" "); for (int i = 0; i < assignees.length; i++) { currentassignee=assignees[i].trim(); if(currentassignee.length()==0) continue; currentassignee = currentassignee.replaceAll("\r\n", ""); name =findBetween(currentassignee,0," "," "); currPosition=currentassignee.indexOf(" ")+" ".length(); address=findBetween(currentassignee,currPosition,"(",")"); if(address==null) { System.err.println("wrong address: " + patentId); } int startIndex=0, endIndex=0; if((endIndex = address.lastIndexOf(',')) >= 0) {city = address.substring(0, endIndex); if (city.lastIndexOf(',') >= 0) {city = city.substring(city.lastIndexOf(',') + 1); city.replaceAll("[^a-zA-Z]", ""); } startIndex = endIndex + 1; } else city="-"; address = address.substring(startIndex); country=findBetween(address,0," "," "); if(country==null) {country="US"; state=address.trim(); } else state="-"; name=name.trim(); city=city.trim(); state=state.trim(); rank++; } Parsing example: parsing inventor data Process inventor name Process inventor address Keep the ranking order of inventors

20 20 Data Analysis Examples Bibliographic analysis –Top 50 countries select c.countryName, count(distinct b.patentId) from usp_assignee a, usp_patentAssignee b, usp_countryName c where a.assigneeId=b.assigneeId and a.aCountry not in ('unknown','') and a.aCountry=c.countryCode group by c.countryName order by count(distinct b.patentId)desc RankAssignee Country Number of Patents 1United States13,506 2Japan2,653 3Federal Republic of Germany836 4France534 5China (Taiwan)428 6Republic of Korea406 7Canada333 8Netherlands325 9Australia276 10United Kingdom258 11Switzerland193 12Israel163 13Sweden108 14Belgium106 15Italy82 16Singapore70 17China66 18Denmark56 19Finland51 20India39 21Hong Kong33 22Bermuda28 23Ireland26 24Austria24 25Norway23 26Spain15 27Liechtenstein13 28Barbados13 29British Virgin Islands7 30New Zealand7

21 21 Citation Network Analysis Developing software: Graphviz http://www.pixelglow.com/graphviz/download/

22 22 Content Map Analysis Developing software: multi-level self-organizing map algorithm developed by AI Lab at the U of Arizona

23 23 Thanks!


Download ppt "USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008."

Similar presentations


Ads by Google