Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.OASUS.ca Come out of the desert of ignorance to the OASUS of knowledge Scraping the Web with SAS Tom Kari Tom Kari Consulting OASUS, June 12 2013.

Similar presentations


Presentation on theme: "Www.OASUS.ca Come out of the desert of ignorance to the OASUS of knowledge Scraping the Web with SAS Tom Kari Tom Kari Consulting OASUS, June 12 2013."— Presentation transcript:

1 www.OASUS.ca Come out of the desert of ignorance to the OASUS of knowledge Scraping the Web with SAS Tom Kari Tom Kari Consulting OASUS, June 12 2013

2 www.OASUS.ca Google is wonderful, but… The first page is full of junk! I cant tell how many pages Im getting from each site. I KNOW the page I want is in here somewhere, how can I find it? Im not using SAS when I use Google! How can I keep ALL the results to analyze? June 12, 2013Tom Kari, Tom Kari Consulting2

3 www.OASUS.ca The Basics data URL_Retrieval_Results; length HTML_Rec $32767; filename HTML_In url "http://www.dolphinsdance.ca"; infile HTML_In lrecl=32767; input; HTML_Rec = _infile_; run; June 12, 2013Tom Kari, Tom Kari Consulting3

4 www.OASUS.ca The Process June 12, 2013Tom Kari, Tom Kari Consulting4 What goes in the reference to google? Get results from Google How do I find the web sites listed by Google? Extract the web sites Figure out how to get 1000 web site listings Post process the results (SAS data management)

5 www.OASUS.ca 1.How to send a search to Google? In Internet Explorer: F12 to open Developer Tools Network Start Capturing Enter your search string Stop Capturing Dig around in the results http://www.google.ca/s?gs_rn=14&gs_ri=psy-ab&cp=41&gs_id=a&xhr=t&q=beautiful%20vaca tion%20resort%20puerto%20vallarta&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq =&gs_l=&pbx=1&bav=on.2,or.r_qf.&bvm=bv.47008514,d.dmg&fp=5ad817295c2c0080&biw=1 123&bih=374&tch=1&ech=1&psi=8xOlUdWjBOT84AO1iYCwDw.1369773041400.1 http://www.google.ca/search?q=beautiful+vacation+resort+puerto+vallarta&start=1 June 12, 2013Tom Kari, Tom Kari Consulting5

6 www.OASUS.ca 2.Get Results from Google data GoogleResults; length HTML_Rec $32767; filename HTML_In url "http://www.google.ca/search?q=beautiful+vacation+resort+puert o+vallarta%nrstr(&start)=1"; infile HTML_In lrecl=32767; input; HTML_Rec = _infile_; run; June 12, 2013Tom Kari, Tom Kari Consulting6 32,767 bytes

7 www.OASUS.ca 3.How do I find the web sites listed by Google? Dreams Puerto Vallarta Resort & Spa - All-inclusive Resort Reviews... www.tripadvisor.ca/Hotel_Review-g150793-d481596-Reviews- Dreams_ Puerto _ Vallarta _ Resort _Spa- Puerto _ Vallarta.html - <a href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:gaglP rouhbkJ:http://www.tripadvisor.ca/Hotel_Review-g150793-d481596-Reviews- Dreams_Puerto_Vallarta_Resort_Spa- June 12, 2013Tom Kari, Tom Kari Consulting7

8 www.OASUS.ca 3.How do I find the web sites listed by Google? (contd) The magic of PRX routines! June 12, 2013Tom Kari, Tom Kari Consulting8 "Pattern matching enables you to search for and extract multiple matching patterns from a character string in one step. Pattern matching also enables you to make several substitutions in a string in one step. You do this by using the PRX functions and CALL routines in the DATA step. For example, you can search for multiple occurrences of a string and replace those strings with another string. You can search for a string in your source file and return the position of the match. You can find words in your file that are doubled."

9 www.OASUS.ca 4.Extract the web sites June 12, 2013Tom Kari, Tom Kari Consulting9 data GoogleHTMLResult; retain prxid; if _n_=1 then prxid=prxparse('/(? <a href="\/url\?q=)[[:alnum:]- \._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&amp)/o'); length HTML_Rec $32767; filename HTML_In url "http://www.google.ca/search?q=beautiful+vacation+resort+pue rto+vallarta%nrstr(&start)=1"; infile HTML_In lrecl=32767; input; HTML_Rec = _infile_; call prxsubstr(prxid,HTML_Rec,pos,len); CiteData=substr(HTML_Rec,HTML_Pos,HTML_Len); output; run;

10 www.OASUS.ca 5.Figure out how to get 1000 web site listings Quirks to remember Many characters cant appear in Google search strings, so must be encoded (spaces to +, etc.) Ampersands in your URL need %nrstr or will fail in SAS To use a new url infile in SAS, you need a new data step. This is easy with a macro loop. Every now and then it fails – ERROR: Invalid reply received from the HTTP server. Use the debug option for more info. Beats me! June 12, 2013Tom Kari, Tom Kari Consulting10

11 www.OASUS.ca 5.Figure out how to get 1000 web site listings (contd) Code is in Example 4 Extract 1000 URLs June 12, 2013Tom Kari, Tom Kari Consulting11

12 www.OASUS.ca 6.Post-process the results Count how many time each URL appears For each unique URL, retain the page and index where it first appears Create a nice looking HTML page Code is in Example 5 Post-processed June 12, 2013Tom Kari, Tom Kari Consulting12

13 www.OASUS.ca June 12, 2013Tom Kari, Tom Kari Consulting13

14 www.OASUS.ca Appendix: PRX parse strings prxid=prxparse('parse string'); /(? <a href="\/url\?q=)[[:alnum:]- \._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&amp)/o outer control non-captured group any-of one or more as-is as-is escaped escaped grouping June 12, 2013Tom Kari, Tom Kari Consulting14


Download ppt "Www.OASUS.ca Come out of the desert of ignorance to the OASUS of knowledge Scraping the Web with SAS Tom Kari Tom Kari Consulting OASUS, June 12 2013."

Similar presentations


Ads by Google