Www.OASUS.ca Come out of the desert of ignorance to the OASUS of knowledge Scraping the Web with SAS Tom Kari Tom Kari Consulting OASUS, June 12 2013.

Slides:



Advertisements
Similar presentations
Staying Up-To-Date Using Google Reader & RSS Feeds Christian Veillette M.D., M.Sc., FRCSC Assistant Professor, University of Toronto Shoulder & Elbow Reconstructive.
Advertisements

How to Create or Open an Account If you are accessing the internet from your home, go to a search engine that you like to use and type in, "free.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 14.
The World Wide Web and the Internet MIS XLM.B Jack G. Zheng May 13 th 2008.
INSTRUCTOR LED COURSES STREAMED TO THE WORLD WIDE WEB WEBINARS ARCHIVED AND EXTENDED: REPURPOSING CLASSROOM SURVEILLANCE FOR BLENDED AND DISTANT LEARNING.
Introduction to the Internet
The INTERNET.
Surrey Libraries Computer Learning Centres January 2012 Internet Searching Teaching Script Totally New to Computers Internet Searching.
Insurance and Risk Management Internet Searches Overall Objective: –On completion of the three course modules, you should be able to obtain and evaluate.
Adaptations By Zani Alam.
From Pencil to Computer in Math Education JUAN JOSÉ PRIETO-VALDÉS (PART-I, from IV)
COMPUTER PROGRAMMING EVALUATING WEBSITES PLAY VIDEO NEXT.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
By: Gabby Finn.  Ad Populum- In logic, an argumentum ad populum (Latin: "appeal to the people") is a fallacious argument that concludes a proposition.
Success Sarah Grissom August 27,2012 Abstract analogy Trying: success:: giving up: failure.
Jamaica. National Flag Map of Jamaica Population: 2,847,232 Currency: Jamaican dollar Principal Products: agriculture, mining, tourism, oil refining,
The digestive system By:Rebekah Martinez. Mouth The mouth is the part of your body that helps you swallow,breath eat a lot of things.
Copyright © 2013 Accenture All rights reserved. 1 Run campaign April – May (end at close of Q3 to maximize A3 score impact) Key components: – Create a.
Presenter: James Huang Date: Sept. 29,  HTTP and WWW  Bottle Web Framework  Request Routing  Sending Static Files  Handling HTML  HTTP Errors.
Web Technologies Using the Internet to publish data and applications.
4.01 How Web Pages Work.
Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to… Edward Marcotte/Univ. of Texas/BCH391L/Spring.
SCRAPING BUSINESS PHONE NOS Anisha S. Agenda When business URLs are present When business URLs are not present; What is present is a list of keywords.
Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.
Part 1: Overview of Web Systems Part 2: Peer-to-Peer Systems Internet Computing Workshop Tom Chothia.
Unit 4.4 We are HTML Editors
HTML Forms Piggy-back on Internet Programs Google example Validator example.
Internet. Internet is Is a Global network Computers connected together all over that world. Grew out of American military.
How to type in a URL Address What is a URL address? It is an address that tells the computer to take you to a web page. All web pages have their own.
Chapter 16 The World Wide Web. 2 Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Write basic HTML.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Internet Information ISYS 105B. What is the Internet? Comprised of network of computers Started in 1969 by U.S. Defense Dept.
HOW ACCESS TO WWW Student Name : Hussein Alkhaldi.
Welcome to the best results from business events.
2 pt 3 pt 4 pt 5pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Hardware Internet Word Processing.
OBJECTIVES  What is HTML  What tools are needed  Creating a Web drive on campus (done only once)  HTML file layout  Some HTML tags  Creating and.
HOW WEB SERVER WORKS? By- PUSHPENDU MONDAL RAJAT CHAUHAN RAHUL YADAV RANJIT MEENA RAHUL TYAGI.
Chapter 1: The Internet and the WWW CIS 275—Web Application Development for Business I.
WWW Forms and Search. Forms URL - always fetch a particular page What if the information we want varies from time to time and from user to user?
Validating, Promoting, & Publishing Your Web Site Writing For the Web The Internet Writer’s Handbook 2/e.
Tutorial 10 Programming with JavaScript
Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002.
Starting to Use the Internet for Work Search strings: Boolean + Key terms e.g Hysterectomy AND subtotal Hysterectomy + subtotal = key terms AND = Boolean.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
The Internet and World Wide Web Sullivan University Library.
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
“Come out of the desert of ignorance to the OASUS of knowledge” Benchmarking and SAS Tom Kari, Tom Kari Consulting Ltd.
HINDU STYLE PORTFOLIO TEMPLATE
The Internet What is the Internet? The Internet is a lot of computers over the whole world connected together so that they can share information. It.
PYP002 Intro.to Computer Science Brwosing the Web1 Browsing the Web Chapter 19.
Part One: Introduction  How to Log on  Which Browser to use  The URL for the site  The Home Page
Basic Internet Skills. What is the internet? A large group of computers connected to one another Its purpose is to send information back and forth to.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
MVC Controllers TestsMigrations Ye Olde Internet Model DB Server Router View Browser Today’s focus Controller.
JavaScript and Ajax (Ajax Tutorial)
CS 100 Mount Union College Fall, 2002
Tutorial 10 Programming with JavaScript
The Internet.
G OOGLE WEBMASTER How to add google search console to wordpress?
Networks Problem Set 1 Due Oct 3 Bonus Date Oct 2
العدد تذكيره وتأنيثه مقدمة
موتورهاي جستجو گر. موتورهاي جستجو گر مهديه گواهي – اعظم رضايي صالح موتورهاي جستجو گر استاد محترم مهندس صالحان تهيه وتنظيم: مهديه گواهي – اعظم رضايي.
PHP and Forms.
All About the Internet.
MVC Controllers.
MVC Controllers.
Use an Internet Browser
Bryan Burlingame 24 April 2019
MVC Controllers.
Presentation transcript:

Come out of the desert of ignorance to the OASUS of knowledge Scraping the Web with SAS Tom Kari Tom Kari Consulting OASUS, June

Google is wonderful, but… The first page is full of junk! I cant tell how many pages Im getting from each site. I KNOW the page I want is in here somewhere, how can I find it? Im not using SAS when I use Google! How can I keep ALL the results to analyze? June 12, 2013Tom Kari, Tom Kari Consulting2

The Basics data URL_Retrieval_Results; length HTML_Rec $32767; filename HTML_In url " infile HTML_In lrecl=32767; input; HTML_Rec = _infile_; run; June 12, 2013Tom Kari, Tom Kari Consulting3

The Process June 12, 2013Tom Kari, Tom Kari Consulting4 What goes in the reference to google? Get results from Google How do I find the web sites listed by Google? Extract the web sites Figure out how to get 1000 web site listings Post process the results (SAS data management)

1.How to send a search to Google? In Internet Explorer: F12 to open Developer Tools Network Start Capturing Enter your search string Stop Capturing Dig around in the results tion%20resort%20puerto%20vallarta&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq =&gs_l=&pbx=1&bav=on.2,or.r_qf.&bvm=bv ,d.dmg&fp=5ad817295c2c0080&biw=1 123&bih=374&tch=1&ech=1&psi=8xOlUdWjBOT84AO1iYCwDw June 12, 2013Tom Kari, Tom Kari Consulting5

2.Get Results from Google data GoogleResults; length HTML_Rec $32767; filename HTML_In url " o+vallarta%nrstr(&start)=1"; infile HTML_In lrecl=32767; input; HTML_Rec = _infile_; run; June 12, 2013Tom Kari, Tom Kari Consulting6 32,767 bytes

3.How do I find the web sites listed by Google? Dreams Puerto Vallarta Resort & Spa - All-inclusive Resort Reviews... Dreams_ Puerto _ Vallarta _ Resort _Spa- Puerto _ Vallarta.html - <a href="/url?q= rouhbkJ: Dreams_Puerto_Vallarta_Resort_Spa- June 12, 2013Tom Kari, Tom Kari Consulting7

3.How do I find the web sites listed by Google? (contd) The magic of PRX routines! June 12, 2013Tom Kari, Tom Kari Consulting8 "Pattern matching enables you to search for and extract multiple matching patterns from a character string in one step. Pattern matching also enables you to make several substitutions in a string in one step. You do this by using the PRX functions and CALL routines in the DATA step. For example, you can search for multiple occurrences of a string and replace those strings with another string. You can search for a string in your source file and return the position of the match. You can find words in your file that are doubled."

4.Extract the web sites June 12, 2013Tom Kari, Tom Kari Consulting9 data GoogleHTMLResult; retain prxid; if _n_=1 then prxid=prxparse('/(? <a href="\/url\?q=)[[:alnum:]- length HTML_Rec $32767; filename HTML_In url " rto+vallarta%nrstr(&start)=1"; infile HTML_In lrecl=32767; input; HTML_Rec = _infile_; call prxsubstr(prxid,HTML_Rec,pos,len); CiteData=substr(HTML_Rec,HTML_Pos,HTML_Len); output; run;

5.Figure out how to get 1000 web site listings Quirks to remember Many characters cant appear in Google search strings, so must be encoded (spaces to +, etc.) Ampersands in your URL need %nrstr or will fail in SAS To use a new url infile in SAS, you need a new data step. This is easy with a macro loop. Every now and then it fails – ERROR: Invalid reply received from the HTTP server. Use the debug option for more info. Beats me! June 12, 2013Tom Kari, Tom Kari Consulting10

5.Figure out how to get 1000 web site listings (contd) Code is in Example 4 Extract 1000 URLs June 12, 2013Tom Kari, Tom Kari Consulting11

6.Post-process the results Count how many time each URL appears For each unique URL, retain the page and index where it first appears Create a nice looking HTML page Code is in Example 5 Post-processed June 12, 2013Tom Kari, Tom Kari Consulting12

June 12, 2013Tom Kari, Tom Kari Consulting13

Appendix: PRX parse strings prxid=prxparse('parse string'); /(? <a href="\/url\?q=)[[:alnum:]- outer control non-captured group any-of one or more as-is as-is escaped escaped grouping June 12, 2013Tom Kari, Tom Kari Consulting14