Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Collection. How to collect the data? How to store the data? ◦Database or files? ◦Cost of storage and bandwidth What is the right data format? ◦Improve.

Similar presentations


Presentation on theme: "Data Collection. How to collect the data? How to store the data? ◦Database or files? ◦Cost of storage and bandwidth What is the right data format? ◦Improve."— Presentation transcript:

1 Data Collection

2 How to collect the data? How to store the data? ◦Database or files? ◦Cost of storage and bandwidth What is the right data format? ◦Improve readability or optimize storage? ◦Human readable or computer processed? How to present the data? ◦Visualization, machine/human readable? Challenges 2

3 Data formats: ◦CSV ◦XML ◦JSON Data collection: ◦Web crawlers ◦wget ◦Apis Outline 3

4 Great for flat data For example log data from web servers, sensors Compact as text data Easily imported into spreadsheets Human readable Easy sequential access CSV – Comma Separated Value 4 John, 2008, 20.50, Detroit, Michigan Michael, 2003, 55.00, San Francisco, California Mary, 2014, 7.75,, Wisconsin Kelli, Kyle and Kat, 2010, 35.00, Miami, Florida

5 Escape strings Field delimiter Lacks meta data, requires users to provide information somewhere else If data does not fit into discrete rows? If row structure does not have fixed size? CSV – Comma Separated Value 5

6 6 Name, Start Year, Hourly Pay, City, State John, 2008, 20.50, Detroit, Michigan John, 2008, 20.50, Ann Arbor, Michigan Michael, 2003, 55.00, San Francisco, California Mary, 2014, 7.75,, Wisconsin ‘Kelli, Kyle & Kat’, 2010, 35.00, Miami, Florida John, 2008, 20.50, Detroit, Michigan Michael, 2003, 55.00, San Francisco, California Mary, 2014, 7.75, Wisconsin Kelli, Kyle and Kat, 2010, 35.00, Miami, Florida Add meta data Add placeholder Add delimiters

7 #load csv library import csv #Open file and create a reader object f = open('mydata.csv') csv_f = csv.reader(f) #Loop through each row and print it for row in csv_f: print row #Loop through each row again and print the first value for row in csv_f: print row[0] CSV Library in Python 7

8 The data describes itself Widely supported Good for structured data A tree-like model: One root element Each element may contain other elements Verbose format => additional storage and bandwidth XML – eXtensible Markup Language 8 Data just right Michael Manoochehri Introduction to data mining Pang Ning Tan Michael Steinbach Vipin Kumar

9 A valid javascript object Easy to use with javascript and other languages Lighterweight syntax than XML so generally faster to parse Verbose format JSON – JavaScript Object Notation 9

10 The file type is.json Data is in name value pairs Data is separated by commas Curly braces to designate objects Brackets to designate arrays JSON – JavaScript Object Notation 10

11 JSON – JavaScript Object Notation 11 { “books”: [ {“title” : ”Data just right”, “author” : ”Michael Manoochehri”}, {“title” : ”Introduction to data mining”, “authors”:[ {“name” : ”Pang Ning Tan”}, {“name” : ”Michael Steinbach”}, {“name” : ”Vipin Kumar”} ] } ] }

12 library.books[0].title library.books[0].title = “Data 2.0” library.books[1].authors[2] = library.books[0].author library.books[0].authors[2] = library.books[0].author JSON – JavaScript Object Notation 12 var library = {“books”: [{“title” : ”Data just right”, “author” : ”Michael Manoochehri”}, {“title” : ”Introduction to data mining”, “authors”:[{“name” : ”Pang Ning Tan”},{“name” : ”Michael Steinbach”},{“name” : ”Vipin Kumar”} ]]}

13 #load json library import json #Convert a python object to JSON stream var1 = [’x’, {’y’: (’Data Mining’, ’C Programming’)}] print var1 #Convert JSON data to a python object json_data = '["foo", {"bar":["baz", null, 1.0, 2]}]' python_obj = json.load(json_data) print python_obj #[u'foo', {u'bar': [u'baz', None, 1.0, 2]}] JSON Library in Python 13

14 json_data = '["foo", {"bar":["baz", null, 1.0, 2]}]' #[u'foo', {u'bar': [u'baz', None, 1.0, 2]}] JSON Library in Python 14 JSONPython objectDict { } arrayList [ ] stringUnicode u’ IntInt, long RealFloat True False NullNone

15 Data on the Internet 15 sports.yahoo.com travel.yahoo.com finance.yahoo.com

16 Data on the Internet 16 HTML File

17 Page Title //javascript code goes here My Data John 1987 Mary 2001 HTML – Hypertext Markup Language 17

18 HTTP: Hypertext Transfer Protocol A protocol to deliver data (files/images/query results) on the World Wide Web A browser is an HTTP Client: Sends requests to a web server Sends response to back to the client (user) HTTP Requests 18 MSU Web Server Client

19 19

20 Request Header 20

21 Response Header 21

22 Response Content 22

23 Data formats: ◦CSV ◦XML ◦JSON Data collection: ◦Web crawlers ◦wget ◦Apis Outline 23

24 An internet program (bot) that browses the World wide web to collect data: Test if web page has valid structure or is available Maintain mirrors of popular website Monitor changes in content Build a special purpose index The bot is given a list of web pages called seeds Each seed is collected/indexed/parsed All links found inside a seed are added to the list to be visited To visit/collect a page, the bot sends an http request to the server Web crawlers (Spiders) 24

25 Deal with a large number of pages Cannot download all the pages Selection policy: which pages to visit? Deal with changing content Re-visit policy: when to visit a page again? Politeness policy: How often to visit the same server? How many requests per second? Do not overload the server Abide by robots.txt: a file on the server that states which pages are allowed/disallowed from being scraped When to stop? How many levels? Issues 25

26 Breadth First Search (BFS) Finds pages along the shortest path from the root

27 Depth First Search (DFS) Tend to wander away

28 A utility to download files from the internet Supports: HTTP, HTTPS and FTP protocols Follows links in html, xhtml and css page: recursive download Respects robots.txt Many configurable options: Logging, download (speed, attempts, progress), directory (inclusion/exclusion), http (username, password, user-agent, caching) Format: wget [option] [url] Help: wget -h WGET 28

29 wget Downloads index.html from msu.edu wget Downloads the pdf file wget –t 5 Retry 5 times when the attempt fails wget –r statenews.com Recursively retrieves files under the hierarchy structure WGET 29

30 A library to send HTTP requests Load library: import requests Send request: req = request.get(‘http://www.espn.com’) Examine response: print req.text#Examine content print req.status_code print req.headers[‘content_type’] Python requests 30

31 Parse the result received for something useful Use libraries: json, ElementTree, MiniDom, lxml, HTMLParser (html.parser), BeautifulSoup Using the result 31 html_doc = """ The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie,http://example.com/elsie Lacie and Tillie ; and they lived at the bottom of a well. """

32 Using BeautifulSoup: Using the result 32 from bs4 import BeautifulSoup soupObj = BeautifulSoup(html_doc)# Create a soup object from the html string print(soupObj.prettify())# Prints with nice indentations soupObj.title# The Dormouse's story soupObj.title.name# u'title' soupObj.title.string# u'The Dormouse's story' soupObj.title.parent.name # u'head' soupObj.p # The Dormouse's story soupObj.p['class'] # u'title' soupObj.a# Elsie

33 Using the result 33 soupObj.find_all('a') # [ Elsie, # Lacie, # Tillie ] soupObj.find(id="link3") # Tillie for link in soupObj.find_all('a'): print(link.get('href')) # # #

34 API: application program interface Uses HTTP (HTTPS) protocol Uses XML/JSON to represent the response Provides clean way to extract data Some api’s are free Usage limit: 5 calls/s, 1000 per day, … Using an API 34

35 Tweets: short posts of 140 character or less Entities: users, hashtags, urls, media Places Streams: sample of public tweets flowing through twitter Timelines: chronologically sorted collections of tweets Home timeline: tweets from people you follow https://twitter.com User timeline: tweets from a specific user https://twitter.com/SocialWebMining Home timeline of someone else https://twitter.com/SocialWebMining/following Twitter 35

36 Create a twitter application account Create an app that you authorize to access your account data Obtain an app key Instead of giving the password for your user account Install Twitter library if you don’t have it Make calls to: Retrieve trends Search for tweets/retweets Search for users Reference: Mining the Social Web, 2 nd Edition Twitter Python API 36

37 import twitter # Obtain the values from your twitter app account CONSUMER_KEY = '' CONSUMER_SECRET = '' OAUTH_TOKEN = '' OAUTH_TOKEN_SECRET = '' auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET) twitter_api = twitter.Twitter(auth=auth) # Nothing to see by displaying twitter_api except that it's now a defined variable print twitter_api Authorize Twitter API 37

38 WORLD_WOE_ID = 1#Look up IDs at US_WOE_ID = world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID) us_trends = twitter_api.trends.place(_id=US_WOE_ID) print world_trends print us_trends # [{u'created_at': u' T11:50:40Z', u'trends': [{u'url': u'http://twitter.com/search?q=%23MentionSomeoneImportantForYou'... print json.dump(world_trends, indent = 1) Get Trends 38

39 [ { "created_at": " T11:50:40Z", "trends": [ { "url": "http://twitter.com/search?q=%23MentionSomeoneImportantForYou", "query": "%23MentionSomeoneImportantForYou", "name": "#MentionSomeoneImportantForYou", "promoted_content": null, "events": null },... ] } ] Get Trends 39

40 person = '#MentionSomeoneImportantForYou' numberOfTweets= 20 search_results = twitter_api.search.tweets(q=person, count= numberOfTweets) statuses = search_results['statuses'] Search for Tweets 40


Download ppt "Data Collection. How to collect the data? How to store the data? ◦Database or files? ◦Cost of storage and bandwidth What is the right data format? ◦Improve."

Similar presentations


Ads by Google