Presentation is loading. Please wait.

Presentation is loading. Please wait.

Google Refine Tutorial April, 08 2012 Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur This Tutorial.

Similar presentations


Presentation on theme: "Google Refine Tutorial April, 08 2012 Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur This Tutorial."— Presentation transcript:

1 Google Refine Tutorial April, Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur This Tutorial was created using Google Refine Version 2.5 on a Windows 7 platform

2 Data Cleansing Data cleansing is identifying the wrong or inaccurate records in the data set and making appropriate corrections to the records. It involves identifying incomplete, inaccurate, and incorrect parts of data and then either replacing them with correct data or deleting the incorrect data Data cleansing results in data which is consistent with the other standard data and is useful for performing various analysis The error in the data could be due to data entry error by the user, failure during transmission of data or improper data definitions. 2

3 Need for Data Cleansing Incorrect or inaccurate data may lead to false conclusions and can cause investments to be misdirected in finance. Also government needs accurate data on population and census for directing the funds to the deserving areas. Many organizations tap into customer information. If the data is not accurate, for eg. If the address is not accurate then the business runs the risk of send wrong information, thus losing customers. 3

4 Challenges Data Cleansing Loss of Information: In many cases the record may be incomplete, hence the whole record may require to be deleted which leads to loss of information. It could become costly if huge number of data is deleted. Maintenance of Data: Once the data is cleansed then any change in the data specification needs to affect only the new values. Hence data management solutions should be designed in such a way that the process of data entry and retrieval are altered to provide correct data. Data cleansing is an iterative process which needs significant work in exploration and corrction of entries. 4

5 About Google Refine Google Refine is a powerful tool that can be effectively used for data cleansing. It helps in working with raw data, cleaning it up, transforming from one format to other, encompassing it with web services and linking it to databases. It is very easy to use and has a web interface. It is freely available and works well with any browser. Google Refine is a desktop application and it runs a small web server on your system and we need to point our browser to the server to use refine. 5

6 Getting Started - Installation 1.Download the zip file (appropriate Windows, Mac, Linux versions) from the link refine/wiki/Downloads?tm=2 refine/wiki/Downloads?tm=2 2.Uncompress the files from the zip file. 3.Run the “google-refine.exe” file. 4.A command window opens and Google refine runs taking the user to the home page in the default browser. 6

7 Google Refine Homepage 7

8 Importing Data Google Refine supports TSV, CSV, Excel (.xls and.xlsx), JSON, XML, and Google data document formats. Once imported the data is in Google Refine’s own data format. We have used TSV data on Disasters worldwide from available from s-worldwide-from for the tutorial. s-worldwide-from

9 Importing Data 9

10 10

11 Creating Project 11 Data Uploaded

12 Creating Project 12 Project Created

13 Faceting Faceting is about seeing the big picture and filtering based on rows to work on data you want to change in bulk. We can create a facet for a column to get the details about that column and then we can filter to a subset of rows with a constraint. We can perform text facet, Numeric facet, timeline facet and scatterplot facet. Also various customized facets can be designed. 13

14 Faceting 14

15 Faceting 15 The Column Type has 18 unique options

16 Removing Redundancy 16 Even though they are of same type, shows as different options due to case

17 Removing Redundancy 17

18 Removing Redundancy 18

19 Removing Redundancy 19

20 Removing Redundancy 20 Reduced to 15 unique options

21 Numeric Faceting 21

22 Numeric Faceting 22 Highly clustered towards low values

23 Numeric Faceting 23

24 Numeric Faceting 24

25 Numeric Faceting 25 Cost column is blank and has no value

26 Numeric Faceting 26 Calamities with low cost

27 Numeric Faceting 27 Calamities with high cost

28 Clustering Clustering is used to merge choices which look similar. 28

29 Clustering 29

30 Clustering 30 Data Merged

31 Using Expressions Expressions are used to transform existing data to create new data 31

32 Using Expressions 32

33 Using Expressions 33

34 Data Augmentation Reconciliation option in Google refine allows data to be linked to web pages. Suppose we want details on the country where the calamity has struck we can perform the following steps 34

35 Reconciliation 35

36 Reconciliation 36

37 Reconciliation 37

38 Reconciliation 38

39 Reconciliation 39

40 Data Enrichment 40

41 Data Enrichment 41

42 Data Enrichment 42

43 Data Enrichment 43

44 Export 44

45 Step 1 45 Step 2 How to Use Twitter Data

46 Step 3 46

47 Step 4 47 Step 5

48 Step 6 48

49 Step 7 Step 8 49

50 Output 50

51 Friends Events using Facebook data 51

52 Friends Events using Facebook data 52

53 Friends Events using Facebook data 53

54 Friends Events using Facebook data 54

55 Friends Events using Facebook data 55

56 Friends Events using Facebook data 56

57 Friends Events using Facebook data 57

58 Friends Events using Facebook data 58

59 Friends Events using Facebook data 59

60 Friends Events using Facebook data 60

61 Friends Events using Facebook data After splitting the cell using separator },{ 61

62 Friends Events using Facebook data 62

63 Friends Events using Facebook data After updating for other columns and rearranging it we get the events as 63

64 Thank You 64


Download ppt "Google Refine Tutorial April, 08 2012 Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur This Tutorial."

Similar presentations


Ads by Google