Build a Text Dataset from AMAZON

Build a Text Dataset from AMAZON
Raymond ZHAO Wenlong (03/07/2018)

Collect data In the data age
In Statistical ML/DL/NLP, volumes of data is a key. We could collect data from the wide world of web. 万维网

HTML HTML stands for Hyper Text Markup Language.
HTML describes the structure of Web pages. HTML tags are element names surrounded by angle brackets The browser uses them to determine how to display the document

Web scraping Download the webpage and parse it. Static web pages
Automating forms with Selenium Amazon page - The right is the page source

The process Download: Requests is a HTTP library
Parse: Beautiful Soup is to parses a web page See the developed script amazon_scraper.py The Hypertext Transfer Protocol (HTTP) is an application protocol We could download the static web pages The <tr> tag defines a row in an HTML table. The <td> tag defines a standard cell in an HTML table. The id attribute specifies a unique id for an HTML element (the value must be unique within the HTML document). The id attribute is most used to point to a style in a style sheet, and by JavaScript to manipulate the element with the specific id.

The dataset There are about 12k reviews for 180 laptops, and about 712k review words totally. Each review is from 2 words to about 600 words; The mean is about 60 words. See the AMAZON dataset amazon_reviews.json The json format - dict "Jay-sawn"

Thanks Thanks Dr. Wong, David and Linkai

Build a Text Dataset from AMAZON

Similar presentations

Presentation on theme: "Build a Text Dataset from AMAZON"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Build a Text Dataset from AMAZON

Similar presentations

Presentation on theme: "Build a Text Dataset from AMAZON"— Presentation transcript:

Similar presentations

About project

Feedback