36-651/751: Web Scraping in Practice

– Spring 2019, mini 3 (last updated February 7, 2019) all courses · refsmmat.com

We’ve talked all about how the Internet and the Web work. But how do we put this into practice to obtain data from websites?

There are many interesting things to extract out of Web pages:

Datasets
Some websites display data tables about things, but don’t actually let you download a convenient CSV, so you have to extract data the hard way.
Attributes of things
Maybe you want to get the contributors to GitHub repositories, the editors of Wikipedia pages, or the numbers of citations received by scientific papers.
Text
You can, say, extract all the tweets about a specific topic, or all the reddit posts in a certain subreddit, or all the privacy policies on popular websites, and do interesting text analysis.

Sometimes websites make it easy to get these things; sometimes we have to do it the hard way.

REST and APIs

These days, lots of websites provide ways for programs to easily interact with them to extract data. Rather than forcing you to scrape the pages the hard way, a site might provide a simple interface to request data and perform common operations. These are usually referred to as APIs (Application Programming Interfaces), and work using ordinary HTTP requests.

For example, Twitter has APIs to search tweets or integrate with Direct Messages (e.g. so you can make a customer service robot). GitHub’s API lets you extract information about public repositories, receive notifications about events on repositories you have access to (we used this in Stat Computing to record pull request approvals for grading), make comments, open issues, and so on. Wikipedia’s API lets you fetch pages, make edits, upload files, and so on (it’s often used by robots that correct common formatting mistakes and detect vandalism). The arXiv API lets you search papers, download abstracts, and fetch PDFs.

Many of these APIs use the idea of REST, or Representational State Transfer. REST APIs often use HTTP requests that send and receive data in XML or JSON formats.

The basic idea is this:

Because REST APIs use simple, well-defined data formats and HTTP requests, it’s often easy to make packages that wrap the API in the programming language of your choice. PyGitHub, for example, provides Python functions and classes that automatically do the necessary REST API calls to GitHub. If you want to use a well-known website’s API, check if there’s a package for your language.

Let’s try using GitHub’s API without using a package.

library(httr)

## This API endpoint fetches all of my repositories
## https://developer.github.com/v3/repos/#list-user-repositories
r <- GET("https://api.github.com/users/capnrefsmmat/repos")

status_code(r) # 200

headers(r)[["content-type"]]  # "application/json; charset=utf-8"

str(content(r))
## An enormous list, starting with
## List of 14
##  $ :List of 72
##   ..$ id               : int 51103677
##   ..$ node_id          : chr "MDEwOlJlcG9zaXRvcnk1MTEwMzY3Nw=="
##   ..$ name             : chr "confidence-hacking"
##   ..$ full_name        : chr "capnrefsmmat/confidence-hacking"
##   ..$ private          : logi FALSE
##   ..$ owner            :List of 18
##   .. ..$ login              : chr "capnrefsmmat"
##   .. ..$ id                 : int 711629
##   .. ..$ node_id            : chr "MDQ6VXNlcjcxMTYyOQ=="
##   .. ..$ avatar_url         : chr "https://avatars3.githubusercontent.com/u/711629?v=4"
##   .. ..$ gravatar_id        : chr ""
##   .. ..$ url                : chr "https://api.github.com/users/capnrefsmmat"
## ...

Notice that httr helpfully parses the JSON returned by GitHub into nested lists for us, using the jsonlite package. It did that by inspecting the Content-Type header.

jsonlite can also read from websites directly, so if you just need to do GET requests with JSON, you can just do

library(jsonlite)

r <- fromJSON("https://api.github.com/users/capnrefsmmat/repos")

jsonlite, being clever, notices that the data is a list of repositories, each with the same attribute names, and makes a data frame out of the result instead of lists of lists of lists. (Some of the columns of the data frame, like owner, are data frames themselves…)

I can also use a POST request to create a repository, though this requires me to prove to GitHub who I am, which I’ll leave out for simplicity:

library(httr)

r <- POST("https://api.github.com/user/repos",
          body = list(name = "new-repository",
                      description = "My cool repo"),
          encode = "json")

APIs often do authentication with “tokens” – basically a secret password you supply to the server in the POST request – or with OAuth, which is somewhat complicated and best left to packages that handle all its details.

Scraping HTML

Sometimes the data you want isn’t available through a convenient API, but it’s right there in the HTML, so surely you can get it out somehow!

This is scraping. It requires a few steps:

  1. Make an HTTP request for the right Web page.
  2. Parse that HTML.
  3. Extract out text and data from the parsed HTML structure.

Last time we talked about packages like Requests and httr that make HTTP requests, so let’s skip that step and talk about parsing HTML and extracting data.

Parsing HTML

HTML, the HyperText Markup Language, is a way of marking up text with various attributes and features to define a structure – a structure of paragraphs, boxes, headers, and so on, which can be given colors and styles and sizes with another language, CSS (Cascading Style Sheets).

For our purposes we don’t need to worry about CSS, since we don’t care what a page looks like, just what’s included in its HTML.

HTML defines a hierarchical tree structure. Let’s look at a minimal example:

Notice some features of HTML:

Tags
Tags are named inside square brackets, like <h1>. There is a set of tags with predefined meanings, like <p> for paragraphs and <table> for tables.
Tag hierarchy
Tags enclose text and other tags: tags open with a form like <p> and close with </p>, and everything in between is enclosed in those tags. All tags have to be closed, except those that don’t. (Since people writing web pages were very bad at remembering to close tags, browsers now have standard rules for inferring when you meant to close a tag; notice the paragraphs above aren’t closed.)
Attributes
Tags can have attributes, which are key-value pairs describing the content inside them. Many attributes have specific meaning: id is used for unique identifiers for elements, which can be used in JavaScript or CSS to modify those elements, and a class can be assigned to many elements which should somehow behave in the same way.
Escaping
Characters like <, >, and & have specific meanings in HTML. If you want to write < without it starting a new tag, you have to escape it by writing &lt;. There are many escapes, like &copy; for the copyright symbol, and numeric escapes for specifying arbitrary Unicode characters. These are called “HTML entities”.
Whitespace
Whitespace has no meaning in HTML. You could put the above HTML document on one line or four hundred, if you’d like. Spaces are collapsed: writing two spaces between words is the same as writing one space between words. Hence lines that wrap, above, don’t have excess spaces because of the indentation.

HTML’s complex structure makes it difficult to parse; the HTML standard chapter on syntax has headings going as deep as “12.2.5.80 Numeric character reference end state”. Do not attempt to parse HTML with regular expressions.

If you use an HTML parsing package – rvest uses the libxml2 parser underneath, while Beautiful Soup uses html5lib – it will handle all the complexity for you.

For example, for the example HTML file above:

Or, in R,

XPath, CSS selectors, and document trees

Often web pages are made of huge complicated HTML documents, and you only need the contents of a few specific tags. How do you extract them from the page?

There are several ways to do this, but they come down to needing a selector: some specification of the type or name of the tag we want.

There are several common types of selector. The simplest is the CSS selector, used when making Cascading Style Sheets. A CSS selector might look like this: .datatable tr#importantrow td.

That means:

.datatable
Any element with the class attribute “datatable”.
tr#importantrow
Any tr element with the ID “importantrow”.
td
Any td element.

These are interpreted hierarchically, so put together in one selector, this identifies all td elements inside a tr whose ID is “importantrow” inside some element with class “datatable”. This will match two td elements in the example above. (Note that the tbody is not in the selector, but that is not a problem; any td inside a tr#importantrow matches, even if there are enclosing tags in between.)

There are various other syntaxes. We can write p > a to find a tags immediately inside p without any enclosing tags, so specifically excluding a situation like

We can use .class and #id on tags or without a tag name, depending on how specific we want to be. There are other kinds of selectors, like selectors for tags with specific attributes; the MDN selector tutorial is a good starting point to learn more.

rvest uses CSS selectors by default. The html_nodes function I used above takes a CSS selector and returns a list of HTML tags matching that selector, then lets you do things to them.

Beautiful Soup supports CSS selectors. You can use

to get a list of the two tags matching the selector.

Another common syntax is XPath, although people often don’t like this because it’s very complicated. rvest supports XPath if you want it. An XPath selector like

/table[@class='datatable']//tr[@id='importantrow']//td

does the same thing as the CSS selector above. XPath can express arbitrarily complicated queries with all kinds of conditions.

(I did not actually try this XPath selector to make sure it works.)

You can also try using your browser’s developer tools to find the HTML tags and selectors you need; let’s do a live demo of that.

Scraping politely

Typical scraping might involve something like this:

  1. You start with a specific Web page or list of pages.
  2. Your program scrapes those pages to extract data, and also extracts links to other relevant pages.
  3. The other relevant pages are put into a queue to scrape next, and scraping proceeds until there are no pages left in the queue.

This is quite a common pattern because we usually don’t have all the URLs we want to scrape in advance. If I’m scraping Wikipedia, rather than downloading a list of all the Wikipedia pages in advance, I’d rather start with a few pages and follow links to find others.

Web site owners, however, often don’t like this pattern. If you don’t build restrain your scraping script, it might send dozens of requests per second to fetch new pages, and it may scrape parts of the website that are not intended to be accessible to robots – things like dynamically generated pages that are slow to make, or private user profiles, or copyrighted images, or other things they don’t want to be downloaded en masse.

To prevent this, Web site owners can use robots.txt, a standard file that specifies what robots should be allowed to do on a website. The standard format is quite simple and easy to read.

A robots.txt file is placed in the root directory of a website, like http://www.example.com/robots.txt. It is a plain text file (no formatting, not RTF, just text) with contents like

User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

User-agent: *
Crawl-delay: 5

This requests Google to ignore certain directories entirely and requests all robots to wait 5 seconds between making requests. (The Crawl-delay directive is unofficial and not respected by all robots.)

You should respect robots.txt if possible. The R package robotstxt can parse robots.txt files and tell you what pages you’re allowed to scrape, and Python’s urllib has a robotparser module as well.

Python users may prefer to use Scrapy, a package that automatically handles everything: processing robots.txt, maintaining a queue of pages to visit, extracting data from pages, and storing data in an output file. Here’s an example from the documentation:

This starts at a specific URL, selects elements from the requested page (using both CSS and XPath selectors), and yields a dictionary of values selected from the page, as well as yielding subsequent pages to visit. Scrapy handles the scheduling, outputs the data to a JSON file (you run Scrapy at the command line to specify the output file location), and automatically skips requests forbidden by robots.txt (provided you set the option to do so).

Driving a web browser

Sometimes it’s not enough to scrape a website by sending it HTTP requests directly, or to use its API. Maybe the website involves a bunch of JavaScript to be run by a Web browser or it doesn’t like being accessed by robots.

In that case, you need a real browser.

Tools like Selenium let you automate a web browser. The Selenium WebDriver lets you start a Web browser – like Chrome or Firefox – and control it from a program, then reach in and inspect the contents of the web pages being displayed. Selenium can be used from within Python and R, and from many other languages. Here’s a Python example from the documentation:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("https://www.python.org")
assert "Python" in driver.title

elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

assert "No results found." not in driver.page_source
driver.close()

The element with name q is the search box, so we are literally typing the text pycon into that box, hitting Enter, and checking that the string “No results found” is not in the resulting page.

Selenium can find elements by name, but there are also methods like driver.find_element_by_css_selector and find_element_by_xpath, among others. You can even get screenshots of the page if you want to do some kind of image analysis or interact with some graphical thing, using driver.save_screenshot.

In the example above, Selenium uses Firefox. (You need to have Firefox installed separately.) It also supports Chrome and Internet Explorer.

Just remember that you usually don’t need this. If you just need the contents of Web pages, use a package for Web scraping or for HTTP requests; you don’t need the massive complexity of Web browsers unless you’re depending on the browser to do things like play videos and run JavaScript.