Web scraping (also referred to as web data extraction or web harvesting) is the process of using software to fetch the contents of a web page and extract information from it for use in some analysis. In Displayr, you may want to include a visualization or analysis of web-based data in your dashboard. In this article I show you how to use the rvest package in R to bring in some data from a web page, and then connect that data to a visualization.
In this article I will scrape a data table from a Wikipedia article. The data contains information about life expectancy for different groups within each US state and territory.
This is a relatively simple version of web scraping, because it only requires me to obtain a single web page and locate and store a single element of the page. Even better, the data itself is already quantified and tabular. I will focus on charting the overall life expectancy, as the other columns contain quite a bit of missing data.
Web scraping can be more abitious than this too. With purpose-built software, or R packages like rselenium, you can automate the process of navigating through a sequence of web pages, scaping data as you go. Alternatively, many web applications like Facebook and Twitter provide APIs, which is to say that you can write code to draw in data directly from their databases, rather than scraping directly from their web pages.
Obtaining the data
For small examples like this, we can use an R Output to add a table of data to the Displayr document. If the data set is large enough, or you want to take advantage of Displayr’s tables and other features, an R Data Set. To do this, click Home > New Data Set, and use the R option. This will allow you to add the data as a collection of variables that can be modified and tabulated using Displayr's built-in statistical engine. For more on adding data sets, see Introduction to Displayr 2: Getting Your Data into Displayr.
In this example we will add the table of data as an R Output:
- Select Insert > R Output.
- Paste in the code below into the R CODE field.
- Click Calculate.
The code for fetching and tidying the life expectancy information from the Wikipedia table is:
library(rvest) # Reading in the table from wikipedia page = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy") # Obtain the peice of the web page that corresponds to the "wikitable" node my.table = html_node(page, ".wikitable") # Convert the html table element into a data frame my.table = html_table(my.table, fill = TRUE) # Extracting the columns from the table and turning into a named vector x = my.table[,4] names(x) = my.table[,3] # Excluding non-states and averages from the table x = x[!as.character(my.table[,3]) %in% c("78.9", "Northern Mariana Islands", "Guam", "American Samoa", "Puerto Rico", "U.S. Virgin Islands")] names(x)[names(x) == "Georgia"] = "GA" # Outputting as an object with a better name life.expectancy = x
This code produces a table containing the data from Wikipedia.
In this code I have:
- Loaded the rvest package, which has functions for dealing with web pages (and also introduces functions from the package xml2 that are also handy for processing html).
- Used the function read_html to obtain the html for the web page.
- Used the function html_node to obtain the part of the web page that corresponds to the table element (called .wikitable)
- Used the function html_table to convert the html table into a data frame.
- Extracted the overall life expectancy column as a single vector, and named that vector according to the state/territory name.
- Tidied some of the state names. This is necessary for creating the visualization described in the next section.
To see the original dashboard, click the button above. You can see the original R code by clicking on the table in the dashboard.
Data like this could be visualized using a bar or column chart, but we can go one step further. An attractive alternative is to plot the data using a geographic map (also known as a choropleth). Such a chart is like a heatmap, where the elements of the heatmap are placed on a map of the US. The shading indicates the relative life expectancy in each state.
We see the longest life expectancy in California, Hawaii, Minnesota, and Connecticut. The lowest are in Mississippi and West Virginia.
To add a map like this in Displayr:
- Select Insert > Visualization > Geographic Map.
- Click Inputs > DATA SOURCE > Output in 'Pages', and select the table of data you want to display.
- Tick Automatic. This way, if the input data changes, or the settings change, the map will keep itself up to date.
To get the map running smoothly, you need to be a bit careful with the labels. In the code above, I tidied some of the state names and removed some erroneous values so that the map could automatically detect the region that I wanted to plot. It is also possible to plot world maps, and maps of particular countries.
With a little R code, it's easy to supplement your report with some data scraped from the web!