Brendan Apfeld Data scientist and political science researcher

World Bank Data and wbstats Package

I’ve been doing a bit of work recently with World Bank data and thought that I would share how I’ve been using the wbstats package in R to help me do that.

wbstats is hosted on CRAN, so installing is simply install.packages("wbstats").

There are three primary functions that you’ll want to call from the package (most likely in this order): wbcache(), wbsearch(), and wb().

wbcache() will download a fresh cache of information about all of the variables available through the World Bank database. It takes a minute or two to run and downloads about 7mb of data, but it’s worth doing at the start of each project you’re working on. The variables do change and the built-in cache in the package is only updated about twice a year.

wbsearch() then allows you to search the cache for variables that fit a certain pattern. The return can include hundreds of variables, so you’ll want to store the results in a dataframe.

wb() is the heart of the package. The arguments are pretty straightforward. You specify the countries and indicators (variables) as character strings and start and end dates as integers. You can also specify if NA values are omitted and whether you want a POSICXct formatted date column. There are a few more options, though you probably won’t need to set them.

Specifying countries and indicators can be a little annoying because both require specific formatting. Countries can be entered as either iso3c or iso2c codes (see here for a nice list). Indicators have to be entered using their official World Bank variable names. This is where the wbsearch() is really going to come in handy. You can also specify “all” for countries and return all countries and all aggregates.1

Once you download your data, you’ll find that it’s in “long” format. There’s a good chance that you actually want “wide” format (where each country-year is a row in the dataframe and each indicator is a variable). I’ve found that dplyr::spread is the easiest way to deal with this problem.

So what does this look like in practice? Here’s a bit of recent code that I used, modified only so that if you run it, you won’t attempt to download 700+mb of data.

  1. Coming in version 0.1.1 will be the option to specify “countries” or “aggregates” for all countries and all aggregates, respectively. Currently CRAN only hosts version 0.1, but the development version can be installed using devtools::install_github("GIST-ORNL/wbstats")