#install.packages("gapminder")
#install.packages("plotly)
require(gapminder)
require(tidyverse)
require(ggplot2)
require(plotly)
#we're also adding two options - one rounds everything to 2 digits, the other
#prevents printing numbers in scientific notation (1e...)
options(digits=2, scipen=99)
The package we’re going to use for our graphing is called ggplot2, which was written by Hadley Wikham before he developed the tidyverse. It is based on one of the most celebrated academic books on visualization, called the “Grammar of Graphics” by Leland Wilkinson in the 1980s. For our purposes, you just need to understand that any visualization is made up of several fundamental pieces. In ggplot2, they are:
One quirk of ggplot is that instead of the %>% pipe command , it uses + instead. Hadley has said this will be changed in future versions, but for now, we have to use the other.
The basic structure of the plot is:
ggplot ( data = df_name,
aes (
x = variable_name,
y = variable_name,
color = variable_name,
size = variable_name
)
) +
a geometry (eg, geom_point, geom_bar, etc.) with options +
facet_wrap ( variable ~ variable ) +
any scale information
We’re going to skip the labeling and annotation parts for now. That would come under layering and legends.
Gapminder is a datast made famous in a viral 2010 TED talk. It contains the life expectency and income (GDP per capita) by country for 200 years. The data is in 5-year increments, through 2007.
Just pull out the latest year (2007) for our practice
#what is gapminder? Life expectancy by country, only goes to 2007. Take the latest year.
gapminder_2007 <-
gapminder %>%
filter ( year == 2007)
glimpse(gapminder_2007)
## Observations: 142
## Variables: 6
## $ country <fct> Afghanistan, Albania, Algeria, Angola, Argentina, Aust…
## $ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europ…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, …
## $ lifeExp <dbl> 44, 76, 72, 43, 75, 81, 80, 76, 64, 79, 57, 66, 75, 51…
## $ pop <int> 31889923, 3600523, 33333216, 12420476, 40301927, 20434…
## $ gdpPercap <dbl> 975, 5937, 6223, 4797, 12779, 34435, 36126, 29796, 139…
One of the first pieces of information you often want about a dataset is its distribution. Do all of the values cluster around the center? Or do they spread out? Let’s do a histogram of the life expectency variable.
Start your plot with the commmand ggplot, then add in the aesthetics and geometry:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We’ll add a few options to this plot to make it a little easier to read
#make it an outline, with smaller piles.
ggplot ( data= gapminder_2007,
aes (x= lifeExp)
) +
geom_histogram (binwidth=5, color="black", fill="white")
But I’m interested in how the different continents look. Try “faceting” by continent.
You can save your plot to an object rather than print it immediately, making it a little easier to troubleshoot.
my_plot <-
ggplot (
data = gapminder_2007,
aes (x= gdpPercap , y = lifeExp)
)
#what does this look like?
my_plot
It doesn’t look like anything! The reason is that I didn’t include a geometry, or a shape, for the values. Add a point here:
There are some built-in themes that take some best practices for mixes of colors and styles, so I’ll add one in.
Let’s add some color. Remember, we keep adding elements to our existing plot, so we don’t have to start over each time.
And now add population for the size of the country
Let’s build this from scratch, and then also make sure that the big points don’t overlap the little ones too much:
Now let’s make a little chart for each continent
It’s hard to see because the gdp is so skewed. That can be fixed with something called a log scale – it’s in logarithms, not base 10 numbers.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
GGPLOT2 is not interactive, so we have to install a different library to allow us to hover over the points. In this case, we’re going to use a function called paste, which puts words together, in the variable in aes called “text”:
#install.packages("plotly")
library(plotly)
my_plot <-
ggplot (data = gapminder_2007,
aes(text = paste("country: ", country),
x= gdpPercap ,
y = lifeExp,
color= continent,
size=pop)
) +
geom_point( alpha= 0.7) +
theme_minimal() +
facet_wrap (~continent) +
scale_x_log10()
#We have to make it a ggplotly to get it interactive.
my_plot <- ggplotly(my_plot)
my_plot
If you want to look at lots more examples, check out Matt Waite’s data visualization course on Github
So far, we’ve only looked at the year 2007 – what if we wanted to look at it over time, say 50 or so years? Let’s take the years 1957 to 2007:
df_gap <-
gapminder %>%
filter (between (year, 1957, 2007))
# and let's just check it
df_gap %>%
group_by (year) %>%
summarise (n() ) %>%
arrange (year)
## # A tibble: 11 x 2
## year `n()`
## <int> <int>
## 1 1957 142
## 2 1962 142
## 3 1967 142
## 4 1972 142
## 5 1977 142
## 6 1982 142
## 7 1987 142
## 8 1992 142
## 9 1997 142
## 10 2002 142
## 11 2007 142
Let’s just repeat what we did to get the original interactive plot, with one change: The plotly package lets you specify a “frame” aesthetic, which will create animation by whatever variable you specify. Here, we’ve specified the year as that frame:
library(plotly)
#this option forces it to show real numbers, not the 1e things.
options (scipen=99)
my_plot <-
ggplot (data = df_gap,
aes(text = paste("country: ", country),
x= gdpPercap ,
y = lifeExp,
color= continent,
size=pop,
frame = year)
) +
geom_point( alpha= 0.7) +
theme_minimal() +
facet_wrap (~continent) +
scale_x_log10()
#We have to make it a ggplotly to get it interactive.
my_plot <- ggplotly(my_plot)
my_plot