Coloring Under the Lines in ggplot
I am tasked with explaining incredibly complex things to people who do not have a lot of time. Consequently, using visuals has been a life saver.
One day I was visiting a school explaining the Common Eurpoean Framework of Reference for Languages, which, in a nutshell, describes what language learners can do at different levels of proficiency AND the number of hours it takes for them to progress to each level.
During the presentation I used the following table in a slide:
Image Courtesy of Keep Calm and Teach English
While that image is informative, it is, in my humble opinion, a little hard to comprehend in comparison to this one:
So how do you make the plot above? Glad you asked :)
Step 1: Create the data frame
As the table above shows, there are seven levels we want to represent (A0 to C2) and a range of hours from 0 - 1200.
library(tidyverse)
library(knitr) #To make the table look pretty on HTML
cefr_hours <- tibble(cefr = as_factor(c("A0", "A1", "A2", "B1", "B2", "C1", "C2")),
hours = c(0, 100, 200, 400, 600, 800, 1200))
kable(cefr_hours)
cefr | hours |
---|---|
A0 | 0 |
A1 | 100 |
A2 | 200 |
B1 | 400 |
B2 | 600 |
C1 | 800 |
C2 | 1200 |
Step 2: Expand the data frame
In order to color the sections between the levels, we need to create groups so that ggplot()
divides the the plot based on the correct levels. To do that, we’ll simply double the data frame.
cefr_hours <- tibble(cefr = as_factor(c("A0", "A1", "A2", "B1", "B2", "C1", "C2")),
hours = c(0, 100, 200, 400, 600, 800, 1200))
cefr_hours <- cefr_hours %>%
bind_rows(cefr_hours)
kable(cefr_hours)
cefr | hours |
---|---|
A0 | 0 |
A1 | 100 |
A2 | 200 |
B1 | 400 |
B2 | 600 |
C1 | 800 |
C2 | 1200 |
A0 | 0 |
A1 | 100 |
A2 | 200 |
B1 | 400 |
B2 | 600 |
C1 | 800 |
C2 | 1200 |
Step 3: Create groups
Next, we rearrange the data frame by CEFR level (more on that later) and create a group for each level. To do so, we create a new column called group
using dplyr::mutate
.
cefr_hours <- tibble(cefr = as_factor(c("A0", "A1", "A2", "B1", "B2", "C1", "C2")),
hours = c(0, 100, 200, 400, 600, 800, 1200))
cefr_hours <- cefr_hours %>%
bind_rows(cefr_hours) %>%
arrange(cefr) %>%
mutate(group = ceiling((row_number() - 1) / 2))
kable(cefr_hours)
cefr | hours | group |
---|---|---|
A0 | 0 | 0 |
A0 | 0 | 1 |
A1 | 100 | 1 |
A1 | 100 | 2 |
A2 | 200 | 2 |
A2 | 200 | 3 |
B1 | 400 | 3 |
B1 | 400 | 4 |
B2 | 600 | 4 |
B2 | 600 | 5 |
C1 | 800 | 5 |
C1 | 800 | 6 |
C2 | 1200 | 6 |
C2 | 1200 | 7 |
If we don’t use arrange()
we get the following mess.
cefr_hours <- tibble(cefr = as_factor(c("A0", "A1", "A2", "B1", "B2", "C1", "C2")),
hours = c(0, 100, 200, 400, 600, 800, 1200))
cefr_hours <- cefr_hours %>%
bind_rows(cefr_hours) %>%
mutate(group = ceiling((row_number() - 1) / 2))
kable(cefr_hours)
cefr | hours | group |
---|---|---|
A0 | 0 | 0 |
A1 | 100 | 1 |
A2 | 200 | 1 |
B1 | 400 | 2 |
B2 | 600 | 2 |
C1 | 800 | 3 |
C2 | 1200 | 3 |
A0 | 0 | 4 |
A1 | 100 | 4 |
A2 | 200 | 5 |
B1 | 400 | 5 |
B2 | 600 | 6 |
C1 | 800 | 6 |
C2 | 1200 | 7 |
“What about ceiling()
?”
Good question!
We use ceiling()
in order to create the groups. Since we want “A1 to A2” to be one group, we need to return whole numbers. For more on how to use ceiling()
please click here.
Step 4: Remove Unecessary Groups
Since we don’t want the first or last level to be a group unto itself, we use dplyr::filter()
to remove the first and the last group by saying group
is equal to all rows except for the min()
and max()
(i.e., the first and the last).
cefr_hours <- tibble(cefr = as_factor(c("A0", "A1", "A2", "B1", "B2", "C1", "C2")),
hours = c(0, 100, 200, 400, 600, 800, 1200))
cefr_hours <- cefr_hours %>%
bind_rows(cefr_hours) %>%
arrange(cefr) %>%
mutate(group = ceiling((row_number() - 1) / 2)) %>%
filter(group != min(group), group != max(group))
kable(cefr_hours)
cefr | hours | group |
---|---|---|
A0 | 0 | 1 |
A1 | 100 | 1 |
A1 | 100 | 2 |
A2 | 200 | 2 |
A2 | 200 | 3 |
B1 | 400 | 3 |
B1 | 400 | 4 |
B2 | 600 | 4 |
B2 | 600 | 5 |
C1 | 800 | 5 |
C1 | 800 | 6 |
C2 | 1200 | 6 |
Step 5: Make the plot
From here, it is simply a matter of plugging the data into ggplot()
.
ggplot(data = cefr_hours, mapping =aes(x= cefr, y=hours, group = group, fill = group)) +
geom_ribbon(aes(ymin = 0, ymax = hours))
But, of course, when we’re talking about ggplot()
, that means we have no end of options at our disposal.
ggplot(data = cefr_hours, mapping =aes(x= cefr, y=hours, group = group, fill = group)) +
geom_ribbon(aes(ymin = 0, ymax = hours)) +
scale_color_brewer(palette = "Blues") +
theme_minimal() + # Set the theme
labs(title = "Hours of Guided Learning Per Level", # Give the plot a title
subtitle = "Source: Cambridge English Assessment", # Give it a subtitle
x = "", # Remove the title on the x axis
y = "") + # Remove the title on the y axis
theme(legend.position = "none", # Delete the legend
axis.text.x = element_text(size = 20), # Set the size to 20
axis.text.y = element_text(size = 20), # Set the size to 20
plot.title = element_text(size = 25)) # Set the size to 25
Finally, a special thanks to Jordo82 whose answer to my question enabled me to make this plot.
#Complete code
library(tidyverse)
library(knitr)
cefr_hours <- tibble(cefr = as_factor(c("A0", "A1", "A2", "B1", "B2", "C1", "C2")),
hours = c(0, 100, 200, 400, 600, 800, 1200))
cefr_hours <- cefr_hours %>%
bind_rows(cefr_hours) %>%
arrange(cefr) %>%
#create a group for A2 to B1, then B1 to B2, etc.
mutate(group = ceiling((row_number() - 1) / 2)) %>%
#exclude the first and last points
filter(group != min(group), group != max(group))
ggplot(data = cefr_hours, mapping =aes(x= cefr, y=hours, group = group, fill = group)) +
geom_ribbon(aes(ymin = 0, ymax = hours)) +
scale_color_brewer(palette = "Blues") +
theme_minimal() +
labs(title = "Hours of Guided Learning Per Level",
subtitle = "Source: Cambridge English Assessment",
x = "",
y = "") +
theme(legend.position = "none",
axis.text.x = element_text(size = 20),
axis.text.y = element_text(size = 20),
plot.title = element_text(size = 25))
Happy Coding!