In this guide, you will learn how to create a bar chart using the R statistical software. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have downloaded the relevant data files to a folder on your computer and that you are using the RStudio environment. The relevant code should, however, work in other environments too.
1. Bar Chart
2. An Example in R: Primary Energy Consumption of Oil by Regions in 2018
3. Your Turn
A horizontal bar chart is a common chart type used to show different values on a qualitative scale on the vertical axis. A bar chart can be useful for relatively exact appraisals of differences between categories and giving a good visual overview of the values in a dataset.
Bar charts should generally be ordered based on the data. Typically, the vertical axis is arranged by values in either ascending or descending order.
The horizontal bar chart is usually not interchangeable with the vertical bar chart, which is used to display time series and serves as an alternative to the line chart.
Figure 1 shows a bar chart of global primary energy consumption of oil by geographical regions in the year 2018.
The chart is made of bars representing each country, arranged horizontally, one beneath the other. The horizontal axis is labeled oil in terawatt-hours from 0 to 20,000 in increments of 5,000. The vertical axis lists the regions. Approximate data from the graph are tabulated below.
Region | Consumption of Oil in terawatt-hours |
South and Central America | 3,500 |
Europe | 8,000 |
North America | 12,000 |
Middle East | 4,000 |
Africa | 2,200 |
Asia Pacific | 19,000 |
Caribbean Island states. | 2,200 |
A vertical line representing the mean oil consumption runs through the 7,500 Terrawatt-hours mark. Text at the bottom reads, “Source: Our world in data, 2020.”
The bar chart gives an easy at-a-glance impression of the absolute primary energy consumption of oil for the regions, ordered by magnitude. A simple two-color scheme was used with dark brown for the bars and black for text and tick marks.
The headline offers an interpretation of the main visual message.
R is a free open-source software and computing platform for statistical analysis with many charting options. R is not based on a graphical interface with pull-down menus. Rather, you input lines of code that execute functions and operations built into R or different packages. It is best to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with R, we suggest you start with the introduction manual located here (http://cran.R-project.org/doc/manuals/R-release/R-intro.html).
For this example, we are using RStudio, a free, open-source user interface for R which makes working with R programming easier.
In this example, we write our code using R Script, found in the top left of the four windows in R Studio. This means that all actions can be recorded and kept for further use. It is helpful to do this to be able to trace back your steps and decisions made in the analysis. To run the code, you can either press Ctrl + Enter (or Command + Enter on a Mac) after each line of code or highlight the line(s) of code you wish to perform and click Run. (Code can also be written in the Console area in the bottom left, pressing Enter at the end of each line of code. This does not record your actions, however.)
Creating the stacked area chart requires installing some packages and importing the libraries. Install them if you do not have them already, either using the interface or by typing install.packages("packagename") in the console or using the menu item Install packages… under Tools. You will get an error message if something is missing when trying to run the script. If needed, just install the missing package, and everything should work after that.
The necessary packages in this case are all contained in one:
Install these as needed and save the tutorial csv data file primary-energy-consumption-by-source.csv to a folder on your computer. The example uses a folder called sage_r-sourcedata in the user root, where the table goes in subfolder tables. If you choose to save your files elsewhere, just update the import path accordingly at the beginning of your code.
Begin by running the code in the script file up to line 4 by marking the lines and hitting control + enter (command + enter on a mac); this will import the necessary libraries.
We will begin by importing our dataset:
data <- read_delim("~/sage_r-sourcedata/tables/primary-energy-consumption-by-source.csv", ",")
The read_delim function accepts a variety of parameters for reading in data, including, for example, data locale and field delimiter (in this case a comma), you can find more documentation on these by searching for “read_delim” in your Help panel. You can alternately use the read_csv or read_tsv functions to read in your data, both are specialized cases of the basic read_delim function and expect comma and tab separated values, respectively. There is also an additional read_csv2 function which expects semicolon separators and commas as decimal points, should this suit your particular dataset better.
You can take a look at the data table by typing View(data) into your console panel in the bottom left quadrant of the interface or by opening the data table from the top right Environment panel. You can also view just the column names with the command colnames(data). Our data should look something like Figure 2.
Data from the imported dataset are tabulated below.
Entity | Code | Year | Oil (terawatt-hours) | Natural gas (terawatt-hours) | Coal (terawatt-hours) | Nuclear (terawatt-hours) | Hydropower (terawatt-hours) | Solar (terawatt-hours) | Wind (terawatt-hours) | Other renewables (terawatt-hours) |
Africa | NA | 1965 | 328.24966 | 9.543754 | 335.580402 | 0.0000000 | 14.2788.56 | 0.00000e+00 | 0.000000 | 0.0000000 |
Africa | NA | 1966 | 359.14315 | 10.66916 | 331.361013 | 0.0000000 | 15.6490489 | 0.00000e+00 | 0.000000 | 0.0000000 |
Africa | NA | 1967 | 356.36648 | 10.545670 | 341.000471 | 0.0000000 | 16.1583330 | 0.00000e+00 | 0.000000 | 0.0000000 |
Africa | NA | 1968 | 376.00921 | 10.688970 | 355.393857 | 0.0000000 | 18.6229828 | 0.00000e+00 | 0.000000 | 0.0000000 |
Africa | NA | 1969 | 381.17900 | 12.492000 | 357.494212 | 0.0000000 | 21.5828968 | 0.00000e+00 | 0.000000 | 0.0000000 |
Africa | NA | 1970 | 421.12586 | 15.520325 | 367.958023 | 0.0000000 | 27.0672870 | 0.00000e+00 | 0.000000 | 0.1640000 |
Africa | NA | 1971 | 475.80862 | 18.405264 | 388.717309 | 0.0000000 | 25.8366053 | 0.00000e+00 | 0.000000 | 0.1650000 |
Africa | NA | 1972 | 515.50459 | 24.670657 | 392.431563 | 0.0000000 | 29.8368784 | 0.00000e+00 | 0.000000 | 0.1700000 |
Africa | NA | 1973 | 554.28050 | 39.551791 | 414.901083 | 0.0000000 | 29.8037420 | 0.00000e+00 | 0.000000 | 0.1750000 |
Africa | NA | 1974 | 572.99627 | 44.518487 | 432.131850 | 0.0000000 | 35.0891855 | 0.00000e+00 | 0.000000 | 0.1720000 |
Africa | NA | 1975 | 599.15275 | 53.638377 | 459.067310 | 0.0000000 | 36.8790484 | 0.00000e+00 | 0.000000 | 0.1850000 |
Africa | NA | 1976 | 665.14780 | 60.981390 | 481.936906 | 0.0000000 | 40.8611792 | 0.00000e+00 | 0.000000 | 0.1890000 |
Africa | NA | 1977 | 703.88885 | 68.234191 | 492.911858 | 0.0000000 | 45.0833511 | 0.00000e+00 | 0.000000 | 0.1950000 |
Africa | NA | 1978 | 740.60696 | 97.369898 | 477.924213 | 0.0000000 | 46.0850793 | 0.00000e+00 | 0.000000 | 0.2010000 |
Africa | NA | 1979 | 789.47579 | 148.842816 | 504.547560 | 0.0000000 | 47.2802068 | 0.00000e+00 | 0.000000 | 0.2070000 |
Africa | NA | 1980 | 835.99012 | 186.913977 | 543.879963 | 0.0000000 | 46.0301706 | 0.00000e+00 | 0.000000 | 0.2180000 |
Africa | NA | 1981 | 889.82084 | 228.491156 | 636.189973 | 0.0000000 | 48.3774465 | 0.00000e+00 | 0.000000 | 0.2360000 |
Africa | NA | 1982 | 932.52345 | 246.733549 | 702.560899 | 0.0000000 | 48.6327147 | 0.00000e+00 | 0.000000 | 0.2300000 |
Africa | NA | 1983 | 964.19883 | 271.275347 | 710.034781 | 0.0000000 | 45.5089147 | 0.00000e+00 | 0.000000 | 0.2340000 |
Africa | NA | 1984 | 984.73530 | 259.902780 | 762.605888 | 0.0000000 | 45.1258603 | 0.00000e+00 | 0.000000 | 0.0000000 |
Africa | NA | 1985 | 1006.97520 | 278.842747 | 787.165889 | 0.0000000 | 5.3150000 | 0.00000e+00 | 0.000000 | 0.2690000 |
Africa | NA | 1986 | 991.20937 | 313.271203 | 805.356096 | 0.0000000 | 8.8030000 | 0.00000e+00 | 0.000000 | 0.6590000 |
Africa | NA | 1987 | 1041.20937 | 323.384357 | 828.319732 | 0.0000000 | 6.1670000 | 0.00000e+00 | 0.000000 | 0.6580000 |
Africa | NA | 1988 | 1090.04827 | 358.444541 | 890.160805 | 0.0000000 | 10.4930000 | 0.00000e+00 | 0.000000 | 0.6260000 |
Africa | NA | 1989 | 1130.28790 | 371.4710041 | 835.620862 | 0.0000000 | 11.0990000 | 0.00000e+00 | 0.000000 | 0.6170000 |
Africa | NA | 1990 | 1146.16657 | 398.645384 | 877.809862 | 0.0000000 | 8.4490000 | 0.00000e+00 | 0.000000 | 0.7320000 |
Next, we want to format our data for plotting. We will begin by making a subset of our original dataset that only includes data for the year 2018, limited to continents. Here the filter() method chooses the rows that meet both conditions for entity and year. NB! the name for the aggregated Eurasian post-Soviet republics, the Commonwealth of Independent States, is shortened to CIS.
subset <- data %>% filter(Year == 2018)
subset <- subset %>% filter(Entity %in% c('Africa', 'Asia Pacific', 'CIS', 'Europe', 'Middle East', 'North America', 'South & Central America'))
Note: Using boolean operators like & for “and” or | for “or” one can add multiple conditions to your filter, for example, filter(AGE == "15-64" & TIME == 2019). For more details, just search for the dplyr “filter” method in your Help panel.
The column names are somewhat cumbersome and long, not to mention misspelled in places, so we will also shorten them for easier plotting later. We remove all extra text contained within parentheses with the following gsub function. The code looks more complicated than it really is, the basic premise is gsub("text to be replaced", "replacement", data), where we have just input some regular expression placeholders to account for the parentheses, any leading spaces, and the text within the parentheses. You can search for gsub or regular expression in your Help panel for more documentation.
names(subset) <- gsub("\\s*\\([^\\)]+\\)", "", names(subset))
We will also rename CIS to Commonwealth of Independent States for clarity sake:
subset$Entity <- gsub("CIS", "Commonwealth of\n Independent States", subset$Entity)
Our formatted subset should look something like Figure 3.
The data from the formatted data set are tabulated below.
Entity | Code | Year | Oil | Natural Gas | Coal | Nuclear | Hydropower | Solar | Wind | Other Renewables |
Africa | NA | 2018 | 2225.136 | 1499.912 | 1179.69841 | 11.090427 | 132.8408 | 9.0290890 | 14.685940 | 8.1610229 |
Asia Pacific | NA | 2018 | 19717.043 | 8253.226 | 33044.86437 | 553.584622 | 1718.5083 | 314.2085529 | 460.469456 | 221.3015173 |
Common wealth of Independent states | NA | 2018 | 2250.969 | 5807.758 | 1568.74569 | 206.577070 | 244.8391 | 0.8813001 | 0.977083 | 0.6791778 |
Europe | NA | 2018 | 8629.147 | 5489.557 | 3571.61300 | 937.491630 | 642.0666 | 139.0520627 | 404.369480 | 217.6324969 |
Middle East | Na | 2018 | 4792.163 | 5531.019 | 92.445519 | 7.000133 | 15.1905 | 6.1211536 | 1.060386 | 0.2613104 |
North America | NA | 2018 | 12938.699 | 10223.409 | 3992.96555 | 963.183321 | 708.3523 | 102.9072327 | 322.528364 | 99.7459563 |
South and Central America | NA | 2018 | 3666.525 | 1683.690 | 419.13443 | 22.504796 | 731.3065 | 12.4315268 | 65.862666 | 78.0238802 |
We will begin with a very simple vertical bar chart of our subset data, choosing continents for the x-axis and oil consumption values for the y-axis. The simplest version of the bar chart is accomplished simply by passing our data subset to ggplot(DATATABLE, aes(x=COLUMN, y=COLUMN)) + geom_col() with the desired columns filled in for the x- and y-axes.
ggplot(data=subset, aes(x=Entity, y=Oil)) + geom_col()
NB! There is also an alternate geom_bar function, which by default creates bars from data counts instead of data values. geom_bar can generally be used to perform statistical transformations to data before plotting, and the addition of a stat = "identity" parameter also allows for overriding the default case counting method to create the same output as geom_col. You may see both methods used for plotting bar charts, though, for our purposes, geom_col() is more direct and simple.
Our basic column chart should look something like Figure 4.
The horizontal axis is labeled entities and lists them in Alphabetical order as follows: Africa, Asia Pacific, Common wealth of Independent states, Europe, Middle East, North America, and South and Central America. The vertical axis is labeled oil data and ranges from 0 to 20,000 in increments of 5,000. Approximate data from the column chart are tabulated below.
Entity | Oil |
Africa | 2,200 |
Asia Pacific | 19,700 |
Common wealth of Independent states | 2,200 |
Europe | 8,600 |
Middle East | 4,800 |
North America | 13,000 |
South and Central America | 3,600 |
You will notice that the bars are in a somewhat inconvenient alphabetical order, and sadly you can sort your original data frame with sort() or arrange() all you want, but ggplot will still plot the values alphabetically. In order to fix this, we will explicitly define a factor order via mutate:
p <- subset%>%
arrange(Oil) %>%
mutate(Entity = factor(Entity, levels =Entity)) %>%
ggplot(aes(x = Entity, y = Oil)) +
geom_col()
p
After assigning the basic plot to the letter p, we can view it by simply typing “p” into the Console. This also creates a simple base plot on top of which we can draw additional elements (Figure 5).
The horizontal axis has the following entities listed in the ascending order including Africa, Common wealth of Independent states, South and Central America, Middle East, Europe, North America, and Asia Pacific. The vertical axis has oil data from 0 to 20,000 in increments of 5,000. Approximate data from the column chart are tabulated below.
Entity | Oil |
Africa | 2,200 |
Common wealth of Independent states | 2,200 |
South and Central America | 3,600 |
Middle East | 4,800 |
Europe | 8,600 |
North America | 13,000 |
Asia Pacific | 19,700 |
To make variations of this basic plot, we continue by adding to the p variable. Here in the following example, we flip the axis to create a horizontal bar chart with coord_flip() and add just a bit of custom styling for a simple theme and the necessary title and subtitle information in the form of labs(). The parameters passed to scale_y_continuous adds a thousand separator for the values on the (now flipped) x-axis.
Notice that while the following code creates a plot, nothing is saved in the p variable unless we add p <- to the beginning (Figure 6).
p <- p +coord_flip() + theme_minimal() +
labs(title = "Asia Pacific is the largest primary oil consumer",
subtitle="Primary energy consumption of oil by regions in 2018",
caption="Source: Our World in Data, 2020") +
ylab("Terawatt-hours") +
xlab("") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))
The horizontal axis is labeled oil in terawatt-hours and ranges from 0 to 20,000, in increments of 2,500. The vertical axis lists regions. The data from the chart are tabulated below.
Region | Oil in terawatt-hours |
Africa | 2,225 |
Commonwealth of Independent States | 2,251 |
South and Central America | 3,667 |
Middle East | 4,792 |
Europe | 8,629 |
North America | 12,939 |
Asia Pacific | 19,717 |
Text above the chart reads, “Asia Pacific is the largest primary oil consumer.” Text under the chart reads, “Source: Our World in Data, 2020.”
We can also add a line indicating the mean for oil consumption using the geom_hline function and text annotation using geom_text:
p + theme_minimal() + geom_hline(yintercept=mean(subset$Oil), color="orange", size=1) +
geom_text(aes(x=1, y=11000, label="Mean oil consumption"), color="orange", size=4)
You can also experiment with labeling features directly with their value, such as p + geom_text(aes(x=subset$Entity, y=subset$Oil+3000, label=subset$Oil), color="orange", size=4), though you will need to play around with the text placement to avoid overlapping the bars themselves.
Regardless of your method and parameters of output, our bar chart should now look something like Figure 7.
The vertical axis is labeled entity and has the following entities listed in the descending order including Asia Pacific, North America, Europe, Middle East, South and Central America, Common wealth of Independent states, and Africa. The horizontal axis is labeled oil data in Terawatt-hours and ranges from 0 to 20,000 in increments of 5,000. Approximate data from the column chart are tabulated below.
Entity | Oil |
Asia Pacific | 19,700 |
North America | 13,000 |
Europe | 8,600 |
Middle East | 4,800 |
South and Central America | 3,600 |
Common wealth of Independent states | 2,200 |
Africa | 2,200 |
A vertical line representing Mean oil consumption runs through the 7,600 mark. Text at the bottom reads, “Source: Our world in data, 2020.”
The bar chart created in this demonstration shows clearly and at a glance how Pacific Asia is the largest primary oil consumer by far, using nearly 20,000 terawatt-hours worth of oil in 2018. The ordering by magnitude allows rather an exact appraisal of relative differences, but still, the difference between the regions of the Commonwealth of Independent States and Africa is nearly indistinguishable in this scale without directly labeling the values themselves.
Adding the optional mean line to the chart gives an idea of how much over the aggregated global mean Asia Pacific is.
For further context, one might want to create another bar chart to compare with another primary energy source side by side.
A source of ambiguity in this chart is that the reader cannot know for certain which countries belong to which group.
Now that you have been introduced to some of the basic operations necessary to complete this type of visualization, you may experiment with variations based on this same dataset. You can try plotting different variables, time periods, or another selection of values—how would you accomplish these tasks? Can you add values to the chart or make a graphic with multiple bar charts for different variables? How would you go about coloring the bars by some value? This will require reading up on some documentation, which you can find, for instance, by typing “geom_col” into your Help panel.