How-to Guide for R
Introduction

In this guide, you will learn how to create a bar chart using the R statistical software. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have downloaded the relevant data files to a folder on your computer and that you are using the RStudio environment. The relevant code should, however, work in other environments too.

Contents

1. Bar Chart

2. An Example in R: Primary Energy Consumption of Oil by Regions in 2018

• 2.1 The R Procedure
• 2.1.1 Preparing the Data
• 2.1.2 Plotting the Data
• 2.2 Exploring the Output

1. Bar Chart

A horizontal bar chart is a common chart type used to show different values on a qualitative scale on the vertical axis. A bar chart can be useful for relatively exact appraisals of differences between categories and giving a good visual overview of the values in a dataset.

Bar charts should generally be ordered based on the data. Typically, the vertical axis is arranged by values in either ascending or descending order.

The horizontal bar chart is usually not interchangeable with the vertical bar chart, which is used to display time series and serves as an alternative to the line chart.

2. An Example in R: Primary Energy Consumption of Oil by Regions in 2018

Figure 1 shows a bar chart of global primary energy consumption of oil by geographical regions in the year 2018.

The chart is made of bars representing each country, arranged horizontally, one beneath the other. The horizontal axis is labeled oil in terawatt-hours from 0 to 20,000 in increments of 5,000. The vertical axis lists the regions. Approximate data from the graph are tabulated below.

 Region Consumption of Oil in terawatt-hours South and Central America 3,500 Europe 8,000 North America 12,000 Middle East 4,000 Africa 2,200 Asia Pacific 19,000 Caribbean Island states. 2,200

A vertical line representing the mean oil consumption runs through the 7,500 Terrawatt-hours mark. Text at the bottom reads, “Source: Our world in data, 2020.”

Figure 1. A Bar Chart of Global Oil Consumption

The bar chart gives an easy at-a-glance impression of the absolute primary energy consumption of oil for the regions, ordered by magnitude. A simple two-color scheme was used with dark brown for the bars and black for text and tick marks.

The headline offers an interpretation of the main visual message.

2.1 The R Procedure

R is a free open-source software and computing platform for statistical analysis with many charting options. R is not based on a graphical interface with pull-down menus. Rather, you input lines of code that execute functions and operations built into R or different packages. It is best to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with R, we suggest you start with the introduction manual located here (http://cran.R-project.org/doc/manuals/R-release/R-intro.html).

For this example, we are using RStudio, a free, open-source user interface for R which makes working with R programming easier.

In this example, we write our code using R Script, found in the top left of the four windows in R Studio. This means that all actions can be recorded and kept for further use. It is helpful to do this to be able to trace back your steps and decisions made in the analysis. To run the code, you can either press Ctrl + Enter (or Command + Enter on a Mac) after each line of code or highlight the line(s) of code you wish to perform and click Run. (Code can also be written in the Console area in the bottom left, pressing Enter at the end of each line of code. This does not record your actions, however.)

Creating the stacked area chart requires installing some packages and importing the libraries. Install them if you do not have them already, either using the interface or by typing install.packages("packagename") in the console or using the menu item Install packages… under Tools. You will get an error message if something is missing when trying to run the script. If needed, just install the missing package, and everything should work after that.

The necessary packages in this case are all contained in one:

Install these as needed and save the tutorial csv data file primary-energy-consumption-by-source.csv to a folder on your computer. The example uses a folder called sage_r-sourcedata in the user root, where the table goes in subfolder tables. If you choose to save your files elsewhere, just update the import path accordingly at the beginning of your code.

Begin by running the code in the script file up to line 4 by marking the lines and hitting control + enter (command + enter on a mac); this will import the necessary libraries.

2.1.1 Preparing the Data

We will begin by importing our dataset:

You can take a look at the data table by typing View(data) into your console panel in the bottom left quadrant of the interface or by opening the data table from the top right Environment panel. You can also view just the column names with the command colnames(data). Our data should look something like Figure 2.

Data from the imported dataset are tabulated below.

 Entity Code Year Oil (terawatt-hours) Natural gas (terawatt-hours) Coal (terawatt-hours) Nuclear (terawatt-hours) Hydropower (terawatt-hours) Solar (terawatt-hours) Wind (terawatt-hours) Other renewables (terawatt-hours) Africa NA 1965 328.24966 9.543754 335.580402 0.0000000 14.2788.56 0.00000e+00 0.000000 0.0000000 Africa NA 1966 359.14315 10.66916 331.361013 0.0000000 15.6490489 0.00000e+00 0.000000 0.0000000 Africa NA 1967 356.36648 10.545670 341.000471 0.0000000 16.1583330 0.00000e+00 0.000000 0.0000000 Africa NA 1968 376.00921 10.688970 355.393857 0.0000000 18.6229828 0.00000e+00 0.000000 0.0000000 Africa NA 1969 381.17900 12.492000 357.494212 0.0000000 21.5828968 0.00000e+00 0.000000 0.0000000 Africa NA 1970 421.12586 15.520325 367.958023 0.0000000 27.0672870 0.00000e+00 0.000000 0.1640000 Africa NA 1971 475.80862 18.405264 388.717309 0.0000000 25.8366053 0.00000e+00 0.000000 0.1650000 Africa NA 1972 515.50459 24.670657 392.431563 0.0000000 29.8368784 0.00000e+00 0.000000 0.1700000 Africa NA 1973 554.28050 39.551791 414.901083 0.0000000 29.8037420 0.00000e+00 0.000000 0.1750000 Africa NA 1974 572.99627 44.518487 432.131850 0.0000000 35.0891855 0.00000e+00 0.000000 0.1720000 Africa NA 1975 599.15275 53.638377 459.067310 0.0000000 36.8790484 0.00000e+00 0.000000 0.1850000 Africa NA 1976 665.14780 60.981390 481.936906 0.0000000 40.8611792 0.00000e+00 0.000000 0.1890000 Africa NA 1977 703.88885 68.234191 492.911858 0.0000000 45.0833511 0.00000e+00 0.000000 0.1950000 Africa NA 1978 740.60696 97.369898 477.924213 0.0000000 46.0850793 0.00000e+00 0.000000 0.2010000 Africa NA 1979 789.47579 148.842816 504.547560 0.0000000 47.2802068 0.00000e+00 0.000000 0.2070000 Africa NA 1980 835.99012 186.913977 543.879963 0.0000000 46.0301706 0.00000e+00 0.000000 0.2180000 Africa NA 1981 889.82084 228.491156 636.189973 0.0000000 48.3774465 0.00000e+00 0.000000 0.2360000 Africa NA 1982 932.52345 246.733549 702.560899 0.0000000 48.6327147 0.00000e+00 0.000000 0.2300000 Africa NA 1983 964.19883 271.275347 710.034781 0.0000000 45.5089147 0.00000e+00 0.000000 0.2340000 Africa NA 1984 984.73530 259.902780 762.605888 0.0000000 45.1258603 0.00000e+00 0.000000 0.0000000 Africa NA 1985 1006.97520 278.842747 787.165889 0.0000000 5.3150000 0.00000e+00 0.000000 0.2690000 Africa NA 1986 991.20937 313.271203 805.356096 0.0000000 8.8030000 0.00000e+00 0.000000 0.6590000 Africa NA 1987 1041.20937 323.384357 828.319732 0.0000000 6.1670000 0.00000e+00 0.000000 0.6580000 Africa NA 1988 1090.04827 358.444541 890.160805 0.0000000 10.4930000 0.00000e+00 0.000000 0.6260000 Africa NA 1989 1130.28790 371.4710041 835.620862 0.0000000 11.0990000 0.00000e+00 0.000000 0.6170000 Africa NA 1990 1146.16657 398.645384 877.809862 0.0000000 8.4490000 0.00000e+00 0.000000 0.7320000
Figure 2. The Imported Dataset

Next, we want to format our data for plotting. We will begin by making a subset of our original dataset that only includes data for the year 2018, limited to continents. Here the filter() method chooses the rows that meet both conditions for entity and year. NB! the name for the aggregated Eurasian post-Soviet republics, the Commonwealth of Independent States, is shortened to CIS.

subset <- data %>% filter(Year == 2018)

subset <- subset %>% filter(Entity %in% c('Africa', 'Asia Pacific', 'CIS', 'Europe', 'Middle East', 'North America', 'South & Central America'))

Note: Using boolean operators like & for “and” or | for “or” one can add multiple conditions to your filter, for example, filter(AGE == "15-64" & TIME == 2019). For more details, just search for the dplyr “filter” method in your Help panel.

The column names are somewhat cumbersome and long, not to mention misspelled in places, so we will also shorten them for easier plotting later. We remove all extra text contained within parentheses with the following gsub function. The code looks more complicated than it really is, the basic premise is gsub("text to be replaced", "replacement", data), where we have just input some regular expression placeholders to account for the parentheses, any leading spaces, and the text within the parentheses. You can search for gsub or regular expression in your Help panel for more documentation.

names(subset) <- gsub("\\s*\\([^\\)]+\\)", "", names(subset))

We will also rename CIS to Commonwealth of Independent States for clarity sake:

subset\$Entity <- gsub("CIS", "Commonwealth of\n Independent States", subset\$Entity)

Our formatted subset should look something like Figure 3.

The data from the formatted data set are tabulated below.

 Entity Code Year Oil Natural Gas Coal Nuclear Hydropower Solar Wind Other Renewables Africa NA 2018 2225.136 1499.912 1179.69841 11.090427 132.8408 9.0290890 14.685940 8.1610229 Asia Pacific NA 2018 19717.043 8253.226 33044.86437 553.584622 1718.5083 314.2085529 460.469456 221.3015173 Common wealth of Independent states NA 2018 2250.969 5807.758 1568.74569 206.577070 244.8391 0.8813001 0.977083 0.6791778 Europe NA 2018 8629.147 5489.557 3571.61300 937.491630 642.0666 139.0520627 404.369480 217.6324969 Middle East Na 2018 4792.163 5531.019 92.445519 7.000133 15.1905 6.1211536 1.060386 0.2613104 North America NA 2018 12938.699 10223.409 3992.96555 963.183321 708.3523 102.9072327 322.528364 99.7459563 South and Central America NA 2018 3666.525 1683.690 419.13443 22.504796 731.3065 12.4315268 65.862666 78.0238802
Figure 3. The Formatted Dataset
2.1.2 Plotting the Data

We will begin with a very simple vertical bar chart of our subset data, choosing continents for the x-axis and oil consumption values for the y-axis. The simplest version of the bar chart is accomplished simply by passing our data subset to ggplot(DATATABLE, aes(x=COLUMN, y=COLUMN)) + geom_col() with the desired columns filled in for the x- and y-axes.

ggplot(data=subset, aes(x=Entity, y=Oil)) + geom_col()

NB! There is also an alternate geom_bar function, which by default creates bars from data counts instead of data values. geom_bar can generally be used to perform statistical transformations to data before plotting, and the addition of a stat = "identity" parameter also allows for overriding the default case counting method to create the same output as geom_col. You may see both methods used for plotting bar charts, though, for our purposes, geom_col() is more direct and simple.

Our basic column chart should look something like Figure 4.

The horizontal axis is labeled entities and lists them in Alphabetical order as follows: Africa, Asia Pacific, Common wealth of Independent states, Europe, Middle East, North America, and South and Central America. The vertical axis is labeled oil data and ranges from 0 to 20,000 in increments of 5,000. Approximate data from the column chart are tabulated below.

 Entity Oil Africa 2,200 Asia Pacific 19,700 Common wealth of Independent states 2,200 Europe 8,600 Middle East 4,800 North America 13,000 South and Central America 3,600
Figure 4. A Simple Column Chart

You will notice that the bars are in a somewhat inconvenient alphabetical order, and sadly you can sort your original data frame with sort() or arrange() all you want, but ggplot will still plot the values alphabetically. In order to fix this, we will explicitly define a factor order via mutate:

p <- subset%>%

arrange(Oil) %>%

mutate(Entity = factor(Entity, levels =Entity)) %>%

ggplot(aes(x = Entity, y = Oil)) +

geom_col()

p

After assigning the basic plot to the letter p, we can view it by simply typing “p” into the Console. This also creates a simple base plot on top of which we can draw additional elements (Figure 5).

The horizontal axis has the following entities listed in the ascending order including Africa, Common wealth of Independent states, South and Central America, Middle East, Europe, North America, and Asia Pacific. The vertical axis has oil data from 0 to 20,000 in increments of 5,000. Approximate data from the column chart are tabulated below.

 Entity Oil Africa 2,200 Common wealth of Independent states 2,200 South and Central America 3,600 Middle East 4,800 Europe 8,600 North America 13,000 Asia Pacific 19,700
Figure 5. A Rearranged Column Chart

To make variations of this basic plot, we continue by adding to the p variable. Here in the following example, we flip the axis to create a horizontal bar chart with coord_flip() and add just a bit of custom styling for a simple theme and the necessary title and subtitle information in the form of labs(). The parameters passed to scale_y_continuous adds a thousand separator for the values on the (now flipped) x-axis.

Notice that while the following code creates a plot, nothing is saved in the p variable unless we add p <- to the beginning (Figure 6).

p <- p +coord_flip() + theme_minimal() +

labs(title = "Asia Pacific is the largest primary oil consumer",

subtitle="Primary energy consumption of oil by regions in 2018",

caption="Source: Our World in Data, 2020") +

ylab("Terawatt-hours") +

xlab("") +

scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))

The horizontal axis is labeled oil in terawatt-hours and ranges from 0 to 20,000, in increments of 2,500. The vertical axis lists regions. The data from the chart are tabulated below.

 Region Oil in terawatt-hours Africa 2,225 Commonwealth of Independent States 2,251 South and Central America 3,667 Middle East 4,792 Europe 8,629 North America 12,939 Asia Pacific 19,717

Text above the chart reads, “Asia Pacific is the largest primary oil consumer.” Text under the chart reads, “Source: Our World in Data, 2020.”

Figure 6. A Horizontal Bar Chart

We can also add a line indicating the mean for oil consumption using the geom_hline function and text annotation using geom_text:

p + theme_minimal() + geom_hline(yintercept=mean(subset\$Oil), color="orange", size=1) +

geom_text(aes(x=1, y=11000, label="Mean oil consumption"), color="orange", size=4)

You can also experiment with labeling features directly with their value, such as p + geom_text(aes(x=subset\$Entity, y=subset\$Oil+3000, label=subset\$Oil), color="orange", size=4), though you will need to play around with the text placement to avoid overlapping the bars themselves.

Regardless of your method and parameters of output, our bar chart should now look something like Figure 7.

The vertical axis is labeled entity and has the following entities listed in the descending order including Asia Pacific, North America, Europe, Middle East, South and Central America, Common wealth of Independent states, and Africa. The horizontal axis is labeled oil data in Terawatt-hours and ranges from 0 to 20,000 in increments of 5,000. Approximate data from the column chart are tabulated below.

 Entity Oil Asia Pacific 19,700 North America 13,000 Europe 8,600 Middle East 4,800 South and Central America 3,600 Common wealth of Independent states 2,200 Africa 2,200

A vertical line representing Mean oil consumption runs through the 7,600 mark. Text at the bottom reads, “Source: Our world in data, 2020.”

Figure 7. Outputting the Bar Chart
2.2 Exploring the Output

The bar chart created in this demonstration shows clearly and at a glance how Pacific Asia is the largest primary oil consumer by far, using nearly 20,000 terawatt-hours worth of oil in 2018. The ordering by magnitude allows rather an exact appraisal of relative differences, but still, the difference between the regions of the Commonwealth of Independent States and Africa is nearly indistinguishable in this scale without directly labeling the values themselves.

Adding the optional mean line to the chart gives an idea of how much over the aggregated global mean Asia Pacific is.

For further context, one might want to create another bar chart to compare with another primary energy source side by side.

A source of ambiguity in this chart is that the reader cannot know for certain which countries belong to which group.