How-to Guide for R
Introduction

In this guide, you will learn how to create a bar chart using the R statistical software. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have downloaded the relevant data files to a folder on your computer and that you are using the RStudio environment. The relevant code should, however, work in other environments too.

Contents

1. Bar Chart

2. An Example in R: Primary Energy Consumption of Oil by Regions in 2018

  • 2.1 The R Procedure
    • 2.1.1 Preparing the Data
    • 2.1.2 Plotting the Data
  • 2.2 Exploring the Output

3. Your Turn

1. Bar Chart

A horizontal bar chart is a common chart type used to show different values on a qualitative scale on the vertical axis. A bar chart can be useful for relatively exact appraisals of differences between categories and giving a good visual overview of the values in a dataset.

Bar charts should generally be ordered based on the data. Typically, the vertical axis is arranged by values in either ascending or descending order.

The horizontal bar chart is usually not interchangeable with the vertical bar chart, which is used to display time series and serves as an alternative to the line chart.

2. An Example in R: Primary Energy Consumption of Oil by Regions in 2018

Figure 1 shows a bar chart of global primary energy consumption of oil by geographical regions in the year 2018.

The chart is made of bars representing each country, arranged horizontally, one beneath the other. The horizontal axis is labeled oil in terawatt-hours from 0 to 20,000 in increments of 5,000. The vertical axis lists the regions. Approximate data from the graph are tabulated below.

Region

Consumption of Oil in terawatt-hours

South and Central America

3,500

Europe

8,000

North America

12,000

Middle East

4,000

Africa

2,200

Asia Pacific

19,000

Caribbean Island states.

2,200

A vertical line representing the mean oil consumption runs through the 7,500 Terrawatt-hours mark. Text at the bottom reads, “Source: Our world in data, 2020.”

Figure 1. A Bar Chart of Global Oil Consumption
A bar chart is an example in R: Primary energy consumption of oil by regions in 2018, titled “Asia Pacific is the largest primary oil consumer, primary energy consumption of oil by regions in 2018.”

The bar chart gives an easy at-a-glance impression of the absolute primary energy consumption of oil for the regions, ordered by magnitude. A simple two-color scheme was used with dark brown for the bars and black for text and tick marks.

The headline offers an interpretation of the main visual message.

2.1 The R Procedure

R is a free open-source software and computing platform for statistical analysis with many charting options. R is not based on a graphical interface with pull-down menus. Rather, you input lines of code that execute functions and operations built into R or different packages. It is best to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with R, we suggest you start with the introduction manual located here (http://cran.R-project.org/doc/manuals/R-release/R-intro.html).

For this example, we are using RStudio, a free, open-source user interface for R which makes working with R programming easier.

In this example, we write our code using R Script, found in the top left of the four windows in R Studio. This means that all actions can be recorded and kept for further use. It is helpful to do this to be able to trace back your steps and decisions made in the analysis. To run the code, you can either press Ctrl + Enter (or Command + Enter on a Mac) after each line of code or highlight the line(s) of code you wish to perform and click Run. (Code can also be written in the Console area in the bottom left, pressing Enter at the end of each line of code. This does not record your actions, however.)

Creating the stacked area chart requires installing some packages and importing the libraries. Install them if you do not have them already, either using the interface or by typing install.packages("packagename") in the console or using the menu item Install packages… under Tools. You will get an error message if something is missing when trying to run the script. If needed, just install the missing package, and everything should work after that.

The necessary packages in this case are all contained in one:

Install these as needed and save the tutorial csv data file primary-energy-consumption-by-source.csv to a folder on your computer. The example uses a folder called sage_r-sourcedata in the user root, where the table goes in subfolder tables. If you choose to save your files elsewhere, just update the import path accordingly at the beginning of your code.

Begin by running the code in the script file up to line 4 by marking the lines and hitting control + enter (command + enter on a mac); this will import the necessary libraries.

2.1.1 Preparing the Data

We will begin by importing our dataset:

data <- read_delim("~/sage_r-sourcedata/tables/primary-energy-consumption-by-source.csv", ",")

The read_delim function accepts a variety of parameters for reading in data, including, for example, data locale and field delimiter (in this case a comma), you can find more documentation on these by searching for “read_delim” in your Help panel. You can alternately use the read_csv or read_tsv functions to read in your data, both are specialized cases of the basic read_delim function and expect comma and tab separated values, respectively. There is also an additional read_csv2 function which expects semicolon separators and commas as decimal points, should this suit your particular dataset better.

You can take a look at the data table by typing View(data) into your console panel in the bottom left quadrant of the interface or by opening the data table from the top right Environment panel. You can also view just the column names with the command colnames(data). Our data should look something like Figure 2.

Data from the imported dataset are tabulated below.

Entity

Code

Year

Oil (terawatt-hours)

Natural gas (terawatt-hours)

Coal (terawatt-hours)

Nuclear (terawatt-hours)

Hydropower (terawatt-hours)

Solar (terawatt-hours)

Wind (terawatt-hours)

Other renewables (terawatt-hours)

Africa

NA

1965

328.24966

9.543754

335.580402

0.0000000

14.2788.56

0.00000e+00

0.000000

0.0000000

Africa

NA

1966

359.14315

10.66916

331.361013

0.0000000

15.6490489

0.00000e+00

0.000000

0.0000000

Africa

NA

1967

356.36648

10.545670

341.000471

0.0000000

16.1583330

0.00000e+00

0.000000

0.0000000

Africa

NA

1968

376.00921

10.688970

355.393857

0.0000000

18.6229828

0.00000e+00

0.000000

0.0000000

Africa

NA

1969

381.17900

12.492000

357.494212

0.0000000

21.5828968

0.00000e+00

0.000000

0.0000000

Africa

NA

1970

421.12586

15.520325

367.958023

0.0000000

27.0672870

0.00000e+00

0.000000

0.1640000

Africa

NA

1971

475.80862

18.405264

388.717309

0.0000000

25.8366053

0.00000e+00

0.000000

0.1650000

Africa

NA

1972

515.50459

24.670657

392.431563

0.0000000

29.8368784

0.00000e+00

0.000000

0.1700000

Africa

NA

1973

554.28050

39.551791

414.901083

0.0000000

29.8037420

0.00000e+00

0.000000

0.1750000

Africa

NA

1974

572.99627

44.518487

432.131850

0.0000000

35.0891855

0.00000e+00

0.000000

0.1720000

Africa

NA

1975

599.15275

53.638377

459.067310

0.0000000

36.8790484

0.00000e+00

0.000000

0.1850000

Africa

NA

1976

665.14780

60.981390

481.936906

0.0000000

40.8611792

0.00000e+00

0.000000

0.1890000

Africa

NA

1977

703.88885

68.234191

492.911858

0.0000000

45.0833511

0.00000e+00

0.000000

0.1950000

Africa

NA

1978

740.60696

97.369898

477.924213

0.0000000

46.0850793

0.00000e+00

0.000000

0.2010000

Africa

NA

1979

789.47579

148.842816

504.547560

0.0000000

47.2802068

0.00000e+00

0.000000

0.2070000

Africa

NA

1980

835.99012

186.913977

543.879963

0.0000000

46.0301706

0.00000e+00

0.000000

0.2180000

Africa

NA

1981

889.82084

228.491156

636.189973

0.0000000

48.3774465

0.00000e+00

0.000000

0.2360000

Africa

NA

1982

932.52345

246.733549

702.560899

0.0000000

48.6327147

0.00000e+00

0.000000

0.2300000

Africa

NA

1983

964.19883

271.275347

710.034781

0.0000000

45.5089147

0.00000e+00

0.000000

0.2340000

Africa

NA

1984

984.73530

259.902780

762.605888

0.0000000

45.1258603

0.00000e+00

0.000000

0.0000000

Africa

NA

1985

1006.97520

278.842747

787.165889

0.0000000

5.3150000

0.00000e+00

0.000000

0.2690000

Africa

NA

1986

991.20937

313.271203

805.356096

0.0000000

8.8030000

0.00000e+00

0.000000

0.6590000

Africa

NA

1987

1041.20937

323.384357

828.319732

0.0000000

6.1670000

0.00000e+00

0.000000

0.6580000

Africa

NA

1988

1090.04827

358.444541

890.160805

0.0000000

10.4930000

0.00000e+00

0.000000

0.6260000

Africa

NA

1989

1130.28790

371.4710041

835.620862

0.0000000

11.0990000

0.00000e+00

0.000000

0.6170000

Africa

NA

1990

1146.16657

398.645384

877.809862

0.0000000

8.4490000

0.00000e+00

0.000000

0.7320000

Figure 2. The Imported Dataset
An imported dataset of primary energy consumption by source.

Next, we want to format our data for plotting. We will begin by making a subset of our original dataset that only includes data for the year 2018, limited to continents. Here the filter() method chooses the rows that meet both conditions for entity and year. NB! the name for the aggregated Eurasian post-Soviet republics, the Commonwealth of Independent States, is shortened to CIS.

subset <- data %>% filter(Year == 2018)

subset <- subset %>% filter(Entity %in% c('Africa', 'Asia Pacific', 'CIS', 'Europe', 'Middle East', 'North America', 'South & Central America'))

Note: Using boolean operators like & for “and” or | for “or” one can add multiple conditions to your filter, for example, filter(AGE == "15-64" & TIME == 2019). For more details, just search for the dplyr “filter” method in your Help panel.

The column names are somewhat cumbersome and long, not to mention misspelled in places, so we will also shorten them for easier plotting later. We remove all extra text contained within parentheses with the following gsub function. The code looks more complicated than it really is, the basic premise is gsub("text to be replaced", "replacement", data), where we have just input some regular expression placeholders to account for the parentheses, any leading spaces, and the text within the parentheses. You can search for gsub or regular expression in your Help panel for more documentation.

names(subset) <- gsub("\\s*\\([^\\)]+\\)", "", names(subset))

We will also rename CIS to Commonwealth of Independent States for clarity sake:

subset$Entity <- gsub("CIS", "Commonwealth of\n Independent States", subset$Entity)

Our formatted subset should look something like Figure 3.

The data from the formatted data set are tabulated below.

Entity

Code

Year

Oil

Natural Gas

Coal

Nuclear

Hydropower

Solar

Wind

Other Renewables

Africa

NA

2018

2225.136

1499.912

1179.69841

11.090427

132.8408

9.0290890

14.685940

8.1610229

Asia Pacific

NA

2018

19717.043

8253.226

33044.86437

553.584622

1718.5083

314.2085529

460.469456

221.3015173

Common wealth of Independent states

NA

2018

2250.969

5807.758

1568.74569

206.577070

244.8391

0.8813001

0.977083

0.6791778

Europe

NA

2018

8629.147

5489.557

3571.61300

937.491630

642.0666

139.0520627

404.369480

217.6324969

Middle East

Na

2018

4792.163

5531.019

92.445519

7.000133

15.1905

6.1211536

1.060386

0.2613104

North America

NA

2018

12938.699

10223.409

3992.96555

963.183321

708.3523

102.9072327

322.528364

99.7459563

South and Central America

NA

2018

3666.525

1683.690

419.13443

22.504796

731.3065

12.4315268

65.862666

78.0238802

Figure 3. The Formatted Dataset
A formatted data set of primary energy consumption by source.
2.1.2 Plotting the Data

We will begin with a very simple vertical bar chart of our subset data, choosing continents for the x-axis and oil consumption values for the y-axis. The simplest version of the bar chart is accomplished simply by passing our data subset to ggplot(DATATABLE, aes(x=COLUMN, y=COLUMN)) + geom_col() with the desired columns filled in for the x- and y-axes.

ggplot(data=subset, aes(x=Entity, y=Oil)) + geom_col()

NB! There is also an alternate geom_bar function, which by default creates bars from data counts instead of data values. geom_bar can generally be used to perform statistical transformations to data before plotting, and the addition of a stat = "identity" parameter also allows for overriding the default case counting method to create the same output as geom_col. You may see both methods used for plotting bar charts, though, for our purposes, geom_col() is more direct and simple.

Our basic column chart should look something like Figure 4.

The horizontal axis is labeled entities and lists them in Alphabetical order as follows: Africa, Asia Pacific, Common wealth of Independent states, Europe, Middle East, North America, and South and Central America. The vertical axis is labeled oil data and ranges from 0 to 20,000 in increments of 5,000. Approximate data from the column chart are tabulated below.

Entity

Oil

Africa

2,200

Asia Pacific

19,700

Common wealth of Independent states

2,200

Europe

8,600

Middle East

4,800

North America

13,000

South and Central America

3,600

Figure 4. A Simple Column Chart
A simple column chart lists data of oil consumption by different entities.

You will notice that the bars are in a somewhat inconvenient alphabetical order, and sadly you can sort your original data frame with sort() or arrange() all you want, but ggplot will still plot the values alphabetically. In order to fix this, we will explicitly define a factor order via mutate:

p <- subset%>%

 arrange(Oil) %>%

 mutate(Entity = factor(Entity, levels =Entity)) %>%

 ggplot(aes(x = Entity, y = Oil)) +

 geom_col()

p

After assigning the basic plot to the letter p, we can view it by simply typing “p” into the Console. This also creates a simple base plot on top of which we can draw additional elements (Figure 5).

The horizontal axis has the following entities listed in the ascending order including Africa, Common wealth of Independent states, South and Central America, Middle East, Europe, North America, and Asia Pacific. The vertical axis has oil data from 0 to 20,000 in increments of 5,000. Approximate data from the column chart are tabulated below.

Entity

Oil

Africa

2,200

Common wealth of Independent states

2,200

South and Central America

3,600

Middle East

4,800

Europe

8,600

North America

13,000

Asia Pacific

19,700

Figure 5. A Rearranged Column Chart
A rearranged column chart lists data of oil consumption by different entities.

To make variations of this basic plot, we continue by adding to the p variable. Here in the following example, we flip the axis to create a horizontal bar chart with coord_flip() and add just a bit of custom styling for a simple theme and the necessary title and subtitle information in the form of labs(). The parameters passed to scale_y_continuous adds a thousand separator for the values on the (now flipped) x-axis.

Notice that while the following code creates a plot, nothing is saved in the p variable unless we add p <- to the beginning (Figure 6).

p <- p +coord_flip() + theme_minimal() +

 labs(title = "Asia Pacific is the largest primary oil consumer",

   subtitle="Primary energy consumption of oil by regions in 2018",

   caption="Source: Our World in Data, 2020") +

 ylab("Terawatt-hours") +

 xlab("") +

 scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))

The horizontal axis is labeled oil in terawatt-hours and ranges from 0 to 20,000, in increments of 2,500. The vertical axis lists regions. The data from the chart are tabulated below.

Region

Oil in terawatt-hours

Africa

2,225

Commonwealth of Independent States

2,251

South and Central America

3,667

Middle East

4,792

Europe

8,629

North America

12,939

Asia Pacific

19,717

Text above the chart reads, “Asia Pacific is the largest primary oil consumer.” Text under the chart reads, “Source: Our World in Data, 2020.”

Figure 6. A Horizontal Bar Chart
A horizontal bar chart is titled, “Primary energy consumption of oil by regions in 2018.”

We can also add a line indicating the mean for oil consumption using the geom_hline function and text annotation using geom_text:

p + theme_minimal() + geom_hline(yintercept=mean(subset$Oil), color="orange", size=1) +

geom_text(aes(x=1, y=11000, label="Mean oil consumption"), color="orange", size=4)

You can also experiment with labeling features directly with their value, such as p + geom_text(aes(x=subset$Entity, y=subset$Oil+3000, label=subset$Oil), color="orange", size=4), though you will need to play around with the text placement to avoid overlapping the bars themselves.

Regardless of your method and parameters of output, our bar chart should now look something like Figure 7.

The vertical axis is labeled entity and has the following entities listed in the descending order including Asia Pacific, North America, Europe, Middle East, South and Central America, Common wealth of Independent states, and Africa. The horizontal axis is labeled oil data in Terawatt-hours and ranges from 0 to 20,000 in increments of 5,000. Approximate data from the column chart are tabulated below.

Entity

Oil

Asia Pacific

19,700

North America

13,000

Europe

8,600

Middle East

4,800

South and Central America

3,600

Common wealth of Independent states

2,200

Africa

2,200

A vertical line representing Mean oil consumption runs through the 7,600 mark. Text at the bottom reads, “Source: Our world in data, 2020.”

Figure 7. Outputting the Bar Chart
A bar chart is titled “Asia Pacific is the largest primary oil consumer. Primary energy consumption of oil by regions in 2018.”
2.2 Exploring the Output

The bar chart created in this demonstration shows clearly and at a glance how Pacific Asia is the largest primary oil consumer by far, using nearly 20,000 terawatt-hours worth of oil in 2018. The ordering by magnitude allows rather an exact appraisal of relative differences, but still, the difference between the regions of the Commonwealth of Independent States and Africa is nearly indistinguishable in this scale without directly labeling the values themselves.

Adding the optional mean line to the chart gives an idea of how much over the aggregated global mean Asia Pacific is.

For further context, one might want to create another bar chart to compare with another primary energy source side by side.

A source of ambiguity in this chart is that the reader cannot know for certain which countries belong to which group.

3. Your Turn

Now that you have been introduced to some of the basic operations necessary to complete this type of visualization, you may experiment with variations based on this same dataset. You can try plotting different variables, time periods, or another selection of values—how would you accomplish these tasks? Can you add values to the chart or make a graphic with multiple bar charts for different variables? How would you go about coloring the bars by some value? This will require reading up on some documentation, which you can find, for instance, by typing “geom_col” into your Help panel.