How-to Guide
Introduction

In this guide, you will learn how to create a line chart in R using the tidyverse set of R packages. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have downloaded the relevant data files to a folder on your computer and that you are using the RStudio IDE. The relevant code should, however, work in other environments too.

Contents

1. Line Chart

2. An Example in R: Central American Greenhouse Gas Emissions 2000–2012

  • 2.1 The R Procedure
    • 2.1.1 Preparing the Data
    • 2.1.2 Plotting the Data
  • 2.2 Exploring the Output

3. Your Turn

1. Line Chart

The line chart is often considered the default option in statistical graphics and is probably the most commonly used chart type. Like the vertical bar chart, it can be used for displaying continuous data and is typically used to visualize data spanning a specific period of time. A line chart consists of individual data points connected by a line, where the line is an approximation of the values falling between recorded points. Most often this means employing a straight line between points, but sometimes a curved or stepped line is also used—the latter when changes in the value are abrupt. The points themselves are usually marked using dots, though in the case that the chart includes a great number of data points, they are usually left unmarked to avoid cluttering the graphic. Markers are also not used with stepped or curved line charts. A line chart should always have a quantitative scale on both x and y-axes.

2. An Example in R: Central American Greenhouse Gas Emissions 2000–2012

Figure 1 shows a basic line chart of Central American countries’ greenhouse gas emissions during the period 2000–2012. The chart is generated with World Bank data, and a set of R packages called tidyverse.

The chart shows kt of CO2 equivalent of emissions by Guatemala, Honduras, Nicaragua, Panama, El Salvador, Costa Rica, and Belize. The horizontal axis ranges from 2000 to 2012 in increments of one year. The vertical axis ranges from 0 to 70,000 in increments of 10,000. The approximate data from the chart are tabulated below.

Year

Guatemala

Honduras

Nicaragua

Panama

El Salvador

Costa Rica

Belize

2000

70000

21500

14500

9000

10500

10000

2000

2001

21500

13000

15000

11000

12500

11000

1000

2002

31000

15000

15000

10000

12500

10000

2000

2003

55500

22000

17000

10500

12500

10500

4000

2004

25000

15500

15500

10000

12500

10500

2000

2005

42000

22500

17000

10000

12500

10000

2500

2006

27500

17000

1500

12000

13000

10500

2000

2007

34000

19500

15500

11500

13000

11000

1500

2008

33000

20000

16000

15500

13000

10500

2000

2009

38000

20000

16000

15750

13000

11000

2000

2010

30000

20000

16000

15750

13000

12500

2000

2011

30500

20000

16000

16000

13000

13000

2000

2012

31000

20500

16000

16000

13000

13000

2000

Text at the bottom of the chart reads, “Source: World Bank, 2020.”

Figure 1. Line Chart of Greenhouse Gas Emissions in Central America
A line chart shows ?Total greenhouse gas emissions, 2000 to 2012? in Central America.
2.1 The R Procedure

R is a free open-source software and computing platform for statistical analysis with many charting options. R is not based on a graphical interface with pull-down menus. Rather, you input lines of code that execute functions and operations built into R or different packages. It is best to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with R, we suggest you start with the introduction manual located here (http://cran.R-project.org/doc/manuals/R-release/R-intro.html).

For this example, we are using RStudio, a free, open-source user interface for R which makes working with R programming easier.

In this example, we write our code using R Script, found in the top left of the four windows in R Studio. This means that all actions can be recorded and kept for further use. It is helpful to do this to be able to trace back your steps and decisions made in the analysis. To run the code, you can either press Ctrl + Enter (or Command + Enter on a Mac) after each line of code or highlight the line(s) of code you wish to perform and click Run. (Code can also be written in the Console area in the bottom left, pressing Enter at the end of each line of code. This does not record your actions, however.)

Creating the line chart requires installing some packages and importing the libraries. Install them if you do not have them already, either using the interface or by typing install.packages(“packagename”) in the console or using the menu item Install packages… under Tools. You will get an error message if something is missing when trying to run the script. If needed, just install the missing package and everything should work after that.

The necessary packages are:

Install these as needed and save the tutorial csv data file world-bank-emissions.csv to a folder on your computer. The example uses a folder called sage-dataset in the user root, where the table goes in a subfolder tables.

Begin by running the code in the script file up to line 4 by marking the lines and hitting control + enter (command + enter on a mac), this will import the necessary libraries.

2.1.1 Preparing the Data

We will begin by reading in our data:

greenhouse_emissions <- read_delim("~/sage_r-sourcedata/tables/world-bank-emissions.csv", "," )

You can take a look at the data table by typing View(greenhouse_emissions) into your console panel in the bottom left quadrant of the interface or by opening the data table from the top right Environment panel. You can also view just the column names with the command colnames(greenhouse_emissions) (Figure 2).

The window consists of two tabs: linechart-in-worldbank-2020-project and greenhouse_emissions. The second tab is selected. The data table shows the following columns: Country Name, Country Code, Indicator Name, Indicator Code, 1970, 1971, 1973, and 1974. Text at the bottom of the window reads, “Showing 1 to 23 entries of 264 entries, 47 total columns.”

Figure 2.
A screenshot shows a window with a data table of greenhouse gas emissions by country and year.

We will firstly make a copy of the original dataset in case we later want to make several different subsets…

GE <- greenhouse_emissions

… and for ease of use, we will rename one column into a shorter form:

GE <- GE %>% rename("Name" = "Country Name")

Next, we will select the columns and rows we want to visualize in our chart. In this case, we will take the years 2000–2012, and choose the countries of Central America by name. In this case, we use select() for columns, and filter() for rows. You could also choose rows by value, for example, with %>% filter(GE$2012> 10000000).

GE <- GE %>% select("Name", "2000":"2012") %>% filter(Name %in% c("Guatemala", "Belize", "Panama","Costa Rica","Nicaragua","Honduras","El Salvador"))

Opening up our GE data frame now, it should look something like Figure 3.

The screenshot shows two tabs: linechart-in-worldbank-2020-project and GE. The second tab is selected. The table shows the following columns: Name, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, and 2008.

Figure 3.
A screenshot shows a window with a data table.

We will reformat the table for easier plotting, gathering all the individual year columns together into Year and all emission values into an Emissions column (Figure 4).

GE <- GE %>% gather(Year, Emissions, -Name)

The screenshot shows two tabs: linechart-in-worldbank-2020-project and GE. The second tab is selected. The table shows the following columns: Name, Year, and Emissions.

Figure 4.
A screenshot shows a window with a filtered subset of the original data table.
2.1.2 Plotting the Data

Now that we have our data ready, we can begin plotting the line chart. We will begin by initializing a ggplot plot and adding series lines and points.

lineChart <- ggplot(GE, aes(x=Year, y=Emissions, group=Name, color = Name, label = Name)) + geom_line()+ geom_point()

We will also assign the highly visible Dark2 color palette and a minimal chart theme at this point. To peruse possible color palettes and further documentation on colors, you can search for “scale_color_brewer” in your Help panel. You can also check on the current state of your plot at any time by typing lineChart in your Console panel (Figure 5).

lineChart<- lineChart + theme_minimal() + scale_color_brewer(palette = "Dark2")

The chart shows the emissions by Guatemala, Honduras, Nicaragua, Panama, El Salvador, Costa Rica, and Belize. The horizontal axis is labeled Year and ranges from 2000 to 2012 in increments of one year. The vertical axis is labeled Emissions and ranges from 0 to 70,000 in increments of 10,000. The approximate data from the chart are tabulated below.

Year

Belize

Costa Rica

El Salvador

Guatemala

Honduras

Nicaragua

Panama

2000

2000

10000

10500

70000

21500

14500

9000

2001

1000

11000

12500

21500

13000

15000

11000

2002

2000

10000

12500

31000

15000

15000

10000

2003

4000

10500

12500

55500

22000

17000

10500

2004

2000

10500

12500

25000

15500

15500

10000

2005

2500

10000

12500

42000

22500

17000

10000

2006

2000

10500

13000

27500

17000

1500

12000

2007

1500

11000

13000

34000

19500

15500

11500

2008

2000

10500

13000

33000

20000

16000

15500

2009

2000

11000

13000

38000

20000

16000

15750

2010

2000

12500

13000

30000

20000

16000

15750

2011

2000

13000

13000

30500

20000

16000

16000

2012

2000

13000

13000

31000

20500

16000

16000

Figure 5.
A line chart shows emissions by seven Central American countries between 2000 to 2012.

We will continue by making all extraneous gridlines and backgrounds blank and adding a nice margin for visual breathing space. If you find that for some reason some of your chart elements are being cut off by the edge of the plot area, you can experiment with changing scale expansion parameters in scale_x_discrete(expand=c(0.1, 0.6)). If you also wanted to give your x and y axes labels, you would do so at the end of this section in ylab("Y AXIS TEXT") + xlab("X AXIS TEXT"), though in this case, we prefer to leave them blank.

lineChart<- lineChart + scale_x_discrete(expand=c(0.1, 0.6)) +

  theme(legend.position = "none",

        panel.grid.major.x = element_blank(),

        panel.background = element_blank(),

        plot.margin=unit(c(30, 80, 30, 30), "points")) +

  ylab("") + xlab("")

Next, we add our chart title, subtitle, and source information:

lineChart <- lineChart + labs(title = "Total greenhouse gas emissions, 2000-2012",

                              subtitle="kt of CO2 equivalent",

                              caption="Source: World Bank, 2020") +

  theme(plot.title = element_text(face="bold",size=14, color="black"))

And finally, adding direct text labels for each series:

lineChart<- lineChart + geom_text_repel(label=ifelse(GE$Year == "2012", GE$Name,""),

                                        direction="y",

                                        box.padding = 0.1,

                                        segment.alpha = 0.2,

                                        nudge_x = 2,

                                        nudge_y = 2)

For the text labels, we use the ggrepel package’s geom_text_repel() function to avoid label overlap. You can also experiment with using the more traditional geom_text() with future datasets if the series labels are not in any danger of overlapping. We have added conditional formatting for the label to turn off all other point labels but that of the final year 2012. Most other parameters are somewhat self-explanatory, with special mention for the nudge_x and nudge_y; parameters, which move the text a certain distance from the origin, as well as box.padding for determining the distance between labels and direction for determining in which direction the labels should be allowed to move. For further documentation on using these repelling labels, just search for “ggrepel” in your Help panel.

Now that we have added all our elements to the lineChart variable, we can plot our finished chart (Figure 6).

lineChart

The chart shows kt of CO2 equivalent of emissions by Guatemala, Honduras, Nicaragua, Panama, El Salvador, Costa Rica, and Belize. The horizontal axis ranges from 2000 to 2012 in increments of one year. The vertical axis ranges from 0 to 70,000 in increments of 10,000. The approximate data from the chart are tabulated below.

Year

Guatemala

Honduras

Nicaragua

Panama

El Salvador

Costa Rica

Belize

2000

70000

21500

14500

9000

10500

10000

2000

2001

21500

13000

15000

11000

12500

11000

1000

2002

31000

15000

15000

10000

12500

10000

2000

2003

55500

22000

17000

10500

12500

10500

4000

2004

25000

15500

15500

10000

12500

10500

2000

2005

42000

22500

17000

10000

12500

10000

2500

2006

27500

17000

1500

12000

13000

10500

2000

2007

34000

19500

15500

11500

13000

11000

1500

2008

33000

20000

16000

15500

13000

10500

2000

2009

38000

20000

16000

15750

13000

11000

2000

2010

30000

20000

16000

15750

13000

12500

2000

2011

30500

20000

16000

16000

13000

13000

2000

2012

31000

20500

16000

16000

13000

13000

2000

Text at the bottom of the chart reads, “Source: World Bank, 2020.

Figure 6. Our Completed Line Chart
A line chart shows ?Total greenhouse gas emissions, 2000 to 2012.?

If you wanted to visually fine tune your plot, you can export it out of R as a PDF (see the Export tab above your plot) and open it in a vector graphics editing program of your choice, such as Adobe Illustrator or Inkscape. The following image has been ever so slightly tweaked for improved legibility and clean alignment. Many of these small adjustments can of course also be executed within R, such as drawing a bolder zero baseline with lineChart <- lineChart + geom_segment(aes(x=2000, xend=2012, y=0, yend=0,), color="black", size=0.5) (Figure 7).

The chart shows kt of CO2 equivalent of emissions by Guatemala, Honduras, Nicaragua, Panama, El Salvador, Costa Rica, and Belize. The horizontal axis ranges from 2000 to 2012 in increments of one year. The vertical axis ranges from 0 to 70,000 in increments of 10,000. The approximate data from the chart are tabulated below.

Year

Guatemala

Honduras

Nicaragua

Panama

El Salvador

Costa Rica

Belize

2000

70000

21500

14500

9000

10500

10000

2000

2001

21500

13000

15000

11000

12500

11000

1000

2002

31000

15000

15000

10000

12500

10000

2000

2003

55500

22000

17000

10500

12500

10500

4000

2004

25000

15500

15500

10000

12500

10500

2000

2005

42000

22500

17000

10000

12500

10000

2500

2006

27500

17000

1500

12000

13000

10500

2000

2007

34000

19500

15500

11500

13000

11000

1500

2008

33000

20000

16000

15500

13000

10500

2000

2009

38000

20000

16000

15750

13000

11000

2000

2010

30000

20000

16000

15750

13000

12500

2000

2011

30500

20000

16000

16000

13000

13000

2000

2012

31000

20500

16000

16000

13000

13000

2000

Text at the bottom of the chart reads, “Source: World Bank, 2020.”

Figure 7. The Finished Line Chart After Some Tweaking in Illustrator
A line chart shows ?Total greenhouse gas emissions, 2000 to 2012.?
2.2 Exploring the Output

The line chart created through this demonstration shows what a wide range of emissions are generated in a relatively limited geographical area, by countries of relatively similar size in terms of area. Guatemala obviously differs greatly from the other Central American countries, emitting even at the best of times half again as much as the next country, Honduras. This is at least partially an indication of Guatemala’s relatively large population and economic output, but maybe also indicative of variables such as the means of generating electricity.

Several countries seem to exhibit emission peaks in both 2003 and 2005, while emissions since then seem to have mostly stabilized. The data itself does not hint at the reason for these peaks but does highlight some interesting time periods for further study.

3. Your Turn

Now that you have been introduced to some of the basic operations necessary to complete this type of visualization, you may experiment with variations based on this same dataset. You can try plotting different time periods or different subsets of countries—how would you accomplish these tasks? Are you able to color code the events by some other interesting feature? (e.g., which countries have a certain emissions trajectory?) What other information could you label the series by, besides just the country name?