Colouring Plots by Factor Variables in R with ggplot2: A Comprehensive Guide

Colouring Plot by Factor in R

====================================

In this article, we will explore how to colour a scatter plot by a factor variable in R. We will start with the basics of plotting data in R and then move on to more advanced techniques.

Introduction


R is a popular programming language for statistical computing and graphics. One of its key features is its ability to create high-quality plots that can help us visualize complex data. In this article, we will focus on colouring plots by factor variables, which is an essential technique in data visualization.

Basic Plotting in R


To get started with plotting in R, we need to load the ggplot2 package and create a new plot. Here is a basic example:

# Load the ggplot2 package
library(ggplot2)

# Create a new dataset
data <- data.frame(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5.0, 5.5, 4.4, 4.9, 4.0, 3.5),
                   Sepal.Width = c(2.8, 3.0, 3.2, 2.5, 2.1, 2.8, 2.7, 2.8, 2.2, 2.1))

# Create a scatter plot
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()

Colouring Plot by Factor Variable


Now that we have created a basic plot, let’s colour it by a factor variable. We can do this using the aes function and specifying the color aesthetic.

# Create a new dataset with a factor variable
data <- data.frame(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5.0, 5.5, 4.4, 4.9, 4.0, 3.5),
                   Sepal.Width = c(2.8, 3.0, 3.2, 2.5, 2.1, 2.8, 2.7, 2.8, 2.2, 2.1),
                   Species = c("Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica", "Virginica", "Virginica"))

# Create a scatter plot with colour by factor variable
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()

How Does It Work?


In the above example, we used the aes function to specify that we want to colour the points based on the Species factor. The color aesthetic is set to Species, which means that R will use different colours for each level of the Species factor.

By default, R uses a palette of 8 colours (red, orange, yellow, green, blue, indigo, violet) for the first 7 levels of the factor and a black fill colour for the last level. However, you can specify your own custom colours using the scale_color_manual function.

# Create a scatter plot with custom colours for each level of the factor
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_manual(name = "Species", values = c("Setosa" = "red", "Versicolor" = "orange", "Virginica" = "blue"))

How to Interpret the Colour Map


By default, R uses a sequential colour map where different colours are mapped to different levels of the factor. However, this can be confusing if you have many levels.

To improve readability, you can use a categorical colour map that assigns each level of the factor to a specific colour. Here is an example:

# Create a scatter plot with categorical colours for each level of the factor
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() + 
  scale_color_discrete(name = "Species", labels = c("Setosa", "Versicolor", "Virginica"))

In this example, each level of the Species factor is assigned a specific colour using the scale_color_discrete function.

Conclusion


Colouring plots by factor variables is an essential technique in data visualization. By using different colours for different levels of the factor, you can create visualizations that are easier to interpret and understand.

In this article, we explored how to colour a scatter plot by a factor variable in R using the ggplot2 package. We discussed the basics of plotting data in R, including creating new datasets and specifying aesthetics. We also covered how to customise colours for each level of the factor using different functions.

By following these steps, you can create high-quality visualizations that help you communicate your findings effectively to others.


Last modified on 2024-01-20