Data Wrangling with R

R is a powerful, open-source programming language designed particularly for statistical analysis and data visualization. It’s built with statistical terminology in mind, making it a popular choice for data scientists and statisticians. This guide will walk you through the basics of data wrangling with R – from installing RStudio to cleaning datasets.

Installing RStudio and Setup

Before diving into data wrangling, you’ll need to install the integrated development environment (IDE) for R – RStudio. Download RStudio and follow the on-screen instructions to set it up.

Once RStudio is installed, open it and create a new R Script document by going to “File -> New File -> R Script”. Your workspace should look something like this:

In RStudio, you’re primarily working with four panels:

  • The Code Editor (top left) is where you write and save your R code.
  • The Environment tab (top right) shows the variables/objects you have defined in your current environment.
  • The Console tab (bottom left) is the R Console where you can write out R commands and see the results of all R code processed.
  • The Files tab (bottom right) shows files in your working directory, the “Plots” tab shows any plots you’ve created in your R session, and the “Packages” tab shows the R packages available to you.

Basics of R

Simple Calculations

R can perform simple math calculations, both directly in the console and as part of your script. The basic operators are: + (addition), – (subtraction), * (multiplication), / (division), ^ (exponentiation), and %% (modulo). For example, (4 - 6 / 3) * 7 ^ 2 would output 98.

rCopy

(4 - 6 / 3) * 7 ^ 2
# Output: [1] 98
Code language: PHP (php)

R also includes built-in basic math functions, including abs(x)sqrt(x)log10(x)cos(x)sin(x), and tan(x). A full list can be found here.

Variables

A variable stores a value in memory. In R, we use the ‘<-‘ symbol to assign a value to a variable.

rCopy

year <- 2020

Variable names must begin either with a letter or a period and can include any combination of letters, numbers, periods, and underscores; no other special characters or symbols may be used. Remember, R is case-sensitive, so name and access your variables accordingly!

Basic Data Types

Variables can have different types. In R, there are three basic data types: numeric, character, and logical.

  • Numeric types can store integers or floats (numbers involving decimals).

rCopy

year <- 2020  # Numeric type
Code language: PHP (php)
  • Character types can be created with either single- or double-quotes.

rCopy

x <- "HODP is life"  # Character type
Code language: PHP (php)
  • Logical types can store either TRUE or FALSE.

rCopy

y <- TRUE  # Logical type
Code language: PHP (php)

Data Structures in R

From these basic data types, we can start to build more complex structures in R.

Vectors

A vector contains elements all of the same data type. Vectors are created by wrapping elements inside c().

rCopy

my.vector <- c("HODP", "is", "life")  # Vector of character types
Code language: PHP (php)

Lists

A list contains elements of multiple data types. Lists are created by wrapping elements inside list().

rCopy

my.list <- list(FALSE, 3, "HODP")  # List with numeric, character, and logical types
Code language: PHP (php)

Factors

A factor represents the categories present in a vector. Factors can be created by wrapping a vector inside factor().

rCopy

my.factor <- factor(c("HODP", "data", "data", "HODP", "life"))  # Factor
levels(my.factor)
# Output: [1] "HODP" "data" "life"
Code language: PHP (php)

Matrices

A matrix is a vector represented in a two-dimensional rectangular format. Matrices can be created by wrapping a vector inside matrix() and specifying the dimensions of the matrix.

rCopy

my.matrix <- matrix(c("H", "O", "D", "P"), nrow = 2, ncol = 2)  # 2x2 matrix
Code language: PHP (php)

Arrays

An array extends the idea of a matrix into multiple dimensions. Arrays can be created by wrapping a vector inside array()and specifying the dimensions of the array.

rCopy

my.array <- array(c("H", "O", "D", "P"), dim = c(2, 2, 3))  # 2x2x3 array
Code language: PHP (php)

Data Frames

A data frame is like a table with rows and columns. Each column can be of different data types (numeric, character, or logical), but all elements within a column must be of the same type. You can create a data frame using the data.frame() function.

rCopy

df <- data.frame(
  Name = c("Alice", "Bob", "Carol"),
  Age = c(24, 27, 22),
  Salary = c(50000, 54000, 58000)
)
Code language: HTML, XML (xml)

Data Wrangling with R

Now that we’ve covered the basics, let’s delve into the actual process of data wrangling with R. Data wrangling involves cleaning and transforming raw data into a more suitable, easy-to-analyze format. It’s an essential step before performing any data analysis.

Loading Data

The first step in data wrangling is to load the data into R. The read.csv() function is commonly used to read CSV files. Remember to set the stringsAsFactors argument to FALSE to prevent R from converting strings to factors automatically.

rCopy

data <- read.csv("data.csv", stringsAsFactors = FALSE)
Code language: PHP (php)

Inspecting Data

Once the data is loaded, use the head()tail()str(), and summary() functions to inspect the structure and summary statistics of your data.

rCopy

head(data)  # Displays the first 6 rows
tail(data)  # Displays the last 6 rows
str(data)   # Displays the structure
summary(data)  # Displays summary statistics
Code language: PHP (php)

Cleaning Data

Data cleaning involves handling missing values, removing duplicates, and correcting inconsistent data types. The is.na()na.omit()duplicated(), and as.character() functions can be used for these tasks.

rCopy

# Check for missing values
is.na(data)

# Remove rows with missing values
data <- na.omit(data)

# Check for duplicates
duplicated(data)

# Convert a factor to character
data$column <- as.character(data$column)
Code language: PHP (php)

Transforming Data

Transforming data involves reshaping data, combining datasets, and creating new variables. The merge()rbind()cbind()melt(), and dcast() functions can be used for these tasks.

rCopy

# Merge two data frames
merged_data <- merge(data1, data2, by = "ID")

# Add new rows
combined_data <- rbind(data1, data2)

# Add new columns
combined_data <- cbind(data1, data2)

# Reshape data from wide to long
melted_data <- melt(data, id.vars = "ID")

# Reshape data from long to wide
reshaped_data <- dcast(melted_data, formula)
Code language: PHP (php)

These are the basic steps involved in data wrangling with R. Keep in mind that the exact steps can vary depending on the specific dataset and the analysis you plan to perform. Happy wrangling!