6.1 mutate() | R for Graduate Students (2024)

6.1 mutate()

What it does: Adds new columns or modifies current variables in the dataset.

Let’s say I want to create three new variables in the diamonds dataset (described in Section5) :

  1. One variable called JustOne where all of the values inside the column are 1.

  2. One variable called Values where all of the values inside are something:

  3. One variable called Simple where all the values equal TRUE

diamonds %>%  mutate(JustOne = 1, Values = "something", Simple = TRUE)
## # A tibble: 53,940 x 13## carat cut color clarity depth table price x y z JustOne## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1## 2 0.21 Prem~ E SI1 59.8 61 326 3.89 3.84 2.31 1## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1## 4 0.290 Prem~ I VS2 62.4 58 334 4.2 4.23 2.63 1## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1## 6 0.24 Very~ J VVS2 62.8 57 336 3.94 3.96 2.48 1## 7 0.24 Very~ I VVS1 62.3 57 336 3.95 3.98 2.47 1## 8 0.26 Very~ H SI1 61.9 55 337 4.07 4.11 2.53 1## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1## 10 0.23 Very~ H VS1 59.4 61 338 4 4.05 2.39 1## # ... with 53,930 more rows, and 2 more variables: Values <chr>,## # Simple <lgl>

Need a refresher on how a pipe (%>%) works? Go back to section 3.3.9!

mutate() can be used to create variables based on existing variables from the dataset.

Let’s use the existing variable price from the diamonds dataset to create a new column/variable named price200. The values for price200 calculated from price minus $200 (price is a default variable in diamonds):

diamonds %>%  mutate(price200 = price - 200)

You can also create multiple columns at once, separating each new variable with a comma.

diamonds %>%  mutate(price200 = price - 200, # $200 OFF from the original price price20perc = price * 0.20, # 20% of the original price price20percoff = price * 0.80, # 20% OFF from the original price  pricepercarat = price / carat, # ratio of price to carat pizza = depth ^ 2) # Square the original depth

Notice that you can label the variables/new columns however you want (refer to 3.4.2 for naming conventions). In the example above, pizza is not a descriptive name for the new variable defined as the depth values squared. Make it easier for Future You to navigate your old code: choose simple, descriptive names.

Currently, the changes that we’ve made to the diamonds dataset have not been saved. We can see that diamonds does not have the newly created variables (price200, price20perc, pizza) when we execute str(diamonds). In order to save your changes, you must define the new dataset as an object using <-! In the code below, the modified diamonds dataset will be saved as a new object/dataset called diamonds.new. As a beginner, it is good practice to save your changes to a dataset under a new object name (e.g., diamonds.new) each time you make changes rather than saving over the original dataset name (e.g., diamonds). See 3.3.7 for more information on saving objects.

diamonds.new <- # saving changes to diamonds as a new object diamonds %>% # original dataset mutate(price200 = price - 200, # $200 OFF from the original price price20perc = price * .20, # 20% of the original price price20percoff = price * 0.80, # 20% OFF from the original price pricepercarat = price / carat, # ratio of price to carat pizza = depth ^ 2) # Square the original depth

6.1.1 Nesting Functions

We can also use other functions inside mutate() to create our new variable(s). For example, we might use the mean() function to calculate the average price value for all diamonds in the dataset. This is called nesting, where one function (mean()) “nests” inside another function (mutate()).

diamonds %>%  mutate(m = mean(price))

Nearly all of the functions used in this book are available by default or are loaded with library(tidyverse). However, you may come across many other kinds of functions that come from different packages. Make sure that the relevant packages are loaded in your library!

The example below creates four new columns to the diamonds dataset that calculate the mean, standard deviation, standard error, and median price value for all diamonds. The values in these columns will be the same for every row because R takes all of the values in price to calculate the mean/standard deviation/median.

diamonds %>%  mutate(m = mean(price), # calculates the mean price sd = sd(price), # calculates standard deviation med = median(price)) # calculates the median price
6.1.1.0.1 Exercises
  1. Using the built-in dataset, midwest (make sure tidyverse is loaded!). Each exercise comes with the desired output! That means you get the answer to each question but not how to get to the answer. To view the newly created variable, scroll all the way over to the right side of the output.
  1. Create a column named avg.pop.den which calculates average population density for the entire dataset (hint: use mean() and popdensity; all values in this column should be the same – Why?)
## # A tibble: 437 x 29## PID county state area poptotal popdensity popwhite popblack## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496## 3 563 BOND IL 0.022 14991 681. 14477 429## 4 564 BOONE IL 0.017 30806 1812. 29344 127## 5 565 BROWN IL 0.018 5836 324. 5264 547## 6 566 BUREAU IL 0.05 35688 714. 35157 50## 7 567 CALHO~ IL 0.017 5322 313. 5298 1## 8 568 CARRO~ IL 0.027 16805 622. 16519 111## 9 569 CASS IL 0.024 13437 560. 13384 16## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559## # ... with 427 more rows, and 21 more variables: popamerindian <int>,## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,## # avg.pop.den <dbl>
  1. Create a column named avg.area which calculates the average area for the entire dataset
## # A tibble: 437 x 29## PID county state area poptotal popdensity popwhite popblack## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496## 3 563 BOND IL 0.022 14991 681. 14477 429## 4 564 BOONE IL 0.017 30806 1812. 29344 127## 5 565 BROWN IL 0.018 5836 324. 5264 547## 6 566 BUREAU IL 0.05 35688 714. 35157 50## 7 567 CALHO~ IL 0.017 5322 313. 5298 1## 8 568 CARRO~ IL 0.027 16805 622. 16519 111## 9 569 CASS IL 0.024 13437 560. 13384 16## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559## # ... with 427 more rows, and 21 more variables: popamerindian <int>,## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,## # avg.area <dbl>
  1. Create a column called totadult which calculates the total number of adults in this dataset (hint: use popadults, sum(); requires nesting)
## # A tibble: 437 x 29## PID county state area poptotal popdensity popwhite popblack## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496## 3 563 BOND IL 0.022 14991 681. 14477 429## 4 564 BOONE IL 0.017 30806 1812. 29344 127## 5 565 BROWN IL 0.018 5836 324. 5264 547## 6 566 BUREAU IL 0.05 35688 714. 35157 50## 7 567 CALHO~ IL 0.017 5322 313. 5298 1## 8 568 CARRO~ IL 0.027 16805 622. 16519 111## 9 569 CASS IL 0.024 13437 560. 13384 16## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559## # ... with 427 more rows, and 21 more variables: popamerindian <int>,## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,## # totadult <int>
  1. Create a new column called tot.minus.white calculating the difference between poptotal and popwhite. Using other variables in the dataset, how else could you create the tot.minus.white column without using poptotal and popwhite. (hint: popblack, popamerindian, popasian, popother)?
## # A tibble: 437 x 30## PID county state area poptotal popdensity popwhite popblack## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496## 3 563 BOND IL 0.022 14991 681. 14477 429## 4 564 BOONE IL 0.017 30806 1812. 29344 127## 5 565 BROWN IL 0.018 5836 324. 5264 547## 6 566 BUREAU IL 0.05 35688 714. 35157 50## 7 567 CALHO~ IL 0.017 5322 313. 5298 1## 8 568 CARRO~ IL 0.027 16805 622. 16519 111## 9 569 CASS IL 0.024 13437 560. 13384 16## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559## # ... with 427 more rows, and 22 more variables: popamerindian <int>,## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,## # tot.minus.white <int>, tot.minus.white2 <int>
  1. Create a new column called child.to.adult that calculates the ratio of percchildbelowpovert to percadultpoverty (i.e., for every adult that is in poverty, what proportion of children are in poverty? Compare to the question from 1a: Why are the values in child.to.adult all different?)
## # A tibble: 437 x 29## PID county state area poptotal popdensity popwhite popblack## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496## 3 563 BOND IL 0.022 14991 681. 14477 429## 4 564 BOONE IL 0.017 30806 1812. 29344 127## 5 565 BROWN IL 0.018 5836 324. 5264 547## 6 566 BUREAU IL 0.05 35688 714. 35157 50## 7 567 CALHO~ IL 0.017 5322 313. 5298 1## 8 568 CARRO~ IL 0.027 16805 622. 16519 111## 9 569 CASS IL 0.024 13437 560. 13384 16## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559## # ... with 427 more rows, and 21 more variables: popamerindian <int>,## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,## # child.to.adult <dbl>
  1. Create a new column named ratio.adult which calculates the ratio of adults in this dataset (hint: popadults, poptotal).
## # A tibble: 437 x 29## PID county state area poptotal popdensity popwhite popblack## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496## 3 563 BOND IL 0.022 14991 681. 14477 429## 4 564 BOONE IL 0.017 30806 1812. 29344 127## 5 565 BROWN IL 0.018 5836 324. 5264 547## 6 566 BUREAU IL 0.05 35688 714. 35157 50## 7 567 CALHO~ IL 0.017 5322 313. 5298 1## 8 568 CARRO~ IL 0.027 16805 622. 16519 111## 9 569 CASS IL 0.024 13437 560. 13384 16## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559## # ... with 427 more rows, and 21 more variables: popamerindian <int>,## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,## # ratio.adult <dbl>
  1. Create a new column named perc.adult that calculates the percentage of the total population that are adults per county (hint: build from 1f; * 100).
## # A tibble: 437 x 29## PID county state area poptotal popdensity popwhite popblack## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496## 3 563 BOND IL 0.022 14991 681. 14477 429## 4 564 BOONE IL 0.017 30806 1812. 29344 127## 5 565 BROWN IL 0.018 5836 324. 5264 547## 6 566 BUREAU IL 0.05 35688 714. 35157 50## 7 567 CALHO~ IL 0.017 5322 313. 5298 1## 8 568 CARRO~ IL 0.027 16805 622. 16519 111## 9 569 CASS IL 0.024 13437 560. 13384 16## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559## # ... with 427 more rows, and 21 more variables: popamerindian <int>,## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,## # perc.adult <dbl>
  1. Using the built-in dataset, presidential, create a column named duration that calculates the total number of days each president held in office.
## # A tibble: 11 x 5## name start end party duration ## <chr> <date> <date> <chr> <drtn> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days## 2 Kennedy 1961-01-20 1963-11-22 Democratic 1036 days## 3 Johnson 1963-11-22 1969-01-20 Democratic 1886 days## 4 Nixon 1969-01-20 1974-08-09 Republican 2027 days## 5 Ford 1974-08-09 1977-01-20 Republican 895 days## 6 Carter 1977-01-20 1981-01-20 Democratic 1461 days## 7 Reagan 1981-01-20 1989-01-20 Republican 2922 days## 8 Bush 1989-01-20 1993-01-20 Republican 1461 days## 9 Clinton 1993-01-20 2001-01-20 Democratic 2922 days## 10 Bush 2001-01-20 2009-01-20 Republican 2922 days## 11 Obama 2009-01-20 2017-01-20 Democratic 2922 days
  1. Using the built-in dataset, economics, create a column named perc.unemploy that calculates the percentage of the population that is unemployed.
## # A tibble: 574 x 7## date pce pop psavert uempmed unemploy perc.unemploy## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1967-07-01 507. 198712 12.6 4.5 2944 1.48## 2 1967-08-01 510. 198911 12.6 4.7 2945 1.48## 3 1967-09-01 516. 199113 11.9 4.6 2958 1.49## 4 1967-10-01 512. 199311 12.9 4.9 3143 1.58## 5 1967-11-01 517. 199498 12.8 4.7 3066 1.54## 6 1967-12-01 525. 199657 11.8 4.8 3018 1.51## 7 1968-01-01 531. 199808 11.7 5.1 2878 1.44## 8 1968-02-01 534. 199920 12.3 4.5 3001 1.50## 9 1968-03-01 544. 200056 11.7 4.1 2877 1.44## 10 1968-04-01 544 200208 12.3 4.6 2709 1.35## # ... with 564 more rows
  1. Using the built-in dataset, txhousing,
  1. Create a column named successrate that calculates the percent of houses that sell of the total listings available (hint: sales, listings).
## # A tibble: 8,602 x 10## city year month sales volume median listings inventory date## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Abil~ 2000 1 72 5.38e6 71400 701 6.3 2000 ## 2 Abil~ 2000 2 98 6.50e6 58700 746 6.6 2000.## 3 Abil~ 2000 3 130 9.28e6 58100 784 6.8 2000.## 4 Abil~ 2000 4 98 9.73e6 68600 785 6.9 2000.## 5 Abil~ 2000 5 141 1.06e7 67300 794 6.8 2000.## 6 Abil~ 2000 6 156 1.39e7 66900 780 6.6 2000.## 7 Abil~ 2000 7 152 1.26e7 73500 742 6.2 2000.## 8 Abil~ 2000 8 131 1.07e7 75000 765 6.4 2001.## 9 Abil~ 2000 9 104 7.62e6 64500 771 6.5 2001.## 10 Abil~ 2000 10 101 7.04e6 59300 764 6.6 2001.## # ... with 8,592 more rows, and 1 more variable: successrate <dbl>
  1. Create a column called failrate that calculates the percent of houses that do not sell of the total listings available (hint: build from 4a).
## # A tibble: 8,602 x 10## city year month sales volume median listings inventory date failrate## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Abile~ 2000 1 72 5.38e6 71400 701 6.3 2000 89.7## 2 Abile~ 2000 2 98 6.50e6 58700 746 6.6 2000. 86.9## 3 Abile~ 2000 3 130 9.28e6 58100 784 6.8 2000. 83.4## 4 Abile~ 2000 4 98 9.73e6 68600 785 6.9 2000. 87.5## 5 Abile~ 2000 5 141 1.06e7 67300 794 6.8 2000. 82.2## 6 Abile~ 2000 6 156 1.39e7 66900 780 6.6 2000. 80 ## 7 Abile~ 2000 7 152 1.26e7 73500 742 6.2 2000. 79.5## 8 Abile~ 2000 8 131 1.07e7 75000 765 6.4 2001. 82.9## 9 Abile~ 2000 9 104 7.62e6 64500 771 6.5 2001. 86.5## 10 Abile~ 2000 10 101 7.04e6 59300 764 6.6 2001. 86.8## # ... with 8,592 more rows

Let’s take a look at some other useful functions.

6.1.1.1 recode()

What it does: modifies the values within a variable. Here is the basic structure for using recode:

data %>% mutate(Variable = recode(Variable, "old value" = "new value"))

Example

diamonds %>%  mutate(cut.new = recode(cut, "Ideal" = "IDEAL"))

In the code below, we created a new variable named cut.new that is defined as the original cut variable except that values originally listed as "Ideal" now read "IDEAL".

We can also recode multiple values at a time.

diamonds %>%  mutate(cut.new = recode(cut, "Ideal" = "IDEAL", "Fair" = "Okay", "Premium" = "pizza"))

Of course, this new edit is intended to be a bit silly. When recoding data, it is important to be descriptive so that Future You can navigate your code. Again, remember that the changes made to diamonds in the above code have not been saved to the dataset; diamonds will not contain the cut.new variable.

Most commonly, I use recode() to fix inconsistent labeling.

Example

# creating a dataset with 2 variables (Sex , TestScore)Sex <- factor(c("male", "m", "M", "Female", "Female", "Female"))TestScore <- c(10, 20, 10, 25, 12, 5)dataset <- tibble(Sex, TestScore)str(dataset)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 2 variables:## $ Sex : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1## $ TestScore: num 10 20 10 25 12 5

We see that Sex has 4 categories (Female, m, M, and male). We know that m, M, and male are not three different categorical values. We meant to have 2 categories representing the two most common biolgical sexes - male and female. Let’s recode all of the male values in Sex to match the female equivalent value, Female.

# creating a new variable (Sex.new) with recoded values # from the original variable (Sex)dataset %>%  mutate(Sex.new = recode(Sex, "m" = "Male", "M" = "Male", "male" = "Male"))
## # A tibble: 6 x 3## Sex TestScore Sex.new## <fct> <dbl> <fct> ## 1 male 10 Male ## 2 m 20 Male ## 3 M 10 Male ## 4 Female 25 Female ## 5 Female 12 Female ## 6 Female 5 Female

Notice that there are now three variables: Sex, TestScore, and Sex.new. The Sex.new variable contains two categorical values: Male and Female. You must save this new dataset’s edits as an object (using <-) to save these changes! Otherwise, the next time you use dataset, the Sex.new variable will not be available.

dataset.new <- # saving the changes to a new object dataset %>%  mutate(Sex.new = recode(Sex, "m" = "Male", "M" = "Male", "male" = "Male"))str(dataset.new)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 3 variables:## $ Sex : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1## $ TestScore: num 10 20 10 25 12 5## $ Sex.new : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1

The original object, dataset remains the same as before. Only dataset.new includes the additional variable, Sex.new.

str(dataset)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 2 variables:## $ Sex : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1## $ TestScore: num 10 20 10 25 12 5
6.1 mutate() | R for Graduate Students (2024)

References

Top Articles
Latest Posts
Article information

Author: Chrissy Homenick

Last Updated:

Views: 6103

Rating: 4.3 / 5 (54 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Chrissy Homenick

Birthday: 2001-10-22

Address: 611 Kuhn Oval, Feltonbury, NY 02783-3818

Phone: +96619177651654

Job: Mining Representative

Hobby: amateur radio, Sculling, Knife making, Gardening, Watching movies, Gunsmithing, Video gaming

Introduction: My name is Chrissy Homenick, I am a tender, funny, determined, tender, glorious, fancy, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.