Upcoming features in the tidyverse

We’re currently working on significant updates to five tidyverse packages, mostly inspired by the 2ed of R4DS: https://r4ds.hadley.nz. We haven’t scheduled the release date of any of most of these packages yet, but they should all be out by early 2023.

All these features are still changeable so we’d love your feedback 😃

pak::pak(c("tidyverse/tidyverse", "tidyverse/dplyr", "tidyverse/tidyr#1304", "tidyverse/stringr", "tidyverse/purrr", "tidyverse/ggplot2"))

tidyverse 1.4.0

library(tidyverse)
#> ── Attaching core tidyverse packages ─────────────────── tidyverse 1.3.2.9000 ──
#> ✔ dplyr     1.0.99.9000       ✔ readr     2.1.3        
#> ✔ forcats   0.5.2             ✔ stringr   1.4.1.9000   
#> ✔ ggplot2   3.3.6.9000        ✔ tibble    3.1.8        
#> ✔ lubridate 1.8.0             ✔ tidyr     1.2.1.9001   
#> ✔ purrr     0.9000.0.9000     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Two big changes: lubridate is joining the core tidyverse and we now advertise the conflicted package.

Haven’t heard of conflicted before? Learn more at https://conflicted.r-lib.org.

library(conflicted)
library(MASS)

select
#> Error:
#> ! [conflicted] `select` found in 2 packages.
#> Either pick the one you want with `::` 
#> * MASS::select
#> * dplyr::select
#> Or declare a preference with `conflict_prefer()`
#> * conflict_prefer("select", "MASS")
#> * conflict_prefer("select", "dplyr")

ggplot2 3.4.0

To be released on Oct 31

ggplot(airquality) + 
  geom_line(aes(Day, Temp, size = Month, group = Month)) + 
  scale_linewidth(range = c(0.5, 3))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> ℹ The deprecated feature was likely used in the ggplot2 package.
#>   Please report the issue at <]8;;https://github.com/tidyverse/ggplot2/issueshttps://github.com/tidyverse/ggplot2/issues]8;;>.
#> Warning in panel_params$guide: partial match of 'guide' to 'guides'

huron <- data.frame(year = 1875:1972, level = as.vector(LakeHuron))
p <- ggplot(huron) +
  geom_line(aes(year, level)) + 
  geom_ribbon(aes(year, xmin = level - 5, xmax = level + 5))

p
#> Error in `geom_ribbon()`:
#> ! Problem while setting up geom.
#> ℹ Error occurred in the 2nd layer.
#> Caused by error in `compute_geom_1()`:
#> ! `geom_ribbon()` requires the following missing aesthetics: ymin and
#>   ymax or y

dplyr 1.1.0

arrange() and group_by() now sort character vectors:

  • This is good because it’s waaaaaay faster
  • This is bad because it’s sorts a little differently: in the C locale because
df <- tibble(x = stringi::stri_rand_strings(n = 5e5, length = 15))

bench::system_time(df |> arrange(x))
#> process    real 
#>   101ms  99.4ms

withr::with_options(list(dplyr.legacy_locale = TRUE), {
  bench::system_time(df |> arrange(x))  
})
#> process    real 
#>   1.93s    1.9s

# Can still request specific locale if needed
bench::system_time(df |> arrange(x, locale = "fr"))
#> process    real 
#>   215ms   212ms

# Also affects group_by
bench::system_time(df |> group_by(x))
#> process    real 
#>   190ms   187ms
withr::with_options(list(dplyr.legacy_locale = TRUE), {
  bench::system_time(df |> group_by(x))  
})
#> process    real 
#>   1.07s   1.05s

Lots of improvements to joins:

Quite a few functions have been rewritten to use vctrs behind the scenes: if_else(), first(), last(), nth(), between(), coalesce(). This brings performance improvements, consistency, and better error messages.

Most important change is that you no longer need to know about different types of NA to use if_else() :

x <- c(1:10, NA)
if_else(x %% 2 == 0, x, NA) # used to be an error
#>  [1] NA  2 NA  4 NA  6 NA  8 NA 10 NA

if_else(x %% 2 == 0, x, "x")
#> Error in `if_else()`:
#> ! Can't combine `true` <integer> and `false` <character>.
# cf base R
ifelse(x %% 2 == 0, x, "x")
#>  [1] "x"  "2"  "x"  "4"  "x"  "6"  "x"  "8"  "x"  "10" NA

# also
if_else(x %% 2 == 0, x, 0, missing = 1000)
#>  [1]    0    2    0    4    0    6    0    8    0   10 1000

New case_match()

x <- c("a", "b", "a", "d", "b", NA, "c", "e")

case_match(x,
  c("a", "b") ~ 1,
  "c" ~ 2,
  "d" ~ 3,
  .default = 0
)
#> [1] 1 1 1 3 1 0 2 0

# Or use it to just replace certain values
case_match(x,
  c("a", "b") ~ "a",
  .default = x
)
#> [1] "a" "a" "a" "d" "a" NA  "c" "e"

purrr 1.0.0

Generally, this release is about finding the core of purrr — many older and less used functions have been deprecated. Generally, we don’t expect this to affect much code in the wild, but it makes purrr much simpler: https://purrr.tidyverse.org/dev/news/index.html#lifecycle-updates-development-version

Three major new features: progress bars, better errors, and generalised map_vec():

out <- map(1:500, \(i) Sys.sleep(0.10), .progress = TRUE)
#>  ■■■                                6% |  ETA: 48s
#>  ■■■■■                             12% |  ETA: 46s
#>  ■■■■■■                            18% |  ETA: 43s
#>  ■■■■■■■■                          23% |  ETA: 40s
#>  ■■■■■■■■■■                        29% |  ETA: 37s
#>  ■■■■■■■■■■■                       35% |  ETA: 34s
#>  ■■■■■■■■■■■■■                     41% |  ETA: 31s
#>  ■■■■■■■■■■■■■■■                   47% |  ETA: 28s
#>  ■■■■■■■■■■■■■■■■■                 52% |  ETA: 25s
#>  ■■■■■■■■■■■■■■■■■■                58% |  ETA: 22s
#>  ■■■■■■■■■■■■■■■■■■■■              64% |  ETA: 19s
#>  ■■■■■■■■■■■■■■■■■■■■■■            70% |  ETA: 16s
#>  ■■■■■■■■■■■■■■■■■■■■■■■■          75% |  ETA: 13s
#>  ■■■■■■■■■■■■■■■■■■■■■■■■■         81% |  ETA: 10s
#>  ■■■■■■■■■■■■■■■■■■■■■■■■■■■       87% |  ETA:  7s
#>  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■     93% |  ETA:  4s
#>  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■   99% |  ETA:  1s
x <- sample(1:500)
map(x, \(x) if (x == 300) stop("An error!") else x * 2)
#> Error in `map()`:
#> ℹ In index: 363.
#> Caused by error in `.f()`:
#> ! An error!
# map_lgl, map_int, map_dbl, map_chr
x <- 1:10
map_vec(x, \(x) Sys.Date() + x)
#>  [1] "2022-10-29" "2022-10-30" "2022-10-31" "2022-11-01" "2022-11-02"
#>  [6] "2022-11-03" "2022-11-04" "2022-11-05" "2022-11-06" "2022-11-07"
map_vec(x, \(x) factor(letters[x]))
#>  [1] a b c d e f g h i j
#> Levels: a b c d e f g h i j

tidyr 1.3.0

New family of functions for splitting strings up into multiple variables:

separate_wider_delim() separate_longer_delim()
separate_wider_position() separate_longer_position()
separate_wider_regex()

More consistent:

separate(by = string) separate_rows()
separate(by = integer vector) N/A
extract()

Today I’ll focus on the wider functions because they’re the most interesting:

vt_census <- tidycensus::get_decennial(
  geography = "block",
  state = "VT",
  county = "Washington",
  variables = "P1_001N",
  year = 2020
)
#> Getting data from the 2020 decennial Census
#> Using the PL 94-171 Redistricting Data summary file
#> Note: 2020 decennial Census data use differential privacy, a technique that
#> introduces errors into data to preserve respondent confidentiality.
#> ℹ Small counts should be interpreted with caution.
#> ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
#> This message is displayed once per session.
vt_census
#> # A tibble: 2,150 × 4
#>    GEOID           NAME                                            varia…¹ value
#>    <chr>           <chr>                                           <chr>   <dbl>
#>  1 500239541003033 Block 3033, Block Group 3, Census Tract 9541, … P1_001N     1
#>  2 500239541001055 Block 1055, Block Group 1, Census Tract 9541, … P1_001N    31
#>  3 500239543002047 Block 2047, Block Group 2, Census Tract 9543, … P1_001N    23
#>  4 500239545001022 Block 1022, Block Group 1, Census Tract 9545, … P1_001N     0
#>  5 500239545001087 Block 1087, Block Group 1, Census Tract 9545, … P1_001N     2
#>  6 500239546002032 Block 2032, Block Group 2, Census Tract 9546, … P1_001N     0
#>  7 500239551001009 Block 1009, Block Group 1, Census Tract 9551, … P1_001N    74
#>  8 500239550002001 Block 2001, Block Group 2, Census Tract 9550, … P1_001N     6
#>  9 500239552003006 Block 3006, Block Group 3, Census Tract 9552, … P1_001N    71
#> 10 500239551002030 Block 2030, Block Group 2, Census Tract 9551, … P1_001N    37
#> # … with 2,140 more rows, and abbreviated variable name ¹​variable

vt_census |>
  separate_wider_position(
    GEOID,
    widths = c(state = 2, county = 3, tract = 6, block = 4)
  )
#> # A tibble: 2,150 × 7
#>    state county tract  block NAME                                  varia…¹ value
#>    <chr> <chr>  <chr>  <chr> <chr>                                 <chr>   <dbl>
#>  1 50    023    954100 3033  Block 3033, Block Group 3, Census Tr… P1_001N     1
#>  2 50    023    954100 1055  Block 1055, Block Group 1, Census Tr… P1_001N    31
#>  3 50    023    954300 2047  Block 2047, Block Group 2, Census Tr… P1_001N    23
#>  4 50    023    954500 1022  Block 1022, Block Group 1, Census Tr… P1_001N     0
#>  5 50    023    954500 1087  Block 1087, Block Group 1, Census Tr… P1_001N     2
#>  6 50    023    954600 2032  Block 2032, Block Group 2, Census Tr… P1_001N     0
#>  7 50    023    955100 1009  Block 1009, Block Group 1, Census Tr… P1_001N    74
#>  8 50    023    955000 2001  Block 2001, Block Group 2, Census Tr… P1_001N     6
#>  9 50    023    955200 3006  Block 3006, Block Group 3, Census Tr… P1_001N    71
#> 10 50    023    955100 2030  Block 2030, Block Group 2, Census Tr… P1_001N    37
#> # … with 2,140 more rows, and abbreviated variable name ¹​variable

vt_census |>
  separate_wider_delim(
    NAME,
    delim = ", ",
    names = c("block", "block_group", "tract", "county", "state")
  ) |>
  mutate(
    block = block %>% parse_number(),
    block_group = block_group %>% parse_number(),
    tract = tract %>% parse_number()
  )
#> # A tibble: 2,150 × 8
#>    GEOID           block block_group tract county            state varia…¹ value
#>    <chr>           <dbl>       <dbl> <dbl> <chr>             <chr> <chr>   <dbl>
#>  1 500239541003033  3033           3  9541 Washington County Verm… P1_001N     1
#>  2 500239541001055  1055           1  9541 Washington County Verm… P1_001N    31
#>  3 500239543002047  2047           2  9543 Washington County Verm… P1_001N    23
#>  4 500239545001022  1022           1  9545 Washington County Verm… P1_001N     0
#>  5 500239545001087  1087           1  9545 Washington County Verm… P1_001N     2
#>  6 500239546002032  2032           2  9546 Washington County Verm… P1_001N     0
#>  7 500239551001009  1009           1  9551 Washington County Verm… P1_001N    74
#>  8 500239550002001  2001           2  9550 Washington County Verm… P1_001N     6
#>  9 500239552003006  3006           3  9552 Washington County Verm… P1_001N    71
#> 10 500239551002030  2030           2  9551 Washington County Verm… P1_001N    37
#> # … with 2,140 more rows, and abbreviated variable name ¹​variable

vt_census |>
  separate_wider_regex(
    NAME,
    patterns = c(
      "Block ", block = "\\d+", ", ",
      "Block Group ", block_group = "\\d+", ", ",
      "Census Tract ", tract = "\\d+.\\d+", ", ",
      county = "[^,]+", ", ",
      state = ".*"
    )
  )
#> # A tibble: 2,150 × 8
#>    GEOID           block block_group tract county            state varia…¹ value
#>    <chr>           <chr> <chr>       <chr> <chr>             <chr> <chr>   <dbl>
#>  1 500239541003033 3033  3           9541  Washington County Verm… P1_001N     1
#>  2 500239541001055 1055  1           9541  Washington County Verm… P1_001N    31
#>  3 500239543002047 2047  2           9543  Washington County Verm… P1_001N    23
#>  4 500239545001022 1022  1           9545  Washington County Verm… P1_001N     0
#>  5 500239545001087 1087  1           9545  Washington County Verm… P1_001N     2
#>  6 500239546002032 2032  2           9546  Washington County Verm… P1_001N     0
#>  7 500239551001009 1009  1           9551  Washington County Verm… P1_001N    74
#>  8 500239550002001 2001  2           9550  Washington County Verm… P1_001N     6
#>  9 500239552003006 3006  3           9552  Washington County Verm… P1_001N    71
#> 10 500239551002030 2030  2           9551  Washington County Verm… P1_001N    37
#> # … with 2,140 more rows, and abbreviated variable name ¹​variable

They also have an improved way to report on problems:

df <- tibble(
  x = c("a", "a-b", "a-b-c")
)

# old functions warn
df %>% separate(x, c("x", "y"))
#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> # A tibble: 3 × 2
#>   x     y    
#>   <chr> <chr>
#> 1 a     <NA> 
#> 2 a     b    
#> 3 a     b

# new functions error
df %>% separate_wider_delim(x, delim = "-", names = c("x", "y"))
#> Error in `separate_wider_delim()`:
#> ! Expected 2 pieces in each element of `x`.
#> ! 1 value was too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
#> ! 1 value was too long.
#> ℹ Use `too_many = "debug"` to diagnose the problem.
#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
df %>% separate_wider_delim(x, delim = "-", names = c("x"))
#> Error in `separate_wider_delim()`:
#> ! Expected 1 pieces in each element of `x`.
#> ! 2 values were too long.
#> ℹ Use `too_many = "debug"` to diagnose the problem.
#> ℹ Use `too_many = "drop"/"merge"` to silence this message.

# and give you debugging tools:
probs <- df %>%
  separate_wider_delim(
    x,
    delim = "-",
    names = c("a", "b"),
    too_few = "debug",
    too_many = "debug"
  )
#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
#> `x_remainder`.
probs
#> # A tibble: 3 × 6
#>   a     b     x     x_ok  x_pieces x_remainder
#>   <chr> <chr> <chr> <lgl>    <int> <chr>      
#> 1 a     <NA>  a     FALSE        1 ""         
#> 2 a     b     a-b   TRUE         2 ""         
#> 3 a     b     a-b-c FALSE        3 "-c"
probs %>% filter(!x_ok)
#> Error:
#> ! [conflicted] `filter` found in 2 packages.
#> Either pick the one you want with `::` 
#> * dplyr::filter
#> * stats::filter
#> Or declare a preference with `conflict_prefer()`
#> * conflict_prefer("filter", "dplyr")
#> * conflict_prefer("filter", "stats")

stringr 1.5.0

stringr 1.4.0 was released in Feb 2019, almost 3 years ago, so mostly an accumulation of small improvements and bug fixes.

Lots of new functions: str_escape(), str_equal(), str_flatten_comma(), str_split_1(), str_split_i(), str_like(), str_rank(), str_sub_all(), str_unique(), str_width().

Handy new str_view() function (which uses colour where possible):

x <- "a\n'\b\n\"c"
x
#> [1] "a\n'\b\n\"c"
writeLines(x)
#> a
#> '
#> "c
str_view(x)
#> [1] │ a
#>     │ '
#>     │ "c

And display special white space:

nbsp <- "Hi\u00A0you"
nbsp
#> [1] "Hi you"
nbsp == "Hi you"
#> [1] FALSE

str_view(nbsp)
#> [1] │ Hi{\u00a0}you

And matches:

str_view(c("abc", "def", "fghi"), "[aeiou]")
#> [1] │ <a>bc
#> [2] │ d<e>f
#> [3] │ fgh<i>
str_view(c("abc", "def", "fghi"), ".$")
#> [1] │ ab<c>
#> [2] │ de<f>
#> [3] │ fgh<i>

str_view(fruit, "(.)\\1")
#>  [1] │ a<pp>le
#>  [5] │ be<ll> pe<pp>er
#>  [6] │ bilbe<rr>y
#>  [7] │ blackbe<rr>y
#>  [8] │ blackcu<rr>ant
#>  [9] │ bl<oo>d orange
#> [10] │ bluebe<rr>y
#> [11] │ boysenbe<rr>y
#> [16] │ che<rr>y
#> [17] │ chili pe<pp>er
#> [19] │ cloudbe<rr>y
#> [21] │ cranbe<rr>y
#> [23] │ cu<rr>ant
#> [28] │ e<gg>plant
#> [29] │ elderbe<rr>y
#> [32] │ goji be<rr>y
#> [33] │ g<oo>sebe<rr>y
#> [38] │ hucklebe<rr>y
#> [47] │ lych<ee>
#> [50] │ mulbe<rr>y
#> ... and 9 more

Also uses standard tidyverse recycling rules (only ever recycling length-1 vectors) and to have more informative errors.