Upcoming features in the tidyverse
We’re currently working on significant updates to five tidyverse packages, mostly inspired by the 2ed of R4DS: https://r4ds.hadley.nz. We haven’t scheduled the release date of any of most of these packages yet, but they should all be out by early 2023.
All these features are still changeable so we’d love your feedback 😃
tidyverse 1.4.0
#> ── Attaching core tidyverse packages ─────────────────── tidyverse ──
#> ✔ dplyr ✔ readr 2.1.3
#> ✔ forcats 0.5.2 ✔ stringr
#> ✔ ggplot2 ✔ tibble 3.1.8
#> ✔ lubridate 1.8.0 ✔ tidyr
#> ✔ purrr 0.9000.0.9000
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Two big changes: lubridate is joining the core tidyverse and we now advertise the conflicted package.
Haven’t heard of conflicted before? Learn more at https://conflicted.r-lib.org.
select#> Error:
#> ! [conflicted] `select` found in 2 packages.
#> Either pick the one you want with `::`
#> * MASS::select
#> * dplyr::select
#> Or declare a preference with `conflict_prefer()`
#> * conflict_prefer("select", "MASS")
#> * conflict_prefer("select", "dplyr")
ggplot2 3.4.0
To be released on Oct 31
ggplot(airquality) +
geom_line(aes(Day, Temp, size = Month, group = Month)) +
scale_linewidth(range = c(0.5, 3))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> ℹ The deprecated feature was likely used in the ggplot2 package.
#> Please report the issue at <]8;;https://github.com/tidyverse/ggplot2/issueshttps://github.com/tidyverse/ggplot2/issues]8;;>.
#> Warning in panel_params$guide: partial match of 'guide' to 'guides'
<- data.frame(year = 1875:1972, level = as.vector(LakeHuron))
huron <- ggplot(huron) +
p geom_line(aes(year, level)) +
geom_ribbon(aes(year, xmin = level - 5, xmax = level + 5))
p#> Error in `geom_ribbon()`:
#> ! Problem while setting up geom.
#> ℹ Error occurred in the 2nd layer.
#> Caused by error in `compute_geom_1()`:
#> ! `geom_ribbon()` requires the following missing aesthetics: ymin and
#> ymax or y
dplyr 1.1.0
and group_by()
now sort character vectors:
- This is good because it’s waaaaaay faster
- This is bad because it’s sorts a little differently: in the C locale because
<- tibble(x = stringi::stri_rand_strings(n = 5e5, length = 15))
::system_time(df |> arrange(x))
bench#> process real
#> 101ms 99.4ms
::with_options(list(dplyr.legacy_locale = TRUE), {
withr::system_time(df |> arrange(x))
})#> process real
#> 1.93s 1.9s
# Can still request specific locale if needed
::system_time(df |> arrange(x, locale = "fr"))
bench#> process real
#> 215ms 212ms
# Also affects group_by
::system_time(df |> group_by(x))
bench#> process real
#> 190ms 187ms
::with_options(list(dplyr.legacy_locale = TRUE), {
withr::system_time(df |> group_by(x))
})#> process real
#> 1.07s 1.05s
Lots of improvements to joins:
- https://r4ds.hadley.nz/joins.html#one-to-one-mapping
- https://r4ds.hadley.nz/joins.html#allow-multiple-rows
- https://r4ds.hadley.nz/joins.html#non-equi-joins
Quite a few functions have been rewritten to use vctrs behind the scenes: if_else()
, first()
, last()
, nth()
, between()
, coalesce()
. This brings performance improvements, consistency, and better error messages.
Most important change is that you no longer need to know about different types of NA
to use if_else()
<- c(1:10, NA)
x if_else(x %% 2 == 0, x, NA) # used to be an error
#> [1] NA 2 NA 4 NA 6 NA 8 NA 10 NA
if_else(x %% 2 == 0, x, "x")
#> Error in `if_else()`:
#> ! Can't combine `true` <integer> and `false` <character>.
# cf base R
ifelse(x %% 2 == 0, x, "x")
#> [1] "x" "2" "x" "4" "x" "6" "x" "8" "x" "10" NA
# also
if_else(x %% 2 == 0, x, 0, missing = 1000)
#> [1] 0 2 0 4 0 6 0 8 0 10 1000
New case_match()
<- c("a", "b", "a", "d", "b", NA, "c", "e")
c("a", "b") ~ 1,
"c" ~ 2,
"d" ~ 3,
.default = 0
)#> [1] 1 1 1 3 1 0 2 0
# Or use it to just replace certain values
c("a", "b") ~ "a",
.default = x
)#> [1] "a" "a" "a" "d" "a" NA "c" "e"
purrr 1.0.0
Generally, this release is about finding the core of purrr — many older and less used functions have been deprecated. Generally, we don’t expect this to affect much code in the wild, but it makes purrr much simpler: https://purrr.tidyverse.org/dev/news/index.html#lifecycle-updates-development-version
Three major new features: progress bars, better errors, and generalised map_vec()
<- map(1:500, \(i) Sys.sleep(0.10), .progress = TRUE)
out #> ■■■ 6% | ETA: 48s
#> ■■■■■ 12% | ETA: 46s
#> ■■■■■■ 18% | ETA: 43s
#> ■■■■■■■■ 23% | ETA: 40s
#> ■■■■■■■■■■ 29% | ETA: 37s
#> ■■■■■■■■■■■ 35% | ETA: 34s
#> ■■■■■■■■■■■■■ 41% | ETA: 31s
#> ■■■■■■■■■■■■■■■ 47% | ETA: 28s
#> ■■■■■■■■■■■■■■■■■ 52% | ETA: 25s
#> ■■■■■■■■■■■■■■■■■■ 58% | ETA: 22s
#> ■■■■■■■■■■■■■■■■■■■■ 64% | ETA: 19s
#> ■■■■■■■■■■■■■■■■■■■■■■ 70% | ETA: 16s
#> ■■■■■■■■■■■■■■■■■■■■■■■■ 75% | ETA: 13s
#> ■■■■■■■■■■■■■■■■■■■■■■■■■ 81% | ETA: 10s
#> ■■■■■■■■■■■■■■■■■■■■■■■■■■■ 87% | ETA: 7s
#> ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 93% | ETA: 4s
#> ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 99% | ETA: 1s
<- sample(1:500)
x map(x, \(x) if (x == 300) stop("An error!") else x * 2)
#> Error in `map()`:
#> ℹ In index: 363.
#> Caused by error in `.f()`:
#> ! An error!
# map_lgl, map_int, map_dbl, map_chr
<- 1:10
x map_vec(x, \(x) Sys.Date() + x)
#> [1] "2022-10-29" "2022-10-30" "2022-10-31" "2022-11-01" "2022-11-02"
#> [6] "2022-11-03" "2022-11-04" "2022-11-05" "2022-11-06" "2022-11-07"
map_vec(x, \(x) factor(letters[x]))
#> [1] a b c d e f g h i j
#> Levels: a b c d e f g h i j
tidyr 1.3.0
New family of functions for splitting strings up into multiple variables:
separate_wider_delim() |
separate_longer_delim() |
separate_wider_position() |
separate_longer_position() |
separate_wider_regex() |
More consistent:
separate(by = string) |
separate_rows() |
separate(by = integer vector) |
N/A |
extract() |
Today I’ll focus on the wider
functions because they’re the most interesting:
<- tidycensus::get_decennial(
vt_census geography = "block",
state = "VT",
county = "Washington",
variables = "P1_001N",
year = 2020
)#> Getting data from the 2020 decennial Census
#> Using the PL 94-171 Redistricting Data summary file
#> Note: 2020 decennial Census data use differential privacy, a technique that
#> introduces errors into data to preserve respondent confidentiality.
#> ℹ Small counts should be interpreted with caution.
#> ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
#> This message is displayed once per session.
vt_census#> # A tibble: 2,150 × 4
#> GEOID NAME varia…¹ value
#> <chr> <chr> <chr> <dbl>
#> 1 500239541003033 Block 3033, Block Group 3, Census Tract 9541, … P1_001N 1
#> 2 500239541001055 Block 1055, Block Group 1, Census Tract 9541, … P1_001N 31
#> 3 500239543002047 Block 2047, Block Group 2, Census Tract 9543, … P1_001N 23
#> 4 500239545001022 Block 1022, Block Group 1, Census Tract 9545, … P1_001N 0
#> 5 500239545001087 Block 1087, Block Group 1, Census Tract 9545, … P1_001N 2
#> 6 500239546002032 Block 2032, Block Group 2, Census Tract 9546, … P1_001N 0
#> 7 500239551001009 Block 1009, Block Group 1, Census Tract 9551, … P1_001N 74
#> 8 500239550002001 Block 2001, Block Group 2, Census Tract 9550, … P1_001N 6
#> 9 500239552003006 Block 3006, Block Group 3, Census Tract 9552, … P1_001N 71
#> 10 500239551002030 Block 2030, Block Group 2, Census Tract 9551, … P1_001N 37
#> # … with 2,140 more rows, and abbreviated variable name ¹variable
vt_census separate_wider_position(
GEOID,widths = c(state = 2, county = 3, tract = 6, block = 4)
)#> # A tibble: 2,150 × 7
#> state county tract block NAME varia…¹ value
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 50 023 954100 3033 Block 3033, Block Group 3, Census Tr… P1_001N 1
#> 2 50 023 954100 1055 Block 1055, Block Group 1, Census Tr… P1_001N 31
#> 3 50 023 954300 2047 Block 2047, Block Group 2, Census Tr… P1_001N 23
#> 4 50 023 954500 1022 Block 1022, Block Group 1, Census Tr… P1_001N 0
#> 5 50 023 954500 1087 Block 1087, Block Group 1, Census Tr… P1_001N 2
#> 6 50 023 954600 2032 Block 2032, Block Group 2, Census Tr… P1_001N 0
#> 7 50 023 955100 1009 Block 1009, Block Group 1, Census Tr… P1_001N 74
#> 8 50 023 955000 2001 Block 2001, Block Group 2, Census Tr… P1_001N 6
#> 9 50 023 955200 3006 Block 3006, Block Group 3, Census Tr… P1_001N 71
#> 10 50 023 955100 2030 Block 2030, Block Group 2, Census Tr… P1_001N 37
#> # … with 2,140 more rows, and abbreviated variable name ¹variable
vt_census separate_wider_delim(
NAME,delim = ", ",
names = c("block", "block_group", "tract", "county", "state")
) mutate(
block = block %>% parse_number(),
block_group = block_group %>% parse_number(),
tract = tract %>% parse_number()
)#> # A tibble: 2,150 × 8
#> GEOID block block_group tract county state varia…¹ value
#> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
#> 1 500239541003033 3033 3 9541 Washington County Verm… P1_001N 1
#> 2 500239541001055 1055 1 9541 Washington County Verm… P1_001N 31
#> 3 500239543002047 2047 2 9543 Washington County Verm… P1_001N 23
#> 4 500239545001022 1022 1 9545 Washington County Verm… P1_001N 0
#> 5 500239545001087 1087 1 9545 Washington County Verm… P1_001N 2
#> 6 500239546002032 2032 2 9546 Washington County Verm… P1_001N 0
#> 7 500239551001009 1009 1 9551 Washington County Verm… P1_001N 74
#> 8 500239550002001 2001 2 9550 Washington County Verm… P1_001N 6
#> 9 500239552003006 3006 3 9552 Washington County Verm… P1_001N 71
#> 10 500239551002030 2030 2 9551 Washington County Verm… P1_001N 37
#> # … with 2,140 more rows, and abbreviated variable name ¹variable
vt_census separate_wider_regex(
NAME,patterns = c(
"Block ", block = "\\d+", ", ",
"Block Group ", block_group = "\\d+", ", ",
"Census Tract ", tract = "\\d+.\\d+", ", ",
county = "[^,]+", ", ",
state = ".*"
)#> # A tibble: 2,150 × 8
#> GEOID block block_group tract county state varia…¹ value
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 500239541003033 3033 3 9541 Washington County Verm… P1_001N 1
#> 2 500239541001055 1055 1 9541 Washington County Verm… P1_001N 31
#> 3 500239543002047 2047 2 9543 Washington County Verm… P1_001N 23
#> 4 500239545001022 1022 1 9545 Washington County Verm… P1_001N 0
#> 5 500239545001087 1087 1 9545 Washington County Verm… P1_001N 2
#> 6 500239546002032 2032 2 9546 Washington County Verm… P1_001N 0
#> 7 500239551001009 1009 1 9551 Washington County Verm… P1_001N 74
#> 8 500239550002001 2001 2 9550 Washington County Verm… P1_001N 6
#> 9 500239552003006 3006 3 9552 Washington County Verm… P1_001N 71
#> 10 500239551002030 2030 2 9551 Washington County Verm… P1_001N 37
#> # … with 2,140 more rows, and abbreviated variable name ¹variable
They also have an improved way to report on problems:
<- tibble(
df x = c("a", "a-b", "a-b-c")
# old functions warn
%>% separate(x, c("x", "y"))
df #> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> # A tibble: 3 × 2
#> x y
#> <chr> <chr>
#> 1 a <NA>
#> 2 a b
#> 3 a b
# new functions error
%>% separate_wider_delim(x, delim = "-", names = c("x", "y"))
df #> Error in `separate_wider_delim()`:
#> ! Expected 2 pieces in each element of `x`.
#> ! 1 value was too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
#> ! 1 value was too long.
#> ℹ Use `too_many = "debug"` to diagnose the problem.
#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
%>% separate_wider_delim(x, delim = "-", names = c("x"))
df #> Error in `separate_wider_delim()`:
#> ! Expected 1 pieces in each element of `x`.
#> ! 2 values were too long.
#> ℹ Use `too_many = "debug"` to diagnose the problem.
#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
# and give you debugging tools:
<- df %>%
probs separate_wider_delim(
x,delim = "-",
names = c("a", "b"),
too_few = "debug",
too_many = "debug"
)#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
#> `x_remainder`.
probs#> # A tibble: 3 × 6
#> a b x x_ok x_pieces x_remainder
#> <chr> <chr> <chr> <lgl> <int> <chr>
#> 1 a <NA> a FALSE 1 ""
#> 2 a b a-b TRUE 2 ""
#> 3 a b a-b-c FALSE 3 "-c"
%>% filter(!x_ok)
probs #> Error:
#> ! [conflicted] `filter` found in 2 packages.
#> Either pick the one you want with `::`
#> * dplyr::filter
#> * stats::filter
#> Or declare a preference with `conflict_prefer()`
#> * conflict_prefer("filter", "dplyr")
#> * conflict_prefer("filter", "stats")
stringr 1.5.0
stringr 1.4.0 was released in Feb 2019, almost 3 years ago, so mostly an accumulation of small improvements and bug fixes.
Lots of new functions: str_escape()
, str_equal()
, str_flatten_comma()
, str_split_1()
, str_split_i()
, str_like()
, str_rank()
, str_sub_all()
, str_unique()
, str_width()
Handy new str_view()
function (which uses colour where possible):
<- "a\n'\b\n\"c"
x#> [1] "a\n'\b\n\"c"
#> a
#> '
#> "c
#> [1] │ a
#> │ '
#> │ "c
And display special white space:
<- "Hi\u00A0you"
nbsp#> [1] "Hi you"
== "Hi you"
nbsp #> [1] FALSE
#> [1] │ Hi{\u00a0}you
And matches:
str_view(c("abc", "def", "fghi"), "[aeiou]")
#> [1] │ <a>bc
#> [2] │ d<e>f
#> [3] │ fgh<i>
str_view(c("abc", "def", "fghi"), ".$")
#> [1] │ ab<c>
#> [2] │ de<f>
#> [3] │ fgh<i>
str_view(fruit, "(.)\\1")
#> [1] │ a<pp>le
#> [5] │ be<ll> pe<pp>er
#> [6] │ bilbe<rr>y
#> [7] │ blackbe<rr>y
#> [8] │ blackcu<rr>ant
#> [9] │ bl<oo>d orange
#> [10] │ bluebe<rr>y
#> [11] │ boysenbe<rr>y
#> [16] │ che<rr>y
#> [17] │ chili pe<pp>er
#> [19] │ cloudbe<rr>y
#> [21] │ cranbe<rr>y
#> [23] │ cu<rr>ant
#> [28] │ e<gg>plant
#> [29] │ elderbe<rr>y
#> [32] │ goji be<rr>y
#> [33] │ g<oo>sebe<rr>y
#> [38] │ hucklebe<rr>y
#> [47] │ lych<ee>
#> [50] │ mulbe<rr>y
#> ... and 9 more
Also uses standard tidyverse recycling rules (only ever recycling length-1 vectors) and to have more informative errors.