Skip to content

Interpolation

The !! ("bang bang") operator can be used to interpolate values of variables from the parent environment into your code. This operator is borrowed from the R rlang package. At some point, we may switch to using native Julia interpolation, but for a variety of reasons that introduce some complexity with native interpolation, we plan to continue to support !! interpolation.

To interpolate multiple variables, the rlang R package uses the !!! "triple bang" operator. However, in TidierData.jl, the !! "bang bang" operator can be used to interpolate either single or multiple values as shown in the examples below.

Note: You can only interpolate values from variables in the parent environment. If you would like to interpolate column names, you have two options: you can either use across() or you can use @aside with @pull() to create variables in the parent environment containing the values of those columns which can then be accessed using interpolatino.

myvar = :bandmyvar = Cols(:a, :b)both refer to *columns* with those names. On the other hand,myvar = "b",myvar = ("a", "b")andmyvar = ["a", "b"]will interpolate the *values*. If you intend to interpolate column names, the preferred way is to useCols()` as in the examples below.

using TidierData

df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
               b = [1,1,1,2,2,2,3,3,3,4],
               c = 11:20)
10×3 DataFrame
Rowabc
StringInt64Int64
1a111
2a112
3b113
4b214
5c215
6c216
7d317
8d318
9e319
10e420

Select the column (because myvar contains a symbol)¤

myvar = :b

@chain df begin
  @select(!!myvar)
end
10×1 DataFrame
Rowb
Int64
11
21
31
42
52
62
73
83
93
104

Select multiple variables¤

You can also use a vector as in [:a, :b], but Cols() is preferred because it lets you mix and match numbers.

myvars = Cols(:a, :b)

@chain df begin
  @select(!!myvars)
end
10×2 DataFrame
Rowab
StringInt64
1a1
2a1
3b1
4b2
5c2
6c2
7d3
8d3
9e3
10e4

This is the same as this...

myvars = Cols(:a, 2)

@chain df begin
  @select(!!myvars)
end
10×2 DataFrame
Rowab
StringInt64
1a1
2a1
3b1
4b2
5c2
6c2
7d3
8d3
9e3
10e4

Filter rows containing the value of myvar_string¤

myvar_string = "b"

@chain df begin
  @filter(a == !!myvar_string)
end
2×3 DataFrame
Rowabc
StringInt64Int64
1b113
2b214

Filtering rows works similarly using in.¤

Note that for in to work here, we have to wrap it in [] because otherwise, the string will be converted into a collection of characters, which are a different data type.

myvar_string = "b"

@chain df begin
  @filter(a in [!!myvar_string])
end
2×3 DataFrame
Rowabc
StringInt64Int64
1b113
2b214

You can also use this for a vector (or tuple) of strings.¤

myvars_string = ["a", "b"]

@chain df begin
  @filter(a in !!myvars_string)
end
4×3 DataFrame
Rowabc
StringInt64Int64
1a111
2a112
3b113
4b214

Mutate one variable¤

Remember: You cannot interpolate column names into @mutate() expressions. However, you can create a temporary variable containing the values of the column in question or you can use @mutate() with across().

Option 1: Create a temporary variable containing the values of the column.¤

myvar = :b

@chain df begin
  @aside(myvar_values = @pull(_, !!myvar))
  @mutate(d = !!myvar_values + 1)
end
10×4 DataFrame
Rowabcd
StringInt64Int64Int64
1a1112
2a1122
3b1132
4b2143
5c2153
6c2163
7d3174
8d3184
9e3194
10e4205

Option 2: Use @mutate() with across()¤

Note: when using across(), anonymous functions are not vectorized. This is intentional to allow users to specify their function exactly as desired.

@chain df begin
  @mutate(across(!!myvar, x -> x .+ 1))
  @rename(d = b_function)
end
10×4 DataFrame
Rowabcd
StringInt64Int64Int64
1a1112
2a1122
3b1132
4b2143
5c2153
6c2163
7d3174
8d3184
9e3194
10e4205

Summarize across one variable¤

myvar = :b

@chain df begin
  @summarize(across(!!myvar, mean))
end
1×1 DataFrame
Rowb_mean
Float64
12.2

Summarize across multiple variables¤

myvars = Cols(:b, :c)

@chain df begin
  @summarize(across(!!myvars, (mean, minimum, maximum)))
end
1×6 DataFrame
Rowb_meanc_meanb_minimumc_minimumb_maximumc_maximum
Float64Float64Int64Int64Int64Int64
12.215.5111420

Group by one interpolated variable¤

myvar = :a

@chain df begin
  @group_by(!!myvar)
  @summarize(c = mean(c))
end
5×2 DataFrame
Rowac
StringFloat64
1a11.5
2b13.5
3c15.5
4d17.5
5e19.5

Group by multiple interpolated variables¤

Once again, you can mix and match column selectors within Cols()

myvars = Cols(:a, 2)

@chain df begin
  @group_by(!!myvars)
  @summarize(c = mean(c))
end

GroupedDataFrame with 5 groups based on key: a

First Group (1 row): a = "a"
Rowabc
StringInt64Float64
1a111.5

⋮

Last Group (2 rows): a = "e"
Rowabc
StringInt64Float64
1e319.0
2e420.0

Notice that df remains grouped by a because the @summarize() peeled off one layer of grouping.

Global constants¤

You can also use !! interpolation to access global variables like pi.

df = DataFrame(radius = 1:5)

@chain df begin
  @mutate(area = !!pi * radius^2)
end
5×2 DataFrame
Rowradiusarea
Int64Float64
113.14159
2212.5664
3328.2743
4450.2655
5578.5398

As of v0.14.0, global constants defined within the Base or Core modules (like missing, pi, and Real can be directly referenced without any !!)

@chain df begin
  @mutate(area = pi * radius^2)
end
5×2 DataFrame
Rowradiusarea
Int64Float64
113.14159
2212.5664
3328.2743
4450.2655
5578.5398

Alternative interpolation syntax¤

Since we know that pi is defined in the Main module, we can also access it using Main.pi.

@chain df begin
  @mutate(area = Main.pi * radius^2)
end
5×2 DataFrame
Rowradiusarea
Int64Float64
113.14159
2212.5664
3328.2743
4450.2655
5578.5398

The key lesson with interpolation is that any bare unquoted variable is assumed to refer to a column name in the DataFrame. If you are referring to any variable outside of the DataFrame, you need to either use !!variable or [Module_name_here].variable syntax to refer to this variable.

Note: You can use !! interpolation anywhere, including inside of functions and loops.

df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
               b = [1,1,1,2,2,2,3,3,3,4],
               c = 11:20)

for col in [:b, :c]
  @chain df begin
    @summarize(across(!!col, mean))
    println
  end
end
1×1 DataFrame
 Row │ b_mean
     │ Float64
─────┼─────────
   1 │     2.2
1×1 DataFrame
 Row │ c_mean
     │ Float64
─────┼─────────
   1 │    15.5

This page was generated using Literate.jl.