Interpolation
The !!
("bang bang") operator can be used to interpolate values of variables from the parent environment into your code. This operator is borrowed from the R rlang
package. At some point, we may switch to using native Julia interpolation, but for a variety of reasons that introduce some complexity with native interpolation, we plan to continue to support !!
interpolation.
To interpolate multiple variables, the rlang
R package uses the !!!
"triple bang" operator. However, in TidierData.jl
, the !!
"bang bang" operator can be used to interpolate either single or multiple values as shown in the examples below.
Note: You can only interpolate values from variables in the parent environment. If you would like to interpolate column names, you have two options: you can either use across()
or you can use @aside
with @pull()
to create variables in the parent environment containing the values of those columns which can then be accessed using interpolatino.
myvar = :band
myvar = Cols(:a, :b)both refer to *columns* with those names. On the other hand,
myvar = "b",
myvar = ("a", "b")and
myvar = ["a", "b"]will interpolate the *values*. If you intend to interpolate column names, the preferred way is to use
Cols()` as in the examples below.
using TidierData
df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
b = [1,1,1,2,2,2,3,3,3,4],
c = 11:20)
Row | a | b | c |
---|---|---|---|
String | Int64 | Int64 | |
1 | a | 1 | 11 |
2 | a | 1 | 12 |
3 | b | 1 | 13 |
4 | b | 2 | 14 |
5 | c | 2 | 15 |
6 | c | 2 | 16 |
7 | d | 3 | 17 |
8 | d | 3 | 18 |
9 | e | 3 | 19 |
10 | e | 4 | 20 |
Select the column (because myvar
contains a symbol)¤
myvar = :b
@chain df begin
@select(!!myvar)
end
Row | b |
---|---|
Int64 | |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 2 |
5 | 2 |
6 | 2 |
7 | 3 |
8 | 3 |
9 | 3 |
10 | 4 |
Select multiple variables¤
You can also use a vector as in [:a, :b]
, but Cols()
is preferred because it lets you mix and match numbers.
myvars = Cols(:a, :b)
@chain df begin
@select(!!myvars)
end
Row | a | b |
---|---|---|
String | Int64 | |
1 | a | 1 |
2 | a | 1 |
3 | b | 1 |
4 | b | 2 |
5 | c | 2 |
6 | c | 2 |
7 | d | 3 |
8 | d | 3 |
9 | e | 3 |
10 | e | 4 |
This is the same as this...
myvars = Cols(:a, 2)
@chain df begin
@select(!!myvars)
end
Row | a | b |
---|---|---|
String | Int64 | |
1 | a | 1 |
2 | a | 1 |
3 | b | 1 |
4 | b | 2 |
5 | c | 2 |
6 | c | 2 |
7 | d | 3 |
8 | d | 3 |
9 | e | 3 |
10 | e | 4 |
Filter rows containing the value of myvar_string
¤
myvar_string = "b"
@chain df begin
@filter(a == !!myvar_string)
end
Row | a | b | c |
---|---|---|---|
String | Int64 | Int64 | |
1 | b | 1 | 13 |
2 | b | 2 | 14 |
Filtering rows works similarly using in
.¤
Note that for in
to work here, we have to wrap it in []
because otherwise, the string will be converted into a collection of characters, which are a different data type.
myvar_string = "b"
@chain df begin
@filter(a in [!!myvar_string])
end
Row | a | b | c |
---|---|---|---|
String | Int64 | Int64 | |
1 | b | 1 | 13 |
2 | b | 2 | 14 |
You can also use this for a vector (or tuple) of strings.¤
myvars_string = ["a", "b"]
@chain df begin
@filter(a in !!myvars_string)
end
Row | a | b | c |
---|---|---|---|
String | Int64 | Int64 | |
1 | a | 1 | 11 |
2 | a | 1 | 12 |
3 | b | 1 | 13 |
4 | b | 2 | 14 |
Mutate one variable¤
Remember: You cannot interpolate column names into @mutate()
expressions. However, you can create a temporary variable containing the values of the column in question or you can use @mutate()
with across()
.
Option 1: Create a temporary variable containing the values of the column.¤
myvar = :b
@chain df begin
@aside(myvar_values = @pull(_, !!myvar))
@mutate(d = !!myvar_values + 1)
end
Row | a | b | c | d |
---|---|---|---|---|
String | Int64 | Int64 | Int64 | |
1 | a | 1 | 11 | 2 |
2 | a | 1 | 12 | 2 |
3 | b | 1 | 13 | 2 |
4 | b | 2 | 14 | 3 |
5 | c | 2 | 15 | 3 |
6 | c | 2 | 16 | 3 |
7 | d | 3 | 17 | 4 |
8 | d | 3 | 18 | 4 |
9 | e | 3 | 19 | 4 |
10 | e | 4 | 20 | 5 |
Option 2: Use @mutate()
with across()
¤
Note: when using across()
, anonymous functions are not vectorized. This is intentional to allow users to specify their function exactly as desired.
@chain df begin
@mutate(across(!!myvar, x -> x .+ 1))
@rename(d = b_function)
end
Row | a | b | c | d |
---|---|---|---|---|
String | Int64 | Int64 | Int64 | |
1 | a | 1 | 11 | 2 |
2 | a | 1 | 12 | 2 |
3 | b | 1 | 13 | 2 |
4 | b | 2 | 14 | 3 |
5 | c | 2 | 15 | 3 |
6 | c | 2 | 16 | 3 |
7 | d | 3 | 17 | 4 |
8 | d | 3 | 18 | 4 |
9 | e | 3 | 19 | 4 |
10 | e | 4 | 20 | 5 |
Summarize across one variable¤
myvar = :b
@chain df begin
@summarize(across(!!myvar, mean))
end
Row | b_mean |
---|---|
Float64 | |
1 | 2.2 |
Summarize across multiple variables¤
myvars = Cols(:b, :c)
@chain df begin
@summarize(across(!!myvars, (mean, minimum, maximum)))
end
Row | b_mean | c_mean | b_minimum | c_minimum | b_maximum | c_maximum |
---|---|---|---|---|---|---|
Float64 | Float64 | Int64 | Int64 | Int64 | Int64 | |
1 | 2.2 | 15.5 | 1 | 11 | 4 | 20 |
Group by one interpolated variable¤
myvar = :a
@chain df begin
@group_by(!!myvar)
@summarize(c = mean(c))
end
Row | a | c |
---|---|---|
String | Float64 | |
1 | a | 11.5 |
2 | b | 13.5 |
3 | c | 15.5 |
4 | d | 17.5 |
5 | e | 19.5 |
Group by multiple interpolated variables¤
Once again, you can mix and match column selectors within Cols()
myvars = Cols(:a, 2)
@chain df begin
@group_by(!!myvars)
@summarize(c = mean(c))
end
GroupedDataFrame with 5 groups based on key: a
Row | a | b | c |
---|---|---|---|
String | Int64 | Float64 | |
1 | a | 1 | 11.5 |
⋮
Row | a | b | c |
---|---|---|---|
String | Int64 | Float64 | |
1 | e | 3 | 19.0 |
2 | e | 4 | 20.0 |
Notice that df
remains grouped by a
because the @summarize()
peeled off one layer of grouping.
Global constants¤
You can also use !!
interpolation to access global variables like pi
.
df = DataFrame(radius = 1:5)
@chain df begin
@mutate(area = !!pi * radius^2)
end
Row | radius | area |
---|---|---|
Int64 | Float64 | |
1 | 1 | 3.14159 |
2 | 2 | 12.5664 |
3 | 3 | 28.2743 |
4 | 4 | 50.2655 |
5 | 5 | 78.5398 |
As of v0.14.0, global constants defined within the Base or Core modules (like missing
, pi
, and Real
can be directly referenced without any !!
)
@chain df begin
@mutate(area = pi * radius^2)
end
Row | radius | area |
---|---|---|
Int64 | Float64 | |
1 | 1 | 3.14159 |
2 | 2 | 12.5664 |
3 | 3 | 28.2743 |
4 | 4 | 50.2655 |
5 | 5 | 78.5398 |
Alternative interpolation syntax¤
Since we know that pi
is defined in the Main
module, we can also access it using Main.pi
.
@chain df begin
@mutate(area = Main.pi * radius^2)
end
Row | radius | area |
---|---|---|
Int64 | Float64 | |
1 | 1 | 3.14159 |
2 | 2 | 12.5664 |
3 | 3 | 28.2743 |
4 | 4 | 50.2655 |
5 | 5 | 78.5398 |
The key lesson with interpolation is that any bare unquoted variable is assumed to refer to a column name in the DataFrame. If you are referring to any variable outside of the DataFrame, you need to either use !!variable
or [Module_name_here].variable
syntax to refer to this variable.
Note: You can use !!
interpolation anywhere, including inside of functions and loops.
df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
b = [1,1,1,2,2,2,3,3,3,4],
c = 11:20)
for col in [:b, :c]
@chain df begin
@summarize(across(!!col, mean))
println
end
end
1×1 DataFrame
Row │ b_mean
│ Float64
─────┼─────────
1 │ 2.2
1×1 DataFrame
Row │ c_mean
│ Float64
─────┼─────────
1 │ 15.5
This page was generated using Literate.jl.