Interpolation
Native (and preferred) method of interpolating using @eval
and $
¤
TidierData relies on "non-standard evaluation," which has the side effect of making interpolation slightly more complicated. For example, in the expression @mutate(df, a = b + 1)
, the df
refers to a data frame, while a
and b
refer to column names within the data frame. What would happen if you created a variable var
that contains the value :a
. Would this interpolated expression work?
using TidierData
df = DataFrame(a = 1:5, b = 6:10)
var = :a
@mutate(df, $var = b + 1)
Unfortunately, this does not work because it produces @mutate(df, :a = b + 1)
. Since TidierData uses bare variables (and not symbols) to refer to column names, this will result in an error. However, there is a slight modification we can apply to make this code work: prefixing it with an @eval
.
using TidierData
df = DataFrame(a = 1:5, b = 6:10, c = 11:15)
var = :a
@eval @mutate(df, $var = b + 1)
Row | a | b | c |
---|---|---|---|
Int64 | Int64 | Int64 | |
1 | 7 | 6 | 11 |
2 | 8 | 7 | 12 |
3 | 9 | 8 | 13 |
4 | 10 | 9 | 14 |
5 | 11 | 10 | 15 |
Why does adding an @eval
to the beginning of the expression make interpolation work?¤
Adding @eval
to the beginning causes the interpolated expressions to be evaluated prior to be interpolated. So $var
, which contains the value :a
, is evaluated to a
, which produces the desired expression @mutate(df, a = b + 1)
. The need of @eval
here then is primarily because TidierData expects an a
rather than an :a
to refer to the column "a" in a data frame.
How can I use @eval
with a chained set of expressions?¤
The answer is simple: use @eval @chain
instead of @chain
.
var = :a
@eval @chain df begin
@select($var)
@mutate($var = $var + 1)
end
Row | a |
---|---|
Int64 | |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
5 | 6 |
If you want to select multiple variables, just use a ...
to splat the vector (or tuple) of variables.
vars = [:a, :b]
@eval @chain df begin
@select($vars...)
end
Row | a | b |
---|---|---|
Int64 | Int64 | |
1 | 1 | 6 |
2 | 2 | 7 |
3 | 3 | 8 |
4 | 4 | 9 |
5 | 5 | 10 |
The @eval
-based interpolation syntax is highly flexible in that it should work anywhere you might need it across the entire package.
@eval @chain df begin
@summarize(across($vars..., mean))
end
Row | a_mean | b_mean |
---|---|---|
Float64 | Float64 | |
1 | 3.0 | 8.0 |
Does @eval
work inside of user-defined functions?¤
Yes. Here's an example of how you could roll up a new select_new
function wrapping the @select
macros.
function select_new(df, columns...)
@eval @select(df, $columns...)
end
select_new(df, :a, :c)
Row | a | c |
---|---|---|
Int64 | Int64 | |
1 | 1 | 11 |
2 | 2 | 12 |
3 | 3 | 13 |
4 | 4 | 14 |
5 | 5 | 15 |
Yes. Here's another example of an add_one()
function that adds one to all numeric columns and returns the result in a new set of columns.
function add_one(df)
@eval @mutate(df, across(where(is_number), x -> x .+ 1))
end
add_one(df)
Row | a | b | c | a_function | b_function | c_function |
---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 6 | 11 | 2 | 7 | 12 |
2 | 2 | 7 | 12 | 3 | 8 | 13 |
3 | 3 | 8 | 13 | 4 | 9 | 14 |
4 | 4 | 9 | 14 | 5 | 10 | 15 |
5 | 5 | 10 | 15 | 6 | 11 | 16 |
Note: the below documentation is included here only for historical reasons. It will be removed in the future.¤
Superseded method of interpolating using the !!
("bang bang") operator¤
The !!
("bang bang") operator can be used to interpolate values of variables from the parent environment into your code. This operator is borrowed from the R rlang
package. At some point, we may switch to using native Julia interpolation, but for a variety of reasons that introduce some complexity with native interpolation, we plan to continue to support !!
interpolation.
To interpolate multiple variables, the rlang
R package uses the !!!
"triple bang" operator. However, in TidierData.jl
, the !!
"bang bang" operator can be used to interpolate either single or multiple values as shown in the examples below.
Note: You can only interpolate values from variables in the parent environment. If you would like to interpolate column names, you have two options: you can either use across()
or you can use @aside
with @pull()
to create variables in the parent environment containing the values of those columns which can then be accessed using interpolatino.
myvar = :band
myvar = Cols(:a, :b)both refer to *columns* with those names. On the other hand,
myvar = "b",
myvar = ("a", "b")and
myvar = ["a", "b"]will interpolate the *values*. If you intend to interpolate column names, the preferred way is to use
Cols()` as in the examples below.
using TidierData
df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
b = [1,1,1,2,2,2,3,3,3,4],
c = 11:20)
Row | a | b | c |
---|---|---|---|
String | Int64 | Int64 | |
1 | a | 1 | 11 |
2 | a | 1 | 12 |
3 | b | 1 | 13 |
4 | b | 2 | 14 |
5 | c | 2 | 15 |
6 | c | 2 | 16 |
7 | d | 3 | 17 |
8 | d | 3 | 18 |
9 | e | 3 | 19 |
10 | e | 4 | 20 |
Select the column (because myvar
contains a symbol)¤
myvar = :b
@chain df begin
@select(!!myvar)
end
Row | b |
---|---|
Int64 | |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 2 |
5 | 2 |
6 | 2 |
7 | 3 |
8 | 3 |
9 | 3 |
10 | 4 |
Select multiple variables¤
You can also use a vector as in [:a, :b]
, but Cols()
is preferred because it lets you mix and match numbers.
myvars = Cols(:a, :b)
@chain df begin
@select(!!myvars)
end
Row | a | b |
---|---|---|
String | Int64 | |
1 | a | 1 |
2 | a | 1 |
3 | b | 1 |
4 | b | 2 |
5 | c | 2 |
6 | c | 2 |
7 | d | 3 |
8 | d | 3 |
9 | e | 3 |
10 | e | 4 |
This is the same as this...
myvars = Cols(:a, 2)
@chain df begin
@select(!!myvars)
end
Row | a | b |
---|---|---|
String | Int64 | |
1 | a | 1 |
2 | a | 1 |
3 | b | 1 |
4 | b | 2 |
5 | c | 2 |
6 | c | 2 |
7 | d | 3 |
8 | d | 3 |
9 | e | 3 |
10 | e | 4 |
Filter rows containing the value of myvar_string
¤
myvar_string = "b"
@chain df begin
@filter(a == !!myvar_string)
end
Row | a | b | c |
---|---|---|---|
String | Int64 | Int64 | |
1 | b | 1 | 13 |
2 | b | 2 | 14 |
Filtering rows works similarly using in
.¤
Note that for in
to work here, we have to wrap it in []
because otherwise, the string will be converted into a collection of characters, which are a different data type.
myvar_string = "b"
@chain df begin
@filter(a in [!!myvar_string])
end
Row | a | b | c |
---|---|---|---|
String | Int64 | Int64 | |
1 | b | 1 | 13 |
2 | b | 2 | 14 |
You can also use this for a vector (or tuple) of strings.¤
myvars_string = ["a", "b"]
@chain df begin
@filter(a in !!myvars_string)
end
Row | a | b | c |
---|---|---|---|
String | Int64 | Int64 | |
1 | a | 1 | 11 |
2 | a | 1 | 12 |
3 | b | 1 | 13 |
4 | b | 2 | 14 |
Mutate one variable¤
Remember: You cannot interpolate column names into @mutate()
expressions. However, you can create a temporary variable containing the values of the column in question or you can use @mutate()
with across()
.
Option 1: Create a temporary variable containing the values of the column.¤
myvar = :b
@chain df begin
@aside(myvar_values = @pull(_, !!myvar))
@mutate(d = !!myvar_values + 1)
end
Row | a | b | c | d |
---|---|---|---|---|
String | Int64 | Int64 | Int64 | |
1 | a | 1 | 11 | 2 |
2 | a | 1 | 12 | 2 |
3 | b | 1 | 13 | 2 |
4 | b | 2 | 14 | 3 |
5 | c | 2 | 15 | 3 |
6 | c | 2 | 16 | 3 |
7 | d | 3 | 17 | 4 |
8 | d | 3 | 18 | 4 |
9 | e | 3 | 19 | 4 |
10 | e | 4 | 20 | 5 |
Option 2: Use @mutate()
with across()
¤
Note: when using across()
, anonymous functions are not vectorized. This is intentional to allow users to specify their function exactly as desired.
@chain df begin
@mutate(across(!!myvar, x -> x .+ 1))
@rename(d = b_function)
end
Row | a | b | c | d |
---|---|---|---|---|
String | Int64 | Int64 | Int64 | |
1 | a | 1 | 11 | 2 |
2 | a | 1 | 12 | 2 |
3 | b | 1 | 13 | 2 |
4 | b | 2 | 14 | 3 |
5 | c | 2 | 15 | 3 |
6 | c | 2 | 16 | 3 |
7 | d | 3 | 17 | 4 |
8 | d | 3 | 18 | 4 |
9 | e | 3 | 19 | 4 |
10 | e | 4 | 20 | 5 |
Summarize across one variable¤
myvar = :b
@chain df begin
@summarize(across(!!myvar, mean))
end
Row | b_mean |
---|---|
Float64 | |
1 | 2.2 |
Summarize across multiple variables¤
myvars = Cols(:b, :c)
@chain df begin
@summarize(across(!!myvars, (mean, minimum, maximum)))
end
Row | b_mean | c_mean | b_minimum | c_minimum | b_maximum | c_maximum |
---|---|---|---|---|---|---|
Float64 | Float64 | Int64 | Int64 | Int64 | Int64 | |
1 | 2.2 | 15.5 | 1 | 11 | 4 | 20 |
Group by one interpolated variable¤
myvar = :a
@chain df begin
@group_by(!!myvar)
@summarize(c = mean(c))
end
Row | a | c |
---|---|---|
String | Float64 | |
1 | a | 11.5 |
2 | b | 13.5 |
3 | c | 15.5 |
4 | d | 17.5 |
5 | e | 19.5 |
Group by multiple interpolated variables¤
Once again, you can mix and match column selectors within Cols()
myvars = Cols(:a, 2)
@chain df begin
@group_by(!!myvars)
@summarize(c = mean(c))
end
GroupedDataFrame with 5 groups based on key: a
Row | a | b | c |
---|---|---|---|
String | Int64 | Float64 | |
1 | a | 1 | 11.5 |
⋮
Row | a | b | c |
---|---|---|---|
String | Int64 | Float64 | |
1 | e | 3 | 19.0 |
2 | e | 4 | 20.0 |
Notice that df
remains grouped by a
because the @summarize()
peeled off one layer of grouping.
Global constants¤
You can also use !!
interpolation to access global variables like pi
.
df = DataFrame(radius = 1:5)
@chain df begin
@mutate(area = !!pi * radius^2)
end
Row | radius | area |
---|---|---|
Int64 | Float64 | |
1 | 1 | 3.14159 |
2 | 2 | 12.5664 |
3 | 3 | 28.2743 |
4 | 4 | 50.2655 |
5 | 5 | 78.5398 |
As of v0.14.0, global constants defined within the Base or Core modules (like missing
, pi
, and Real
can be directly referenced without any !!
)
@chain df begin
@mutate(area = pi * radius^2)
end
Row | radius | area |
---|---|---|
Int64 | Float64 | |
1 | 1 | 3.14159 |
2 | 2 | 12.5664 |
3 | 3 | 28.2743 |
4 | 4 | 50.2655 |
5 | 5 | 78.5398 |
Alternative interpolation syntax¤
Since we know that pi
is defined in the Main
module, we can also access it using Main.pi
.
@chain df begin
@mutate(area = Main.pi * radius^2)
end
Row | radius | area |
---|---|---|
Int64 | Float64 | |
1 | 1 | 3.14159 |
2 | 2 | 12.5664 |
3 | 3 | 28.2743 |
4 | 4 | 50.2655 |
5 | 5 | 78.5398 |
The key lesson with interpolation is that any bare unquoted variable is assumed to refer to a column name in the DataFrame. If you are referring to any variable outside of the DataFrame, you need to either use !!variable
or [Module_name_here].variable
syntax to refer to this variable.
Note: You can use !!
interpolation anywhere, including inside of functions and loops.
df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
b = [1,1,1,2,2,2,3,3,3,4],
c = 11:20)
for col in [:b, :c]
@chain df begin
@summarize(across(!!col, mean))
println
end
end
1×1 DataFrame
Row │ b_mean
│ Float64
─────┼─────────
1 │ 2.2
1×1 DataFrame
Row │ c_mean
│ Float64
─────┼─────────
1 │ 15.5
This page was generated using Literate.jl.