Interpolation
Native (and preferred) method of interpolating using @eval and $¤
TidierData relies on "non-standard evaluation," which has the side effect of making interpolation slightly more complicated. For example, in the expression @mutate(df, a = b + 1), the df refers to a data frame, while a and b refer to column names within the data frame. What would happen if you created a variable var that contains the value :a. Would this interpolated expression work?
using TidierData
df = DataFrame(a = 1:5, b = 6:10)
var = :a
@mutate(df, $var = b + 1)
Unfortunately, this does not work because it produces @mutate(df, :a = b + 1). Since TidierData uses bare variables (and not symbols) to refer to column names, this will result in an error. However, there is a slight modification we can apply to make this code work: prefixing it with an @eval.
using TidierData
df = DataFrame(a = 1:5, b = 6:10, c = 11:15)
var = :a
@eval @mutate(df, $var = b + 1)
| Row | a | b | c |
|---|---|---|---|
| Int64 | Int64 | Int64 | |
| 1 | 7 | 6 | 11 |
| 2 | 8 | 7 | 12 |
| 3 | 9 | 8 | 13 |
| 4 | 10 | 9 | 14 |
| 5 | 11 | 10 | 15 |
Why does adding an @eval to the beginning of the expression make interpolation work?¤
Adding @eval to the beginning causes the interpolated expressions to be evaluated prior to be interpolated. So $var, which contains the value :a, is evaluated to a, which produces the desired expression @mutate(df, a = b + 1). The need of @eval here then is primarily because TidierData expects an a rather than an :a to refer to the column "a" in a data frame.
How can I use @eval with a chained set of expressions?¤
The answer is simple: use @eval @chain instead of @chain.
var = :a
@eval @chain df begin
@select($var)
@mutate($var = $var + 1)
end
| Row | a |
|---|---|
| Int64 | |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
If you want to select multiple variables, just use a ... to splat the vector (or tuple) of variables.
vars = [:a, :b]
@eval @chain df begin
@select($vars...)
end
| Row | a | b |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 6 |
| 2 | 2 | 7 |
| 3 | 3 | 8 |
| 4 | 4 | 9 |
| 5 | 5 | 10 |
The @eval-based interpolation syntax is highly flexible in that it should work anywhere you might need it across the entire package.
@eval @chain df begin
@summarize(across($vars..., mean))
end
| Row | a_mean | b_mean |
|---|---|---|
| Float64 | Float64 | |
| 1 | 3.0 | 8.0 |
Does @eval work inside of user-defined functions?¤
Yes. Here's an example of how you could roll up a new select_new function wrapping the @select macros.
function select_new(df, columns...)
@eval @select(df, $columns...)
end
select_new(df, :a, :c)
| Row | a | c |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 11 |
| 2 | 2 | 12 |
| 3 | 3 | 13 |
| 4 | 4 | 14 |
| 5 | 5 | 15 |
Yes. Here's another example of an add_one() function that adds one to all numeric columns and returns the result in a new set of columns.
function add_one(df)
@eval @mutate(df, across(where(is_number), x -> x .+ 1))
end
add_one(df)
| Row | a | b | c | a_function | b_function | c_function |
|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 6 | 11 | 2 | 7 | 12 |
| 2 | 2 | 7 | 12 | 3 | 8 | 13 |
| 3 | 3 | 8 | 13 | 4 | 9 | 14 |
| 4 | 4 | 9 | 14 | 5 | 10 | 15 |
| 5 | 5 | 10 | 15 | 6 | 11 | 16 |
Note: the below documentation is included here only for historical reasons. It will be removed in the future.¤
Superseded method of interpolating using the !! ("bang bang") operator¤
The !! ("bang bang") operator can be used to interpolate values of variables from the parent environment into your code. This operator is borrowed from the R rlang package. At some point, we may switch to using native Julia interpolation, but for a variety of reasons that introduce some complexity with native interpolation, we plan to continue to support !! interpolation.
To interpolate multiple variables, the rlang R package uses the !!! "triple bang" operator. However, in TidierData.jl, the !! "bang bang" operator can be used to interpolate either single or multiple values as shown in the examples below.
Note: You can only interpolate values from variables in the parent environment. If you would like to interpolate column names, you have two options: you can either use across() or you can use @aside with @pull() to create variables in the parent environment containing the values of those columns which can then be accessed using interpolatino.
myvar = :bandmyvar = Cols(:a, :b)both refer to *columns* with those names. On the other hand,myvar = "b",myvar = ("a", "b")andmyvar = ["a", "b"]will interpolate the *values*. If you intend to interpolate column names, the preferred way is to useCols()` as in the examples below.
using TidierData
df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
b = [1,1,1,2,2,2,3,3,3,4],
c = 11:20)
| Row | a | b | c |
|---|---|---|---|
| String | Int64 | Int64 | |
| 1 | a | 1 | 11 |
| 2 | a | 1 | 12 |
| 3 | b | 1 | 13 |
| 4 | b | 2 | 14 |
| 5 | c | 2 | 15 |
| 6 | c | 2 | 16 |
| 7 | d | 3 | 17 |
| 8 | d | 3 | 18 |
| 9 | e | 3 | 19 |
| 10 | e | 4 | 20 |
Select the column (because myvar contains a symbol)¤
myvar = :b
@chain df begin
@select(!!myvar)
end
| Row | b |
|---|---|
| Int64 | |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
| 6 | 2 |
| 7 | 3 |
| 8 | 3 |
| 9 | 3 |
| 10 | 4 |
Select multiple variables¤
You can also use a vector as in [:a, :b], but Cols() is preferred because it lets you mix and match numbers.
myvars = Cols(:a, :b)
@chain df begin
@select(!!myvars)
end
| Row | a | b |
|---|---|---|
| String | Int64 | |
| 1 | a | 1 |
| 2 | a | 1 |
| 3 | b | 1 |
| 4 | b | 2 |
| 5 | c | 2 |
| 6 | c | 2 |
| 7 | d | 3 |
| 8 | d | 3 |
| 9 | e | 3 |
| 10 | e | 4 |
This is the same as this...
myvars = Cols(:a, 2)
@chain df begin
@select(!!myvars)
end
| Row | a | b |
|---|---|---|
| String | Int64 | |
| 1 | a | 1 |
| 2 | a | 1 |
| 3 | b | 1 |
| 4 | b | 2 |
| 5 | c | 2 |
| 6 | c | 2 |
| 7 | d | 3 |
| 8 | d | 3 |
| 9 | e | 3 |
| 10 | e | 4 |
Filter rows containing the value of myvar_string¤
myvar_string = "b"
@chain df begin
@filter(a == !!myvar_string)
end
| Row | a | b | c |
|---|---|---|---|
| String | Int64 | Int64 | |
| 1 | b | 1 | 13 |
| 2 | b | 2 | 14 |
Filtering rows works similarly using in.¤
Note that for in to work here, we have to wrap it in [] because otherwise, the string will be converted into a collection of characters, which are a different data type.
myvar_string = "b"
@chain df begin
@filter(a in [!!myvar_string])
end
| Row | a | b | c |
|---|---|---|---|
| String | Int64 | Int64 | |
| 1 | b | 1 | 13 |
| 2 | b | 2 | 14 |
You can also use this for a vector (or tuple) of strings.¤
myvars_string = ["a", "b"]
@chain df begin
@filter(a in !!myvars_string)
end
| Row | a | b | c |
|---|---|---|---|
| String | Int64 | Int64 | |
| 1 | a | 1 | 11 |
| 2 | a | 1 | 12 |
| 3 | b | 1 | 13 |
| 4 | b | 2 | 14 |
Mutate one variable¤
Remember: You cannot interpolate column names into @mutate() expressions. However, you can create a temporary variable containing the values of the column in question or you can use @mutate() with across().
Option 1: Create a temporary variable containing the values of the column.¤
myvar = :b
@chain df begin
@aside(myvar_values = @pull(_, !!myvar))
@mutate(d = !!myvar_values + 1)
end
| Row | a | b | c | d |
|---|---|---|---|---|
| String | Int64 | Int64 | Int64 | |
| 1 | a | 1 | 11 | 2 |
| 2 | a | 1 | 12 | 2 |
| 3 | b | 1 | 13 | 2 |
| 4 | b | 2 | 14 | 3 |
| 5 | c | 2 | 15 | 3 |
| 6 | c | 2 | 16 | 3 |
| 7 | d | 3 | 17 | 4 |
| 8 | d | 3 | 18 | 4 |
| 9 | e | 3 | 19 | 4 |
| 10 | e | 4 | 20 | 5 |
Option 2: Use @mutate() with across()¤
Note: when using across(), anonymous functions are not vectorized. This is intentional to allow users to specify their function exactly as desired.
@chain df begin
@mutate(across(!!myvar, x -> x .+ 1))
@rename(d = b_function)
end
| Row | a | b | c | d |
|---|---|---|---|---|
| String | Int64 | Int64 | Int64 | |
| 1 | a | 1 | 11 | 2 |
| 2 | a | 1 | 12 | 2 |
| 3 | b | 1 | 13 | 2 |
| 4 | b | 2 | 14 | 3 |
| 5 | c | 2 | 15 | 3 |
| 6 | c | 2 | 16 | 3 |
| 7 | d | 3 | 17 | 4 |
| 8 | d | 3 | 18 | 4 |
| 9 | e | 3 | 19 | 4 |
| 10 | e | 4 | 20 | 5 |
Summarize across one variable¤
myvar = :b
@chain df begin
@summarize(across(!!myvar, mean))
end
| Row | b_mean |
|---|---|
| Float64 | |
| 1 | 2.2 |
Summarize across multiple variables¤
myvars = Cols(:b, :c)
@chain df begin
@summarize(across(!!myvars, (mean, minimum, maximum)))
end
| Row | b_mean | c_mean | b_minimum | c_minimum | b_maximum | c_maximum |
|---|---|---|---|---|---|---|
| Float64 | Float64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 2.2 | 15.5 | 1 | 11 | 4 | 20 |
Group by one interpolated variable¤
myvar = :a
@chain df begin
@group_by(!!myvar)
@summarize(c = mean(c))
end
| Row | a | c |
|---|---|---|
| String | Float64 | |
| 1 | a | 11.5 |
| 2 | b | 13.5 |
| 3 | c | 15.5 |
| 4 | d | 17.5 |
| 5 | e | 19.5 |
Group by multiple interpolated variables¤
Once again, you can mix and match column selectors within Cols()
myvars = Cols(:a, 2)
@chain df begin
@group_by(!!myvars)
@summarize(c = mean(c))
end
GroupedDataFrame with 5 groups based on key: a
| Row | a | b | c |
|---|---|---|---|
| String | Int64 | Float64 | |
| 1 | a | 1 | 11.5 |
⋮
| Row | a | b | c |
|---|---|---|---|
| String | Int64 | Float64 | |
| 1 | e | 3 | 19.0 |
| 2 | e | 4 | 20.0 |
Notice that df remains grouped by a because the @summarize() peeled off one layer of grouping.
Global constants¤
You can also use !! interpolation to access global variables like pi.
df = DataFrame(radius = 1:5)
@chain df begin
@mutate(area = !!pi * radius^2)
end
| Row | radius | area |
|---|---|---|
| Int64 | Float64 | |
| 1 | 1 | 3.14159 |
| 2 | 2 | 12.5664 |
| 3 | 3 | 28.2743 |
| 4 | 4 | 50.2655 |
| 5 | 5 | 78.5398 |
As of v0.14.0, global constants defined within the Base or Core modules (like missing, pi, and Real can be directly referenced without any !!)
@chain df begin
@mutate(area = pi * radius^2)
end
| Row | radius | area |
|---|---|---|
| Int64 | Float64 | |
| 1 | 1 | 3.14159 |
| 2 | 2 | 12.5664 |
| 3 | 3 | 28.2743 |
| 4 | 4 | 50.2655 |
| 5 | 5 | 78.5398 |
Alternative interpolation syntax¤
Since we know that pi is defined in the Main module, we can also access it using Main.pi.
@chain df begin
@mutate(area = Main.pi * radius^2)
end
| Row | radius | area |
|---|---|---|
| Int64 | Float64 | |
| 1 | 1 | 3.14159 |
| 2 | 2 | 12.5664 |
| 3 | 3 | 28.2743 |
| 4 | 4 | 50.2655 |
| 5 | 5 | 78.5398 |
The key lesson with interpolation is that any bare unquoted variable is assumed to refer to a column name in the DataFrame. If you are referring to any variable outside of the DataFrame, you need to either use !!variable or [Module_name_here].variable syntax to refer to this variable.
Note: You can use !! interpolation anywhere, including inside of functions and loops.
df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
b = [1,1,1,2,2,2,3,3,3,4],
c = 11:20)
for col in [:b, :c]
@chain df begin
@summarize(across(!!col, mean))
println
end
end
1×1 DataFrame
Row │ b_mean
│ Float64
─────┼─────────
1 │ 2.2
1×1 DataFrame
Row │ c_mean
│ Float64
─────┼─────────
1 │ 15.5
This page was generated using Literate.jl.