Interpolation

Native (and preferred) method of interpolating using `@eval` and `$`¤

TidierData relies on "non-standard evaluation," which has the side effect of making interpolation slightly more complicated. For example, in the expression @mutate(df, a = b + 1), the df refers to a data frame, while a and b refer to column names within the data frame. What would happen if you created a variable var that contains the value :a. Would this interpolated expression work?

using TidierData
df = DataFrame(a = 1:5, b = 6:10)

var = :a
@mutate(df, $var = b + 1)

Unfortunately, this does not work because it produces @mutate(df, :a = b + 1). Since TidierData uses bare variables (and not symbols) to refer to column names, this will result in an error. However, there is a slight modification we can apply to make this code work: prefixing it with an @eval.

using TidierData
df = DataFrame(a = 1:5, b = 6:10, c = 11:15)

var = :a
@eval @mutate(df, $var = b + 1)

5×3 DataFrame

Row	a	b	c
	Int64	Int64	Int64
1	7	6	11
2	8	7	12
3	9	8	13
4	10	9	14
5	11	10	15

Why does adding an `@eval` to the beginning of the expression make interpolation work?¤

Adding @eval to the beginning causes the interpolated expressions to be evaluated prior to be interpolated. So $var, which contains the value :a, is evaluated to a, which produces the desired expression @mutate(df, a = b + 1). The need of @eval here then is primarily because TidierData expects an a rather than an :a to refer to the column "a" in a data frame.

How can I use `@eval` with a chained set of expressions?¤

The answer is simple: use @eval @chain instead of @chain.

var = :a

@eval @chain df begin
  @select($var)
  @mutate($var = $var + 1)
end

5×1 DataFrame

Row	a
	Int64
1	2
2	3
3	4
4	5
5	6

If you want to select multiple variables, just use a ... to splat the vector (or tuple) of variables.

vars = [:a, :b]

@eval @chain df begin
  @select($vars...)
end

5×2 DataFrame

Row	a	b
	Int64	Int64
1	1	6
2	2	7
3	3	8
4	4	9
5	5	10

The @eval-based interpolation syntax is highly flexible in that it should work anywhere you might need it across the entire package.

@eval @chain df begin
  @summarize(across($vars..., mean))
end

1×2 DataFrame

Row	a_mean	b_mean
	Float64	Float64
1	3.0	8.0

Does `@eval` work inside of user-defined functions?¤

Yes. Here's an example of how you could roll up a new select_new function wrapping the @select macros.

function select_new(df, columns...)
  @eval @select(df, $columns...)
end

select_new(df, :a, :c)

5×2 DataFrame

Row	a	c
	Int64	Int64
1	1	11
2	2	12
3	3	13
4	4	14
5	5	15

Yes. Here's another example of an add_one() function that adds one to all numeric columns and returns the result in a new set of columns.

function add_one(df)
  @eval @mutate(df, across(where(is_number), x -> x .+ 1))
end

add_one(df)

5×6 DataFrame

Row	a	b	c	a_function	b_function	c_function
	Int64	Int64	Int64	Int64	Int64	Int64
1	1	6	11	2	7	12
2	2	7	12	3	8	13
3	3	8	13	4	9	14
4	4	9	14	5	10	15
5	5	10	15	6	11	16

Note: the below documentation is included here only for historical reasons. It will be removed in the future.¤

Superseded method of interpolating using the `!!` ("bang bang") operator¤

The !! ("bang bang") operator can be used to interpolate values of variables from the parent environment into your code. This operator is borrowed from the R rlang package. At some point, we may switch to using native Julia interpolation, but for a variety of reasons that introduce some complexity with native interpolation, we plan to continue to support !! interpolation.

To interpolate multiple variables, the rlang R package uses the !!! "triple bang" operator. However, in TidierData.jl, the !! "bang bang" operator can be used to interpolate either single or multiple values as shown in the examples below.

Note: You can only interpolate values from variables in the parent environment. If you would like to interpolate column names, you have two options: you can either use across() or you can use @aside with @pull() to create variables in the parent environment containing the values of those columns which can then be accessed using interpolatino.

myvar = :bandmyvar = Cols(:a, :b)both refer to *columns* with those names. On the other hand,myvar = "b",myvar = ("a", "b")andmyvar = ["a", "b"]will interpolate the *values*. If you intend to interpolate column names, the preferred way is to useCols()` as in the examples below.

using TidierData

df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
               b = [1,1,1,2,2,2,3,3,3,4],
               c = 11:20)

10×3 DataFrame

Row	a	b	c
	String	Int64	Int64
1	a	1	11
2	a	1	12
3	b	1	13
4	b	2	14
5	c	2	15
6	c	2	16
7	d	3	17
8	d	3	18
9	e	3	19
10	e	4	20

Select the column (because `myvar` contains a symbol)¤

myvar = :b

@chain df begin
  @select(!!myvar)
end

10×1 DataFrame

Row	b
	Int64
1	1
2	1
3	1
4	2
5	2
6	2
7	3
8	3
9	3
10	4

Select multiple variables¤

You can also use a vector as in [:a, :b], but Cols() is preferred because it lets you mix and match numbers.

myvars = Cols(:a, :b)

@chain df begin
  @select(!!myvars)
end

10×2 DataFrame

Row	a	b
	String	Int64
1	a	1
2	a	1
3	b	1
4	b	2
5	c	2
6	c	2
7	d	3
8	d	3
9	e	3
10	e	4

This is the same as this...

myvars = Cols(:a, 2)

@chain df begin
  @select(!!myvars)
end

10×2 DataFrame

Row	a	b
	String	Int64
1	a	1
2	a	1
3	b	1
4	b	2
5	c	2
6	c	2
7	d	3
8	d	3
9	e	3
10	e	4

Filter rows containing the value of `myvar_string`¤

myvar_string = "b"

@chain df begin
  @filter(a == !!myvar_string)
end

2×3 DataFrame

Row	a	b	c
	String	Int64	Int64
1	b	1	13
2	b	2	14

Filtering rows works similarly using `in`.¤

Note that for in to work here, we have to wrap it in [] because otherwise, the string will be converted into a collection of characters, which are a different data type.

myvar_string = "b"

@chain df begin
  @filter(a in [!!myvar_string])
end

2×3 DataFrame

Row	a	b	c
	String	Int64	Int64
1	b	1	13
2	b	2	14

You can also use this for a vector (or tuple) of strings.¤

myvars_string = ["a", "b"]

@chain df begin
  @filter(a in !!myvars_string)
end

4×3 DataFrame

Row	a	b	c
	String	Int64	Int64
1	a	1	11
2	a	1	12
3	b	1	13
4	b	2	14

Mutate one variable¤

Remember: You cannot interpolate column names into @mutate() expressions. However, you can create a temporary variable containing the values of the column in question or you can use @mutate() with across().

Option 1: Create a temporary variable containing the values of the column.¤

myvar = :b

@chain df begin
  @aside(myvar_values = @pull(_, !!myvar))
  @mutate(d = !!myvar_values + 1)
end

10×4 DataFrame

Row	a	b	c	d
	String	Int64	Int64	Int64
1	a	1	11	2
2	a	1	12	2
3	b	1	13	2
4	b	2	14	3
5	c	2	15	3
6	c	2	16	3
7	d	3	17	4
8	d	3	18	4
9	e	3	19	4
10	e	4	20	5

Option 2: Use `@mutate()` with `across()`¤

Note: when using across(), anonymous functions are not vectorized. This is intentional to allow users to specify their function exactly as desired.

@chain df begin
  @mutate(across(!!myvar, x -> x .+ 1))
  @rename(d = b_function)
end

10×4 DataFrame

Row	a	b	c	d
	String	Int64	Int64	Int64
1	a	1	11	2
2	a	1	12	2
3	b	1	13	2
4	b	2	14	3
5	c	2	15	3
6	c	2	16	3
7	d	3	17	4
8	d	3	18	4
9	e	3	19	4
10	e	4	20	5

Summarize across one variable¤

myvar = :b

@chain df begin
  @summarize(across(!!myvar, mean))
end

1×1 DataFrame

Row	b_mean
	Float64
1	2.2

Summarize across multiple variables¤

myvars = Cols(:b, :c)

@chain df begin
  @summarize(across(!!myvars, (mean, minimum, maximum)))
end

1×6 DataFrame

Row	b_mean	c_mean	b_minimum	c_minimum	b_maximum	c_maximum
	Float64	Float64	Int64	Int64	Int64	Int64
1	2.2	15.5	1	11	4	20

Group by one interpolated variable¤

myvar = :a

@chain df begin
  @group_by(!!myvar)
  @summarize(c = mean(c))
end

5×2 DataFrame

Row	a	c
	String	Float64
1	a	11.5
2	b	13.5
3	c	15.5
4	d	17.5
5	e	19.5

Group by multiple interpolated variables¤

Once again, you can mix and match column selectors within Cols()

myvars = Cols(:a, 2)

@chain df begin
  @group_by(!!myvars)
  @summarize(c = mean(c))
end

GroupedDataFrame with 5 groups based on key: a

First Group (1 row): a = "a"

Row	a	b	c
	String	Int64	Float64
1	a	1	11.5

&vellip;

Last Group (2 rows): a = "e"

Row	a	b	c
	String	Int64	Float64
1	e	3	19.0
2	e	4	20.0

Notice that df remains grouped by a because the @summarize() peeled off one layer of grouping.

Global constants¤

You can also use !! interpolation to access global variables like pi.

df = DataFrame(radius = 1:5)

@chain df begin
  @mutate(area = !!pi * radius^2)
end

5×2 DataFrame

Row	radius	area
	Int64	Float64
1	1	3.14159
2	2	12.5664
3	3	28.2743
4	4	50.2655
5	5	78.5398

As of v0.14.0, global constants defined within the Base or Core modules (like missing, pi, and Real can be directly referenced without any !!)

@chain df begin
  @mutate(area = pi * radius^2)
end

5×2 DataFrame

Row	radius	area
	Int64	Float64
1	1	3.14159
2	2	12.5664
3	3	28.2743
4	4	50.2655
5	5	78.5398

Alternative interpolation syntax¤

Since we know that pi is defined in the Main module, we can also access it using Main.pi.

@chain df begin
  @mutate(area = Main.pi * radius^2)
end

5×2 DataFrame

Row	radius	area
	Int64	Float64
1	1	3.14159
2	2	12.5664
3	3	28.2743
4	4	50.2655
5	5	78.5398

The key lesson with interpolation is that any bare unquoted variable is assumed to refer to a column name in the DataFrame. If you are referring to any variable outside of the DataFrame, you need to either use !!variable or [Module_name_here].variable syntax to refer to this variable.

Note: You can use !! interpolation anywhere, including inside of functions and loops.

df = DataFrame(a = string.(repeat('a':'e', inner = 2)),
               b = [1,1,1,2,2,2,3,3,3,4],
               c = 11:20)

for col in [:b, :c]
  @chain df begin
    @summarize(across(!!col, mean))
    println
  end
end

1×1 DataFrame
 Row │ b_mean
     │ Float64
─────┼─────────
   1 │     2.2
1×1 DataFrame
 Row │ c_mean
     │ Float64
─────┼─────────
   1 │    15.5

This page was generated using Literate.jl.

Interpolation

Native (and preferred) method of interpolating using @eval and $¤

Why does adding an @eval to the beginning of the expression make interpolation work?¤

How can I use @eval with a chained set of expressions?¤

Does @eval work inside of user-defined functions?¤

Note: the below documentation is included here only for historical reasons. It will be removed in the future.¤

Superseded method of interpolating using the !! ("bang bang") operator¤

Select the column (because myvar contains a symbol)¤

Select multiple variables¤

Filter rows containing the value of myvar_string¤

Filtering rows works similarly using in.¤

You can also use this for a vector (or tuple) of strings.¤

Mutate one variable¤

Option 1: Create a temporary variable containing the values of the column.¤

Option 2: Use @mutate() with across()¤

Summarize across one variable¤

Summarize across multiple variables¤

Group by one interpolated variable¤

Group by multiple interpolated variables¤

Global constants¤

Alternative interpolation syntax¤

Native (and preferred) method of interpolating using `@eval` and `$`¤

Why does adding an `@eval` to the beginning of the expression make interpolation work?¤

How can I use `@eval` with a chained set of expressions?¤

Does `@eval` work inside of user-defined functions?¤

Superseded method of interpolating using the `!!` ("bang bang") operator¤

Select the column (because `myvar` contains a symbol)¤

Filter rows containing the value of `myvar_string`¤

Filtering rows works similarly using `in`.¤

Option 2: Use `@mutate()` with `across()`¤