Auto-vectorization

In general, Tidier.jl uses a lookup table to decide which functions not to vectorize. For example, mean() is listed as a function that should never be vectorized. Also, any function used inside of @summarize() is also never automatically vectorized. Any function that is not included in this list and is used in a context other than @summarize() is automatically vectorized.

This "auto-vectorization" makes working with Tidier.jl more R-like and convenient. However, if you ever define your own function and try to use it, Tidier.jl may unintentionally vectorize it for you. To prevent auto-vectorization, you can prefix your function with a ~.

using Tidier
using RDatasets

df = DataFrame(a = repeat('a':'e', inner = 2), b = [1,1,1,2,2,2,3,3,3,4], c = 11:20)
10×3 DataFrame
Rowabc
CharInt64Int64
1a111
2a112
3b113
4b214
5c215
6c216
7d317
8d318
9e319
10e420

For example, let's define a function new_mean() that calculates a mean.

new_mean(exprs...) = mean(exprs...)
new_mean (generic function with 1 method)

If we try to use new_mean() inside of @mutate(), it will give us the wrong result. This is because new_mean() is vectorized, which results in the mean being calculated element-wise, which is almost never what we actually want.

@chain df begin
    @mutate(d = c - new_mean(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a1110.0
2a1120.0
3b1130.0
4b2140.0
5c2150.0
6c2160.0
7d3170.0
8d3180.0
9e3190.0
10e4200.0

To prevent new_mean() from being vectorized, we need to prefix it with a ~ like this:

@chain df begin
    @mutate(d = c - ~new_mean(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a111-4.5
2a112-3.5
3b113-2.5
4b214-1.5
5c215-0.5
6c2160.5
7d3171.5
8d3182.5
9e3193.5
10e4204.5

This gives us the correct answer. Notice that adding a ~ is not needed with mean() because mean() is already included on our look-up table of functions not requiring vectorization.

@chain df begin
    @mutate(d = c - mean(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a111-4.5
2a112-3.5
3b113-2.5
4b214-1.5
5c215-0.5
6c2160.5
7d3171.5
8d3182.5
9e3193.5
10e4204.5

If you're not sure if a function is vectorized and want to prevent it from being vectorized, you can always prefix it with a ~ to prevent vectorization. Even though mean() is not vectorized anyway, prefixing it with a ~ will not cause any harm.

@chain df begin
    @mutate(d = c - ~mean(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a111-4.5
2a112-3.5
3b113-2.5
4b214-1.5
5c215-0.5
6c2160.5
7d3171.5
8d3182.5
9e3193.5
10e4204.5

If for some crazy reason, you did want to vectorize mean(), you are always allowed to vectorize it, and Tidier.jl won't un-vectorize it.

@chain df begin
    @mutate(d = c - mean.(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a1110.0
2a1120.0
3b1130.0
4b2140.0
5c2150.0
6c2160.0
7d3170.0
8d3180.0
9e3190.0
10e4200.0

Note: ~ also works with operators, so if you want to not vectorize an operator, you can prefix it with ~, for example, a ~* b will perform a matrix multiplication rather than element-wise multiplication. Remember that this is only needed outside of @summarize() because @summarize() never performs auto-vectorization.


This page was generated using Literate.jl.