Auto-vectorization
TidierData.jl uses a lookup table to decide which functions not to vectorize. For example, mean()
is listed as a function that should never be vectorized. Also, any function used inside of across()
is also not automatically vectorized. Any function that is not included in this list and is used in a context other than across()
is automatically vectorized.
Which functions are not vectorized? The set of non-vectorized functions is contained in the array TidierData.not_vectorized[]
. Let's take a look at this array. We will wrap it in a string()
to make the output easier to read.
using TidierData
string(TidierData.not_vectorized[])
"[:getindex, :rand, :esc, :Ref, :Set, :Cols, :collect, :(:), :∘, :lag, :lead, :ntile, :repeat, :across, :desc, :mean, :std, :var, :median, :mad, :first, :last, :minimum, :maximum, :sum, :length, :skipmissing, :quantile, :passmissing, :cumsum, :cumprod, :accumulate, :is_float, :is_integer, :is_string, :cat_rev, :cat_relevel, :cat_infreq, :cat_lump, :cat_reorder, :cat_collapse, :cat_lump_min, :cat_lump_prop, :categorical, :as_categorical, :is_categorical, :unique, :iqr, :cat_other, :cat_replace_missing, :cat_recode]"
This "auto-vectorization" makes working with TidierData.jl more R-like and convenient. However, if you ever define your own function and try to use it, TidierData.jl may unintentionally vectorize it for you. To prevent auto-vectorization, you can prefix your function with a ~
.
df = DataFrame(a = repeat('a':'e', inner = 2), b = [1,1,1,2,2,2,3,3,3,4], c = 11:20)
Row | a | b | c |
---|---|---|---|
Char | Int64 | Int64 | |
1 | a | 1 | 11 |
2 | a | 1 | 12 |
3 | b | 1 | 13 |
4 | b | 2 | 14 |
5 | c | 2 | 15 |
6 | c | 2 | 16 |
7 | d | 3 | 17 |
8 | d | 3 | 18 |
9 | e | 3 | 19 |
10 | e | 4 | 20 |
For example, let's define a function new_mean()
that calculates a mean.
new_mean(exprs...) = mean(exprs...)
new_mean (generic function with 1 method)
If we try to use new_mean()
inside of @mutate()
, it will give us the wrong result. This is because new_mean()
is vectorized, which results in the mean being calculated element-wise, which is almost never what we actually want.
@chain df begin
@mutate(d = c - new_mean(c))
end
Row | a | b | c | d |
---|---|---|---|---|
Char | Int64 | Int64 | Float64 | |
1 | a | 1 | 11 | 0.0 |
2 | a | 1 | 12 | 0.0 |
3 | b | 1 | 13 | 0.0 |
4 | b | 2 | 14 | 0.0 |
5 | c | 2 | 15 | 0.0 |
6 | c | 2 | 16 | 0.0 |
7 | d | 3 | 17 | 0.0 |
8 | d | 3 | 18 | 0.0 |
9 | e | 3 | 19 | 0.0 |
10 | e | 4 | 20 | 0.0 |
To prevent new_mean()
from being vectorized, we need to prefix it with a ~
like this:
@chain df begin
@mutate(d = c - ~new_mean(c))
end
Row | a | b | c | d |
---|---|---|---|---|
Char | Int64 | Int64 | Float64 | |
1 | a | 1 | 11 | -4.5 |
2 | a | 1 | 12 | -3.5 |
3 | b | 1 | 13 | -2.5 |
4 | b | 2 | 14 | -1.5 |
5 | c | 2 | 15 | -0.5 |
6 | c | 2 | 16 | 0.5 |
7 | d | 3 | 17 | 1.5 |
8 | d | 3 | 18 | 2.5 |
9 | e | 3 | 19 | 3.5 |
10 | e | 4 | 20 | 4.5 |
Or you can modify the do-not-vectorize list like this:
push!(TidierData.not_vectorized[], :new_mean)
52-element Vector{Symbol}:
:getindex
:rand
:esc
:Ref
:Set
:Cols
:collect
:(:)
:∘
:lag
⋮
:categorical
:as_categorical
:is_categorical
:unique
:iqr
:cat_other
:cat_replace_missing
:cat_recode
:new_mean
Now new_mean()
should behave just like mean()
in that it is treated as non-vectorized.
@chain df begin
@mutate(d = c - new_mean(c))
end
Row | a | b | c | d |
---|---|---|---|---|
Char | Int64 | Int64 | Float64 | |
1 | a | 1 | 11 | -4.5 |
2 | a | 1 | 12 | -3.5 |
3 | b | 1 | 13 | -2.5 |
4 | b | 2 | 14 | -1.5 |
5 | c | 2 | 15 | -0.5 |
6 | c | 2 | 16 | 0.5 |
7 | d | 3 | 17 | 1.5 |
8 | d | 3 | 18 | 2.5 |
9 | e | 3 | 19 | 3.5 |
10 | e | 4 | 20 | 4.5 |
This gives us the correct answer. Notice that adding a ~
is not needed with mean()
because mean()
is already included on our look-up table of functions not requiring vectorization.
@chain df begin
@mutate(d = c - mean(c))
end
Row | a | b | c | d |
---|---|---|---|---|
Char | Int64 | Int64 | Float64 | |
1 | a | 1 | 11 | -4.5 |
2 | a | 1 | 12 | -3.5 |
3 | b | 1 | 13 | -2.5 |
4 | b | 2 | 14 | -1.5 |
5 | c | 2 | 15 | -0.5 |
6 | c | 2 | 16 | 0.5 |
7 | d | 3 | 17 | 1.5 |
8 | d | 3 | 18 | 2.5 |
9 | e | 3 | 19 | 3.5 |
10 | e | 4 | 20 | 4.5 |
If you're not sure if a function is vectorized and want to prevent it from being vectorized, you can always prefix it with a ~ to prevent vectorization. Even though mean()
is not vectorized anyway, prefixing it with a ~ will not cause any harm.
@chain df begin
@mutate(d = c - ~mean(c))
end
Row | a | b | c | d |
---|---|---|---|---|
Char | Int64 | Int64 | Float64 | |
1 | a | 1 | 11 | -4.5 |
2 | a | 1 | 12 | -3.5 |
3 | b | 1 | 13 | -2.5 |
4 | b | 2 | 14 | -1.5 |
5 | c | 2 | 15 | -0.5 |
6 | c | 2 | 16 | 0.5 |
7 | d | 3 | 17 | 1.5 |
8 | d | 3 | 18 | 2.5 |
9 | e | 3 | 19 | 3.5 |
10 | e | 4 | 20 | 4.5 |
If for some crazy reason, you did want to vectorize mean()
, you are always allowed to vectorize it, and TidierData.jl won't un-vectorize it.
@chain df begin
@mutate(d = c - mean.(c))
end
Row | a | b | c | d |
---|---|---|---|---|
Char | Int64 | Int64 | Float64 | |
1 | a | 1 | 11 | 0.0 |
2 | a | 1 | 12 | 0.0 |
3 | b | 1 | 13 | 0.0 |
4 | b | 2 | 14 | 0.0 |
5 | c | 2 | 15 | 0.0 |
6 | c | 2 | 16 | 0.0 |
7 | d | 3 | 17 | 0.0 |
8 | d | 3 | 18 | 0.0 |
9 | e | 3 | 19 | 0.0 |
10 | e | 4 | 20 | 0.0 |
Note: ~
also works with operators, so if you want to not vectorize an operator, you can prefix it with ~
, for example, a ~* b
will perform a matrix multiplication rather than element-wise multiplication.
This page was generated using Literate.jl.