Auto-vectorization

TidierData.jl uses a lookup table to decide which functions not to vectorize. For example, mean() is listed as a function that should never be vectorized. Also, any function used inside of across() is also not automatically vectorized. Any function that is not included in this list and is used in a context other than across() is automatically vectorized.

Which functions are not vectorized? The set of non-vectorized functions is contained in the array TidierData.not_vectorized[]. Let's take a look at this array. We will wrap it in a string() to make the output easier to read.

using TidierData

string(TidierData.not_vectorized[])
"[:getindex, :rand, :esc, :Ref, :Set, :Cols, :collect, :(:), :∘, :lag, :lead, :ntile, :repeat, :across, :desc, :mean, :std, :var, :median, :mad, :first, :last, :minimum, :maximum, :sum, :length, :skipmissing, :quantile, :passmissing, :cumsum, :cumprod, :accumulate, :is_float, :is_integer, :is_string, :cat_rev, :cat_relevel, :cat_infreq, :cat_lump, :cat_reorder, :cat_collapse, :cat_lump_min, :cat_lump_prop, :categorical, :as_categorical, :is_categorical, :unique, :iqr, :cat_other, :cat_replace_missing, :cat_recode]"

This "auto-vectorization" makes working with TidierData.jl more R-like and convenient. However, if you ever define your own function and try to use it, TidierData.jl may unintentionally vectorize it for you. To prevent auto-vectorization, you can prefix your function with a ~.

df = DataFrame(a = repeat('a':'e', inner = 2), b = [1,1,1,2,2,2,3,3,3,4], c = 11:20)
10×3 DataFrame
Rowabc
CharInt64Int64
1a111
2a112
3b113
4b214
5c215
6c216
7d317
8d318
9e319
10e420

For example, let's define a function new_mean() that calculates a mean.

new_mean(exprs...) = mean(exprs...)
new_mean (generic function with 1 method)

If we try to use new_mean() inside of @mutate(), it will give us the wrong result. This is because new_mean() is vectorized, which results in the mean being calculated element-wise, which is almost never what we actually want.

@chain df begin
    @mutate(d = c - new_mean(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a1110.0
2a1120.0
3b1130.0
4b2140.0
5c2150.0
6c2160.0
7d3170.0
8d3180.0
9e3190.0
10e4200.0

To prevent new_mean() from being vectorized, we need to prefix it with a ~ like this:

@chain df begin
    @mutate(d = c - ~new_mean(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a111-4.5
2a112-3.5
3b113-2.5
4b214-1.5
5c215-0.5
6c2160.5
7d3171.5
8d3182.5
9e3193.5
10e4204.5

Or you can modify the do-not-vectorize list like this:

push!(TidierData.not_vectorized[], :new_mean)
52-element Vector{Symbol}:
 :getindex
 :rand
 :esc
 :Ref
 :Set
 :Cols
 :collect
 :(:)
 :∘
 :lag
 ⋮
 :categorical
 :as_categorical
 :is_categorical
 :unique
 :iqr
 :cat_other
 :cat_replace_missing
 :cat_recode
 :new_mean

Now new_mean() should behave just like mean() in that it is treated as non-vectorized.

@chain df begin
    @mutate(d = c - new_mean(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a111-4.5
2a112-3.5
3b113-2.5
4b214-1.5
5c215-0.5
6c2160.5
7d3171.5
8d3182.5
9e3193.5
10e4204.5

This gives us the correct answer. Notice that adding a ~ is not needed with mean() because mean() is already included on our look-up table of functions not requiring vectorization.

@chain df begin
    @mutate(d = c - mean(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a111-4.5
2a112-3.5
3b113-2.5
4b214-1.5
5c215-0.5
6c2160.5
7d3171.5
8d3182.5
9e3193.5
10e4204.5

If you're not sure if a function is vectorized and want to prevent it from being vectorized, you can always prefix it with a ~ to prevent vectorization. Even though mean() is not vectorized anyway, prefixing it with a ~ will not cause any harm.

@chain df begin
    @mutate(d = c - ~mean(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a111-4.5
2a112-3.5
3b113-2.5
4b214-1.5
5c215-0.5
6c2160.5
7d3171.5
8d3182.5
9e3193.5
10e4204.5

If for some crazy reason, you did want to vectorize mean(), you are always allowed to vectorize it, and TidierData.jl won't un-vectorize it.

@chain df begin
    @mutate(d = c - mean.(c))
end
10×4 DataFrame
Rowabcd
CharInt64Int64Float64
1a1110.0
2a1120.0
3b1130.0
4b2140.0
5c2150.0
6c2160.0
7d3170.0
8d3180.0
9e3190.0
10e4200.0

Note: ~ also works with operators, so if you want to not vectorize an operator, you can prefix it with ~, for example, a ~* b will perform a matrix multiplication rather than element-wise multiplication.


This page was generated using Literate.jl.