Auto-vectorization

In general, Tidier.jl uses a lookup table to decide which functions not to vectorize. For example, mean() is listed as a function that should never be vectorized. Also, any function used inside of @summarize() is also never automatically vectorized. Any function that is not included in this list and is used in a context other than @summarize() is automatically vectorized.

This "auto-vectorization" makes working with Tidier.jl more R-like and convenient. However, if you ever define your own function and try to use it, Tidier.jl may unintentionally vectorize it for you. To prevent auto-vectorization, you can prefix your function with a ~.

using Tidier
using RDatasets

df = DataFrame(a = repeat('a':'e', inner = 2), b = [1,1,1,2,2,2,3,3,3,4], c = 11:20)

10×3 DataFrame

Row	a	b	c
	Char	Int64	Int64
1	a	1	11
2	a	1	12
3	b	1	13
4	b	2	14
5	c	2	15
6	c	2	16
7	d	3	17
8	d	3	18
9	e	3	19
10	e	4	20

For example, let's define a function new_mean() that calculates a mean.

new_mean(exprs...) = mean(exprs...)

new_mean (generic function with 1 method)

If we try to use new_mean() inside of @mutate(), it will give us the wrong result. This is because new_mean() is vectorized, which results in the mean being calculated element-wise, which is almost never what we actually want.

@chain df begin
    @mutate(d = c - new_mean(c))
end

10×4 DataFrame

Row	a	b	c	d
	Char	Int64	Int64	Float64
1	a	1	11	0.0
2	a	1	12	0.0
3	b	1	13	0.0
4	b	2	14	0.0
5	c	2	15	0.0
6	c	2	16	0.0
7	d	3	17	0.0
8	d	3	18	0.0
9	e	3	19	0.0
10	e	4	20	0.0

To prevent new_mean() from being vectorized, we need to prefix it with a ~ like this:

@chain df begin
    @mutate(d = c - ~new_mean(c))
end

10×4 DataFrame

Row	a	b	c	d
	Char	Int64	Int64	Float64
1	a	1	11	-4.5
2	a	1	12	-3.5
3	b	1	13	-2.5
4	b	2	14	-1.5
5	c	2	15	-0.5
6	c	2	16	0.5
7	d	3	17	1.5
8	d	3	18	2.5
9	e	3	19	3.5
10	e	4	20	4.5

This gives us the correct answer. Notice that adding a ~ is not needed with mean() because mean() is already included on our look-up table of functions not requiring vectorization.

@chain df begin
    @mutate(d = c - mean(c))
end

10×4 DataFrame

Row	a	b	c	d
	Char	Int64	Int64	Float64
1	a	1	11	-4.5
2	a	1	12	-3.5
3	b	1	13	-2.5
4	b	2	14	-1.5
5	c	2	15	-0.5
6	c	2	16	0.5
7	d	3	17	1.5
8	d	3	18	2.5
9	e	3	19	3.5
10	e	4	20	4.5

If you're not sure if a function is vectorized and want to prevent it from being vectorized, you can always prefix it with a ~ to prevent vectorization. Even though mean() is not vectorized anyway, prefixing it with a ~ will not cause any harm.

@chain df begin
    @mutate(d = c - ~mean(c))
end

10×4 DataFrame

Row	a	b	c	d
	Char	Int64	Int64	Float64
1	a	1	11	-4.5
2	a	1	12	-3.5
3	b	1	13	-2.5
4	b	2	14	-1.5
5	c	2	15	-0.5
6	c	2	16	0.5
7	d	3	17	1.5
8	d	3	18	2.5
9	e	3	19	3.5
10	e	4	20	4.5

If for some crazy reason, you did want to vectorize mean(), you are always allowed to vectorize it, and Tidier.jl won't un-vectorize it.

@chain df begin
    @mutate(d = c - mean.(c))
end

10×4 DataFrame

Row	a	b	c	d
	Char	Int64	Int64	Float64
1	a	1	11	0.0
2	a	1	12	0.0
3	b	1	13	0.0
4	b	2	14	0.0
5	c	2	15	0.0
6	c	2	16	0.0
7	d	3	17	0.0
8	d	3	18	0.0
9	e	3	19	0.0
10	e	4	20	0.0

Note: ~ also works with operators, so if you want to not vectorize an operator, you can prefix it with ~, for example, a ~* b will perform a matrix multiplication rather than element-wise multiplication. Remember that this is only needed outside of @summarize() because @summarize() never performs auto-vectorization.

This page was generated using Literate.jl.