@filter
Filtering is a mechanism to indicate which rows you want to keep in a dataset based on criteria. This is also referred to as subsetting. Filtering rows is normally a bit tricky in DataFrames.jl
because comparison operators like >=
actually need to be vectorized as .>=
, which can catch new Julia users by surprise. @filter()
mimics R's tidyverse
behavior by auto-vectorizing the code and then only selecting those rows that evaluate to true
. Similar to dplyr
, rows that evaluate to missing
are skipped.
using Tidier
using RDatasets
movies = dataset("ggplot2", "movies");
Let’s take a look at the movies whose budget was more than average. We will select only the first 5 rows for the sake of brevity.
@chain movies begin
@mutate(Budget = Budget / 1_000_000)
@filter(Budget >= mean(skipmissing(Budget)))
@select(Title, Budget)
@slice(1:5)
end
Row | Title | Budget |
---|---|---|
String | Float64? | |
1 | 'Til There Was You | 23.0 |
2 | 10 Things I Hate About You | 16.0 |
3 | 102 Dalmatians | 85.0 |
4 | 13 Going On 30 | 37.0 |
5 | 13th Warrior, The | 85.0 |
Now let's see how to use @filter()
with in
. Here's an example with a tuple.
@chain movies begin
@filter(Title in ("101 Dalmatians", "102 Dalmatians"))
@select(1:5)
end
Row | Title | Year | Length | Budget | Rating |
---|---|---|---|---|---|
String | Int32 | Int32 | Int32? | Float64 | |
1 | 101 Dalmatians | 1996 | 103 | missing | 5.5 |
2 | 102 Dalmatians | 2000 | 100 | 85000000 | 4.7 |
We can also use @filter()
with in
using a vector, denoted by a []
.
@chain movies begin
@filter(Title in ["101 Dalmatians", "102 Dalmatians"])
@select(1:5)
end
Row | Title | Year | Length | Budget | Rating |
---|---|---|---|---|---|
String | Int32 | Int32 | Int32? | Float64 | |
1 | 101 Dalmatians | 1996 | 103 | missing | 5.5 |
2 | 102 Dalmatians | 2000 | 100 | 85000000 | 4.7 |
This page was generated using Literate.jl.