@filter
Filtering is a mechanism to indicate which rows you want to keep in a dataset based on criteria. This is also referred to as subsetting. Filtering rows is normally a bit tricky in DataFrames.jl
because comparison operators like >=
actually need to be vectorized as .>=
, which can catch new Julia users by surprise. @filter()
mimics R's tidyverse
behavior by auto-vectorizing the code and then only selecting those rows that evaluate to true
. Similar to dplyr
, rows that evaluate to missing
are skipped.
using TidierData
using RDatasets
movies = dataset("ggplot2", "movies");
Let’s take a look at the movies whose budget was more than average. We will select only the first 5 rows for the sake of brevity.¤
@chain movies begin
@mutate(Budget = Budget / 1_000_000)
@filter(Budget >= mean(skipmissing(Budget)))
@select(Title, Budget)
@slice(1:5)
end
5×2 DataFrame
Row | Title | Budget |
---|---|---|
String | Float64? | |
1 | 'Til There Was You | 23.0 |
2 | 10 Things I Hate About You | 16.0 |
3 | 102 Dalmatians | 85.0 |
4 | 13 Going On 30 | 37.0 |
5 | 13th Warrior, The | 85.0 |
Let's search for movies that have at least 200 votes and a rating of greater than or equal to 8. There are 3 ways you can specify an "and" condition inside of TidierData.jl
.¤
The first option is to use the short-circuiting &&
operator as shown below. This is the preferred approach because the second expression is only evaluated (per element) if the first one is true.¤
@chain movies begin
@filter(Votes >= 200 && Rating >= 8)
@select(Title, Votes, Rating)
@slice(1:5)
end
5×3 DataFrame
Row | Title | Votes | Rating |
---|---|---|---|
String | Int32 | Float64 | |
1 | 12 Angry Men | 29278 | 8.7 |
2 | 12 stulev | 252 | 8.9 |
3 | 2001: A Space Odyssey | 64982 | 8.3 |
4 | 21 Grams | 21857 | 8.0 |
5 | 39 Steps, The | 7931 | 8.0 |
The second option is to use the bitwise &
operator. Note that there is a key difference in syntax between &
and &&
. Because the &
operator takes a higher operator precedence than >=
, you have to wrap the comparison expressions inside of parentheses to ensure that the overall expression is evaluated correctly.¤
@chain movies begin
@filter((Votes >= 200) & (Rating >= 8))
@select(Title, Votes, Rating)
@slice(1:5)
end
5×3 DataFrame
Row | Title | Votes | Rating |
---|---|---|---|
String | Int32 | Float64 | |
1 | 12 Angry Men | 29278 | 8.7 |
2 | 12 stulev | 252 | 8.9 |
3 | 2001: A Space Odyssey | 64982 | 8.3 |
4 | 21 Grams | 21857 | 8.0 |
5 | 39 Steps, The | 7931 | 8.0 |
The third option for "and" conditions only is to separate the expressions with commas. This is similar to the behavior of filter()
in tidyverse
.¤
@chain movies begin
@filter(Votes >= 200, Rating >= 8)
@select(Title, Votes, Rating)
@slice(1:5)
end
5×3 DataFrame
Row | Title | Votes | Rating |
---|---|---|---|
String | Int32 | Float64 | |
1 | 12 Angry Men | 29278 | 8.7 |
2 | 12 stulev | 252 | 8.9 |
3 | 2001: A Space Odyssey | 64982 | 8.3 |
4 | 21 Grams | 21857 | 8.0 |
5 | 39 Steps, The | 7931 | 8.0 |
Now let's see how to use @filter()
with in
. Here's an example with a tuple.¤
@chain movies begin
@filter(Title in ("101 Dalmatians",
"102 Dalmatians"))
@select(1:5)
end
2×5 DataFrame
Row | Title | Year | Length | Budget | Rating |
---|---|---|---|---|---|
String | Int32 | Int32 | Int32? | Float64 | |
1 | 101 Dalmatians | 1996 | 103 | missing | 5.5 |
2 | 102 Dalmatians | 2000 | 100 | 85000000 | 4.7 |
We can also use @filter()
with in
using a vector, denoted by a []
.¤
@chain movies begin
@filter(Title in ["101 Dalmatians",
"102 Dalmatians"])
@select(1:5)
end
2×5 DataFrame
Row | Title | Year | Length | Budget | Rating |
---|---|---|---|---|---|
String | Int32 | Int32 | Int32? | Float64 | |
1 | 101 Dalmatians | 1996 | 103 | missing | 5.5 |
2 | 102 Dalmatians | 2000 | 100 | 85000000 | 4.7 |
Finally, we can combine @filter
with row_number()
to retrieve the first 5 rows, which can be used to mimic the functionality provided by @slice
.¤
@chain movies begin
@filter(row_number() <= 5)
@select(1:5)
end
5×5 DataFrame
Row | Title | Year | Length | Budget | Rating |
---|---|---|---|---|---|
String | Int32 | Int32 | Int32? | Float64 | |
1 | $ | 1971 | 121 | missing | 6.4 |
2 | $1000 a Touchdown | 1939 | 71 | missing | 6.0 |
3 | $21 a Day Once a Month | 1941 | 7 | missing | 8.2 |
4 | $40,000 | 1996 | 70 | missing | 8.2 |
5 | $50,000 Climax Show, The | 1975 | 71 | missing | 3.4 |
This page was generated using Literate.jl.