Skip to content

@filter

Filtering is a mechanism to indicate which rows you want to keep in a dataset based on criteria. This is also referred to as subsetting. Filtering rows is normally a bit tricky in DataFrames.jl because comparison operators like >= actually need to be vectorized as .>=, which can catch new Julia users by surprise. @filter() mimics R's tidyverse behavior by auto-vectorizing the code and then only selecting those rows that evaluate to true. Similar to dplyr, rows that evaluate to missing are skipped.

using TidierData
using RDatasets

movies = dataset("ggplot2", "movies");

Let’s take a look at the movies whose budget was more than average. We will select only the first 5 rows for the sake of brevity.¤

@chain movies begin
  @mutate(Budget = Budget / 1_000_000)
  @filter(Budget >= mean(skipmissing(Budget)))
  @select(Title, Budget)
  @slice(1:5)
end
5×2 DataFrame
RowTitleBudget
StringFloat64?
1'Til There Was You23.0
210 Things I Hate About You16.0
3102 Dalmatians85.0
413 Going On 3037.0
513th Warrior, The85.0

Let's search for movies that have at least 200 votes and a rating of greater than or equal to 8. There are 3 ways you can specify an "and" condition inside of TidierData.jl.¤

The first option is to use the short-circuiting && operator as shown below. This is the preferred approach because the second expression is only evaluated (per element) if the first one is true.¤

@chain movies begin
  @filter(Votes >= 200 && Rating >= 8)
  @select(Title, Votes, Rating)
  @slice(1:5)
end
5×3 DataFrame
RowTitleVotesRating
StringInt32Float64
112 Angry Men292788.7
212 stulev2528.9
32001: A Space Odyssey649828.3
421 Grams218578.0
539 Steps, The79318.0

The second option is to use the bitwise & operator. Note that there is a key difference in syntax between & and &&. Because the & operator takes a higher operator precedence than >=, you have to wrap the comparison expressions inside of parentheses to ensure that the overall expression is evaluated correctly.¤

@chain movies begin
  @filter((Votes >= 200) & (Rating >= 8))
  @select(Title, Votes, Rating)
  @slice(1:5)
end
5×3 DataFrame
RowTitleVotesRating
StringInt32Float64
112 Angry Men292788.7
212 stulev2528.9
32001: A Space Odyssey649828.3
421 Grams218578.0
539 Steps, The79318.0

The third option for "and" conditions only is to separate the expressions with commas. This is similar to the behavior of filter() in tidyverse.¤

@chain movies begin
  @filter(Votes >= 200, Rating >= 8)
  @select(Title, Votes, Rating)
  @slice(1:5)
end
5×3 DataFrame
RowTitleVotesRating
StringInt32Float64
112 Angry Men292788.7
212 stulev2528.9
32001: A Space Odyssey649828.3
421 Grams218578.0
539 Steps, The79318.0

Now let's see how to use @filter() with in. Here's an example with a tuple.¤

@chain movies begin
  @filter(Title in ("101 Dalmatians",
                    "102 Dalmatians"))
  @select(1:5)
end
2×5 DataFrame
RowTitleYearLengthBudgetRating
StringInt32Int32Int32?Float64
1101 Dalmatians1996103missing5.5
2102 Dalmatians2000100850000004.7

We can also use @filter() with in using a vector, denoted by a [].¤

@chain movies begin
  @filter(Title in ["101 Dalmatians",
                    "102 Dalmatians"])
  @select(1:5)
end
2×5 DataFrame
RowTitleYearLengthBudgetRating
StringInt32Int32Int32?Float64
1101 Dalmatians1996103missing5.5
2102 Dalmatians2000100850000004.7

Finally, we can combine @filter with row_number() to retrieve the first 5 rows, which can be used to mimic the functionality provided by @slice.¤

@chain movies begin
  @filter(row_number() <= 5)
  @select(1:5)
end
5×5 DataFrame
RowTitleYearLengthBudgetRating
StringInt32Int32Int32?Float64
1$1971121missing6.4
2$1000 a Touchdown193971missing6.0
3$21 a Day Once a Month19417missing8.2
4$40,000199670missing8.2
5$50,000 Climax Show, The197571missing3.4

This page was generated using Literate.jl.