@filter

Filtering is a mechanism to indicate which rows you want to keep in a dataset based on criteria. This is also referred to as subsetting. Filtering rows is normally a bit tricky in DataFrames.jl because comparison operators like >= actually need to be vectorized as .>=, which can catch new Julia users by surprise. @filter() mimics R's tidyverse behavior by auto-vectorizing the code and then only selecting those rows that evaluate to true. Similar to dplyr, rows that evaluate to missing are skipped.

using TidierData
using RDatasets

movies = dataset("ggplot2", "movies");

Let’s take a look at the movies whose budget was more than average. We will select only the first 5 rows for the sake of brevity.¤

@chain movies begin
  @mutate(Budget = Budget / 1_000_000)
  @filter(Budget >= mean(skipmissing(Budget)))
  @select(Title, Budget)
  @slice(1:5)
end

5×2 DataFrame

Row	Title	Budget
	String	Float64?
1	'Til There Was You	23.0
2	10 Things I Hate About You	16.0
3	102 Dalmatians	85.0
4	13 Going On 30	37.0
5	13th Warrior, The	85.0

Let's search for movies that have at least 200 votes and a rating of greater than or equal to 8. There are 3 ways you can specify an "and" condition inside of `TidierData.jl`.¤

The first option is to use the short-circuiting `&&` operator as shown below. This is the preferred approach because the second expression is only evaluated (per element) if the first one is true.¤

@chain movies begin
  @filter(Votes >= 200 && Rating >= 8)
  @select(Title, Votes, Rating)
  @slice(1:5)
end

5×3 DataFrame

Row	Title	Votes	Rating
	String	Int32	Float64
1	12 Angry Men	29278	8.7
2	12 stulev	252	8.9
3	2001: A Space Odyssey	64982	8.3
4	21 Grams	21857	8.0
5	39 Steps, The	7931	8.0

The second option is to use the bitwise `&` operator. Note that there is a key difference in syntax between `&` and `&&`. Because the `&` operator takes a higher operator precedence than `>=`, you have to wrap the comparison expressions inside of parentheses to ensure that the overall expression is evaluated correctly.¤

@chain movies begin
  @filter((Votes >= 200) & (Rating >= 8))
  @select(Title, Votes, Rating)
  @slice(1:5)
end

5×3 DataFrame

Row	Title	Votes	Rating
	String	Int32	Float64
1	12 Angry Men	29278	8.7
2	12 stulev	252	8.9
3	2001: A Space Odyssey	64982	8.3
4	21 Grams	21857	8.0
5	39 Steps, The	7931	8.0

The third option for "and" conditions only is to separate the expressions with commas. This is similar to the behavior of `filter()` in `tidyverse`.¤

@chain movies begin
  @filter(Votes >= 200, Rating >= 8)
  @select(Title, Votes, Rating)
  @slice(1:5)
end

5×3 DataFrame

Row	Title	Votes	Rating
	String	Int32	Float64
1	12 Angry Men	29278	8.7
2	12 stulev	252	8.9
3	2001: A Space Odyssey	64982	8.3
4	21 Grams	21857	8.0
5	39 Steps, The	7931	8.0

Now let's see how to use `@filter()` with `in`. Here's an example with a tuple.¤

@chain movies begin
  @filter(Title in ("101 Dalmatians",
                    "102 Dalmatians"))
  @select(1:5)
end

2×5 DataFrame

Row	Title	Year	Length	Budget	Rating
	String	Int32	Int32	Int32?	Float64
1	101 Dalmatians	1996	103	missing	5.5
2	102 Dalmatians	2000	100	85000000	4.7

We can also use `@filter()` with `in` using a vector, denoted by a `[]`.¤

@chain movies begin
  @filter(Title in ["101 Dalmatians",
                    "102 Dalmatians"])
  @select(1:5)
end

2×5 DataFrame

Row	Title	Year	Length	Budget	Rating
	String	Int32	Int32	Int32?	Float64
1	101 Dalmatians	1996	103	missing	5.5
2	102 Dalmatians	2000	100	85000000	4.7

Finally, we can combine `@filter` with `row_number()` to retrieve the first 5 rows, which can be used to mimic the functionality provided by `@slice`.¤

@chain movies begin
  @filter(row_number() <= 5)
  @select(1:5)
end

5×5 DataFrame

Row	Title	Year	Length	Budget	Rating
	String	Int32	Int32	Int32?	Float64
1	$	1971	121	missing	6.4
2	$1000 a Touchdown	1939	71	missing	6.0
3	$21 a Day Once a Month	1941	7	missing	8.2
4	$40,000	1996	70	missing	8.2
5	$50,000 Climax Show, The	1975	71	missing	3.4

This page was generated using Literate.jl.

@filter

Let’s take a look at the movies whose budget was more than average. We will select only the first 5 rows for the sake of brevity.¤

Let's search for movies that have at least 200 votes and a rating of greater than or equal to 8. There are 3 ways you can specify an "and" condition inside of TidierData.jl.¤

The first option is to use the short-circuiting && operator as shown below. This is the preferred approach because the second expression is only evaluated (per element) if the first one is true.¤

The third option for "and" conditions only is to separate the expressions with commas. This is similar to the behavior of filter() in tidyverse.¤

Now let's see how to use @filter() with in. Here's an example with a tuple.¤

We can also use @filter() with in using a vector, denoted by a [].¤

Finally, we can combine @filter with row_number() to retrieve the first 5 rows, which can be used to mimic the functionality provided by @slice.¤

Let's search for movies that have at least 200 votes and a rating of greater than or equal to 8. There are 3 ways you can specify an "and" condition inside of `TidierData.jl`.¤

The first option is to use the short-circuiting `&&` operator as shown below. This is the preferred approach because the second expression is only evaluated (per element) if the first one is true.¤

The third option for "and" conditions only is to separate the expressions with commas. This is similar to the behavior of `filter()` in `tidyverse`.¤

Now let's see how to use `@filter()` with `in`. Here's an example with a tuple.¤

We can also use `@filter()` with `in` using a vector, denoted by a `[]`.¤

Finally, we can combine `@filter` with `row_number()` to retrieve the first 5 rows, which can be used to mimic the functionality provided by `@slice`.¤