@filter

Filtering is a mechanism to indicate which rows you want to keep in a dataset based on criteria. This is also referred to as subsetting. Filtering rows is normally a bit tricky in DataFrames.jl because comparison operators like >= actually need to be vectorized as .>=, which can catch new Julia users by surprise. @filter() mimics R's tidyverse behavior by auto-vectorizing the code and then only selecting those rows that evaluate to true. Similar to dplyr, rows that evaluate to missing are skipped.

using Tidier
using RDatasets

movies = dataset("ggplot2", "movies");

Let’s take a look at the movies whose budget was more than average. We will select only the first 5 rows for the sake of brevity.

@chain movies begin
    @mutate(Budget = Budget / 1_000_000)
    @filter(Budget >= mean(skipmissing(Budget)))
    @select(Title, Budget)
    @slice(1:5)
end

5×2 DataFrame

Row	Title	Budget
	String	Float64?
1	'Til There Was You	23.0
2	10 Things I Hate About You	16.0
3	102 Dalmatians	85.0
4	13 Going On 30	37.0
5	13th Warrior, The	85.0

Now let's see how to use @filter() with in. Here's an example with a tuple.

@chain movies begin
  @filter(Title in ("101 Dalmatians", "102 Dalmatians"))
  @select(1:5)
end

2×5 DataFrame

Row	Title	Year	Length	Budget	Rating
	String	Int32	Int32	Int32?	Float64
1	101 Dalmatians	1996	103	missing	5.5
2	102 Dalmatians	2000	100	85000000	4.7

We can also use @filter() with in using a vector, denoted by a [].

@chain movies begin
  @filter(Title in ["101 Dalmatians", "102 Dalmatians"])
  @select(1:5)
end

2×5 DataFrame

Row	Title	Year	Length	Budget	Rating
	String	Int32	Int32	Int32?	Float64
1	101 Dalmatians	1996	103	missing	5.5
2	102 Dalmatians	2000	100	85000000	4.7

This page was generated using Literate.jl.