Skip to content

Reproduce a duckplyr example

In this example, we will reproduce a DuckDB and duckplyr blog post example to demonstrate TidierDB's v0.5.0 capability.

The example by Hannes that is being reproduced is exploring Open Data from the New Zealand government that is ~ 1GB.

Set up¤

First we will set up the local duckdb database and pull in the metadata for the files. Notice we are not reading this data into memory, only the paths and and column, and table names. To follow along, copy the set up code below after downloading the data, but add the directory to the local data.

import TidierDB as DB
db = DB.connect(DB.duckdb())

dir = "/Downloads/nzcensus/"
data   = dir * "Data8277.csv"
age    = dir * "DimenLookupAge8277.csv"
area   = dir * "DimenLookupArea8277.csv"
ethnic = dir * "DimenLookupEthnic8277.csv"
sex    = dir * "DimenLookupSex8277.csv"
year   = dir * "DimenLookupYear8277.csv"

data = DB.db_table(db, data);
age = DB.db_table(db, age);
area = DB.db_table(db, area);
ethnic = DB.db_table(db, ethnic);
sex = DB.db_table(db, sex);
year = DB.db_table(db, year);

Exploration¤

While this long chain could be broken up into multiple smaller chains, lets reproduce the duckplyr code from example and demonstrate how TidierDB also supports multiple joins after filtering, mutating, etc the joining tables. 6 different tables are being joined together through sequential inner joins.

@chain DB.t(data) begin
  DB.@filter(str_detect(count, r"^\d+$"))
  DB.@mutate(count_ = as_integer(count))
  DB.@filter(count_ > 0)
  DB.@inner_join(
    (@chain DB.t(age) begin
    DB.@filter(str_detect(Description, r"^\d+ years$"))
    DB.@mutate(age_ = as_integer(str_remove(Code, "years"))) end),
    Age = Code
  )
  DB.@inner_join((@chain DB.t(year) DB.@mutate(year_ = Description)), year = Code)
  DB.@inner_join((@chain DB.t(area) begin
    DB.@mutate(area_ = Description)
    DB.@filter(!str_detect(area_, r"^Total"))
  end)
    , Area = Code)
    DB.@inner_join((@chain DB.t(ethnic) begin
      DB.@mutate(ethnic_ = Description)
      DB.@filter(!str_detect( ethnic_, r"^Total",)) end), Ethnic = Code)
  DB.@inner_join((@chain DB.t(sex) begin
    DB.@mutate(sex_ = Description)
    DB.@filter(!str_detect( sex_, r"^Total"))
  end)
   , Sex = Code)
  DB.@inner_join((@chain DB.t(year) DB.@mutate(year_ = Description)), Year = Code)
  @aside DB.@show_query _
  DB.@create_view(joined_up)
end;

@chain DB.db_table(db, "joined_up") begin
  DB.@filter begin
    age_ >= 20
    age_ <= 40
    str_detect(area_, r"^Auckland")
    year_ == "2018"
    ethnic_ != "European"
    end
  DB.@group_by sex_
  DB.@summarise(group_count = sum(count_))
  DB.@collect
end

Results¤

When we collect this to a local dataframe, we can see that the results match the duckplyr/DuckDB example.

2×2 DataFrame
 Row │ sex_    group_count
     │ String  Int128
─────┼─────────────────────
   1 │ Female       398556
   2 │ Male         397326

This page was generated using Literate.jl.