Reusing a Query (and Views)
While using TidierDB, you may need to generate part of a query and reuse it multiple times. There are two ways to do this
from_query(query)
or its aliast(query)
@create_view(name)
Setup¤
import TidierDB as DB
con = DB.connect(DB.duckdb())
mtcars_path = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
mtcars = DB.db_table(con, mtcars_path);
Start a query to analyze fuel efficiency by number of cylinders. However, to further build on this query later, end the chain without using @show_query
or @collect
query = DB.@chain DB.t(mtcars) begin
DB.@group_by cyl
DB.@summarize begin
across(mpg, (mean, minimum, maximum))
num_cars = n()
end
DB.@mutate begin
efficiency = case_when(
mpg_mean >= 25, "High",
mpg_mean >= 15, "Moderate",
"Low" )
end
end;
from_query()
or t(query)
¤
Now, from_query
, or t()
a convienece wrapper, will allow you to reuse the query to calculate the average horsepower for each efficiency category
DB.@chain DB.t(query) begin
DB.@left_join(DB.t(mtcars), cyl)
DB.@group_by(efficiency)
DB.@summarize(avg_hp = mean(hp))
DB.@collect
end
2×2 DataFrame
Row │ efficiency avg_hp
│ String? Float64?
─────┼──────────────────────
1 │ Moderate 180.238
2 │ High 82.6364
@create_view¤
This can also be done with @create_view
.
query2 = @chain DB.t(mtcars) DB.@filter(mpg>20) DB.@mutate(mpg = mpg *4);
DB.@chain DB.t(mtcars) begin
DB.@group_by cyl
DB.@summarize begin
across(mpg, (mean, minimum, maximum))
num_cars = n()
end
DB.@mutate begin
efficiency = case_when(
mpg_mean >= 25, "High",
mpg_mean >= 15, "Moderate",
"Low" )
end
DB.@create_view(viewer)
end;
DB.@chain DB.db_table(con, "viewer") begin
DB.@left_join(DB.t(query2), cyl == cyl)
DB.@summarize(avg_mean = mean(mpg), _by = efficiency)
DB.@mutate(mean = avg_mean / 4 )
@aside DB.@show_query _
DB.@collect
end
2×3 DataFrame
Row │ efficiency avg_mean mean
│ String Float64 Float64
─────┼───────────────────────────────
1 │ High 106.655 26.6636
2 │ Moderate 84.5333 21.1333
Preview or save an intermediate table¤
While querying a dataset, you may wish to see an intermediate table, or even save it. You can use @aside
and from_query(_)
, illustrated below, to do just that. While we opted to print the results in this simple example below, we could have saved them by using name = DB.@chain...
import ClickHouse;
conn = conn = DB.connect(DB.clickhouse(); host="localhost", port=19000, database="default", user="default", password="")
path = "https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet"
DB.@chain DB.db_table(conn, path) begin
DB.@count(artists)
@aside println(DB.@chain DB.from_query(_) DB.@head(5) DB.@collect)
DB.@arrange(desc(count))
DB.@collect
end
5×2 DataFrame
Row │ artists count
│ String? UInt64
─────┼─────────────────
1 │ missing 1
2 │ Wizo 3
3 │ MAGIC! 3
4 │ Macaco 1
5 │ SOYOU 1
31438×2 DataFrame
Row │ artists count
│ String? UInt64
───────┼─────────────────────────
1 │ The Beatles 279
2 │ George Jones 271
3 │ Stevie Wonder 236
4 │ Linkin Park 224
5 │ Ella Fitzgerald 222
6 │ Prateek Kuhad 217
7 │ Feid 202
⋮ │ ⋮ ⋮
31432 │ Leonard 1
31433 │ marcos g 1
31434 │ BLVKSHP 1
31435 │ Memtrix 1
31436 │ SOYOU 1
31437 │ Macaco 1
31438 │ missing 1
31424 rows omitted
This page was generated using Literate.jl.