Working With Larger than RAM Datasets
While using the DuckDB backend, TidierDB's lazy intferace enables querying datasets larger than your available RAM.
To illustrate this, we will recreate the Hugging Face x Polars example. The final table results are shown below and in this Hugging Face x DuckDB example
First we will load TidierDB, set up a local database and then set the URLs for the 2 training datasets from huggingface.co
using TidierDB
db = connect(duckdb())
urls = ["https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet",
"https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet"];
Here, we pass the vector of URLs to db_table
, which will not copy them into memory. Since these datasets are so large, we will also set stream = true
in @collect
to stream the results. If we wanted to read all the files in the folder we could have replace the 0000
with *
(wildcard) db_table(db, "Path/to/folder/*.parquet")
Of note, reading these files from URLs is not as rapid as reading them from local files.
@chain db_table(db, urls) begin
@group_by(horoscope)
@summarise(count = n(), avg_blog_length = mean(length(text)))
@arrange(desc(count))
@aside @show_query _
@collect(stream = true)
end
Placing @aside @show_query _
before @collect
above lets us see the SQL query and collect it to a local DataFrame at the same time.
SELECT horoscope, COUNT(*) AS count, AVG(length(text)) AS avg_blog_length
FROM read_parquet(['https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet', 'https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet'])
GROUP BY horoscope
ORDER BY avg_blog_length DESC
12×3 DataFrame
Row │ horoscope count avg_blog_length
│ String? Int64? Float64?
─────┼──────────────────────────────────────
1 │ Aquarius 49568 1125.83
2 │ Cancer 63512 1097.96
3 │ Libra 60304 1060.61
4 │ Capricorn 49402 1059.56
5 │ Sagittarius 50431 1057.46
6 │ Leo 58010 1049.6
7 │ Taurus 61571 1022.69
8 │ Gemini 52925 1020.26
9 │ Scorpio 56495 1014.03
10 │ Pisces 53812 1011.75
11 │ Virgo 64629 996.684
12 │ Aries 69134 918.081
To learn more about memory efficient queries on larger than RAM files, this blog from DuckDB will help maximize your local db
This page was generated using Literate.jl.