Getting Started
If you plan to use TidierDB.jl with TidierData.jl or Tidier.jl, it is most convenenient to load the packages as follows:
using TidierData
import TidierDB as DB
Alternatively, using Tidier
will import TidierDB in the above manner for you, where TidierDB functions and macros will be available as DB.@mutate()
and so on, and the TidierData equivalent would be @mutate()
.
Connecting¤
To use TidierDB, a connection must first be established. To connect to a database, you can uset the connect
function as shown below, or establish your own connection through the respecitve libraries.
For example Connecting to DuckDB
conn = DB.connect(DB.duckdb())
versus connecting to MySQL
conn = DB.connect(DB.mysql(); host="localhost", user="root", password="password", db="mydb")
The connect
docstring has many examples for how to use the connect function to connect to various backends or to MotherDuck.
Connect to a local database file with DuckDB¤
You can also connect to an existing database by passing the database file path as a string.
db = DB.connect(DB.duckdb(), "path/to/mydb.duckdb")
You can also establish any DuckDB connection through an alternate method that you prefer, and use that as your connection as well.
Package Extensions¤
The following backends utilize package extensions. To use one of backends listed below, you will need to write using Library
- ClickHouse:
import ClickHouse
- MySQL and MariaDB:
using MySQL
- MSSQL:
using ODBC
- Postgres:
using LibPQ
- SQLite:
using SQLite
- Athena:
using AWS
- Oracle:
using ODBC
- Google BigQuery:
using GoogleCloud
db_table
¤
What does db_table
do?
db_table
starts the underlying SQL query struct, in addition to pulling the table metadata and storing it there. Storing metadata is what enables a lazy interface that also supports tidy selection.
db_table
has two required arguments:connection
andtable
-
table
can be a table name on a database or a path/url to file to read. When passingdb_table
a path or url, the table is not copied into memory.- Of note,
db_table
only support direct file paths to a table. It does not support database file paths such asdbname.duckdb
ordbname.sqlite
. Such files must be used withconnect
first. - With DuckDB and ClickHouse, if you have a folder of multiple files to read, you can use
*
read in all files matching the pattern, with an optionalalias
argument for what the data should be referred to. - For example, the below would read all files that end in
.csv
in the given folder.
- Of note,
db_table(db, "folder/path/*.csv")
db_table
also supports iceberg, delta, and S3 file paths via DuckDB.
Minimizing Compute Costs and Keystrokes¤
If you are working with a backend where compute cost is important, it will be important to minimize using db_table
as this will requery for metadata each time. Compute costs are relevant to backends such as AWS, databricks and Snowflake.
To do this, save the results of db_table
and use them with t
. Using t
pulls the relevant information (metadata, con, etc) from the mutable SQLquery struct, allowing you to repeatedly query and collect the table without requerying for the metadata each time
!Tip:
t()
is an alias forfrom_query
This means after saving the results ofdb_table
, uset(table)
to refer to the table or prior query
table = DB.db_table(con, "path")
@chain DB.t(table) begin
## data wrangling here
end
This page was generated using Literate.jl.