Dataframes are one of the most important objects in data science.
A dataframe is a table where each row is an observation and each column is a variable.
A dataframe df is a list of vectors, all with the same length.
A column of df is just one if its vectors.
The i-th row of df is the vector formed by the i-th coordinate of each of its columns.
We will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.
usingDataFrames, PalmerPenguinsusingTidier, ChainimportDataFramesMeta as DFMpenguins = PalmerPenguins.load() |> DataFrame
344×7 DataFrame
319 rows omitted
Row
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
String15
String15
Float64?
Float64?
Int64?
Int64?
String7?
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
missing
missing
missing
missing
missing
5
Adelie
Torgersen
36.7
19.3
193
3450
female
6
Adelie
Torgersen
39.3
20.6
190
3650
male
7
Adelie
Torgersen
38.9
17.8
181
3625
female
8
Adelie
Torgersen
39.2
19.6
195
4675
male
9
Adelie
Torgersen
34.1
18.1
193
3475
missing
10
Adelie
Torgersen
42.0
20.2
190
4250
missing
11
Adelie
Torgersen
37.8
17.1
186
3300
missing
12
Adelie
Torgersen
37.8
17.3
180
3700
missing
13
Adelie
Torgersen
41.1
17.6
182
3200
female
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
333
Chinstrap
Dream
45.2
16.6
191
3250
female
334
Chinstrap
Dream
49.3
19.9
203
4050
male
335
Chinstrap
Dream
50.2
18.8
202
3800
male
336
Chinstrap
Dream
45.6
19.4
194
3525
female
337
Chinstrap
Dream
51.9
19.5
206
3950
male
338
Chinstrap
Dream
46.8
16.5
189
3650
female
339
Chinstrap
Dream
45.7
17.0
195
3650
female
340
Chinstrap
Dream
55.8
19.8
207
4000
male
341
Chinstrap
Dream
43.5
18.1
202
3400
female
342
Chinstrap
Dream
49.6
18.2
193
3775
male
343
Chinstrap
Dream
50.8
19.0
210
4100
male
344
Chinstrap
Dream
50.2
18.7
198
3775
female
Libraries
Dataframes
Dataframes.jl is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier.
DataFramesMeta
DataFramesMeta is a collection of macros based on DataFrames. It provides many syntatic helpers to slice rows, create columns and summarise data.
Tidier
Tidier is inspired by the tidyverse ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this “tidy” heritance, we will often talk about the R packages that inspired the Julia ones (like dplyr, tidyr and many others).
In this book, whenever possible, we will show the different approaches in a tabset so you can compare them, giving more emphasis on Tidier.
Operations
Let’s start with some unary operations, ie. operations that take only one dataframe as input and return one dataframe as output.1. We can divide these operations in some categories:
Rows operations
These are operations that only affect rows, leaving all columns untouched.
Filtering or subsetting is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.
Arranging or ordering is when we reorder the rows of a dataframe using some criteria.
Column operations
These are operations that only affect columns, leaving all rows untouched.
Selecting is when we select some columns of a dataframe, while keeping all the rows. Example: select the species and sex columns.
Mutating or transforming is when we create new columns. Example: a new column body_mass_kg can be obtained dividing the column body_mass_g by 1000.
Reshaping operations
These operations change the shape of a dataframe, making it wider or longer.
Widening
Longering?
Grouping operations
Grouping is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by species gives us 3 dataframes, each with only one species.
Mixed operations
These operations can possibly change rows and columns at the same time.
Distinct;
Counting;
Summarising or combining is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each species, apply the mean function to the columns body_mass_g. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.
??? deixar grupo e sumário juntos?
Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.
Now for binary operations (ie. operations that take two dataframes), we have all the joins:
Left join;
Right join;
Inner join;
Outer join;
Full join.
Comparing Tidier with DataFramesMeta
The following table list the operations on each package:
dplyr
Tidier
DataFramesMeta
DataFrames
filter
@filter
@subset / @rsubset
subset
arrange
@arrange
@orderby / @rorderby
sort!
select
@select
@select
array sintax
mutate
@mutate
@transform / @rtransform
array sintax
group_by
@group_by
@groupby
groupby
summarise
@summarise
@combine
combine
It is clear that for those coming from R, Tidier will look like the most natural approach.
Notice that we have a name clash with @select: that is why we import DataFramesMeta as DFM at the beginning.
We will see each operation with more details in the following chapters.
Chaining operations
We can chain (or pipe) dataframe operations as follows with the @chain macro:
@chain penguins begin@filter !ismissing(sex)@group_by sex@summarise mean =mean(bill_length_mm)@arrange meanend
Using variables as column names
In Tidier, using the column names as if they were variables in the environment leads to some complication when we want to use other variables that are not column names.
For example, suppose you want to arrange penguins by a column that is stored in a variable.
When this happens, we add @eval before the Tidier code and add a $ to force evaluation of the variable, as in the following example: