Part 2: Dataframes

Dataframes are one of the most important objects in data science.

A dataframe is a table where each row is an observation and each column is a variable.

A dataframe df is a list of vectors, all with the same length.

A column of df is just one if its vectors.

The i-th row of df is the vector formed by the i-th coordinate of each of its columns.

We will use the Palmer Penguin dataset as a toy example for the remaining of the chapter.

using DataFrames, PalmerPenguins
using Tidier, Chain
import DataFramesMeta as DFM

penguins = PalmerPenguins.load() |> DataFrame
344×7 DataFrame
319 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
String15 String15 Float64? Float64? Int64? Int64? String7?
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 female
3 Adelie Torgersen 40.3 18.0 195 3250 female
4 Adelie Torgersen missing missing missing missing missing
5 Adelie Torgersen 36.7 19.3 193 3450 female
6 Adelie Torgersen 39.3 20.6 190 3650 male
7 Adelie Torgersen 38.9 17.8 181 3625 female
8 Adelie Torgersen 39.2 19.6 195 4675 male
9 Adelie Torgersen 34.1 18.1 193 3475 missing
10 Adelie Torgersen 42.0 20.2 190 4250 missing
11 Adelie Torgersen 37.8 17.1 186 3300 missing
12 Adelie Torgersen 37.8 17.3 180 3700 missing
13 Adelie Torgersen 41.1 17.6 182 3200 female
333 Chinstrap Dream 45.2 16.6 191 3250 female
334 Chinstrap Dream 49.3 19.9 203 4050 male
335 Chinstrap Dream 50.2 18.8 202 3800 male
336 Chinstrap Dream 45.6 19.4 194 3525 female
337 Chinstrap Dream 51.9 19.5 206 3950 male
338 Chinstrap Dream 46.8 16.5 189 3650 female
339 Chinstrap Dream 45.7 17.0 195 3650 female
340 Chinstrap Dream 55.8 19.8 207 4000 male
341 Chinstrap Dream 43.5 18.1 202 3400 female
342 Chinstrap Dream 49.6 18.2 193 3775 male
343 Chinstrap Dream 50.8 19.0 210 4100 male
344 Chinstrap Dream 50.2 18.7 198 3775 female

Libraries

Dataframes

Dataframes.jl is the main package for dealing with dataframes in Julia. You can use it directly to manipulate tables, but we also have 2 alternatives: DataFramesMeta and Tidier.

DataFramesMeta

DataFramesMeta is a collection of macros based on DataFrames. It provides many syntatic helpers to slice rows, create columns and summarise data.

Tidier

Tidier is inspired by the tidyverse ecosystem in R. Tidier use macros to rewrite your code into DataFrames.jl code. Because of this “tidy” heritance, we will often talk about the R packages that inspired the Julia ones (like dplyr, tidyr and many others).

In this book, whenever possible, we will show the different approaches in a tabset so you can compare them, giving more emphasis on Tidier.

Operations

Let’s start with some unary operations, ie. operations that take only one dataframe as input and return one dataframe as output.1. We can divide these operations in some categories:

Rows operations

These are operations that only affect rows, leaving all columns untouched.

  • Filtering or subsetting is when we select a subset of rows based on some criteria. Example: all male penguins of species Adelie. The output is a dataframe with the exact same columns, but possibly fewer rows.

  • Arranging or ordering is when we reorder the rows of a dataframe using some criteria.

Column operations

These are operations that only affect columns, leaving all rows untouched.

  • Selecting is when we select some columns of a dataframe, while keeping all the rows. Example: select the species and sex columns.

  • Mutating or transforming is when we create new columns. Example: a new column body_mass_kg can be obtained dividing the column body_mass_g by 1000.

Reshaping operations

These operations change the shape of a dataframe, making it wider or longer.

  • Widening

  • Longering?

Grouping operations

  • Grouping is when we split the dataframe into a collection (array) of dataframes using some criteria. Example: grouping by species gives us 3 dataframes, each with only one species.

Mixed operations

These operations can possibly change rows and columns at the same time.

  • Distinct;
  • Counting;
  • Summarising or combining is when we apply some function to some columns in order to reduce the amount of rows with some kind of summary (like a mean, median, max, and so on). Example: for each species, apply the mean function to the columns body_mass_g. This will yield a dataframe with 3 rows, one for each species. Summarising is usually done after a grouping, so the summary is calculated with relation to each of the groups.

??? deixar grupo e sumário juntos?

Since all these functions return a dataframe (or an array of dataframes, in the case of grouping), we can chain these operations together, with the convention that on grouped dataframes we apply the function in each one of the groups.

Now for binary operations (ie. operations that take two dataframes), we have all the joins:

  • Left join;
  • Right join;
  • Inner join;
  • Outer join;
  • Full join.

Comparing Tidier with DataFramesMeta

The following table list the operations on each package:

dplyr Tidier DataFramesMeta DataFrames
filter @filter @subset / @rsubset subset
arrange @arrange @orderby / @rorderby sort!
select @select @select array sintax
mutate @mutate @transform / @rtransform array sintax
group_by @group_by @groupby groupby
summarise @summarise @combine combine

It is clear that for those coming from R, Tidier will look like the most natural approach.

Notice that we have a name clash with @select: that is why we import DataFramesMeta as DFM at the beginning.

We will see each operation with more details in the following chapters.

Chaining operations

We can chain (or pipe) dataframe operations as follows with the @chain macro:

@chain penguins begin
    @filter !ismissing(sex)
    @group_by sex
    @summarise mean = mean(bill_length_mm)
    @arrange mean
end

Using variables as column names

In Tidier, using the column names as if they were variables in the environment leads to some complication when we want to use other variables that are not column names.

For example, suppose you want to arrange penguins by a column that is stored in a variable.

When this happens, we add @eval before the Tidier code and add a $ to force evaluation of the variable, as in the following example:

my_arrange_column = :body_mass_g;

@eval @arrange penguins $my_arrange_column

Documentation

https://dataframes.juliadata.org/stable/man/working_with_dataframes/

https://juliadata.org/DataFramesMeta.jl/stable

https://tidierorg.github.io/TidierData.jl/latest/reference/


  1. Join operations will be dealt later.↩︎