Movies dataset
To get started, we will load the movies
dataset from the RDatasets.jl
package.
using Tidier
using RDatasets
movies = dataset("ggplot2", "movies");
To work with this dataset, we will use the @chain
macro. This macro initiates a pipe, and every function or macro provided to it between the begin
and end
blocks modifies the dataframe mentioned at the beginning of the pipe. You don't have to necessarily spread a chain over multiple lines of code, but when working with data frames it's often easiest to do so. Before going futher, take a look at the Chain.jl GitHub page to see all the cool things that are possible with this, including mid-chain side effects using @aside
and mid-chain assignment of variables.
Let's take a look at the first 5 rows of the movies
dataset using @slice()
.
@chain movies begin
@slice(1:5)
end
Row | Title | Year | Length | Budget | Rating | Votes | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 | R9 | R10 | MPAA | Action | Animation | Comedy | Drama | Documentary | Romance | Short |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String | Int32 | Int32 | Int32? | Float64 | Int32 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Cat… | Int32 | Int32 | Int32 | Int32 | Int32 | Int32 | Int32 | |
1 | $ | 1971 | 121 | missing | 6.4 | 348 | 4.5 | 4.5 | 4.5 | 4.5 | 14.5 | 24.5 | 24.5 | 14.5 | 4.5 | 4.5 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | |
2 | $1000 a Touchdown | 1939 | 71 | missing | 6.0 | 20 | 0.0 | 14.5 | 4.5 | 24.5 | 14.5 | 14.5 | 14.5 | 4.5 | 4.5 | 14.5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
3 | $21 a Day Once a Month | 1941 | 7 | missing | 8.2 | 5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 24.5 | 0.0 | 44.5 | 24.5 | 24.5 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | |
4 | $40,000 | 1996 | 70 | missing | 8.2 | 6 | 14.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 34.5 | 45.5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
5 | $50,000 Climax Show, The | 1975 | 71 | missing | 3.4 | 17 | 24.5 | 4.5 | 0.0 | 14.5 | 14.5 | 4.5 | 0.0 | 0.0 | 0.0 | 24.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Let's use the describe()
function, which is re-exported from the DataFrames.jl
package to describe the dataset.
describe(movies)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | Type | |
1 | Title | $ | xXx: State of the Union | 0 | String | ||
2 | Year | 1976.13 | 1893 | 1983.0 | 2005 | 0 | Int32 |
3 | Length | 82.3379 | 1 | 90.0 | 5220 | 0 | Int32 |
4 | Budget | 1.34125e7 | 0 | 3.0e6 | 200000000 | 53573 | Union{Missing, Int32} |
5 | Rating | 5.93285 | 1.0 | 6.1 | 10.0 | 0 | Float64 |
6 | Votes | 632.13 | 5 | 30.0 | 157608 | 0 | Int32 |
7 | R1 | 7.01438 | 0.0 | 4.5 | 100.0 | 0 | Float64 |
8 | R2 | 4.02238 | 0.0 | 4.5 | 84.5 | 0 | Float64 |
9 | R3 | 4.72116 | 0.0 | 4.5 | 84.5 | 0 | Float64 |
10 | R4 | 6.37485 | 0.0 | 4.5 | 100.0 | 0 | Float64 |
11 | R5 | 9.79669 | 0.0 | 4.5 | 100.0 | 0 | Float64 |
12 | R6 | 13.0392 | 0.0 | 14.5 | 84.5 | 0 | Float64 |
13 | R7 | 15.5481 | 0.0 | 14.5 | 100.0 | 0 | Float64 |
14 | R8 | 13.876 | 0.0 | 14.5 | 100.0 | 0 | Float64 |
15 | R9 | 8.95421 | 0.0 | 4.5 | 100.0 | 0 | Float64 |
16 | R10 | 16.854 | 0.0 | 14.5 | 100.0 | 0 | Float64 |
17 | MPAA | R | 0 | CategoricalValue{String, UInt8} | |||
18 | Action | 0.0797442 | 0 | 0.0 | 1 | 0 | Int32 |
19 | Animation | 0.0627679 | 0 | 0.0 | 1 | 0 | Int32 |
20 | Comedy | 0.293784 | 0 | 0.0 | 1 | 0 | Int32 |
21 | Drama | 0.371011 | 0 | 0.0 | 1 | 0 | Int32 |
22 | Documentary | 0.0590597 | 0 | 0.0 | 1 | 0 | Int32 |
23 | Romance | 0.0806967 | 0 | 0.0 | 1 | 0 | Int32 |
24 | Short | 0.160883 | 0 | 0.0 | 1 | 0 | Int32 |
This page was generated using Literate.jl.