DataFrames
Dataframes
Julia has a library to handle tabular data, in a way similar to R or Pandas dataframes. The name is, no surprises, DataFrames. The approach and the function names are similar, although the way of actually accessing the API may be a bit different. For complex analysis, DataFramesMeta adds some helper macros.
Documentation:
DataFramesMeta: https://github.com/JuliaStats/DataFramesMeta.jl
Stats in Julia in general: http://juliastats.github.io/
Install and import the library
Install the library:
Pkg.add(DataFrames)
Load the library:
using DataFrames
Create a df or load data:
From a table defined in code:
Read a CSV file:
myData = CSV.read(file, Dataframe; delim=';', missingstring="NA", delim=";", decimal=',', copycols=true)
(useCSV.read(file, DataFrame; delim='\t')
for tab delimited files)
If a column has in the first top rows used by type-autorecognition only missing values, but then has non-missing values in subsequent rows, an error may appear. The trick is to manually specify the column value with the type
parameter (Vector or Dictionary, e.g. types=Dict("freeDim" => Union{Missing,Int64})
)
If you need to edit the values of your imported dataframe, do not forget the copycols=true
option.
From a stream, use the package
HTTP
:From a OpenDocument Spreadsheet file (OpenOffice, LibreOffice, MS Excel and others): Use the
OdsIO
package together with theretType="DataFrame"
argument:df = ods_read("spreadsheet.ods";sheetName="Sheet2",retType="DataFrame",range=((tl_row,tl_col),(br_row,br_col)))
Crate a df from scratch:
Create an empty df:
df = DataFrame(A = Int64[], B = Float64[])
Convert from a Matrix of data and a vector of column names:
df = DataFrame([[mat[:,i]...] for i in 1:size(mat,2)], Symbol.(headerstrs))
Convert from a Matrix with headers in the first row:
df = DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))
Get insights about your data:
first(df, 6)
show(df, allrows=true, allcols=true)
last(df, 6)
describe(df)
unique(df.fieldName)
or[unique(c) for c in eachcol(df)]
names(df)
returns array of column names[eltype(col) for col = eachcol(df)]
returns an array of column typessize(df)
(r,c),size(df)[1]
(r),size(df)[2]
(c)ENV["LINES"] = 60
change the default number of lines before the content is truncated (default 30). Also COLUMNS. May not work with terminal.for c in eachcol(df)
iterates over each columnfor r in eachrow(df)
iterates over each row
Column names are Julia symbols. To programmatically compose a column name you need hence to use the Symbol(String) constructor, e.g.:
Symbol("value_",0)
Filter (aka "selection" or "query")
There are two ways to refer to some column(s) in a DataFrame, by referencing the stored values in the new object, or by copying them into the new object.
Referencing is obtained using the exclamation mark for the row position (to emphasize that referenced data could be changed in the new object) or using the dot syntax: myObjWithReferencedData = df[!,[cNames]]
or myObjWithReferencedData.cName
.
Copying use instead the old two point syntax: myobjWithCopyedData = df[:,[cName(s)]]
.
Filter by value, based on a field being in a list of values using boolean selection trough list comprehension:
df[ [i in ["blue","green"] for i in df.colour], :]
Combined boolean selection:
df[([i in ["blue","green"] for i in df.colour] .> 0) .& (df.shape .== "triangle"), :]
(the dot is needed to vectorize the operation. Note the usage of the bitwise and the single ampersand).Filter using
@where
(DataFrameMeta
package):@where(df, :x .> 2, :y .== "a") # the two expressions are "and-ed"
. If the column name is stored in a variable, you need to wrap it using thecols()
function, e.g.col = Symbol("x"); @where(df, cols(col) .> 2)
Change a single value by filtering columns:
df[ (df.product .== "hardWSawnW") .& (df.year .== 2010) , :consumption] = 200
Filter based on initial pattern:
filteredDf = df[startswith.(df.field,pattern),:]
A benchmark note: using
@with()
or boolean selection is ~ the same, while "querying" an equivalent Dict with categorical variables as tuple keys is around ~20% faster than querying the dataframe.A further (and perhaps more elegant, although longer) way to query a DataFrame is to use the
Query
package. The first example above let you select a subsets of both rows and columns, the second one highlight instead how you can mix multiple selection criteria:
Edit data
Replace values based to a dictionary :
mydf.col1 = map(akey->myDict[akey], mydf.col1)
(the original data to replace can be in a different column or a totally different DataFrameConcatenate (string) values for several columns to create the value a new column:
df.c = df.a .* " " .* df.b
To compute the value of a column based of other columns you need to use elementwise operations using the dot, e.g.
df.a = df.b .* df.c
(note that the equal sign doesn't have the dot.. but if you have to make a comparison, the==
operator wants also the dot, i.e..==
)Append a row:
push!(df, [1 2 3])
Delete a given row: use
deleterows!(df,rowIdx)
or just copy a df without the rows that are not needed, e.g.df2 = df[[1:(i-1);(i+1):end],:]
Empty a dataframe:
df = similar(df,0)
Edit structure
Delete columns by name:
select!(df, Not([:col1, :col2]))
Rename columns:
names!(df, [:c1,:c2,:c3])
(all)rename!(df, Dict(:c1 => :newCol))
(a selection)Change column order:
df = df[:,[:b, :a]]
Add an "id" column (useful for unstacking):
df.id = 1:size(df, 1)
# this makes it easier to unstackAdd a Float64 column (all filled with missing by default):
df.a = Array{Union{Missing,Float64},1}(missing,size(df,1))
Add a column based on values of other columns:
df.c = df.a .+ df.b
(as alternative use map:df.c = map((x,y) -> x + y, df.a, df.b)
)Insert a column at a position i:
insert!(df, i, [colContent], :colName)
Convert columns:
from Int to Float:
df.A = convert(Array{Float64,1},df.A)
from Float to Int:
df.A = convert(Array{Int64,1},df.A)
from Int (or Float) to String:
df.A = map(string, df.A)
from String to Float:
string_to_float(str) = try parse(Float64, str) catch; return(missing) end; df.A = map(string_to_float, df.A)
from Any to T (including String, if the individual elements are already strings):
df.A = convert(Array{T,1},df.A)
You can "pool" specific columns in order to efficiently store repeated categorical variables with
categorical!(df, [:A, :B])
. Attention that while the memory decrease, filtering with categorical values is not quicker (indeed it is a bit slower). You can go back to normal arrays wihcollect(df.A)
.
Merge/Join/Copy datasets
Concatenate different dataframes (with same structure):
df = vcat(df1,df2,df3)
ordf = vcat([df1,df2,df3]...)
(note the three dots at the end, i.e. the splat operator).Join dataframes horizontally:
fullDf = join(df1, df2, on = :commonCol)
. The on parameter can also be an array of common columns. There are many possible types of join , the most common ones are:**
:inner
(default, only rows with keys on both sides are returned),**
:left
(additionally, rows on the left df with keys not present on the df on the right are also returned),**
:right
(opposite of :left ),**
:outer
(rows with elements missing in any of the two df are also returned);Copy the structure of a DataFrame (to an empty one):
df2 = similar(df1, 0)
Manage Missing values
Starting from Julia 1, Missings
type is defined in core (with some additional functionality still provided by the additional package Missings.jl
). At the same time, a DataFrame
changes from being a collection of DataArrays
to a collection of standard Arrays
, eventually of type Union{T,Missing}
if missing data is present.
The missing value is simply
missing
Remove missing values with:
a = collect(skipmissing(df.col1))
(returns anArray
) orb = dropmissing(df[[:col1,:col2]])
(returns aDataFrame
even for a single column)dropmissing!(df)
(in both its version with or without question mark) andcompletecases(df)
select only rows without missing values. The first returns the skimmedDataFrame
, while the second return a boolean array, and you can also specify on which columns you want to limit the application of this filtercompletecases(df[[:col1,:col2]])
. You can then get the df withdf2 = df[completecases(df[[:col1,:col2]]),:]
)Within an operation (e.g. a sum) you can use
dropmissing()
in order to skipmissing
values before the operation take place.Remove missing values on all string and numeric columns:
[df[ismissing.(df[!,i]), i] .= 0 for i in names(df) if Base.nonmissingtype(eltype(df[!,i])) <: Number]
and[df[ismissing.(df[!,i]), i] .= "" for i in names(df) if Base.nonmissingtype(eltype(df[!,i])) <: String]
To make comparison (e.g. for boolean selection or within the
@where
macro inDataFramesMeta
) where missing values could be present you can useisequal.(a,b)
to NOT propagate the missing (i.e.isequal("green",missing)
is true) or the confrontation operator (==
)to preserve missingness (i.e."green" == missing
is neithertrue
norfalse
butmissing
)Count the
missing
values:nMissings = length(findall(x -> ismissing(x), df.col))
Split-Apply-Combine strategy
The DataFrames package supports the Split-Apply-Combine strategy through the by
function, which takes in three arguments: (1) a DataFrame, (2) a column (or columns) to split the DataFrame on, and (3) a function or expression to apply to each subset of the DataFrame.
The function can return a value, a vector, or a DataFrame. For a value or vector, these are merged into a column along with the cols
keys. For
a DataFrame, cols
are combined along columns with the resulting DataFrame. Returning a DataFrame is the clearest because it allows column labelling.
by
function can take the function as first argument, so to allow the usage of do blocks.
Inside, it uses the groupby() function, as in the code it is defined as nothing else than:
Aggregate
Aggregate by several fields:
aggregate(df, [:field1, :field2], sum)
Attention that all categorical fields have to be included in the list of fields over which to aggregate, otherwise Julia will try to compute a sum also over them (but them being string, it will raice an error) instead of just ignoring them.
The workaround is to remove the fields you don't want before doing the operation.
Alternatively (and without the problem of the previous point):
Compute cumulative sum by categories
Manual method (very slow):
Using by and the split-apply-combine strategy (fast):
Using @linq (from DataFramesMeta) and the split-apply-combine strategy (fast):
Using groupby (fast):
Pivot
Stack
Move columns to rows of a "variable" column, i.re. moving from wide to long format.
For stack(df,[cols])
you have to specify the column(s) that have to be stacked, for melt(df,[cols])
at the opposite you specify the other columns, that represent the id columns that are already in stacked form.
Finally stack(df)
- without column names - automatically stack all float columns.
Note that the stacked columns are inserted as data in a "variable" column (with names of the variables not strings but symbols) and the corresponding values in a "column" value.
Unstack
You can specify the dataframe, the column name which content will become the row index (id variable), the column name with content will become the name of the columns (column variable names) and the column name containing the values that will be placed in the new table (column values):
widedf = unstack(longdf, [:ids], :variable, :value)
Alternatively you can omit the :id
parameter and all the existing column except the one defining column names and the one defining column values will be preserved as index (row) variables:
widedf = unstack(longdf, :variable, :value)
Sorting
sort!(df, cols = (:col1, :col2), rev = (false, false))
The (optional) reverse order parameter (rev) must be a tuple of the same size as the cols parameter
Use LAJuliaUtils.jl
You can use (my own utility module) LAJuliaUtils.jl
in order to Pivot and optionally filter and sort in a single function in a spreadsheet-like Pivot Tables fashion. See the relevant section.
Export your data
Export to CSV
CSV.write("file.csv", df, delim = ';', header = true)
(from package CSV
)
Export to ods (OpenDocument Spreadsheet file - OpenOffice, LibreOffice, MS Excel and others)
Use the OdsIO
package:
ods_write("spreadsheet.ods",Dict(("MyDestSheet",3,2)=>myDf)))
Export to Dict
This export to a dictionary where the keys are the unique elements of a df column and the values are the splitted dataframes:
Export to hdf5 format
To use hdf5 with the HDF5
package, some systems may require system-wide hdf5 binaries, e.g. in Ubuntu linux sudo apt-get install hdf5-tools.
h5write("out.h5", "mygroup/myDf", convert(Array, df[:,[list_of_cols]))
The HDF5 package doesn't yet support directly dataframes, so you need first to export them as Matrix (a further limitation is that it doesn't accept a matrix of Any type, so you may want to export a DataFrame in two pieces, the string and the numeric columns separatly). You can read back the data with data = h5read("out.h5", "mygroup/myDf")
.
While an updated, expanded and revised version of this chapter is available in "Chapter 9 - Working with Data" of Antonello Lobianco (2019), "Julia Quick Syntax Reference", Apress, this tutorial remains in active development.
Last updated