Workflow with data
As a part of working on large data projects, I have settled into a kind of workflow and I started wondering where I can improve it.
Step 0: ETL
I get the data and convert it to a form I can read it in. Since I have 1.5 Tb RAM machines at my disposal (re: MPI-SWS), this translates to SQLite for most datasets of ~ 500Gb size. My favourite tools for the job are csv2sqlite and some other related scripts.
Most of the scripts work on compressed versions (i.e. gzip
or bz2
) of the
raw data files.
Step 1: Exploration
I start by loading the data up on an Jupyter notebook. This means loading the whole data into memory as a Pandas DataFrame.
Then I explore the data a bit with matplotlib.
Step 2: Modeling
The task of modelling slowly takes form with preprocessing, sub-setting, model creation, model testing, and finally with running the models.
Depending on the kind of model, there may be other steps involved here. For example, figuring out the cross-validation strategy or creation of synthetic-data.
Step 2.1: Subsetting, Preprocessing and Model Creation
These go hand in hand because the model dictates what the intermediate results look like. The sub-setting of data is done to make the testing of models feasible, such that everything can be done in-memory
At some point, it becomes apparent at various points that some pre-processing can save a bunch of time.
Here, I like to save data in mat
form and usually as sparse
arrays because the
format is interoperable between MATLAB and Python.
This precludes the possibility of using recipy since
it does not work with scipy.io
.
However, if I move to pickle
or even np.savetxt
, I can reap the benefits of
the library. This is something I can try to improve.
Step 2.2: Testing the model
At this point, the methodology changes a bit.
I move all the code the model needs to run to one single Jupyter cell, save it to a file and then make it configurable using argparse.
Then I invoke the file from Jupyter using the
%run -i
magic
command.
This gives me the best part of two worlds. Placing code in a file is good for
version control and for sharing the script with others collaborators. However,
the script still has all the variables in the global scope and the %run -i
lets me import those the current Jupyter session, inspect them and figure out
what the internal state of the program was at the point of exit (either though
an exception or normally).
Hence, using a persistence framework like
sacred is out of the question since I
explicitly want all the local variables available in Jupyter for debugging
while sacred
requires the code inside a function.
Here, to persist data, I use seqfile which lets me save files from different processes and even different clusters on an NFS without worrying about losing data.
Moreover, since I run the script usually with %run -i
, I usually don’t have
to read the data from the disk again and can carry straight to the Results phase.
Nevertheless, if I have to to look at a previous experiment’s results, I only
have to load the last file saved by seqfile
and look into it.
One thing I usually do run into is that reading local variables from a file clobbers some variables that I have already declared in the Jupyter notebook and if I restart the kernel and run only some blocks, I usually lose track of which declared. A Notebook extension which lists all variables in the current global environment (like the Workspace pane in MATLAB) would help me here.
There is also the inverse problem of having undefined variables in the script which are conveniently supplied by the Jupyter notebook when it is run from inside a Notebook. Using Syntastic helps with weeding out this class of problems.
Step 3: Results
Finally, to analyse and process the results, I again rely on the Jupyter
notebook, albeit a new one. The new notebook only runs the scripts as %time
%run -i script.py --args
in individual cells and receives the results loaded
in a local variable. Plotting is again done using seaborn
and the results are saved as
pdf
or eps
figures.