Cookiecutter GPN: A flexible template for implementing and sharing machine learning models.
Project GitHub Repo: https://github.com/gpnlab/ccgpn
Why this template?
A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don't necessarily have to read 100% of the code before knowing where to look for very specific things.
Well organized code tends to be self-documenting in that the organization
itself provides context for your code without much overhead. A great example is
the Filesystem Hierarchy
Standard for
Unix-like systems. The /etc
directory has a very specific purpose, as does
the /tmp
folder, and everybody (more or less) agrees to honor that social
contract. That means a Red Hat user and an Ubuntu user both know roughly where
to look for certain types of files, even when using each other's system — or
any other standards-compliant system for that matter! Ideally, that's how it
should be when a colleague opens up your data science project.
The recommendations we make here follow the standards and conventions of much of the scientific Python eco-system. Following these standards and recommendations will make it easier
- for others to use your code,
- to port your code into other projects,
- to collaborate with other users of this eco-system,
- to learn about the analysis process and the domain, and
- to feel confident in analysis conclusions.
We're not talking about devoting a disproportionate amount of time to menial and trivial matters like indentation aesthetics or hairsplitting formatting standards — after all, what really matters is code correctness and reproducibility.
Disagree with a couple of the default folder names? Working on a project that's a little nonstandard and doesn't exactly fit with the current structure? Prefer to use a different package than one of the (few) defaults? Go for it! This is a lightweight structure, and is intended to be a good starting point for many projects.
Requirements to use this cookiecutter:
- Python 2.7 or 3.8+
- PyTorch >= 1.9
- Tensorboard >= 2.6
- Cookiecutter Python package >= 1.7.0:
- This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter
or
$ conda config --add channels conda-forge
$ conda install cookiecutter
To start a new project, run:
$ cookiecutter -c v1.0.0 https://github.com/gpnlab/ccgpn
Do not name your repo (repo_name
prompt) with anything start with a number, since Python puts a fist around importing modules starting with numbers and it is a hassle to cover all the instances.
Add source control using Git and GitHub to repo
- Add local source control using Git
$ make git
- Track the repo remotely using GitHub
$ make github
The resulting directory structure
The directory structure of your new project looks like this:
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
│
├── references/ <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures/ <- Generated graphics and figures to be used in reporting
│
├── docs/ <- A default Sphinx project; see sphinx-doc.org for details
│
├── data <- directory for storing input data
│ ├── external/ <- Data from third party sources.
│ ├── interim/ <- Intermediate data that has been transformed.
│ ├── processed/ <- The final, canonical data sets for modeling.
│ └── raw/ <- The original, immutable data dump.
│
├── experiments <- directory for storing experimental setups
│ ├── A01_MLP-XOR-implementation/
│ ├── A02_clustering-interleaving-moons/
│ ├── A03_MNIST-handwritten-digit-classification/
│ └── A04_fashion-MNIST-clothing-classification <- Aims. Naming convention is
│ │ aim ID (A + ordered two-digit integer)
│ │ + "_" + short "-" delimited description
│ ├── E01_linear-model/
│ ├── E02_MLP/
│ └── E03_CNN <- Experiments. Naming convention is
│ │ experiment ID
│ │ (E + ordered two-digit integer)
│ │ + "_" + short "-" delimited description
│ ├── S0001_pilot/
│ ├── S0002_adam/
│ ├── S0003_dec-lr/
│ └── S0004_inc-batch <- Setups. Naming convention is setup ID
│ │ (S + ordered four-digit integer)
│ │ + "_" + short "-" delimited description
│ └── A04-E03-S0004.yml <- YAML configuration file.
│ Naming convention is aim ID +
│ "-" + experiment ID + "-" setup ID
│
│
├── notebooks/ <- Jupyter notebooks. Naming convention is
│ a number (for ordering) + "_" + the creator's initials
│ + "_" experiment ID + "_" + short `-` delimited
│ description, e.g. `1.0_ed_E01_results-visualization`
│
├── trials <- directory for storing experimental trials
│ ├── A01-E01-S0001/ <- Name of the YAML config file
│ ├── A02-E01-S0001/
│ ├── A03-E01-S0001/
│ └── A04-E03-S0004
│ ├── timestamp1/ <- timestamp, in YYYYMMDDhhmmssUTC format,
│ │ i.e., year (YYYY), month (MM), day (DD),
│ │ hour (hh), minute (mm), second (ss), all according to
│ │ the Coordinated Universtal Time (UTC) standard
│ └── timestamp2
│ ├── etc/ <- stores all information necessary to reproduce this trial,
│ │ e.g., the YAML config file, resuming session, seeds
│ ├── logs/ <- stores all logging information
│ ├── runs/ <- stores Tensorboard runs
│ └── ckpts/ <- stores checkpoints
│
│
├── observations <- directory for storing experimental observations
│ ├── A01-E01-S0001/
│ ├── A02-E01-S0001/
│ ├── A03-E01-S0001/
│ └── A04-E03-S0004
│ ├── timestamp1/
│ └── timestamp2/
│
│
├── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
│
├── requirements.txt <- The requirements file for reproducing the analysis environment,
│ e.g., generated with `pip freeze > requirements.txt`
├── requirements-dev.txt <- The requirements file with the devel dependencies
│
├── setup.py <- makes project pip installable (pip install -e .)
│ so package_name can be imported
├── tests/ <- folder with test code
│
├── logging.yml <- logging configuration
│
└── package_name <- package directory
├── __init__.py <- Makes package_name a Python package
├── cli.py <- command line interface
├── package_name.py <- package script
├── main_foo.py <- training script for model "foo"
├── main_bar.py <- training script for model "bar"
│
├── base <- abstract base classes
│ ├── __init__.py <- Makes the abstract base class a Python subpackage
│ ├── transform.py <- abstract base class for data transformations
│ ├── dataset.py <- abstract base class for datasets
│ ├── dataloader.py <- abstract base class for data loaders
│ ├── arch.py <- abstract base class for models' archtectures
│ ├── loss.py <- abstract base class for losses
│ ├── metric.py <- abstract base class for metrics
│ ├── optimizer.py <- abstract base class for optimizers
│ ├── scheduler.py <- abstract base class for schedulers
│ └── trainer.py <- abstract base class for trainers
│
├── utils <- utilities
│ ├── seeder.py <- manages reproducibility
│ ├── logger.py <- trainning logger
│ ├── monitor.py <- Tensorboard visualization support
│ └── backer.py <- manages paths for saving models + logs
│
├── model_foo/ <- each model is a subpackage
└── model_bar <- model "bar"
├── __init__.py <- Makes model_bar a Python subpackage
├── dataset.py
├── transform.py
├── dataloader.py
├── loss.py
├── metric.py
├── arch.py
├── optimizer.py
├── scheduler.py
└── trainer.py
Test the newly create repo
- Create virtual environment
$ make virtual_environment
- Instal development requirements
Note: Before installing the requirements, update
environment.yml
and/orrequirements.txt
andrequirements-dev.txt
to suit your project needs
$ make install_requirements
- Run MNIST hello world example
$ conda activate <your_repo_name>
$ <your_repo_name> train-hello
To see all available rules, the command make help
is available.
As of this version, the following commands are available:
$ make help
Available rules:
clean Delete all compiled Python files
git Set up Git
github Set up GitHub
install_requirements Install Python Dependencies
lint Lint using flake8
sync_data_from_s3 Download Data from S3
sync_data_to_s3 Upload Data to S3
test_environment Test python environment is setup correctly
virtual_environment Set up python interpreter environment
Repository Structure
Features:
- Clear folder structure which is suitable for many deep learning projects.
- Runs are configured via
.yml
files allowing for easy experimentation. - Checkpoint saving and resuming.
- Tensorboard logging
We now examine the newly created repository in more detail.
Config file format
Config files are in .yml
format:
name: A04-E03-S0004
seed: 12345
target_devices: [0]
save_dir: trials/
arch:
type: my_arch
args: {}
transform:
type: my_transform
args: {}
dataloader:
type: my_dataloader
args:
batch_size: 128
data_dir: data/
num_workers: 2
shuffle: true
validation_split: 0.1
loss: my_loss
lr_scheduler:
type: my_lr_scheduler
args:
gamma: 0.1
step_size: 50
metrics:
- my_metric
- my_metric2
optimizer:
type: my_optimizer
args:
lr: 0.001
weight_decay: 0
training:
early_stop: 10
epochs: 100
monitor: min val_loss
save_period: 1
tensorboard: true
testing:
data_dir: data/
batch_size: 128
num_workers: 8
Add additional configurations as needed.
Checkpoints
You can specify the name of the training session in config files:
name: A04-E03-S0004
The checkpoints will be saved in
<save_dir>/<trial>/<timestamp>/ckpts/ckpt_epoch-<n>_<trial>_<timestamp>
,
where <save_dir>
and <trial>
are as defined in the config file, <n>
is an
integer identifying the checkpoint, and timestamp is a datetime footprint in
the YYYYMMDDhhmmssUTC
format, with year (YYYY
), month (MM
), day (DD
),
hour (hh
), minute (mm
), second (ss
), all according to the Coordinated
Universtal Time (UTC
) standard. A copy of config file will be saved in
<save_dir>/<trial>/<timestamp>/etc/config_<trial>_<timestamp>.yml
.
Note: checkpoints contain:
checkpoint = {
'arch': arch,
'epoch': epoch,
'state_dict': self.model.state_dict(),
'optimizer': self.optimizer.state_dict(),
'monitor_best': self.mnt_best,
'config': self.config
}
Tensorboard Visualization
This template supports Tensorboard visualization
- Run training
Set
tensorboard
option in config file true. - Open tensorboard server
Type
tensorboard --logdir <save_dir>/
at the project root, then server will open athttp://localhost:6006
By default, values of loss and metrics specified in config file, input images,
and histogram of model parameters will be logged. If you need more
visualizations, use add_scalar('tag', data)
, add_image('tag', image)
, etc
in the trainer._train_epoch
method. add_something()
methods in this
template are basically wrappers for those of tensorboard.SummaryWriter
module.
Note: You don't have to specify current steps, since TensorboardWriter
class defined at logger/monitor.py
will track current steps automatically.
The data
folder
Don't ever edit your raw data, especially not manually, and especially not in
Excel. Don't overwrite your raw data. Don't save multiple versions of the raw
data. Treat the data (and its format) as immutable. The code you write should
move the raw data through a pipeline to your final analysis. You shouldn't have
to run all of the steps every time you want to make a new figure, but anyone
should be able to reproduce the final products with only the code in package_name
and the data in data/raw
.
Analysis should be a DAG. Ever tried to reproduce an analysis that you
did a few months ago or even a few years ago? You may have written the code,
but it's now impossible to decipher whether you should use
make_figures.py.old
, make_figures_working.py
or new_make_figures01.py
to
get things done. Here are some questions we've learned to ask with a sense of
existential dread:
- Are we supposed to go in and join the column X to the data before we get started or did that come from one of the notebooks?
- Come to think of it, which notebook do we have to run first before running e plotting code: was it "process data" or "clean data"?
- Where did the shapefiles get downloaded from for the geographic plots?
- \(\{\textrm{et cetera}\}\times \infty\)
These types of questions are painful and are symptoms of a disorganized project. A good project structure encourages practices that make it easier to come back to old work, for example separation of concerns, abstracting analysis as a DAG, and engineering best practices like version control.
If Analysis is a DAG, it is streamlinable. Often in an analysis you have
long-running steps that preprocess data or train models. If these steps have
been run already (and you have stored the output somewhere like the
data/interim
directory), you don't want to wait to rerun them every time. We
prefer make
for managing steps that
depend on each other, especially the long-running ones. Make is a common tool
on Unix-based platforms (and is available for Windows). Following the make
documentation, Makefile
conventions,
and portability
guide
will help ensure your Makefiles work effectively across systems. Here are
some
examples to get
started.
A number of data folks use make
as their tool of choice, including Mike
Bostock. There are other tools for managing
DAGs that are written in Python instead of a DSL (e.g.,
Paver,
Luigi,
Airflow,
Snakemake,
Ruffus, or
Joblib). Feel free to use these
if they are more appropriate for your analysis.
Also, if data is immutable, it doesn't need source control in the same way that
code does. Therefore, by default, the data folder is included in the
.gitignore
file. If you have a small amount of data that rarely changes,
you may want to include the data in the repository. Github currently warns if
files are over 50MB and rejects files over 100MB. Some other options for
storing/syncing large data include AWS S3 with a
syncing tool (e.g., s3cmd
), Git Large File
Storage, Git
Annex, and dat.
Currently by default, we support instantiating a S3 bucket and use AWS
CLI to sync
data in the data
folder with the server (you can also sync data from a local
server).
AWS CLI configuration. When using Amazon S3 to store data, a simple method
of managing AWS access is to set your access keys to environment variables.
However, managing mutiple sets of keys on a single machine (e.g. when working
on multiple projects) it is best to use a credentials
file,
typically located in ~/.aws/credentials
. A typical file might look like:
[default]
aws_access_key_id=myaccesskey
aws_secret_access_key=mysecretkey
[another_project]
aws_access_key_id=myprojectaccesskey
aws_secret_access_key=myprojectsecretkey
You can add the profile name when initialising a project; assuming no applicable environment variables are set, the default profile credentials will be used.
Finally, a handy code snippet to get the standard file-system location for the
data
folder:
import os.path as op
from pathlib import Path
import package_name as pkg
root = Path(pkg.__path__[0]).parent.absolute()
data_path = op.join(root, 'data')
The package_name
folder
We place the main module for the source code in a file called package_name.py
in directory called package_name
. This structure is a bit confusing at first,
but it is a simple way to create a structure where when we type import
package_name as pkg
in an interactive Python session, the classes and
functions defined inside of the package_name.py
file are available in the
pkg
namespace. For this to work, we need to also create a file in
__init__.py
which contains code that imports everything in that file into the
namespace of the project:
from .package_name import *
As a general rule, in module code, we follow the convention that all functions are either imported from other places, or are defined in lines that precede the lines that use that function. This helps readability of the code, because you know that if you see some name, the definition of that name will appear earlier in the file, either as a function/variable definition, or as an import from some other module or package.
Environment
The first step in reproducing an analysis is always reproducing the computational environment it was run in. You need the same tools, the same libraries, and the same versions to make everything play nicely together.
One effective approach to this is use a virtual environment . We included a
target in the Makefile (make create_environment
) which will create a conda
environment (see Miniconda
software
distribution, not to be confused with
Anaconda
, though they are
similar and both produced by Continuum) if conda is installed, else create a
virtualenv
(virtualenvwrapper is
recommended for managing virtualenvs).
By listing all of your requirements in the repository (we include a
requirements.txt
file) you can easily track the packages needed to recreate
the analysis. Here is a good workflow:
- run
make create_environment
when creating a new project. This will create aconda
environment if conda available, else createvirtualenv
. - run
conda activate project_name
(orworkon project_name
) to activate the virtual environment. pip install
/conda install
the packages that your analysis needs.- run
conda list --export > requirements.txt
(orpip freeze > requirements.txt
) to pin the exact package versions used to recreate the analysis. - If you find you need to install another package, run
conda list --export > requirements.txt
(orpip freeze > requirements.txt
) again and commit the changes to version control.
You can easily duplicate your environment and the associated dependencies with
conda create --name <env_name> --file requirements.txt
(or mkvirtualenv
<env_name>; workon <env_name>; pip install -r requirements.txt
).
This awesome conversion
table
lists operation equivalences between conda
, pip
and virtualenv
. For more
on the differences between conda
and pip
, see this
article
by Jake VanderPlas.
If you have more complex requirements for recreating your environment, consider a container-based approach such as Docker or Vagrant. Both of these tools use text-based formats (Dockerfile and Vagrantfile, respectively) you can easily add to source control to describe how to create a container with the requirements you need.
Making the Repository an Installable Package
For installation and distribution we will use the python standard library
setuptools
module. This module uses a setup.py
file to figure out how to
install your software on a particular system. For a small project such as this
one, managing installation of the software modules and the data is rather
simple.
A package_name/version.py
contains all of the information needed for the
installation and for setting up the PyPI
page for the software. This also makes
it possible to install your software using pip
and easy_install
, which
are package managers for Python software. The setup.py
file reads this
information from there and passes it to the setup
function which takes care
of the rest.
Much more information on packaging Python software can be found in the Hitchhiker's guide to packaging.
Continuous integration
Travis-CI is a system that can be used to automatically test every revision of
your code directly from github, including testing of github pull requests,
before they are merged into the master
branch. This provides you with
information needed in order to evaluate contributions made by others. It also
serves as a source of information for others interested in using or contributing
to your project about the degree of test coverage of your project.
You will need a .travis.yml file in your repo. This file contains the
configuration of your testing environment. This includes the different
environments in which you will test the source code (for example, we test
package_name
against Python 2.7, Python 3.7 and Python 3.8). It includes steps
that need to be taken before installation of the software. For example,
installation of the software dependencies.
For details on setting up Travis-CI with github, see Travis-CI's getting started page. To summarize:
- Go to the Travis-CI website and get a Travis user account, linked to your github user account.
- You will need to set up your github repo to talk to Travis.
- You will need to go back to travis-ci, and flip on the switch on that side as well.
- The travis output will also report to you about test coverage, if you set it up that way.
- You will start getting emails telling you the state of the testing suite on
every pull request for the software, and also when you break the test suite on
the
master
branch. That way, you can be pretty sure that themaster
is working (or at least know when it isn't...).
You can also continuously test your code on a Windows system. This is done on
another CI system called Appveyor. In prinicple, it
does something that is very similar to what Travis does: downloads your code,
installs it on a Windows machine, with various versions of python, and runs the
tests. Appveyor is controlled through another configuration file: the
appveyor.yml
. In addition to committing this file into the repository, you
will need to activate Appveyor for your project. This is done by signing into
the Appveyor interface with your Github account, clicking on the "projects" tab
at the top of the page, then clicking on the "+" sign for "New project" and
selecting the project you would like to add from the menu that appears (you
might need to give Appveyor the permission to see projects in your Github
account).
Distribution
The main venue for distribution of Python software is the Python Package Index, or PyPI, also lovingly known as "the cheese-shop". To distribute your software on PyPI, you will need to create a user account on PyPI. It is recommended that you upload your software using twine. Using Travis, you can automatically upload your software to PyPI, every time you push a tag of your software to github. The instructions on setting this up can be found here. You will need to install the travis command-line interface.
Version Control
Source code versioning control is managed with git. Currently there are two files in the repository which help working with this repository, and which you could extend further:
.gitignore
-- specifies intentionally untracked files (such as compiled*.pyc
files), which should not typically be committed to git (seeman gitignore
).mailmap
-- if any of the contributors used multiple names/email addresses or his git commit identity is just an alias, you could specify the ultimate name/email(s) for each contributor, so such commands asgit shortlog -sn
could take them into account (seegit shortlog --help
)
Keep secrets and configuration out of version control. You really don't
want to leak your AWS secret key or Postgres username and password on Github.
Enough said — see the Twelve Factor App
principles on this point. One solution is to create a .env
file in the
project root folder. Thanks to the .gitignore
, this file should never get
committed into the version control repository. Here's an example:
# example .env file
DATABASE_URL=postgres://username:password@localhost:5432/dbname
AWS_ACCESS_KEY=myaccesskey
AWS_SECRET_ACCESS_KEY=mysecretkey
OTHER_VARIABLE=something
If you look at the stub script in package_name/data/make_dataset.py
, it uses
a package called python-dotenv to
load up all the entries in this file as environment variables so they are
accessible with os.environ.get
. Here's an example snippet adapted from the
python-dotenv
documentation:
# src/data/dotenv_example.py
import os
from dotenv import load_dotenv, find_dotenv
# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()
# load up the entries as environment variables
load_dotenv(dotenv_path)
database_url = os.environ.get("DATABASE_URL")
other_variable = os.environ.get("OTHER_VARIABLE")
Styling
Remember that code will be probably be read more times than it will be written.
Make it easy to read (for others, but also for yourself when you come back to
it), by following a consistent formatting style. We strongly recommend
following the PEP8 code formatting
standard, and we enforce this by
running a code-linter called flake8
(it combines the tools pep8
and
pyflakes
), which automatically checks the code and reports any violations of
the PEP8 standard (and checks for other general code hygiene issues).
Some projects include flake8
inside their automated tests, so that every pull
request is examined for code cleanliness. In this project, we have run flake8
most (but not all) files, on most (but not all) checks:
flake8 --ignore N802,N806 `find . -name *.py | grep -v setup.py | grep -v /doc/`
This means, check all .py files, but exclude setup.py and everything in directories named "docs". Do all checks except N802 and N806, which enforce lowercase-only names for variables and functions.
The Makefile
contains an instruction for running this command as well:
make flake8
Documentation
Documenting your software is a good idea. Not only as a way to communicate to others about how to use the software, but also as a way of reminding yourself what the issues are that you faced, and how you dealt with them, in a few months/years, when you return to look at the code.
The first step in this direction is to document every function in your module code. We recommend following the numpy docstring standard, which specifies in detail the inputs/outputs of every function, and specifies how to document additional details, such as references to scientific articles, notes about the mathematics behind the implementation, etc.
This standard also plays well with a system that allows you to create more comprehensive documentation of your project. Writing such documentation allows you to provide more elaborate explanations of the decisions you made when you were developing the software, as well as provide some examples of usage, explanations of the relevant scientific concepts, and references to the relevant literature.
To document package_name
we use the Sphinx documentation
system. You can follow the instructions on the sphinx
website, and the example here to set up the
system, but we have also already initialized and commited a skeleton
documentation system in the docs
directory, that you can build upon.
Sphinx uses a Makefile
to build different outputs of your documentation. For
example, if you want to generate the HTML rendering of the documentation (web
pages that you can upload to a website to explain the software), you will type:
make html
This will generate a set of static webpages in the docs/_build/html
, which you
can then upload to a website of your choice.
Alternatively, readthedocs.org (careful,
not readthedocs.com) is a service that will run sphinx for you,
and upload the documentation to their website. To use this service,
you will need to register with RTD. After you have done that, you will
need to "import your project" from your github account, through the
RTD web interface. To make things run smoothly, you also will need to
go to the "admin" panel of the project on RTD, and navigate into the
"advanced settings" so that you can tell it that your Python
configuration file is in docs/conf.py
.
Testing
Most scientists who write software constantly test their code. That is, if you are a scientist writing software, I am sure that you have tried to see how well your code works by running every new function you write, examining the inputs and the outputs of the function, to see if the code runs properly (without error), and to see whether the results make sense.
Automated code testing takes this informal practice, makes it formal, and automates it, so that you can make sure that your code does what it is supposed to do, even as you go about making changes around it.
Most scientists writing code are not really in a position to write a complete specification of their software, because when they start writing their code they don't quite know what they will discover in their data, and these chance discoveries might affect how the software evolves. Nor do most scientists have the inclination to write complete specs - scientific code often needs to be good enough to cover our use-case, and not any possible use-case. Testing the code serves as a way to provide a reader of the code with very rough specification, in the sense that it at least specifies certain input/output relationships that will certainly hold in your code.
We recommend using the 'pytest' library for
testing. The py.test
application traverses the directory tree in which it is
issued, looking for files with the names that match the pattern test_*.py
(typically, something like our package_name/tests/test_package_name.py
).
Within each of these files, it looks for functions with names that match the
pattern test_*
. Typically each function in the module would have a
corresponding test (e.g. test_transform_data
). This is sometimes called 'unit
testing', because it independently tests each atomic unit in the software.
Other tests might run a more elaborate sequence of functions ('end-to-end
testing' if you run through the entire analysis), and check that particular
values in the code evaluate to the same values over time. This is sometimes
called 'regression testing'. Regressions in the code are often canaries in the
coal mine, telling you that you need to examine changes in your software
dependencies, the platform on which you are running your software, etc.
Test functions should contain assertion statements that check certain relations
in the code. Most typically, they will test for equality between an explicit
calculation of some kind and a return of some function. We recommend using
functions from the numpy.testing
module (which we import as npt
) to assert
certain relations on arrays and floating point numbers. This is because npt
contains functions that are specialized for handling numpy
arrays, and they
allow to specify the tolerance of the comparison through the decimal
key-word
argument.
To run the tests on the command line, change your present working directory to
the top-level directory of the repository (e.g. /home/user/code/package_name
),
and type:
py.test package_name
This will exercise all of the tests in your code directory. If a test fails, you will see a message such as:
package_name/tests/test_package_name.py .F...
=================================== FAILURES =============================
.
.
.
E AssertionError:
.
.
.
package_name/tests/test_package_name.py:49: AssertionError
====================== 1 failed, 4 passed in 0.82 seconds ================
As your code grows and becomes more complicated, you might develop new features that interact with your old features in all kinds of unexpected and surprising ways. As you develop new features of your code, keep running the tests, to make sure that you haven't broken the old features. Keep writing new tests for your new code, and recording these tests in your testing scripts. That way, you can be confident that even as the software grows, it still keeps doing correctly at least the few things that are codified in the tests.
We have also provided a Makefile
that allows you to run the tests with more
verbose and informative output from the top-level directory, by issuing the
following from the command line:
make test
Notebooks
Notebook packages like the Jupyter notebook, Beaker notebook, Zeppelin, and other literate programming tools are very effective for exploratory data analysis. Hence, the notebooks directory can be used as a place to experiment with your module code, and as a place to produce scripts that contain a narrative structure, demonstrating the use of the code, or producing scientific results from your code and your data and telling a story with these elements. For example, an IPython notebook that reads in some data, and creates a figure. Maybe this is Figure 1 from some future article?
When we use notebooks in our work, we often subdivide the notebooks
folder.
For example, notebooks/exploratory
contains initial explorations, whereas
notebooks/reports
is more polished work that can be exported as html to the
reports
directory.
Since notebooks are challenging objects for source control (e.g., diffs of the
json
are often not human-readable and merging is near impossible), we
recommended not collaborating directly with others on Jupyter notebooks. There
are two steps we recommend for using notebooks effectively:
- Follow a naming convention that shows the owner and the order the analysis
was done in. We use the format
<step>_<ghuser>_<description>.ipynb
(e.g.,1.0-ed-visualize-distributions.ipynb
). - Refactor the good parts. Don't write code to do the same task in multiple
notebooks. If it's a data preprocessing task, put it in the pipeline at
package_name/data
and load data fromdata/interim
. If it's useful utility code, refactor it topackage_name
.
Now, by default we turn the project into a Python package, so you can import your code and use it in notebooks with a cell like the following:
# OPTIONAL: Load the "autoreload" extension so that code can change
%load_ext autoreload
# OPTIONAL: always reload modules so that as you change code in src, it gets loaded
%autoreload 2
import package_name.data as data
Licensing
License your code! A repository like this without a license maintains
copyright to the author, but does not provide others with any
conditions under which they can use the software. In this case, we use
the BSD-3 license. You can read the conditions of the license in the
LICENSE
file. As you can see, this is not an Apple software license
agreement (has anyone ever actually tried to read one of those?). It's
actually all quite simple, and boils down to "You can do whatever you
want with my software, but I take no responsibility for what you do
with my software"
For more details on what you need to think about when considering choosing a license, see this article!
Getting cited
When others use your code in their research, they should probably cite you. To
make their life easier, we use duecredit. This is a software
library that allows you to annotate your code with the correct way to cite it.
To enable duecredit
, we have added a file due.py
into the main directory.
This file does not need to change at all (though you might want to occasionally
update it from duecredit itself. It's
here,
under the name stub.py
).
In addition, you will want to provide a digital object identifier (DOI) to the article you want people to cite.
To get a DOI, use the instructions in this page.
Another way to get your software cited is by writing a paper. There are several journals that publish papers about software.
Contributing
Best practices change, tools evolve, and lessons are learned. The goal of this project is to make it easier to start, structure, and share an analysis. Pull requests and filing issues is encouraged. We'd love to hear what works for you, and what doesn't!
Links to related projects and references
Project structure and reproducibility is talked about more in the R research community. Here are some projects and blog posts if you're working in R that may help you out.
- Project Template - An R data analysis template.
- "Designing projects" on Nice R Code.
- "My research workflow" on Carlboettiger.info.
- "A Quick Guide to Organizing Computational Biology Projects" in PLOS Computational Biology.
Acknowledgements
Many thanks to the Cookiecutter project (github), which is helping us all spend less time molding and more time getting things baked.
This cookiecutter was developed abreast the awesome Cookiecutter Data
Science
project (github),
borrowing features from
shablona (e.g. duecredit, travis CI)
and inspired by the directory scaffold of
Cookiecutter-Pytorch. Here
are some examples of projects using cookiecutter-pytorch
template: