Skip to content

Cookiecutter GPN: A flexible template for implementing and sharing machine learning models.

Project GitHub Repo: https://github.com/gpnlab/ccgpn

Why this template?

standards

A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don't necessarily have to read 100% of the code before knowing where to look for very specific things.

Well organized code tends to be self-documenting in that the organization itself provides context for your code without much overhead. A great example is the Filesystem Hierarchy Standard for Unix-like systems. The /etc directory has a very specific purpose, as does the /tmp folder, and everybody (more or less) agrees to honor that social contract. That means a Red Hat user and an Ubuntu user both know roughly where to look for certain types of files, even when using each other's system — or any other standards-compliant system for that matter! Ideally, that's how it should be when a colleague opens up your data science project.

The recommendations we make here follow the standards and conventions of much of the scientific Python eco-system. Following these standards and recommendations will make it easier

  • for others to use your code,
  • to port your code into other projects,
  • to collaborate with other users of this eco-system,
  • to learn about the analysis process and the domain, and
  • to feel confident in analysis conclusions.

We're not talking about devoting a disproportionate amount of time to menial and trivial matters like indentation aesthetics or hairsplitting formatting standards — after all, what really matters is code correctness and reproducibility. bikeshedding

Disagree with a couple of the default folder names? Working on a project that's a little nonstandard and doesn't exactly fit with the current structure? Prefer to use a different package than one of the (few) defaults? Go for it! This is a lightweight structure, and is intended to be a good starting point for many projects.

Requirements to use this cookiecutter:


  • Python 2.7 or 3.8+
  • PyTorch >= 1.9
  • Tensorboard >= 2.6
  • Cookiecutter Python package >= 1.7.0:
  • This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter

or

$ conda config --add channels conda-forge
$ conda install cookiecutter

To start a new project, run:


$ cookiecutter -c v1.0.0 https://github.com/gpnlab/ccgpn

asciicast

Do not name your repo (repo_name prompt) with anything start with a number, since Python puts a fist around importing modules starting with numbers and it is a hassle to cover all the instances.

Add source control using Git and GitHub to repo


  1. Add local source control using Git
$ make git
  1. Track the repo remotely using GitHub
$ make github

The resulting directory structure


The directory structure of your new project looks like this:

├── LICENSE
├── Makefile             <- Makefile with commands like `make data` or `make train`
├── README.md            <- The top-level README for developers using this project.
│
├── references/          <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports              <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures/         <- Generated graphics and figures to be used in reporting
│
├── docs/                <- A default Sphinx project; see sphinx-doc.org for details
│
├── data                 <- directory for storing input data
│   ├── external/        <- Data from third party sources.
│   ├── interim/         <- Intermediate data that has been transformed.
│   ├── processed/       <- The final, canonical data sets for modeling.
│   └── raw/             <- The original, immutable data dump.
│
├── experiments                                     <- directory for storing experimental setups
│   ├── A01_MLP-XOR-implementation/
│   ├── A02_clustering-interleaving-moons/
│   ├── A03_MNIST-handwritten-digit-classification/
│   └── A04_fashion-MNIST-clothing-classification   <- Aims. Naming convention is
│           │                                          aim ID (A + ordered two-digit integer)
│           │                                          + "_" + short "-" delimited description
│           ├── E01_linear-model/
│           ├── E02_MLP/
│           └── E03_CNN                             <- Experiments. Naming convention is
│                │                                     experiment ID
│                │                                     (E + ordered two-digit integer)
│                │                                     + "_" + short "-" delimited description
│                ├── S0001_pilot/
│                ├── S0002_adam/
│                ├── S0003_dec-lr/
│                └── S0004_inc-batch                <- Setups. Naming convention is setup ID
│                      │                               (S + ordered four-digit integer)
│                      │                               + "_" + short "-" delimited description
│                      └── A04-E03-S0004.yml        <- YAML configuration file.
│                                                      Naming convention is aim ID +
│                                                      "-" + experiment ID + "-" setup ID
│
│
├── notebooks/                     <- Jupyter notebooks. Naming convention is
│                                     a number (for ordering) + "_" + the creator's initials
│                                     + "_" experiment ID + "_" + short `-` delimited
│                                     description, e.g. `1.0_ed_E01_results-visualization`
│
├── trials                         <- directory for storing experimental trials
│    ├── A01-E01-S0001/            <- Name of the YAML config file
│    ├── A02-E01-S0001/
│    ├── A03-E01-S0001/
│    └── A04-E03-S0004
│         ├── timestamp1/           <- timestamp, in YYYYMMDDhhmmssUTC format,
│         │                            i.e., year (YYYY), month (MM), day (DD),
│         │                            hour (hh), minute (mm), second (ss), all according to
│         │                            the Coordinated Universtal Time (UTC) standard
│         └── timestamp2
│                    ├── etc/       <- stores all information necessary to reproduce this trial,
│                    │                 e.g., the YAML config file, resuming session, seeds
│                    ├── logs/      <- stores all logging information
│                    ├── runs/      <- stores Tensorboard runs
│                    └── ckpts/     <- stores checkpoints
│
│
├── observations                    <- directory for storing experimental observations
│    ├── A01-E01-S0001/
│    ├── A02-E01-S0001/
│    ├── A03-E01-S0001/
│    └── A04-E03-S0004
│         ├── timestamp1/
│         └── timestamp2/
│
│
├── tox.ini              <- tox file with settings for running tox; see tox.readthedocs.io
│
├── requirements.txt     <- The requirements file for reproducing the analysis environment,
│                           e.g., generated with `pip freeze > requirements.txt`
├── requirements-dev.txt <- The requirements file with the devel dependencies
│
├── setup.py             <- makes project pip installable (pip install -e .)
│                           so package_name can be imported
├── tests/               <- folder with test code
│
├── logging.yml          <- logging configuration
│
└── package_name                  <- package directory
    ├── __init__.py               <- Makes package_name a Python package
    ├── cli.py                    <- command line interface
    ├── package_name.py           <- package script
    ├── main_foo.py               <- training script for model "foo"
    ├── main_bar.py               <- training script for model "bar"
    │
    ├── base                      <- abstract base classes
    │   ├── __init__.py           <- Makes the abstract base class a Python subpackage
    │   ├── transform.py          <- abstract base class for data transformations
    │   ├── dataset.py            <- abstract base class for datasets
    │   ├── dataloader.py         <- abstract base class for data loaders
    │   ├── arch.py               <- abstract base class for models' archtectures
    │   ├── loss.py               <- abstract base class for losses
    │   ├── metric.py             <- abstract base class for metrics
    │   ├── optimizer.py          <- abstract base class for optimizers
    │   ├── scheduler.py          <- abstract base class for schedulers
    │   └── trainer.py            <- abstract base class for trainers
    │
    ├── utils                     <- utilities
    │    ├── seeder.py            <- manages reproducibility
    │    ├── logger.py            <- trainning logger
    │    ├── monitor.py           <- Tensorboard visualization support
    │    └── backer.py            <- manages paths for saving models + logs
    │
    ├── model_foo/                <- each model is a subpackage
    └── model_bar                 <- model "bar"
         ├── __init__.py          <- Makes model_bar a Python subpackage
         ├── dataset.py
         ├── transform.py
         ├── dataloader.py
         ├── loss.py
         ├── metric.py
         ├── arch.py
         ├── optimizer.py
         ├── scheduler.py
         └── trainer.py

Test the newly create repo


  1. Create virtual environment
$ make virtual_environment
  1. Instal development requirements Note: Before installing the requirements, update environment.yml and/or requirements.txt and requirements-dev.txt to suit your project needs
$ make install_requirements
  1. Run MNIST hello world example
$ conda activate <your_repo_name>
$ <your_repo_name> train-hello

To see all available rules, the command make help is available.

As of this version, the following commands are available:

$ make help

Available rules:

clean                Delete all compiled Python files
git                  Set up Git
github               Set up GitHub
install_requirements Install Python Dependencies
lint                 Lint using flake8
sync_data_from_s3    Download Data from S3
sync_data_to_s3      Upload Data to S3
test_environment     Test python environment is setup correctly
virtual_environment  Set up python interpreter environment

Repository Structure


Features:

  • Clear folder structure which is suitable for many deep learning projects.
  • Runs are configured via .yml files allowing for easy experimentation.
  • Checkpoint saving and resuming.
  • Tensorboard logging

We now examine the newly created repository in more detail.

Config file format


Config files are in .yml format:

name: A04-E03-S0004
seed: 12345
target_devices: [0]
save_dir: trials/

arch:
    type: my_arch
    args: {}

transform:
    type: my_transform
    args: {}

dataloader:
    type: my_dataloader
    args:
        batch_size: 128
        data_dir: data/
        num_workers: 2
        shuffle: true
        validation_split: 0.1

loss: my_loss

lr_scheduler:
    type: my_lr_scheduler
    args:
        gamma: 0.1
        step_size: 50

metrics:
    - my_metric
    - my_metric2

optimizer:
    type: my_optimizer
    args:
        lr: 0.001
        weight_decay: 0

training:
    early_stop: 10
    epochs: 100
    monitor: min val_loss
    save_period: 1
    tensorboard: true

testing:
    data_dir: data/
    batch_size: 128
    num_workers: 8

Add additional configurations as needed.

Checkpoints


You can specify the name of the training session in config files:

name: A04-E03-S0004

The checkpoints will be saved in

<save_dir>/<trial>/<timestamp>/ckpts/ckpt_epoch-<n>_<trial>_<timestamp>,

where <save_dir> and <trial> are as defined in the config file, <n> is an integer identifying the checkpoint, and timestamp is a datetime footprint in the YYYYMMDDhhmmssUTC format, with year (YYYY), month (MM), day (DD), hour (hh), minute (mm), second (ss), all according to the Coordinated Universtal Time (UTC) standard. A copy of config file will be saved in

<save_dir>/<trial>/<timestamp>/etc/config_<trial>_<timestamp>.yml.

Note: checkpoints contain:

checkpoint = {
'arch': arch,
'epoch': epoch,
'state_dict': self.model.state_dict(),
'optimizer': self.optimizer.state_dict(),
'monitor_best': self.mnt_best,
'config': self.config
}

Tensorboard Visualization


This template supports Tensorboard visualization

  1. Run training

    Set tensorboard option in config file true.

  2. Open tensorboard server

    Type tensorboard --logdir <save_dir>/ at the project root, then server will open at http://localhost:6006

By default, values of loss and metrics specified in config file, input images, and histogram of model parameters will be logged. If you need more visualizations, use add_scalar('tag', data), add_image('tag', image), etc in the trainer._train_epoch method. add_something() methods in this template are basically wrappers for those of tensorboard.SummaryWriter module.

Note: You don't have to specify current steps, since TensorboardWriter class defined at logger/monitor.py will track current steps automatically.

The data folder


Don't ever edit your raw data, especially not manually, and especially not in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis. You shouldn't have to run all of the steps every time you want to make a new figure, but anyone should be able to reproduce the final products with only the code in package_name and the data in data/raw.

Analysis should be a DAG. Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use make_figures.py.old, make_figures_working.py or new_make_figures01.py to get things done. Here are some questions we've learned to ask with a sense of existential dread:

  • Are we supposed to go in and join the column X to the data before we get started or did that come from one of the notebooks?
  • Come to think of it, which notebook do we have to run first before running e plotting code: was it "process data" or "clean data"?
  • Where did the shapefiles get downloaded from for the geographic plots?
  • \(\{\textrm{et cetera}\}\times \infty\)

These types of questions are painful and are symptoms of a disorganized project. A good project structure encourages practices that make it easier to come back to old work, for example separation of concerns, abstracting analysis as a DAG, and engineering best practices like version control.

If Analysis is a DAG, it is streamlinable. Often in an analysis you have long-running steps that preprocess data or train models. If these steps have been run already (and you have stored the output somewhere like the data/interim directory), you don't want to wait to rerun them every time. We prefer make for managing steps that depend on each other, especially the long-running ones. Make is a common tool on Unix-based platforms (and is available for Windows). Following the make documentation, Makefile conventions, and portability guide will help ensure your Makefiles work effectively across systems. Here are some examples to get started. A number of data folks use make as their tool of choice, including Mike Bostock. There are other tools for managing DAGs that are written in Python instead of a DSL (e.g., Paver, Luigi, Airflow, Snakemake, Ruffus, or Joblib). Feel free to use these if they are more appropriate for your analysis.

Also, if data is immutable, it doesn't need source control in the same way that code does. Therefore, by default, the data folder is included in the .gitignore file. If you have a small amount of data that rarely changes, you may want to include the data in the repository. Github currently warns if files are over 50MB and rejects files over 100MB. Some other options for storing/syncing large data include AWS S3 with a syncing tool (e.g., s3cmd), Git Large File Storage, Git Annex, and dat. Currently by default, we support instantiating a S3 bucket and use AWS CLI to sync data in the data folder with the server (you can also sync data from a local server).

AWS CLI configuration. When using Amazon S3 to store data, a simple method of managing AWS access is to set your access keys to environment variables. However, managing mutiple sets of keys on a single machine (e.g. when working on multiple projects) it is best to use a credentials file, typically located in ~/.aws/credentials. A typical file might look like:

[default]
aws_access_key_id=myaccesskey
aws_secret_access_key=mysecretkey

[another_project]
aws_access_key_id=myprojectaccesskey
aws_secret_access_key=myprojectsecretkey

You can add the profile name when initialising a project; assuming no applicable environment variables are set, the default profile credentials will be used.

Finally, a handy code snippet to get the standard file-system location for the data folder:

import os.path as op
from pathlib import Path
import package_name as pkg
root = Path(pkg.__path__[0]).parent.absolute()
data_path = op.join(root, 'data')

The package_name folder


We place the main module for the source code in a file called package_name.py in directory called package_name. This structure is a bit confusing at first, but it is a simple way to create a structure where when we type import package_name as pkg in an interactive Python session, the classes and functions defined inside of the package_name.py file are available in the pkg namespace. For this to work, we need to also create a file in __init__.py which contains code that imports everything in that file into the namespace of the project:

from .package_name import *

As a general rule, in module code, we follow the convention that all functions are either imported from other places, or are defined in lines that precede the lines that use that function. This helps readability of the code, because you know that if you see some name, the definition of that name will appear earlier in the file, either as a function/variable definition, or as an import from some other module or package.

Environment


The first step in reproducing an analysis is always reproducing the computational environment it was run in. You need the same tools, the same libraries, and the same versions to make everything play nicely together.

One effective approach to this is use a virtual environment . We included a target in the Makefile (make create_environment) which will create a conda environment (see Miniconda software distribution, not to be confused with Anaconda, though they are similar and both produced by Continuum) if conda is installed, else create a virtualenv (virtualenvwrapper is recommended for managing virtualenvs).

By listing all of your requirements in the repository (we include a requirements.txt file) you can easily track the packages needed to recreate the analysis. Here is a good workflow:

  • run make create_environment when creating a new project. This will create a conda environment if conda available, else create virtualenv.
  • run conda activate project_name (or workon project_name) to activate the virtual environment.
  • pip install/conda install the packages that your analysis needs.
  • run conda list --export > requirements.txt (or pip freeze > requirements.txt) to pin the exact package versions used to recreate the analysis.
  • If you find you need to install another package, run conda list --export > requirements.txt (or pip freeze > requirements.txt) again and commit the changes to version control.

You can easily duplicate your environment and the associated dependencies with conda create --name <env_name> --file requirements.txt (or mkvirtualenv <env_name>; workon <env_name>; pip install -r requirements.txt).

This awesome conversion table lists operation equivalences between conda, pip and virtualenv. For more on the differences between conda and pip, see this article by Jake VanderPlas.

If you have more complex requirements for recreating your environment, consider a container-based approach such as Docker or Vagrant. Both of these tools use text-based formats (Dockerfile and Vagrantfile, respectively) you can easily add to source control to describe how to create a container with the requirements you need.

Making the Repository an Installable Package


For installation and distribution we will use the python standard library setuptools module. This module uses a setup.py file to figure out how to install your software on a particular system. For a small project such as this one, managing installation of the software modules and the data is rather simple.

A package_name/version.py contains all of the information needed for the installation and for setting up the PyPI page for the software. This also makes it possible to install your software using pip and easy_install, which are package managers for Python software. The setup.py file reads this information from there and passes it to the setup function which takes care of the rest.

Much more information on packaging Python software can be found in the Hitchhiker's guide to packaging.

Continuous integration

continuous integration

Travis-CI is a system that can be used to automatically test every revision of your code directly from github, including testing of github pull requests, before they are merged into the master branch. This provides you with information needed in order to evaluate contributions made by others. It also serves as a source of information for others interested in using or contributing to your project about the degree of test coverage of your project.

You will need a .travis.yml file in your repo. This file contains the configuration of your testing environment. This includes the different environments in which you will test the source code (for example, we test package_name against Python 2.7, Python 3.7 and Python 3.8). It includes steps that need to be taken before installation of the software. For example, installation of the software dependencies.

For details on setting up Travis-CI with github, see Travis-CI's getting started page. To summarize:

  1. Go to the Travis-CI website and get a Travis user account, linked to your github user account.
  2. You will need to set up your github repo to talk to Travis.
  3. You will need to go back to travis-ci, and flip on the switch on that side as well.
  4. The travis output will also report to you about test coverage, if you set it up that way.
  5. You will start getting emails telling you the state of the testing suite on every pull request for the software, and also when you break the test suite on the master branch. That way, you can be pretty sure that the master is working (or at least know when it isn't...).

You can also continuously test your code on a Windows system. This is done on another CI system called Appveyor. In prinicple, it does something that is very similar to what Travis does: downloads your code, installs it on a Windows machine, with various versions of python, and runs the tests. Appveyor is controlled through another configuration file: the appveyor.yml. In addition to committing this file into the repository, you will need to activate Appveyor for your project. This is done by signing into the Appveyor interface with your Github account, clicking on the "projects" tab at the top of the page, then clicking on the "+" sign for "New project" and selecting the project you would like to add from the menu that appears (you might need to give Appveyor the permission to see projects in your Github account).

Distribution


The main venue for distribution of Python software is the Python Package Index, or PyPI, also lovingly known as "the cheese-shop". To distribute your software on PyPI, you will need to create a user account on PyPI. It is recommended that you upload your software using twine. Using Travis, you can automatically upload your software to PyPI, every time you push a tag of your software to github. The instructions on setting this up can be found here. You will need to install the travis command-line interface.

Version Control


documents

Source code versioning control is managed with git. Currently there are two files in the repository which help working with this repository, and which you could extend further:

  • .gitignore -- specifies intentionally untracked files (such as compiled *.pyc files), which should not typically be committed to git (see man gitignore)
  • .mailmap -- if any of the contributors used multiple names/email addresses or his git commit identity is just an alias, you could specify the ultimate name/email(s) for each contributor, so such commands as git shortlog -sn could take them into account (see git shortlog --help)

Keep secrets and configuration out of version control. You really don't want to leak your AWS secret key or Postgres username and password on Github. Enough said — see the Twelve Factor App principles on this point. One solution is to create a .env file in the project root folder. Thanks to the .gitignore, this file should never get committed into the version control repository. Here's an example:

# example .env file
DATABASE_URL=postgres://username:password@localhost:5432/dbname
AWS_ACCESS_KEY=myaccesskey
AWS_SECRET_ACCESS_KEY=mysecretkey
OTHER_VARIABLE=something

If you look at the stub script in package_name/data/make_dataset.py, it uses a package called python-dotenv to load up all the entries in this file as environment variables so they are accessible with os.environ.get. Here's an example snippet adapted from the python-dotenv documentation:

# src/data/dotenv_example.py
import os
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# load up the entries as environment variables
load_dotenv(dotenv_path)

database_url = os.environ.get("DATABASE_URL")
other_variable = os.environ.get("OTHER_VARIABLE")

Styling


code style

Remember that code will be probably be read more times than it will be written. Make it easy to read (for others, but also for yourself when you come back to it), by following a consistent formatting style. We strongly recommend following the PEP8 code formatting standard, and we enforce this by running a code-linter called flake8 (it combines the tools pep8 and pyflakes), which automatically checks the code and reports any violations of the PEP8 standard (and checks for other general code hygiene issues).

Some projects include flake8 inside their automated tests, so that every pull request is examined for code cleanliness. In this project, we have run flake8 most (but not all) files, on most (but not all) checks:

flake8 --ignore N802,N806 `find . -name *.py | grep -v setup.py | grep -v /doc/`

This means, check all .py files, but exclude setup.py and everything in directories named "docs". Do all checks except N802 and N806, which enforce lowercase-only names for variables and functions.

The Makefile contains an instruction for running this command as well:

make flake8

Documentation


rtfm

Documenting your software is a good idea. Not only as a way to communicate to others about how to use the software, but also as a way of reminding yourself what the issues are that you faced, and how you dealt with them, in a few months/years, when you return to look at the code.

The first step in this direction is to document every function in your module code. We recommend following the numpy docstring standard, which specifies in detail the inputs/outputs of every function, and specifies how to document additional details, such as references to scientific articles, notes about the mathematics behind the implementation, etc.

This standard also plays well with a system that allows you to create more comprehensive documentation of your project. Writing such documentation allows you to provide more elaborate explanations of the decisions you made when you were developing the software, as well as provide some examples of usage, explanations of the relevant scientific concepts, and references to the relevant literature.

To document package_name we use the Sphinx documentation system. You can follow the instructions on the sphinx website, and the example here to set up the system, but we have also already initialized and commited a skeleton documentation system in the docs directory, that you can build upon.

Sphinx uses a Makefile to build different outputs of your documentation. For example, if you want to generate the HTML rendering of the documentation (web pages that you can upload to a website to explain the software), you will type:

make html

This will generate a set of static webpages in the docs/_build/html, which you can then upload to a website of your choice.

Alternatively, readthedocs.org (careful, not readthedocs.com) is a service that will run sphinx for you, and upload the documentation to their website. To use this service, you will need to register with RTD. After you have done that, you will need to "import your project" from your github account, through the RTD web interface. To make things run smoothly, you also will need to go to the "admin" panel of the project on RTD, and navigate into the "advanced settings" so that you can tell it that your Python configuration file is in docs/conf.py.

Testing


TDD

Most scientists who write software constantly test their code. That is, if you are a scientist writing software, I am sure that you have tried to see how well your code works by running every new function you write, examining the inputs and the outputs of the function, to see if the code runs properly (without error), and to see whether the results make sense.

Automated code testing takes this informal practice, makes it formal, and automates it, so that you can make sure that your code does what it is supposed to do, even as you go about making changes around it.

Most scientists writing code are not really in a position to write a complete specification of their software, because when they start writing their code they don't quite know what they will discover in their data, and these chance discoveries might affect how the software evolves. Nor do most scientists have the inclination to write complete specs - scientific code often needs to be good enough to cover our use-case, and not any possible use-case. Testing the code serves as a way to provide a reader of the code with very rough specification, in the sense that it at least specifies certain input/output relationships that will certainly hold in your code.

We recommend using the 'pytest' library for testing. The py.test application traverses the directory tree in which it is issued, looking for files with the names that match the pattern test_*.py (typically, something like our package_name/tests/test_package_name.py). Within each of these files, it looks for functions with names that match the pattern test_*. Typically each function in the module would have a corresponding test (e.g. test_transform_data). This is sometimes called 'unit testing', because it independently tests each atomic unit in the software. Other tests might run a more elaborate sequence of functions ('end-to-end testing' if you run through the entire analysis), and check that particular values in the code evaluate to the same values over time. This is sometimes called 'regression testing'. Regressions in the code are often canaries in the coal mine, telling you that you need to examine changes in your software dependencies, the platform on which you are running your software, etc.

Test functions should contain assertion statements that check certain relations in the code. Most typically, they will test for equality between an explicit calculation of some kind and a return of some function. We recommend using functions from the numpy.testing module (which we import as npt) to assert certain relations on arrays and floating point numbers. This is because npt contains functions that are specialized for handling numpy arrays, and they allow to specify the tolerance of the comparison through the decimal key-word argument.

To run the tests on the command line, change your present working directory to the top-level directory of the repository (e.g. /home/user/code/package_name), and type:

py.test package_name

This will exercise all of the tests in your code directory. If a test fails, you will see a message such as:

package_name/tests/test_package_name.py .F...

=================================== FAILURES =============================

.
.
.


E       AssertionError:

.
.
.

package_name/tests/test_package_name.py:49: AssertionError
====================== 1 failed, 4 passed in 0.82 seconds ================

As your code grows and becomes more complicated, you might develop new features that interact with your old features in all kinds of unexpected and surprising ways. As you develop new features of your code, keep running the tests, to make sure that you haven't broken the old features. Keep writing new tests for your new code, and recording these tests in your testing scripts. That way, you can be confident that even as the software grows, it still keeps doing correctly at least the few things that are codified in the tests.

We have also provided a Makefile that allows you to run the tests with more verbose and informative output from the top-level directory, by issuing the following from the command line:

make test

Notebooks


code style

Notebook packages like the Jupyter notebook, Beaker notebook, Zeppelin, and other literate programming tools are very effective for exploratory data analysis. Hence, the notebooks directory can be used as a place to experiment with your module code, and as a place to produce scripts that contain a narrative structure, demonstrating the use of the code, or producing scientific results from your code and your data and telling a story with these elements. For example, an IPython notebook that reads in some data, and creates a figure. Maybe this is Figure 1 from some future article?

When we use notebooks in our work, we often subdivide the notebooks folder. For example, notebooks/exploratory contains initial explorations, whereas notebooks/reports is more polished work that can be exported as html to the reports directory.

Since notebooks are challenging objects for source control (e.g., diffs of the json are often not human-readable and merging is near impossible), we recommended not collaborating directly with others on Jupyter notebooks. There are two steps we recommend for using notebooks effectively:

  1. Follow a naming convention that shows the owner and the order the analysis was done in. We use the format <step>_<ghuser>_<description>.ipynb (e.g., 1.0-ed-visualize-distributions.ipynb).
  2. Refactor the good parts. Don't write code to do the same task in multiple notebooks. If it's a data preprocessing task, put it in the pipeline at package_name/data and load data from data/interim. If it's useful utility code, refactor it to package_name.

Now, by default we turn the project into a Python package, so you can import your code and use it in notebooks with a cell like the following:

# OPTIONAL: Load the "autoreload" extension so that code can change
%load_ext autoreload

# OPTIONAL: always reload modules so that as you change code in src, it gets loaded
%autoreload 2

import package_name.data as data

Licensing


open source

License your code! A repository like this without a license maintains copyright to the author, but does not provide others with any conditions under which they can use the software. In this case, we use the BSD-3 license. You can read the conditions of the license in the LICENSE file. As you can see, this is not an Apple software license agreement (has anyone ever actually tried to read one of those?). It's actually all quite simple, and boils down to "You can do whatever you want with my software, but I take no responsibility for what you do with my software"

For more details on what you need to think about when considering choosing a license, see this article!

Getting cited


citation needed

When others use your code in their research, they should probably cite you. To make their life easier, we use duecredit. This is a software library that allows you to annotate your code with the correct way to cite it. To enable duecredit, we have added a file due.py into the main directory. This file does not need to change at all (though you might want to occasionally update it from duecredit itself. It's here, under the name stub.py).

In addition, you will want to provide a digital object identifier (DOI) to the article you want people to cite.

To get a DOI, use the instructions in this page.

Another way to get your software cited is by writing a paper. There are several journals that publish papers about software.

Contributing


here to help

Best practices change, tools evolve, and lessons are learned. The goal of this project is to make it easier to start, structure, and share an analysis. Pull requests and filing issues is encouraged. We'd love to hear what works for you, and what doesn't!


Project structure and reproducibility is talked about more in the R research community. Here are some projects and blog posts if you're working in R that may help you out.

Acknowledgements


reinvent the wheel

Many thanks to the Cookiecutter project (github), which is helping us all spend less time molding and more time getting things baked.

This cookiecutter was developed abreast the awesome Cookiecutter Data Science project (github), borrowing features from shablona (e.g. duecredit, travis CI) and inspired by the directory scaffold of Cookiecutter-Pytorch. Here are some examples of projects using cookiecutter-pytorch template:

  1. Severstal Steel Defect Detection (Kaggle)
  2. Aptos Blindness Detection (Kaggle)
  3. Understanding Cloud Organization (Kaggle)