Review of Reimer et al that shows how to Neurodata without Borders and DataJoint in Practice


Introduction

This Blog post is a first-pass review of-

Reimer M, B Lakshmi, Waxman S, Tan A. Core principles for the implementation of the neurodata without borders data standards. Journal of Neuroscience Methods. Volume 348, 15 January 2021, 108972

The paper will henceforth be called Reimer et al. It shows how to use Neurodata without Borders (NWB) and DataJoint in practice. Software analysis tools and standardization are becoming indispensable in Neuroscience/Neuroengineering research. The paper provides a guide and practical examples on how to work with NWB with DataJoint in your research. This work is a foundational paper for me and I am grateful that the authors published it and released the source code.

This blog post present notes I took during a quick read of the Reimer et al. It is just a first-pass review. I will post a more detailed analysis in the future. This blog post is a critique of the paper and NOT a criticism nor a mockery. When I study papers, I look for things that I could do different in order to build upon works of others. I took the time to write this Blog post because I find the work of value.

Value-Add

The paper starts off introducing NWB. NWB is a data standard for neurophysiological data formatting and structure to enable better data sharing, reuse, and analysis between neuroscientists and laboratories. Currently there is a huge reproducibility and understandability issue in Neuroscience where experiments are hard to reproduce and formatting hard to understand by people outside the laboratory that generated the data. NWB was made to reduce these problems. Reimer et al provides validation of NWB with practical code examples.

DataJoint is an Object Relational Mapper (see https://en.wikipedia.org/wiki/Object–relational_mapping for a theoretical discussion) that provides a query language around MySQl (or MariaDB.) It removes the need to write SQL to access data in a relational Database and provided a query language in Python that can be used instead. This paper shows how DataJoint can be used with NWB. I will discuss Database design and DataJoint in a future post.

Supporting Concepts and Companies

Big Data to me involve working with petabytes. It sometimes has needs relational database don’t meet. That is why Big Data frequently uses NOSQL databases. MySQl that powers DataJoint is a relational database. I feel neuroscience is currently a Data Science problem, true, but not yet commonly a Big Data problem as Reimer et al implies.

DataJoint is developed by DataJoint Neuro. Vatheless LLC is the old name of the company.

Jupyter Notebooks is being replaced by Jupyter Labs. So, I would of had pushed to use Jupyter Labs if I were one of the authors.

Experimental Setup

The experimental setup involves tracking a mouse’s location. Figure 1 of the paper shows a SpragueDawley rat even though the experimental setup involves a mouse. I believe that was a typo. Figure 2 shows Entity Relationship model illustrating how the tables in the database are related. In DataJoint lingo, Entity Relationship model is called a DataJoint Pipeline.

Codebase

The codebase is divided into four divisions. The first code creates DataJoint Pipelines (essentially creates the database.) The second code base shows how NWB works with DataJoint. The third codebase shows different means to feed the DataJoint with data. The fourth codebase shows analysis work.

I feel it is bad practice to store codebases meant for a papers in a personal github account. I prefer they be stored in a laboratory, department or institute Github account. This will ensure permanence as a user can delete or change a codebase at will; but a centralized account would be more diligent with deletions and changes. I also feel codebase part of papers should be tagged. Codebases in papers should be treated how codebases are treated in large open sources projects like React.js. React.js code is stored in Facebook’s corporate account along with other projects. Discaimer, I don’t feel the same rigor should be applied to codebases that are part of Blog posts as Blog posts code are not code reviewed.

I may do a code review in a future Blog post. For now- I am a fan of using f-string and that would have made things easier. The authors could of had used a cookiecutter to set up the codebase. In code bases, it would have been better if the authors Dockerized the codebase so users are freed from having to download dependencies manually. And since that was not done, it is good practice to provide users links in the README on where to get the dependencies. In Database design, its best to use UTC times so that there is no confusion about timezone. It is best to automate creation of the DataJoint Pipeline so users can focus on experiments. It is a good idea to use code analysis tool to automate testing of code standards. I personally like pylint. It is also good to have unit test to help verify code works properly. A tool I like is PyTest.

Data sources

You can get NWB data from IBL developed the Open Neurophysiology Environment (ONE) or Distributed Archives for Neurophysiology Integration (DANDI). These resources although in early stages are a good start. I foresee a world where Neuroscientist would first check to if there is data in a repository like DANDI first, rather than first doing animal experiments. While I feel Animal research is essential for progress for all living things, I strongly feel Animal research should be done as a last resort due to the sacrifice it requires. Sharing data can help relieve the need for some animal sacrifice.

Downside of NWB

No tool lack weaknesses. NWB supports Python and Matlab only. If your laboratory uses R, C or C++, you need to investigate a third-party tool. I do now know what to say if you still use Fortran or an esoteric language. NWB does let users “fall into the pit of success“- It does not enforce a specific data structure or best practice. I personally prefer this but some do not, therefore, it has to be duly noted. Bad design can makes it harder to answer scientific questions. Users have to follow good NWB design practices. NWB is essentially a wrapper around HDF5 and failure to recognize this can lead to poor design choices. DataJoint requires users to have an idea of good database design. A bad database design can make answering scientific questions harder.

Summary

Overall I enjoyed reading Reimer et al. It is a good staring point to learn how to use NWB and DataJoint in the real world. I intended to read it again but this time and do a more detailed study of the code and the paper. I am thankful to Reimer et al for releasing this work and making the code freely available.

Reference

This Blog post is based off-

Reimer M, B Lakshmi, Waxman S, Tan A. Core principles for the implementation of the neurodata without borders data standards. Journal of Neuroscience Methods. Volume 348, 15 January 2021, 108972

Leave a comment