Code Review in Laboratories and Scientific Journals


Abstract
Scientific research, especially Neuroscience and Genomics is demanding researchers write code. Many laboratories write their own software to address specific problems no one else has addressed. Empirically, code from papers too many times have not borrowed hones methodologies commonly used by great and productive Software Development teams in corporate America – To improve the code quality and grow the programming skills of their developers. This blog post is focused on Code Review. It is a provides a big picture review of Code Review and get a humble attempt to get researchers interested in it.

Assumptions
This blog post assumes readers work in a wet lab working and write scientific programs to address research questions. It is also assumed that researchers reading are familiar with github. Readers without this background can still understand the key points but may struggle understanding many parts. There are many tutorials on github like TutorialPoint. This blog post is a starting point. Researchers trying to get started with Code Review need to experiment to see what suits their laboratories and do a web search for ideas and available complementary tools.

Figure 1. Code Review in progress. Beautiful people sharpening each other’s programming skills via Code Review. Picture freely provided by Lagos Tech via Uplash

Introduction
Code Review is a critique of code to ensure quality standards are met. It can be a self-review whereby a researcher examines his/her own code for flaws in quality and areas that can be improved. It can be automated like via linting tool that examines code to ensure proper formatting and structure are used. It can also be done by a human other than the programmer who made the code. Human code review can be done with two or more people studying code (like in Figure 1 above) or offline where a reviewer write notes on a Pull Request.

The benefits of Code Review cannot be overemphasized. It is crucial in growing the skill set of researchers at all levels. Although it adds time to software development timeline, the value-add is tremendous and worth the extra time. It also saves times by likely resulting in less bugs and sharping researchers’ programming skills over time.

I personally read papers in Neuroscience and Genomics. Empirically, Code Review is a step not given due emphasis in the scientific literature. This blog post is an attempt to spark more interest in Code Review and contribute to any discussion of adding it as a a criteria to either accept or reject papers in Scientific Journal. There is no debate against Software Development being an essential part of Neuroscience and genomics research. In turn, there is a crucial need to adopt best practice developed and honed by companies who have built trillion-dollar businesses like Amazon and Microsoft and entire regions like Silicon Valley in California and tech hubs in Austin, TX and Seattle Washington. There are lots of topics but this blog will only address Code Review.

Types of Code review
This blog classifies four types of Code ReviewSelf (done by the researcher), Automated (done by a linting tool), Peer (done by a person or a group as a prerequisite for considering the code completed and test-ready) and Lab (done as part of a laboratory preparation to submit a paper to a journal.) It is recommended all classes be done. They are listed in order of precedence. Like, it is better to do a self-review before automated. It is better to do automated review before investing a human’s times in a Peer review; and so on. Here are descriptions of the types –

Self Review
It is a grave error to consider code completed and ready to be submitted along with a paper the instance it works. Researchers should take the time to review their own code to determine if it meets quality standards. This save a lot of time and also help reduce bugs as the programmer can catch mistakes and save others from doing that.

Automated Review
There are linters like Pylint that can automate verification of coding standards. Pylint has a set of pre-set rules you can use and you can modify them and/or add new rules. Researchers are expected to satisfy these rules. The rules can be configured to give warning messages so are optional to address or errors which mean they must be addressed. It is best the laboratory get together as a group to determine theses rules based on a consensus. Linting is great in detecting bad code and saves time because it catches issues a human need not worry about. There are many linters for many languages. You can do information-gathering to determine which is best for your needs and how to configure it.

Peer Review
Peer Code Review is done either by two or more researchers meeting and discussing code in person or via a video teleconferencing like on Zoom or Slack or some other system. Or it can offline where a reviewer write comments on code for later reviewer by a researchers. The general idea to to ensure good quality code is met and to provide feedback if any is work needed to obtain this goal. Github has a system where reviewer can comment on a Pull Request. After the review is completed, the reviewer can either request changes on the Pull Request or approve it.

Lab Review
Principal Investigator and lab members review papers before submitting them to journal for publication. It would make sense for Code Review to be part of this process. It would be a good idea for the lab to at least smoke test the code to ensure it does what it is expected to do independent of testing and to ensure the code satisfying the standards of a funding agency like have a specific kind of license or be HIPAA-compliant.

Recommended Code Review by Scientific Journals
Scientific progress has benefited a lot from reviews – Peer review of pre-released papers, reviews of grants, reviews of PhD proposals etc. Researchers at all level typically write proposals and have other researchers, especially more experience researchers and gain valuable feedback. Reviews can provide great ideas the writer has not though of, help clarify ideas, point out fallacies in methodology, help prevent a researcher from wasting years on a poorly designed research project and help increase likelihood of gaining funding. It is recommended you never start an experiment session without a review, especially if you are a PhD student without years of research/real-world experience. It makes sense that code be given this same courtesy.

It would be great if Scientific Journal have a selected linters all submitted code that are part of potentially published papers must satisfy prior to publication. It can be Pylint or a modified version or something else. It makes no difference as long as the linter has strong community support and linting is enforced consistently. Community support is subjective and can be gauged by which potential linters have more stars and forks or some other means like which one has worked well for others. It is automated which reducing the burden on human reviewers.

Like mentioned earlier, it would be a great idea if Peer Reviewers are made responsible also to do a code review. Or a dedicated group of Peer Reviewers do a Code Review on code in papers. This will improve code quality and make code in paper more useful.

Recommended Code Reviews Steps by Laboratories
Step 1 – Self Review
The first test should be a self-review. The researcher can examine at his/her code and check for errors and areas that can be improved. This alleviates the need for others to do this. Linters may take time to run. Therefore, time can be saved if ideally the linter finds no errors or warnings as the researcher can progress to the next step. Otherwise, the code has to be improved and the linter re-ran.

Step 2 – Automated review
Linter are automated tools that check code for quality based on configured rules. There are a variety of linters for a variety of popular programming languages. The beauty of linters is that they automate code review, so free human code reviewers to focus less of styling and more of logarithms and efficiency. You can do information-gathering on the web on how to set one up one for your particular code.
To illustrate the core idea of linters with a simple example, imagine this is code a researcher needs linting performed on. Typically it is best to have lint as part of Continuous Integration. Smaller code may have this as part of an IDE. In a JupyterLab/Jupyter Notebook application, you can try this approach. To keep things simple, let’s try this code snippet from Code Snippet 1 –

def apple_engine():
    return ['Apple', 'orange', 'pear'];

Code Snippet 1. Code that fails linting.

The code in Code Snippet 1 fails the linting. It is bad practice to use semi-colon in Python in general. The researcher needs to remove the semi-colon at the end of the line. So the researcher does code in Code Snippet 2-

def apple_engine():
    return ['Apple', 'orange', 'pear']

Code Snippet 2. Code that passes linting.

Now, linting passes and the code is ready for a human to review.

Step 3 – Performing Manual Code Review
There is no hard-and-fast rules. Empirically, I leave styling up to linter. Code Review by humans examine the logic and flow of the code. Comments are made on anything questionable. There are two grades – Approve or Request changes. Approve means any change recommendations are optional and the code looks fine. Request changes means the researcher is required to make at least one change. The focus is typically on a single Pull Request and not the code base overall structure like Readme.md files and license.md file.

The reviewer doing the Code Review will see Code Snippet 2 in a Pull Request. The code is not of good quality because the function’s name is not intuitive. This is a change the researcher must correct in order to pass the review. The code reviewer can also warn the researcher that fruit names should be in a consistent case, but marks this as an optional change. If this is done offline, the reviewer can comment on the Pull Request as illustrate here to convey this to the researcher. The “Finish your review” modal popup can say something like –

Figure 2. Requesting changes. Potential use of github to request code changes.

The researcher makes changes. The function name is now more descriptive. The casing was untouched but does not matter as changing it was optional. The code change is illustrate in Code Snippet 3 below –

def get_list_of_rat_feed():
    return ['Apple', 'orange', 'pear']

Code Snippet 2. Code that passes linting.

The reviewer satisfied with the change can approve the code as illustrated in Figure 3.

Figure 3. Approved. Potential way to approve a Pull Request.

Code Review takes times, but the benefits more than makes up for it. The researcher learned not to use semi-colons to end a line in Python and name functions appropriately. Teaching is a great way to reinforce concepts and the reviewer benefits. The teams benefits as code quality was improved. This example does not illustrate Code Review catching bugs. But empirically, the team I worked on that produced the least bugs is the team that has mandatory Code Review. This has saved time and enhances user experience.

This is a trivial example but illustrate the core ideas of Code Review and a workflow. Some code reviewers smoke test code and read the task to better understand the requirements and ensure the code satisfies them. But this is optional and testing can be left to the testing team. It is ideal not to use the code reviewer as tester, although in small teams there may be no choice.

Step 4 – Lab review
Lab Review focus more on the overall code base like Readme.md, license.md, ensuring there is a mean for other researchers to access the data and other factors that are needed to ensure grant requirements are satisfied. This is primary done by the Principal Investigator or some responsible for compliance like a director. It tend not to be too focused on Pull Requests. This also has no hard-and-fast rules and varies per lab and per grant.


Personal Anecdote
I foresee a future where big research breakthroughs are made possible by code. I truly believe in this. It is why I left Chicago to move to Silicon Valley in California – To be around smart people who push the envelops of programming, Neuroscience and genomics. Silicon Valley has among the smartest people on the planet and it is great to be near them.


I am proud to have contributed to the growth of others by doing Code Review for them. It is time well spent and I will never work in a laboratory where Code Review is not standard. When I do code reviews, I am brutally honest and demand the next review to go better if one is needed. I expect others to do the same for me. It is important to be honest and have frank and open communication to helps people grow. A culture of gossiping about others behind their back while not providing them feedback does not help growth, is counterproductive and a waste of time. I recommend leaving such as job rather than try to salvage an upcoming disaster the company/group may become. I know first hand why not to make that mistake. Sometimes programmer do approaches the reviewer has no experience in so it is good to talk to the programmer to discuss why an approach was taken rather than conclude the programmer does not know what he or she is doing and make other believe so too. Code Review is an opportunity for both the programmer and the reviewer to learn. With honest Code Review, egos can get bruised the benefits of learning and growth heals the wounds. I bring this up because the beauty of Code Review will be lost if open and honest communication is hindered by political correctness.

Figure 4 – The future. I foresee a world where great software come not just from a geek hacking solving on a business problem like how to get people to click on ads, but from researchers who code themselves to contribute towards answering pressing scientific questions. Picture provided by National Cancer Institute via upslash.

Additional Topics
There are many other topics to consider in improving the quality of code in scientific papers. A few are –

Conclusion
Code Review is a great way to enhance research quality and better help researcher learn good programming habits. The are four categories of Code Review that involve automated or manual verification. The time it takes to do Code Review is worth it as there can be tremendous gains in code quality and improving the programming skills of researchers. It would be great if Scientific Journals require linting of all code that is part of a papers.

References/Further Reading

Demonstration of Cookiecutters in Neuroscience Research Onboarding


Abstract
This post demonstrates using Cookiecutter to streamline experiment and session code base onboarding. The example comes from a Neuroscience paper about Neurodata without borders and Datajoint. But it is applicable to almost any research areas where code between researchers and groups have a lot in common.

Pre-requisite

  • Familiarity with Python
  • Familiarity with virtual environments
  • Interested in big picture story of using Cookiecutter rather than created one
  • Familiarity with Github and basic phrases like fork

Relevance
Setting up code bases requires generating files and making decisions like license, requirements, base code, programming language, file structure and other factors. Researchers needing to use code and have an agreed-upon configuration already can streamline onboarding by using a Cookiecutter. This saves times and help reduce bugs due to manually updating code.

Introduction
Sometimes in Neuroscience, different researchers like Grad students and Post-doc need to use the same equipment, experimental set up and code base; but with slight modifications like different credentials or file directory. In regards to the code and differences like credentials, a common approach is to copy-and-paste existing code and search-and-replace areas that are unique to a researcher’s experimental setup. A better approach would be to use a template-system that request information from the researcher and then uses that information to create the code base. This blog will highlight one potential solution – Cookiecutter and demonstrates it efficacy with an example. The example is take from Reimer et al from the Tan lab in Yale. This post does not show you how to develop a Cookiecutter – Just demonstrates how to use one.

What is a Cookiecutter?
Cookiecutter is a templating system that gets information from users and uses that information to autogenerate populate part of code. The Cookiecuter system this post will discuss is one from Audrey Greenfield who is famous in the Djanjo community. It is commonly used for Python projects and is written mostly in Python. But can be used for any other languages like Matlab, C++ and more. The post focuses on consuming a ready-made Cookiecutter and will not discussion creating a Cookiecutter project from scratch. Matias Calderini wrote an excellent post on building a Cookiecutter project and you can study his work to learn how to make one yourself – https://maticalderini.github.io/blog/tutorial/2020/04/13/cookiecutter.html.

Demonstration
In the paper by Reimer el al, there are a number of Jupyter notebooks files a researcher can modify freely and use for their own experiments (it is recommended you read the paper and explore the code.) In using the code in the paper for a specific experimental session, what a researcher needs to do is fork the code off https://github.com/MarikeReimer/Big-Data-with-DataJoint-and-NWB/ and add credentials manually. To demonstrate the power of Cookiecutter, two of the files from the paper will be examined- Chapter2WorkingWithNWBAndDataJointData.ipynb and ImportsAndTableDefinitions.py.

Cookiecutter simply involves pulling down a Cookiecutter, setting values and using the autogenerated code. To start off, first open up a virtual environment. You can use your favorite tool. But this demonstration will use conda.

conda create -n openFieldStudy python=3.8

Now, activate the virtual environment

conda activate openFieldStudy

Next, download Cookiecutter.

conda install cookiecutter

This Cookiecutter code basehttps://github.com/zoldello/OpenFieldStudyCookieCutter was made specifically for this post. Again, a detail discussion of how to made is beyond the scope of this post. But you can find material on a web search, YouTube or the documentation if interested in making your own.

When creating a code base off https://github.com/zoldello/OpenFieldStudyCookieCutter, you will be asked a series of questions. You can either keep pressing the enter (return) key to use the default or add values yourself. The command to create the code base off the Cookiecutter is –

cookiecutter https://github.com/zoldello/OpenFieldStudyCookieCutter.git

You will see a list of options. They will come up one at a time. The final screen display is –

name [Phil]:
email [phil@example.com]:
twitter_handle [zoldello]:
github_username [zoldello]:
project_name [Session]:
lab [Tan Lab]:
project_slug [session]:
project_short_description [Python Boilerplate contains all the boilerplate you need to create a NWB and DataJoint Pipeline to process mouse location preference-experiments.]:
pypi_username [zoldello]:
version [0.1.0]:
Select open_source_license:
1 - MIT license
2 - BSD license
3 - ISC license
4 - Apache Software License 2.0
5 - GNU General Public License v3
6 - Not open source
Choose from 1, 2, 3, 4, 5, 6 [1]:
database_host [tutorial-db.datajoint.io]:
database_user [marikelreimer]:
database_password [AreRA3c5yc]:
subject_id [Mouse_5025]:
session_id [Session_22]:
experiment_path [C:/Users/meowm/OneDrive/DataWarehouse/Experimenter1]:

The code is stored in an autogenerated folder named- Session. Enter it –

cd Session

Look at the content –

ls

You should see four file –

  • Chapter2WorkingWithNWBAndDataJointData.ipynb
  • ImportsAndTableDefinitions.py
  • notes.txt
  • requirements.txt

Pull down dependencies with –

pip install -r requirements.txt

While it is best to sticking with conda when using a conda virtual environment (conda install --file requirements.txt), some of the packages could not be installed with conda. So I reverted to using pip.

Examing the code base. The Cookiecutter did things like –

  • Populated your data notes.txt with the options you selected.
  • Added your credential to Chapter2WorkingWithNWBAndDataJointData.ipynb
  • Added your credential to ImportsAndTableDefinitions.py.

Anyone can get the code, answer the questions as it suits them and they are up and ready very quickly.

There are a variety of Cookiecutter freely available online. You can search on Github – https://github.com/search?q=cookiecutter&type=Repositories albeit most are not directly geared toward Neuroscience. If you create one you feel others will value, you can post it on PyPi.

Conclusion
This post illustrates the merits of using Cookiecutter. As Neuroscience and Genetics research moves ever so more into heavy Software Development usage, a lot of practices honed in professional Software Development can be borrowed to streamline scientific research. If you find the Cookiecutter technique useful, you can adopt it in your own experiment/session code base. Otherwise, it is good to be familiar with it in case it ever becomes useful in the future for you or a colleague.

Notes

  • The associated code- https://github.com/zoldello/OpenFieldStudyCookieCutter is meant for illustrate purposes. It is not production ready nor intended to be. You have to understand the principle and apply it to your system if you so desire. Please, don’t blindly copy-and-paste it and hope for a miracle. Science is not a religion.
  • If you are wary of using plain-text passwords, you can store them in environmental variables or in .bashrc and read it off there or setup up a public/private key system. This blog post goal is to illustrate cookiecutter and not to dwell on computer security principles
  • I am not affliated with Tan lab nor contributor to Reimer et al. Any errors or omissions in this post is entirely my own – @zoldello

Homework

  • Try to dockerize the setup illustrated in this blog post.
  • Try adding underscore to the project name option. The system fails because of a hook that validate project name is not empty. You can explore hooks more.

References/Further Reading

Review of Reimer et al that shows how to Neurodata without Borders and DataJoint in Practice


Introduction

This Blog post is a first-pass review of-

Reimer M, B Lakshmi, Waxman S, Tan A. Core principles for the implementation of the neurodata without borders data standards. Journal of Neuroscience Methods. Volume 348, 15 January 2021, 108972

The paper will henceforth be called Reimer et al. It shows how to use Neurodata without Borders (NWB) and DataJoint in practice. Software analysis tools and standardization are becoming indispensable in Neuroscience/Neuroengineering research. The paper provides a guide and practical examples on how to work with NWB with DataJoint in your research. This work is a foundational paper for me and I am grateful that the authors published it and released the source code.

This blog post present notes I took during a quick read of the Reimer et al. It is just a first-pass review. I will post a more detailed analysis in the future. This blog post is a critique of the paper and NOT a criticism nor a mockery. When I study papers, I look for things that I could do different in order to build upon works of others. I took the time to write this Blog post because I find the work of value.

Value-Add

The paper starts off introducing NWB. NWB is a data standard for neurophysiological data formatting and structure to enable better data sharing, reuse, and analysis between neuroscientists and laboratories. Currently there is a huge reproducibility and understandability issue in Neuroscience where experiments are hard to reproduce and formatting hard to understand by people outside the laboratory that generated the data. NWB was made to reduce these problems. Reimer et al provides validation of NWB with practical code examples.

DataJoint is an Object Relational Mapper (see https://en.wikipedia.org/wiki/Object–relational_mapping for a theoretical discussion) that provides a query language around MySQl (or MariaDB.) It removes the need to write SQL to access data in a relational Database and provided a query language in Python that can be used instead. This paper shows how DataJoint can be used with NWB. I will discuss Database design and DataJoint in a future post.

Supporting Concepts and Companies

Big Data to me involve working with petabytes. It sometimes has needs relational database don’t meet. That is why Big Data frequently uses NOSQL databases. MySQl that powers DataJoint is a relational database. I feel neuroscience is currently a Data Science problem, true, but not yet commonly a Big Data problem as Reimer et al implies.

DataJoint is developed by DataJoint Neuro. Vatheless LLC is the old name of the company.

Jupyter Notebooks is being replaced by Jupyter Labs. So, I would of had pushed to use Jupyter Labs if I were one of the authors.

Experimental Setup

The experimental setup involves tracking a mouse’s location. Figure 1 of the paper shows a SpragueDawley rat even though the experimental setup involves a mouse. I believe that was a typo. Figure 2 shows Entity Relationship model illustrating how the tables in the database are related. In DataJoint lingo, Entity Relationship model is called a DataJoint Pipeline.

Codebase

The codebase is divided into four divisions. The first code creates DataJoint Pipelines (essentially creates the database.) The second code base shows how NWB works with DataJoint. The third codebase shows different means to feed the DataJoint with data. The fourth codebase shows analysis work.

I feel it is bad practice to store codebases meant for a papers in a personal github account. I prefer they be stored in a laboratory, department or institute Github account. This will ensure permanence as a user can delete or change a codebase at will; but a centralized account would be more diligent with deletions and changes. I also feel codebase part of papers should be tagged. Codebases in papers should be treated how codebases are treated in large open sources projects like React.js. React.js code is stored in Facebook’s corporate account along with other projects. Discaimer, I don’t feel the same rigor should be applied to codebases that are part of Blog posts as Blog posts code are not code reviewed.

I may do a code review in a future Blog post. For now- I am a fan of using f-string and that would have made things easier. The authors could of had used a cookiecutter to set up the codebase. In code bases, it would have been better if the authors Dockerized the codebase so users are freed from having to download dependencies manually. And since that was not done, it is good practice to provide users links in the README on where to get the dependencies. In Database design, its best to use UTC times so that there is no confusion about timezone. It is best to automate creation of the DataJoint Pipeline so users can focus on experiments. It is a good idea to use code analysis tool to automate testing of code standards. I personally like pylint. It is also good to have unit test to help verify code works properly. A tool I like is PyTest.

Data sources

You can get NWB data from IBL developed the Open Neurophysiology Environment (ONE) or Distributed Archives for Neurophysiology Integration (DANDI). These resources although in early stages are a good start. I foresee a world where Neuroscientist would first check to if there is data in a repository like DANDI first, rather than first doing animal experiments. While I feel Animal research is essential for progress for all living things, I strongly feel Animal research should be done as a last resort due to the sacrifice it requires. Sharing data can help relieve the need for some animal sacrifice.

Downside of NWB

No tool lack weaknesses. NWB supports Python and Matlab only. If your laboratory uses R, C or C++, you need to investigate a third-party tool. I do now know what to say if you still use Fortran or an esoteric language. NWB does let users “fall into the pit of success“- It does not enforce a specific data structure or best practice. I personally prefer this but some do not, therefore, it has to be duly noted. Bad design can makes it harder to answer scientific questions. Users have to follow good NWB design practices. NWB is essentially a wrapper around HDF5 and failure to recognize this can lead to poor design choices. DataJoint requires users to have an idea of good database design. A bad database design can make answering scientific questions harder.

Summary

Overall I enjoyed reading Reimer et al. It is a good staring point to learn how to use NWB and DataJoint in the real world. I intended to read it again but this time and do a more detailed study of the code and the paper. I am thankful to Reimer et al for releasing this work and making the code freely available.

Reference

This Blog post is based off-

Reimer M, B Lakshmi, Waxman S, Tan A. Core principles for the implementation of the neurodata without borders data standards. Journal of Neuroscience Methods. Volume 348, 15 January 2021, 108972

Getting Started With Docker


Introduction

This Blog post is a very basic introduction to Docker. It is intended for Researchers new to Docker but with some programming experience.You shhould be familiar with Python and Git Of course, non-Researchers can benefit from it too. There are other resources you can find online that give you more advances information after you finish going through this Blog post.

What is Docker?

Docker is a platform for running, building and shipping applications in a manner that frees Researcher (and professional developers) from needing to be concerned with application-configuration and setup. So, if code is made in one laboratory, another laboratory can use the code and not worry about software versions and other mundane issue unrelated to using the application. It is a way of not just sharing code, but also sharing the ability to inform a computer what to install to run the code and running the code too.

Docker is essential LIKE a process that mimics an operating system. It is not a Virtual Machine in that Docker does not have hardware dedicated to it as a Virtual Machine does. Again, Docker is a specialized process. Docker can be started up in seconds and does not require hardware configuration which can take way longer.

You can find a glossary of key Docker terms at- https://docs.microsoft.com/en-us/dotnet/architecture/microservices/container-docker-introduction/docker-terminology

Relevance of Docker

Pulling code off Github without Docker or a similar system requires you to have the correct version of the operating system and dependencies and know the right commands. Docker is a tool that sets up the dependencies so other Researchers don’t have to be concerned with setup. It saves a lot of time and resonates to people have experiences software not working on their system when receiving a crucial application.

Dockerizing an Application

I will start off by explaining a very simple application and introducing Docker terms along the way. The application is a Python application that prints a single line of text. Here is the source code- https://github.com/zoldello/DockerHelloWorldWordpress. The README page has instructions on how to run it. In summary, clone the code locally-

git clone https://github.com/zoldello/DockerHelloWorldWordpress.git

You can open the code with any text editor. I personally use Visual Code with an extension that makes the chrome green. After you download the code, to run it run-

python app.py

The code should print-

Possible Nucleotides for a DNA `Sequence: 'A', 'C', 'T', 'G'

It is just that simple.

The code is Dockerized because of the presence of Dockerfile at the root. Dockerfile contain instructions on how to construct an Image. Image is Docker lingo for a blueprint of what Docker should do to set up the environment and application. Images can have a variety of information like what Operating System to run, what third party libraries to pull down, environmental variables and much more. The images are used to build containers. Containers are like processes but for hosting code.

The Dockerfile in the sample code contains

FROM python:alpine
COPY . /app
WORKDIR /app
CMD python app.js

The line: FROM python:alpine tells Docker to go to Docker hub (a repository for pre-built containers) and pull down Python running in alpine. Alpine is a lightweight Linux distribution. We will discuss pulling down code from Docker Hub later on in this Blog post.

The line COPY . /app tells Docker to copy the content of the application into a Docker container folder called app that is created on the fly.

The command WORKDIR /app tells Docker to make the newly created app folder the active folder.

CMD python app.js run the code

Building an Image

Docker image is a blueprint that describes how to set up an application. You tag an image while building it. The command-format for building an image is:

sudo docker build -t <image_name> .

This tells Docker to build an image with name <image_name> (this is whatever name you want as long as it’s text is all lower case.) The code for the image is obtained from the current direction (‘.’). You can get the code from a different directory if you specify it.

Let’s build an image called “dna-nucleotides”. The image name has to be in lower case. To do create the image run:

sudo docker build -t dna-nucleotides

Was the image build successfully? Check if it was by running:

sudo docker image ls

You should see your image under the column- REPOSITORY. The column- IMAGE ID is also important.

The image itself is a blueprint. The image has to be instantiated. The instantiated image is called a container. You create a container by running-

sudo docker run dna-nucleotides

This can be ran from any location on your local machine. You should see this output-

Possible Nucleotides for a DNA `Sequence: 'A', 'C', 'T', 'G'

To see for yourself the container was created run-

sudo docker container ls -a

You should see the container you made with your image name as one of the columns. To keep this tutorial simple, please, create only one container from the image.

Clean up

Let’s say you are done using the container. Now, how do you delete it? Again, run the command to see a listing of containers-

sudo docker container ls -a

Identify the container you want to delete. You can use the CREATED and IMAGE columns to help you determine. When you identify the desired row to delete, take note of the CONTAINER ID. You first stop the container, then you delete it. To stop the container run-

sudo docker stop <container_id>

where <container_id> is the container id of the container you want deleted.

Now, delete the container by running-

sudo docker rm <container_id>

You have deleted the container. If you created multiple containers off the images (NOT ALL your containers, just the ones related to this Blog post), I recommend you delete them because multiple containers off the practice image will bring up an issue with image deletion beyond the scope of this tutorial (cannot delete an image with active containers so need to force-delete).

The container is deleted. Let’s say now you want to delete the image. To see a list of images run-

sudo docker image ls -a

Take note of the IMAGE ID of the image you want to delete. To delete the image run-

sudo docker rmi <image_id>

Using a Third-Party Docker Image

There are many images others have created and are freely available to people who have signed up for a Docker account. They are hosted on Docker Hub.

To experiment with Docker Hub, let’s say we want to pull down an image of Debian (a popular distribution of Linux) and be able to run commands on the console. All you have to do is pull down the image and run it interactively. Let’s see how this is done.

First, you need to sign up for a Docker hub account. After you do that, search for Debian- https://hub.docker.com/search?q=debian&type=image. Let’s use the official image- https://hub.docker.com/_/debian. On the page, on the far right, you see pull instruction. The instruction duplicated here for convenience is-

sudo docker pull debian

To verify we pulled down the image, run-

sudo docker image ls -a

You should see a REPOSITORY named debian.

Remember, this is an image– a blueprint. You need to create a container off the image– An implementation/instantiation. In addition, you want it to be interactive; you want it to run continuously. The Dockerized code we pulled off Github ran just once. To run Debian interactively run this command-

sudo docker run -it debian

You should see something like-

root@b7ade9958799:/#

NOTE, YOURS WILL BE SLIGHTLY DIFFERENT BUT FOLLOW THE SAME STRUCTURE. The root part tells you you are in the root directory. The id- b7ade9958799.

Let’s experiment with the Debian container. Let’s install python, create a very simple python sample code and run it.

First all, update the container as per Linux best practice-

apt update

Now, install python

apt install python

Create a file with the simple python code

echo "print('Hello, World')" > app.py

Do not worry about fully understanding what this does. At a high level, echo display text in front of it, “>” is a mean to redirect output to a file in this case called app.py. Basically, this is creating a file with text without using a text editor. Again, just know it is a way to write to a file and do not worry about the Linux command themselves.

To verify we have text in app.py, run

more app.py

This displays the content of a file. If you see nothing you did somehting wrong. If you are a die-hard Linux users, less command is a replacement for more. But the Debian distribution used in the demonstration did not have less and it outside of the scope to install it.

Now, let’s run the python code-

python app.py

Done. We installed a pre-made image of Debian Operating system off Docker Hub, made a container, install python, wrote python code and ran it. Great.

To kill (yes, that is the technical term) the Debian container, run the command-

exit

Or, use Ctrl-D (hold Ctrl key down and press ‘D’ key [hold both keys down simultaneously]).

To see the container run-

sudo docker container ls -a

To delete the container, find the Debian one, take note of the CONTAINER ID and run-

sudo docker rm <container_id>

To delete the image, display the list of images and take note of the Debian IMAGE ID

sudo docker image ls -a

Once you identify the IMAGE ID, delete the image-

sudo docker rmi <image_id>

Useful Docker Commands

Here is a summary of useful Docker commands

  • sudo docker stop <container_id> # stop container
  • sudo docker container ls -a # list all containers
  • sudo docker rm <container_id> # delete a container
  • sudo docker image ls -a # list all images
  • sudo docker rmi <image_id> # delete an image
  • sudo docker system prune -a #delete everything related to docker

Summary

This blog post introduces you to Docker and very basic commands. You can find the complete code at- https://github.com/zoldello/DockerHelloWorldWordpress. The README provides a summary of the steps to build a Docker container.

This blog post is a starting point and I recommend you explore other sources like

and study sample code and search for things you do not understand. Sample Dockerfile can be found at-

There is also Docker documentation

There are lots of free videos of YouTube

I also recommend you play with the code provided and try new things. Like on Debian, rather than use echo to update a file, install a text editor like a nano and create a python file with the text editor.

Neuroscience and Genomics now require programming skills and the need to program will grow more in the future. I hope this Blog post helps you in your journey to navigate the paradigm shift.

Reference

This blog post was based on work by Programming with Mosh‘s, Docker YouTube video. I made alterations to try out new things myself and also so this Blog Post is not mere replicating this work.

Revamping this Blog


I am revamping this blog and will try to post more frequently (no promises though). This blog is as a collection of references and notes I make for myself. It serves as quick reminders and references on how to perform certain techniques. Writing it also serves as a memory aid. Of course, everyone else is free to benefit off this material, albeit, I make not guarantees of any kind.

Currently, my interest is in the intersection of Neuroscience, Genomics and Data Science. Most of the topics will be related to these three area. I have plans to post information of Docker, Neurodata without border, DataJoint, Conda, Advances in Neuroscience, literature reviews, Python tips and tricks, scientific process and a host of other areas that reflect my interest.

I am pumped, primed and ready, let’s go…

Quickly Setting up a Python Virtual Environment in Ubuntu


Notes on using Python’s virtualenv

One-time work
Installing virtualevn: sudo pip install virtualenv

Getting Up and Going (Most important)
Creating virtual environment: virtualenv <project_name>
Activating: source <project_name>/bin/activate
Deactivate: deactivate
Documenting Dependencies: pip freeze –local > requirements.txt
Destroy virtualenv: rm -rf <project_name>

Misc.
Packages in virtualenv: pip list
Determining if virtualenv is used: which python
Python Path (DO NOT USE IN VIRTUAL ENVIRONMENT): which python
Specific Python-version virtualenv: virtualenv -p <python_version_path><project_name>
Pulling dependencies from requirements.txt: pip install -r requirements.txt

Reference

Docker as an Alternative

 

 

Common Github Markdown and Hidden Files


The power of Github lies in not merely using it as a source control. Rather, the power lies in using it as source control, bundled with a means of social media. Different files, features and the vary nature of git are designed to be a social media-communication tool to convey information to developers and scientist (they use git too.) For example, tagging lets people know you have releases. Seeing a branch communicates to others you are working on something and do not want to affect a major branch (like fixing a bug or adding a new feature.)

In looking at numerous git projects, I have seen a number of files that are hidden (beginning with a period) or are markdown (ending with “.md”) that occur repeatedly and are not related meant for third-party dependencies (like .babelrc for babel.js). I do not believe there is a standard. Nevertheless, they tend to have similar information across different projects. Here are the ones I frequently see:

.gitignore: Files you do not want git to pull into source control.
You can find a pre-made file for your current project off this site: https://www.gitignore.io

.editorconfig: Instructs text editors that support this on how to format code. It is best to have this to prevent a colossal geek-battle between the guy who uses 2-spaces with another guy who uses tab- this file is the definitive answer. A tutorial and sample file can be found at- http://editorconfig.org

README.md: Is information users see at the bottom of a git project. It informs them of the purpose. It is one of the first things they see despite it being it being at the bottom of the page. I have seen this vary from a basic introduction to a full blown documentation.

CONTRIBUTING.md: Provides guidelines everyone contributing to the code needs to follow (primarily developers).

COMMITTER.md: Provides guidelines to anyone with admin-power, like people that can have pull request on Master-branch. This might not be a consider problems in a small project. But it is something to consider in big codebases with core contributors scattered across difference time zones.

ChangeLog.md: A listing of what are changes in every release.

DEVELOPER.md: Instructions potential contributors on how to get the codebase running locally.

Author.txt: Listing of contributors

LICENSE.md or License.txt or License: This expresses to consumers of a code how the creator wants you to use it.  THIS IS IMPORTANT! It is vital for the code creators to write a license and the code consumer to read. If can only remember just one thing from this blog, remember that the license file should be given serious consideration. Some licenses say basically you can use the code but the creator is not liable for damages (MIT.) Some other license say if you use this code, you must open source the code you use it with; which may be your company’s $5 million codebase (like GPL.) Other say you can use the code for free if you do not modify it or contribute modifications to open source; otherwise you must pay the creators (I know MySQL used to have this.) Again, read the license. Ask the creator if you do not see a license.

You can do a web search and find information on licenses. One I like is wikipedia’s comparison- https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses

This list does not cover them all. There are  others like .gitattributes (used for configuring individual folders) and .mailmap (e-mail listing of contributors). Some third-party dependencies uses some like npm uses .npmignore. You can also create one yourself. I personally like to use .Acknowledgement to recognize anyone, codebases or sites that helped me out a lot while working on a project in Github. Some do not end with .md nor are hidden files but seems important like PATENTS in react.js

Creating a tag in git


This blog post shows you how to create an annotated tag in git; using github. I assume you already have git installed and have at least a working knowledge of the basis.

Steps
Create the tag
git tag -a v1.0.0 -m 'some message'

This is how you push tags to the server (regular push does not move over tags)
git push --tag

Miscellaneous
To view all tags:
git tag -v

To delete a tag (use the view option above to get the tag name )
git tag -d v1.0.0

ChicagoVeg Website’s Release #2 now in Prod


The ChicagoVeg site: http://www.chicagoveg.com has been updated; with Release #2. The fixes include the follow:

  • Added new carousel images
  • Images now on most pages
  • Made Facebook page plugin responsive
  • Updated text

To see a list of issues addressed, see:
https://github.com/ChicagoVeg/ChicagoVegSite/issues?utf8=✓&q=%20milestone%3A”Release%202″%20

ChicagoVeg is the largest Chicagoland community for vegetarians, vegans, and raw foodists to socialize and meet like-minded people. We organize a range of fun activities and educational events in the city and the suburbs. Whether you are vegan or not vegan, we welcome everyone who is interested in a vegan/vegetarian/raw lifestyle for any reason – ethical, environmental, health, or spiritual. We support a nonjudgmental community where everyone can feel welcome to express their opinions, socialize, and have fun!

You can find meetings at: http://www.meetup.com/ChicagoVeg/