Review of Reimer et al that shows how to Neurodata without Borders and DataJoint in Practice


Introduction

This Blog post is a first-pass review of-

Reimer M, B Lakshmi, Waxman S, Tan A. Core principles for the implementation of the neurodata without borders data standards. Journal of Neuroscience Methods. Volume 348, 15 January 2021, 108972

The paper will henceforth be called Reimer et al. It shows how to use Neurodata without Borders (NWB) and DataJoint in practice. Software analysis tools and standardization are becoming indispensable in Neuroscience/Neuroengineering research. The paper provides a guide and practical examples on how to work with NWB with DataJoint in your research. This work is a foundational paper for me and I am grateful that the authors published it and released the source code.

This blog post present notes I took during a quick read of the Reimer et al. It is just a first-pass review. I will post a more detailed analysis in the future. This blog post is a critique of the paper and NOT a criticism nor a mockery. When I study papers, I look for things that I could do different in order to build upon works of others. I took the time to write this Blog post because I find the work of value.

Value-Add

The paper starts off introducing NWB. NWB is a data standard for neurophysiological data formatting and structure to enable better data sharing, reuse, and analysis between neuroscientists and laboratories. Currently there is a huge reproducibility and understandability issue in Neuroscience where experiments are hard to reproduce and formatting hard to understand by people outside the laboratory that generated the data. NWB was made to reduce these problems. Reimer et al provides validation of NWB with practical code examples.

DataJoint is an Object Relational Mapper (see https://en.wikipedia.org/wiki/Object–relational_mapping for a theoretical discussion) that provides a query language around MySQl (or MariaDB.) It removes the need to write SQL to access data in a relational Database and provided a query language in Python that can be used instead. This paper shows how DataJoint can be used with NWB. I will discuss Database design and DataJoint in a future post.

Supporting Concepts and Companies

Big Data to me involve working with petabytes. It sometimes has needs relational database don’t meet. That is why Big Data frequently uses NOSQL databases. MySQl that powers DataJoint is a relational database. I feel neuroscience is currently a Data Science problem, true, but not yet commonly a Big Data problem as Reimer et al implies.

DataJoint is developed by DataJoint Neuro. Vatheless LLC is the old name of the company.

Jupyter Notebooks is being replaced by Jupyter Labs. So, I would of had pushed to use Jupyter Labs if I were one of the authors.

Experimental Setup

The experimental setup involves tracking a mouse’s location. Figure 1 of the paper shows a SpragueDawley rat even though the experimental setup involves a mouse. I believe that was a typo. Figure 2 shows Entity Relationship model illustrating how the tables in the database are related. In DataJoint lingo, Entity Relationship model is called a DataJoint Pipeline.

Codebase

The codebase is divided into four divisions. The first code creates DataJoint Pipelines (essentially creates the database.) The second code base shows how NWB works with DataJoint. The third codebase shows different means to feed the DataJoint with data. The fourth codebase shows analysis work.

I feel it is bad practice to store codebases meant for a papers in a personal github account. I prefer they be stored in a laboratory, department or institute Github account. This will ensure permanence as a user can delete or change a codebase at will; but a centralized account would be more diligent with deletions and changes. I also feel codebase part of papers should be tagged. Codebases in papers should be treated how codebases are treated in large open sources projects like React.js. React.js code is stored in Facebook’s corporate account along with other projects. Discaimer, I don’t feel the same rigor should be applied to codebases that are part of Blog posts as Blog posts code are not code reviewed.

I may do a code review in a future Blog post. For now- I am a fan of using f-string and that would have made things easier. The authors could of had used a cookiecutter to set up the codebase. In code bases, it would have been better if the authors Dockerized the codebase so users are freed from having to download dependencies manually. And since that was not done, it is good practice to provide users links in the README on where to get the dependencies. In Database design, its best to use UTC times so that there is no confusion about timezone. It is best to automate creation of the DataJoint Pipeline so users can focus on experiments. It is a good idea to use code analysis tool to automate testing of code standards. I personally like pylint. It is also good to have unit test to help verify code works properly. A tool I like is PyTest.

Data sources

You can get NWB data from IBL developed the Open Neurophysiology Environment (ONE) or Distributed Archives for Neurophysiology Integration (DANDI). These resources although in early stages are a good start. I foresee a world where Neuroscientist would first check to if there is data in a repository like DANDI first, rather than first doing animal experiments. While I feel Animal research is essential for progress for all living things, I strongly feel Animal research should be done as a last resort due to the sacrifice it requires. Sharing data can help relieve the need for some animal sacrifice.

Downside of NWB

No tool lack weaknesses. NWB supports Python and Matlab only. If your laboratory uses R, C or C++, you need to investigate a third-party tool. I do now know what to say if you still use Fortran or an esoteric language. NWB does let users “fall into the pit of success“- It does not enforce a specific data structure or best practice. I personally prefer this but some do not, therefore, it has to be duly noted. Bad design can makes it harder to answer scientific questions. Users have to follow good NWB design practices. NWB is essentially a wrapper around HDF5 and failure to recognize this can lead to poor design choices. DataJoint requires users to have an idea of good database design. A bad database design can make answering scientific questions harder.

Summary

Overall I enjoyed reading Reimer et al. It is a good staring point to learn how to use NWB and DataJoint in the real world. I intended to read it again but this time and do a more detailed study of the code and the paper. I am thankful to Reimer et al for releasing this work and making the code freely available.

Reference

This Blog post is based off-

Reimer M, B Lakshmi, Waxman S, Tan A. Core principles for the implementation of the neurodata without borders data standards. Journal of Neuroscience Methods. Volume 348, 15 January 2021, 108972

Getting Started With Docker


Introduction

This Blog post is a very basic introduction to Docker. It is intended for Researchers new to Docker but with some programming experience.You shhould be familiar with Python and Git Of course, non-Researchers can benefit from it too. There are other resources you can find online that give you more advances information after you finish going through this Blog post.

What is Docker?

Docker is a platform for running, building and shipping applications in a manner that frees Researcher (and professional developers) from needing to be concerned with application-configuration and setup. So, if code is made in one laboratory, another laboratory can use the code and not worry about software versions and other mundane issue unrelated to using the application. It is a way of not just sharing code, but also sharing the ability to inform a computer what to install to run the code and running the code too.

Docker is essential LIKE a process that mimics an operating system. It is not a Virtual Machine in that Docker does not have hardware dedicated to it as a Virtual Machine does. Again, Docker is a specialized process. Docker can be started up in seconds and does not require hardware configuration which can take way longer.

You can find a glossary of key Docker terms at- https://docs.microsoft.com/en-us/dotnet/architecture/microservices/container-docker-introduction/docker-terminology

Relevance of Docker

Pulling code off Github without Docker or a similar system requires you to have the correct version of the operating system and dependencies and know the right commands. Docker is a tool that sets up the dependencies so other Researchers don’t have to be concerned with setup. It saves a lot of time and resonates to people have experiences software not working on their system when receiving a crucial application.

Dockerizing an Application

I will start off by explaining a very simple application and introducing Docker terms along the way. The application is a Python application that prints a single line of text. Here is the source code- https://github.com/zoldello/DockerHelloWorldWordpress. The README page has instructions on how to run it. In summary, clone the code locally-

git clone https://github.com/zoldello/DockerHelloWorldWordpress.git

You can open the code with any text editor. I personally use Visual Code with an extension that makes the chrome green. After you download the code, to run it run-

python app.py

The code should print-

Possible Nucleotides for a DNA `Sequence: 'A', 'C', 'T', 'G'

It is just that simple.

The code is Dockerized because of the presence of Dockerfile at the root. Dockerfile contain instructions on how to construct an Image. Image is Docker lingo for a blueprint of what Docker should do to set up the environment and application. Images can have a variety of information like what Operating System to run, what third party libraries to pull down, environmental variables and much more. The images are used to build containers. Containers are like processes but for hosting code.

The Dockerfile in the sample code contains

FROM python:alpine
COPY . /app
WORKDIR /app
CMD python app.js

The line: FROM python:alpine tells Docker to go to Docker hub (a repository for pre-built containers) and pull down Python running in alpine. Alpine is a lightweight Linux distribution. We will discuss pulling down code from Docker Hub later on in this Blog post.

The line COPY . /app tells Docker to copy the content of the application into a Docker container folder called app that is created on the fly.

The command WORKDIR /app tells Docker to make the newly created app folder the active folder.

CMD python app.js run the code

Building an Image

Docker image is a blueprint that describes how to set up an application. You tag an image while building it. The command-format for building an image is:

sudo docker build -t <image_name> .

This tells Docker to build an image with name <image_name> (this is whatever name you want as long as it’s text is all lower case.) The code for the image is obtained from the current direction (‘.’). You can get the code from a different directory if you specify it.

Let’s build an image called “dna-nucleotides”. The image name has to be in lower case. To do create the image run:

sudo docker build -t dna-nucleotides

Was the image build successfully? Check if it was by running:

sudo docker image ls

You should see your image under the column- REPOSITORY. The column- IMAGE ID is also important.

The image itself is a blueprint. The image has to be instantiated. The instantiated image is called a container. You create a container by running-

sudo docker run dna-nucleotides

This can be ran from any location on your local machine. You should see this output-

Possible Nucleotides for a DNA `Sequence: 'A', 'C', 'T', 'G'

To see for yourself the container was created run-

sudo docker container ls -a

You should see the container you made with your image name as one of the columns. To keep this tutorial simple, please, create only one container from the image.

Clean up

Let’s say you are done using the container. Now, how do you delete it? Again, run the command to see a listing of containers-

sudo docker container ls -a

Identify the container you want to delete. You can use the CREATED and IMAGE columns to help you determine. When you identify the desired row to delete, take note of the CONTAINER ID. You first stop the container, then you delete it. To stop the container run-

sudo docker stop <container_id>

where <container_id> is the container id of the container you want deleted.

Now, delete the container by running-

sudo docker rm <container_id>

You have deleted the container. If you created multiple containers off the images (NOT ALL your containers, just the ones related to this Blog post), I recommend you delete them because multiple containers off the practice image will bring up an issue with image deletion beyond the scope of this tutorial (cannot delete an image with active containers so need to force-delete).

The container is deleted. Let’s say now you want to delete the image. To see a list of images run-

sudo docker image ls -a

Take note of the IMAGE ID of the image you want to delete. To delete the image run-

sudo docker rmi <image_id>

Using a Third-Party Docker Image

There are many images others have created and are freely available to people who have signed up for a Docker account. They are hosted on Docker Hub.

To experiment with Docker Hub, let’s say we want to pull down an image of Debian (a popular distribution of Linux) and be able to run commands on the console. All you have to do is pull down the image and run it interactively. Let’s see how this is done.

First, you need to sign up for a Docker hub account. After you do that, search for Debian- https://hub.docker.com/search?q=debian&type=image. Let’s use the official image- https://hub.docker.com/_/debian. On the page, on the far right, you see pull instruction. The instruction duplicated here for convenience is-

sudo docker pull debian

To verify we pulled down the image, run-

sudo docker image ls -a

You should see a REPOSITORY named debian.

Remember, this is an image– a blueprint. You need to create a container off the image– An implementation/instantiation. In addition, you want it to be interactive; you want it to run continuously. The Dockerized code we pulled off Github ran just once. To run Debian interactively run this command-

sudo docker run -it debian

You should see something like-

root@b7ade9958799:/#

NOTE, YOURS WILL BE SLIGHTLY DIFFERENT BUT FOLLOW THE SAME STRUCTURE. The root part tells you you are in the root directory. The id- b7ade9958799.

Let’s experiment with the Debian container. Let’s install python, create a very simple python sample code and run it.

First all, update the container as per Linux best practice-

apt update

Now, install python

apt install python

Create a file with the simple python code

echo "print('Hello, World')" > app.py

Do not worry about fully understanding what this does. At a high level, echo display text in front of it, “>” is a mean to redirect output to a file in this case called app.py. Basically, this is creating a file with text without using a text editor. Again, just know it is a way to write to a file and do not worry about the Linux command themselves.

To verify we have text in app.py, run

more app.py

This displays the content of a file. If you see nothing you did somehting wrong. If you are a die-hard Linux users, less command is a replacement for more. But the Debian distribution used in the demonstration did not have less and it outside of the scope to install it.

Now, let’s run the python code-

python app.py

Done. We installed a pre-made image of Debian Operating system off Docker Hub, made a container, install python, wrote python code and ran it. Great.

To kill (yes, that is the technical term) the Debian container, run the command-

exit

Or, use Ctrl-D (hold Ctrl key down and press ‘D’ key [hold both keys down simultaneously]).

To see the container run-

sudo docker container ls -a

To delete the container, find the Debian one, take note of the CONTAINER ID and run-

sudo docker rm <container_id>

To delete the image, display the list of images and take note of the Debian IMAGE ID

sudo docker image ls -a

Once you identify the IMAGE ID, delete the image-

sudo docker rmi <image_id>

Useful Docker Commands

Here is a summary of useful Docker commands

  • sudo docker stop <container_id> # stop container
  • sudo docker container ls -a # list all containers
  • sudo docker rm <container_id> # delete a container
  • sudo docker image ls -a # list all images
  • sudo docker rmi <image_id> # delete an image
  • sudo docker system prune -a #delete everything related to docker

Summary

This blog post introduces you to Docker and very basic commands. You can find the complete code at- https://github.com/zoldello/DockerHelloWorldWordpress. The README provides a summary of the steps to build a Docker container.

This blog post is a starting point and I recommend you explore other sources like

and study sample code and search for things you do not understand. Sample Dockerfile can be found at-

There is also Docker documentation

There are lots of free videos of YouTube

I also recommend you play with the code provided and try new things. Like on Debian, rather than use echo to update a file, install a text editor like a nano and create a python file with the text editor.

Neuroscience and Genomics now require programming skills and the need to program will grow more in the future. I hope this Blog post helps you in your journey to navigate the paradigm shift.

Reference

This blog post was based on work by Programming with Mosh‘s, Docker YouTube video. I made alterations to try out new things myself and also so this Blog Post is not mere replicating this work.

Revamping this Blog


I am revamping this blog and will try to post more frequently (no promises though). This blog is as a collection of references and notes I make for myself. It serves as quick reminders and references on how to perform certain techniques. Writing it also serves as a memory aid. Of course, everyone else is free to benefit off this material, albeit, I make not guarantees of any kind.

Currently, my interest is in the intersection of Neuroscience, Genomics and Data Science. Most of the topics will be related to these three area. I have plans to post information of Docker, Neurodata without border, DataJoint, Conda, Advances in Neuroscience, literature reviews, Python tips and tricks, scientific process and a host of other areas that reflect my interest.

I am pumped, primed and ready, let’s go…

Quickly Setting up a Python Virtual Environment in Ubuntu


Notes on using Python’s virtualenv

One-time work
Installing virtualevn: sudo pip install virtualenv

Getting Up and Going (Most important)
Creating virtual environment: virtualenv <project_name>
Activating: source <project_name>/bin/activate
Deactivate: deactivate
Documenting Dependencies: pip freeze –local > requirements.txt
Destroy virtualenv: rm -rf <project_name>

Misc.
Packages in virtualenv: pip list
Determining if virtualenv is used: which python
Python Path (DO NOT USE IN VIRTUAL ENVIRONMENT): which python
Specific Python-version virtualenv: virtualenv -p <python_version_path><project_name>
Pulling dependencies from requirements.txt: pip install -r requirements.txt

Reference

Docker as an Alternative

 

 

Common Github Markdown and Hidden Files


The power of Github lies in not merely using it as a source control. Rather, the power lies in using it as source control, bundled with a means of social media. Different files, features and the vary nature of git are designed to be a social media-communication tool to convey information to developers and scientist (they use git too.) For example, tagging lets people know you have releases. Seeing a branch communicates to others you are working on something and do not want to affect a major branch (like fixing a bug or adding a new feature.)

In looking at numerous git projects, I have seen a number of files that are hidden (beginning with a period) or are markdown (ending with “.md”) that occur repeatedly and are not related meant for third-party dependencies (like .babelrc for babel.js). I do not believe there is a standard. Nevertheless, they tend to have similar information across different projects. Here are the ones I frequently see:

.gitignore: Files you do not want git to pull into source control.
You can find a pre-made file for your current project off this site: https://www.gitignore.io

.editorconfig: Instructs text editors that support this on how to format code. It is best to have this to prevent a colossal geek-battle between the guy who uses 2-spaces with another guy who uses tab- this file is the definitive answer. A tutorial and sample file can be found at- http://editorconfig.org

README.md: Is information users see at the bottom of a git project. It informs them of the purpose. It is one of the first things they see despite it being it being at the bottom of the page. I have seen this vary from a basic introduction to a full blown documentation.

CONTRIBUTING.md: Provides guidelines everyone contributing to the code needs to follow (primarily developers).

COMMITTER.md: Provides guidelines to anyone with admin-power, like people that can have pull request on Master-branch. This might not be a consider problems in a small project. But it is something to consider in big codebases with core contributors scattered across difference time zones.

ChangeLog.md: A listing of what are changes in every release.

DEVELOPER.md: Instructions potential contributors on how to get the codebase running locally.

Author.txt: Listing of contributors

LICENSE.md or License.txt or License: This expresses to consumers of a code how the creator wants you to use it.  THIS IS IMPORTANT! It is vital for the code creators to write a license and the code consumer to read. If can only remember just one thing from this blog, remember that the license file should be given serious consideration. Some licenses say basically you can use the code but the creator is not liable for damages (MIT.) Some other license say if you use this code, you must open source the code you use it with; which may be your company’s $5 million codebase (like GPL.) Other say you can use the code for free if you do not modify it or contribute modifications to open source; otherwise you must pay the creators (I know MySQL used to have this.) Again, read the license. Ask the creator if you do not see a license.

You can do a web search and find information on licenses. One I like is wikipedia’s comparison- https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses

This list does not cover them all. There are  others like .gitattributes (used for configuring individual folders) and .mailmap (e-mail listing of contributors). Some third-party dependencies uses some like npm uses .npmignore. You can also create one yourself. I personally like to use .Acknowledgement to recognize anyone, codebases or sites that helped me out a lot while working on a project in Github. Some do not end with .md nor are hidden files but seems important like PATENTS in react.js

Creating a tag in git


This blog post shows you how to create an annotated tag in git; using github. I assume you already have git installed and have at least a working knowledge of the basis.

Steps
Create the tag
git tag -a v1.0.0 -m 'some message'

This is how you push tags to the server (regular push does not move over tags)
git push --tag

Miscellaneous
To view all tags:
git tag -v

To delete a tag (use the view option above to get the tag name )
git tag -d v1.0.0

ChicagoVeg Website’s Release #2 now in Prod


The ChicagoVeg site: http://www.chicagoveg.com has been updated; with Release #2. The fixes include the follow:

  • Added new carousel images
  • Images now on most pages
  • Made Facebook page plugin responsive
  • Updated text

To see a list of issues addressed, see:
https://github.com/ChicagoVeg/ChicagoVegSite/issues?utf8=✓&q=%20milestone%3A”Release%202″%20

ChicagoVeg is the largest Chicagoland community for vegetarians, vegans, and raw foodists to socialize and meet like-minded people. We organize a range of fun activities and educational events in the city and the suburbs. Whether you are vegan or not vegan, we welcome everyone who is interested in a vegan/vegetarian/raw lifestyle for any reason – ethical, environmental, health, or spiritual. We support a nonjudgmental community where everyone can feel welcome to express their opinions, socialize, and have fun!

You can find meetings at: http://www.meetup.com/ChicagoVeg/

 

My comments on Tyra Bank’s career


I watched a Youtube video on Tyra Banks (https://www.youtube.com/watch?v=FE057BWoLh4)and commented on it. My comments were inspired partially by words I read from Microsoft’s CEO Satya Nadella. I decided to  rephrase it a bit and write it in a blog post. This way, I can remind myself of it periodically. Here it is:

Tyra Banks did some things right. What gives you success today probably will not tomorrow. Therefore, you need career redirections periodically. She did that. She left the catwalk on top and when I believe her career with it would be on the downswing soon. Many have redirected their careers. Michael Jackson (bumble gum kid singer to teen pop/disco sing to adult r&b/funk/pop singer to Rock/pop/New Jack Swing singer), Michael Jordan (slasher/leaper/dunker to outside-shooter and passing the ball a little bit more), Microsoft (PC software maker to cloud provider) and many more others have rebranded and reinvented themselves.

You plan switches while still on top with the current direction. And you do little experiments to see whats next. Then you slowly but surely switch to the next thing. It’s not done necessarily in this order or way. Nevertheless, the point is; you do not want to be stuck without a wave to ride- you always need something big in your career to keep you in demand and on people’s mind and radar. I do not think she did this to a good degree. Kodak had nothing to fall on when digital photography made its business model obsolete. Its top brass did not see beyond what was making money for them when they were big and profitable and were slow to change with the changing consumer demand. Google currently really has nothing to fall on if its search business fails. Android, Google Maps and other of its big products really do not directly bring in income; they just help increase Google Search usage (its primary money-maker). Gmail and Google Cloud I believe are not profitable enough to run a huge company off of. Alphabet can help, just not right now.

Sometimes, despite best effort, things just do not work out for a while. Abraham Lincoln had a terrible life and was a straight up failure, before finally hitting it big and eventually becoming President and revered. John Travolta had a downtown time prior to getting “Pulp Fiction” to bring this back up. She now needs to find her “Pulp Fiction.” Plus, at a point, you have to use more wit and creativity and less beauty and other time-limited things to sell your goods. Beauty fads or gets stale to the eye; but wit and creativity is not limited by age or physicality.

 

 

Getting Up and Going with Elasticsearch


Introduction
This blog post explains how to get up and going with elasticsearch, on Ubuntu. Although for Ubuntu,  many of the concepts are applicable to Windows. This post focuses on the fundamentals and bare necessities to learn them. Therefore, while it will not get you ready for a Production release, it will help prepare you for the next level. Also, we will be using simple tooling; so reduce the extra learning that may incur. The pros and cons and comparison with other search servers is beyond the scope of this blog post. There are other introductory blogs post and a PluralSight video link if you desire different elasticsearch material to complement this blog post. They are listed in the references. There is a minor lean towards Bioinformatics/Molecular Biology. However, knowledge of these fields is not required to understand this blog post.

Elasticsearch is a search server for running full text search. It is a NOSQL, Distributed database engine. Search queries are performed over an index rather than actual documents. This makes it fast and reduces the need to store a duplicate copy of documents after indexing is complete. Lucene is used in the backend, akin to to SOLR.

Elasticsearch provides a RESTFUL interface for performing search task like querying and indexing. It can be used for searching blog content, analyzing logs and any other search needs. A growing use of it is in performing search in pairwise matches in genomic. This is what got me interested in this technology.

It is recommend you read the text AND try out the activities in the blog post.

 

Prerequisites Knowledge
There are some tools and standards you need a working knowledge as prerequisites. The first two are “RESTful” API calls  and “curl“. In addition to the hyperlinks on the terms, here are additional links: http://code.tutsplus.com/tutorials/a-beginners-guide-to-http-and-rest—net-16340. You also need a basic understanding of JSON (http://www.secretgeek.net/json_3mins). You need not master them; just be aware of their definition and have a big-picture understanding. 

 

Prerequisites Tooling
As per best practices, make sure your system is updated. Run this command in the command line:

sudo apt-get update

Next, get Java if you do not already have it. Run this command to check whether or not its installed:

java -version

Skip this step if its installed. If not, run this command to install (as of Mid-2016):

sudo apt-get install openjdk-8-jre

 

Installation
There are many ways to install elasticsearch. One way to install is to use the wget tool and  the other is to manually download and move the files to the right folder-

Option 1: Using command line

Run this command:

wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-2.3.1.tar.gz

Move the content to the a folder your desire.

Note, the latest version at post time was 2.3.1. Adjust modify the above url based on the latest. The latest version number can be found at:

https://www.elastic.co/downloads/elasticsearch

Option 2: Manually downloading files

Another way to install it to to download the the relevant files and move them to an appropriate folder.

  1. Go to: https://www.elastic.co/downloads/elasticsearch
  2. Download the zip or tz.tar version to your local hard drive.
  3. Uncompress
  4. Move content to a folder you desire.

The names of the folders are self-explanatory. The config folder contain config files. The log files hold logging information, etc.

 

Configuration Adjustments
It is common to adjusting the configuration a bit while learning elasticsearch, to make identifying things a bit easier. Here are some:

  1. Open up the Elastic search folder
  2. Open up the config folder
  3. Look for a file called: “elasticsearch.yml” and open
  4. Look for “cluster.name.elasticsearch”. Change it to a name you will recognize.
  5. Look for “node.name” and change it to a name you will recognize.
  6. Look for “bootstap.mlockall”. Change it to “true”. This prevents memory swapping; the details are beyond the scope of this blog and something we need to eliminate from consideration for our intent and purpose.
  7. Look for network.host. Change it to “127.0.0.1”. This is your localhost.

These changes are useful for debugging (like node.name) or running queries (change network.host to localhost.)

 

Verify Elastic Search is Running
Open up the folder where elastic search was installed

Run this command in the command line:

/bin/ElasticSearch

This will start up elasticsearch service. You can kill this process anytime by using CTL-C. However, leave this process running for the sake of this tutorial, until completed.

Open up a new command prompt, while leaving the previous one running. In the new command line, type in this command:

curl -XGET ‘http://localhost:9200&#8217;

You do not have the Elastic Search service running if you see _hash with a long id, that means you have Elastic Search running.

 

Big Picture Story
You are now ready to try out elasticsearch. The big picture story of how it works is that you get your data stored as JSON, you use RESTful calls to store data into elasticsearch, elasticsearch indexes this and makes it available for search (done again via RESTful calls.) I recommend you keep this working idea in mind as learn to use elasticsearch.

 

Data to Store
Data is stored in elasticsearch as JSON. There is no restriction on what is a key and what is a value; as long as the data is JSON. For illustration purposes, we are going to examine a simple genetic search problem involving DNA sequences.

For our purposes, a DNA String is a collection contain a combination of the alphabets ‘A’, ‘C’, ‘T’ and ‘S’. A sequence is a string with biologically significant ordering (something measured).

Here is made-up data we will be using to illustration purposes:

 

{

“id”: “1”,

“name”: “Sample 1”,

“read_date”: “2014-01-02”,

“sequence”: “CAGCATACGAAAGAGCTTAAGGCACGCCAATTCGCACTGTCAGGGTCACTTGGGTGTTTTGCACTACCGTCAGGTACGCTAGTATGCGTTCTTCCTTCCA”

}

{

“id”: “2”,

“name”: “Sample 2”

“read_date”: “2015-03-04”,

“sequence”: “GAGGTATGTGGCTGCGTGGTCAAAAGTGCGGCATTCGTATTTGCTCCTCGTGTTTACTCTCACAAACTTGACCTGGAGATCAAGGAGATGCTTCTTGTGG”

}

{

“id”: “3”,

“name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

 

 

Preparation for Indexing
Note, “curl” does not accept new line characters. They are shown for illustration-purposes. However,  a copy of every code snippet  without them is also provided, so you can run on a commandline.

You need to run a POST command to index a document. A POST command is used for updating. Here is the structure elasticsearch needs:

curl -XPOST ‘<URL>/<mapping>’

“Mapping” is like a data schema- tells elasticsearch the data field and how to store the data (number, boolean, etc) . “Type” can be considered something like a Relational databases’ table declaration- Tells elasticsearch what to expect . Properties are like relation databases’ columns

Here is a way of modeling the mapping for the DNA string used in this blog:

curl -XPOST “localhost:9200/dna” -d

‘{

“mappings”: {

“string”: {

“properties”: {

“id”: {

“type”: “integer”

},

“name”: {

“type”: “string”

},

“sequence”: {

“type”: “string”

},

“read_date”: {

“type”: “date”

}

}

}

}

}’

Here is the text with white spaces removes; more appropriate for the command line:

curl -XPOST “localhost:9200/dna” -d ‘{ “mappings”: { “string”: { “properties”: { “id”: { “type”: “integer” }, “name”: { “type”: “string” }, “sequence”: { “type”: “string” }, “read_date”: { “type”: “date” } } } } }’

There are other data types like binary (used for storing content like blobs), date (used for storing dates) and boolean (used for storing boolean values)

Run this command to see the indices:

curl -XGET localhost:9200/_cat/indices

You should see the dna entry. Now to verify the appropriate mapping was applied, run:

curl -XGET localhost:9200/dna/_mappings

 

Indexing Data
A POST command (used for creating new content) is used for adding content to the search index. The  structure of the command is:

[Elasticsearch URL]/<index>/<type>/[<id>]

The id is optional; it will be autogenerated if not provided.

1st content to add

curl -XPOST “localhost:9200/dna/string” -d

‘{

“id”: “1”,

“name”: “Sample 1”,

“read_date”: “2014-01-02”,

“sequence”: “CAGCATACGAAAGAGCTTAAGGCACGCCAATTCGCACTGTCAGGGTCACTTGGGTGTTTTGCACTACCGTCAGGTACGCTAGTATGCGTTCTTCCTTCCA”

}’

Here is the entry with whitespace removed, more appropriate for the command line:

 

curl -XPOST “localhost:9200/dna/string” -d ‘{ “id”: “1”, “name”: “Sample 1”, “read_date”: “2014-01-02”, “sequence”: “CAGCATACGAAAGAGCTTAAGGCACGCCAATTCGCACTGTCAGGGTCACTTGGGTGTTTTGCACTACCGTCAGGTACGCTAGTATGCGTTCTTCCTTCCA” }’

2nd content to add

curl -XPOST “localhost:9200/dna/string” -d

{

“id”: “2”,

“name”: “Sample 2”,

“read_date”: “2015-03-04”,

“sequence”: “GAGGTATGTGGCTGCGTGGTCAAAAGTGCGGCATTCGTATTTGCTCCTCGTGTTTACTCTCACAAACTTGACCTGGAGATCAAGGAGATGCTTCTTGTGG”

}’

Here is the entry with whitespace removes, more appropriate for the command line:

curl -XPOST “localhost:9200/dna/string” -d ‘{ “id”: “2”, “name”: “Sample 2”, “read_date”: “2015-03-04”, “sequence”: “GAGGTATGTGGCTGCGTGGTCAAAAGTGCGGCATTCGTATTTGCTCCTCGTGTTTACTCTCACAAACTTGACCTGGAGATCAAGGAGATGCTTCTTGTGG” }’

3rd content to add

curl -XPOST “localhost:9200/dna/string” -d

‘{

“id”: “3”,

“Name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}’

Here is the entry with whitespace removes, more appropriate for the command line:

curl -XPOST “localhost:9200/dna/string” -d ‘{ “id”: “3”, “Name”: “Sample 3”, “read_date”: “2016-04-05”, “sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA” }’

 

 

Basic Search
A Basic Search is use when you want to quickly search for data and do not need any addons. The structure of the Basic Search is:

curl -XGET <URL>/_search?q=<field>:<search term>

To illustrate, lets say there is a need to find which DNA string contains the sequence: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”.

curl -XGET localhost:9200/dna/_search?q=sequence:AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA

The result you get back (formated) would be similar to:

{

“took”: 2,

“timed_out”: false,

“_shards”: {

“total”: 5,

“successful”: 5,

“failed”: 0

},

“hits”: {

“total”: 1,

“max_score”: 0.30685282,

“hits”: [

{

“_index”: “dna”,

“_type”: “string”,

“_id”: “AVSRjGaZcdfh2Ha9smfL”,

“_score”: 0.30685282,

“_source”: {

“id”: “3”,

“Name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

}

]

}

}

The value “_id” (in hits) will depend on your machine (not, we set “id” and not “_id”.) The important keys in the result are “hits” and “sources”. The “hits” tells you what matches and the source provides you back the data.

You can find a substring in the sequence. Lets say you want all entries that contain the substring “AAAA”, you use the wildcard ‘*’ to tell elasticSearch to look for a substring.

Here is the query:

curl -XGET localhost:9200/dna/_search?q=sequence:*AAAA*

The result you see should be similar to:

{

“took”: 7,

“timed_out”: false,

“_shards”: {

“total”: 5,

“successful”: 5,

“failed”: 0

},

“hits”: {

“total”: 2,

“max_score”: 1,

“hits”: [

{

“_index”: “dna”,

“_type”: “string”,

“_id”: “AVSRjGaZcdfh2Ha9smfL”,

“_score”: 1,

“_source”: {

“id”: “3”,

“Name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

},

{

“_index”: “dna”,

“_type”: “string”,

“_id”: “AVSRij7Tcdfh2Ha9smcT”,

“_score”: 1,

“_source”: {

“id”: “2”,

“name”: “Sample 2”,

“read_date”: “2015-03-04”,

“sequence”: “GAGGTATGTGGCTGCGTGGTCAAAAGTGCGGCATTCGTATTTGCTCCTCGTGTTTACTCTCACAAACTTGACCTGGAGATCAAGGAGATGCTTCTTGTGG”

}

}

]

}

}

Query DSL
The Basic Search is fine for some cases, but other cases requires more sophisticated techniques like filtering, changing the search algorithm (analyzer), searching for phrases and other options. Elasticsearch has a DSL (domain specific language) made available that can help with this. This blog post will not go into details on them all; but will help you enough to go find out others and us them as needed.

The Basic Search we did earlier can be re-written with the DSL in this structure:

<GET command> <URL>/<index name>/_search

{

“query”: {

“match”: {

<field>: <search term>

}

}

}

The query key tells Elasticsearch to perform a search. The match key tells Elasticsearch to  look for each term in the search term, divided by token (details to beyond the scope of this blog, for conservation, lets say each word in a phrase.)

Here is an example:

 

Here is an example:

curl -XGET “localhost:9200/dna/_search” -d

‘{

“query”: {

“match”: {

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

}

}’

Command line:

curl -XGET “localhost:9200/dna/_search” -d ‘{ “query”: { “match”: { “sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA” } } }’

The result:

{

“took”: 3,

“timed_out”: false,

“_shards”: {

“total”: 5,

“successful”: 5,

“failed”: 0

},

“hits”: {

“total”: 1,

“max_score”: 0.30685282,

“hits”: [

{

“_index”: “dna”,

“_type”: “string”,

“_id”: “AVSRjGaZcdfh2Ha9smfL”,

“_score”: 0.30685282,

“_source”: {

“id”: “3”,

“Name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

}

]

}

}

You can search for a phrase rather than for individual words. The match_phrase key helps with this.

This can be expanded to perform filtering. This can eliminate irrelevant data from the search results. The format of it is:

<GET command> <URL>/<index name>/_search

{

“filtered”: {

“filter”: {

<filter option>: {

<field>: {

<filter command>: <filtered-by term>

}

},

“query”: {

“match”: {

<field>: <search term>

}

}

}

}

To illustrate, in the search result, lets say we want only the result from after a January 1st, 2016 (the selected date is the same for both the British and American date-structure systems):

 

Homework
These are a collection of exercises to test your understanding of this blog and help you get started in discovering another feature of Elastic search.

Indices need to change in the real world as users update content. Elastic search provides this   functionality by supporting the PUT command.

Use the PUT command to update the third index’s sequence to:

GAGTATCATGGAGCCCATGGTTGAATGAGTCCAATAACGAACTTCGACATGATAAAGTCCCCCCCTCGCGACTTCCAGAGAAGAAGACTACTGAGTTGAG

Once done, search for do these exercises

  1. Search for:  AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA
  2.  Search for:  GAGTATCATGGAGCCCATGGTTGAATGAGTCCAATAACGAACTTCGACATGATAAAGTCCCCCCCTCGCGACTTCCAGAGAAGAAGACTACTGAGTTGAG
  3.  Search for: *AAAA*
  4. Search for: *TTCGACATG*

Note: for (3) and (4), the ‘*’ tells elasticsearch to search for a substring.

Being able to complete these task shows you understand this blog post. If unable to, I recommend you re-read this blog post and follow along with the tutorial on a computer.

Notes

  1. https://www.elastic.co: This is Elastic Search’s URL. Note, there is no ‘m’ in the end
  2. https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html: This is a free book about Elastic search, online. You have to buy an e-book or physical book if you want to read it offline (this trend is becomes popular). I recommend you look at this book after reading this blog post and are ready for more advanced material.
  3. My current thought is that developing an analyzer for processing genomic data would be a great Bioinformatic problem and helpful in Precision medicine studies. I have not investigated it much as I am occcupied with studies other topics. But its on the back of my mind.
  4. Zachary Tong (https://www.elastic.co/blog/author/zachary-tong) is a name I saw a number of time when I begun studying Elastic search. Like him, I too am interested in Neuroscience, Bioformatics and Software Development. He posted a slide involving Gene sequencing with Bioinformatics that is worth a glance in my opinion- http://www.slideshare.net/ZacharyTong/boston-meetupgoingorganic
  5. Elastic BV (https://www.elastic.co/subscriptions) provide commercial support. I believe paid support makes an open source project more appealing. It tells you that there are a few people with an invested interest to maintain the code base and can help when a problem has reached a C-Level executive during bonus-paying time and you need a miracle.
  6. https://aws.amazon.com/elasticsearch-service/ Amazon web services supports hosting Elastic search in the cloud.
  7. Neflix, Github, Quora, Foursquare, Stack Exchange (who own stackoverflow.com): Are a few of the companies using elasticsearch. It is proof that it is a valuable tool that can handle huge search tasks.

 

References

  1. http://joelabrahamsson.com/ Great tutorial about how to get started with Elastic Search. It has a lot of details.
  2. https://troubleshootguru.wordpress.com/2014/11/19/how-to-install-elasticsearch-on-centos-rhel-2/ This is another tutorial. Has has less cruft, so good if you just need to know what to put in the command line.
  3. http://faculty.ucr.edu/~mmaduro/random.htm This is a tool I used to generate random genetic sequences.
  4. https://www.pluralsight.com/courses/elasticsearch-for-dotnet-developers A great PluralSight training video on Elastic Search.
  5. http://code.tutsplus.com/tutorials/a-beginners-guide-to-http-and-rest—net-16340 Overview of RESTful calls and curl.
  6. http://www.secretgeek.net/json_3mins JSON tutorial