Posted in Uncategorized, Bioinformatics, Software Development, How-to, Python

Quickly Setting up a Python Virtual Environment in Ubuntu


Notes on using Python’s virtualenv

One-time work
Installing virtualevn: sudo pip install virtualenv

Getting Up and Going (Most important)
Creating virtual environment: virtualenv <project_name>
Activating: source <project_name>/bin/activate
Deactivate: deactivate
Documenting Dependencies: pip freeze –local > requirements.txt
Destroy virtualenv: rm -rf <project_name>

Misc.
Packages in virtualenv: pip list
Determining if virtualenv is used: which python
Python Path (DO NOT USE IN VIRTUAL ENVIRONMENT): which python
Specific Python-version virtualenv: virtualenv -p <python_version_path><project_name>
Pulling dependencies from requirements.txt: pip install -r requirements.txt

Reference

Docker as an Alternative

 

 

Advertisements
Posted in How-to, Search, Software Development

Getting Up and Going with Elasticsearch


Introduction
This blog post explains how to get up and going with elasticsearch, on Ubuntu. Although for Ubuntu,  many of the concepts are applicable to Windows. This post focuses on the fundamentals and bare necessities to learn them. Therefore, while it will not get you ready for a Production release, it will help prepare you for the next level. Also, we will be using simple tooling; so reduce the extra learning that may incur. The pros and cons and comparison with other search servers is beyond the scope of this blog post. There are other introductory blogs post and a PluralSight video link if you desire different elasticsearch material to complement this blog post. They are listed in the references. There is a minor lean towards Bioinformatics/Molecular Biology. However, knowledge of these fields is not required to understand this blog post.

Elasticsearch is a search server for running full text search. It is a NOSQL, Distributed database engine. Search queries are performed over an index rather than actual documents. This makes it fast and reduces the need to store a duplicate copy of documents after indexing is complete. Lucene is used in the backend, akin to to SOLR.

Elasticsearch provides a RESTFUL interface for performing search task like querying and indexing. It can be used for searching blog content, analyzing logs and any other search needs. A growing use of it is in performing search in pairwise matches in genomic. This is what got me interested in this technology.

It is recommend you read the text AND try out the activities in the blog post.

 

Prerequisites Knowledge
There are some tools and standards you need a working knowledge as prerequisites. The first two are “RESTful” API calls  and “curl“. In addition to the hyperlinks on the terms, here are additional links: http://code.tutsplus.com/tutorials/a-beginners-guide-to-http-and-rest—net-16340. You also need a basic understanding of JSON (http://www.secretgeek.net/json_3mins). You need not master them; just be aware of their definition and have a big-picture understanding. 

 

Prerequisites Tooling
As per best practices, make sure your system is updated. Run this command in the command line:

sudo apt-get update

Next, get Java if you do not already have it. Run this command to check whether or not its installed:

java -version

Skip this step if its installed. If not, run this command to install (as of Mid-2016):

sudo apt-get install openjdk-8-jre

 

Installation
There are many ways to install elasticsearch. One way to install is to use the wget tool and  the other is to manually download and move the files to the right folder-

Option 1: Using command line

Run this command:

wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-2.3.1.tar.gz

Move the content to the a folder your desire.

Note, the latest version at post time was 2.3.1. Adjust modify the above url based on the latest. The latest version number can be found at:

https://www.elastic.co/downloads/elasticsearch

Option 2: Manually downloading files

Another way to install it to to download the the relevant files and move them to an appropriate folder.

  1. Go to: https://www.elastic.co/downloads/elasticsearch
  2. Download the zip or tz.tar version to your local hard drive.
  3. Uncompress
  4. Move content to a folder you desire.

The names of the folders are self-explanatory. The config folder contain config files. The log files hold logging information, etc.

 

Configuration Adjustments
It is common to adjusting the configuration a bit while learning elasticsearch, to make identifying things a bit easier. Here are some:

  1. Open up the Elastic search folder
  2. Open up the config folder
  3. Look for a file called: “elasticsearch.yml” and open
  4. Look for “cluster.name.elasticsearch”. Change it to a name you will recognize.
  5. Look for “node.name” and change it to a name you will recognize.
  6. Look for “bootstap.mlockall”. Change it to “true”. This prevents memory swapping; the details are beyond the scope of this blog and something we need to eliminate from consideration for our intent and purpose.
  7. Look for network.host. Change it to “127.0.0.1”. This is your localhost.

These changes are useful for debugging (like node.name) or running queries (change network.host to localhost.)

 

Verify Elastic Search is Running
Open up the folder where elastic search was installed

Run this command in the command line:

/bin/ElasticSearch

This will start up elasticsearch service. You can kill this process anytime by using CTL-C. However, leave this process running for the sake of this tutorial, until completed.

Open up a new command prompt, while leaving the previous one running. In the new command line, type in this command:

curl -XGET ‘http://localhost:9200&#8217;

You do not have the Elastic Search service running if you see _hash with a long id, that means you have Elastic Search running.

 

Big Picture Story
You are now ready to try out elasticsearch. The big picture story of how it works is that you get your data stored as JSON, you use RESTful calls to store data into elasticsearch, elasticsearch indexes this and makes it available for search (done again via RESTful calls.) I recommend you keep this working idea in mind as learn to use elasticsearch.

 

Data to Store
Data is stored in elasticsearch as JSON. There is no restriction on what is a key and what is a value; as long as the data is JSON. For illustration purposes, we are going to examine a simple genetic search problem involving DNA sequences.

For our purposes, a DNA String is a collection contain a combination of the alphabets ‘A’, ‘C’, ‘T’ and ‘S’. A sequence is a string with biologically significant ordering (something measured).

Here is made-up data we will be using to illustration purposes:

 

{

“id”: “1”,

“name”: “Sample 1”,

“read_date”: “2014-01-02”,

“sequence”: “CAGCATACGAAAGAGCTTAAGGCACGCCAATTCGCACTGTCAGGGTCACTTGGGTGTTTTGCACTACCGTCAGGTACGCTAGTATGCGTTCTTCCTTCCA”

}

{

“id”: “2”,

“name”: “Sample 2”

“read_date”: “2015-03-04”,

“sequence”: “GAGGTATGTGGCTGCGTGGTCAAAAGTGCGGCATTCGTATTTGCTCCTCGTGTTTACTCTCACAAACTTGACCTGGAGATCAAGGAGATGCTTCTTGTGG”

}

{

“id”: “3”,

“name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

 

 

Preparation for Indexing
Note, “curl” does not accept new line characters. They are shown for illustration-purposes. However,  a copy of every code snippet  without them is also provided, so you can run on a commandline.

You need to run a POST command to index a document. A POST command is used for updating. Here is the structure elasticsearch needs:

curl -XPOST ‘<URL>/<mapping>’

“Mapping” is like a data schema- tells elasticsearch the data field and how to store the data (number, boolean, etc) . “Type” can be considered something like a Relational databases’ table declaration- Tells elasticsearch what to expect . Properties are like relation databases’ columns

Here is a way of modeling the mapping for the DNA string used in this blog:

curl -XPOST “localhost:9200/dna” -d

‘{

“mappings”: {

“string”: {

“properties”: {

“id”: {

“type”: “integer”

},

“name”: {

“type”: “string”

},

“sequence”: {

“type”: “string”

},

“read_date”: {

“type”: “date”

}

}

}

}

}’

Here is the text with white spaces removes; more appropriate for the command line:

curl -XPOST “localhost:9200/dna” -d ‘{ “mappings”: { “string”: { “properties”: { “id”: { “type”: “integer” }, “name”: { “type”: “string” }, “sequence”: { “type”: “string” }, “read_date”: { “type”: “date” } } } } }’

There are other data types like binary (used for storing content like blobs), date (used for storing dates) and boolean (used for storing boolean values)

Run this command to see the indices:

curl -XGET localhost:9200/_cat/indices

You should see the dna entry. Now to verify the appropriate mapping was applied, run:

curl -XGET localhost:9200/dna/_mappings

 

Indexing Data
A POST command (used for creating new content) is used for adding content to the search index. The  structure of the command is:

[Elasticsearch URL]/<index>/<type>/[<id>]

The id is optional; it will be autogenerated if not provided.

1st content to add

curl -XPOST “localhost:9200/dna/string” -d

‘{

“id”: “1”,

“name”: “Sample 1”,

“read_date”: “2014-01-02”,

“sequence”: “CAGCATACGAAAGAGCTTAAGGCACGCCAATTCGCACTGTCAGGGTCACTTGGGTGTTTTGCACTACCGTCAGGTACGCTAGTATGCGTTCTTCCTTCCA”

}’

Here is the entry with whitespace removed, more appropriate for the command line:

 

curl -XPOST “localhost:9200/dna/string” -d ‘{ “id”: “1”, “name”: “Sample 1”, “read_date”: “2014-01-02”, “sequence”: “CAGCATACGAAAGAGCTTAAGGCACGCCAATTCGCACTGTCAGGGTCACTTGGGTGTTTTGCACTACCGTCAGGTACGCTAGTATGCGTTCTTCCTTCCA” }’

2nd content to add

curl -XPOST “localhost:9200/dna/string” -d

{

“id”: “2”,

“name”: “Sample 2”,

“read_date”: “2015-03-04”,

“sequence”: “GAGGTATGTGGCTGCGTGGTCAAAAGTGCGGCATTCGTATTTGCTCCTCGTGTTTACTCTCACAAACTTGACCTGGAGATCAAGGAGATGCTTCTTGTGG”

}’

Here is the entry with whitespace removes, more appropriate for the command line:

curl -XPOST “localhost:9200/dna/string” -d ‘{ “id”: “2”, “name”: “Sample 2”, “read_date”: “2015-03-04”, “sequence”: “GAGGTATGTGGCTGCGTGGTCAAAAGTGCGGCATTCGTATTTGCTCCTCGTGTTTACTCTCACAAACTTGACCTGGAGATCAAGGAGATGCTTCTTGTGG” }’

3rd content to add

curl -XPOST “localhost:9200/dna/string” -d

‘{

“id”: “3”,

“Name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}’

Here is the entry with whitespace removes, more appropriate for the command line:

curl -XPOST “localhost:9200/dna/string” -d ‘{ “id”: “3”, “Name”: “Sample 3”, “read_date”: “2016-04-05”, “sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA” }’

 

 

Basic Search
A Basic Search is use when you want to quickly search for data and do not need any addons. The structure of the Basic Search is:

curl -XGET <URL>/_search?q=<field>:<search term>

To illustrate, lets say there is a need to find which DNA string contains the sequence: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”.

curl -XGET localhost:9200/dna/_search?q=sequence:AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA

The result you get back (formated) would be similar to:

{

“took”: 2,

“timed_out”: false,

“_shards”: {

“total”: 5,

“successful”: 5,

“failed”: 0

},

“hits”: {

“total”: 1,

“max_score”: 0.30685282,

“hits”: [

{

“_index”: “dna”,

“_type”: “string”,

“_id”: “AVSRjGaZcdfh2Ha9smfL”,

“_score”: 0.30685282,

“_source”: {

“id”: “3”,

“Name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

}

]

}

}

The value “_id” (in hits) will depend on your machine (not, we set “id” and not “_id”.) The important keys in the result are “hits” and “sources”. The “hits” tells you what matches and the source provides you back the data.

You can find a substring in the sequence. Lets say you want all entries that contain the substring “AAAA”, you use the wildcard ‘*’ to tell elasticSearch to look for a substring.

Here is the query:

curl -XGET localhost:9200/dna/_search?q=sequence:*AAAA*

The result you see should be similar to:

{

“took”: 7,

“timed_out”: false,

“_shards”: {

“total”: 5,

“successful”: 5,

“failed”: 0

},

“hits”: {

“total”: 2,

“max_score”: 1,

“hits”: [

{

“_index”: “dna”,

“_type”: “string”,

“_id”: “AVSRjGaZcdfh2Ha9smfL”,

“_score”: 1,

“_source”: {

“id”: “3”,

“Name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

},

{

“_index”: “dna”,

“_type”: “string”,

“_id”: “AVSRij7Tcdfh2Ha9smcT”,

“_score”: 1,

“_source”: {

“id”: “2”,

“name”: “Sample 2”,

“read_date”: “2015-03-04”,

“sequence”: “GAGGTATGTGGCTGCGTGGTCAAAAGTGCGGCATTCGTATTTGCTCCTCGTGTTTACTCTCACAAACTTGACCTGGAGATCAAGGAGATGCTTCTTGTGG”

}

}

]

}

}

Query DSL
The Basic Search is fine for some cases, but other cases requires more sophisticated techniques like filtering, changing the search algorithm (analyzer), searching for phrases and other options. Elasticsearch has a DSL (domain specific language) made available that can help with this. This blog post will not go into details on them all; but will help you enough to go find out others and us them as needed.

The Basic Search we did earlier can be re-written with the DSL in this structure:

<GET command> <URL>/<index name>/_search

{

“query”: {

“match”: {

<field>: <search term>

}

}

}

The query key tells Elasticsearch to perform a search. The match key tells Elasticsearch to  look for each term in the search term, divided by token (details to beyond the scope of this blog, for conservation, lets say each word in a phrase.)

Here is an example:

 

Here is an example:

curl -XGET “localhost:9200/dna/_search” -d

‘{

“query”: {

“match”: {

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

}

}’

Command line:

curl -XGET “localhost:9200/dna/_search” -d ‘{ “query”: { “match”: { “sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA” } } }’

The result:

{

“took”: 3,

“timed_out”: false,

“_shards”: {

“total”: 5,

“successful”: 5,

“failed”: 0

},

“hits”: {

“total”: 1,

“max_score”: 0.30685282,

“hits”: [

{

“_index”: “dna”,

“_type”: “string”,

“_id”: “AVSRjGaZcdfh2Ha9smfL”,

“_score”: 0.30685282,

“_source”: {

“id”: “3”,

“Name”: “Sample 3”,

“read_date”: “2016-04-05”,

“sequence”: “AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA”

}

}

]

}

}

You can search for a phrase rather than for individual words. The match_phrase key helps with this.

This can be expanded to perform filtering. This can eliminate irrelevant data from the search results. The format of it is:

<GET command> <URL>/<index name>/_search

{

“filtered”: {

“filter”: {

<filter option>: {

<field>: {

<filter command>: <filtered-by term>

}

},

“query”: {

“match”: {

<field>: <search term>

}

}

}

}

To illustrate, in the search result, lets say we want only the result from after a January 1st, 2016 (the selected date is the same for both the British and American date-structure systems):

 

Homework
These are a collection of exercises to test your understanding of this blog and help you get started in discovering another feature of Elastic search.

Indices need to change in the real world as users update content. Elastic search provides this   functionality by supporting the PUT command.

Use the PUT command to update the third index’s sequence to:

GAGTATCATGGAGCCCATGGTTGAATGAGTCCAATAACGAACTTCGACATGATAAAGTCCCCCCCTCGCGACTTCCAGAGAAGAAGACTACTGAGTTGAG

Once done, search for do these exercises

  1. Search for:  AACTGGACAACGCATCAACGCAACGGATCTACGTTACAGCGTGCATAGTGAAAACGGAGTTGCTGACGACGAAAGCGACATTGGGATCTGTCTGTTGTCA
  2.  Search for:  GAGTATCATGGAGCCCATGGTTGAATGAGTCCAATAACGAACTTCGACATGATAAAGTCCCCCCCTCGCGACTTCCAGAGAAGAAGACTACTGAGTTGAG
  3.  Search for: *AAAA*
  4. Search for: *TTCGACATG*

Note: for (3) and (4), the ‘*’ tells elasticsearch to search for a substring.

Being able to complete these task shows you understand this blog post. If unable to, I recommend you re-read this blog post and follow along with the tutorial on a computer.

Notes

  1. https://www.elastic.co: This is Elastic Search’s URL. Note, there is no ‘m’ in the end
  2. https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html: This is a free book about Elastic search, online. You have to buy an e-book or physical book if you want to read it offline (this trend is becomes popular). I recommend you look at this book after reading this blog post and are ready for more advanced material.
  3. My current thought is that developing an analyzer for processing genomic data would be a great Bioinformatic problem and helpful in Precision medicine studies. I have not investigated it much as I am occcupied with studies other topics. But its on the back of my mind.
  4. Zachary Tong (https://www.elastic.co/blog/author/zachary-tong) is a name I saw a number of time when I begun studying Elastic search. Like him, I too am interested in Neuroscience, Bioformatics and Software Development. He posted a slide involving Gene sequencing with Bioinformatics that is worth a glance in my opinion- http://www.slideshare.net/ZacharyTong/boston-meetupgoingorganic
  5. Elastic BV (https://www.elastic.co/subscriptions) provide commercial support. I believe paid support makes an open source project more appealing. It tells you that there are a few people with an invested interest to maintain the code base and can help when a problem has reached a C-Level executive during bonus-paying time and you need a miracle.
  6. https://aws.amazon.com/elasticsearch-service/ Amazon web services supports hosting Elastic search in the cloud.
  7. Neflix, Github, Quora, Foursquare, Stack Exchange (who own stackoverflow.com): Are a few of the companies using elasticsearch. It is proof that it is a valuable tool that can handle huge search tasks.

 

References

  1. http://joelabrahamsson.com/ Great tutorial about how to get started with Elastic Search. It has a lot of details.
  2. https://troubleshootguru.wordpress.com/2014/11/19/how-to-install-elasticsearch-on-centos-rhel-2/ This is another tutorial. Has has less cruft, so good if you just need to know what to put in the command line.
  3. http://faculty.ucr.edu/~mmaduro/random.htm This is a tool I used to generate random genetic sequences.
  4. https://www.pluralsight.com/courses/elasticsearch-for-dotnet-developers A great PluralSight training video on Elastic Search.
  5. http://code.tutsplus.com/tutorials/a-beginners-guide-to-http-and-rest—net-16340 Overview of RESTful calls and curl.
  6. http://www.secretgeek.net/json_3mins JSON tutorial