Efficient Use of TigerGraph and Docker

TigerGraph can be combined with Docker to run on nearly any OS. In this article, we tackle the issue of the large TigerGraph image and its unused potential.

David Baker Effendi
Towards Data Science

--

Image by author. Logos are from TigerGraph and Docker’s websites respectively.

TigerGraph is my graph database and graph analytics platform of choice as it is fast, scalable, and has an active open-source community. I regularly make use of TigerGraph locally due to my location not having nearby TigerGraph Cloud servers.

At the time of writing, the TigerGraph software requirements specify support for the following operating systems:

  • Redhat and Centos versions 6.5–6.9, 7.0–7.4, and 8.0–8.2
  • Ubuntu 14.04, 16.04, and 18.04
  • Debian 8

For anyone using operating systems beyond this list, a logical solution would be to make use of containerization: Docker, in the case of this article.

In this article we will cover:

  1. How to make use of the official TigerGraph and what’s inside
  2. Stripping the official Docker image of unnecessary bloat
  3. Modifying the ENTRYPOINT to add:
  • Running gadmin on startup
  • Run GSQL scripts bound at a certain directory
  • Route the output from a log file to STDOUT

4. Using Docker Compose to run TigerGraph images

The Official TigerGraph Image

The official TigerGraph image, running the developer edition, can be obtained by the following command:

docker pull docker.tigergraph.com/tigergraph-dev:latest

Run with:

docker run -d -p 14022:22 -p 9000:9000 -p 14240:14240 --name tigergraph_dev --ulimit nofile=1000000:1000000 -v ~/data:/home/tigergraph/mydata -t docker.tigergraph.com/tigergraph-dev:latest

This source gives more in-depth instructions on how the image is constructed, but in summary:

  • A base image of Ubuntu 16.04 is used
  • All required software such as tar, curl, etc. are installed
  • Optional software such as emacs, vim, wget, etc. are installed
  • GSQL 101 and 102 tutorials and the GSQL Algorithms library is downloaded
  • An SSH server, REST++ API, and GraphStudio are the 3 notable ports which can be exposed and used to communicate with the server

The total image is close to a 1.8–2.0GB download (version dependent) which puts considerable strain on bandwidth — especially with resource-sensitive use cases like CI/CD. Another notable point is that all one needs to make use of TigerGraph is a GSQL socket connection which can be interfaced with by tools such as Giraffle and pyTigerGraph.

I’ve identified two large sources of bloat which are:

  • The optional and unnecessary software e.g. vim and GSQL Tutorial 101
  • GraphStudio and binaries not necessary for the minimal operation of TigerGraph Developer Edition

Stripping the TigerGraph Image

I’ve replaced the base image ubuntu:16.04 with Bitnami’s MiniDeb image in order to shave off a few megabytes of unnecessary space. This runs Debian Jessie.

The next step was to remove unnecessary binaries installed during the apt-get stage of the official image. I’ve kept Vim as the only command-line text editor but binaries such as wget, git, unzip, emacs, etc. are no longer installed.

During the TigerGraph installation, the hardware requirements are strictly enforced and the installation will fail if they are not met. Since I want DockerHub runners to automatically build and push my image, I hacked the check such that the low-resource runners can continue to build the image.

This is done by replacing the os_utils binary with my version, which makes the check_cpu_number() and check_memory_capacity() functions more lenient. This binary can be found under:

/home/tigergraph/tigergraph-${DEV_VERSION}-developer/utils/os_utils

This has already reduced the bloat by around 400MB and my DockerHub image reports a compressed size TigerGraph 3.0.0 of 1.52GB (I did notice that downloading these layers indicates that it comes to around 1.62GB).

Note: I have attempted to haphazardly delete GraphStudio binaries but this fails the gadmin start script so there will have to be more meticulous adjustments made in order to remove more from TigerGraph, e.g editing the gadmin Python scripts.

The final result between the two images once I’ve downloaded and uncompressed them can be seen by calling docker images:

The source code for my build can be found here and I encourage anyone with suggestions to contact me!

UPDATE (2/11/2020): Thanks to Bruno Šimić for reaching out and building on this work to slim down the TigerGraph Enterprise Edition and share his code with me. The following additional stripping is his work that has been implemented in this image.

There appears to be documentation and unnecessary build artifacts (such as many node_modules) under the installation directory. Examples of where these can be found are under:

  • ${INSTALL_DIR}/app/${DEV_VERSION}/document/
  • ${INSTALL_DIR}/app/${DEV_VERSION}/bin/gui/server/node_modules
  • ${INSTALL_DIR}/app/${DEV_VERSION}/bin/gui/node/lib/node_modules
  • ${INSTALL_DIR}/app/${DEV_VERSION}/gui/server/node_modules/
  • ${INSTALL_DIR}/app/${DEV_VERSION}/.syspre/usr/share/
  • ${INSTALL_DIR}/app/${DEV_VERSION}/.syspre/usr/lib/jvm/java-8-openjdk-amd64–1.8.0.171/

Where INSTALL_DIR is /home/tigergraph/tigergraph for TigerGraph Developer Edition v3 and DEV_VERSION is the specific version e.g. 3.0.0.

Performing a fresh pull of the TigerGraph Developer Edition v3.0.5 and comparing it to my new v3.0.5 we see the following amount of disk space used:

Modifying the ENTRYPOINT

Before we add features let’s have a look at the original ENTRYPOINT:

ENTRYPOINT /usr/sbin/sshd && su — tigergraph bash -c “tail -f /dev/null”

This does two things:

  1. The SSH server is started by running /usr/sbin/ssh .
  2. The container is kept alive by running thetail command as user tigergraph. What this does is constantly read output from /dev/null which is also why the container’s STDOUT is empty.

Starting “gadmin” on Startup

As a way to improve user experience I’ve added a line that starts gadmin services on the Docker entry point from

ENTRYPOINT /usr/sbin/sshd && su - tigergraph bash -c "tail -f /dev/null"

to

ENTRYPOINT /usr/sbin/sshd && su - tigergraph bash -c "/home/tigergraph/tigergraph/app/cmd/gadmin start all && tail -f /dev/null"

A very simple but valuable change!

Run GSQL Scripts on Startup Using Volumes

Something that the TigerGraph Docker Image lacks (which other database images such as MySQL, MariaDB, and PostgreSQL has) is a directory named something along the lines of docker-entrypoint-init.d where a user can bind database scripts to run at startup e.g. for schema creation or database population.

There are various ways to go about this but I’ve chosen a fairly simple way of implementing this by adding the following line between the gadmin and tail command:

How this command works is:

  1. The if-command will check if a directory called /docker-entrypoint-initdb.d exists and will not perform the next step unless this is true.
  2. The for file in /docker-entrypoint-initdb.d/*.gsql; do line will start a for-each loop of all the files ending with the gsql extension in the entry point folder.
  3. The su tigergraph bash -c line will run the GSQL command on the file given by the for-each loop.
  4. By appending || continue, the container nor the loop will stop if the script failed to execute.

This will most likely look neater if placed into an entrypoint.sh but this is up to you! The final result now looks something like this:

Routing logs to STDOUT

In order to figure out where the logs belong, one can run gadmin log which will return something along the lines of

ADMIN  : /home/tigergraph/tigergraph/log/admin/ADMIN#1.out
ADMIN : /home/tigergraph/tigergraph/log/admin/ADMIN.INFO
CTRL : /home/tigergraph/tigergraph/log/controller/CTRL#1.log
CTRL : /home/tigergraph/tigergraph/log/controller/CTRL#1.out
DICT : /home/tigergraph/tigergraph/log/dict/DICT#1.out
DICT : /home/tigergraph/tigergraph/log/dict/DICT.INFO
ETCD : /home/tigergraph/tigergraph/log/etcd/ETCD#1.out
EXE : /home/tigergraph/tigergraph/log/executor/EXE_1.log
EXE : /home/tigergraph/tigergraph/log/executor/EXE_1.out
...etc

I’m mostly interested in the admin logs so I will change the tail command to read from /home/tigergraph/tigergraph/log/admin/ADMIN.INFO instead of /dev/null.

Now anything written to the admin logs will be piped to the container’s logs automatically. The final product from all three steps are now:

Using Docker Compose with TigerGraph

Note that I have added a health check which calls the REST++ echo endpoint every 5 seconds to determine if the container is healthy or not. If you use the official image, you would need to SSH into the container to manually start all services:

If you would like a GSQL script to run on startup, add the following entry under volumes:

- my_script.gsql:/docker-entrypoint-initdb.d/my_script.gsql

Note that I have added a health check which calls the REST++ echo endpoint every 5 seconds to determine if the container is healthy or not. This is useful for many applications and, in my use-case, is used to check if the container is ready during integration testing before starting the tests.

If you use the official image, you would need to SSH into the container to manually start all services:

The default password is “tigergraph” after which you would call the command

gadmin start all (v3.0.0 >=)
gadmin start (v3.0.0<)

GraphStudio can then be found on localhost:14240 on your web browser and Rest++ can be found on localhost:9000.

Conclusion

In this article we have:

  • Inspected the official Dockerfile,
  • identified and removed obvious unnecessary files,
  • built a slimmer version of the TigerGraph image saving a considerable amount of disk space,
  • modified the ENTRYPOINT to add additional automation to the container, and
  • used Docker Compose to run this image.

If you have any suggestions or thoughts on ways to further reduce the size of the image, then leave a comment, issue, or fork and give a pull request on the GitHub repository for the code in this article. If you would like to see more Docker-related guides for building custom setups for databases such as JanusGraph then leave a comment below.

You can find these images on Docker Hub, and I will continue to update this as new versions come out or until TigerGraph makes an official slimmer version.

If you would like to join the TigerGraph community and contribute, or start awesome projects and get a deeper look into what’s coming for TigerGraph, then join us on the following platforms:

If you are interested in seeing some of my other work, then have a look at my personal page at https://davidbakereffendi.github.io/.

Credits go to Jon Herke at TigerGraph for his leadership in the community and equipping us to contribute in meaningful ways and Bruno Šimić for sharing his findings in slimming down the TigerGraph Enterprise Edition image which can be found at https://hub.docker.com/r/xpertmind/tigergraph.

--

--

Computer science graduate from Stellenbosch University focusing on static program analysis