Video Tutorial

Nvidia gave me a $15K Data Science Workstation — here’s what I did with it

Recreating a massive Pubmed literature search project on a Data Science WhisperStation in hours versus weeks

Kyle Gallatin

Published in

Towards Data Science

13 min readFeb 25, 2020

Watch on YouTube and check out the associated git repo

When NVIDIA asked if I wanted to try one of the latest data science workstations, I was stoked. However, a sobering thought followed the excitement: what in the world should I use this for?

As a machine learning engineer, I do a lot of deep learning, but I’m definitely no Google Brain researcher. I could run benchmarking tests, time jobs etc…but I don’t work at Nvidia and honestly, it didn’t sound too fun.

Damn — NVIDIA-Powered Data Science Workstations

I kept pitching myself ideas and started to think about the true value of powerful compute for a data scientist. Well-engineered GPU compute can lead to cost savings, low latency serving, and the easy training of large models — but what I was most interested in was rapid iteration.

Data science is a field grounded in experimentation. With big data or large models, the number of times a scientist can try out new configurations or parameters is limited without massive resources. Everyone knows the pain of starting a computationally-intensive process, only be blindsided by an unforeseen error literal hours into running it. Then you have to correct it and start all over again.

I thought back to my first data science project: a massive, multilingual search engine for medical literature. If I had access to the compute and GPU libraries I have now in 2020 back in 2017, what might I have been able to accomplish? How much faster would I have accomplished it?

And so I tried to answer that question by building a Pubmed search engine using only GPU resources.

The Project

The initial project came from a team in China, where there exists a burden on pharmaceutical companies to provide non-branded medical information to health care practitioners (HCPs). The team wanted to build a novel search tool which needed to be:

multilingual (en/zh)
available via multiple channels (web/WeChat)
customizable (able to tune algorithm for optimal results)
low latency (fast search results)
big data (all Pubmed abstracts — here I pilot with the last 10 years)
accurate (better search results than Pubmed)

My job was to engineer a solution that would meet and support all of these requirements. Moreover, this solution would need a relatively quick turn around through iterations so that we could rapidly test, evaluate, and update the system based on input from subject matter experts (SMEs).

However, as a data scientist I had one big issue…

The Data

For those who don’t know, Pubmed is a database of biomedical literature containing more than 30 million citations. While not all full-text articles are open source, each citation has an abstract. All of this information is available via API or in bulk download as XML files, which, unfortunately for me, is about 300GB spread across a thousand files.

After you parse each XML to extract useful fields (title, abstract, publication year, etc…), the data size is reduced down to an amount closer to 25GB. However, this still isn’t super manageable locally. Like I said before, data science is about experimentation — and I needed to do a lot of that. As a somewhat new data scientist (and awful engineer), I basically had one resource: Python. There were no 20 node Spark or Elasticsearch clusters coming to my rescue.

But, what if I had a GPU or two?

The Workstation

The Data Science WhisperStation I was granted access to by Microway & Nvidia had the following features:

Dual Intel Xeon 10-core CPUs
192GB memory
High-speed 1TB scratch drive
Dual NVIDIA Quadro RTX 6000 GPUs with NVLink
Preinstalled and configured with Python, Docker, RAPIDs and all the machine learning libraries I’d need for this project

What we really care about are these last two bullets. As you’re likely aware, GPU compute is hugely popular in data science. Running workflows with GPU libraries can speed up code by orders of magnitude — which can mean hours instead of weeks with every experiment run.

Additionally, if you’ve ever set up a data science environment from scratch you know it can really suck. Having Docker, RAPIDs, tensorflow, pytorch and everything else installed and configured out-of-the-box saved hours in setup time.

Until lately, GPUs were mostly used for deep learning in data science. However, not so long ago Nvidia released RAPIDS — a general purpose data science library for GPUs. RAPIDS is composed of:

cudf (basically pandas)
cuml (basically scikit-learn)
cugraph (basically network-X)

With these general-purpose data science libraries offering massive computational enhancements for traditionally CPU-bound processes (data loading, cleansing, feature engineering, linear models, etc…), the path is paved to entirely new frontier of data science.

Pandas to RAPIDs

In modern-day, the simple search is a fairly straightforward process. Words can be represented as numerical vectors, and then the distance between those vectors can be computed to see how “similar” the passages are. The simplest methodology for this uses cosine similarity with TF-IDF word vectors.

However, to “vectorize” our text, we first need to actually read it in and preprocess it. Assume the XML → csv preprocessing has already been completed, and now we just need to read these in as dataframes and perform the associated preprocessing.

Now, we have about 23GB of data and only two GPUs. I have no illusions about being able to fit all of this in the GPU memory with Python. Fortunately, as is the case with scientific literature, only recent articles tend to be relevant. To evaluate the accuracy of the search with experts, I really only need the last 10 years of Pubmed data — which, with my dataset, was around 8 million articles.

Using CPU-bound processes would be pretty taxing, but speeding things up with cuda makes all the difference. I want to:

Read in the dataframe
Clean the strings in the “Abstract” column
Keep only years ≥ 2009
Rewrite to csv

This is where cudf comes in. I literally wrote pandas code, did a find/replace and I had GPU accelerated code!

This processes the dataframes much faster than they would be processed locally. Here is a sample output locally using pandas:

Processed 13783 abstracts in 0.84604811668396 seconds
Processed 21714 abstracts in 1.2190630435943604 seconds
Processed 20259 abstracts in 1.1971170902252197 seconds

And here is output from the process on the workstation using cudf:

Processed 23818 abstracts in 0.3909769058227539 seconds
Processed 23609 abstracts in 0.5951714515686035 seconds
Processed 23929 abstracts in 0.3672349452972412 seconds

Each file is being processed more than twice as fast and the code only needs one GPU! Even better, we can use all the memory in both GPUs by creating a cuda cluster and employing dask.

If I want to just read in all the abstracts and do something else with them, dask makes this highly efficient with minimal lines of code.

The code above produced the following output on my subset of Pubmed data (all of Pubmed throws a memory error — no surprise).

Read 7141779 abstract in 64.332682 seconds

Checking the output of the GPU usage in a separate window with watch -n 0.5 nvidia-smi, you can watch your processes run and monitor the memory usage.

GPU Accelerated Cosine Similarity

Since I now know I can load the last ten years of Pubmed data into the GPU memory, I can move on to the fun part: the actual TF-IDF vectorization. In scikit-learn this is pretty easy, see my full CPU implementation here. Using cuml we should be able to just find and replace like we did with pandas, but unfortunately….

Failure :(

According to this Github issue, the text feature extraction libraries for cuml are still in the works at the time of writing (but once finished I’ll update with code!). This means our vectorizer still needs to be implemented with scikit-learn, and we can’t yet get GPU acceleration on this TF-IDF task. This is just for the training step, but it means that our TF-IDF vectorizer will remain CPU bound and therefore, inefficient.

However, this wasn’t over just yet. Even if the training itself is inefficient, that step really only needs to happen one time. Fortunately, the output sklearn’s TF-IDF vectorizer is just a sparse matrix — and when we’re back to dealing with matrices, we can get help from some classic tensor libraries. I decided to go with tensorflow.

As one would expect, matrix multiplication is an implicit part of any tensor library. After training my vectorizer in sklearn, I could port the actual vectors back over to the GPU with tensorflow to perform the matrix multiplication.

Now, in theory, this worked great with small portions of Pubmed — but it doesn’t scale. In all of Pubmed (and even our subset), there are quite a few unique words. Since we also have one vector for every citation in Pubmed after 2009, our sparse matrices become massive. I think it became roughly about 8 million by 1 million.

Yeah big surprise Kyle nice job dude

Thwarted not by the hardware but software. Trying to move back and forth from sklearn and tensorflow was leading to a host of issues. Realizing this approach would take more time and skill than I had readily available, it was time to move on or become a better engineer. It was time to move on to deep learning representations.

Creating a GPU Accelerated BERT Index

Vectorizing Pubmed using BERT

Recent advancements with transformers in NLP have shown massive improvements in a variety of tasks. While numerous models have come since, the origin of this revolution Google’s BERT. Like some other DL based models, BERT produces a contextual vector for sentences. The number of dimensions (length of the vector) is equal to the hidden layer size, which in the latest recommended BERT-large model is 1024.

This is huge. Even if we can’t use sparse matrices anymore, the size of our vector goes from millions x millions → million x thousands. On GPUs where space can be somewhat limited, this makes all the difference.

Normally BERT is used for classification tasks, but in our case, we just want to use it to extract the vectorized representation of our Pubmed abstracts so they can be indexed and searched. Thanks to Tencent Research, we already have a well engineered and GPU capable library: BERT as a service.

You can follow the instructions in the repo to actually install the service. Once you have it available in your environment, all you have to do is download your preferred BERT model and start it up.

Download model and start service

Now that you have the service running, simple Python can be invoked to get the vectors for any text you want the BERT representation for.

Vectorize text with bert-as-service

Easy enough. With the BERT service using the two GPUs on the workstation, large amount of abstracts are passed through the model blazingly fast. Below is the output when I time it for each csv:

Vectorized 23727 abstracts in 53.800883 seconds
Vectorized 25402 abstracts in 56.999314 seconds
Vectorized 25402 abstracts in 57.235494 seconds
Vectorized 23575 abstracts in 50.786675 seconds
Vectorized 17773 abstracts in 33.936309 seconds
Vectorized 24190 abstracts in 53.914434 seconds

Even with the workstation this process takes a while — which gives you an idea of how long it takes without it. It’s also worth noting this time includes reading in the data with cudf. To illustrate just how large the gap between GPU acceleration and local compute is, here’s the same process using my personal laptop instead:

Vectorized 13172 abstracts in 2048.069033 seconds

30 minutes. It took almost 30 minutes to vectorize half of the abstracts I process on the workstation in < 60 seconds. Even to just get the vectors from such a large model, GPU compute saves me full days of twiddling my thumbs while my code runs.

Index Using Faiss

This time, instead of doing matrix multiplication myself I’m going to hand it off to a well engineered fast index library. Facebook’s faiss is easy to use, and GPU capable making it the perfect tool with which to index our BERT vectors. To create a flat GPU-based in faiss, we only need ~10 lines of code.

Once you have the index itself, all you have to do is toss the vectors in. To save GPU memory, I recommend vectorizing the text using the BERT service separately first and saving to disk. Then, you can load and index the vectors without the service also cruising in the background. However, you can also do it all at once if you choose.

After creating the index itself, searches can be done in a single line. But will this scale? If I wanted to use this code to retrieve results or even put a model into production, I want to make sure that searches are run as quickly as possible. I benchmarked searches up to ~3 million abstracts and searches results still took < 0.1 seconds.

Even at 2.5M abstracts, the search query time using faiss is still less than 10 ms on the workstation

Finally? Sanity check. Those whole time I’ve been performing searches under the assumptions that SMEs will be able to evaluate them for accuracy. However, if the searches are so bad it’s pointless I’d have to start all over again or completely refactor my approach. Fortunately, this isn’t the case.

>>> search_term = "parkinsons disease" # search parkinsons
>>> search_vector = bc.encode([search_term]) # encode term
>>> distances,indicies = index.search( # get top 3 results
    search_vector.astype('float32'), 
    k=3)
>>>for i in indicies[0]:
...    print(text[i][0:500], sep = "\n") # print first 500 charDeep brain stimulation (DBS) improves motor symptoms in Parkinson's disease (PD), but questions remain regarding neuropsychological decrements sometimes associated with this treatment, including rates of statistically and clinically meaningful change, and whether there are differences in outcome related to surgical target.Neuropsychological functioning was assessed in patients with Parkinson's disease (PD) at baseline and after 6 months in a prospective, randomised, controlled study comparingKennedy's disease (KD) is a progressive degenerative disorder affecting lower motor neurons. We investigated the correlation between disease severity and whole brain white matter microstructure, including upper motor neuron tracts, by using diffusion-tensor imaging (DTI) in eight patients with KD in whom disease severity was evaluated using the Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS).From DTI acquisitions we obtained maps of fractional anisotropy (FA), mean diffusivity (Autophagy is associated with the pathogenesis of Lewy body disease, including Parkinson's disease (PD) and dementia with Lewy bodies (DLB). It is known that several downstream autophagosomal proteins are incorporated into Lewy bodies (LBs). We performed immunostaining and Western blot analysis using a cellular model of PD and human brain samples to investigate the involvement of upstream autophagosomal proteins (ULK1, ULK2, Beclin1, VPS34 and AMBRA1), which initiate autophagy and form autophago

A quick look shows that aa contextual search for “Parkinson’s Disease” returns relevant abstracts in the field (to my layman’s evaluation).

So, let’s look back at the requirements and see if this approach solved all of the requirements for this project:

Multilingual (en/zh) ✅: BERT supports 104 languages!

Available via multiple channels (web/WeChat) ✅: Wrap it up in an API and serve away.

Customizable (able to tune algorithm for optimal results)✅: I used BERT base, but it’s possible to use Bio-BERT or any other fine-tuned BERT here. Additionally, we can stack lightweight classification algorithms or heuristics on these results to improve accuracy even more.

Low latency (fast search results) ✅: Using almost 1/3 of the Pubmed abstracts, latency was still < 0.1 seconds and looked to be scaling reasonably.

Support large data (all Pubmed abstracts and more) ✅: We only used citations ≥ the year 2009 for validation, but with more GPUs and a better engineer you could easily scale this to all of Pubmed.

Accurate (better search results than Pubmed) 🤔: Remains to be seen. SMEs would have to rate and compare search results with Pubmed search, and tune the algorithm over time. However, with such a short turnover on ~7 millions abstracts the workstation makes this very feasible to do with relatively quick turnaround. Additionally, while the sanity check lacked scale, it at least shows this approach may be worth exploring.

Conclusion

Information retrieval is huge in large corporations that are overflowing with disorganized documents. Intelligent solutions to retrieve these documents are in high demand. While many vendors offer robust enterprise-grade solutions, to organize information at such grand scale in such a short period of time is only possible now through the hardware and software advances of the late 2010s.

I created, iterated, and revised my approach to this problem in my spare time over a few weeks. Thanks to the power of the workstation and open source, I actually managed to accomplish my goal in that time. Rather than wait weeks for code to run, I received constant feedback and tackled errors early. As a result, my code and this personal project progressed exponentially faster.

I love this lil dude so much, they have saved me hours of headaches

Since I’m basically already working in a production environment, it’s also easy to transition to more managed cloud hosts for deployment. While my code was nowhere near production code, using Docker allowed me to ensure everything I built could be prepackaged and shipped off to whatever image registry and deployment scheme I liked.

Obviously, $15K is a lot to put down on some hardware. But, if you’re an enterprise organization looking for quick experimentation and turnover, it makes sense. As a comparison, here’s a quote for a dedicated AWS p3.8x large (4 Tesla V100s). $75K for 1 year and the headache of installing of installing all the libraries and tools yourself.

There are solutions to this problem that don’t involve GPU. Since elasticsearch now has support for vector scoring, you can deploy the same solution on a 20 node cluster pretty easily with a lot more bells and whistles than the ~30 lines of code I used here.

However, the efficiency and scale that was achieved on just two DGXs here should show what’s in the works using GPU. Accessibility through high-level Python APIs now enables the average data scientist to perform highly optimized tasks with minimal effort. Thanks for reading, and by all means please improve on these solutions!

Shameless plug: working on my Twitter game and feel free to connect on LinkedIn!