
DataOps is very important in data science, and that my opinion is that data scientists should pay more attention to DataOps. It’s the less used feature in data science projects. At the moment we normally are versioning code (with something like Git), and more people and organizations are starting to version their models. But what about data?
I’ll cover in detail how to use Git with DVC with other tools for versioning almost everything that goes into a data science (and scientific) project in an upcoming article.
Recently, the DVC project creator Dmitry Petrov gave an interview to Tobias Macey at Podcast.init a Python podcast. In this blog post, I provide a transcript of the interview. You might find interesting the ideas behind DVC and how Dmitry sees the future of data science and data engineering.
You can hear the podcast here:
TL;DR
We need to pay more attention on how we organize our work. We need to pay more attention how we structure our project, where we need to find the places where we waste our time instead of doing actual work. And is very important to be more organized, more productive as a data scientist, because today, we are still on the Wild West.
The transcript
Disclaimer: This transcript is a result of listening to the podcast and writing what I heard. I used some software to help me in the transcription but most of the work is made by my ears and hands, so please if you can improve this transcription feel free to leave a comment below 🙂
Tobias:
Your host as usual as Tobias Macy and today I’m interviewing Dmitry Petrov about DVC, an open source version control system for machine learning projects. So Dmitry, could you start by introducing yourself?
Dmitry:
Sure. Hi, Tobias. It is pleasure to be on your show. I am Dmitry Petrov. I have a mixed background in data science and software engineering. About 10 years ago I worked at academia and sometimes I say, I work with machine learning for more than 10 years but you probably know that 10 years ago machine learning was mostly about like linear regression and logistic regression (laughing). This is pretty much what I was working on.
Then I switched to software engineering. I write some production code. At that time, data science was not a thing. Around five years ago, I switched back to quantitative area. I became a data scientist at Microsoft. And I saw how modern data science looks like. Recently, I’m, I’m actually back to software engineering. Today we are working on DVC and I basically build tools for machine learning.
Tobias:
Do you remember how you first got introduced to Python?
Dmitry:
It happened in 2004 I believe. It happened accidentally, I got into an internship during my master’s program, and I worked on a Python project it was my first scripting language. I don’t count at least in my functional programming class, of course, and it was powered by Python 2.2 and then we switch to 2.3.
I have spend a number of sleepless nights debugging Unicode issues (laughing), in Python 2, if you work with Python 2 to you probably know what I’m talking about. And later I found myself using Python in every research project that I was working on. We were still using the MATLAB for machine learning. It was quite popular 10 years ago, we use plain C for of low-level software to put let’s imagine when the driver tracks some signals, but in general Python was a primary language in my laboratory, when I worked as a software engineer my primary language was C++. So but I still used Python for ad-hoc projects for automation.
At that time there were a lot of discussion about Python vs. Perl and to me it was not a question. So I used Python all the time during my transition back to data science. Python became my primary language because why in data science is one of top languages so that people use today. We build tools for data scientists, and we use Python.
Tobias:
So one of the tools that you’ve been working on is this project called DVC or data version control, can you start by explaining a bit about what it is and how that project got started?
Dmitry:
DVC it’s basically version control system for machine learning projects, you can think of DVC is a command line tool, it’s kind of Git for ML, what it gives you. It gives three basic things. First, it helps you to track our large data files, for example, 10 GB data files, 100 GB data sets, ML models, and it helps you version pipelines, machines learning pipelines, it’s not the same as data engineering pipelines. It’s a lightweight version pipelines and DVC helps you to organize your experiments and make them self descriptive, descriptive, self documented, basically, you know, everything, how a model was produced, what kind of commands you need to run, what kind of metrics were produced as a result of this experiments.
You will have all this information in your Git history. DVC helps you to navigate through your experiments from technical point of view, we use Git as a foundation. So it DVC works on top of Git and a cloud storage. You can use S3, you can use Google Storage or Azure, or just random SSH, sever where you store data, DVC basically orchestrate Git and cloud storages.
You also asked, how DVC started. This is actually a separate story. Because initially, when we started the project, we were not thinking Git for ML. What we were thinking about was data science experience, what is the best practices, and we were inspired by ideas of data science platforms, you probably heard about Michelangelo, a data science platform of Uber, you probably heard about Domino data lab.
Today, actually, every large technology company has its own data science platform, because they need to work somehow in large teams, right? They need to collaborate. Software engineering toolset is not perfect for ML projects or data science projects. But when we are looking on the platform, we were thinking like what, how is the platform of the future should look like how we can create a platform which data scientists can use, and which can be widely adopted, highly available. We came up with this idea, the principles of a dataset platform of the future. It has to be open source, this is the way you can be like community driven, this is a way you can create a common scenario which every data scientist can use.
If it has to be distributed, it’s important because sometimes you work on your desktop, because you have GPU, for example, sometimes you still need your laptop to work. Sometimes you need the cloud resources to run a specific model for example, you need a huge amount of memory. It is important to be everywhere.
So with this principles, we came up with the idea, why don’t we reuse Git as a foundation for data science platform for our tool. Then we realize it’s very close to idea of versioning, we just need version datasets, we need version experiments. This how DVC got started.
Tobias:
You mentioned that the traditional version control systems that are used for regular software projects aren’t quite sufficient for machine learning because of the need for being able to track the data sets along with the code itself and some of the resulting models. So can you walk through some of the overall workflow working with DVC and how that differs from just a regular Git workflow for software engineering project that might be focused on like a Django application?
Dmitry:
First of all, we need to understand what is the difference between software engineering and machine learning. Machine learning is driven by experiments, the problem is, you have a bunch of experiments, you have thousands, sometimes hundreds of experiments, and you need to communicate with the experiments with your colleagues. Basically with yourself, because tomorrow you won’t remember what happened today. In one month, there are no way you will remember what you have done today, and why this idea produce such a poor result, you need to track everything like in many cases. You can be in the situation when one of your colleagues comes to your office and says – hey, you know what? I have spent like two days trying this idea. You know what, it doesn’t work. And you’re like, yeah, I tried it, I have try the same idea like two weeks ago. I know, it didn’t work. It’s hard to communicate these ideas, because you have just dozens of those, sometimes hundreds of those. And this is the difference. You need a framework to communicate by huge amount of ideas, huge amount of experiments.
In software engineering, workflows are different. You have a limited amount of idea feature requests and bug fixes. I can say one controversial statement – in software engineering, you almost never fail. If you decide to implement some issue, implements some feature, or fix, fix a bug, you create a branch. And in 9 cases out of 10, the branch will be merged to mainstream. You can fail in terms of quality of software that you produced, you can fail in terms of budget, because you thought it was like one day project, and you end up spending two weeks of implementing this feature. But finally, it will be merged.
In data science you try 10 ideas, one works at the very best and the rest – didn’t work. But you still need to communicate all of those. DVC basically helps you to do this, it helps you to version your experiments, metrics of your experiments, you see metrics which were produced, you see the code which was used for this particular experiment with this particular version of the data set.
This work way better than, for example, Excel spreadsheet. Because today, many teams use just Excel spreadsheets to track their experiment. So this is the basic difference self-documented experiments, clear way of collaborating on experiments and look into the result.
Data versioning it’s kind of a separate thing, which is important for experiments, it is a must have for experiments. Sometimes people use the DVC just for data versioning, usually, for example, engineers who work on deployment side, they use the DVC to deploy the model. Sometimes they, the data engineers, data scientists use DVC just to version data sets. Sometimes people use the DVC as a replacement for Git LFS, because Git LFS has some limitation and DVC was built to was optimized for dozens or hundreds of gigabytes.
Tobias:
For the types of experiments that you’re working with, I know that some of the inputs can be in the form of feature engineering, where you’re trying to extract different attributes of the data, or changing the hyper parameters that you’re tuning to try and get a certain output of the model. So can you discuss how DVC is able to track those and some of the communication paradigms that it enables in order to make sure that you don’t have that replicated effort of loss of time and energy with people duplicating the same work because of not being able to identify ahead of time that particular set of features and hyperparameters and data inputs have all been tried together?
Dmitry:
Oh, from data files point of view and data sets for point of view DVC just track them. So your feature it’s, so basically the DVC treats every file, as just a blob, and we it usually don’t go into semantic of your data files and structure of your data files. The same for a model. So models are just binaries. What DVC can understand that the model was changed, it can understand that this particular version of the binary file was produced on this particular step. This particular input was consumed, but it doesn’t go inside features, it doesn’t go into the semantic of the data.
Tobias:
As far as the just overall communication pattern, what are some of the things that a data scientists would be looking at as they’re working within a project that’s managed through DVC to make sure that they’re not duplicating that effort, any sort of signals that they would be identifying that would be lost otherwise, with just using just Git by itself?
Dmitry:
First of all, the structure of experiment. This is what super important in data science, you need to know. I mean, not only in Data Science. In engineering case is pretty much the same. You need to understand what kind of code was used, what exact version of your code, and what exact version of the data set was used. This is how you can trust to an experiment that you did let’s say, three weeks ago.
One more thing is metrics. If you know that, this code with this version of the data set produces that particular value of the metric it creates trust. You don’t need to redo this job again if you see this result in your Excel spreadsheet, it might be some discrepancy, right? It can be an error in your process. You might end up redoing the same stuff.
For documentation, you can use just a Git commits. When you commit your result, which means you commit the pipeline, you commit a version of your data set and output result, you put a message. A message in Git as it’s very simple form of documentation. This is actually a very powerful. What DVC does it’s basically makes this work for data projects. Because in our regular workflow, you can commit code, but you don’t have connection with data, you don’t have a clear connection with your output metrics. Sometimes you don’t even have a connection with your pipelines. Because this commit may be related with one particular changes will perform one particular part of pipeline, not the entire pipeline.
Tobias:
And as far as the metrics that you’re tracking, can you discuss a bit about how those are captured and some of the ways that they factor into the overall evaluation and experimentation cycle to ensure that the model is producing the types of outputs that you’re desiring and the overall project?
Dmitry:
DVC is a digital agnostic tool, and framework agnostic. DVC it’s metrics agnostic as well. It means that we track metrics in a simple forms of text files. Your metrics are TSV files with some header, or CSV file or JSON file. When you just output we had a five metrics with that value, and DVC allows you to look at your entire set of experiments, which metrics was produced, you can find that this idea, for example, which failed, because metrics was not perfect, actually produced a better another metric had a good way. And you probably need to deep dive into this experiments.
So it shows you metrics across your branches, across your commits, and it basically helps you to navigate through your complicated Git structure tree by metrics. It’s a data driven way to navigate through your Git repository. In a data project, you might have thousands of commits, hundreds of thousands of branches. DVC basically helps you to navigate through this complicated structure. This is why we need metrics, this is DVC it has a special support for metrics.
Tobias:
Is there a built-in support for being able to search across those metric values and compare them across branches? Or is that something that you would need to implement on a per project basis as far as a specific function call that determines, you know, whether it’s a positive or negative value across those different comparisons.
Dmitry:
Today, we basically show the metrics. If you need to navigate you need to properly implement something, but I won’t be surprised if we implement some logic metric specific logic, for example, show me max value of this particular metrics. Show me some max value of those combination of those metrics, something like this.
Tobias:
In addition to version control that’s typically used for regular software applications. There are also a number of other types of tooling that are useful to ensure that the projects that you’re building are healthy and that you don’t have code regressions. So things along the lines of linting or unit tests support. And I’m wondering what are some of those adjacent concerns that should be considered when you’re building machine learning projects? Any ways that the work that you’re doing either with DVC or any of your other projects tie into that?
Dmitry:
It’s a good question. Because, in general toolsets in data projects, are not the same state that tooling set for software projects. It was in an interview with West McKinsey, like a month ago. He said, in data science, we are still in the Wild West. This is actually what is happening (laughing), we don’t have a great support for many scenarios.
But from the tool point of view, what I am seeing today, it’s become, it is quite mature in terms of algorithms. Because we have PyTorch, we have TensorFlow, we have a bunch of other algorithms, like random tree based algorithms. Today, there is a race of online monitoring tools. For example, TensorBoard, when you can report your metrics online when you train and see what actually is going on the training phase now. It is especially important for deep learning because the algorithms are still quite slow, I would say, they’re a bunch of commercial products in this space, and MLflow, one of the open source projects, which is becoming popular, which helps you track your metrics and visualize training process. This is a trend today.
Another trend is how to visualize your models, how to understand what is inside your models. Again, there’s a bunch of tools in order to do this but the state of this tool, it’s still not perfect. In terms of unit tests, you can use just a regular one, just the regular unit test framework. But I couldn’t say it works really well for ML projects, specifically, what I have seen for many times is unit test or probably not unit, but functional tests for data engineer. In the data engineering part, when new set of data came into your system, you can get basic metrics and make sure there are no drifts in the metrics, they are not big changes of the metric. So this is how unit test, or test at least can work in the data world. But tools in general are still in the Wild West.
Tobias:
Moving on to the data version that is built into the DVC. As I was reading through the documentation, it mentions that the actual data files themselves are stored in external systems, such as S3, or Google Cloud Storage or something like NFS. So I’m wondering if you can talk through how DVC actually manages the different versions of that data and any type of sort of incremental changes that it is able to track or any difficulties or challenges that you faced in the process of building that system and integrating it with the software control that you use, Git for in the overall project structure?
Dmitry:
Yeah, of course, we don’t commit data (laughs) to the repository, we push data to your servers, to your clouds, usually, and you can reconfigure it to go to the cloud. As said before we treat data as binary blobs. For each particular commit, we can bring the you actual datasets, and all the data artifact that were in use. We don’t do any diffs per file, because you need to understand semantic of file in order to do diffs, right, it’s not a diff. In a Git every file it makes sense to make it make a diff.
In data science, you need to know what exactly what exact format of the data file you use. However, we track directories as separate types of structure. If you have a directory with lets imagine 100.000 files, and then you added a few more files into a directory and committed this is a new version of your data set, then we understand that only a small portion of our file was changed, let’s say 2000 files was modified and 1000 was added, then we version on the diff, so you can easily add your labels in your database in a weekly basis without any concern for the cycle of your of your directory, we do this kind of optimization.
Another optimization, important optimization that we do is optimization your workspace. When Git checkouts files from our internal structure, it creates a copy in your workspace, however in a data world, sometimes it just does not make sense because you don’t want to create a copy of 100, let’s say gigabytes of data, another copy. We optimize this process through having some reference.
So you’re not your having duplications of datasets. So you can easily work with hundreds of gigabytes, dozens of gigabytes without this concerns.
Tobias:
For somebody who is on-boarding on to an existing project. And they’re just checking out the state of the repository for the first time, is there any building capacity for being able to say I only want to pull the code, I’m not ready to pull down all the data yet, or I just want to pull down a subset of the data because somebody who’s working on a multi hundred gigabyte dataset doesn’t necessarily want to have all that located on their laptop as they’re building through these experiments. Just curious what that overall workflow looks like as you’re training the models when you’re working locally, how it handles and interact with these large data repositories to make sure that it doesn’t just completely fill up your local disk.
Dmitry:
This is a good question. We do this granular pull, we optimize this as well. You as a data scientist, you can decide what exactly you need. For example, if you’d like to deploy your model, which probably within 100 megabytes, you probably don’t need to waste time for the 100 GB data set, which was used in order to produce model.
Then you can specify anything on the data file, like clone a new repository of yours with a code and meta data. Then say I just need a model file. All the model file will be delivered to your production system to your deployment machine, and the same datasets.
Tobias:
There are some other projects that I’ve talked about, with people are building them, such as Quilt or Pachyderm that have built in support for version of data. I’m wondering if you have any plans currently to work on integrating with those systems, or just the overall process of what’s involved in adding additional support for the data storage piece of DVC?
Dmitry:
For some of the system integration can be done easily, for example, Pachyderm it’s a project, it’s mostly about data engineering. They have a concept of our pipelines, kind of a data engineer pipelines. DVC can can be used in data engineering pipeline. It has notion of ML pipelines, a lightweight concept. It’s optimized specifically for machine learning engineers, it doesn’t have all this complexity of data engineering pipelines, but it can easily be used as a single step and engineering pipeline.
We have seen that for many times, when people take, for example a DVC pipeline, and put that inside AirFlow as a single step. With this kind of design, it’s actually a good design, because you give a lightweight tool for ML engineers and data scientists, so they can easily produce a result. They can iterate faster in order to create their jobs. You have a production system with DVC can be easily just injected inside.
There is a term I don’t remember what company use. Netflix, probably – decks inside deck; which means you have a deck of data pipelines, and you have ML deck for one particular problem, and then basically inject a lot of ML decks inside the data engineering deck. So from this point of view, there are no problem to integrate, there are no issues with integration to Pachyderm or AirFlow or other systems. Regarding Quilt data, they do versioning, they work with S3 as well, potentially we can do we can be integrated with them. We are thinking about this, we are trying to be consumer driven, customer driven. The biggest need today is probably integration with MLflow, because MLflow shines really well with in online metrics tracking side. Sometimes people like to use MLflow for tracking, tracking metrics online, and DVC for versioning data files. This is one of the integration that we are thinking about today.
Tobias:
In terms of the actual implementation of DVC itself, I know that it’s primarily written in Python. And you mentioned that’s largely driven by the fact that is becoming the lingua franca for data scientists. So I’m wondering if now that you have gone a bit further in the overall implementation and maintenance of DVC, If you think that is still the right choice? If you were to start over today, what are some of the things that you would do differently either in terms of language choice or system design or overall project structure?
Dmitry:
I believe Python is a good choice for such kind of projects, for two major reasons. One, we are targeting data scientists, and most of them are comfortable with Python. We expect the data scientists to contribute to our code base. If you write this kind of project in, let’s say C or C++, or Golang, probably you won’t see a lot of contribution from the community. Because the community speaks different languages. For us for works perfectly, data scientists are contributing code, which is great (laughs).
And second reason was programming APIs. Before, we were thinking about creating a DVC through APIs, another option of using DVC. And if you write code in Python, it kind of goes out of the box, you can reuse the DVC as a library and injecting it into your project. If you will use a different language, it just create some overhead, you need to think about these into a nice form of this way. These were the reasons. So far we are happy Python and it works so nicely.
Tobias:
You mentioned being able to use DVC is a library as well. So I’m wondering if there are any use cases that you’ve seen that were particularly interesting or unexpected or novel either in that library use case or just in the command line oriented way that it was originally designed?
Dmitry:
Sometimes you people ask for library support, because they need to implement some more crazy scenarios (laughing). I can say, for example, people use DVC, to build their own platforms, if you wish data science platform, they’d like to build continuous integration frameworks, when DVC plays where all of this kind of glue between your local experience and CI experience, and they’re asking for libraries, but we had a such a great, command line support command line tool set, and people just switch back to a command line experience. But one day, I won’t be surprised if some someone will use DVC just as a library.
Tobias:
I’m also interested in what you were talking about of being able to integrate DVC into data engineering pipelines as just wrapping a single implementation step for running the model training piece of it. So I’m wondering if you can talk a bit more about that and some of the specific implementations that you’ve seen?
Dmitry:
Yeah, absolutely. This is actually a really good question. I believe that data engineers need pipelines, right?. Data scientists and machine learning engineers need pipelines. But the fact is, their needs are absolutely different. Data engineers do care about stable systems, if it fails, the system needs to do something, it has to recover. This is a primary goal of the data engineering framework.
In data science, it works kind of an opposite. You fail all the time (laughing). When you come up with some idea, you write code runs a score that fails, you feel that it failed, and etc. Your goal is to is to make a framework to check ideas fast, to fail fast. This is the goal of ML engineers. This is a good practice to separate two frameworks have two kinds pipelines frameworks. One is stable engineering, and second one fast, lightweight experimentation pipeline, if you wish.
When you separate these two worlds, you simplify life of a ML engineers a lot. They don’t need to deal with complicated stuff, we don’t need to waste time on understanding how AirFlow works, how Luigi works, they just live in their world, produce models, and once the model is ready, they need a clear way have to inject this pipeline into data pipelines. You can build a very simple tool in order to do this. So I remember when I worked at Microsoft, to me took me maybe like a couple of hours to productionize my pipeline. Because I have a separate workflow, I had a separate tool for ML pipelines. This works nicely. I believe in this kind of a future engineer, we need to separate these two things.
Tobias:
I’m also interested in the deployment capabilities DVC provides as far as being able to put models into production or revert the state of the model in the event that it’s producing erroneous output, or that the predictions are the results that it’s providing are providing problems to the Business or just some of the overall tooling and workflow that is involved in running machine learning models in production, and particularly as far as metrics tracking to know when the model needs to be retrained and just completing, closing the loop of that overall process of building and deploying these models.
Dmitry:
Yeah, deployment, we are waiting to get it because it’s close to business. And there’s a little funny story about ML model deployment. Sometimes it goes like this – a software engineer and a Data Science team, can we do more review of our model to the not the previous one, but the model from the previous week? And data scientists like – Yeah, sure, we have datasets, and you need five hours to retrain it (laughs).
It does not make any sense right to spend five hours three were a model in software engineering work world, it does not work this way. You need to have everything available. You need to review it right away, because waiting five hours means wasting money for business.
DVC basically helps you to organize this process, it helps, it basically creates a common language between your data scientists who produce a model, and ML engineers who take models and deploy models. So next time, with a proper data management, you won’t be even asking data scientist to give you a previous model, you should have everything in your system with DVC or not DVC it doesn’t matter what you need to have are all the artifact available.
From online from metrics tracking point of view, this is actually a separate question. Because when you’re talking about metrics tracking, in production, it usually means online metrics. It’s usually means metrics based on feedback from users. This is a separate question. So DVC it’s mostly about deployment or not deployment, mostly about developing phase. It doesn’t do nothing basically with online metrics.
Tobias:
So you are actually building and maintaining this project under the auspices of iterative.ai, which is actually a venture-backed company. So I’m curious what the overall value equation is for your investors that makes it worthwhile for them to fund your efforts on building and releasing this open source project. And just the overall strategy that you have for the business.
Dmitry:
You would be surprised (laughs) how investors are interested in open source projects today, especially the last year was super successful for open source projects. Last year, Mulesoft was acquired for numbers of millions or billions, sorry, last year Elastic Search went IPO. It’s purely open source company.
And when you do open source, it usually means that you are doing IT infrastructure. In many cases, you are doing IT infrastructure, they are good for monetization. With a successful open source project, that bunch of companies, which are monetizing this open source, it’s very important to understand your business model, because with open source there are a few common models.
One is a service model, a kind of consultancy model. The second thing is open core model, when you build software, and produce a piece of the software with advanced feature for money, or a different version of your software, as a product for enterprises.
And the third model is ecosystem. When you build a product, an open source product and create services as a separate product. One example might be Git and GitHub, when they have open source and SaaS service, which is absolutely different product is absolutely different experience and use cases. You need to understand which model you fit in. For successful project to be a lot of wish to interested in this experience in this kind of businesses.
Initially, I started the project as it was my pet project for about a year. And then I was thinking – how to make something big out of this, how to spend more time on this, how to find more resources in order to do this. It was clear that this project if it’s successful, there will be a few businesses monetizing this area. Why don’t we be this business, which builds a product and monetize this product. So it’s a natural path for a modern open source world, I would say.
Tobias:
As far as the overall experience of building and maintaining the DVC project and the community, I’m wondering what you have found to be some of the most interesting or challenging or unexpected lessons learned in the process.
Dmitry:
One of the lesson that I learned, I think it’s a usual business lesson, actually. So when you build your project, you know what you do, you know, your roadmap, you know your goal, and you’re just building, but one day users came to you. And they asked for a lot of different stuff.
Then you have tension, you have tension between your vision, your plans and demands from user side. And today, where is the point when everyday, we have a few requests from users sometimes, so we had like 10 requests per day (laughing). It’s not easy to balance the things. Because if you if you do everything people ask, you have no time for your roadmap. And actually, you have no time to fix and to implement everything that people asked. So you need to learn how to prioritize things. You need to learn how to say it, sometimes say no to users and say we will do this, but probably not right now. So this is not easy to do. This is what you need to learn during the process. As I said, the software experience in open source, it was the same as the same way in the business environment. I have seen that for many times.
Tobias:
Looking forward in terms of the work that you have planned for DVC, I’m curious what types of features or improvements that you have on the roadmap.
Dmitry:
Features for the near future, we are going to release better support for datasets versioning, and ML model versioning. We are introducing newcomers into DVC, which simplify your experience today. some companies are using mono-repos with a bunch of data sets inside. We need new commands to do something better, sometimes because these datasets are evolving in a different speed. You need to work with one version your those dataset, one version with others, sometimes. So basically, this is one of the steps we are taking.
And the another use case for data sets is a cross repository references. Other companies are not using the mono-repo, they’re using set of repos. For example, they might have like 10, 20 repos with data sets and 20 more with models. They need to cross-reference datasets, this is the next command we are going to implement to support this cross-reference, cross repository scenarios. This is our near future.
And in longer vision, the next step would be implementing more features for better experiment support, especially when people deal with such scenarios as hyper-parameter tuning. They need to have let’s say 1000 experiments. And they need to still control them. They don’t want to have 1000 branches (laughing).This is this experience we need to improve. We have a clear plan on how to do this. And this pretty much is for the next maybe half a year. Eventually we believe that the DVC can be this platform, when people can work on the same environment in one team and share ideas between each other. In the future we believe we can create experiences, great experience when people can share ideas, even between companies, between teams.
This is the big future of DVC that I believe in.
Tobias:
Are there any other aspects of the work that you’re doing on DVC or the overall workflow of machine learning projects that we didn’t discuss yet do you think we should cover before we close out the show?
Dmitry:
I don’t they have something to add. But what I believe in, is that we need to pay more attention on how we organize our work. We need to pay more attention how we structure our project, we need to find more places when we waste our time instead of doing actual work. And this is very important to be more organized, more productive as a data scientist, because today we are still on the Wild West, this needs to be changed as soon as possible. It is important to pay attention to this, it’s important to understand this problem set.
Tobias:
All right. Well, for anybody who wants to follow along with the work that you’re doing or get in touch I’ll have you add your preferred contact information to the show notes. And so with that, I’ll move us into the pics and this week, I’m going to choose a tool that I started using recently and experimenting with called otter.ai.
And it’s billed as a voice note taking service that will transcribe your meeting notes or just mental notes to yourself to text so that they’re searchable. And I’ve been experimenting with using it to try out generating transcriptions for the podcast. So looking forward to start using that more frequently and starting transcripts to the show. So definitely worth checking out if you’re looking for something that does a pretty good job of generating transcripts automatically and at a reasonable price. So with that, I’ll pass it to you to meet you. Do you have any pics this week?
Dmitry:
So I thought the open source part and open source vs. venture capital is a question that we will discuss. Actually, I have nothing special to suggest but today it’s a nice weather. Spring just started. So just spent more time outside the region outside and walking around the city or town.
Tobias:
All right, well, thank you very much for taking the time today to join me and discuss the work that you’re doing on DVC and adding some better structure to the overall project development of machine learning life cycles. So thank you for that. I hope you enjoy the rest of your day.
Dmitry:
Oh, thank you, Tobias. Thank you.
Conclusion

We need testing and versioning to understand what we’re doing. When we are programming we’re doing a lot of different things all the time, we are testing new things, trying new libraries and more, and it’s not uncommon to mess things up. It’s the path of data science, fail fast and iterate to success.
DVC is one of the best tools we have at the moment for controlling your data projects, and as you can see it can be connected with other great tools. I hope this interview and article helped you get started with them.
If you have any questions please write me here:
Favio Vazquez – Founder / Chief Data Scientist – Ciencia y Datos | LinkedIn
Have fun learning 🙂