A Journey in Learning

Data Versioning

Recently I was tasked to build a system that would automatically deploy pipelines using Kubeflow and Pachyderm. This system needed data versioning to make sure that results were reproducible in the future. Data versioning, or DVC can be summed up in these diagrams

Data Versioning Control (DVC) has many features. Some of the most relevant features include:

  • Versioned Data Collaboration allows multiple team-mates an agile approach in sharing and modifying models and data files.
  • Data Registries allow users to import saved models or pull data from a single location.
  • Data Deduplication allows the reduction of network load when multiple instance require the same data to function. This is accomplished by loading DVC data to a locally attached network file system.

DVC: Versioned Collaboration

  • Distributed Collaboration eliminates the issues with sharing model data between different local development environments.
  • Each local environment has access to separate repositories that house data and code. Changes to data are referenced in the metadata files stored in code repositories. 
  • Model data, being much larger than code, can be stored onto hybrid environments allowing developers a scalable and flexible solution when pulling data to a remote testing environment.

DVC: Data Registry

  • Data Registries act like a package management system for data and models in our pipeline.
  • This system acts as a data management middleware between machine learning projects and cloud storage. 
  • With this registry a user can pull data or import models when necessary. This is important especially when a project depends upon data from another repository, or when rerunning a specific part of a pipeline during testing.

DVC: Data Deduplication

  • Containerized deployments may have multiple models running on the same set of data. Data and model artifacts will be served through local caching servers.
  • This process is termed Data Deduplication, and it will prevent unnecessary network congestion allowing an elastic production environment.

Eventually this technique of data versioning became a component in a larger agile deployment of machine learning pipelines.

Note: all graphics in this article relating to DVC were taken from the DVC website dvc.org.