Customised solutions are crucial in tackling any machine learning challenge, as there is no universal fix that can be applied to every problem. While the development of an effective solution tailored to specific circumstances can be demanding in terms of time and resources, it is essential to invest in it for optimal results.
Businesses often face the challenge of monitoring experiments to achieve optimal results. However, manually tracking all the data generated by these experiments can be a time-consuming task. Additionally, managing significant amounts of data on a spreadsheet manually only compounds the difficulty.
We have chosen an innovative method to overcome this challenge. We are capitalising on the valuable insights gathered by other organisations to enhance the effectiveness of our experiments. To maintain a unified user experience, we are integrating DagsHub and DVC (Data Version Control) into our workflow.
This essay will explore what DagsHub and DVCs entail and how DagsHub streamlines the process of conducting varied experiments for machine learning novices. Moreover, we will utilise the distinct Titanic dataset to train our model and conduct a series of experiments that rely on the aggregation models.
Nevertheless, this is just the tip of the iceberg. In order to obtain a more comprehensive understanding of the outcomes, we will employ DAGsHub’s insightful dashboards to display and evaluate the findings. To continue with our work, let us take a quick look at the specifics of FastDS, DVC, and DAGsHub.
What exactly is DagsHub?
DAGsHub is an open-source platform that utilises GitHub and DVC to serve as a central data science project repository. We use it extensively at MindMeld to maintain a central model registry, enabling anyone, regardless of location, to quickly and effortlessly access our models. It operates similarly to GitHub, but with features specifically geared towards facilitating collaboration among data scientists. It offers an environment where experiments, code, data, and models can be readily shared.
To sum it up, this platform caters specifically to the open-source community by offering a user interface that closely resembles GitHub’s interface. This enables users to easily share, explore, and reuse completed projects without any hassle.
DAGsHub platform boasts a wide range of features that include testing, MLflow coordination, AI pipeline representation, and much more. For machine learning developers, this platform is of great benefit as it allows code versioning and sharing of data, models, and experiments. To provide a better understanding of what DAGsHub offers, here are a few of its impressive features.
- ML Comparison Metrics
- Data Verification and Rollback
- Machine learning trial logging
- Visualization and Integration of ML Flows and Pipelines
- The Integration of DVC and Git
- Label Annotation using tools such as The Label Studio
DAGsHub offers an extensive list of features that facilitate addressing any machine learning challenge, making it a valuable tool for Machine Learning Operations (MLOps). Its primary objective is to enable data scientists and AI developers to streamline their workflow, ensuring they can achieve their goals with greater efficiency and effectiveness. Essentially, DAGsHub simplifies the entire process of solving any machine learning related challenge.
- Archiving Experiment Data
- Data Pipeline Development
- Repository Copy Creation
Let’s explore the realm of DVC in machine learning today.
So, What is DVC exactly?
The Data Version Control (DVC) toolkit is a potent Python framework that enables users to manage data entry forms efficiently. DVC is similar to the well-known version control system Git and allows you to store data and model files without disturbing the regular workflow. With DVC, it’s easy to store multiple versions of your data in Git while keeping your distinct data in a separate repository. At MindMeld, we use the DVC toolkit to monitor the advancement of our trained models (Learn More).
DVC shares a similar syntax and grammar to Git, making it effortless for those familiar with Git and its commands to get up to speed with DVC quickly. However, to efficiently utilize DVC, users will need to set up cloud services for storing all data, which is made easier by DagsHub.
Executing the following command will initiate the installation process.
Apart from DagsHub and DVCs, you might encounter the term FastDs. The official FastDs website reveals that it is an open-source command-line solution that seamlessly integrates with Git and DVC. It is intended to decrease the risk of human errors, automate routine tasks, and make it easier for new users to get started.
In essence, FastDS streamlines code and data governance for machine learning architects.
DagsHub: A Comprehensive Guide to Monitoring Your ML Experiments.
If you’ve been using a spreadsheet to manage and document your machine learning experiments, DagsHub might be a more suitable solution for you. With DagsHub, you can avoid the risk of errors that can occur when manually keeping track of numerous parameters.
As a data scientist or machine learning practitioner, dagsHub – an online platform built using open-source technologies – will prove to be immensely valuable to you. This blog post will introduce you to the platform and illustrate how to monitor your experiments and visualise data in a graphical format.
Alright, let’s get started.
Before we commence, it is presumed that you have a fundamental grasp of both scikit-learn and Git. Firstly, we will deliver an in-depth explanation of the pertinent background information, followed by elucidating how DagsHub can be employed to oversee your experiments.
Establishing a Data Repository
In the following example, we’ll develop a file system that will be employed throughout the presentation. Follow the steps outlined below to get started.
As exemplified in the provided directory tree, the file named ‘model.ipynb’ is utilised to produce an array of models. In addition, the working directory includes two subfolders – one named ‘data,’ and the other named ‘models.’ The dataset that we will be using for this experiment is situated in the ‘data’ folder. As a result, the ‘models’ folder is where the pickle files for the various models developed throughout this experiment will be saved.
Creating a Technique to Efficiently Produce Experimental Models
Our aim is to employ a dataset of iris plants classified into one of three categories based on their physical traits to generate models for our experimentation chain. This approach will allow us to gain a better understanding of how the physical characteristics of these plants influence our experiment outcomes.
In the subsequent phase, the dataset is divided into features and labels. Furthermore, the data is segregated into a “train” and “test” set. To evaluate the model’s accuracy, we utilise the “train_test_split” function from scikit-learn to test 30% of the available data.
At a later stage, the DagsHub logger was integrated to oversee the model metrics and hyperparameters. The model fit function is used to access the training data, resulting in a Python pickle file that is stored in the model folder.
In order for the previous code to run properly, it is imperative that a .sav file, metrics.csv, and params.yml are present in the model folder. To verify that these files have been generated correctly and are located in the correct directory, please refer to the following directory tree.
The DagsHub Logger is Utilised to Monitor an Experiment.
Our first step is to transfer all of our data to DagsHub. Prior to uploading, we must create a remote repository and incorporate DVC and Git into our current working directory. For more details, please see the accompanying image.
Once you have logged in, you can begin generating your own DagsHub content by selecting the “+Create” button.
Clicking on this button will bring up a menu, from which you can select “new repository.” Once selected, the window depicted below will appear.
To create a remote repository, the initial stage is to include the repository name. Afterwards, we will set up Git to operate in the current directory.
Get Started Now!
Git URL: https://dagshub.com/srishti.chaudhary/dagshub-tutorial.git
Following that, during the DVC setup, we perform several additional procedures to configure DagsHub as a DVC remote storage to log experiments.
Once the experiment files have been uploaded to Data Version Control (DVC) and recorded remotely on DagsHub, we must push them to Git. The ‘models’ and ‘data’ directories each contain .gitignore and .dvc files that must be included in the Git push as well.
Upon completing the necessary steps, you will be granted access to the repository’s files. Then, go to the ‘Experiments’ menu option, where you will find our Random Forest Classifier experiment as the first entry. There is an unlimited number of experiments that may be conducted.
This article will offer an introduction to DAGsHub and DVC, along with discussing how DAGsHub can be utilised to differentiate and compare the results of various machine learning experiments. Furthermore, through experimentation with the platform, we can gain a more comprehensive understanding of the comprehensive capabilities of this robust version control system.