Every machine learning problem requires a solution that is tailored to its specific characteristics. Unfortunately, there is no single solution that can address all machine learning issues. Developing an optimal approach to an individual problem can be resource-intensive and time-consuming, but it is worth the effort to ensure the best possible outcome.
In order to address this issue, many businesses monitor their experiments to ensure the maximum possible results. However, manually tracking all the data generated by these tests can be extremely time-consuming. Furthermore, the challenge of manually managing large amounts of data on a spreadsheet is not solved either.
In this situation, we are taking a different approach to resolving the issue. To ensure the development of more effective trials, we are leveraging the valuable insights gained by previous organisations. To ensure a consistent user experience, we are integrating DagsHub and DVC (Data Version Control) into our process.
In this essay, we will gain an understanding of what DagsHub and DVCs are and how DagsHub simplifies the process for machine learning beginners to carry out various experiments. Additionally, we will train our model on the unique Titanic dataset and conduct a collection of experiments that are dependent on the aggregation models.
However, that is only the beginning. To gain a better understanding of the results, DAGsHub’s insightful dashboards will be employed to show and evaluate the conclusions. To move forward with our work, let us quickly review the details of DVC, FastDS, and DAGsHub.
DagsHub, what is it?
DAGsHub is an open-source platform built on top of GitHub and DVC that is designed to serve as a central repository for data science projects. At MindMeld, we use it as a central model registry, allowing us to quickly and easily share our models with anyone, regardless of their location. It functions similarly to GitHub, but its purpose is tailored to meet the needs of data scientists, providing a collaboration-friendly environment where experiments, code, data, and models can be shared.
In summary, this platform is designed to cater to the needs of the open-source community. It features a user interface that is similar to GitHub, enabling users to effortlessly share, survey, and reuse finished projects.
The DAGsHub platform not only provides a wide array of features, but also offers testing, MLflow coordination, AI pipeline representation, and much more. It provides a great benefit to machine learning developers by allowing them to version their code, data, models, and experiments. To give you a better idea of what DAGsHub can do for you, here are a few of its impressive features.
- Metrics for ML Comparison
- Verification and rollback of data
- Log machine learning trials
- A Method for Visualising and Integrating ML Flows and Pipelines
- The marriage of DVC and Git
- Annotating labels using tools like The Label Studio
DAGsHub provides a comprehensive suite of features that make it easier for Machine Learning Operations (MLOps) to address any machine learning challenge. Its core mission is to empower data scientists and AI developers to foster a more effective and efficient workflow, thus allowing them to achieve their objectives more quickly and effectively. In essence, DAGsHub simplifies the process of solving any machine learning related problem.
- Keeping the experiment’s data on file
- Building the Data Pipeline
- Creating a copy of the repository
Let’s dive into the world of DVC in machine learning right now.
So, what exactly is DVC?
The Data Version Control (DVC) toolkit is a powerful Python framework that enables users to effectively manage data entry forms. Similar to the popular version control system Git, DVC is used for storing data and model files without causing any disruption to the workflow. By using DVC, you can easily store the many versions of your data stored in Git, while maintaining your unique data in a separate repository. At MindMeld, we utilise the DVC toolkit to track the progress of our trained models.
In a similar way, the syntax and grammar of DVC is highly comparable to that of Git, rendering it easy for those who are already knowledgeable in Git and its commands to quickly become proficient in DVC. Nevertheless, utilising DVC necessitates the setting up of cloud services for the storage of all data, a task which is facilitated by DagsHub.
The following command will initiate the installation process.
In addition to DagsHub and DVCs, you may come across the term FastDs. According to the official FastDs website, it is an open-source command line solution that integrates with Git and DVC, and is designed to minimise the potential for human errors, automate mundane tasks, and facilitate the onboarding of new users.
This means that FastDS facilitates code and data governance for machine learning architects.
DagsHub: A Guide to Keeping Tabs on Your ML Experiments.
If you have been relying on a spreadsheet to manage and record your machine learning experiments, DagsHub might be a more suitable option for you. By utilising DagsHub, you can eliminate the potential for errors that can come from having to manually keep track of hundreds of different parameters.
As a data scientist or machine learning practitioner, you will find great value in using dagsHub, an online platform built using open-source technologies. This post will provide an overview of the platform and demonstrate how to use it to monitor your experiments and visualise data in a graphical format.
Okay, so let’s begin.
It is assumed that you have a basic understanding of both scikit-learn and Git. To start, we will provide you with a comprehensive overview of the relevant background information, followed by a demonstration of how to utilise DagsHub for monitoring your experiments.
Building a Data Repository
In the next example, we’ll build a file system that will be used throughout the presentation. Follow these steps to make it happen.
As illustrated in the directory tree presented above, the file named ‘model.ipynb’ is used to create a variety of models. The working directory also has two subfolders, one labelled ‘data’ and the other labelled ‘models’. The dataset that we will be utilising for this experiment is located in the ‘data’ folder. Subsequently, the ‘models’ folder is where the pickle files for the multiple models constructed during this experiment will be saved.
Developing a method to mass-produce experimental models
We intend to implement a service that utilises a dataset of iris plants, categorised into one of three types based on their physical characteristics, to collect models for our experimentation chain. This will enable us to gain insight into how the physical attributes of these plants impact the outcomes of our experiments.
The code in the preceding section makes use of dependencies to import the relevant Python libraries required for the execution of the function. After which, it utilises the Pandas library to read the data.
In the following step, the dataset is divided into its labels and features. Additionally, the data is separated into a “train” and “test” set. To evaluate the accuracy of the model, the “train_test_split” function from scikit-learn is used to test 30% of the available data.
The DagsHub logger was implemented at a later stage to monitor the model metrics and hyperparameters. The model fit function is used to access the training data, and a Python pickle file is stored in the model folder.
It is essential that a .sav file, metrics.csv and params.yml are present in the model’s folder if the preceding code executes correctly. To ensure that these files have been correctly generated and are located in the correct directory, please refer to the following directory tree for verification.
DagsHub Logger is being used to keep tabs on an experiment.
To begin, we need to upload all of our data to DagsHub. Before we can do this, we must first set up a remote repository. Additionally, we must integrate DVC and Git into our existing working directory. If you would like to find out more information, please refer to the attached image.
You may start making your own DagsHub content by clicking the “+Create” button after you’ve logged in.
If you click it, a menu will appear; choose “new repository” from there. If you choose it, the window shown below will open.
In order to create a remote repository, the first step is to add the repository’s name. Subsequently, we will configure Git to function in the present directory.
Git URL: https://dagshub.com/srishti.chaudhary/dagshub-tutorial.git
After that, we do some extra steps during DVC setup to set up DagsHub as DVC remote storage for the purpose of experiment logging.
Once the experiment’s files have been added to Data Version Control (DVC) and stored remotely on DagsHub, we need to push them to Git. Both the ‘models’ and ‘data’ directories contain .gitignore and .dvc files that must also be included in the push to Git.
Upon completion of the required steps, you will gain access to the repository’s files. Subsequently, select the ‘Experiments’ menu item, where you will find our Random Forest Classifier experiment listed as the initial entry. There is an unlimited capacity of experiments that can be conducted.
In this article, we will provide an overview of DAGsHub and DVC, and explore the ways in which DAGsHub can be used to compare and contrast the outcomes of different machine learning experiments. Additionally, by experimenting with the system, we can gain a better understanding of the full potential of this powerful version control system.