Improving Dataset Creation for Machine Learning

Written in collaboration with Andreas Putz, Computer Scientist at MistyWest.

Over the past decade, one of the hottest areas of innovation has been to apply machine learning to real-world applications. Google and Amazon are now using machine learning models to transcribe speech to text, interpret language, and analyze images and videos. However, to achieve a degree of high accuracy in these applications, training these state-of-the-art deep neural networks requires unfathomably large annotated datasets.

Access to large datasets is beneficial in the machine learning world, so it’s not surprising that the greater the amount of high-quality data you have, the better your model will be. But how can you build a business when the dataset for your application doesn’t exist? This post looks at the challenges involved in creating and organizing a dataset, before offering some solutions.


Problem #1: New datasets are expensive

Creating new datasets can be costly. You need both the data (image, sensor value sequence, etc.), as well as the ground truth (label). Labelling the data can often be time consuming, but since it doesn’t actually require skill it can be outsourced. However, in other cases, you need the result of a lab test or the opinion of a specialist to acquire the label, and this process can quickly become expensive, especially when working to create a minimum viable product.


Problem #2: Dataset quality is a challenge

Dataset quality is a major challenge in machine learning. A dataset with mislabelled data will yield poor classification results. Your dataset should accurately represent the reality you want to describe. To create your dataset, you want to take a relevant statistical sample of your target population. This sample should be identical and independently distributed, even if the population isn’t.

For example, if you want to categorize images of dogs into different breeds, you’ll need to have the same number of images for each dog breed. To start,”every piece of data collected should meet a minimum level of quality, and the data needs to contain the subject (signal) and have a minimal amount of “noise”. Noise refers to irrelevant information that can mask the signal you’re trying to detect. This noise could be a blurred image or any number of other anomalies in the measured values. For example, when looking for dental anomalies in a patient image, the dataset should include the teeth, be in focus, and be properly magnified.

Metadata is key to organizing your dataset

Problem #3: Keeping the dataset organized

Another challenge is keeping the dataset organized by using relevant metadata. How do you link the input data with the labels, especially when there can be a time delay between when the data is collected and when the label is applied (e.g. if the label is provided by a lab test several days later)?

Metadata such as camera settings, firmware versions, or sensor settings are key features to associate with the data. These settings can dramatically affect the nature of the data collected, which at some point will then need to be retrieved. Being able to efficiently create filters using metadata can save major headaches during the model creation process. 

With large datasets, storing and managing all these data streams can become overwhelming. For example, MistyWest built a model for a client project on data that had been collected from a device in January that had a classification accuracy of 97% when tested. More data was added two months later, and the accuracy dropped to 78%. Now we faced a challenge figuring out what caused the accuracy drop. Since the dataset was effectively organized, we could analyze the metadata to see if there is a logical reason for the change. In this case, the culprit was a configuration change made to the sensor, so we excluded the data from the second batch and reverted the sensor setting.

If categorizing dog breeds, you will need equal numbers of Steve photos to other dog breeds

Problem #4: Obtaining a balanced dataset

It can be difficult to create a balanced dataset, especially in medical applications, as it requires that you obtain a similar amount of healthy and unhealthy sample data. In many cases, you may have many more healthy subjects than unhealthy ones, and with an unbalanced dataset your model will be biased towards whatever data you have more of. Data augmentation techniques as well as specialized techniques for unbalanced datasets can be used in these cases, but nothing replaces good input data. Therefore it is critical that if unhealthy sample data occurs infrequently, you don’t miss it.


Solution #1: Bound the problem

The key to creating a cost-effective dataset is to bound the problem for the use case. By introducing constraints, such as image quality and orientation, you can drastically limit the amount of training data needed to create an effective model. Ensuring that the incoming data is high quality will be easier for you to train your model to validate its use.

In a simple, controlled environment, does the data contain enough information to make a sufficient model? It’s like creating a playground. If the answer is “yes”, the playground can be made more complex by reducing the number of constraints applied. 


Solution #2: Build tools to automate dataset creation

You can build tools that collect data faster and ease logistical challenges. This usually consists of a system that:

  • Collects the input data (from a video or sensor stream, for example)
  • Creates an inventory of the input data
  • Allows easy access to the data to execute exploratory data analysis tasks (do the basics: statistical analysis, visualize your data and get a feel for it).
  • Provides easy and convenient access for labelling.

The method of labelling the data depends on the use case – it could include additional sensors (i.e., the ones you’re trying to replace), or a system for user input. An automated system can also pull relevant metadata and store it alongside the data. The availability of metadata can help to:

  • Allow or simplify anomaly detection
  • Helps to explain variability in the data


Solution #3: Instantaneous  feedback

How do you ensure that you’re capturing quality data, especially when a rare sample is encountered? One answer is to ensure that every collected dataset is of high quality. In some cases this can be achieved by a lightweight quality analysis either directly on the device or through a low latency cloud based assessment. This fast turnaround feedback loop can trigger a re-acquisition of the dataset.

For example, with sensor data, the analysis could be targeted at assessing the noise level in the data. For an image, it could be assessing the focus or exposure of the images. If the data doesn’t pass this initial assessment, a prompt can be created to reacquire the data before moving on to the next data sample.



MistyWest client case study

One of MistyWest’s clients was developing a way to help diagnose a medical condition using images instead of lab tests. Creating a dataset of both images and corresponding lab test results was difficult, and our client was paying the lab for results, regardless of whether the image contained any useful information. This was costly for our client.

Our team developed a prototype of a handheld device for capturing images with a cloud backend, and built an automated system that introduced constraints into the data collection process. These constraints guarded against blurry images and verified the content of the image to ensure it was capturing the target area.

A clinician was able to simply take a series of images with the device, and from this series we could then extract a validation image that, using the device’s IoT capabilities, was analyzed in the cloud. If the image passed the validation process, the lab result was requested. If the image failed validation, the clinician was requested to take another set of images. This process eliminated the risk and cost of getting unusable data back from a lab.



Building a dataset can be a challenging process. You must keep data quality and organization in mind, while not spending more than you can afford. At MistyWest, we help clients establish systems that effectively build their datasets. Email us at [email protected] to discover more about how we can help you with this important task.

Please wait...
Scroll to Top