Devil's in the details: Data Leakage
In this blog post I assume that you know what data leakage is and will provide a perspective to solve it for a use-case which I come across often. To remind you about the topic, I will quote from Wikipedia.
In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.
So you trained your machine learning model and had a look at the performance metrics. They are close to 100%. There you are with a problem-solved face and confident smile. Then you start thinking again about the high scores. Was the problem your machine learning model solved an easy one? Or more probably your test dataset includes (leaked) datapoints that should only be part of your training dataset and the trained model is remembering datapoints from your training dataset. I would like to discuss the more probable case in this blog post.
And if you still want to keep your confident smile, it is time to rethink your data splitting strategy to avoid the data leakage.
I encounter this type of data leakage very often, especially when working on data-driven problems for IoT devices. In such problems, there are usually multiple datapoints collected by multiple IoT devices, or entities so to say. To put it another way, in the dataset there are many datapoints associated with each entity. From this point on, we refer to the IoT devices as entities, since this is the most granular level of detail that the datapoints can be linked to.
To take a use case as an example and assume we are working on a predictive battery maintenance problem for an IoT device and the objective of our machine learning model is to estimate the battery state of the device given daily event counts collected by the device. Each datapoint for this example includes daily event counts, device specific data and also cumulative event data throughout time.
Not good: Only random split
Creating separate datasets for training and test is a part of the standard machine learning modeling procedure. Generally, random split method is prefered, in which the data is shuffled first and split. The idea is to have a representation of the data distribution in both of train/test sets, so random splitting works most of the time. However, in this case, random splitting alone to create training and test datasets will not be a good choice. The reason is each datapoint highly likely includes features of an entity and datapoints of an entity are similar to each other. If the datapoints of an entity are shared amongst training and test datasets, the model will not need to make an educated guess to generalize and instead will remember these entity features.
Coming back to the use case example above, features like device specific data and cumulative data will make the model remember the entity if used both for training and test datasets, thus the data leakage.
Better: Entitywise split & random split:
If we only use random splitting, this can cause a data leakage. The way to avoid this is to split the dataset by entity first. First, randomly assign an entity to the training or test dataset. Then include the entity datapoints only in the dataset, to which the entity is assigned. Do not let both of the training and test datasets include datapoints from the same entity.
An entity here we can declare as
device_id or even a more granular entity like
battery_id, because cumulative data for a given device resets at each battery replacement.
How to identify the entity for your problem:
This is the most crucial bit and it does not have an easy answer. You need to understand the problem you are trying to solve and make sure that problem translates to the objective of your model. I will share some examples below to clarify how.
Use case: Predictive maintenance
At my client Salto Clay, a smart lock manufacturer, I built a machine learning model with an objective to estimate whether a smart lock would reject customers at the door due to low battery and it was not possible to read remaining battery state. A smart lock is a network-enabled device that lets users leave their keys behind, locking and unlocking doors with mobile phones, tags and fingerprint. The initial objective at my client was: Which smart locks will trigger lock rejection due to low battery?. The data source consisted of daily event counts triggered by the devices and some device specifics.
For the training of a machine learning model based on this dataset, it will be necessary to split the data to training and test datasets. If I simply randomly split the datapoints, this will cause data leakage, because then there will be datapoints from the same smart lock, the entity, in both of the training and test datasets. I should follow a wiser splitting approach and make sure training and test datasets do not have any smart lock datapoints in common. Because according to the initial objective definition above smart locks are the entities.
Indeed, splitting datapoints by smart locks as entities is an improvement and will avoid the data leakage up to some extend. Still it is not the best solution, because a smart lock as an entity is too general to cover the features which depend on battery runtime. So it is necessary to define the entity in more granularity like a battery session of a smart lock. With such an entity definition, datapoints collected by one device during one of its battery sessions will belong to either training or test dataset and thus the data leakage is avoided.
One more thing! For this type of problem and dataset with multiple datapoints for each entity, it is not preferred to create a rolling dataset and use cross-validation like in time-series problems, because simple rolling will not eliminate cumulative data which resets in time. It is not necessary to make cross-validation throught time, because we can make use later datapoints collected given a battery to make estimation for previously used batteries and vice versa. Thus the order in time does not matter as long as entity datapoints land in either training or test dataset.
Use case: Customer discovery
At my previous client, an electricity/gas provider, there was an objective Which heating source do customers have? and the dataset consisted of daily datapoints of electricity/gas consumptions and data specific to the residentials, e.g. building age and type.
If datapoints are randomly assigned to training and test datasets, there will be data leakage and performance metrics will be high. Because the model remember building building specific data instead of making an educated guess. Better we define our entity as residential and assign one residential's datapoints to either training or test dataset.