Data Cleansing Paradox

Avoid Data Errors and Data Cleansing

The importance of high quality data is obvious. But is data cleansing really necessary?

Surely, the obvious way to improve data quality is to prevent incorrect data entering your system in the first place.

Seems obvious, right? So IT automatically include strong data validation and checks in all IT systems to ensure only clean data is added.

Data Checks Reduce Data Quality

The paradox is that rigorous data checks can actually reduce data quality.

Rigorous Checks – Theory Vs. Practice

In theory, rigorous data checking would always work. When you design your system, you define rules for the data.
e.g.

All employees have an annual salary. This must have a currency and be greater than zero.
All customers must have a delivery address, email address, and phone number.
All products sold must have dimensions and weight.

You can build in checks to ensure this logic is adhered to. When data is inputted or imported into the system, red error messages prevent invalid data being stored.

This works – for a while.

In practice, there are exceptions. Always. Your business changes, the world around you changes… some of your assumptions are no longer true. For instance

You start hiring interns. They must be added to the system so they can get an ID and claim expenses, but they get no salary. HR cannot wait for IT to change the system so they enter a salary of $0.001 knowing that will be too small to be paid.
Some customers placing orders over the phone refuse to give an email address. So sales staff enter “noemail@example.com” just to accept the order.
The company starts selling software. The weight and dimensions of the CD case used to work fine. But when software started to be downloaded they kept the CD measurements as the system would not allow zero.

In practice, these small exceptions always creep in. They cannot all be anticipated.

For the people inputting data, they just need to get on with their job. They cannot stop and get IT to change the system, so they ‘force’ it in.

Now data errors are in the system. And the worst part is the errors are hidden, as the system thinks all the data is valid. So you need to data cleanup but it will be hard.

Permissive Errors

The permissive errors approach allows “errors” to be entered.

It acknowledges there will be exceptions. Instead of the errors been hidden away amongst valid data, the exceptions stand out ready to be fixed.

Fixing the error may involve

Completing some information later.
An administrator changing some related data.
An administrator or IT changing the validation rules to allow this.

Example

Here is an example. John is creating next year’s IT budget, including a network project for Asia Pacific. This new region is not on the system yet.

If strict data controls were in place, he would have force the data into somehow or wait for a system update.

But with permissive errors he can add the data – knowing it will get fixed later. Until then, the data has a warning next to it.

Later, Jane checks the data and sees the warning. She investigates and sees a new business unit is needed.

She has admin permission for this data and can add Asia Pacific as a new region. Now the data is all correct.

See this example in action.

When to Use Permissive Approach

There will be times when data validation must be strict. For example if John’s budget values were being used by executives in a report showing budget by region, the data would be inconsistent.

You could work around that by only including data in the report that had been approved – and making the approval process strict.

You may want to use a permissive approach for a new business process. To give people time to adjust the old data and iron out inconsistencies, before switching to a strict approach.

Or you may want to import a batch of historical data or onboard a new division. In which case, you could allow that data to be added and create a view for cleaning up the data.

Or you may decide to use a permissive approach just for people in certain roles.

Permissive or Strict

LiveDataset lets you configure data validation to be strict or permissive. And gives you the flexibility to change between approaches.

If strict validation is enforced, there needs to be a way to handle exceptions. So we typically start with a permissive approach and only switch to strict validation when needed.

Don’t think of data cleanup as a one-off task – plan for it as a normal part of any business process.