Bad, bad data. No donut for you.

Unfortunately, the data you are looking for is missing, duplicated, incorrect, outdated, or non-conforming. Hopefully, it will return to its helpful state once you fix things.


Imagine driving down an expressway as fast as you are legally allowed to. At some point of time, your windshield fogs up to the extent that you can barely make out the taillights of the vehicles in front. Can you drive at the same pace you were? Are you just as safe as before?

That’s what bad data does to an organization.

IBM estimates that bad data costs the US economy upwards of three trillion dollars in a year. That’s just the direct cost of bad data, though. Factor in the cost of repairs – product recalls, customer outreach programs, penalties and corrections – and the true monetary impact of bad data could be in the multiples of that $3Tn figure. Gartner reports that at least 40% of businesses fail to achieve their business objectives due to bad data.

Businessman taking a wrong decision based on bad data.

But what exactly is bad data?

Bad data is any data that is missing, corrupted, inaccurate or invalid. In other words, it is the exact opposite of data that is useful and reliable.

Bad data can be broadly classified into two categories based on their origins: accidental and deliberate.

Accidentally-bad data is what you get when it is corrupted by:

  • A malfunctioning storage device.
  • Data over-writing with junk values.
  • Two or more systems exchanging information in different and/or incompatible formats.
  • The system’s design, preventing it from understanding data passed into it.
  • Careless data entry and/or processing.

Deliberately-bad data, on the other hand, implies an active human agency as the name implies. This happens when

  • Customers/users knowingly give you wrong data, either to protect their privacy or to mislead you.  
  • Data is masked to protect identities or intellectual properties.

A trillion data points a second

There used to be a time, not even too long ago, when companies complained that there was simply no way of collecting the data they needed to make informed decisions. Except for a few organizations with pockets deep enough to pay for extensive IT divisions, most organizations that managed to collect data anyway had no idea how to process huge volumes of it. As a result, there was a sense of premium attached to data collection along the lines of the old hunter’s adage, “Kill only what you eat.”

That’s not the case anymore. We are in an era where we double the data we possess every two years. Advancements in mobility and IoT mean that practically every step of an organization’s business day can be captured in a digital format. Machine learning and artificial intelligence can take hundreds of data points and extrapolate them into another hundred data points, drawing correlations and patterns hitherto invisible to the human intellect.

The downstream multiplier effect

What that also means is that if a piece of data – a datum – is bad, it will corrupt every bit of downstream data as well. Even if the original datum is identified and corrected, you can never be fully confident that there isn’t a report out there that’s still carrying the original, bad data. Just like college photos that turn up at the most inopportune moments or fake news showing up in your WhatsApp inbox long after it’s been debunked, bad data can also be, unfortunately, Permanent Data.

The downstream effect isn’t restricted to pieces of data either. Decisions based on bad data affect 88% of companies, and reduce top-line revenues by about 12% on average. Divisions across the length and breadth of the organization are affected in one way or another. For example, bad data forces you to devote additional resources to verify details in your CRM, a draw on your staffing and marketing budgets. Bad data in production can prevent you from identifying an emerging problem, dismissing it as a temporary glitch, or vice versa.

As mentioned earlier, factor in the cost of damage control and you are looking at a hit of anywhere from 10 to 40% on your bottom line.

Avoiding Bad Data

To commandeer an age-old proverb, “a byte in time saves nine.” Investing in the right set of controls might seem cumbersome, and perhaps unnecessarily expensive, at the beginning, especially if you are looking at manageable volumes of data. But given the astronomical rate at which data collection can happen, not to mention the near-unrealistic expectations customers have from companies they buy from (“you must know everything about what I need without knowing anything about me”), it’s an effort that will quickly pay for itself.

1. Have backups and failsafes

Most organizations suffer from bad data when they are hit by a data outage. With cloud storage and hosting becoming increasingly cheap these days, it’s an excellent solution for organizations who don’t want the hassle of investing in IT infrastructure on their own. But whatever you do, ensure that you run periodic backups of your databases and have mirrors you can switch to when needed.

2. Build for the future, not for the present

When you are designing your system, plan ahead in years, not months. A thumb rule we like is to budget 10x the requirements in the short-term, 50x in the medium horizon and 100x in the next 3 years. The multiplier is not just in terms of users, but also in terms of spread – geography, product, partner, etc. Think flexible, think scalable.

For instance, if there’s a chance you might work with Chinese firms in the future, you might be better off planning a database system that can work with Chinese characters; if you are looking to operate in two different countries, you might want to ensure that you capture timestamps in UTC instead of local times.

3. Promote a culture that respects data first and foremost

Sales divisions often make the mistake of marking a customer ‘lost’ when he/she declines on a call, and then attributing it to the salesperson making the call. This is a short-sighted approach that is ripe for gaming – it tempts the salesperson into keeping the customer’s status as ‘active’ instead of a ‘DNC’ (Do Not Contact), which means that the company will keep contacting the customer and lose both money and goodwill in the process.

Creating a culture that focuses on accurate data, on the other hand, means that the door to future value-creation is always open. Encouraging an employee to reject bad data, as opposed to forcing him/her to accept data just so that there is some data in the system, sends the message that you take the veracity of the data you possess very seriously.

4. And one policy to rule them all

In the near future, almost every organization – and certainly all of your worthy competitors – will have access to all the data needed to make business decisions. What will then separate the leaders from the rest of the pack is how quickly they identify how bad data gets into their systems and how effectively they deal with it. It will be the ones with robust, mature data governance policies that set the pace for everyone else.

Data governance policies cover everything from how data should be collected and how it should be processed, to who can access it and at what levels. When you have clear controls in place for design and access, it helps you identify gaps and entry-points for bad data faster. It will also help you build checkpoints that can guard against bad data, such as when talking to a third-party system or when running batch operations.

For more insights, you can check out some of our posts on the topic of data governance:


Leave a Reply

Your email address will not be published. Required fields are marked *