Big Data x 5NF = Really Big Data Errors
It is true that Big Data may offer huge opportunities for enterprises to gain hitherto unimagined insights. Alas, it also true that is has the potential to tell enterprises really big lies!
This has nothing to do with the quality of the Big Data. These ‘lies’ can arise from 100% correct data. They arise due to data structure anomaly.
This huge risk is all down to the fact that the structures of data that you will get from external sources arehighly likely to break Fifth Normal Form (5NF). In truth, it would be almost a fluke if such data did not! I call this propensity for merged Big Data sets to lie 5NF Syndrome!
What is Fifth Normal Form?
A good definition for 5NF (that is understandable) is hard to find. The best way to explain it is by an example from real life.
A distribution company in the UK had a data table in its corporate distribution system that looked like Fig1 below. This showed which manufacturer’s products the enterprise was sanctioned to distribute and to which retailers.
Over time, three regional divisions extracted parts of the data from the corporate distribution system for use in their local standalone systems in order to do some regional analysis. The extracted data looked like this in the three regional systems.
A regional manager, who had responsibility for the three regions, had some creative ideas for expanding his local distribution infrastructure. In order to ensure that his business case was sound, he needed to do some analysis that would enable him to know the overall form and spread of distribution services across these regions.
What easier way to do it than to run a report using the data in the three standalone regional systems? After all, they contained all of the data elements that he required.
All that was required was a simple SQL query joining the three.
Before long his analysis was giving him lots of new insights. It suggested that the enterprise was actually distributing many more manufacturers’ products to many more retailers than they had previously imagined
Just to make sure that this was not due to an error in the SQL, the code was checked and proved to be working correctly.
It seemed that the data in the three regional systems had actually been concealing valuable sales intelligence. Excited by these new insights, the regional manager built a compelling business case for expanding his distribution facilities and presented it to head office.
Doing due diligence, head office staff checked the ‘exciting new insights’ against their own distribution analysis taken from the corporate distribution system – the system that had been the original source of the data for the three regional databases.
The distribution analyses figures did not match. The regional system was showing much more activity than the corporate system. Something was wrong, but what?
Together IT and the business checked the source data against the data that had been originally extracted into the three regional systems and found that it matched.
The SQL query that produced the analysis was checked and double checked. It too was correct.
Many hundreds of man-hours were consumed in trying to solve this huge anomaly.
It’s All Academic
It was not until an external consultant was brought in, as a last ditch effort, that the mystery was solved. He listened to the history, looked at the tables and asked, “Have you heard about Fifth Normal Form?” A few had heard of it but no one had any idea what it was, other than some esoteric data rule that might have some relevance in academia, but none in a commercial enterprise. How wrong they were!
Violations of Fifth Normal Form (5NF) trip up innumerable data projects around the globe every year. So what exactly is 5NF? It is a normalization error that occurs when you ‘over normalize’ data tables and then try to recombine them.
In the distribution company this ‘over normalization’ occurred when the original three-column table (Fig 1) was split into three separate two-column tables (Fig 2). Performing queries on the individual tables will not result in any errors as, in standalone mode, there is nothing wrong with their structure.
The violation of 5NF occurs when you try to combine the columns through a query that joins all three tables. This is not just an academic error. It is a fatal error that consistently produces false values as demonstrated in Fig 3 below.
The row highlighted by the red arrows did not exist in the original table in Fig 1. Yet every time that you run a query that joins these three separate tables this extra phantom row is created.
This is just one phantom row generated from just four rows in the original table. Imagine how many would be generated if the original table had contained 10,000 or 1,000,000 rows!
Once fragmented in this way, any attempt to recombine the tables will always result in phantom rows being generated – and there is absolutely no way of ascertaining which rows these are.
What’s 5NF got to do with Big Data?
The fact is that there are probably billions of Big Data sets out there which, when queried on a standalone basis, will give totally accurate results. However, these same data sets, if joined together through a query, will always produce spurious extra rows. This means that the new ‘insights’ that Big Data is throwing up for your enterprise may well be the biggest lies that you could tell yourself!
When these structures exist, Big Data will always lie to you and you will have no way of telling which elements of data generated by the query are the lies!
The only thing you can do to prevent these errors is to test to see if the data structures violate 5NF – before joining the data! How to do this is far too much to cover here. I show one technique for checking this in my book IMM Data Structure Modeling.