Data and its V’s
Volume, variety, velocity, and veracity are all certainly important considerations when we talk about data, but I want to focus on a different data V today: data and its Value.
A most valuable resource, probably
When folks talk about data value, sometimes they emphasize the value to consumers. The development of consumer-controlled data markets comes up a lot, and there is usually some talk about compensating users for their specific data. Recently, some politicians have argued that tech companies should pay a data dividend to consumers (or at least the US-based ones).
Others focus on the value of data to a business. This may be quantifiable-by-proxy, like when a company is sold to another for the value of its data. Or it may be quantifiable by looking at the amount spent on data and associated processing costs.
More recently, researchers have begun to theorize about the value of data inside the context of a model. Here, data valuation is tied to the information gain resulting from exposing some data to a model. In these sorts of frameworks, the data is worth whatever can be learned from it.
But it’s not quite that simple.
It’s so immediately obvious to most of us that data are valuable that it’s easy to overlook the irony affecting our position: we have almost no data on the value of data. We hardly even have any theories about it, let alone evidence to test them if we did.
This lack insight into the accounting and/or economic value of data is important because it highlights an important assumption that we need to a better job of examining.
In particular, when people talk about data and its value, they tend to start with a cognitive bias that whatever the data are worth, it’s worth something and that something is positive.
That bias, it turns out, is very often wrong.
Log everything and prosper, or, All your bases are belong to swamp
The prevailing wisdom of logging / measuring everything gained adoption in industry at around the same time that more and more people were becoming aware that more data tends to beat better algorithms.
“Log everything and keep it forever,” the logic went. “Storage is cheap and you never know when the data might be useful somehow. It’s better to have it, just in case.”
It was an alluring argument, and it would have been nice if this were true.
The problem is that the storage and processing costs of data (which are infinitesimally small at the margins) do not account for even a majority of the costs that need to be paid in order for that data to exist.
It’s a particularly easy fallacy to notice if instead of thinking about the data you are putting in a database, you think about the code you are putting in the codebase.
Nobody would seriously make the argument that a codebase has been made better because of how big it is and how quickly it grows. If anything they would insist the opposite is true.
But with data, we tend to be much more forgiving about what we allow in.
And our data lakes often grow into data swamps as a result.
The Five L’s of Data Value
Instead of logging everything and keeping it for as long as possible, data professionals should have focused on logging everything valuable and holing on to it it as long as its value to the business remains non-negative.
We’re still a long way away from being able to quantify “data value” with the sort of precision and rigor we try to apply to everything else, but we should still begin to orient our thinking around value and and whether or not we are adding any with our data.
Towards that end, I would like to propose the The Five L’s of Data Value as a thinking device to go alongside “The
Three Four Five V’s of Data.” They are:
How much noise/clutter will be added to the data warehouse with the addition of these data? How many users will be able to find these new data? How many of those users have any interest, desire, or need to?
How sensitive are the data? If it were stolen or involved in a breach, what would be the impact? How valuable would the data be to an attacker?
Are these data true and what is the impact if they are not? Would any system or user benefit from lying and attempting to poison the data set? Alternatively, what is the impact and likelihood of the data being used as an invalid source of truth?
What is the compliance overhead to keep these data in existence? What’s the risk/cost of a compliance mistake being made?
What do we think we can learn from these data? What has been learned? How important is it to continuing learning from these data? When would a lesson learned from these data expire or become invalid?
Thoughts for the future
Our databases, like our codebases, require careful and frequent pruning.
As the theory and practice of data valuation advances, we can expect to see the emergence of value-based data retention strategies, for example, and of ML frameworks that make intelligent, optimized decisions against the costs of accessing training data.
Alongside this, we need to develop better habits of working with our data, and to get into the practice of purging worthless and negatively valued data.
If it helps, maybe try using the Five L’s of Data Value and think about what it has to say about some of your existing data assets. Which ones are most valuable, in this framework? Which are most costly? Is this what you would have expected and does this ring true to you and your stakeholders? Is that consistent with whatever existing cost/resource tracking you have, or does this framework say something else?
I’ll be back in the next two weeks with some thoughts on the architecture of an epidemic: our relationships with data and technology. Stay tuned, and let me know thoughts!
Has data been on your mind a lot lately? Mine too! If you’d like, feel free to snag a time on my calendar and let’s have a chat about data!