The advent of big data has led to organizations investing in systems capable of housing and processing data in unimaginable ways. We have also seen some enterprises changing their existing IT infrastructure to take advantage of this new trend in data. These systems have yielded tangible results: increased revenue and lower costs. While these results may seem impressive, to truly scale the results of and value out of these systems, these platforms must be governed. This leads us to data governance. While the term and field of data governance sends shivers down the spine of many data practitioners, if implemented and used for these systems, it will guarantee positive outcomes out of these big data systems. But what most enterprises do out of the fear of data governance is especially amusing: solve the problem with a technology-only approach. But it is well known that technology by itself will rarely suffice. The pitfall of the only-technology-based approach is that when you begin with a particular technology stack and implement it, this stack must be revisited when you start to get limited results and want to scale, especially when you want to optimize for data governance.
What is Data Governance?
Before we begin to understand what data governance is, let’s clear up some misconceptions about it. Data governance is not data lineage or MDM or data stewardship though these terms are often used in conjunction with data governance. These terms are components of an organization’s data governance practice but do not encompass it and also do not give the whole picture.
So, what is data governance? At its heart, data governance is about formally managing an organization’s data and deriving value out of it through a combination of people and processes with technology being used to simplify and automate aspects of it. For example, let’s deal with the aspect of security of data. An enterprise’s most important asset is it’s sensitive and confidential data. There is a need to secure and protect this data with processes to keep it confidential and expose all or parts of it to users with a legitimate need-to-know. People should help identify who should or should not have access to certain parts of data. Technologies such as identity management systems and permission management capabilities simplify and automate key parts of this task. As the speed and volume of data coming in increases (especially in the age of big data), it is nearly impossible for humans (data stewards or security analysts) to reconcile and classify this new data and integrate it with the data already present. A typical approach followed in many enterprises is to keep this new data locked up in a holding cell and then take time to classify the data. Fortunately, your organization doesn’t need to follow this approach. Technology providers have worked out innovative ways to solve such problems by classifying data as it comes or very soon thereafter. By leveraging such technologies, a key prerequisite of authorization is met while timely access to data is provided minimizing time to insight.
The 3 Vs of Big Data
Let’s consider the three Vs of Big Data:
The volume is the large amount of data stored in big data systems which can be beyond petabytes.
Prior to the big data age, relational databases were the mainstream. Now in the era of big data, data housed in these systems can be structured, or semi-structured, or unstructured (containing images, videos, audio, etc).
The velocity is the movement of data, companies today need quick ingestion of data from devices around the globe, and this allows real time analytics and visualization.
Big Data Governance
Governing the data stored in big data systems such as Hadoop can be complicated. The typical approach followed in most organizations is to stitch together different clusters each of which has a separate purpose or stores and processes data differently like files, tables and streams. The approach goes against the very spirit of data governance. Even if the securing and stitching is done correctly, the gaps become exposed.
The solution to this is converged architecture based systems. In converged architecture based systems, the different types of sub-systems are integrated into a single data repository where security and governance becomes easier.
There is another problem related to these systems where they are used in AI. In various enterprises, these big data systems are used in AI where there is amalgamation of various machine learning engines sitting atop different systems. Spark and Hive are two of the various common tools and most enterprises just pick one of these two engines. The trouble is that such technology-based solutions do not work from a data governance perspective. These systems do not have audit features as to where they do not lend themselves to questions such as “who did this?”, “where did this come from?”, “How did this happen?”. These tools do not lend themselves to the same security mechanisms which can cause inconsistencies in data lineage – a key aspect of data governance.
Stream-based architecture solution, answers to questions about data lineage. Fortunately, new technological advances have produced systems which make it possible to solve for data lineage using a more prescriptive viewpoint. Stream-based architecture makes it possible to publish data streams throughout for all publishers and subscribers of a particular stream of metadata. People who consume or use data can populate streams of data down to a lower hierarchy of consumers and subsystems throughout the cluster. Moreover, stream-based architecture makes it easy to solve the questions of data lineage-administrators and they can rewind and replay the stream as all actions and events about the data are tracked and recorded.
But a few following conditions have to be met in order to comply with regulatory requirements:
⦁ The events which are published cannot be changed i.e., the stream of metadata itself is immutable.
⦁ Adequate and reasonable permissions are set for all users based on their role and level of hierarchy.
⦁ There should be proper audit logs that capture the metadata and track it all levels
⦁ Adequate replication clusters are arranged and configured to allow for the global replication of streams of data.
The Bottom Line
The right tool and technology are critical to the functioning of a robust data governance practice which is also set in the combination of people and processes. Technology can be used to automate certain aspects of the practice of data governance and seal the fissures that would otherwise occur in fragmented systems without a sound data governance practice.