
It takes ZERO technical skills to accomplish.
The employment outlook for data careers is booming. Whether you are looking to be a data scientist, data analyst, data architect or BI engineer, there are no shortage of openings at companies all over the world. Because of this boom many analysts are honing their technical skills to compete in the often IT centered data world. But while Python, SQL and R are important skills, do not forget businesses need to have an accurate view of the who, what, where, when and how of their data assets. Getting into the weeds of the processes that create business data is at the heart of data curation and it does not take any technical skill.
What is Data Curation?
The word curation evokes the visions of museums and documentation of priceless artifacts. That is a perfect way to think because your stored data is somewhat like a museum of your organization or research. Data curation involves documenting how and where a dataset is stored, how it is created, who should have access to it and who manages it. Curation should tell the story of a dataset in a way that is accessible for non-technical users. Essentially a plain language metadata catalog for datasets. Because it does not help anyone to store inaccessible, unintelligible data.
Data curation concerns:
- Describing analysis variables
- Identify datasets and collections
- Describe data pipelines and data lineage
- Documenting business processes being described
- Describing business rules
- Reasoning behind data decisions
- Which departments have access to the collections and datasets
- Ways the data is being used
So now that we have covered the what, we can cover the why, and the top 5 reasons you should incorporate data curation in your overall strategy.
- It helps you explain data naming and transformation choices
Have you ever found that different departments in your organization using different names for the same thing? Do you need to consult data owners for their insights whenever you look at a dataset? When aggregating data from several sources there are many changes that occur between the sources and the final product. Column names could be changed, or product names or types might be combined. All of this can lead to confusion within an organization or even worse, duplication across datasets. Creating a curated catalog connect knowledge bases across teams which is what you want when planning a mature data strategy that supports self-service. A single point of truth and a clear definition of what each variable represents is vital to self service BI.
2. Analysts Cannot Use Data They Do Not Know About
You would never think of holding any other asset without properly documenting it.When an organization goes out of its way to create data warehouses or data lakes it is a huge investment. The only way to get any return on this investment is to use the data. If not, the whole thing is a waste. Organizational data lakes and warehouses can be vast repositories. It is easy to overlook useful datasets. Having a simple searchable catalog ensures your analytics teams know how to get what they are looking for to best leverage your information assets.
3. Curation can Improve Data Security
Curation should not only tell the story of your data up to the final storage, but it should also give information about how and where it is currently being used and who is using it. Keeping accurate details on data asset access is vital to support your security efforts. We do not give everyone in the organization keys to supply closets and the same goes for important data. Being on top of documenting access will support security audits, threat modeling, and strengthen controls. Another bonus is that documenting how the data is already being used and by whom gives others a great basis for creating their own use cases when they need to request access to a dataset. Or they could find that what they need already exists (no duplicate effort!). Data curation is a must for a solid security plan.
4. Can Improve the Quality of Your Data
It is obvious why data quality is important. Basing decisions on bad data is bad business. Sometimes quality issues are not immediately obvious. In many cases how a dataset is created determines how it can be appropriately used. Proper documentation and curation ensure that analysts understand how a dataset is meant to be used. Data curation techniques prevent users from misinterpreting the dataset or misusing it. Inconsistent use can result in two teams getting different answers to the same question. That creates confusion and lost faith in the analysis.
5. Helps with Ethical Interpretation
The ethical issues surrounding big data, machine learning, and AI are constantly coming up in recent discussions. Computers are not inherently biased, but data and the processes used to collect or create it can be. One way to get ahead of these concerns is having clear data provenance. When we know the motivation for collection, who collected it, methods used, and conditions under which the data is collected it can be easier to understand where ethical concerns could arise. Much data collection directly or indirectly concerns human subjects and its important to remember the humans behind the numbers. When analysts fully understand the history of the data it becomes clear who or what is included or not included and why. For example, imagine using an English only questionnaire to understand sentiment about usefulness of a new bus route in your city. But your city also has a large population who does not speak English at all. Because our questionnaire was only distributed in English chances are the responses exclude the population who does not speak English. This is an important limitation in our dataset because it is not representative. Adequate documentation gives your team the tools it needs to interpret data ethically.
Getting Started
Now that you why data curation is important here are some tools to help you get started. Depending on the size of your collection and team you could start with something as simple as a spreadsheet detailing all available datasets, along with how, when, why they were created, who can access them and how they are currently being used. Check out this data biography template from We All Count to get an idea of how this looks:
https://docs.google.com/spreadsheets/d/1Ych5dzBfGLoQGYb-Jtq6VMn0PKdj_Y_tk6nGjopEduw/edit?usp=sharing
If you are in a larger organization, check the tools you are currently using. Most offer data catalog software that plays nicely with the tools you already have set up. In either case happy curating!