How to improve Data Quality with GCP Data Prep?

How to improve Data Quality with GCP Data Prep?

tudip-logo

Tudip

04 June 2020

Data is the basic building block of every application. Not just data, Cleaned data is an important and necessary thing for every application.

Every business partner, including suppliers, distributors, warehouses and other retail stores, may provide data in various shapes and levels of quality. One company may use palettes instead of boxes as a unit of storage, pounds versus kilograms and may have different categories nomenclature, may use a different date format, or will most likely have product stock keeping units that are a combination of both internal and other supplier IDs. There might be a case that some data may be missing or may have been incorrectly entered.

Each of these data issues represents an important risk to forecasting. The analysts should absolutely clean, standardize, and gain trust in the data before they can report and model on it accurately. There are key techniques for cleaning data with Cloud Dataprep and some new features that may help improve your data quality with minimal effort.

Developers do not like to code when correcting errors, omissions, or inconsistencies in your data. Dataprep comes into the picture when one wants clean data. Using Data prep, developers can start using Google Cloud DataPrep directly from the Google Cloud Console. The very first step in a Dataprep solution is selecting the datasets that need to be wrangled. There are two main types of datasets which DataPrep supports: Wrangled and Imported. Wrangled datasets:These are used to transform the source data using Recipes. The Another one is – Imported datasets: These are a reference to source data residing in data storage systems such as Files, databases, big data file systems, etc.

The following are certain principles which one must know before Using Google Cloud Dataprep

Create baseline dataset before profiling source data:

Before we started cleaning our dataset, it is helpful to create a virtual profile of the source data. Initially, create a minimal recipe on a dataset then, click Run Job to generate a profile of the data, which can be used as a baseline dataset for validating and debugging the origin of data problems that we discover.

Normalize data before applying DeDuplicate Transform:

Uniqueness check is a common step in data preparation so we need to Remove identical rows from your dataset. Cloud Dataprep provides a single transform deduplicate, which can remove identical/similar rows from your dataset.

 But there are 2 limitations in this case-

  • This transform is case-sensitive. Rows containing the same values are not considered duplicates and cannot be removed with this transform.
  • Whitespace and the beginning and ending of values is never ignored.

 It is also necessary to normalize your data before applying deduplicate transform.

Join early and Union later:

You can decorate your data by Join or Union dataset together from multiple sources. Join operations should be performed early in your recipe so that you reduce the chance of having changes/updates to your join keys impacting the results of your join operations. Union operations should be performed later. By doing this as per this requence, you minimize the chance of changes to the union operation, including dataset refreshes, affecting the recipe and the output.

Use statistical information to evaluate generated data:

Once you are done with your recipe and run the job, you are free to open the source data and the profile you created for the source data in separate browser tabs to evaluate the consistency and to see how complete your data remains from beginning to end of the wrangling process.

Instead of comparing data one by one with rows, use the statistical information in the generated profile to compare with the statistics generated from the source, so that you can identify if your changes have introduced any unwanted changes to these values.

After profiling source data, keep Recipe records:

For record keeping purpose, you can click View Recipe to copy and paste the recipe used to create the profile. You can also Download Recipe into a text file if you want.

Few major Benefits of Google Cloud Data Prep

  1. It is easy to use and to ramp up.
  2. We don’t need to run the “code” on any VM.
  3. Convenient GUI which allows you to see the statistics, whole process as well as metadata over it.
  4. Runs in a pretty efficient manner by considering the performance.

Cloud DataPrep completes the whole circle in Google Cloud’s data science pipeline that includes services in areas such as analytics, transformation, storage, and data science. More importantly, data wrangling services such as DataPrep are a strong differentiator of Google Cloud against competitive platforms such as AWS or Azure

search
Blog Categories
Request a quote