Data Deduplication

Data Deduplication, also referred to as Dedup, is the process of identifying and removing duplicate data from a dataset. By eliminating redundant records, Dedup enhances the overall quality and accuracy of the data, ensuring more reliable and efficient analysis. Additionally, it helps optimize storage resources by removing unnecessary duplicate data, thereby lowering storage costs and improving the performance of data processes.

The Calibo Accelerate platform supports data deduplication using Databricks in the data quality stage of a data pipeline. Currently you can perform deduplication on data in an Amazon S3 data lake.

After you create a data deduplication job, the job runs in two parts:

  • In the first part of the job run, you select the deduplication column, match ratio, and the unique identifier. This is processed and provides you a list of duplicate records. Based on the duplicate records, you select the records that you want to retain. Then you run the other part of the job.

  • In the second part of the job run, based on the records that are to be retained, the output data is generated.

To create a data deduplication pipeline

  1. On the home page of Data Pipeline Studio, add the following stages and connect them as shown below:

    • Data Lake - Amazon S3

    • Data Deduplication - Databricks

    Create Data Deduplication Pipeline

  2. Configure the Amazon S3 node.

  3. Click the data deduplication node and click Create Job.

  4. Provide the following inputs for each stage of the data deduplication job:

To run the data deduplication job

Now that the data deduplication job is created, you can run the job:

  • In the first part of the job run, the algorithm identifies and fetches duplicate records based on the inputs like duplicate columns, match ratio, and unique identifier.

  • In the second part, based on inputs about retaining duplicate records, deduplication is done and output data is generated.

  1. Click the Databricks node and click Start.

  2. After the job run is successful, Inputs for Deduplication is enabled. Click Inputs for Deduplication.

Provide Inputs for Data Deduplication job

In this step you can view the output of the job run. This provides a list of duplicate records based on the column and unique identifier that you selected. Review the data and do the following:

  1. Deselect the records that you want to retain.

  2. Click Save if you want to review the records again.

  3. Click Save and Run to save your deselection and run the job. The job run is initiated. The Save and Run button is enabled, once the job run is complete.

  4. After the job run is complete, click the Output Data tab.

  5. You can view the Modified Data and Source Data by clicking on each tab.

Related Topics Link IconRecommended Topics What's next? Databricks Data Profiler