Databricks Issue Resolver

In the issue resolver stage, you enhance the quality of data in various ways by handling duplicate data, handling missing data, handling outliers, specifying the partitioning order, handling case sensitivity, string operations and so on.

In the data quality stage, add an issue resolver node. Connect to and from the data lake.
Click the issue resolver node and click Create Job to create an issue resolver job.

Provide the following information to create an Issue Resolver job:

Source

DQ Issue Resolver Source stage

Source - this is automatically selected depending on the type of source added in the pipeline.
Datastore - this is automatically selected depending on the configured datastore.
Source Format - You can either select Parquet or Delta table.
Choose Base Path - You can select one file at a time.
Issue Resolver Constraints - select the following constraints, based on your requirements:
- Handle Duplicate data - select a column with unique key for handling duplicate data.
- Partitioning Order - select the partitioning order for the required columns, choose from ascending and descending.
- Handle Missing Data - select the type of action you want to take on columns with missing data, choose from Remove Null/Empty and Update Null/Empty.
- Handle Outliers - specify the integer value for handling outliers.
- Handle String Operations - perform strings operations on the selected columns. String operations include Trim, LTrim, RTrim, LPad, RPad, Regex Replace, and Sub-string.
- Handle Case Sensitivity - specify whether to handle case sensitivity. Options are lower case, upper case, and proper case.
- Replace Selective Data - specify values to replace existing values. You can specify multiple comma separated values.
- Handle Data Against Master Table - perform a lookup against master table in S3 and then perform append, overwrite or upsert operation on the selected columns.

Target

Target - this is automatically selected depending on the type of target you select in the pipeline.
Datastore - this is automatically selected depending on the configured datastores to which you have access.
Choose Target Format - select either Parquet or Delta table.
Target Folder - select the target folder where you want to store the data validator job output. This is an optional step.
Folder - Create a folder structure inside the target folder.
Subfolder - Create a folder inside the folder that you created in the previous step.
Audit Tables Path - this path is formed based on the folders selected. A folder Data_Issue_Resolver_Job_audit_table is created for data issue resolver.
Operation Type - Select the type of operation that you want to perform on the data while storing it in the target table. Choose one of the following options:
- Append - Adds new data at the end of the table without erasing the existing content.
- Overwrite - Replaces the entire content of the table with new data.
Enable Partitioning - Enable this option if you want to use partitioning for the target data. Select from the following options:
- Data Partition - Select the filename, column details, enter the column value. Click Add.
- Date Based Partitioning - Select the type of partitioning that you want to use for target date from the options - Yearly, Monthly, Daily. Add a prefix to the partition folder name. This is optional.
Final File Path - Review the final target path. This is created based on the inputs you provide for folder, subfolder, and partitioning.

Cluster Configuration

You can select an all-purpose cluster or a job cluster to run the configured job. In case your Databricks cluster is not created through the Calibo Accelerate platform and you want to update custom environment variables, refer to the following:

How to update custom environment variables for a Databricks cluster that is not created through the Calibo Accelerate platform

Job Cluster

Cluster Details	Description
Choose Cluster	Provide a name for the job cluster that you want to create.
Job Configuration Name	Provide a name for the job cluster configuration.
Databricks Runtime Version	Select the appropriate Databricks version.
Worker Type	Select the worker type for the job cluster.
Workers	Enter the number of workers to be used for running the job in the job cluster. You can either have a fixed number of workers or you can choose autoscaling.
Enable Autoscaling	Autoscaling helps in scaling up or down the number of workers within the range specified by you. This helps in reallocating workers to a job during its compute-intensive phase. Once the compute requirement reduces the excess number of workers are removed. This helps control your resource costs.
Cloud Infrastructure Details
First on Demand	Provide the number of cluster nodes that are marked as first_on_demand. The first_on_demand nodes of the cluster are placed on on-demand instances.
Availability	Choose the type of EC2 instances to launch your Apache Spark clusters, from the following options: Spot On-demand Spot with fallback
Zone	Identifier of the availability zone or data center in which the cluster resides. The provided availability zone must be in the same region as the Databricks deployment.
Instance Profile ARN	Provide an instance profile ARN that can access the target Amazon S3 bucket.
EBS Volume Type	The type of EBS volume that is launched with this cluster.
EBS Volume Count	The number of volumes launched for each instance of the cluster.
EBS Volume Size	The size of the EBS volume to be used for the cluster.
Additional Details
Spark Config	To fine tune Spark jobs, provide custom Spark configuration properties in key value pairs.
Environment Variables	Configure custom environment variables that you can use in init scripts.
Logging Path (DBFS Only)	Provide the logging path to deliver the logs for the Spark jobs.
Init Scripts	Provide the init or initialization scripts that run during the start up of each cluster.

Click Complete.

With this the job creation is done. Click Start to run the Data Issue Resolver job.
On completion of the job, click the Data Issue Resolver Result tab and then click View Resolver Results.
View the output of the Issue Resolver job. Click to download and save the results of Issue Resolver to a CSV file.

What's next? Data Quality

Databricks Issue Resolver

Note: