Databricks Issue Resolver
In the issue resolver stage, you enhance the quality of data in various ways by handling duplicate data, handling missing data, handling outliers, specifying the partitioning order, handling case sensitivity, string operations and so on.
-
In the data quality stage, add an issue resolver node. Connect to and from the data lake.
-
Click the issue resolver node and click Create Job to create an issue resolver job.
-
Provide the following information to create an Issue Resolver job:
Job Name
-
Template - this is automatically selected depending on the selected stages.
-
Job Name - provide a name for the issue resolver job.
-
Node Rerun Attempts - the number of times the job is rerun in case of failure. The default setting is done at the pipeline level.
-
Fault tolerance - Select the behaviour of the pipeline upon failure of a node. The options are:
-
Default - Subsequent nodes should be placed in a pending state, and the overall pipeline should show a failed status.
-
Skip on Failure - The descendant nodes should stop and skip execution.
-
Proceed on Failure - The descendant nodes should continue their normal operation on failure.
-
Click Next.
Source
-
Source - this is automatically selected depending on the type of source added in the pipeline.
-
Datastore - this is automatically selected depending on the configured datastore.
-
Source Format - You can either select Parquet or Delta table.
-
Choose Base Path - You can select one file at a time.
-
Issue Resolver Constraints - select the following constraints, based on your requirements:
-
Handle Duplicate data - select a column with unique key for handling duplicate data.
-
Partitioning Order - select the partitioning order for the required columns, choose from ascending and descending.
-
Handle Missing Data - select the type of action you want to take on columns with missing data, choose from Remove Null/Empty and Update Null/Empty.
-
Handle Outliers - specify the integer value for handling outliers.
-
Handle String Operations - perform strings operations on the selected columns. String operations include Trim, LTrim, RTrim, LPad, RPad, Regex Replace, and Sub-string.
-
Handle Case Sensitivity - specify whether to handle case sensitivity. Options are lower case, upper case, and proper case.
-
Replace Selective Data - specify values to replace existing values. You can specify multiple comma separated values.
-
Handle Data Against Master Table - perform a lookup against master table in S3 and then perform append, overwrite or upsert operation on the selected columns.
-
Target
-
Target - this is automatically selected depending on the type of target you select in the pipeline.
-
Datastore - this is automatically selected depending on the configured datastores to which you have access.
-
Choose Target Format - select either Parquet or Delta table.
-
Target Folder - select the target folder where you want to store the data validator job output. This is an optional step.
-
Folder - Create a folder structure inside the target folder.
-
Subfolder - Create a folder inside the folder that you created in the previous step.
-
Audit Tables Path - this path is formed based on the folders selected. A folder Data_Issue_Resolver_Job_audit_table is created for data issue resolver.
-
Operation Type - Select the type of operation that you want to perform on the data while storing it in the target table. Choose one of the following options:
-
Append - Adds new data at the end of the table without erasing the existing content.
-
Overwrite - Replaces the entire content of the table with new data.
-
-
Enable Partitioning - Enable this option if you want to use partitioning for the target data. Select from the following options:
-
Data Partition - Select the filename, column details, enter the column value. Click Add.
-
Date Based Partitioning - Select the type of partitioning that you want to use for target date from the options - Yearly, Monthly, Daily. Add a prefix to the partition folder name. This is optional.
-
-
Final File Path - Review the final target path. This is created based on the inputs you provide for folder, subfolder, and partitioning.
Cluster Configuration
You can select an all-purpose cluster or a job cluster to run the configured job. In case your Databricks cluster is not created through the Calibo Accelerate platform and you want to update custom environment variables, refer to the following:
All Purpose Clusters
Cluster - Select the all-purpose cluster that you want to use for the data integration job, from the dropdown list.
Note:
If you do not see a cluster configuration in the dropdown list, it is possible that the configured Databricks cluster has been deleted.
In this case, you must create a new Databricks cluster configuration in the Data Integration section of Cloud Platform Tools and Technologies. Delete the data integration node from the data pipeline, add a new node with the newly created configuration, and configure the job again. Now the user can select the newly configured Databricks cluster.
Job Cluster
Cluster Details Description Choose Cluster Provide a name for the job cluster that you want to create. Job Configuration Name Provide a name for the job cluster configuration. Databricks Runtime Version Select the appropriate Databricks version. Worker Type Select the worker type for the job cluster. Workers Enter the number of workers to be used for running the job in the job cluster.
You can either have a fixed number of workers or you can choose autoscaling.
Enable Autoscaling Autoscaling helps in scaling up or down the number of workers within the range specified by you. This helps in reallocating workers to a job during its compute-intensive phase. Once the compute requirement reduces the excess number of workers are removed. This helps control your resource costs. Cloud Infrastructure Details First on Demand Provide the number of cluster nodes that are marked as first_on_demand.
The first_on_demand nodes of the cluster are placed on on-demand instances.
Availability Choose the type of EC2 instances to launch your Apache Spark clusters, from the following options:
-
Spot
-
On-demand
-
Spot with fallback
Zone Identifier of the availability zone or data center in which the cluster resides.
The provided availability zone must be in the same region as the Databricks deployment.
Instance Profile ARN Provide an instance profile ARN that can access the target Amazon S3 bucket. EBS Volume Type The type of EBS volume that is launched with this cluster. EBS Volume Count The number of volumes launched for each instance of the cluster. EBS Volume Size The size of the EBS volume to be used for the cluster. Additional Details Spark Config To fine tune Spark jobs, provide custom Spark configuration properties in key value pairs. Environment Variables Configure custom environment variables that you can use in init scripts. Logging Path (DBFS Only) Provide the logging path to deliver the logs for the Spark jobs. Init Scripts Provide the init or initialization scripts that run during the start up of each cluster. Click Complete.
-
-
With this the job creation is done. Click Start to run the Data Issue Resolver job.
-
On completion of the job, click the Data Issue Resolver Result tab and then click View Resolver Results.
-
View the output of the Issue Resolver job. Click
to download and save the results of Issue Resolver to a CSV file.
What's next? Data Quality |