Databricks Issue Resolver
In the issue resolver stage, you enhance the quality of data in various ways by handling duplicate data, handling missing data, handling outliers, specifying the partitioning order, handling case sensitivity, string operations and so on.
-
In the data quality stage, add an issue resolver node. Connect to and from the data lake.
-
Click the issue resolver node and click Create Job to create an issue resolver job.
-
Provide the following information to create an Issue Resolver job:
Job Name-
Template - this is automatically selected depending on the selected stages.
-
Job Name - provide a name for the issue resolver job.
-
Node Rerun Attempts - the number of times the job is rerun in case of failure. The default setting is done at the pipeline level.
Click Next.
Source-
Source - this is automatically selected depending on the type of source added in the pipeline.
-
Datastore - this is automatically selected depending on the configured datastore.
-
Source Format - You can either select Parquet or Delta table.
-
Choose Base Path - You can select one file at a time.
-
Issue Resolver Constraints - select the following constraints, based on your requirements:
-
Handle Duplicate data - select a column with unique key for handling duplicate data.
-
Partitioning Order - select the partitioning order for the required columns, choose from ascending and descending.
-
Handle Missing Data - select the type of action you want to take on columns with missing data, choose from Remove Null/Empty and Update Null/Empty.
-
Handle Outliers - specify the integer value for handling outliers.
-
Handle String Operations - perform strings operations on the selected columns. String operations include Trim, LTrim, RTrim, LPad, RPad, Regex Replace, and Sub-string.
-
Handle Case Sensitivity - specify whether to handle case sensitivity. Options are lower case, upper case, and proper case.
-
Replace Selective Data - specify values to replace existing values. You can specify multiple comma separated values.
-
Handle Data Against Master Table - perform a lookup against master table in S3 and then perform append, overwrite or upsert operation on the selected columns.
-
Cluster ConfigurationYou can select an all-purpose cluster or a job cluster to run the configured job. In case your Databricks cluster is not created through the Lazsa Platform and you want to update custom environment variables, refer to the following:
All Purpose ClustersCluster - Select the all-purpose cluster that you want to use for the data integration job, from the dropdown list.
Job ClusterCluster Details Description Choose Cluster Provide a name for the job cluster that you want to create. Job Configuration Name Provide a name for the job cluster configuration. Databricks Runtime Version Select the appropriate Databricks version. Worker Type Select the worker type for the job cluster. Workers Enter the number of workers to be used for running the job in the job cluster.
You can either have a fixed number of workers or you can choose autoscaling.
Enable Autoscaling Autoscaling helps in scaling up or down the number of workers within the range specified by you. This helps in reallocating workers to a job during its compute-intensive phase. Once the compute requirement reduces the excess number of workers are removed. This helps control your resource costs. Cloud Infrastructure Details First on Demand Provide the number of cluster nodes that are marked as first_on_demand.
The first_on_demand nodes of the cluster are placed on on-demand instances.
Availability Choose the type of EC2 instances to launch your Apache Spark clusters, from the following options:
-
Spot
-
On-demand
-
Spot with fallback
Zone Identifier of the availability zone or data center in which the cluster resides.
The provided availability zone must be in the same region as the Databricks deployment.
Instance Profile ARN Provide an instance profile ARN that can access the target Amazon S3 bucket. EBS Volume Type The type of EBS volume that is launched with this cluster. EBS Volume Count The number of volumes launched for each instance of the cluster. EBS Volume Size The size of the EBS volume to be used for the cluster. Additional Details Spark Config To fine tune Spark jobs, provide custom Spark configuration properties in key value pairs. Environment Variables Configure custom environment variables that you can use in init scripts. Logging Path (DBFS Only) Provide the logging path to deliver the logs for the Spark jobs. Init Scripts Provide the init or initialization scripts that run during the start up of each cluster. Click Complete.
-
-
With this the job creation is done. Click Start to run the Data Issue Resolver job.
-
On completion of the job, click the Data Issue Resolver Result tab and then click View Resolver Results.
-
View the output of the Issue Resolver job. Click to download and save the results of Issue Resolver to a CSV file.
What's next? Data Quality |