Databricks Custom Transformation Job

Data transformation is the process of converting, cleansing, and structuring data into a usable format that can be analyzed to support decision making processes. The data transformation process converts raw data into a usable format by removing duplicates, converting data types, and enriching the dataset. This dataset can then be used for data analysis or as an input for AI/ML processes.

Calibo Accelerate Data Pipeline Studio (DPS) provides templates for creating transformation jobs. The jobs include join/union/aggregate functions that can be performed to group or combine data for analysis.

For complex operations to be performed on data, Calibo Accelerate DPS provides the option of creating custom transformation jobs. For custom queries a template is provided with placeholder code. You can navigate to Databricks Notebook and replace the placeholder code with your own custom code.

To create a Databricks custom transformation job

Sign in to the Calibo Accelerate platform and navigate to Products.
Select a product and feature. Click the Develop stage of the feature and navigate to Data Pipeline Studio.
Create a pipeline with the following nodes:

Note: The stages and technologies used in this pipeline are merely for the sake of example.
- Data Lake - Amazon S3
- Data Transformation - Databricks
Configure the data lake and data transformation nodes.
In the data transformation stage click the Databricks node, and select Create Custom Job - to create a custom transformation job.
Complete the following steps to create the Databricks custom transformation job:
Job Name
Provide job details for the data transformation job:
- Job Name - provide a name for the data transformation job that you are creating.
- Node Rerun Attempts - this is the number of times the pipeline rerun is attempted on this node, in case of failure. The default setting is done at the pipeline level. You can select rerun attempts for this node. If you do not set the rerun attempts, then the default setting is considered.
Click Next.
Repository
In this step, you can configure the branch template and the source code repository. This helps you to define your branching structure and create the branches in the source code repository accordingly.

To configure branch template
1. Click Configure Branch Template. You are navigated to the Develop stage of the feature in which the pipeline is created. Click Configure.
2. On the Configure Branch Template screen do one of the following:
  - Select an existing template from the dropdown list and add or delete the required branches.
  - Click + Add More and create branches as per your requirement.
1. Click Save.
To configure source code repository
1. Navigate to Data Pipeline Studio and click the data transformation node. You have the following two options to configure source code repository:
- Create a new repository. Provide the following information:
  - Repository Name - A repository name is provided in the default format. The repository name is populated in the following format: Technology Name - Product Name. For example if the technology being used is Databricks and the product name is ABC, then the repository name is Databricks-ABC. You can edit this name to create a custom name for the repository.
    
    Note:
    
    You can edit the repository name when you use the instance for the first time. Once the repository name is set, you cannot change it thereafter.
  - Group - Select a group to which you want to add this repository. Groups help you to organize, manage, and share the repositories for the product.
  - Visibility - Select the type of visibility that the repository must have. Choose from Public or Private.
    
    Note:
    
    The Group and Visibility fields are only visible if you configure GitLab as a source code repository.
  - Click Create Repository. Once the repository is created, the repository path is displayed.
  - Select Source Code Branch from the dropdown list.
- Use Existing Repository - Enable the toggle to use an existing repository. Provide the following information:
  - Title - The technology title is added.
  - Repository Name - Select a repository from the dropdown list.
Click Next.

Cluster Configuration

You can select an all-purpose cluster or a job cluster to run the configured job. Since you are creating a custom transformation job, you may require certain library versions for successfully running the transformation job. To update the library versions, see Updating Cluster Libraries for Databricks

In case your Databricks cluster is not created through the Calibo Accelerate platform and you want to update custom environment variables, refer to the following:

How to update custom environment variables for a Databricks cluster that is not created through the Calibo Accelerate platform

Job Cluster

Cluster Details	Description
Choose Cluster	Provide a name for the job cluster that you want to create.
Job Configuration Name	Provide a name for the job cluster configuration.
Databricks Runtime Version	Select the appropriate Databricks Runtime version from the dropdown menu.
Worker Type	Select the worker type for the job cluster.
Workers	Enter the number of workers to be used for running the job in the job cluster. You can either have a fixed number of workers or you can choose autoscaling.
Enable Autoscaling	Autoscaling helps in scaling up or down the number of workers within the range specified by you. This helps in reallocating workers to a job during its compute-intensive phase. Once the compute requirement reduces the excess number of workers are removed. This helps control your resource costs.
Cloud Infrastructure Details
First on Demand	Lets you pay for the compute capacity by the second.
Availability	Select from the following options: Spot On-demand Spot with fallback
Zone	Select a zone from the available options.
Instance Profile ARN	Provide an instance profile ARN that can access the target S3 bucket.
EBS Volume Type	The type of EBS volume that is launched with this cluster.
EBS Volume Count	The number of volumes launched for each instance of the cluster.
EBS Volume Size	The size of the EBS volume to be used for the cluster.
Additional Details
Spark Config	To fine tune Spark jobs, provide custom Spark configuration properties in key value pairs.
Environment Variables	Configure custom environment variables that you can use in init scripts.
Logging Path (DBFS Only)	Provide the logging path to deliver the logs for the Spark jobs.
Init Scripts	Provide the init or initialization scripts that run during the start up of each cluster.

To replace the placeholder custom code

After you have created the custom transformation job, click the Databricks Notebook icon. This navigates you to the custom transformation job in the Databricks UI. Replace the code with your custom code and then run the job.

Navigate to Databricks Notebook

Note:

In the sample code provided in Databricks Notebook, if you delete the job run parameters like Records Processed and Time Taken it may not provide the accurate time taken for job run which is displayed in View Details on the UI. This is because the time taken may include the time required to start the Databricks cluster, if it is not already running.

What's next? Snowflake Custom Transformation Job