Configuring Databricks for Unity Catalog

Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions. Databricks is used for ETL and for managing security, governance, data discovery among its various other features.

The Calibo Accelerate platform uses Databricks for various operations like data integration, data transformation, and data quality. After you save the connection details for Databricks, you can use it in any of the nodes in a data pipeline, as mentioned earlier.

Prerequisites

The following permissions are required for configuring Databricks:

s3:ListBucket
s3:PutObject
s3:GetObject
s3:DeleteObject
s3:PutObjectAcl

To access an Amazon S3 bucket from Databricks, you must create an instance profile with read, write, and delete permissions. For detailed instructions on how to create an instance profile, refer to the following link: instance-profile-tutorial.html

To configure the connection details of your Databricks, do the following:

Sign in to the Calibo Accelerate platform and click Configuration in the left navigation pane.
On the Platform Setup screen, on the Cloud Platform, Tools & Technologies tile, click Configure.
On the Cloud Platform, Tools & Technologies screen, in the Data Integration section, click Configure.

Modify

On the Databricks screen, do the following:

In the Details section, provide the following details:

Field	Description
Name	Give a unique name to your Databricks configuration. This name is used to save and identify your specific Databricks connection details within the Calibo Accelerate platform.
Description	Provide a brief description that helps you identify the purpose or context of this Databricks configuration.

In the Configuration section, provide the following information:

Field	Description
Workspace URL	Provide the URL for the configured Databricks instance.
Select Cloud Provider	Select the cloud provider on which the Databricks cluster is deployed. Note: You must select the same cloud service provider on which the Databricks cluster is deployed. For example, if Databricks is deployed on Azure, do not select AWS as the cloud service provider.

Field

Description

Workspace URL

Provide the URL for the configured Databricks instance.

Select Cloud Provider

Select the cloud provider on which the Databricks cluster is deployed.

Note:

You must select the same cloud service provider on which the Databricks cluster is deployed. For example, if Databricks is deployed on Azure, do not select AWS as the cloud service provider.

Depending on the cloud provider on which the Databricks cluster is deployed and how you want to retrieve the credentials to connect to Databricks, do one of the following:

Field	Description
Connect using Calibo Accelerate Orchestrator Agent	Enable this option to resolve your Databricks credentials within your private network via Calibo Accelerate Orchestrator Agent without sharing them with the Calibo Accelerate platform. If the Cloud Service Provider is AWS, then provide the following details: Select the Calibo Accelerate Orchestrator Agent for AWS Secret Name - provide the name with which the secret is stored in AWS Secrets Manager. If the Cloud Service Provider is Azure, then provide the following details: Select the Calibo Accelerate Orchestrator Agent for Azure Vault Name - provide the name of the vault where the secret is stored in Azure Key Vault.
Select Secret Manager	Select Calibo Accelerate, the user credentials are securely stored in the Calibo-managed secrets store. Databricks API Token Key - provide the key for Databricks API token. Select AWS Secrets Manager. In the Secret Management Tool dropdown list, the AWS Secrets Manager configurations that you save and activate in the Secret Management section on the Cloud Platform, Tools & Technologies screen are listed for selection. Select the configuration of your choice. Secret Name - Provide the secret name, with which the secrets for Databricks are stored.
Databricks API Token	Provide the Databricks API token.
Test Connection	Click Test Connection to verify the provided details and ensure that the Calibo Accelerate platform is able to successfully connect to the configured Databricks instance.

In the Databricks Resource Configuration section, provide the following information:

Field	Description
Organization ID	Provide the organization ID. When you log into your Databricks workspace, observe the URL in the address bar of the browser. The numbers following the o= in the URL make up the organization ID. For example if the URL is https://abcd-teste2-test-spcse2.cloud.databricks.com/?o=2281745829657864#, the organization ID is 2281745829657864.
Cluster Name	Do one of the following: Select an existing cluster from the dropdown list. If you select a Databricks cluster that has unity catalog feature enabled, you will see additional fields related to Unity Catalog settings. Click + New Cluster to create a new cluster. See Create a Databricks cluster.
Enable Unity Catalog	Do one of the following: If you have selected a Databricks cluster that runs Databricks Runtime version 14.3 LTS or above, but the Unity Catalog feature is not enabled from Databricks, enable this toggle to use the Unity Catalog feature. Once enabled, all the available metastores are displayed in the dropdown list. If you have selected a Databricks cluster that runs Databricks Runtime version 14.3 LTS or above, and the Unity Catalog feature is enabled from Databricks, no action is required. Note: Once the Unity Catalog feature is enabled from Databricks, the toggle is enabled but non-editable from the Calibo Accelerate platform
SQL Warehouse	Select a SQL Warehouse that you want to use for this Databricks instance. (This is optional, you can also select a SQL Warehouse when you create a Databricks job using this instance.) See Add New SQL Warehouse.
Metastore	Select the metastore for this Databricks instance. If Unity Catalog is enabled, the metastore is pre-selected and no action is required.
External Location	An external location in a Unity Catalog-enabled Databricks environment is a data storage location such as Amazon S3 or Azure Blob Storage that is linked to Databricks. These locations allow you to access data stored outside of Databricks workspace. Click Manage to view the existing external locations or create a new one. See Create an External Location
Workspace Parent Folder	Do one of the following: Provide the name of the parent folder in the Databricks workspace that you are configuring. Click + New Parent Folder. Provide a Workspace Parent Folder Name and click Create.
Auto Clone Custom Code to Databricks Repos	Enable this option if you want your custom code to be automatically cloned to Repository in Databricks. If you disable this option, Calibo Accelerate uploads the custom code to the Repository in Databricks.
Secret Scope	A secret scope is a collection of secrets that is stored in an encrypted database owned and managed by Databricks Azure. It allows Databricks to use the credentials stored in it locally, eliminating the need to connect to an external secrets management tool, every time a job is run in a data pipeline. You can either use an existing secret scope from the ones already created in Databricks or create a new one. Do one of the following: Select a secret scope from the dropdown list. Click + New Secret Scope to create a new secret scope. Provide a name for the new secret scope and click Create.

Secure configuration details with a password
To password-protect your Databricks connection details, turn on this toggle, enter a password, and then retype it to confirm. This is optional but recommended. When you share the connection details with multiple users, password protection helps you ensure authorized access to the connection details.
Click Save Configuration. You can see the configuration listed on the Data Integration screen.

Create a Unity Catalog-enabled Cluster

To create a new cluster that is unity catalog-enabled, provide the following details:

Cluster Details

Field	Description
Cluster Name	Provide a name for the Databricks cluster that you are creating.
Enable Unity Catalog	Enable the toggle to use Unity Catalog functionality with this Databricks cluster.
Access Mode	For a Databricks data integration job, you can select either of the following access modes: Dedicated - This option allows a single user to run SQL, Python, R, and Scala workloads, with access to data secured in Unity Catalog. Standard - This option allows multiple users to share the cluster to run SQL, Python, and Scala workloads on data secured in Unity Catalog.
Databricks Runtime Version	Select a version based on the type of job for which you want to use the Databricks cluster.
Worker Type	Select the processing capacity that you need. This is based on the use case you are trying to achieve.
Workers	Number of worker nodes. This depends on the kind of cluster you want to create. For example, a multi-node cluster requires a greater number of workers.
Enable Autoscaling	Turn on this toggle to control the infrastructure costs. If you enable this option, and provide values for Min Workers and Max Workers, the infrastructure costs begin at minimum workers and can scale up to maximum workers depending on the requirement.
Terminate after below minutes of inactivity	Use this option to control the infrastructure costs. Provide a value in minutes. If the Databricks cluster is inactive for the specified minutes, it is automatically terminated.
Use latest Calibo UTIL version	Enable this option to use the required utility.

Cloud Infrastructure Details

Field	Description
First on Demand	Lets you pay for the compute capacity by the second.
Availability	Availability zone for a cluster. The instance type you want may be available only in certain zones.
Zone	Select a zone from the available options.
Instance Profile ARN	Instance Profiles allow you to access your data from Databricks clusters. Specify the Instance Profile ARN.
EBS Volume Type	Databricks provisions EBS volumes by default, for efficient functioning of the cluster. You can specify the type of EBS volume, the EBS volume count and the EBS volume size.
EBS Volume Count
EBS Volume Size

Additional Details

Field

Description

Spark Config

You can fine-tune your spark jobs by providing custom Spark configuration properties. For more information see Spark configuration properties.

Enter the configuration properties as one key-value pair per line.

Environment Variables

You can add the required environment variables.

Environment variables are used to configure various aspects of the behavior of the Databricks cluster and environment. For example: The Python version to use, number of worker instances to start on each worker node, location of the Apache Spark installation on the cluster nodes, the logging level to be set for the cluster and so on.

Logging Path

When you create a cluster, you specify the location where the logs for the Spark driver node, worker nodes, and events are stored.

Select the destination from the following options:

DBFS
S3

Log Level

Select the level for logging that you want to set for this cluster. Choose from the following options:

ALL
DEBUG
ERROR
FATAL
INFO
OFF
TRACE
WARN
NONE

Init Scripts

Set the destination and the path of the location where the init scripts for the cluster are stored. Init scripts customize the initialization process of the cluster by executing specified commands or scripts when the cluster spins up. The scripts perform various tasks like configuring the cluster environment, installing additional software packages, setting environment variables and so on. Specify the destination and the actual path for the init scripts:

Destination - Select from the following options:
- Workspace
- DBFS
- S3

Init script path - Specify the actual path on the selected destination.

Type

A table is created with the options that are selected in the previous step and the information is summarized as shown below:

Type	File Path	Region	Delete
dbfs	dbfs:/logging	-	delete icon

File Path

Region

Create an external location

Note:

If you are creating an external location on a Unity Catalog-enabled Databricks instance that is deployed on Azure, ensure that you have the CREATE EXTERNAL LOCATION permission.

On the Databricks cluster configuration under External Location, click Manage.
On the External Location side drawer, click Add External Location and provide the following information:
1. External Location Name - A name for the external location that you want to create.
2. Storage Credential - Credentials that authorize access to the external storage location.
3. Location URL - The path of the external storage location.
4. Read-only Access - Click this option to enable read-only access to users to this external storage location.
5. Comment - Add any comments about this external storage location.
6. Click Add.
The newly created external location gets listed in the table.