Unity Catalog Integration with Calibo Accelerate platform
Databricks Unity Catalog is a fine-grained data governance solution integrated directly into the Databricks lakehouse platform. It addresses the complexities of data governance by providing centralized access control, auditing, data discovery, and metadata management across Databricks workspaces. Calibo Accelerate platform leverages its data organization, governance, tagging, and audit log features for data integration, transformation, and quality processes.
Calibo Accelerate platform seamlessly integrates with Unity Catalog, leveraging its robust features such as centralized data organization (through the 3-tier namespace), granular data governance, flexible tagging, and comprehensive audit logs to enhance data integration, transformation, and quality workflows within our platform.
Unity Catalog significantly enhances data management, governance, security, and usability within the Databricks environment. Its built-in features for fine-grained access control and comprehensive audit logs streamline operations and reduce the need for separate governance tools often associated with traditional data lakes on object storage like Amazon S3.
Let us go through the features of Unity Catalog:
-
Unity Catalog provides a centralized and consistent data governance layer, enforcing uniform security, auditing, and fine-grained access controls across all managed data assets within the Databricks account.
-
Leveraging the intuitive Catalog Explorer, users can discover and explore data objects based on their granted permissions. Powerful search capabilities allow filtering by attributes such as keywords, creation timestamp, data type, and even tags.
-
Unity Catalog serves as a centralized metadata repository for all data assets managed within Databricks, including tables, views, volumes, and external locations. It automatically maintains a comprehensive audit log, tracking all actions performed on these data assets at the account level.
-
Unity Catalog enables robust tagging of data assets, allowing for classification based on sensitivity levels (e.g., PII, confidential) and adherence to regulatory requirements (e.g., GDPR, HIPAA).
-
Unity Catalog seamlessly integrates with Delta Lake, the underlying storage layer in Databricks, enabling efficient incremental data processing. This significantly reduces resource consumption and processing times for data updates and transformations.
-
Unity Catalog employs a well-defined 3-tier namespace (catalog.schema.table) to provide a structured and logical organization of data assets. This hierarchical structure is crucial for administrators to implement and manage fine-grained access control policies at the catalog, schema, and table levels.
Calibo Accelerate platform currently provides the following capabilities for Unity Catalog:
-
Data Integration: Unity Catalog supports a wide range of data formats; this enhances the flexibility of integrating diverse data sources within the Databricks environment. By leveraging Unity Catalog-enabled Databricks for data integration, you ensure that all ingested data is centrally registered and governed, benefiting from consistent data discovery and access control policies. Calibo Accelerate platform supports the following type of data ingestion jobs:
-
Data Transformation: Unity Catalog provides granular access control capabilities to tables, columns, and even rows. This ensures that only authorized users can perform specific data transformations on the data they are permitted to access, enforcing organizational policies and adhering to regulatory requirements.
-
Data Quality: By leveraging Unity Catalog's data discovery and metadata, our platform helps you understand data characteristics. This enables you to effectively set up and enforce data quality rules, ensuring reliable data for decision-making.
Data Profiler using Unity Catalog
Let us take an example of clinical trial data for a specific drug and see how the features of Unity Catalog are useful:
-
You can organize data in a structured manner in Unity Catalog, allowing you to quickly navigate to the section that you require. You can filter and find data based on various attributes like keywords, creation date, or data types. Instead of searching through multiple files and folders, you can search the catalog for "clinical trial data for Drug XYZ" and quickly find the exact datasets you need.
-
Your clinical trial data might be organized by drug name, trial phase, or study site. Unity Catalog organizes data assets in a structured way, making it easy to browse through categories or projects.
-
Unity Catalog allows you to centrally set permissions and manage who can access which data assets and what they can do with them. This ensures that only authorized personnel, like data scientists or regulatory compliance officers, have access to sensitive clinical trial data.
-
Unity Catalog provides audit logs to track every action taken on data with details about who accessed it, what they did with it, and when. You can look at the audit logs to see the exact user who accessed a particular dataset, and actions taken ensuring accountability and transparency.
-
Tagging of data, which is another important feature of Unity Catalog, is especially important and useful when it comes to PII data by marking it confidential and applying governance policies restricting access to such data.
-
By setting up data quality rules, you can automatically flag or prevent the use of data that doesn’t meet your quality standards, ensuring that decisions are based on reliable data.
What's next? Data Integration using Unity Catalog |