Add a New Crawler

Purpose

Adding a new crawler for Azure Blob Storage to the Calibo Accelerate platform.

Azure blob is one of the Object storages where different types of files such as CSV, TSV, Excel etc. are stored. The metadata and data format of file objects is different from traditional RDBMS systems and hence RDBMS crawlers won't support such file systems.

To support such storage, new crawlers must be developed and integrated with the Calibo Accelerate platform.

The diagram below shows the related microservices. Implementation of any new crawler will impact these set of services.

Microservice	Change Impact
plf-elab-service	Configuration changes
plf-configuration-service	Configuration changes
plf-data-pipeline-designer	Code changes
plf-common-orchestrator	Code changes

Microservices impacted by Crawler development

Microservices impacted by crawler development

In the Calibo Accelerate platform ecosystem, we have microservices that deal with the functionality of the crawlers. For developing any crawlers say AzureCrawler, the implementation of the interfaces exposed, need to be developed and added to the respective folders for compilation.

Once the developed classes are compiled with the service named plf-data-pipeline-designer, then the integration of the crawler has to be to be done with the platform.

Integration involves creating settings and entries in other modules of the platform for consumption of the newly developed code. Adjustments in plf-elab-service and plf-configuration-service to be added for consuming the new crawler.

Similarly, if the developed crawler needs to be added to the Calibo Accelerate orchestrator framework, then in addition to mentioned details, the crawler classes need to be compiled with plf-common-orchestrator.

In the above diagram, AzureCrawler is the developed piece of code extended from plf-data-pipeline-designer service interface and compiled with different components of the Calibo Accelerate platform.

Steps for adding Azure Blob Storage as a crawler

You need to make the following changes to the different microservices in order to add Azure Blob Storage as a crawler.

Steps	Examples	Action
Microservice: plf-elab-service This service is responsible for orchestrating the business logic of the platform dealing with product portfolios and products.
Add entry to the: com.calibo.platform.elab.enums.ProviderEnum	Copy `public enum ProviderEnum { AZURE_BLOB; }`	Action performed by Calibo Accelerate team.
Microservice: plf-data-pipeline-designer This service is responsible for all business logic related to data management of the platform like data crawlers, data visualizers, data catalog, etc.
Create an entry to the crawler utility class in package: com.calibo.platform.dpd.util	Copy public RelationalDatasourceGenericRepository getRelationalDatasourceRepository(ProviderEnum type) { switch (type) { case RDBMS: case MYSQL: case AWS_RDS_MYSQL: case AWS_S3: case POSTGRE_SQL: case AWS_RDS_POSTGRE_SQL: case AZURE_RDS_POSTGRE_SQL: case SNOWFLAKE: case MS_SQL_SERVER: case AWS_RDS_MARIA_DB: case ORACLE: case REST_API: return restAPICrawler; case CSV: return csvCrawler; case EXCEL: return excelCrawler; case SFTP: return sftpCrawler; case FTP: return ftpCrawler; case PARQUET: return parquetCrawler; case AZURE_BLOB: return azureBlobCrawler; default: throw new IllegalArgumentException("Datastore not supported."); }
Implement the AzureBlobCrawler under package: com.calibo.platform.dpd.service.impl.crawler	The AzureBlobCrawler will implement all the methods of interface com.calibo.platform.dpd.repository. Copy RelationalDatasourceGenericRepository public class AzureBlobCrawler implements RelationalDatasourceGenericRepository { @Override public List<String> getTableList(ElabDataStoreBeanWithAttributesMap elabDataStore) throws SQLException { return null; } @Override public TableDescription getTableDescription(ElabDataStoreBeanWithAttributesMap elabDataStore, String tableName) throws SQLException { return null; } @Override public TableDescription getJoinedPreview(ElabDataStoreBeanWithAttributesMap datastore, List<RelationalDatasourceNodeConfigSelectAttribute> selectRefs, List<RelationalDatasourceNodeConfigJoinAttribute> joinRefs) throws SQLException { return null; } @Override public String generateQuery(ElabDataStoreBeanWithAttributesMap datastore, List<RelationalDatasourceNodeConfigSelectAttribute> selectRefs, List<RelationalDatasourceNodeConfigJoinAttribute> joinRefs) throws SQLException { return null; } @Override public TableDescription getTableDescriptionFromQuery(ElabDataStoreBeanWithAttributesMap datastore, String query) throws SQLException { return null; } @Override public TableDescription getCustomJoinQueryPreview(Node node, ElabDataStoreBeanWithAttributesMap attributesMap) throws SQLException { return null; } @Override public Map<String, String> getCrawlerFiles(FilePathRequestBean filePathRequestBean) { return null; } // The mandatory method to implement for proper working of any generic crawler @Override public List<MetadataDetails> fetchMetadata(ElabDataStoreBeanWithAttributesMap dataStore) throws SQLException, IOException, JSchException, SftpException { return null; } }	Action performed by Adapter Development Team
Microservice: plf-configuration-service This service is responsible for managing the configurations of the platform like user settings, tool settings, tenant settings, etc.
Insert DB entries under databases and data stores	INSERT INTO `setting` (`name`, `description`, `config_code`, `section`, `sub_section`, `config_version`, `selected`, `default`, `provider_code`, `logo`, `version`, `created_by`, `created_on`, `updated_by`, `updated_on`) VALUES (‘Azure Blob’, ‘Azure Blob is an open-source x technology management system.', 'DATA_STORES', NULL, NULL, NULL, false, false, AZURE_BLOB, '/techx.png', 0, 'system’, now(), 'system’, now())	Action performed by Calibo Accelerate team.
Add AZURE_BLOB in the enum com.calibo.platform.core.enums.ProviderEnum	Copy `public enum ProviderEnum { AZURE_BLOB; }`	Action performed by Calibo Accelerate team.
Microservice: plf-common-orchestrator This is a Calibo Accelerate agent that deals with the execution of data related activities with the unexposed tools from the client side. This component helps in abstracting the tools defined and added on the client side from the platform.
Implement the AzureBlobCrawler under package com.calibo.platform.common.service	The AzureBlobCrawler will implement all the methods of interface com.calibo.platform.common.service. Copy RelationalDatasourceGenericRepository public class AzureBlobCrawler implements RelationalDatasourceGenericRepository { List<String> getTableList(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; TableDescription getTableDescription(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; TableDescription getJoinedPreview(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; String generateQuery(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; TableDescription getTableDescriptionFromQuery(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; // The mandatory method to implement for proper working of any generic crawler List<MetadataDetails> fetchMetadata(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException, IOException, JSchException, SftpException; }	Action performed by Adapter Development Team

Compile all the above services and deploy them in the ecosystem to start the integration process.

Deployment process of the impacted microservices

The developed code once placed in the proper packages of the services has to be pushed to the source code repositories with proper approval process. As the new code is pushed to the source code repository, the build server detects the changes, triggers auto build and deploys jobs.

The build server pulls the changes on the specific repositories and branches, builds the code, which in turn prepares the packaged jar to be kept at the artifactory.

The build server then triggers the deployment on the specific environment from the packaged jar.

Steps to verify the Azure Blob crawler in the Calibo Accelerate platform

The newly developed crawler must be added as a data store in Cloud Platform, Tools & Technologies under the Configuration section.

Steps	Details
Log in to the Calibo Accelerate platform	Log in to the platform using the credentials. Keep the TenantID and access token ready for authentication and authorization.
Get a list of the supported data types	Fetch a list of supported data types by using the following API. The response provides all the types of datastores along with the new data store added to be consumed in crawler. Copy `curl 'https://lazsa<env>.calibo.com/configuration/settings/list?configCode=DATA_STORES' \ -H 'Authorization: Bearer <token>' \ -H 'X-TenantID: <tenant>' \`
Add the instance of new data type to be consumed in the Platform for crawling	Trigger the following API to add an instance of AZURE_BLOB in platform configuration. In the payload, “attributes” is a list of key value pairs which is required to connect to the technology for performing the required operations. Copy curl 'https://lazsa<env>.calibo.com/configuration/datastores' \ -X 'PUT' \ -H 'Authorization: Bearer <token>' \ -H 'X-TenantID: <tenant>' \ -H 'Accept: application/json, text/plain, /' \ --data-raw '{"attributes":[{"attributeName":"host","attributeValue":"xyz"},{"attributeName":"userName","attributeValue":"abc"},{"attributeName":"password","attributeValue":"abc"}],"description":"P Demo FTP to be deleted","isSelected":true,"name":"P Demo FTP","isPasswordProtectEnabled":false,"type":"AZURE_BLOB","subType":null,"usage":"SYSTEM","isOrchestratorConfiguration":false,"identitySecurityProvider":"LAZSA"}' \
Fetch the instances that are added to be consumed in the crawler	Retrieve the newly added datastore by using the following API: Copy curl 'https://lazsa<env>.calibo.com/configuration/v2/settings/dataStores' \ -H 'Authorization: Bearer <token>' \ -H 'X-TenantID: <tenant>' \ --compressed Response will be List of following objects: [ { "id": "b458692f-0649-4807-88e1-63c27c51b956", "name": "P Demo FTP", "description": "P Demo FTP to be deleted", "configType": null, "type": "AZURE_BLOB", "subType": null, "usage": "SYSTEM", "isSelected": true, "isPasswordProtectEnabled": false, "logo": "/ftp.png", "createdBy": "psadhukhan@calibo.com", "createdOn": "2023-08-23T05:48:08", "updatedBy": null, "updatedOn": "2023-08-23T05:48:08", "attributes": [], "isAdmin": null, "accessMode": null, "validityTime": null, "identitySecurityProvider": "LAZSA", "isOrchestratorConfiguration": false, "pendingRequestCount": 0, "isAccessRequested": false, "updatedByUsername": null, "is_manage_access_allowed": true, "created_by_username": "Pritam Sadhukhan" } ]
Add a new crawler using the new instance added in Configuration	Add supported crawlers using existing configuration by using the following API: Copy curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers?' \ -H 'accept: application/json, text/plain, /' \ -H 'Authorization: Bearer <token>' \ -H 'X-TenantID: <tenant>' \ --data-raw '{"name":"P Demo Crawler","sourceType":"AZURE_BLOB","subType":"","attributes":[{"dataStoreId":" b458692f-0649-4807-88e1-63c27c51b956","attributeName":"host","attributeValue":"xyz"},{"dataStoreId":" b458692f-0649-4807-88e1-63c27c51b956","attributeName":"userName","attributeValue":"abc"},{"dataStoreId":" b458692f-0649-4807-88e1-63c27c51b956","attributeName":"password","attributeValue":"abc"}]}' \
Fetch a list of crawlers added to the Platform	Fetch a list of all the added crawlers by using the following API: Copy `curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers?' \ -H 'Authorization: Bearer <token>' \ -H 'X-TenantID: <tenant>' \` Response is a list of objects: Copy [ [ { "id": "aa10a268-901e-4c8f-a1ed-7c6c6ff6d935", "name": "P demo crawler", "createdBy": "Pritam Sadhukhan", "createdDate": "2023-08-22T12:07:48", "updatedBy": "Pritam Sadhukhan", "updatedDate": "2023-08-22T12:08:02", "lastExecutionTime": "2023-08-22T12:08:02", "sourceType": "AZURE_BLOB", "subSourceType": "", "status": "SUCCESS", "attributes": null, "attributesJson": null, "projectId": null, "releaseId": null, "workstreamId": null, "noOfRuns": 1 } ]
Execute the recently added crawler	Execute the crawler using the following API: Copy curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers/run?crawlerId=aa10a268-901e-4c8f-a1ed-7c6c6ff6d935' \ `curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers/run?crawlerId=aa10a268-901e-4c8f-a1ed-7c6c6ff6d935' \ -X 'PUT' \ -H 'accept: application/json, text/plain, /' \ -H 'Authorization: Bearer <token>' \ -H 'X-TenantID: <tenant>' \ --data-raw '{}' \`
Fetch status of the executed crawler	Copy `curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers/run?crawlerId=aa10a268-901e-4c8f-a1ed-7c6c6ff6d935' \ -X 'PUT' \ -H 'accept: application/json, text/plain, /' \ -H 'Authorization: Bearer <token>' \ -H 'X-TenantID: <tenant>' \ --data-raw '{}' \` Response is details of the crawler Copy { "id": "aa10a268-901e-4c8f-a1ed-7c6c6ff6d935", "name": "P demo crawler", "createdBy": "Pritam Sadhukhan", "createdDate": "2023-08-22T12:07:48", "updatedBy": "Pritam Sadhukhan", "updatedDate": "2023-08-23T06:39:40", "lastExecutionTime": "2023-08-23T06:39:40", "sourceType": "AZURE_BLOB", "subSourceType": "", "status": "SUCCESS", "attributes": null, "attributesJson": null, "projectId": null, "releaseId": null, "workstreamId": null, "noOfRuns": 3 } The status attribute of the response is the status of the crawler run.
Analyze the crawled data from the crawler	The crawler provides data in the standard format and can be checked using the following API: Copy `curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers/aa10a268-901e-4c8f-a1ed-7c6c6ff6d935/details?' \ -H 'Authorization: Bearer <token>' \ -H 'X-TenantID: <tenant>' \` Response is list of details of the crawled data. Copy [ { "schema": "CSV", "tables": [ { "id": "9ac686f6-80b0-4788-b7c9-f62ee676ad80", "tableName": "argentina", "owner": null, "fields": [ { "id": "a6b04036-c696-433f-bb37-4ecd8872dd6d", "name": "c_7", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "ee391a3a-5ac2-4e34-80fc-6240d304a11e", "name": "c_5", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "96628874-7f29-444c-ae96-414edfc44dc9", "name": "c_10", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "2187b8dd-e54d-4d07-975f-d3dc7309b1ee", "name": "c_6", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "97d173bd-7b1a-4ffc-9826-a005c2ef25ed", "name": "c_2", "dataType": "int", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "f4b8f4c5-1e51-4630-a4ee-7406f9bb5de9", "name": "c_1", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "dde4575b-21f6-4a33-87d8-076697b37dd7", "name": "c_9", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "15e910cc-9dad-42d5-a4ad-6c8b3d444c7f", "name": "c_3", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "ea010927-b765-4bd3-8c22-972ce5f57c88", "name": "c_8", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "27db5f13-0882-4998-ac81-f78b23dd4920", "name": "c_4", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] }, { "id": "83cc94b5-f8a3-4a67-80fa-dca5cde3a56f", "name": "c_0", "dataType": "string", "comment": null, "isPrimaryKey": null, "status": "UNCHANGED", "conditionAdded": null, "conditions": [] } ], "status": "UNCHANGED", "type": null } ], "joins": [], "status": "UNCHANGED", "name": null, "id": null } ]