Add a New Crawler

Purpose

Adding a new crawler for Azure Blob Storage to the Calibo Accelerate platform.

Azure blob is one of the Object storages where different types of files such as CSV, TSV, Excel etc. are stored. The metadata and data format of file objects is different from traditional RDBMS systems and hence RDBMS crawlers won't support such file systems.

To support such storage, new crawlers must be developed and integrated with the Calibo Accelerate platform.

The diagram below shows the related microservices. Implementation of any new crawler will impact these set of services.

Microservice Change Impact
plf-elab-service Configuration changes
plf-configuration-service Configuration changes
plf-data-pipeline-designer Code changes
plf-common-orchestrator Code changes

 

Microservices impacted by Crawler development

In the Calibo Accelerate platform ecosystem, we have microservices that deal with the functionality of the crawlers. For developing any crawlers say AzureCrawler, the implementation of the interfaces exposed, need to be developed and added to the respective folders for compilation.

Once the developed classes are compiled with the service named plf-data-pipeline-designer, then the integration of the crawler has to be to be done with the platform.

Integration involves creating settings and entries in other modules of the platform for consumption of the newly developed code. Adjustments in plf-elab-service and plf-configuration-service to be added for consuming the new crawler.

Similarly, if the developed crawler needs to be added to the Lazsa orchestrator framework, then in addition to mentioned details, the crawler classes need to be compiled with plf-common-orchestrator.

In the above diagram, AzureCrawler is the developed piece of code extended from plf-data-pipeline-designer service interface and compiled with different components of the lazsa platform.

Steps for adding Azure Blob Storage as a crawler

You need to make the following changes to the different microservices in order to add Azure Blob Storage as a crawler.

Steps Examples Action
Microservice: plf-elab-service

This service is responsible for orchestrating the business logic of the platform dealing with product portfolios and products.

Add entry to the: com.calibo.platform.elab.enums.ProviderEnum
Copy
public enum ProviderEnum { 
 
AZURE_BLOB; 
 
}
Action performed by Lazsa team.
Microservice: plf-data-pipeline-designer

This service is responsible for all business logic related to data management of the platform like data crawlers, data visualizers, data catalog, etc.

Create an entry to the crawler utility class in package: com.calibo.platform.dpd.util
Copy
public RelationalDatasourceGenericRepository getRelationalDatasourceRepository(ProviderEnum type) { 
 switch (type) { 
  case RDBMS: 
  case MYSQL: 
  case AWS_RDS_MYSQL: 
  case AWS_S3: 
  case POSTGRE_SQL: 
  case AWS_RDS_POSTGRE_SQL: 
  case AZURE_RDS_POSTGRE_SQL: 
  case SNOWFLAKE: 
  case MS_SQL_SERVER: 
  case AWS_RDS_MARIA_DB: 
  case ORACLE: 
  case REST_API: 
   return restAPICrawler; 
  case CSV: 
   return csvCrawler; 
  case EXCEL: 
   return excelCrawler; 
  case SFTP: 
   return sftpCrawler; 
  case FTP: 
   return ftpCrawler; 
  case PARQUET: 
   return  parquetCrawler;
  case AZURE_BLOB: 
   return azureBlobCrawler; 
  default: 
   throw new IllegalArgumentException("Datastore not supported."); 
 } 
 
Implement the AzureBlobCrawler under package: com.calibo.platform.dpd.service.impl.crawler

The AzureBlobCrawler will implement all the methods of interface com.calibo.platform.dpd.repository.

Copy
RelationalDatasourceGenericRepository 

public class AzureBlobCrawler  implements RelationalDatasourceGenericRepository { 
 
  @Override 
  public List<String> getTableList(ElabDataStoreBeanWithAttributesMap elabDataStore) 
      throws SQLException { 
    return null; 
  } 
 
  @Override 
  public TableDescription getTableDescription(ElabDataStoreBeanWithAttributesMap elabDataStore, 
      String tableName) throws SQLException { 
    return null; 
  } 
 
  @Override 
  public TableDescription getJoinedPreview(ElabDataStoreBeanWithAttributesMap datastore, 
      List<RelationalDatasourceNodeConfigSelectAttribute> selectRefs, 
      List<RelationalDatasourceNodeConfigJoinAttribute> joinRefs) throws SQLException { 
    return null; 
  } 
 
  @Override 
  public String generateQuery(ElabDataStoreBeanWithAttributesMap datastore, 
      List<RelationalDatasourceNodeConfigSelectAttribute> selectRefs, 
      List<RelationalDatasourceNodeConfigJoinAttribute> joinRefs) throws SQLException { 
    return null; 
  } 
 
  @Override 
  public TableDescription getTableDescriptionFromQuery(ElabDataStoreBeanWithAttributesMap datastore, 
      String query) throws SQLException { 
    return null; 
  } 
 
  @Override 
  public TableDescription getCustomJoinQueryPreview(Node node, 
      ElabDataStoreBeanWithAttributesMap attributesMap) throws SQLException { 
    return null; 
  } 
 
  @Override 
  public Map<String, String> getCrawlerFiles(FilePathRequestBean filePathRequestBean) { 
    return null; 
  } 
 

 // The mandatory method to implement for proper working of any generic crawler 
  @Override 
  public List<MetadataDetails> fetchMetadata(ElabDataStoreBeanWithAttributesMap dataStore) 
      throws SQLException, IOException, JSchException, SftpException { 
    return null; 
  } 
}
Action performed by Adapter Development Team
Microservice: plf-configuration-service

This service is responsible for managing the configurations of the platform like user settings, tool settings, tenant settings, etc.

Insert DB entries under databases and data stores INSERT INTO `setting` (`name`, `description`, `config_code`, `section`, `sub_section`, `config_version`, `selected`, `default`, `provider_code`, `logo`, `version`, `created_by`, `created_on`, `updated_by`, `updated_on`) VALUES (‘Azure Blob’, ‘Azure Blob is an open-source x technology management system.', 'DATA_STORES', NULL, NULL, NULL, false, false, AZURE_BLOB, '/techx.png', 0, 'system’, now(), 'system’, now()) Action performed by Lazsa team.
Add AZURE_BLOB in the enum com.calibo.platform.core.enums.ProviderEnum
Copy
public enum ProviderEnum { 
   AZURE_BLOB; 
 
Action performed by Lazsa team.

Microservice: plf-common-orchestrator

This is a lazsa agent that deals with the execution of data related activities with the unexposed tools from the client side. This component helps in abstracting the tools defined and added on the client side from the platform.

Implement the AzureBlobCrawler under package com.calibo.platform.common.service

The AzureBlobCrawler will implement all the methods of interface com.calibo.platform.common.service.

Copy
RelationalDatasourceGenericRepository 

 

public class AzureBlobCrawler  implements RelationalDatasourceGenericRepository { 
 
 List<String> getTableList(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; 
 
 TableDescription getTableDescription(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; 
 
 TableDescription getJoinedPreview(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; 
 
 String generateQuery(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; 
 
 TableDescription getTableDescriptionFromQuery(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException; 
 

// The mandatory method to implement for proper working of any generic crawler 
 List<MetadataDetails> fetchMetadata(RdsOrchestratorBean rdsOrchestratorBean) throws SQLException, IOException, JSchException, SftpException; 
Action performed by Adapter Development Team

Compile all the above services and deploy them in the ecosystem to start the integration process.

Deployment process of the impacted microservices

The developed code once placed in the proper packages of the services has to be pushed to the source code repositories with proper approval process. As the new code is pushed to the source code repository, the build server detects the changes, triggers auto build and deploys jobs.

The build server pulls the changes on the specific repositories and branches, builds the code, which in turn prepares the packaged jar to be kept at the artifactory.

The build server then triggers the deployment on the specific environment from the packaged jar.

Steps to verify the Azure Blob crawler in the Calibo Accelerate platform

The newly developed crawler must be added as a data store in Cloud Platform, Tools & Technologies under the Configuration section.

Steps Details
Log in to the Calibo Accelerate platform
  1. Log in to the platform using the credentials.

  2. Keep the TenantID and access token ready for authentication and authorization.

Get a list of the supported data types

Fetch a list of supported data types by using the following API. The response provides all the types of datastores along with the new data store added to be consumed in crawler.

Copy
curl 'https://lazsa<env>.calibo.com/configuration/settings/list?configCode=DATA_STORES' \ 

  -H 'Authorization: Bearer <token>' \ 

  -H 'X-TenantID: <tenant>' \ 
Add the instance of new data type to be consumed in the Platform for crawling

Trigger the following API to add an instance of AZURE_BLOB in platform configuration. In the payload, “attributes” is a list of key value pairs which is required to connect to the technology for performing the required operations.

Copy
curl 'https://lazsa<env>.calibo.com/configuration/datastores' \ 

  -X 'PUT' \ 

-H 'Authorization: Bearer <token>' \ 

 -H 'X-TenantID: <tenant>' \ 

 -H 'Accept: application/json, text/plain, */*' \ 

  --data-raw '{"attributes":[{"attributeName":"host","attributeValue":"xyz"},{"attributeName":"userName","attributeValue":"abc"},{"attributeName":"password","attributeValue":"abc"}],"description":"P Demo FTP to be deleted","isSelected":true,"name":"P Demo FTP","isPasswordProtectEnabled":false,"type":"AZURE_BLOB","subType":null,"usage":"SYSTEM","isOrchestratorConfiguration":false,"identitySecurityProvider":"LAZSA"}' \ 
Fetch the instances that are added to be consumed in the crawler

Retrieve the newly added datastore by using the following API:

Copy
curl 'https://lazsa<env>.calibo.com/configuration/v2/settings/dataStores' \ 

-H 'Authorization: Bearer <token>' \ 

-H 'X-TenantID: <tenant>' \ 

  --compressed 

Response will be List of following objects: 




        "id": "b458692f-0649-4807-88e1-63c27c51b956", 

        "name": "P Demo FTP", 

        "description": "P Demo FTP to be deleted", 

        "configType": null, 

        "type": "AZURE_BLOB", 

        "subType": null, 

        "usage": "SYSTEM", 

        "isSelected": true, 

        "isPasswordProtectEnabled": false, 

        "logo": "/ftp.png", 

        "createdBy": "psadhukhan@calibo.com", 

        "createdOn": "2023-08-23T05:48:08", 

        "updatedBy": null, 

        "updatedOn": "2023-08-23T05:48:08", 

        "attributes": [], 

        "isAdmin": null, 

        "accessMode": null, 

        "validityTime": null, 

        "identitySecurityProvider": "LAZSA", 

        "isOrchestratorConfiguration": false, 

        "pendingRequestCount": 0, 

        "isAccessRequested": false, 

        "updatedByUsername": null, 

        "is_manage_access_allowed": true, 

        "created_by_username": "Pritam Sadhukhan" 

    } 

 

Add a new crawler using the new instance added in Configuration

Add supported crawlers using existing configuration by using the following API:

Copy
curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers?' \ 

  -H 'accept: application/json, text/plain, */*' \ 

-H 'Authorization: Bearer <token>' \ 

-H 'X-TenantID: <tenant>' \ 

  --data-raw '{"name":"P Demo Crawler","sourceType":"AZURE_BLOB","subType":"","attributes":[{"dataStoreId":" b458692f-0649-4807-88e1-63c27c51b956","attributeName":"host","attributeValue":"xyz"},{"dataStoreId":" b458692f-0649-4807-88e1-63c27c51b956","attributeName":"userName","attributeValue":"abc"},{"dataStoreId":" b458692f-0649-4807-88e1-63c27c51b956","attributeName":"password","attributeValue":"abc"}]}' \ 
Fetch a list of crawlers added to the Platform

Fetch a list of all the added crawlers by using the following API:

Copy
curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers?' \ 

-H 'Authorization: Bearer <token>' \ 

-H 'X-TenantID: <tenant>' \ 

 

Response is a list of objects:

Copy

[





        "id": "aa10a268-901e-4c8f-a1ed-7c6c6ff6d935", 

        "name": "P demo crawler", 

        "createdBy": "Pritam Sadhukhan", 

        "createdDate": "2023-08-22T12:07:48", 

        "updatedBy": "Pritam Sadhukhan", 

        "updatedDate": "2023-08-22T12:08:02", 

        "lastExecutionTime": "2023-08-22T12:08:02", 

        "sourceType": "AZURE_BLOB", 

        "subSourceType": "", 

        "status": "SUCCESS", 

        "attributes": null, 

        "attributesJson": null, 

        "projectId": null, 

        "releaseId": null, 

        "workstreamId": null, 

        "noOfRuns": 1 

    } 



 
Execute the recently added crawler

Execute the crawler using the following API:

Copy

curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers/run?crawlerId=aa10a268-901e-4c8f-a1ed-7c6c6ff6d935' \

curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers/run?crawlerId=aa10a268-901e-4c8f-a1ed-7c6c6ff6d935' \ 

  -X 'PUT' \ 

  -H 'accept: application/json, text/plain, */*' \ 

-H 'Authorization: Bearer <token>' \ 

-H 'X-TenantID: <tenant>' \ 

  --data-raw '{}' \ 
Fetch status of the executed crawler
Copy
curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers/run?crawlerId=aa10a268-901e-4c8f-a1ed-7c6c6ff6d935' \ 

  -X 'PUT' \ 

  -H 'accept: application/json, text/plain, */*' \ 

-H 'Authorization: Bearer <token>' \ 

-H 'X-TenantID: <tenant>' \ 

  --data-raw '{}' \ 

 

Response is details of the crawler

Copy


    "id": "aa10a268-901e-4c8f-a1ed-7c6c6ff6d935", 

    "name": "P demo crawler", 

    "createdBy": "Pritam Sadhukhan", 

    "createdDate": "2023-08-22T12:07:48", 

    "updatedBy": "Pritam Sadhukhan", 

    "updatedDate": "2023-08-23T06:39:40", 

    "lastExecutionTime": "2023-08-23T06:39:40", 

    "sourceType": "AZURE_BLOB", 

    "subSourceType": "", 

    "status": "SUCCESS", 

    "attributes": null, 

    "attributesJson": null, 

    "projectId": null, 

    "releaseId": null, 

    "workstreamId": null, 

    "noOfRuns": 3 

 

The status attribute of the response is the status of the crawler run.

Analyze the crawled data from the crawler

The crawler provides data in the standard format and can be checked using the following API:

Copy
curl 'https://lazsa-dis.calibo.com/datapipeline/crawlers/aa10a268-901e-4c8f-a1ed-7c6c6ff6d935/details?' \ 

-H 'Authorization: Bearer <token>' \ 

-H 'X-TenantID: <tenant>' \ 

 

Response is list of details of the crawled data.

Copy


    { 

        "schema": "CSV", 

        "tables": [ 

            { 

                "id": "9ac686f6-80b0-4788-b7c9-f62ee676ad80", 

                "tableName": "argentina", 

                "owner": null, 

                "fields": [ 

                    { 

                        "id": "a6b04036-c696-433f-bb37-4ecd8872dd6d", 

                        "name": "c_7", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "ee391a3a-5ac2-4e34-80fc-6240d304a11e", 

                        "name": "c_5", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "96628874-7f29-444c-ae96-414edfc44dc9", 

                        "name": "c_10", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "2187b8dd-e54d-4d07-975f-d3dc7309b1ee", 

                        "name": "c_6", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "97d173bd-7b1a-4ffc-9826-a005c2ef25ed", 

                        "name": "c_2", 

                        "dataType": "int", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "f4b8f4c5-1e51-4630-a4ee-7406f9bb5de9", 

                        "name": "c_1", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "dde4575b-21f6-4a33-87d8-076697b37dd7", 

                        "name": "c_9", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "15e910cc-9dad-42d5-a4ad-6c8b3d444c7f", 

                        "name": "c_3", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "ea010927-b765-4bd3-8c22-972ce5f57c88", 

                        "name": "c_8", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "27db5f13-0882-4998-ac81-f78b23dd4920", 

                        "name": "c_4", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    }, 

                    { 

                        "id": "83cc94b5-f8a3-4a67-80fa-dca5cde3a56f", 

                        "name": "c_0", 

                        "dataType": "string", 

                        "comment": null, 

                        "isPrimaryKey": null, 

                        "status": "UNCHANGED", 

                        "conditionAdded": null, 

                        "conditions": [] 

                    } 

                ], 

                "status": "UNCHANGED", 

                "type": null 

            } 

        ], 

        "joins": [], 

        "status": "UNCHANGED", 

        "name": null, 

        "id": null 

    } 

 

Related Topics Link IconRecommended Topics

What's next? Integrate a New Technology in an Existing Crawler Category