Data Quality Adapters

Data Quality ensures that the information used to make key business decisions is reliable, accurate, and complete. It is therefore crucial to ensure data quality throughout the data management process.

Data Quality stands as a critical factor in guaranteeing the integrity of information used for pivotal business choices, necessitating reliability, precision, and comprehensiveness. Consequently, upholding data quality across the entire data management continuum is of paramount importance.
This metric assesses the extent to which a dataset satisfies specific benchmarks, gauging its alignment with predetermined standards.
By ensuring data quality, a robust standard is maintained throughout data processing, fostering the utilization of data that is both accurate and dependable..

Ensuring the excellence of data quality encompasses a series of pivotal considerations and methodologies. The journey towards optimal data quality entails the following sequential steps:

Data Profiling - Looks closely at where your data comes from to see how it's structured and how good it is. This helps catch problems before they cause trouble later.
Data Cleansing - Removing or correcting errors, inconsistencies, and inaccuracies in the data. This may involve processes like deduplication, standardization, and validation.
Data Transformation Rules - Applying business rules and transformations to ensure that data is properly formatted, normalized, and conforms to the target schema.
Data Validation - Checking the transformed data to ensure that it meets predefined quality criteria and business rules.
Data Enrichment - Enhancing data with additional information from external sources to improve its completeness and relevance. For example, handling null and empty values.
Error Handling - Establishing mechanisms to identify and address data quality issues during the data processing.

Poor quality of data can lead to incorrect insights, inaccurate analysis, and unreliable business decisions. Organizations therefore invest significant effort in establishing robust data quality practices.

APIs and Interfaces used in Data Quality

Microservice: plf-snowflake-integration

Microservice: plf-snowflake-integration
Add constraints entry to Data Analyser in switch case and call applyConstraint() method with right set of paramaters Note: Other switch cases are already implemented. Class: com.calibo.platform.snowflake.service.impl.DataAnalyserImpl	Copy `processDataAnalyser(JobRunStatus jobRunStatus, DataStoreBean target, Map<String, Object> targetAttrMap, Map<String, Object> sourceAttrMap) switch (constraint.getConstraintName()) { case "MaxLength": applyConstraint("max(length(%s)) as %s,", constraint.getColumn1(), constraint.getCustomColumnName(), "", columnQuery); break; }`
Add constraints entry to Issue Resolver in switch case and add the appropriate implementation. Note: Other switch cases are already implemented Class: com.calibo.platform.snowflake.service.impl.ResolverServiceImpl	Copy `DataFrame handleOutlierData(DataFrame df, Map<String, List<Outlier>> outlierData) switch (action) { case "max": Object max = (cloneDf.select(Functions.max(Functions.col(columnName))).collect()[0].get(0)); df = df.withColumn(columnName, Functions.when(exp, Functions.lit(max.toString())).otherwise(Functions.col(columnName))); break; }`
Add constraints entry to Data Profiler in switch case and add the appropriate implementation. Note: Other switch cases are already implemented Class: com.calibo.platform.snowflake.service.impl.ProfilerServiceImpl	Copy `processDataProfiler(JobRunStatus jobRunStatus, DataStoreBean target, Map<String, Object> targetAttrMap, Map<String, Object> sourceAttrMap) switch (profilerFeature) { case "characterCount": pf.calculateCharCounts(session, tableName, columnList); break; }`

Add constraints entry to Data Analyser in switch case and call applyConstraint() method with right set of paramaters

Note: Other switch cases are already implemented.

Class: com.calibo.platform.snowflake.service.impl.DataAnalyserImpl

Copy

processDataAnalyser(JobRunStatus jobRunStatus, DataStoreBean target, Map<String, Object> targetAttrMap, Map<String, Object> sourceAttrMap) 

switch (constraint.getConstraintName()) { 
 case "MaxLength": 
 applyConstraint("max(length(%s)) as %s,", constraint.getColumn1(), constraint.getCustomColumnName(), "", columnQuery); 
 break; 
}

Add constraints entry to Issue Resolver in switch case and add the appropriate implementation.

Note: Other switch cases are already implemented

Class: com.calibo.platform.snowflake.service.impl.ResolverServiceImpl

Copy

DataFrame handleOutlierData(DataFrame df, Map<String, List<Outlier>> outlierData) 
switch (action) { 
 case "max": 
 Object max = (cloneDf.select(Functions.max(Functions.col(columnName))).collect()[0].get(0)); 
 df = df.withColumn(columnName, Functions.when(exp, Functions.lit(max.toString())).otherwise(Functions.col(columnName))); 
 break; 
}

Add constraints entry to Data Profiler in switch case and add the appropriate implementation.

Note: Other switch cases are already implemented

Class: com.calibo.platform.snowflake.service.impl.ProfilerServiceImpl

Copy

processDataProfiler(JobRunStatus jobRunStatus, DataStoreBean target, Map<String, Object> targetAttrMap, Map<String, Object> sourceAttrMap) 

switch (profilerFeature) { 
 case "characterCount": 
 pf.calculateCharCounts(session, tableName, columnList); 
 break; 
}

Following are the APIs and interfaces used in data quality:

CalculateCompleteness()
- This method takes tableName and profiler as a parameter.
- It calculates completeness profiler result data.

ProcessDataProfiler()
- This method takes SourceDataStore and TargetDataStore bean as a request parameter.
- It processes and generates profiler output.
- It returns profiler result set count.

HandleMissingData ()
- This method takes dataframe, dropNullValues and replaceNullValues as parameter.
- It handles and replaces missing data.
- Returns result set data frame of issue resolver.

HandleOutlierData ()
- This Issue Resolver method takes dataframe and outlierData list as a paramter.
- It handles outliers such as drop, mean, max and min etc.
- It returns handled outlier data result dataframe.

HandleCaseSensitiveData ()
- This method takes dataframe and caseSensitiveData as parameter.
- It handles data with functions such as upper, proper, lower etc.
- Returns result set data frame.

AnalyserandValidator
- This method takes dataframe, constraints and table name as a request parameter.
- It handles data with constraints such as Size, ApproxDistinctCount, MinLength, MaxLength, CountDistinct, Sum, Mean, StandardDiviation, Correlation, Completeness, Compliance, Uniqueness, Entropy, isLessThan, isGreaterThan etc.
- Returns AnalyserAndValidator output.

Data Quality constraints

Based on the exact use case, data quality constraints or rules can be added. The supported constraints are as follows:

ApproxCountDistinct	Returns the number of distinct values within the dataset.
Completeness	Checks whether the data fulfills the expectation of comprehensiveness. For example if customer name is asked, then whether first name and last name are present for all records. If either of the two is missing the record is incomplete.
Compliance	Calculates the fraction of rows that match the given column constraint.
Correlation	Calculates the pearson correlation coefficient between the selected columns.
CountDistinct	Returns the count of distinct elements in a column.
DataType	Returns the data type of the column. For example - boolean, fraction, integer and so on.
Distinctness	Returns the count of distinct values in a column.
Entropy	Returns the measure of disorder contained in a message.
Maximum	Returns the maximum value of a numeric column.
MaxLength	Returns the maximum length of a column with string data type.
Mean	Returns the average value of a numeric column.
Minimum	Returns the minimum value of a numeric column.
MinLength	Returns the minimum length of a column with string data type.
MutualInformation	Information about one column that can be inferred from another column. This is applicable to numeric and string data types.
PatternMatch	Returns the regex pattern.
Size	Returns the size of data.
StandardDeviation	Shows the variation from the mean value of a column.
Sum	Provides the sum of the column values.
UniqueValueRatio	Returns the ratio of uniqueness of a column. This is applicable to numeric and string data types.
Uniqueness	Returns the ratio of unique values against all values of a column. This is applicable to numeric and string data types.
hasSize	Checks and confirms that the data has a size.
isComplete	Confirms whether a column is complete.
hasCompleteness	Confirms whether a column is complete based on the historical completeness of the column.
isUnique	Confirms whether a column is unique.
hasUniqueness	Confirms whether a column or set of columns have uniqueness. Uniqueness is a fraction of unique values of a column.
hasDistinctness	Confirms whether a column or set of columns have distinctness. Distinctness is a fraction of distinct values of a column.
hasUniqueValueRatio	Confirms whether there is a unique value ratio in a column or set of columns.
hasEntropy	Confirms whether a column has entropy. Entropy is a measure of disorder contained in a message.
hasMutualInformation	Confirms whether two columns have mutual information. Mutual information means how much information about one column can be inferred from another column.
hasMinLength	Confirms the minimum length of a column with string data type.
hasMaxLength	Confirms the maximum length of a column with string data type.
hasMin	Confirms the minimum of a column, that contains a long, integer, or float data type.
hasMax	Confirms the maximum of a column, that contains a long, integer, or float data type.
hasSum	Confirms the sum the column.
hasMean	Confirms the mean of the column.
hasStandardDeviation	Confirms that the column has variation from the mean value.
hasApproxcountDistinct	Confirms that the column has approximate distinct count.
hasCorrelation	Confirms that there exists a pearson correlation between two columns.
hasPattern	Confirms whether the pattern of values of a column match that of the regular expression.
containsCreditCardNumber	Checks and confirms whether a column has credit card number pattern.
containsEmail	Checks and confirms whether a column has email pattern.
containsURL	Checks and confirms whether a column has URL pattern.
containsSocialSecurityNumber	Checks and confirms whether a column has pattern for Social Security Number for the USA.
isNonNegative	Checks and confirms that a column does not contain any negative values.
isPositive	Checks and confirms that a column does not contain any negative value and is greater than 0.
isLessThan	Checks and confirms that in each row, the value of column A is greater than the value of column B.
isLessThanOrEqualTo	Checks and confirms that in each row, the value of column A is less than or equal to the value of column B.
isGreaterThan	Checks and confirms that in each row, the value of column A is greater than the value of column B.
isGreaterThanOrEqualTo	Checks and confirms that in each row, the value of column A is greater than or equal to the value of column B.
isContainedIn	Checks and confirms that the value in a column is contained in a set of predefined values.