Data Quality Constraints
In data quality you can select data quality constraints according to your usecase. The Lazsa Platform supports the following constraints:
Constraint | Description | Data Type to which it is applicable |
---|---|---|
Profiler Constraints | ||
ApproxCountDistinct | Returns the approximate number of distinct values in a column. | Numeric, string |
Completeness | Returns the number of non-null values in a column. For example if customer name is required, then whether first name and last name are present for all records. If either of the two is missing the record is incomplete. | Numeric, string |
DataType | Returns the data type of the column. For example - boolean, fraction, integer and so on. | Numeric, string |
Validity | Assesses whether the data in a column adheres to the specified format or constraints, based on the provided regex. | String |
Count | Checks distinct, filled and null counts. | Numeric, string |
Character Count | Calculates numbers, numbers only, letters only, numbers and letters, and special characters. | Numeric, string |
Statistical Value | Involves various statistical measures (e.g. minimum, maximum, mean, standard deviation) to describe the distribution of numeric data. | Numeric |
Recommendation | Provides suggestions or recommendations based on the profiling results, indicating potential improvements or actions. | Numeric, string |
Analyzer Constraints | ||
ApproxCountDistinct | Returns the approximate number of distinct values in a column. | Numeric, string |
Completeness | Returns the number of non-null values in a column. For example if customer name is required, then whether first name and last name are present for all records. If either of the two is missing the record is incomplete. | Numeric, string |
Compliance | Calculates the fraction of rows that match the given column constraint. | Numeric, string |
Correlation | Calculates the pearson correlation coefficient between the selected columns. | Numeric |
CountDistinct |
Returns the count of distinct elements in a column. | Numeric, string |
DataType | Returns the data type of the column. For example - boolean, fraction, integer and so on. | Numeric, string |
Distinctness | Returns the count of distinct values in a column. | Numeric, string |
Entropy |
Returns the measure of disorder contained in a message. |
Numeric, string |
Maximum | Returns the maximum value of a numeric column. | Numeric |
MaxLength | Returns the maximum length of a column with data type -string. | String |
Mean | Returns the average value of a numeric column. | Numeric |
Minimum | Returns the minimum value of a numeric column. | Numeric |
MinLength | Returns the minimum length of a column with data type - string. | String |
MutualInformation | Information about one column that can be inferred from another column. | Numeric, string |
PatternMatch | Returns the regex pattern. | String |
Size | Returns the size of data. | N/A |
StandardDeviation | Shows the variation from the mean value of a column. | Numeric |
Sum | Provides the sum of the column values. | Numeric |
UniqueValueRatio | Returns the ratio of uniqueness of a column. | Numeric, string |
Uniqueness | Returns the ratio of unique values against all values of a column. | Numeric, string |
Validator Constraints | ||
hasSize | Calculates the data frame size. | N/A |
isComplete | Confirms whether a column is complete. | Numeric, string |
hasCompleteness | Confirms whether a column is complete based on the historical completeness of the column. | Numeric, string |
isUnique | Confirms whether a column is unique. | Numeric, string |
hasUniqueness | Confirms whether a column or set of columns have uniqueness. Uniqueness is a fraction of unique values of a column. | Numeric, string |
hasDistinctness | Confirms whether a column or set of columns have distinctness. Distinctness is a fraction of distinct values of a column. | Numeric, string |
hasUniqueValueRatio | Confirms whether there is a unique value ratio in a column or set of columns. | Numeric, string |
hasEntropy | Confirms whether a column has entropy. Entropy is a measure of disorder contained in a message. | Numeric, string |
hasMutualInformation | Confirms whether two columns have mutual information. Mutual information means how much information about one column can be inferred from another column. | Numeric, string |
hasMinLength |
Confirms the minimum length of a column with string data type. | String |
hasMaxLength | Confirms the maximum length of a column with string data type. | String |
hasMin | Confirms the minimum of a column, that contains a long, integer, or float data type. | Numeric |
hasMax | Confirms the maximum of a column, that contains a long, integer, or float data type. | Numeric |
hasSum | Confirms the sum the column. | Numeric |
hasMean | Confirms the mean of the column. | Numeric |
hasStandardDeviation | Confirms that the column has variation from the mean value. | Numeric |
hasApproxCountDistinct | Confirms that the column has approximate distinct count. | Numeric, string |
hasCorrelation | Confirms that there exists a pearson correlation between two columns. | Numeric |
hasPattern | Confirms whether the pattern of values of a column match that of the regular expression. | String |
containsCreditCardNumber | Checks and confirms whether a column has credit card number pattern. | String |
containsEmail | Checks and confirms whether a column has email pattern. | String |
containsURL | Checks and confirms whether a column has URL pattern. | String |
containsSocialSecurityNumber | Checks and confirms whether a column has pattern for Social Security Number for the USA. | String |
isNonNegative | Checks and confirms that a column does not contain any negative values. | Numeric |
isPositive | Checks and confirms that a column does not contain any negative value and is greater than 0. | Numeric |
isLessThan | Checks and confirms that in each row, the value of column A is greater than the value of column B. | Numeric |
isLessThanOrEqualTo | Checks and confirms that in each row, the value of column A is less than or equal to the value of column B. | Numeric |
isGreaterThan | Checks and confirms that in each row, the value of column A is greater than the value of column B. | Numeric |
isGreaterThanOrEqualTo | Checks and confirms that in each row, the value of column A is greater than or equal to the value of column B. | Numeric |
isContainedIn | Checks and confirms that the value in a column is contained in a set of predefined values. | Numeric |
Issue Resolver Constraints | ||
Handle Duplicate Data | Choose the column with a unique key to address duplicate entries. This operation can be applied to one or more columns. If duplicates exist, those records will be filtered out, and the filtered records can be stored in the rejected records path if the user opts for it. | Numeric, string |
Replace Selective Data | Specify multiple values separated by commas to replace targeted values in the dataset. | Numeric, string |
Handle Missing Data | Manage null or empty values by choosing to either fill or remove records. If the user opts to remove records, the discarded entries will be stored in the rejected records path. | Numeric, string |
Handle Outliers | Input an integer value to address outliers. Based on specified conditions, records falling within these criteria will have their column values replaced with mean, max, minimum, or dropped, depending on user preferences. Rejected records will be stored in the rejected path. | Numeric |
Handle String Operations | Execute selected string operations on specific columns, including trim, rtrim, lpad, rpad, substring, and regexp_replace. | String |
Handle Case Sensitivity | Define whether to account for case sensitivity, including options for upper case, lower case, and proper case. | String |
Handle Data Against Master Table | Conduct a lookup against the master table, allowing users to provide either static values or master table column names. Entries that do not match the master table or static values will be filtered out and saved in the rejected records path. | Numeric, string |
What's next? Databricks Data Analyzer |