Tutorial: CrossContracts¶
We illustrate the basic usage of CrossContract. How to load them and how to validate data against the given contract.
Packages and options¶
import pandas as pd
from crosscontract import CrossContract, SchemaValidationError
Creating the contract¶
Contracts can be created from dictionaries or from yaml or json files. Here we
keep the notebook self-contained and use a dictionary. If we would use a file,
we would use the from_file method to create the contract.
contract_data = {
"name": "contract_gdp",
"title": "Gross Domestic Product (GDP)",
"description": "Gross Domestic Product (GDP) by country and years.\n",
"tableschema": {
"fields": [
{
"name": "country",
"type": "string",
"title": "Country Name",
"description": "Name of the country",
"constraints": {
"required": True,
"maxLength": 100
}
},
{
"name": "year",
"type": "integer",
"title": "Year",
"description": "Year of the GDP data",
"constraints": {
"required": True,
"minimum": 2000,
"maximum": 2050
}
},
{
"name": "gdp",
"type": "number",
"title": "GDP Value",
"description": "Gross Domestic Product value in USD",
"constraints": {
"required": True,
"minimum": 0
}
}
]
}
}
gdp_contract = CrossContract(**contract_data)
The contract object can be used to inspect the contract. The fields and their constraints are available through the tableschema property:
print(f"Name: {gdp_contract.name}")
print(f"Title: {gdp_contract.title}")
print(f"Description: {gdp_contract.description}")
print("\nTable Schema Fields:")
for field in gdp_contract.tableschema.field_iterator():
print("------")
print(f"Field Name: {field.name}, Type: {field.type}, Title: {field.title}")
print(f"Constraints: {field.constraints}")
Name: contract_gdp Title: Gross Domestic Product (GDP) Description: Gross Domestic Product (GDP) by country and years. Table Schema Fields: ------ Field Name: country, Type: string, Title: Country Name Constraints: required=True unique=False pattern=None minLength=None maxLength=100 enum=None ------ Field Name: year, Type: integer, Title: Year Constraints: required=True unique=False minimum=2000 maximum=2050 enum=None ------ Field Name: gdp, Type: number, Title: GDP Value Constraints: required=True unique=False minimum=0.0 maximum=None enum=None
Data validation¶
Given the contract we can now validate some data against it. For this we create
a pandas dataframe and pass it to the validation method of the contract. For this
we use tableschema.validate_dataframe() method which is part of the table schema.
df_valid = pd.DataFrame({
"country": ["CountryA", "CountryB", "CountryC"],
"year": [2020, 2021, 2022],
"gdp": [500, 600, 700]
})
df_invalid = pd.DataFrame({
"country": ["CountryA"*100, "CountryB", "CountryC"],
"year": [2020, 2021, 2072],
"gdp": [500, -500, 700]
})
In case of valid data, we simply get an empty response.
gdp_contract.tableschema.validate_dataframe(df=df_valid)
In case of invalid data, we get a SchemaValidationError. The SchemaValidationError allows for
closer inspection of the errors using the to_list or to_pandas methods. to_list
provides a list of dictionaries and to_pandas a pandas dataframe. To now what
went wrong, we therefore catch the error and inspect the dataframe:
try:
gdp_contract.tableschema.validate_dataframe(df=df_invalid)
except SchemaValidationError as e:
print(f"Validation errors:")
print(e.to_pandas())
Validation errors:
schema_context column check check_number \
0 Column gdp greater_than_or_equal_to(0.0) 0
1 Column year less_than_or_equal_to(2050) 1
2 Column country str_length(None, 100) 0
failure_case index
0 -500.0 1
1 2072 2
2 CountryACountryACountryACountryACountryACountr... 0
The index and column column inform about the row and column in which the error occurred. The check lists the violated constraint and the failure_case shows the value violating the constraint. In our example, we have three errors:
- First row (index = 0): Country string is too long
- Second row: GDP is negative
- Third row: Year is too high