Introduction to the CrossClient¶
In this notebook you learn, how to connect to the CrossPlatform using the CrossClient an to get a contract and validate your data against the contract.
Packages and data¶
from crosscontract import CrossClient, CrossContract
import pandas as pd
Determine user¶
Here we assume that you have some .env file that stores your credentials and we extract them from there.
from dotenv import load_dotenv
import os
load_dotenv(".env")
username = os.getenv("CROSSUSER")
password = os.getenv("PASSWORD")
# we explicitly set the domain here as we want to connect to our staging instance
domain = "https://backstage.sweetcross.link"
Connect to the CrossPlatform¶
To connect to the platform using CrossClient you need a registered user. To create the client, simply provide it the username and password. Here we assume that you have some .env file that stores your credentials and we extract them from there.
my_client = CrossClient(username=username, password=password, base_url=domain)
That's it. The platform knows who you are and how you want to login. So let's get a contract and use it for data validation.
Getting an overview¶
First we want to get an overview which contracts are on the CrossPlatform and
what they contain. For this we have the method client.contracts.overview that
provides a Pandas Dataframe with the metadata of the contract (as well as the status
of the contract).
As we noted above, we use the context manager to take advantage of automatic connection handling.
df_overview = my_client.contracts.overview()
df_overview[["name", "description"]].head(10)
| name | description | |
|---|---|---|
| 0 | dim_tech_hydrogen | List of technologies used to produce hydrogen |
| 1 | scenass_hdd | Heating Degree Days (HDD) by climate scenario ... |
| 2 | scenass_households | Household data used as assumptions for scenari... |
| 3 | scenass_import_prices | Import prices by fuel type, year, and country ... |
| 4 | dim_tech_liquids | List of technologies used to produce liquid fuels |
| 5 | result_district_heat_energy_production | Useful energy production of distric heat as su... |
| 6 | result_electricity_consumption | Electricity consumption as submitted from scen... |
| 7 | result_elec_cons_typical_day | Electricity consumption as submitted from scen... |
| 8 | dim_tech_methane | List of technologies used to produce methane |
| 9 | result_electricity_supply | Electricity supply as submitted from scenario ... |
Contract creation¶
Suppose we want to add our test contract given as:
test_contract = {
"name": "test_contract",
"title": "Test Contract",
"description": "A simple test contract",
"tableschema": {
"primaryKey": ["year", "country"],
"fields": [
{
"name": "value",
"type": "number",
"constraints": {
"required": True,
"minimum": 0.0,
"maximum": 100.0,
"unique": True,
},
},
{
"name": "year",
"type": "integer",
"constraints": {"required": True, "minimum": 2000, "maximum": 2025},
},
{
"name": "country",
"type": "string",
"constraints": {"required": False, "maxLength": 6, "minLength": 2},
},
],
}
}
To add the contract to the platform, we create the CrossContract and use the client.contracts.create
method. This will create a contract in Draft model. In this mode we are not allowed to
submit data. Therefore we directly activate the contract to put into Active state
which allows us data submission.
If you runt that line, you mostly likely will get a ConflictError as the contract
already exists. Alternatively you get a PermissionDinedError as you are not allowed to
create contracts. We can catch these errors using the usual try/except logic:
from crosscontract.crossclient.exceptions import ConflictError, PermissionDeniedError
contract = CrossContract(**test_contract)
try:
created_contract = my_client.contracts.create(contract, activate=True)
except (ConflictError, PermissionDeniedError) as e:
# catch the expected errors here
print(f"Expected error creating contract: {e}")
except Exception as e:
# but raise any unexpected errors
raise e
Getting a contract¶
Let's no get our test_contract.
To get the contract we use client.contracts.get. If the contract is found, we
will get back a ContractResource. A ContractResource is a CrossContract that
lives on the CrossPlatform. As the contract is saved on the CrossPlatform the contract
is read-only and also provides some additional information like the status of
the contract on the platform.
The ContractResource is the central object to work with remote contracts and
allows you to get, add, and delete data for a contract.
contract_name = test_contract["name"]
my_contract_resource = my_client.contracts.get(name=contract_name)
print(f"Retrieved contract resource: {my_contract_resource}")
Retrieved contract resource: ContractResource(name=test_contract, status=Active)
The ContractResource contains the contract but usually we do not want to deal with
it directly but only want to validate our local data or add or get data from the
platform:
Validate local data¶
Validation of data follows the exactly same steps as in the CrossContract case.
We simply use validate_dataframe function with our data given as Pandas Dataframe.
df_test = pd.DataFrame({
"year": [2020, 2021, 2022],
"country": ["US", "CA", "MX"],
"value": [50.5, 60.0, 70.2]
})
my_contract_resource.validate_dataframe(df_test)
If nothing happens, the data is locally valid. However in the case of validation errors,
validate_dataframe will raise an ValidationError. To get more information about
which data violating the contract, we can catch the error and use the to_df function
to get a dataframe with detailed error messages by row:
from crosscontract.crossclient.exceptions import ValidationError
df_fail = pd.DataFrame({
"year": [1820, 2021, 2022],
"country": ["US", "CA", "ThisCountryNameIsWayTooLong"],
"value": [50.5, 100000, 70.2]
})
try:
my_contract_resource.validate_dataframe(df_fail)
except ValidationError as e:
df_errors = e.to_pandas()
df_errors
| schema_context | column | check | check_number | failure_case | index | |
|---|---|---|---|---|---|---|
| 0 | Column | year | greater_than_or_equal_to(2000) | 0 | 1820 | 0 |
| 1 | Column | value | less_than_or_equal_to(100.0) | 1 | 100000.0 | 1 |
| 2 | Column | country | str_length(2, 6) | 0 | ThisCountryNameIsWayTooLong | 2 |
There is one different in validation using the CrossContract and the ContractResource: CrossContract raises a SchemaValidationError but ContractResource raises a ValidationError. The two behave the same in terms of error details. But the ValidationError unifies validation errors that occur locally with that occur on the CrossPlatform. More on this below.
Adding data¶
To add data, we again use our ContractResource and its add_data method that does two things:
- Validate the data locally
- Submit the data to the server
df_test = pd.DataFrame({
"year": [2020, 2021, 2022],
"country": ["US", "CA", "MX"],
"value": [50.5, 60.0, 70.2]
})
my_contract_resource.add_data(df_test)
What happens if we submit the data again? The contract has a primary key constraint that restricts the combination of year and country to be unique:
# Local validation succeeds
my_contract_resource.validate_dataframe(df_test)
# but adding the same data again raises a ValidationError due to unique constraint violation
try:
my_contract_resource.add_data(df_test)
except ValidationError as e:
df_errors = e.to_pandas()
print(e)
df_errors
ValidationError (422): Data validation against contract 'test_contract' failed. To get detailed error information, catch the ValidationError and use its .to_list() or .to_pandas() methods.
| schema_context | column | check | check_number | failure_case | index | |
|---|---|---|---|---|---|---|
| 0 | DataFrameSchema | year, country | PrimaryKeyError: Primary key ['year', 'country... | 0 | [2020, US] | 0 |
| 1 | DataFrameSchema | year, country | PrimaryKeyError: Primary key ['year', 'country... | 0 | [2021, CA] | 1 |
| 2 | DataFrameSchema | year, country | PrimaryKeyError: Primary key ['year', 'country... | 0 | [2022, MX] | 2 |
So the local validation passes but the server raises an validation error. This illustrates the two concepts of validity in the context of the CrossClient:
- Local validity Data are locally consistent. But we do not check whether our local data are consistent with the server.
- Remote validity When we submit data to the CrossPlatform checks the new data together with the data already stored in the platform. In the case of foreign key references, the platform also tries to resolve and check these references.
The difference between local and remote validity mostly matters in two cases: (a) Resolving of uniqueness constrains as the local data point already exists on the server. (b) Resolving of foreign key constraints, i.e., the data contain a reference to data in another contract and the respective value is not found in that other contract.
Getting data¶
To get data back from the platform, we use the ContractResource and its get_data
method. The method allows to impose a simple filter on the data.
Currently, filtering is however restricted to string values that are scalar. I.e., filtering on numerical data or using lists is is not possible at the moment:
df_data = my_contract_resource.get_data()
# or for illustration purposes with a filter
my_contract_resource.get_data(filters={"country": "US"})
| value | year | country | |
|---|---|---|---|
| 0 | 50.5 | 2020 | US |
Deleting the contract¶
Deleting the contract, requires multiple steps:
- Change the contract status to "retired"
- Drop the data table associated with the contract. That deletes all data and is only possible for contracts that are in state retired.
- Use the client to delete the contract.
my_contract_resource.change_status("retired")
my_contract_resource.drop_data()
my_client.contracts.delete(name=contract_name)
Close the client¶
After using the client, you should close the connection properly:
my_client.close()