Wimsey 🔍
A fast, lightweight and easy data contract library.
Ideally, all data would be usable when you recieve it, but you probably already have figured that's not always the case. That's where data contracts come in.
A data contract is an expression of what should be true of some data, such as that it should 'only have columns x and y' or 'the values of column a should never exceed 1'. Wimsey is a library built to run these contracts on a dataframe during python runtime.
Wimsey is built on top of the awesome Narwhals and natively supports any dataframes that Narwhal's does. At the time of writing, that includes Polars, Pandas, Arrow, Dask, Rapids and Modin.
Run through
As an example, let's work through a simple example, and imagine we recieve "top-6-sleuths.csv" daily, over sftp. It's meant to look something like this:
first_name | last_name | rating | cases_solved |
---|---|---|---|
Peter | Wimsey | 9 | 11 |
Jane | Marple | 9 | 12 |
Father | Brown | 7 | 53 |
Hercule | Poirot | 10 | 33 |
Beatrice | Bradley | 8 | 66 |
It's meant to contain the top 5 sleuths, only sometimes, it has the wrong number of entries; othertimes first names are missing; and whilst ratings should be out of 10, sometimes they are over that. To make things worse every now and then, someone puts "lots" into cases_solved
meaning it's no longer a number, and that causes all kinds of trouble.
Writing Tests
We can convert those concerns we just mentioned into four tests to carry out on our dataset:
- The row count should be 5
first_name
should never be nullrating
should be be less than 10cases_solved
should be a number
We can test for a lot more than that, but that works for our example. Our first move is to write this out as a "contract". This can be a yaml or json file, or alternatively, we can code it directly into python.
See Possible Tests for a full catalogue of runnable tests and their configurations.
Executing Tests
Now that we've written out tests, we just need to actually run them on the actual data. There's two functions wimsey
gives you to carry out checks: validate
and test
. These both carry out checks in the same way, but behave slightly differently based on the results.
test
will return aFinalResult
type of object. It's a dataclasses containing asuccess
boolean, alongside further details on the individual tests in aresults
lists.validate
will run the checks and then just return the initial dataframe assuming everything passed. If any tests failed, it'll stop execution and throw aDataValidationException
.
These are designed to cover a couple different use cases, test
will provide more details if you want to dig into problems in a dataset, whilst validate
is helpful if you just want to use wimsey
as a "guard" to catch bad data from being processed.
We'll cover test
first, it's called the same regardless of what type your dataframe is:
Wimsey uses fsspec under the hood, so configs can be from any filesystem supported by fsspec (such as S3, SSH, Azure, Google Cloud etc) - use the fsspec prefix and pass in the appropriate storage options using test
's storage_options
keyword. See fsspec documentation for more details on this.
Validate, will run tests in the exact same way as test
, but simply raises an error if data fails expectations. This, in conjunction with Wimsey's compatibility with multiple dataframe types can make it a convenient tool for providing guarantees in a data pipeline.
import pandas as pd
from wimsey import validate
from settings import sleuth_storage_options
top_sleuth: str = (
pd.read_csv(
"sshfs://sleuthwatch/top-5-sleuths.csv",
storage_options=sleuth_storage_options,
)
.pipe(validate, "sleuth-checks.json") # <- this is the wimsey bit
.assign(name=lambda df: df["first_name"] + df["last_name"])
.sort_values("rating", ascending=False)
["name"][0]
)
print(f"{top_sleuth} is the best sleuth!")
import polars as pl
from wimsey import validate
from settings import sleuth_storage_options
top_sleuth: str = (
pl.read_csv(
"sshfs://sleuthwatch/top-5-sleuths.csv",
storage_options=storage_options,
)
.pipe(validate, "sleuth-checks.json") # <- this is the wimsey bit
.with_columns(name=pl.col("first_name") + " " + pl.col("last_name"))
.sort("rating", descending=True)
.select("name")
.to_series()[0]
)
print(f"{top_sleuth} is the best sleuth!")
import dask.dataframe as dd
from wimsey import validate
from settings import sleuth_storage_options
top_sleuth: str = (
dd.read_csv(
"sshfs://sleuthwatch/top-5-sleuths.csv",
storage_options=sleuth_storage_options,
)
.pipe(validate, "sleuth-checks.json") # <- this is the wimsey bit
.assign(name=lambda df: df["first_name"] + " " + df["last_name"])
.sort_values("rating", ascending=False)
["name"]
.compute()[0]
)
print(f"{top_sleuth} is the best sleuth!")
from pyarrow import compute, csv
from wimsey import validate
from settings import download_sleuth_file
download_sleuth_file(to="local-5-sleuths.csv")
df = csv.read_csv("local-5-sleuths.csv")
validate(df, "sleuth-checks.json") # <- this is the wimsey bit
name = compute.binary_join_element_wise(df["first_name"], df["last_name"], " ")
df = df.append("name", name).sort_by("rating")
top_sleuth = str(df["name"][-1])
print(f"{top_sleuth} is the best sleuth!")
And that's it, to keep things simple validate
and test
are the only public-intended functions in Wimsey, aside from test creation, which is covered further in the possible tests section.