Using pytest for verifying PySpark transformations - ADR 010

Context

The Data Platform team has been writing Apache Spark jobs using PySpark to transform data within the platform.

Examples include:

Address matching
Address cleaning
Repairs sheets data cleaning

These jobs lack automated tests, which has meant that debugging these scripts has involved slow feedback loops, running against actual data within the platform.

By introducing testing practices, frameworks and tools we hope to:

Improve the speed at which PySpark scripts can be developed
Provide documentation for each script with example data they expect, and what results they output
Increase the proportion of defects found before they reach staging environment

Decision

We will:

Use a Python testing framework, pytest
Use the same Docker container we use for the Jypiter Notebook for running the tests, as it replicates the AWS Glue Spark environment locally.
Integrate that framework into Apache Spark, and provide example test code
Create documentation and guidance around how to productively test PySpark scripts
Run the suite of Python tests as part of the deployment pipeline, and prevent failing tests from being deployed to staging

Consequences

Building and maintaining PySpark scripts should become easier and faster as a result.

Writing PySpark scripts will require some additional learning, if you haven't used unit testing practices before.

Context​

Decision​

Consequences​

Context

Decision

Consequences