Skip to main content

Data Platform Training Module - Transforming data using PySpark and writing to the refined zone

In this module you will manipulate the data that you loaded into the Data Platform raw zone in module 1. You will save the result of your work into the refined zone of the data platform.

The refined zone is for data that has been enhanced or enriched and is ‘ready’ for analysis.

You will be using PySpark to create and test your script. You will create your code in a Jupyter Notebook environment running in Amazon SageMaker.

Jupyter notebooks are a convenient way to test code locally as Glue Studio is not so straightforward for code development (despite what the name implies!). You can read more about Jupyter notebooks here.

This module is all about prototyping within a sandbox environment, and therefore we won’t be pushing any code to the Data Platform GitHub repository.

Overview

You’ll write a few lines of code in PySpark which will include reading data from S3, joining the two Covid datasets together and then writing the newly created dataset back to S3 in the refined zone.

PySpark is an interface that uses the Python programming language to access the Apache Spark cluster-computing framework; therefore the syntax will seem a bit different if you have used Python packages such as Pandas or NumPy before.

Prerequisites

Please ensure that both the 'Getting set up' module and the 'Ingesting data' module are completed before proceeding. You will need the names of the tables that were created from completing the 'Ingesting data' module.

Step-by-step instructions

1. Developing the job script in a Jupyter notebook

  1. Launch the sandbox notebook within Sagemaker by following this guide.

  2. Open the notebook template called using-pyspark.ipynb.

    • Familiarise yourself with the steps (variables, reading, transforming and writing data).
    • Duplicate the template, saving it in the sandbox folder, and rename it (you can append your name to the name of the file).
  3. Prototype your own ETL (Extract Transform Load) job to transform the data in PySpark. The PySpark By Examples website is a handy resource to write your code.

    Please try the following things on your data:

    • Load the Covid Locations dataset into the notebook.
      • You will first need to set the variables in the first cell, see the "TODO" comments. The datasets will be the same as the last step in Module 1 training.
    • Print out the schema and the first few rows of data to check it has imported correctly.
    • Change a column name: rename location to country.
    • Load the Covid Vaccinations dataset into the notebook - Repeat the 2 steps above.
    • Cast a column to a different type e.g. a string date to date type.
    • Join the two tables to a single output using the country column.
  4. Write to the Refined zone When you are confident that your transformation is working, you can use the last cell of the notebook to actually write the data into the Data Platform (please note that a notebook will only ever write to the pre-production environment, never to the production environment as it is meant for prototyping).

  5. Find and view the data in the refined zone bucket within S3 If the previous step has worked correctly, then you should be able to see your newly created parquet files in the refined zone S3 bucket within the sandbox department folder. We won't try to visualise or query the data in this module.

🎉 Congratulations! You have completed this module!

In the next module, you will turn your script into a Glue job and deploy it into the Data Platform.