Getting started with the local PySpark environment

Introduction

Why develop PySpark scripts on your local machine?

Prototyping your jobs on your local machine's IDE is the cheapest and quickest way to work. It has the following advantages:

Seamless process: The script you're writing and testing locally is exactly the script that will end up in the repository. You won't need to copy/paste and transform code from Sagemaker or Glue Studio into GitHub.
Speed: you are able to test your script on a data sample and not the full data hosted in S3 pre-prod. Your local environment is also quicker to start than Glue.
Cost: you are not consuming any AWS resource and compute power while prototyping your job.

Who is this tutorial for?

People who need to prototype Glue jobs. If you just need to explore or analyse data on an ad-hoc basis, you should rather use a notebooking environment like Sagemaker.

Pre-requisites

You should be familiar with Python, Git, and the use of an IDE like PyCharm or VS Code. This guide uses PyCharm.

Windows or Mac

This guide covers Windows and Mac setup. You'll need admin permissions.

Data Platform Access?

You don't need Data Platform access to use the local environment.

Set up instructions for Mac

Install pre-requisites

Java

Spark needs Java. Check if you have Java installed by running java --version in your command prompt. If you do not have Java installed, then go to https://adoptopenjdk.net Select OpenJDK 8 (LTS) & HotSpot Choose x64 OSX version

Python

Check if you have Python3 installed by running python3 --version in your command prompt. If you do not have it, then go to https://www.python.org/downloads/ and install it.

Git

Check if you have Git installed by running git --version in your command prompt. If you do not have it, then get it from https://git-scm.com/book/en/v2/Getting-Started-Installing-Git.

PyCharm

Install the Community free version from https://www.jetbrains.com/pycharm/.

Additional Python packages

For the Data Platform code to run, you need at least the packages boto3, pytest, pyspark and pydeequ. You can install them now by typing in your command line: python3 -m pip install --user boto3 pytest pyspark pydeequ.

Alternatively, these libraries can be installed inside PyCharm as shown below.

Glue libraries

We need these libraries to simulate the Glue environment. Go to the project GitHub page: https://github.com/awslabs/aws-glue-libs and download the code (Code > Download zip). Unzip it and leave it there. We will move it in other step later.

Create the Data Platform local environment using PyCharm

Install the project in PyCharm

Open PyCharm and clone the Data Platform project: https://github.com/LBHackney-IT/Data-Platform.

Alternatively, if you already have the project, pull the latest changes by running git pull in the PyCharm terminal window.

Set the Python interpreter for the project

Open the preferences > Project:Data_platform > Python Interpreter > Click on the setting icon > Add.

In this screen, set the python interpreter to the version you’ve installed.

Add extra Python libraries

The following packages are necessary for unit tests. You can install them in the below window, or alternatively use pip install.

gspread
freezegun
pytest-mock

Import the Glue library as an external library

Paste the full downloaded awsglue folder in External Libraries > Python X.X > site-packages as shown on the following screen.

Check your environment

Open the file scripts/jobs/env-context.py.

If you have installed all the dependencies, the imports at the top of this script should all be fine.

If some imports are underlined, try to close and reopen PyCharm. If you still have problems, you can try to re-import the packages you installed via pip install within the DP environment, from the Python packages tab.

When the script looks fine, go the section 'Test the Levenshtein address matching script'

Set up instructions for Windows

Install pre-requisites

Java

Check if you have Java by running the command java –version. We want version 8. If you have the wrong version, uninstall it from the start menu. Install openjdk (version 8) form https://adoptopenjdk.net. In the installer, activate the option to set JAVA_HOME. After the install, the command java –version should return something.

Python

Install python from Windows store: it will set the right environment variables. After the install, the command python --version should return something.

NB: For some Windows OS, python installation from python.org also works fine. If you chose to install yourself, the Python command will be the normal py command.

Hadoop

This is specific to windows and will help us with paths etc. Download the full project from https://github.com/steveloughran/winutils/blob/master/hadoop-3.0.0/bin/. Unzip and locate the 2 files we need:

winutils.exe
hadoop.dll

We will now place these files in different locations.

Create a bin folder for winutils.exe and save it there, e.g. C:\winutils\bin or C:\users\sballey\winutils\bin.
Create the environment variable HADOOP_HOME and set it to the path (omitting bin at the end), e.g. C:\users\sballey\winutils. .
Copy hadoop.dll in C:\Windows\System32.
NB: you can try without this step 3 – it is critical for some users (it was for a Hackney Lenovo Thinkpad running Windows 10) but it appears some others don’t need it. Everything you need should in theory be in winutils.exe.

Python libraries

For the data platform we need at least:

boto3
pytest
pyspark
pydeeq
gspread
freezegun
pytest-mock

If you have installed Python through Windows store, you must use the command python -m pip install …. If not, the normal command is py -m pip install ...

It is recommended to install these packages 1 by 1 separately to be sure you see the execution of each python -m pip install boto3 python -m pip install pydeequ python -m pip install pytest

The most problematic package is pySpark: python -m pip install pyspark

If this one fails and complains about a legacy install, it’s because it should rather be in the form of a wheel file. In this case, run the command python -m pip install wheel

and then again python -m pip install pyspark

NB: Alternatively, you can install Python libraries later in PyCharm at project level.

Glue libraries

We need these libraries to simulate the Glue environment. Download them as a zip from https://github.com/awslabs/aws-glue-libs. Unzip them anywhere, you will move them later.

Create the Data Platform local environment using PyCharm

Install the project in PyCharm

Open PyCharm and clone the Data Platform project: https://github.com/LBHackney-IT/Data-Platform.

Alternatively, if you already have the project, pull the latest changes by running git pull in the PyCharm terminal window.

Set the Python interpreter for the project

In file > settings, choose your project in the left panel and set the python interpreter to the version you’ve installed. Tick the ‘Inherit’ box but not the other box Close the settings window.

Import the glue libraries

In the left panel, go to external libraries - site packages and paste the glue folder you’ve unzipped in the prerequisites step.

Check your environment

At this point, if you open the scripts/jobs/env-context.py script, you shouldn’t see any library highlighted in red.

If you open file>settings, you should see (after it loads) all the python libraries you’ve installed and are available for this project. Click the + button to install more.

If you still have underlined imports, closing and reopening PyCharm may help, or even restarting your laptop.

When your installatin is complete, try to run a script by following the next section!

Running a PySpark script (Mac and Windows)

Running the Levenshtein address matching script

This is the first script that was converted for the local PySpark environment and can be considered as a model for other scripts.

Get sample data

You'll need some [sample data] (https://drive.google.com/file/d/1a18qvDoJPGavNjfLNvnrY_jOmVTTWhY7/view?usp=sharing) (downloadable by Hackney only): some source addresses to match, and some target addresses from the address gazetteer to match against. These 2 datasets are in Parquet format and should be saved with a folder structure mirroring S3.

Example folder structure for source addresses sample_data_address_matching_source

Example folder structure for target addresses sample_data_address_matching_target

Set the script parameters

Open the file scripts/jobs/levenshtein_address_matching.py. Check that the imports are fine. If you run the script now, you will get an error message because we haven’t specified the parameters. Do this from the Configuration window that you can open from the drop down list just right from the green Run arrow. pycharm_run_config pycharm_edit_levenshtein_config

Here are the parameters values to add to the parameter text box:

--execution_mode=local
--addresses_data_path=”<your local path to>\dataplatform-stg-raw-zone\unrestricted\addresses_api\dbo.hackney_address” 
--source_data_path=”<your local path to>\dataplatform-stg-refined-zone\housing-repairs\repairs-electrical-mechanical-fire\communal-lighting\with-cleaned-addresses” 
--target_destination=”<your local path to>\output\levenshtein” --match_to_property_shell=forbid

Note for Windows users: When copying the path, do not include c: in your path and all backslashes "" need to be replaced with forward slashes "/"

--execution_mode=local
--addresses_data_path="<your local path to>/dataplatform-stg-raw-zone/unrestricted/addresses_api/dbo.hackney_address" 
--source_data_path="<your local path to>/dataplatform-stg-refined-zone/housing-repairs/repairs-electrical-mechanical-fire/communal-lighting/with-cleaned-addresses" 
--target_destination="<your local path to>/output/levenshtein" --match_to_property_shell=forbid

You can now run the job, see its progress in the console, and find the result file in the output folder.

A minimal test script

If you are blocked with the Levenshtein address matching script, here is a minimal script that you can save as test.py (do not commit to the DP repository!). This creates a Spark data frame and then another one from a csv file. It is useful to troubleshoot basic issues.

Note for Windows users: the file paths inside the code need double backslash.

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("LocalTest").getOrCreate()
print(spark)

def main():
    address_data_raw = spark.createDataFrame(
       [(2022, "06", "01"), (2022, "6", "01"), (2022, "07", "01")])
    address_data_raw.show()
    address_data_raw = spark.read.csv(
       "<your local path \\ to a folder \\containing a csv file>")
    address_data_raw.show()

if __name__ == '__main__':
    main()

Running the unit tests

Running the unit tests of the Data Platform project is another way to test your configuration. To do this, navigate and right-click on the scripts/test folder. It gives you the option to modify the Run Configuration of this folder. Rename this configuration to 'Run all tests' and make sure the settings are as below.

pycharm_unit_tests_configuration

Troubleshooting

These are solutions to random problems related to the local environment.

Issue: After a system upgrade, scripts wouldn’t run in PyCharm because python.exe was not found.

Environment: Windows

Solution: From the start menu, Launch ‘Manage Apps Execution aliases’ Make sure aliases for Python are on but not for the Python App installer.

windows_restoring_aliases_for_python

Environment: Windows

Solution: Download https://visualstudio.microsoft.com/visual-cpp-build-tools/ and install the 3 first tools

Issue: Various PySpark functions can no longer be found**: Mac

Solution: Uninstall PySpark and reinstall. The latest version should generally be ok, but if in doubt reinstall the version you were using previously.

Issue: Python packages cannot be imported**: Windows

Solution: First of all open command prompt and enter where python. If there are mutiple paths then this means that there are multiple instances of Python installed which will be conflicting with each other. Identify the instance that you wish to use, and delete all traces of the version that is no longer required. You can verify this has been successful by running where python again and seeing how many paths are returned.

This may solve the issue above, and you should now be able to import packages as intended, using your Python environment set up for the specific project you are working on.

If the problem persists, then the next step is to try editing the Configuration for your local development environment by setting additional environment variables as in the screenshot below. This will instruct your PyCharm Python environment on where to look for Python.

Environment variables in PyCharm configuration

Introduction​

Why develop PySpark scripts on your local machine?​

Who is this tutorial for?​

Pre-requisites​

Set up instructions for Mac​

Install pre-requisites​

Java​

Python​

Git​

PyCharm​

Additional Python packages​

Glue libraries​

Create the Data Platform local environment using PyCharm​

Install the project in PyCharm​

Set the Python interpreter for the project​

Add extra Python libraries​

Import the Glue library as an external library​

Check your environment​

Set up instructions for Windows​

Install pre-requisites​

Java​

Python​

Hadoop​

Python libraries​

Glue libraries​

Create the Data Platform local environment using PyCharm​

Install the project in PyCharm​

Set the Python interpreter for the project​

Import the glue libraries​

Check your environment​

Running a PySpark script (Mac and Windows)​

Running the Levenshtein address matching script​

Get sample data​

Set the script parameters​

A minimal test script​

Running the unit tests​

Troubleshooting​

Issue: After a system upgrade, scripts wouldn’t run in PyCharm because python.exe was not found.​

Issue: C-related errors when trying to run your first PyCharm script.​

Issue: Various PySpark functions can no longer be found**: Mac​

Issue: Python packages cannot be imported**: Windows​

Introduction

Why develop PySpark scripts on your local machine?

Who is this tutorial for?

Pre-requisites

Set up instructions for Mac

Install pre-requisites

Java

Python

Git

PyCharm

Additional Python packages

Glue libraries

Create the Data Platform local environment using PyCharm

Install the project in PyCharm

Set the Python interpreter for the project

Add extra Python libraries

Import the Glue library as an external library

Check your environment

Set up instructions for Windows

Install pre-requisites

Java

Python

Hadoop

Python libraries

Glue libraries

Create the Data Platform local environment using PyCharm

Install the project in PyCharm

Set the Python interpreter for the project

Import the glue libraries

Check your environment

Running a PySpark script (Mac and Windows)

Running the Levenshtein address matching script

Get sample data

Set the script parameters

A minimal test script

Running the unit tests

Troubleshooting

Issue: After a system upgrade, scripts wouldn’t run in PyCharm because python.exe was not found.

Issue: C-related errors when trying to run your first PyCharm script.

Issue: Various PySpark functions can no longer be found**: Mac

Issue: Python packages cannot be imported**: Windows