Tips on writing an API Ingestion script for AWS Lambda
Prerequisites
- Have the secrets/api_key stored in both Production and Pre-Production environments, with the naming convention of "/[Department Name]/[secret name]". So for example, /customer-services/vonage
Introduction
This is an article on tips to write, format your Python script, to be used in Lambda for the Data Platform.
Writing the Code for the Data Platform
You can use whatever you want to make the Python Script. However to have it run in the Data Platform, there are a few rules to keep in mind if you intend for this script to be used in AWS Lambda.
Desired End Output
- .Py File
- Code which can be run frequently. Lambdas have a timeout limit of 15 minutes, so the code must complete within this time. It's therefore likely you'll want your code to run daily.
- Outputs the RAW CONTENT of the API call. No transformations
- Outputs the content in
{folder}/import_year={year}/import_month={month}/import_day={day}/import_date={date}/
- The dates refer to the date of import, not the date of the Data
Basic Rules
- The Python Script needs to be able to determine, on its own, which date range to pull data from
- The Lambda will run the "Lambda_handler" function. This means in place of
__MAIN__()
you will havedef lambda_handler(event, lambda_context):
. Then to run this script, you just have alambda_handler("","")
line - If you have variables which you want the Terraform or Glue job to handle, you can pass them through as environment variables instead.
os.environ["TARGET_S3_BUCKET_NAME"] = "landing-zone"
for example will set the environment variableTARGET_S3_BUCKET_NAME
tolanding-zone
.s3_bucket = getenv("TARGET_S3_BUCKET_NAME")
Will then be the code to get the environment variable as a variable in your script - Have no secrets (e.g. API keys) hard coded into the script
- Try to use as little external packages as possible, Pandas is around 50mb after you install all of the dependancies. Very expensive
Some tips
- If you are coding using .py files, the way I have found to code the script is to have 2 separate scripts. One main.py file which has the code to go into the Lambda. Another script which triggers that one and sets environment variables - In Jupyter (So Google Colab). Instead of making multiple scripts, use multiple cells. This is what I would suggest
- Import cell
- Functions cell
lambda_handler(event, lambda_context):
which acts as your main(), and interacts with the functions cell- A cell with lambda_handler("","") and environment variable setters to trigger above. This will not be copied into your .py script when it comes down to making it in the Data Platform but it's what will help trigger your Lambda Handler event
Example Jupyter Script Format
import os
def print_text(text_string):
print(text_string)
def lambda_handler(event, lambda_context):
load_dotenv() # load environment variables
string_to_print = getenv("STRING_TO_PRINT")
print_text(string_to_print)
import os
os.environ["STRING_TO_PRINT"] = "Bacon" # Variable to be passed via Terraform
lambda_handler("","")
Once you have your script able to output data from the API locally, we need to modify the script to output to S3.
Integrating S3
For our Python script to output data into the Data Platform, we will use a Python Module called Boto3
While it has many features, we will focus on 3 of them
- Obtaining secrets from Secrets Manager
- Reading files in S3 Bucket
- Outputting to S3
However Boto3 does not work unless you authenticate Boto3. There are many ways to do so, follow one of these methods found Here
Once you have authenticated Boto3. Lets use some AWS functionality
Obtaining Secrets from the Secrets Manager
Code to get Secrets from S3
secrets_manager_client = boto3.client('secretsmanager')
secret_name = getenv("SECRET_NAME")
secret_manager_response = secrets_manager_client.get_secret_value(SecretId=secret_name)
api_credentials = json.loads(secret_manager_response['SecretString'])
api_key = api_credentials.get("api_key")
secret = api_credentials.get("secret")
- Create a secrets manager client with boto3
- Pull the secret_name from environment variables.
- In local environment we can give this to the script using
os.environ["SECRET_NAME"] = "Some Value"
. - In the Data Platform, we will give the secret name to the script via Terraform later
- Make an API request to the Secrets Manager client, this will return a response
- Load the Secret String portion of the response ['SecretString']
- Get the API Key and Secret
Reading files in S3 Bucket
You may want to read what files that you have in an S3 Bucket, maybe to determine what data you already have, or maybe to actually read one
Code to List folders
s3_client = boto3.client('s3')
def list_subfolders_in_directory(s3_client,bucket,directory):
response = s3_client.list_objects_v2(
Bucket=bucket,
Prefix=directory,
Delimiter="/")
subfolders = response.get('CommonPrefixes')
return subfolders
Returns a list of folders at a specific path.
Code to List Files
s3_client = boto3.client('s3')
bucket = "Bucket name"
directory = "Path to where you want to list the files, ending with /"
def list_s3_files_in_folder_using_client(s3_client,bucket,directory):
response = s3_client.list_objects_v2(Bucket=bucket, Prefix=directory)
files = response.get("Contents")
for file in files:
file['Key'] = re.sub(string=file['Key'],
pattern=f"{directory}/".format(),
repl="")
# returns a list of dictionaries with file metadata
return files
Returns a list of files at a specific path.
Output to S3 Landing Zone
Here I will supply and explain two functions which will help you put files into S3
Output to Landing zone with Formatting
from datetime import date
def output_to_landing_zone(s3_bucket, data, output_folder,filename):
todays_date = date.today()
day = todays_date.day.zfill(2)
month = todays_date.month.zfill(2)
year = str(todays_date.year)
return s3_client.put_object(
Bucket=s3_bucket,
Body=str(data),
Key=f"{output_folder}/import_year={year}/import_month={month}/import_day={day}/import_date={todays_date}/{filename}.json")
So if you wanted to put a json file into the "Sandbox" bucket, and within that bucket, you want the data to be within the "CRM" folder, you would call the function with
output_to_landing_zone(s3_bucket, <the json data>, "**CRM**", "**Sandbox**").
It will then proceed to put the Data into the Data Platform. It will use today as the import day and create the correct folder structure for it to work. Note that the import_date is in YYYYMMDD format not YYYY-MM-DD
Once you get the code, try running the code using Pre-Prod credentials to output to Pre-Prod
Generating a Piplock and Requirements.txt File
For your Python script to work, you will most likely need to download extra packages to use, packages which are not built in to Python
To do this, we will create a requirements.txt file, and from the requirements.txt file, we will generate a Piplock file. This Piplock file is what tells the Terraform (Which creates Lambdas) which external packages to give to the environment which will run your Python Script
- Install Pipreqs
Generally something like pip install pipreqs
would be fine
- Open command prompt at the location of your Python Script
- Use the command
pipreqs
This creates a requirements.txt
file. It contains all of the packages and dependancies you need for your script
- Install Pipenv
pip install pipenv
works
- Use the command
pipenv lock
This will now create a pipfile
and pipfile.lock
. These are what the Terraform needs to create your Lambda.
- Open your
pipfile
. - Move boto3 and botocore rows to
[dev-packages]
- Replace the contents of all of the package versions with "*" instead
For example. boto3 = "==1.24.89"
would become boto3 = "*"
instead
- Change the python version to "3"
So now we have a script which outputs files into AWS S3, we have the piplock file which lets the Terraform know what packages we need. We now need to push the script to the dataplatform so that we can begin to use it
Update the Data Platform with our Script
The best practise way is to clone the project into your IDE, create the files in the right place, then push it back to the Data Platform to be merged into the main branch
- If you do not have a copy of the Data Platform in your IDE environment already, clone the Data Platform
- Navigate to
lambdas
- Create a folder for your script. Named something like
something_api_ingestion
. All lowercase and underscores replacing spaces - Either create or copy your
main.py
script in this folder. Do the same with thepipfile
andpipfile.lock
. - Commit and Push into the Data Platform, then make a Pull Request. Your code will need to be reviewed by someone else before it can be put into Live.