Time series documentation - Step by Step
Introduction
This guide is for people who are very new to the Data Platform and want to do some forecasting. Can be used alone or conjunction with the other bits of documentation
Using public data which is stored on the Platform
Prerequisites
- Have access to Pre-Production in AWS Glue (PowerUser)
- Your Data ready in Pandas Dataframe format. For this guide, the sample script has already gotten to this stage
1. Set up
First let's set up our test Glue Job. There is already one in the AWS Glue Jobs list for us to copy
Time Series Forecasting StepByStep
Clone and rename it to Time Series Forecasting StepByStep-[Your Name]
This script will load some public data about Bike Rentals
Viewing the Data using Print in AWS Glue
Most people will have no idea what this data looks like, so let's print the data and see. The print statements are already there to show some of the data and to show its information. So just Run the script
Once the script runs, look at the Output Logs at the bottom of the run, and select the log
You should be able to see the python print outputs at the bottom, click on these to expand and take a look
In particular, the info is interesting
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4781 entries, 0 to 4780
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Day 4781 non-null object
1 Number of Bicycle Hires 4781 non-null object
2 import_datetime 4781 non-null datetime64[ns]
3 import_timestamp 4781 non-null object
4 import_year 4781 non-null object
5 import_month 4781 non-null object
6 import_day 4781 non-null object
7 import_date 4781 non-null object
dtypes: datetime64[ns](1), object(7)
This shows us all the columns, how many values we have and their types. We have a few issues from here
- The number of Bicycle Hires is not an int or float, but an object.
- There are multiple columns
- The Date is in a separate column rather than in the Index
Adding the Helper Functions to our Glue Job
You can import the helper functions by putting this line of code underneath your other imports
from scripts.helpers.time_series_helpers import "Your Function"
For what we want to do in this script, let's use these functions
- reshape_time_series_data: Most Time series functions use a 1 dimensional dataframe with the Date in the Index
- forecast_ets: The exponential smoothing ETS method. Requires a start and end date
- get_start_end_date: This will give us the start and end date needed
Details
Completed Import code Spoiler
from scripts.helpers.time_series_helpers import reshape_time_series_data, forecast_ets, get_start_end_dateNext, we want to add the libraries needed into our Glue environment
Follow this guide on how to do so: Link
Now save and run the Glue Script again. There should be no errors, and now you have some functions to use
Reshaping the Data for Forecasting
So lets reshape the data so that its in the right shape
You probably noticed that there is this line in the code at the moment
bike_data['Number of Bicycle Hires'] = bike_data['Number of Bicycle Hires'].apply(clean_int)
Since this guide is not about cleaning data, we have applied this cleaning step for you
First let's get this data into the desired shape for Time Series Analysis which means having just one metric and the date in the index
reshape_time_series_data
is the helper to use here, its arguments are
- The Dataframe
- The name of the Date Column so it can move the date into the index. In this case its
"Day"
- The list of columns to include, so for this case its
["Number of Bicycle Hires"]
- Format of the Date, for this dataset its american and is not MM/DD/YYYY so its
"%m/%d/%y"
Details
Completed Reshape code Spoiler
reshaped_data = reshape_time_series_data(bike_data,"Day",["Number of Bicycle Hires"], "%m/%d/%y")We then need the start date and end date. get_start_end_date
is what we want to use for this.
- The reshaped dataframe
- The period of your time data in the index. For this its
"D"
for days - How many multiples of that period do you want, so for months you can put 6 for 6 months of data.
Details
Completed Start End Date code Spoiler
start_date, end_date = get_start_end_date(reshaped_data,"D",180)To get 180 days of data
Putting the reshaped data into a Function
Now let's use the forecast_ets
function to get our forecast
- The dataset as a SERIES. This is important because the reshape function calls the single variable Y. So if your dataframe is called
reshaped_dataset
you will need to putreshaped_dataset.y
- Start date
- End Date
Details
Completed Start End Date code Spoiler
forecast = forecast_ets(reshaped_data.y,start_date,end_date)This will return a dataframe with our forecast mean results.
Using other functions instead
For other functions simply follow the documentation for their respective helper instead of using the forecast_ets helpers
and that's all we need. Just a few lines of code to copy and you get some forecasting. Now you need to figure out what you want to do with this forecast, but that won't be covered here