Ingesting data from Google Sheets
Objective
Ingest data contained within a Google Sheet for use on the Data Platform, optionally setting a recurring schedule to ingest new data.
Intended audience
- Data Analyst
Prerequisites
- You have a Github account - you can create one yourself using your Hackney email
- You have been added to the 'LBHackney-IT' team - you can request this from Rashmi Shetty
- You have a service account email address, such as one listed in the table below:
Department specific information
Preparing a Google sheet for ingestion
- Open the Google sheet you would like to ingest
- Ensure that all columns have headers. Columns without headers will be lost
- Click
Share
in the top right corner of the sheet - If the document is unnamed, name it
- Paste in the service account email address you have been provided into the email box
- Ensure the suggested email matches the service account email and select it
- On the new window, choose from the dropdown on the right hand side and select
Viewer
- Uncheck the
Notify people
checkbox - Click
Share
- You will be asked to confirm sharing outside the organisation, click
share anyway
- Your Google sheet is now ready to ingest
Getting Google sheet detail
-
You will need to obtain the document key from the url. The document id is the portion of the url between
https://docs.google.com/spreadsheets/d/
and/edit#gid=0
. See example below -
You will also need to obtain the worksheet name that you wish to ingest. The worksheet name is located at the bottom left of the screen and unless it has been changed or other worksheets added, it will be called
Sheet1
-
To ingest multiple worksheets from the same Google sheet, repeat the instructions in the below section for each worksheet
Setting up the AWS Glue job
This is what will handle the ingestion of the data from Google Sheets to the Data Platform.
-
Open the Data Platform Project. If you don't have the correct permissions, you'll get a '404' error (see prerequisites).
-
Navigate to the main
terraform/etl
directory, and open08-google-sheets-imports.tf
-
Switch to 'edit mode' (using edit button on top right)
-
Copy one of the modules above, paste at the bottom of the file and update the following fields:
-
module
- in the formatyour_unique_module_name
, and unique to all othermodule
names in this file (it is helpful to keep the same naming convention as your dataset/folder) -
glue_catalog_database_name
- Usingmodule.department_DEPARTMENT-NAME.raw_zone_catalog_database_name
(e.g.module.department_parking.raw_zone_catalog_database_name
) -
google_sheets_document_id
- Your Google Sheets document ID - see theGetting Google sheet detail
section above -
google_sheets_worksheet_name
- The name of the worksheet within your Google Sheet document - see theGetting Google sheet detail
section above. Please note the worksheet name must be unique within the Terraform file. -
department
-module.department_DEPARTMENT-NAME
(department name should appear as in the table above, e.g.module.department_housing_repairs
) -
dataset_name
- The name of the dataset as you'd like it to appear within the data platform e.g.housing-repair
-
Optional: stop your Google Sheet from ingesting automatically - The ingest job will run every weekday at 11pm if
enable_glue_trigger
is not specified (i.e. there's no line for this in your module) or it's set totrue
. If this is set tofalse
then your job will not run automatically on a schedule, and will have to be run manually within AWS. See the section Running the ingestion manually for instructions on how to do this. -
Optional: update the time schedule for the Google Sheet ingestion - If a value for
google_sheet_import_schedule
is not provided, the ingestion will run at 11pm on weekdays.- To override and set a new time schedule, add a new row to the respective module with the new cron time: e.g.
google_sheet_import_schedule = "cron(0 23 ? * 1-5 *)"
- To create a new Cron expression follow the guidance provided by the AWS Cron Expression documentation.
- To override and set a new time schedule, add a new row to the respective module with the new cron time: e.g.
-
-
Submit your changes, by referring to the Committing changes section of the Using Github guide.
- The Data Platform team needs to approve any changes to the code that you make, so your change won't happen automatically. Once your changes have been approved, the job will run at the next scheduled time.
Running the ingestion manually
Once you have been notified that your pull request has been merged, you can run the ingestion manually from the AWS console or wait until the scheduled time (if you set one).
You can do this by navigating to AWS glue workflows, selecting the workflow named <department_name>-<dataset_name>
, clicking the "Actions" dropdown and then "Run".