Ingest manually uploaded csv files
Prerequisites
- You have some structured data files you wish to have available from the Data Platform
- You have access to the Hackney Data Platform
- The department you are placing this data into the data platform has the manual CSV upload functionality enabled.
Steps
-
Ensure your data meets the following requirments before proceeding.
- You have saved the data as a "CSV" file format, using a comma to separate fields. When exporting from Excel, select the "CSV UTF-8" option.
- The first row of the CSV contains the names of the columns as you wish them to appear within the Data Platform. If there are any empty rows before the column names delete them before exporting the file.
-
Sign in to the AWS Management Console and open the Amazon S3 console.
-
In the Buckets list, select the landing zone bucket, at dataplatform-stg-landing-zone. Navigate to your departments manual upload folder, see below for structure.
<department>/
└── manual/ -
Inside of the manual folder, create a new folder for the dataset you wish to create inside the data platform. The name of the folder you create here will be used as the table name for the datasource throughout the platform. If you are appending data to an already existing dataset, you can skip this step.
-
Inside of your dataset folder, upload a CSV containing your dataset by clicking the "Add files" button and selecting the local files you want to upload then afterwards click the "Upload" button. Alternatively, you can use "Add folder" taking care not to have any files in your local folder that you do not want to upload. Any CSVs uploaded within this folder will be combined into one dataset, and should have a matching set of columns. The status of each file should say "Upload succeeded", then click on the "Close" button.
-
Now go to AWS Glue Jobs, where you will run the job called
<department> copy manually uploaded CSVs to raw
. Select this job and click the "Run job" option in the Action pull-down menu. If a "Parameters" dialogue box appears then click "Run job". -
Reselect the Job, observe the progress of its run within the "History" tab, and wait for the "Run status" to reach "Succeeded". This job will have created a S3 folder structure as shown below inside the Raw zone and an Apache Parquet file containing your CSV data inside of there. You can check the output by navigating to S3 Bucket dataplatform-stg-raw-zone, confirming the folder structure, and observing that Apache Parquet file(s) have been created.
<department>/
└── manual/
└── <dataset-name>/
└ ── import_year=<year>/
└── import_month=<month>/
└── import_day=<day>/
└── import_date=<YYYYMMDD>
└── part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.xxxx.snappy.parquet
Files uploaded to the CSV folder will only be copied over to the raw zone once.
The job uses the CSVs "created at" timestamp to process only the CSVs which have been uploaded since the last time the job successfully ran.
Confirm the ingestion worked
Next we'll confirm the data has been ingested correctly by crawling the data with AWS Glue Crawler and then inspecting it within AWS Athena.
-
Before accessing the data within Athena, you will need to run the crawler for this data. Navigate to the AWS Glue Crawler interface, select the crawler named
<department>-copy-manually-uploaded-csvs-to-raw
, then click on "Run crawler". Wait until its status returns to "Ready". Check the "Last runtime" of previous jobs to get an idea of how long you might have to wait. -
Once crawled, there will be a newly created table within the database which you can access in Athena as
<department>-raw-zone-manual
. You should see a table eg. "cake_designs" with the column names as per the CSV header. You should also see various "import_..." columns at the end some of which are marked (Partitioned). You can then view the newly imported tables under the tables tab. Note: The original names of the files, when they were uploaded, is not captured here at the moment.