Homework - write a partitioned calendar aggregate
Goal
Create a second script that:
- reads
"unrestricted-raw-zone"."universal_calendar" - creates a monthly aggregate
- writes to
s3://dataplatform-stg-refined-zone/data-and-insight/testing/demo/ - registers a new table in
"data-and-insight-refined-zone" - partitions the output by
["import_year", "import_month", "import_day", "import_date"]
Use your own name in the table name. For example, Jane Smith should use:
TABLE_NAME = "test_jane_smith_demo_calendar_monthly_partitioned"
Do not use tian unless your name is Tian.
Task
Copy universal_calendar_demo.py into a new file:
universal_calendar_homework.py
Change these values:
TABLE_NAME = "test_your_name_demo_calendar_monthly_partitioned"
PARTITION_COLUMNS = ["import_year", "import_month", "import_day", "import_date"]
Change the SQL so it aggregates by year and month:
SELECT
year(CAST(date_iso AS timestamp)) AS calendar_year,
month(CAST(date_iso AS timestamp)) AS calendar_month,
count(*) AS day_count,
sum(CASE WHEN bank_holiday_flag THEN 1 ELSE 0 END) AS bank_holiday_count
FROM "unrestricted-raw-zone"."universal_calendar"
WHERE date_iso IS NOT NULL
GROUP BY 1, 2
ORDER BY 1, 2
Before writing the parquet, add the partition columns:
from datetime import date
today = date.today()
df["import_year"] = today.year
df["import_month"] = today.month
df["import_day"] = today.day
df["import_date"] = today.isoformat()
Add this line to wr.s3.to_parquet:
partition_cols=PARTITION_COLUMNS,
Check
Run:
py .\universal_calendar_homework.py
Then check the table in Athena:
SELECT *
FROM "data-and-insight-refined-zone"."test_your_name_demo_calendar_monthly_partitioned"
LIMIT 10;
You should also see partition folders in S3, for example:
import_year=2026/import_month=5/import_day=19/import_date=2026-05-19/