👉 Back to ► 📚Prototype simple transforms
How can I use Amazon Athena to prototype a simple table-join data transformation?
📚Appendix
A. Partition Columns
Because partition column data is not written to the underlying Parquet files in the S3 data lake, it is strongly recommended to add a further column import_timestamp
or original_import_date
containing the same information in case it was ever necessary to deduce where misplaced, or exported, Parquet files were written to.
Here are a few ways Partition Columns are commonly inserted into transform queries:-
a) Generated by the "now" variables:
CAST(CURRENT_TIMESTAMP AS TIMESTAMP(3)) AS import_timestamp,
/* Partition columns */
FORMAT_DATETIME(CURRENT_DATE, 'yyyy') AS import_year,
FORMAT_DATETIME(CURRENT_DATE, 'MM') AS import_month,
FORMAT_DATETIME(CURRENT_DATE, 'dd') AS import_day,
FORMAT_DATETIME(CURRENT_DATE, 'yyyyMMdd') AS import_date
"import" is really a misnomer here, since it is theoretically possible to generate these partitions the day after the data was actually ingested. However this still remains the convention, even though it has been suggested to use transform_date
or even the "old school" generation_date
instead. In practice the new DAP⇨flow orchestration's triggering of transforms should prevent these partition columns being out of alignment. Before DAP⇨flow, a scheduler might have still triggered a similar Glue ETL job even though the underlying ingestion had failed, leading to some confusion about what the partition data actually meant when it was later queried.
b) Derived directly from the [service raw zone].[parent table] p
, for example:
p.import_date AS original_import_date, --> preserved in the Parquet file
/* Partition columns */
p.import_year,
p.import_month,
p.import_day,
p.import_date
This is the preferred option for many as it makes the most logical sense. But you still might want to capture the CURRENT_TIMESTAMP
, as in (a) above.
c) Passed down from [service refined zone].[parent table] p
, for example:
p.import_timestamp,
-- This column might be named differently, eg...
-- p.ImportDateTime AS import_timestamp,
/* Partition columns */
p.import_year,
p.import_month,
p.import_day,
p.import_date
Tables in the refined zone are themselves products of transforms with partitions. So you would want to align their subsequent down-stream product to them.
In case import_timestamp
was called something else or needs casting to a TIMESTAMP you will need to go check [service refined zone].[parent table]
first!
B. Conversations about style
💡Migrating to new technologies or a new platform offers teams the opportunity to raise coding standards and shed those common old habits of the past.
-
You have complete freedom over the SQL code formatting however we recommend stacking tables and column expressions, each on their own line. That helps readability and line-by-line debugging. You should endeavour to be consistent in your style and agree on style among your service colleagues. In our examples, we used 4-space tabulation and indent column expressions after the statement, which you are welcome to adopt.
-
Indentation has no syntactic significance for SQL interpreters so such a practice is purely aesthetic for readability. That is especially true of JOIN statements that not only join the previous table, but join everything previous to the table that follows. Further indentation of subsequent JOINs cannot imply nested right-to-left execution of table joins. Beginning every JOIN statement at the left emphasizes the true top-to-bottom execution order. Nesting the execution order can only be achieved using bracketed sub-queries which should naturally be indented for readability.
-
The use of leading commas is generally not considered compatible with the use of
-- trailing comments
when used in the same SQL script. Because merely auto-reformatting such SQL can easily cause a SQL syntactic failure making it no longer executable. There exists much legacy SQL, from way back last century, that uses leading commas. If the intention is to maintain such SQL, as it is, then you should also consider banning-- trailing comments
and only allow/* encapsulated comments */
. Alternately, please use SQL formatting tools to banish those leading commas forever! -
Everyone disagrees about their favorite SQL formatting style. SQL formatting tools can help us quickly get around that problem. However do beware, SQL formatting tools allow you to keep all your bad habits as well as enforce good habits! And it is all too easy to be swayed by the popularity of some bad habits.
-
Are all-caps SQL statements considered a redundant practice? Well that depends! This is coming back into fashion to help emphasize SQL code embedded within C or Python code which is more generally lower-case. But People will never agree, so you must decide for yourself, or if there is a consensus among your colleagues, why not just go with that? Just be consistent!