đź“šIntroduction
What is DAP⇨flow?​
DAP⇨flow is an integration of Apache Airflow with Amazon Athena built upon Hackney's Data Analytics Platform.
DAP⇨flow allows Data Analysts, in the simplest way possible, to develop and run data pipelines using their own service's data and create data products for their service and service users.​
Building data pipelines used to be harder and more complex and time consuming.​
Data Analysts, after prototyping their SQL queries using Amazon Athena were required to convert Athena SQL code to Spark SQL, a different SQL dialect, then embed their code within an Amazon Glue job which they had to deploy using Terraform.
Data Analysts were forced to query across multiple generations of the same data stored in the Amazon S3 Data lake when all they actually wanted was just their current data. That meant they could not simply take legacy SQL queries and run them directly in Amazon Athena.
-
PREVIOUSLY: Too hard + too complex = too time consuming...​
-
FIREBREAK: Deciding what we wanted to change...​
-
OUTCOME: DAP re-imagined using Airflow = DAP⇨flow...​
How DAP⇨flow solved our problems​
-
Firstly, Data Analysts no longer need to convert and re-test their prototype SQL transforms to run in the separate and more complex Amazon Glue run-time environment.
Instead, Apache Airflow can use exactly the same Amazon Athena to transform data in production with the outputs going directly into data products. So that Data Analysts' prototype SQL transform queries, that they spent time on testing until they were working, can simply be reused instead of being discarded.
That cuts development time by more than half while Data Analysts no longer need to context-switch between the two SQL dialects.
-
Secondly, Data Analysts no longer must adapt their legacy SQL queries to Amazon S3's Data Lake partitioning architecture.
Instead, Apache Airflow is configured to generate views of the underlying table data to present Data Analysts with current-only ingested service data, both in readiness for prototyping and testing, and for when the working transforms are subsequently deployed, being automated and run by Airflow.
That further cuts development time while Data Analysts can very easily take the legacy SQL code from their service database system and run it directly on Amazon Athena with few changes.
Data Analysts can also migrate their existing Athena SQL prototypes, previously adapted for the Amazon S3's Data Lake partitioning architecture, because the same table history is available to them, although the table names will now be suffixed "_history", which is more intuitive for new users.
-
Lastly, Data Analysts no longer need to use Terraform for deploying their data pipeline jobs because Apache Airflow simply takes care of that as soon as they commit their transform queries to DAP⇨flow's GitHub repository.
📚Onboarding​
A series onboarding documents is available here, to help Data Analysts get started with DAP⇨flow​
Anyone new to DAP⇨flow will start with 📚Before you begin followed by 📚Welcome!.
Thereafter, Data Analysts do not need to read every document in the order they are listed below, especially if they are already familiar with the AWS Management Console and have used Amazon Athena before.
Data Analysts are encouraged to think about what they need to do before deciding which document to read next. For example, if they have a legacy SQL query that they want to migrate to DAP⇨flow, they could jump straight to 📚Prototype legacy transforms.
"We ♡ your feedback!"​
Your continuous feedback enables us to improve DAP⇨flow and our Data Analytics Platform service. Survey links are provided at the end of each onboarding document.
Below here, is the full list of topics currently on offer...​
And more topics will be added as they are ready. Skip to the end to discover what's coming next!