Skip to main content

Using DataHub as a Data Catalogue - ADR 011

Context

We need users to be able to browse and search the datasets within the platform so that they can find what they need, and we can break down data siloes. There are various different ways to implement this.

Decision Drivers

  • Ease of set up and maintenance
  • Features and user experience
  • Ability to ingest metadata from the data lake
  • Cost

Considered Options

  • Bespoke, in-house solution
  • Open source solutions (DataHub, Amundsen, Metacat, Marquez)
  • Paid-for solutions (Atlan, Qlik, Google Cloud Catalog)

Decision

We have decided to use an DataHub, an open source tool, as our data catalogue tool because:

  • We have been able to successfully set it up in our AWS environment without much difficulty [please add here]
  • It has the basic features we require to catalogue datasets (e.g. a range of metadata available which users can add to, search and browse functionality) as well as additional functionality that may enable us to catalogue data pipelines, dashboards, models etc in future.
  • Users responded positively to it in user research.
  • It is capable of ingesting metadata from Hive/AWS Glue and thus our data lake.
  • It is open-source and therefore cost is limited.
  • It has an active community where we have an opportunity to influence its future development.

Consequences

The team will need to dedicate resource to its maintenance and any further development we wish to influence.