Ingesting Metadata to DataHub
Overview
DataHub metadata is managed through YAML configuration files in the dap-datahub repository under yaml_config/. After updating these files and merging to the main branch, they will be synced to a bespoke S3 bucket and the DataHub ingestion DAG will process them daily at 10:00 AM.
Configuration File Structure
yaml_config/
├── ingestion/ # Source ingestion recipes
│ ├── glue.yaml # AWS Glue Catalog ingestion
│ └── qlik_cloud.yaml # Qlik Cloud ingestion
├── metadata/
│ ├── domains/ # Domain definitions
│ │ └── domains.yaml
│ ├── dataplatforms/ # Platform definitions
│ │ └── dataplatforms.yaml
│ ├── datasets/ # Individual dataset metadata
│ │ └── *.yaml
│ └── dataproducts/ # Data product definitions (grouped by domain)
│ ├── child-fam-services/
│ │ └── *.yaml
│ └── housing/
│ └── *.yaml
How to Add Tables to Glue Ingestion
File: ingestion/glue.yaml
Add a Database
Add to the database_pattern.allow list:
database_pattern:
allow:
- your-database-name
Add a Table
Add to the table_pattern.allow list:
table_pattern:
allow:
- database-name.table-name
Remove a Table
- The SQL-based sources (e.g. Glue) allow the enable of
remove_stale_metadataparameter. Simply remove the table from thetable_pattern.allowlist. It will be automatically soft-deleted from DataHub on the next ingestion run. - Qlik Cloud ingestion does not support automatic deletion of stale metadata, so you will need to manually delete the dataset via DataHub UI or CLI if you want to remove it from the metadata store.
How to Add Qlik Spaces to Ingestion
File: ingestion/qlik_cloud.yaml
Add space names to the space_pattern.allow list:
space_pattern:
allow:
- 'Your Space Name'
How to Add Ingestion from Other Sources
If you need to add ingestion from sources other than Glue or Qlik Cloud, please contact the DAP team to create the YAML template for you for the first time.
How to Add a New Domain
File: metadata/domains/domains.yaml
Add a new entry to the list:
- id: your-domain-id
display_name: Your Domain Display Name
description: Description of what this domain covers
Fields:
id: Unique identifier (kebab-case)display_name: Name shown in DataHub UIdescription: What the domain covers
How to Add a New Data Platform
File: metadata/dataplatforms/dataplatforms.yaml
Add a new platform to the platforms list:
platforms:
- id: glue
display_name: Data Analytics Platform
description: Platform description
logo: https://url-to-logo.png
Fields:
id: Must match the source type used in the ingestion configuration (e.g.,glue,qlik-sense,athena) or a custom platform ID if you are adding metadata for a non-ingested sourcedisplay_name: Custom name shown in DataHub UIlogo: URL to logo image (optional)
Note: after adding a new platform, without attaching a dataset to it, the platform will not show up in the DataHub UI.
How to Add a Data Product
- Navigate to the correct domain folder:
metadata/dataproducts/{domain-name}/ - Create a new YAML file: e.g.,
my-product.yaml - Add the data product definition:
id: urn:li:dataProduct:domain-name.product-name
properties:
name: Product Display Name
description: "Business description of what this data product provides"
domain: urn:li:domain:domain-name
assets:
- urn:li:dataset:(urn:li:dataPlatform:glue,database.table1,${ENV})
- urn:li:dataset:(urn:li:dataPlatform:glue,database.table2,${ENV})
customProperties:
data_quality: "High"
refresh_frequency: "Daily"
Fields:
id: URN format:urn:li:dataProduct:domain.product-namedomain: Must match an existing domainassets: List of dataset URNs to includecustomProperties: Optional metadata (key-value pairs)
How to Add Dataset Metadata (Advanced)
Note: In most cases, you do NOT need to add dataset metadata manually as datasets are automatically created during ingestion from sources as the single source of truth. Only use this if you need to override or add additional metadata to existing datasets. Please consult with the DAP team before adding custom dataset metadata.
File: Create a new YAML file in metadata/datasets/ (e.g., my_dataset.yaml as below)
id: database.table_name
platform: glue
env: '${ENV}'
editableProperties:
description: "Your dataset description"
globalTags:
tags:
- tag: urn:li:tag:YourTag
ownership:
owners:
- owner: urn:li:corpuser:your.email@hackney.gov.uk
type: TECHNICAL_OWNER