Data Catalog

What is DataHub, and How to Deploy It in an Enterprise Data Stack?

Last updated on
May 12, 2026

Overview

DataHub is an open-source metadata platform used to catalog data assets, document ownership, track lineage, and make datasets easier to discover and govern.

In a Shakudo environment, DataHub sits at the data discovery and governance layer. It connects to warehouses, BI tools, orchestration tools, and databases so teams can understand what data exists, who owns it, and how it is used.

This page is written for onboarding and deployment calls. It focuses on what customers need to understand, provide, validate, and troubleshoot in a real environment.

Where it fits in the stack

  • Primary role: DataHub provides a reusable platform capability rather than a one-off application.
  • Typical deployment model: Kubernetes + Helm, with customer-specific values and secrets.
  • Typical access model: private internal endpoint or customer-approved external route.
  • Typical support model: validate deployment health first, then validate user workflow and integrations.

Getting Started

Start with one safe workflow in DataHub before enabling production usage. The goal is to prove connectivity, permissions, and operational ownership.

What the customer needs to provide

  • metadata sources such as warehouses, databases, Airbyte, dbt, Superset, or Kafka
  • ingestion credentials with read-only metadata access
  • search/index backend such as Elasticsearch or OpenSearch
  • Kafka and SQL metadata store configuration, either bundled or external
  • initial admin users and ownership model

First workflow

  • Open the DataHub UI
  • Create or import the first ingestion source
  • Run ingestion against one safe source first, such as a staging database
  • Review datasets, schemas, ownership, and glossary terms
  • Add owners, tags, and documentation for high-value assets
  • Schedule ingestion after the initial result is validated

Deployment Runbook

<aside>📌 Command-first runbook for customer deployment calls. Replace placeholders before running. For production environments, run changes through the customer-approved change process.

</aside>

Step 1 — Confirm cluster access

Run:

export KUBECONFIG=/path/to/customer-kubeconfig
export KUBE_CONTEXT=<customer-context>

kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" config current-context
kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" get nodes
kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" get namespace hyperplane-datahub || kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" create namespace hyperplane-datahub

Step 2 — Clone the Shakudo chart

Run:

git clone --depth=1 --branch <release-branch> <https://github.com/devsentient/monorepo.git> /tmp/monorepo
cd /tmp/monorepo/stack-components/datahub/helm
helm dependency update .

Step 3 — Create required secrets

Run:

kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" create secret generic datahub-ingestion-secrets -n hyperplane-datahub --from-literal=WAREHOUSE_USER='<read-only-user>' --from-literal=WAREHOUSE_PASSWORD='<password>'

Step 4 — Prepare values.yaml

Run:

cat > /tmp/datahub-values.yaml <<'EOF_VALUES'
datahub:
 frontend:
   enabled: true
 gms:
   enabled: true

prerequisites:
 elasticsearch:
   enabled: true
 kafka:
   enabled: true
 mysql:
   enabled: true

ingestion:
 enabled: true

global:
 graph_service_impl: elasticsearch
 datahub_analytics_enabled: false

ingress:
 enabled: true
 host: datahub.<customer-domain>
EOF_VALUES

Step 5 — Deploy or upgrade

Run:

helm --kubeconfig "$KUBECONFIG" --kube-context "$KUBE_CONTEXT" upgrade --install datahub /tmp/monorepo/stack-components/datahub/helm \\
 --namespace hyperplane-datahub \\
 --create-namespace \\
 --values /tmp/datahub-values.yaml \\
 --timeout 15m \\
 --wait

Step 6 — Validate Kubernetes resources

Run:

helm --kubeconfig "$KUBECONFIG" --kube-context "$KUBE_CONTEXT" status datahub -n hyperplane-datahub
kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" get pods,svc,pvc,ingress,virtualservice -n hyperplane-datahub
kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" get events -n hyperplane-datahub --sort-by=.lastTimestamp | tail -n 60
kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" logs -n hyperplane-datahub -l app.kubernetes.io/instance=datahub --tail=100

Step 7 — Smoke test the service

Run:

kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" port-forward -n hyperplane-datahub svc/datahub-frontend 9002:9002

# In another terminal:
curl -I <http://localhost:9002>

Rollback

Run:

helm --kubeconfig "$KUBECONFIG" --kube-context "$KUBE_CONTEXT" history datahub -n hyperplane-datahub
helm --kubeconfig "$KUBECONFIG" --kube-context "$KUBE_CONTEXT" rollback datahub <REVISION> -n hyperplane-datahub
kubectl --kubeconfig "$KUBECONFIG" --context "$KUBE_CONTEXT" rollout status deployment/datahub -n hyperplane-datahub || true

Troubleshooting & FAQ

Use this section during customer debugging calls. Format: Problem → What to check → Fix.

Ingestion job fails

  • What to check: Check connector credentials, network access, and the ingestion pod logs
  • Fix: Fix the source config and rerun the ingestion recipe manually

Assets do not appear in search

  • What to check: Check GMS health, search backend health, and whether ingestion completed
  • Fix: Restart ingestion and confirm Elasticsearch/OpenSearch indexes are healthy

Lineage is missing

  • What to check: Check whether the source supports lineage and whether dbt/BI metadata was ingested
  • Fix: Add the relevant source connector or dbt manifest ingestion

UI loads but metadata pages error

  • What to check: Check DataHub GMS logs and metadata store connectivity
  • Fix: Restart GMS after confirming database and Kafka are healthy

Administration and Best Practices

Use these practices to keep DataHub reliable after the initial deployment.

  • Start with a small number of high-value sources before cataloging everything
  • Use read-only ingestion credentials
  • Define owner and domain conventions before asking teams to contribute
  • Schedule metadata ingestion during low-traffic windows
  • Monitor Elasticsearch/OpenSearch storage because metadata indexes grow over time
  • Back up DataHub metadata store before upgrades

Watch in action

Why is DataHub better on Shakudo?

Why is DataHub better on Shakudo?

Core Shakudo Features

Own Your AI

Keep data sovereign, protect IP, and avoid vendor lock-in with infra-agnostic deployments.

Faster Time-to-Value

Pre-built templates and automated DevOps accelerate time-to-value.
integrate

Flexible with Experts

Operating system and dedicated support ensure seamless adoption of the latest and greatest tools.
See Shakudo in Action
Neal Gilmore
Get Started >