Apache DolphinScheduler and SeaTunnel: Comprehensive Guide

Learn what Apache DolphinScheduler and SeaTunnel are, why they are used together, and how to deploy, migrate, manage logs, enable auto‑start, upgrade, and optimize metadata caching.

4 February 2026 by

Suraj Barman

What is Apache DolphinScheduler?

Apache DolphinScheduler is an open‑source distributed workflow scheduling system designed for large‑scale data processing and ETL pipelines.

Provides visual DAG (Directed Acyclic Graph) design.
Supports task dependencies, fault tolerance, and resource isolation.
Integrates with major big‑data components (Spark, Flink, Hadoop, etc.).

What is Apache SeaTunnel?

Apache SeaTunnel (formerly Waterdrop) is a unified, high‑performance data integration platform for batch and streaming workloads.

Offers a pluggable connector ecosystem.
Supports real‑time synchronization between heterogeneous data sources.
Works seamlessly with DataX for batch data migration.

Why Use DolphinScheduler and SeaTunnel Together?

Combining DolphinScheduler’s orchestration capabilities with SeaTunnel’s data integration strengths creates a robust end‑to‑end data pipeline solution.

Schedule and monitor SeaTunnel jobs as first‑class tasks.
Leverage DolphinScheduler’s retry and alert mechanisms for SeaTunnel failures.
Achieve consistent metadata management and caching across pipelines.

How to Deploy a Production‑Grade DolphinScheduler Cluster (3.2.0)

Follow these steps to set up a reliable, scalable DolphinScheduler environment.

Prepare three node types: Master, Worker, and Database (MySQL/PostgreSQL).
Install Java 11+, Docker (optional), and required system packages.
Configure conf/dolphinscheduler-env.sh with proper JVM options and resource limits.
Initialize the database schema using the provided SQL scripts.
Start services in order: Master, then Workers, and finally the API server.
Verify cluster health via the web UI and health check endpoints.

How to Transfer Workflows from Apache Airflow to DolphinScheduler (Air2phin)

Air2phin is a migration tool that converts Airflow DAGs into DolphinScheduler tasks.

Export Airflow DAG files (Python) to a local directory.
Run Air2phin with the source directory and target DolphinScheduler endpoint.
Review generated JSON task definitions for compatibility.
Import the JSON into DolphinScheduler via the UI or REST API.
Test the imported workflow and adjust task parameters as needed.

How to Regularly Delete Log Instances in DolphinScheduler

Log retention prevents storage bloat and maintains performance.

Configure log.cleaner.enable=true in conf/dolphinscheduler.properties.
Set log.cleaner.days to the desired retention period (e.g., 30).
Optionally schedule a cron job that runs the built‑in log-cleaner.sh script.
Monitor the dolphinscheduler_log table to ensure old entries are purged.

How to Enable Auto‑Start for DolphinScheduler Services

Auto‑start ensures services recover after a reboot.

Create systemd unit files for each component (master, worker, api, alert).
Set WantedBy=multi-user.target and Restart=on-failure.
Enable the services: systemctl enable dolphinscheduler-master (repeat for others).
Start them immediately with systemctl start and verify status.

How to Upgrade DolphinScheduler from 1.3.4 to 3.1.2

Upgrading across major versions requires careful planning.

Backup the existing database and configuration files.
Review the release notes for breaking changes (e.g., schema modifications, removed APIs).
Upgrade the database schema using the migration scripts provided in the 3.x release.
Install the new binaries and copy over custom configurations.
Restart services and validate functionality through the UI and API.

How SeaTunnel Metadata Caching Works

Metadata caching reduces latency when accessing schema information from source/target systems.

During job initialization, SeaTunnel queries source metadata (tables, columns, types).
The metadata is stored in an in‑memory cache (e.g., Guava Cache) with a configurable TTL.
Subsequent tasks reuse the cached metadata, avoiding repeated network calls.
Cache invalidation occurs when schema changes are detected or TTL expires.

How to Optimize SeaTunnel and DataX Integration

Effective integration maximizes throughput and reliability.

Align batch size and parallelism settings between SeaTunnel and DataX.
Enable SeaTunnel’s checkpointing to recover from failures without reprocessing.
Use column projection to transfer only required fields, reducing data volume.
Monitor connector metrics (read/write rates, error counts) via Prometheus or built‑in dashboards.