What is Apache DolphinScheduler?
Apache DolphinScheduler is an open‑source distributed workflow scheduling system designed for large‑scale data processing and ETL pipelines.
- Provides visual DAG (Directed Acyclic Graph) design.
- Supports task dependencies, fault tolerance, and resource isolation.
- Integrates with major big‑data components (Spark, Flink, Hadoop, etc.).
What is Apache SeaTunnel?
Apache SeaTunnel (formerly Waterdrop) is a unified, high‑performance data integration platform for batch and streaming workloads.
- Offers a pluggable connector ecosystem.
- Supports real‑time synchronization between heterogeneous data sources.
- Works seamlessly with DataX for batch data migration.
Why Use DolphinScheduler and SeaTunnel Together?
Combining DolphinScheduler’s orchestration capabilities with SeaTunnel’s data integration strengths creates a robust end‑to‑end data pipeline solution.
- Schedule and monitor SeaTunnel jobs as first‑class tasks.
- Leverage DolphinScheduler’s retry and alert mechanisms for SeaTunnel failures.
- Achieve consistent metadata management and caching across pipelines.
How to Deploy a Production‑Grade DolphinScheduler Cluster (3.2.0)
Follow these steps to set up a reliable, scalable DolphinScheduler environment.
- Prepare three node types: Master, Worker, and Database (MySQL/PostgreSQL).
- Install Java 11+, Docker (optional), and required system packages.
- Configure
conf/dolphinscheduler-env.shwith proper JVM options and resource limits. - Initialize the database schema using the provided SQL scripts.
- Start services in order: Master, then Workers, and finally the API server.
- Verify cluster health via the web UI and health check endpoints.
How to Transfer Workflows from Apache Airflow to DolphinScheduler (Air2phin)
Air2phin is a migration tool that converts Airflow DAGs into DolphinScheduler tasks.
- Export Airflow DAG files (Python) to a local directory.
- Run Air2phin with the source directory and target DolphinScheduler endpoint.
- Review generated JSON task definitions for compatibility.
- Import the JSON into DolphinScheduler via the UI or REST API.
- Test the imported workflow and adjust task parameters as needed.
How to Regularly Delete Log Instances in DolphinScheduler
Log retention prevents storage bloat and maintains performance.
- Configure
log.cleaner.enable=trueinconf/dolphinscheduler.properties. - Set
log.cleaner.daysto the desired retention period (e.g., 30). - Optionally schedule a cron job that runs the built‑in
log-cleaner.shscript. - Monitor the
dolphinscheduler_logtable to ensure old entries are purged.
How to Enable Auto‑Start for DolphinScheduler Services
Auto‑start ensures services recover after a reboot.
- Create systemd unit files for each component (master, worker, api, alert).
- Set
WantedBy=multi-user.targetandRestart=on-failure. - Enable the services:
systemctl enable dolphinscheduler-master(repeat for others). - Start them immediately with
systemctl startand verify status.
How to Upgrade DolphinScheduler from 1.3.4 to 3.1.2
Upgrading across major versions requires careful planning.
- Backup the existing database and configuration files.
- Review the release notes for breaking changes (e.g., schema modifications, removed APIs).
- Upgrade the database schema using the migration scripts provided in the 3.x release.
- Install the new binaries and copy over custom configurations.
- Restart services and validate functionality through the UI and API.
How SeaTunnel Metadata Caching Works
Metadata caching reduces latency when accessing schema information from source/target systems.
- During job initialization, SeaTunnel queries source metadata (tables, columns, types).
- The metadata is stored in an in‑memory cache (e.g., Guava Cache) with a configurable TTL.
- Subsequent tasks reuse the cached metadata, avoiding repeated network calls.
- Cache invalidation occurs when schema changes are detected or TTL expires.
How to Optimize SeaTunnel and DataX Integration
Effective integration maximizes throughput and reliability.
- Align batch size and parallelism settings between SeaTunnel and DataX.
- Enable SeaTunnel’s checkpointing to recover from failures without reprocessing.
- Use column projection to transfer only required fields, reducing data volume.
- Monitor connector metrics (read/write rates, error counts) via Prometheus or built‑in dashboards.