All Tools
A
DataFreeOpen Source
APACHE HUDI
Incremental data processing for data lakes
Apache-2.0
ABOUT
Traditional data lakes lack ACID transactions and cannot efficiently handle incremental updates, deletions, or point-in-time queries. Apache Hudi solves this by bringing database-style upserts, deletes, and incremental pull capabilities to data lakes — enabling real-time ingestion, streaming ETL, and time-travel queries on object stores like S3 and HDFS with existing Spark and Flink pipelines.
INSTALL
# Via Spark packages:
spark-shell --packages org.apache.hudi:hudi-spark-bundle:0.15.0
# Via Maven:
# Add hudi-spark-bundle to pom.xml
INTEGRATION GUIDE
1. Streaming data ingestion: upsert real-time event streams into data lakes with ACID guarantees and no data loss
2. Change data capture (CDC): capture and apply database changelogs to data lakes for analytics without full recomputes
3. Data lake de-duplication: remove duplicate records from large-scale data lakes without reprocessing entire datasets
4. Time-travel queries: query historical versions of data for audit trails, regulatory compliance, and debugging
5. Incremental ETL: process only changed records since the last pipeline run, reducing compute costs by orders of magnitude
6. Unified batch and streaming: run both batch and streaming workloads on the same dataset with consistent guarantees
TAGS
data-lakebig-datasparkapachejavadata-processingincrementaletl