17 posts tagged with "Data insights"

Data insights

Choosing Your Data Lake Format in 2025:Iceberg vs Delta Lake vs Paimon

October 22, 2025 · 9 min read

Barry

In the era of big data, data lakes became a popular choice for large-scale analytics, thanks to their flexibility, low cost, and separation of storage and compute. But they’ve also struggled with consistency, schema drift, and complex query optimization.

That’s where modern lake formats like Iceberg, Delta Lake, and Paimon come in.

They introduce an independent metadata layer on top of data files, bringing database-like features such as ACID transactions, schema evolution, and time travel to object storage.

This blog breaks down how these lake formats work under the hood, what makes them different, and how you can build your own real-time data lake with BladePipe.

How Modern Data Lakes Work

At the heart of these data lake formats is metadata.

When data is written to a table, the engine creates immutable data files (usually Parquet or ORC) and records their state in metadata snapshots. Each snapshot represents a consistent view of the entire table at a point in time.

Instead of rewriting data files directly, updates and deletes are performed through atomic metadata operations, switching the table pointer to a new snapshot.

Key concepts:

Metadata: Record the state of every snapshot, including which files were written, where they reside, etc. This metadata may live in a file system (JSON, Avro, etc.) or in a managed service (e.g., Hive Metastore).
Data files: Immutable Parquet/ORC files containing actual rows.

Data Writing

Here’s a simplified example to illustrate how a lake-format works.

Suppose we have a user table (users) and perform the following operations:

Initially write 2 user records.
Insert a new user record.
Update one user record.

Step 1: Write 2 user records

The engine writes the two records into a new Parquet file, say file_A.parquet.
It then creates a metadata file capturing snapshot_1, listing file_A.
Finally, it atomically updates a pointer to reference that metadata.

Any query on users will look up the pointer, read snapshot_1, then open file_A for results.

Step 2: Insert 1 new record

A new Parquet file file_B.parquet is written with the inserted record.
Metadata for snapshot_2 is created, now pointing to both file_A and file_B.
The pointer is atomically updated and refer to snapshot_2.

The old snapshot (snapshot_1) still exists, enabling version-based reads.

Step 3: Update and merge

Since Parquet files are immutable, the engine reads file_A, applies the update in memory, and writes a new file file_C.parquet.
It may also trigger a compaction job that merges file_B and file_C into a larger file file_D.parquet.
Then snapshot_3 is created, listing only file_D.
The pointer is updated atomically to snapshot_3.

Files no longer referenced (e.g., file_A, file_B) become candidates for garbage-collection. This design shifts complex operations into atomic metadata actions, thus supporting ACID guarantees.

Data Querying

When querying a lake‐format table, the engine follows roughly these steps:

Read the pointer.txt to find the current snapshot or a specified historic snapshot.
From the snapshot, obtain the list of data files and their statistics (partitions, row count, min/max per column, bloom filters, etc.).
Based on the predicates, perform partition pruning and column-statistics pruning, retaining only files or rows that may match, and then apply column projection and predicate pushdown into the Parquet/ORC layer.
Merge file reads:
1. Copy-On-Write (CoW): Read the latest set of files directly.
2. Merge-On-Read (MoR): Read base files + incremental/delete-vector files and apply merges/deletes during read.

In summary, the essence of a data lake format is to transform complex data operations into atomic metadata updates，thus achieving ACID transactions and efficient queries.

Iceberg vs Delta Lake vs Paimon

Although they share the same goal, the three lake formats diverge in design philosophy and metadata structure.

Let’s look at how each handles metadata, updates, and queries.

Apache Iceberg

Metadata

Iceberg uses a hierarchical tree-based metadata structure that makes metadata pruning highly efficient.

A top-level metadata file defines schema, partitioning, and the list of all snapshots.
Each snapshot has a manifest list, an index pointing to multiple manifest files and their partitions.
Each manifest contains the actual data file list, with detailed column statistics, min/max values, and file-level metadata.

Data Updates

Create a new Delete file, such as del_file_A.parquet, to record that the Nth row of file_A or rows with user_id = 1 have been deleted.
Create a v2 snapshot, which includes both file_A.parquet and del_file_A.parquet. When reading, the query engine will automatically filter out the deleted rows based on the Delete file.

Querying

The engine reads the top-level metadata, finds the manifest list for the version, prunes based on partition info to select only relevant manifest files, then reads only those data files.

Delta Lake

Metadata

Delta Lake is built around a strictly time-ordered transaction log (_delta_log/).

Transaction Log: Each write/update/delete produces a new JSON file documenting which files were added and removed.
Checkpoint File: When the log grows large, a checkpoint Parquet file is created for faster reads.

Data Updates

Copy-On-Write: The engine reads the old file, applies changes, writes a new file, and logs the change in /_delta_log.
Delete vector: Instead of rewriting, the engine can append a delete vector record in /_delta_log.

Querying

The engine locates the latest checkpoint, then applies subsequent transaction logs to compute the final state of the table.

Apache Paimon

Metadata Paimon draws on a database-style LSM-tree architecture, particularly optimized for continuous real-time ingestion.

Write-Buffer / L0 Level: Newly arriving data is written into L0 small files for very low-latency writes.
Background Compaction: A background process merges L0 files into larger L1, L2 files while deduplicating by primary key, keeping only the latest version.
Snapshot: A metadata snapshot records what file layers compose the table at any point.

Data Updates

A new version for a key is written into L0 (a small file).
Later, compaction merges L0 files, retaining only the most recent version and generating a larger L1 file.

Querying

The engine reads all levels (L0, L1, …) and merges/deduplicates in memory to return the latest data.

The Evolution of Compute Bottlenecks

All three formats rely on metadata stored as files, which means merging and compaction often happen on the compute nodes. This can create performance bottlenecks in self-managed setups.

Copy-on-Write (CoW): Frequent updates cause file rewrites and small-file compaction, stressing write nodes.
Merge-on-Read (MoR): Queries must merge data and delete files at runtime, putting pressure on query nodes.
LSM-style (Paimon): Fast writes, but background merges consume compute and I/O resources.

To offload these costs, cloud providers are increasingly pushing optimization logic into managed services:

AWS Glue + Iceberg: Offer automatic small file compaction and table optimization.
Databricks + Delta Lake: Provide built-in automatic merging and OPTIMIZE/ZORDER operations to merge small files in the background.
Alibaba Cloud DLF + Paimon: DLF, as a managed metadata platform, offers storage optimization strategies to reduce maintenance overhead.

These managed optimizations make a big difference in stability and performance, but actual gains depend heavily on cloud provider capabilities and configuration strategy.

Summary

Feature	Apache Iceberg	Delta Lake	Apache Paimon
Design Philosophy	Open standard, engine-agnostic	Simple and efficient, deeply integrated with Spark	Unified stream & batch, real-time updates prioritized
Core Contributors	Netflix, Apple, Dremio	Databricks	Flink community
Metadata Structure	Tree structure (Metadata -> Manifest List -> Manifest)	Linear transaction log (JSON + Parquet Checkpoint)	LSM-Tree-like structure
Key Strengths	Strong schema evolution, partition evolution, efficient metadata pruning	Easy to use, seamless integration with Spark/Databricks ecosystem	Excellent real-time upsert/delete performance
Concurrency Control	Optimistic Concurrency Control (OCC)	Optimistic Concurrency Control (OCC)	Optimistic Concurrency Control (OCC)
Best Suited for	Large-scale, multi-engine analytical batch workloads	Spark-centric batch and streaming pipelines	Flink CDC, real-time data lake, unified stream & batch
Openness	Very high, widely supported by Flink, Spark, Trino, StarRocks, etc.	High, but more tightly bound to Databricks ecosystem	High, closely integrated with Flink/Spark ecosystem

In short, the core differences among the three formats lie in their metadata organization and update models.

Iceberg is ideal for batch-heavy, multi-engine analytics.
Delta Lake excels in Spark-based pipelines with auditable logs.
Paimon dominates in real-time CDC and streaming use cases.

Building a Real-Time Data Lake

So how do you bring your database changes into a modern data lake in real time?

That’s where BladePipe comes in.

BladePipe is a real-time data integration platform that can continuously sync database changes into Paimon, Iceberg, or Delta Lake. It recently supports SaaS Managed mode. With it, you can start moving data after logging in. No deployment or maintenance is required.

Next, we'll set up a real-time data lake using BladePipe SaaS Managed and DeltaLake.

Data Sync

Add Data Sources

Log in to the BladePipe Cloud. Select Fully Managed mode in the upper right corner.
Click DataSource > Add DataSource for both MySQL and Delta Lake.

Create Sync Job

Click DataJob > Create DataJob.
Select the source and target DataSources, and click Test Connection to ensure the connection to the source and target DataSources are both successful.

In Properties Page: Select Incremental for DataJob Type, together with the Full Data option.

Select the tables to be replicated.

Select the columns to be replicated.

Confirm the DataJob creation.

Once started, BladePipe will handle full-load initialization and capture incremental changes in real time into Delta Lake.

Takeways

Iceberg, Delta Lake, and Paimon each represent a different path toward a more consistent, real-time, and scalable data lake architecture.

Iceberg: open, analytical, and highly flexible.
Delta Lake: operationally simple and deeply Spark-integrated.
Paimon: streaming-native and real-time ready.

With tools like BladePipe, teams can connect operational databases directly to these lake formats, enabling near-zero-latency analytics and simplifying the entire data pipeline.

Building a Real-Time Lakehouse from Scratch with BladePipe, Paimon, and SelectDB

September 25, 2025 · 5 min read

Zoe

Why are so many teams still juggling both a data lake and a warehouse, only to end up with slow pipelines, stale dashboards, and skyrocketing costs?

Data lakes are cheap and flexible for storage but fall short when it comes to fast queries. Warehouses deliver strong analytics but at the price of higher costs and limited adaptability. This split forces companies into a tradeoff they shouldn’t have to make.

The real-time lakehouse offers a way out. In this post, we’ll show how to build a real-time lakehouse from the ground up using BladePipe, Paimon, and SelectDB. By the end, you’ll see how the three work together to deliver an end-to-end pipeline, making data flow easier than ever.

Why Real-Time Lakehouses Matter

Traditional “lake + warehouse” setups have three recurring pain points:

Slow pipelines: data must first land in the lake, then get transformed via ETL before reaching the warehouse.
High latency: analytics or dashboards are often delayed by minutes or hours.
Inconsistent data: what you query in the warehouse doesn’t always match what’s in the lake, discouraging decision-making.

A real-time lakehouse removes these bottlenecks. By combining the low-cost durability of a data lake with the performance of a warehouse, it makes data queryable in seconds instead of hours. Teams can finally have real-time and cost-efficient analytics, without the complexity of running two separate systems.

BladePipe + Paimon + SelectDB: The Core Building Blocks

To bring a real-time lakehouse to life, you need three things: real-time ingestion, unified lake storage, and a fast query engine. Here’s how they come together:

BladePipe: Real-Time Ingestion
- Capture database changes with CDC (Change Data Capture)
- Support both full data loads and incremental data sync
- Work with 60+ data sources
- Maintain sub-second latency
- Provide a visual UI for setup and monitoring
Paimon: Unified Lake Storage
- Deliver efficient lake storage
- Support primary-key tables to avoid duplicate data
- Enable schema evolution, compatible with Online DDL
SelectDB: Fast Analytical Engine
- Seamlessly integrate with Paimon to query lake data directly
- Support real-time analytics and interactive queries
- Optimized for large-scale, multi-dimensional business analysis Together, these tools create an end-to-end pipeline: BladePipe ingests, Paimon stores, SelectDB analyzes.

A Real-World Example: E-Commerce Analytics

Think about an e-commerce platform. Every second, users are browsing, adding items to their carts, placing orders, and making payments. These events are scattered across transactional databases, user systems, and log services.

With a traditional warehouse-based pipeline, this data doesn’t show up until hours later, making it useless for real-time recommendations.

With BladePipe + Paimon + SelectDB, the process becomes:

Real-Time Ingestion: BladePipe streams changes from multiple databases into Paimon with second-level latency.
Unified Storage: Paimon organizes orders, users, and logs in a unified, consistent storage layer.
Real-Time Querying: SelectDB queries Paimon directly, returning results in milliseconds. Recommendation systems and dashboards reflect changes in real time.

As a result, the platform can deliver personalized recommendations while a user is still browsing, or promptly detect high-risk transactions.

Lakehouse From Zero to Real-Time

Here’s how you can build it yourself.

Prepare Tools

Install BladePipe SaaS: https://www.bladepipe.com/
Install Paimon: https://paimon.apache.org/
Install SelectDB: https://www.selectdb.com/

Ingest Data with BladePipe

Add Data Sources

Log in to the BladePipe Cloud.
Click DataSource > Add DataSource, and add MySQL and Paimon instances.
When adding a Paimon instance, special configuration is required. See Add a Paimon DataSource.

Create a Sync Task

Click DataJob > Create DataJob.
Select the source and target DataSources, and click Test Connection to ensure the connection to the source and target DataSources are both successful.
Select Incremental for DataJob Type, together with the Full Data option.
Select the tables to be replicated.
Select the columns to be replicated.
Confirm the DataJob creation.

Then, BladePipe loads full data and then streams ongoing changes into Paimon automatically.

Query in SelectDB

SelectDB natively supports the Paimon Catalog. That means you can query Paimon directly, no ETL required.

Next, we will use the database management tool CloudDM to query the data.

Create Paimon catalog.

CREATE CATALOG catalog_name PROPERTIES (
    'type' = 'paimon',
    'warehouse' = '<paimon_warehouse>'
    "s3.access_key" = "your-access-key",
    "s3.secret_key" = "your-secret-key",
    "s3.endpoint" = "http://minio.example.com:9000"
);

Once the Catalog is created, you can start querying data in Paimon:

As MySQL data changes, BladePipe syncs it to Paimon, and SelectDB picks it up immediately. That’s true end-to-end real-time analytics with a single pipeline.

Wrapping Up

The choice between a data lake and a warehouse is no longer a choice you need to make. With BladePipe, Paimon, and SelectDB, you can unify ingestion, storage, and analytics into one architecture.

The result is a real-time lakehouse: low-latency, cost-efficient, and built for modern data-driven applications.

How to Prevent Loops in Redis Bidirectional Sync

September 15, 2025 · 7 min read

Barry

In scenarios such as cross-data-center deployments, master-slave failover, and hybrid cloud architectures, bidirectional Redis sync is a common requirement. The hardest part isn’t setting up the sync itself, but preventing data from bouncing endlessly between two instances.

This blog will walk you through why loops happen, two approaches to stop them, and show you how to set up an anti-loop bidirectional pipeline step by step.

Why Loops Happen in Bidirectional Sync？

Take two Redis instances, A and B, as an example, with sync tasks configured in both directions: A→B and B→A.

Data written to A will be synchronized to B. Once B receives it, the data will be sent back to A. Without a loop detection mechanism, the same event just “ping-pong” between A and B endlessly.

BladePipe already solves this for MySQL and PostgreSQL using incremental event tags and transaction records separately to filter loop events. Each sync task checks whether a transaction contains a marker, and if so, filters it out, breaking the data loop.

But Redis makes things trickier:

Redis commands can be very granular (e.g.INCR key) and are not always executed within a transaction.
Redis transactions (MULTI/EXEC) differ from traditional relational database transactions and do not have full atomicity.

So, how can we design Redis bidirectional sync?

Solution 1: Auxiliary Tags

Based on the approach used in traditional database bidirectional sync, a straightforward loop-prevention method in Redis is to use auxiliary commands for loop detection. When a normal command is received, its hash value is calculated, and an auxiliary command key is generated. By checking whether the corresponding auxiliary key exists in the opposite direction, BladePipe can determine if a loop has occurred; if it exists, the event is filtered.

The advantages of this approach are:

Simplicity: It's straightforward and simple to implement.
High adaptability: It works for both standalone and clustered Redis deployments.

However, there are also drawbacks:

High performance overhead: For each event, the number of commands is theoretically increased by 3 to 4 times, adding write pressure on Redis.
Ambiguity in edge cases: In certain extreme scenarios, such as when an application performs similar write operations on the target instance, the reverse sync task may have difficulty distinguishing the source of the commands, which could lead to false positives or even a lost update.

Solution 2: Transaction Tags

Another approach leverages Redis transactions.

Redis transactions (MULTI ... EXEC) differ from those in relational databases: they do not support rollback as part of transaction atomicity, but they do have a key feature: all commands within a transaction are executed in order, and no commands from other clients are interleaved during execution.

Based on this feature, for a forward sync task, wrap a source command in a transaction, and insert a marker command as the first operation. When the reverse task encounters a transaction, it indicates a potential loop event from the forward task. By checking if the first command in the transaction is a marker, BladePipe can determine if the entire transaction is part of a loop. If it is, the transaction is filtered entirely.

The advantages of this approach are:

Better performance: There is no need to maintain additional markers for each command, reducing system overhead.
Simple logic: By checking the beginning of a transaction, BladePipe can quickly determine loop events without comparing commands one by one.
Lower Redis pressure: Filtering is handled within BladePipe, reducing the load on Redis.

However, it is important to note that in sharded cluster mode, Redis transactions don’t work across shards. Therefore, the transaction-tag approach is best suited for standalone or master-slave scenarios.

Hands-On Demo with BladePipe

BladePipe supports both approaches mentioned above. You can adjust the filtering mode via the deCycleMode parameter in the console.

Let’s look at how to quickly set up Redis bidirectional sync with the transaction tag method using BladePipe.

Step 1: Install BladePipe

Follow the instructions in Install Worker (Docker) or Install Worker (Binary) to download and install a BladePipe Worker.

Step 2: Add DataSources

It is suggested to modify the DataSource description to prevent mistaking the instances when you configure two-way DataJobs.

Step 3: Create Forward DataJob

Click DataJob > Create DataJob.
Select the source and target DataSources, and click Test Connection to ensure the connection to the source and target DataSources are both successful.

In Properties Page:
1. Select Incremental for DataJob Type, together with the Full Data option.
2. Grey out Start Automatically to set parameters after the DataJob is created.

Confirm the DataJob creation.
Click Details > Functions > Modify DataJob Params.
1. Choose Source tab, and set deCycle to true and deCycleMode to TX_SIGN.
2. Click Save.

Start the DataJob.

Step 4: Create Reverse DataJob

Click DataJob > Create DataJob.
Select the source and target DataSources(reverse selection of Forward DataJob), and click Test Connection to ensure the connection to the source and target DataSources are both successful.

In Properties Page:
1. Select Incremental, and DO NOT check Full Data option.
2. Grey out Start Automatically to set parameters after the DataJob is created.

Confirm the DataJob creation.
Click Details > Functions > Modify DataJob Params.
1. Choose Source tab, and set deCycle to true and deCycleMode to TX_SIGN.
2. Click Save.

Start the DataJob. Forward and reverse DataJobs are running with sub-second latency.

Step 5: Verify the Results

Make changes in the source Redis and check the monitoring charts. You'll find that the forward DataJob registers the changes, while the reverse one does not, indicating that no data loop has occurred.

Make changes in the target Redis and check the monitoring charts. You'll find that the reverse DataJob registers the change, while the forward one does not, indicating that no data loop has occurred.

Create a data verification task, and you can see that the data in both instances remains consistent.

Conclusion

The hardest part of Redis bidirectional sync isn’t syncing. It’s stopping the endless loop of changes. We analyzed two approaches:

Auxiliary tag: simple and universal, but with performance overhead. For sharded clusters, auxiliary markers may still be the practical choice.
Transaction tag: lightweight and efficient, recommended for most standalone and master-slave setups.

If you are planning or designing Redis bidirectional sync, have a try of BladePipe to get started quickly. If you have further questions about bidirectional data sync, feel free to join the discussion.

Kafka vs RabbitMQ vs RocketMQ vs Pulsar in 2025 - Key Differences

September 5, 2025 · 6 min read

John Li

Message brokers are the backbone of modern distributed systems. Whether it’s log ingestion, order processing, or building a real-time data warehouse, they ensure data flows reliably between services. Among the open-source options, Kafka, RabbitMQ, RocketMQ, and Pulsar are the most widely discussed. Each has its strengths and trade-offs, and developers often struggle with which one to pick.

In this post, I’ll break down these four systems across architecture, performance, scalability, and reliability, and provide a clear side-by-side comparison to help you make an informed decision.

Architecture at a Glance

Kafka
Kafka is built around a distributed log. Producers write to Brokers, which store messages in partitioned logs. Consumers pull messages sequentially. Kafka originally relied on ZooKeeper for metadata but is moving toward its own metadata service (KRaft).

RabbitMQ
RabbitMQ implements the AMQP protocol. Messages first go to an Exchange, which routes them to Queues based on rules. Consumers then pull from these queues. Its flexible routing (direct, topic, fanout, headers) makes it a great fit for complex messaging patterns.

RocketMQ
RocketMQ uses a lightweight NameServer and Broker architecture. Producers fetch routing information from NameServers, then write to Broker queues. It supports transactional and ordered messages, making it popular in e-commerce and finance.

Pulsar
Pulsar features an architecture with separated compute (Brokers) and storage (BookKeeper). This design enables infinite storage scaling, tiered storage, and is cloud-native by default.

Performance

When it comes to performance, three aspects matter most: throughput, latency, and backlog handling.

Metric	Kafka	RabbitMQ	RocketMQ	Pulsar
Throughput	Very high (hundreds of thousands to millions TPS)	Moderate (tens of thousands per node)	High (hundreds of thousands TPS)	High (hundreds of thousands TPS)
Latency	Low (tens of ms)	Very low (single-digit ms)	Low (tens of ms)	Low (tens of ms)
Backlog handling	Excellent, support long-term storage and replay	Limited, backlog can cause performance issues	Strong, support large-scale backlogs	Strong, with tiered storage for long-term retention

PS: The numbers are for reference. For precise performance statistics, please check official benchmark reports.

Scalability

Kafka
Kafka scales horizontally via partitions. A single topic can be split into many partitions, processed in parallel across brokers and consumers. In a cluster, brokers can be added up to thousands in production to support real-time data streaming.

RabbitMQ
RabbitMQ scales through clustering, but queues must replicate across nodes, adding significant overhead. This makes it less ideal for massive-scale workloads.

RocketMQ
RocketMQ scales by adding brokers and queues. Storage and consumers can expand independently, and nodes can be added without downtime, which is well-suited for large distributed systems.

Pulsar
Pulsar leverages compute-storage separation. That means a great scalability. To increase throughput, you can add brokers. To expand storage, you can add BookKeeper nodes. Combined with multi-tenancy, Pulsar scales smoothly in cloud-native environments.

Reliability

Kafka
Kafka relies on partition replicas for durability. It guarantees at-least-once delivery by default, with exactly-once possible via idempotence and transactions. Kafka is very mature in large-scale distributed environments.

RabbitMQ
RabbitMQ uses message persistence and replicated queues. Since 3.8, Quorum Queues (based on Raft) is introduced to improve reliability. It guarantees at-least-once delivery, but duplicates are possible, which requires idempotence.

RocketMQ
RocketMQ uses master-slave replication and configurable flush strategies (sync/async). The DLedger mode, based on Raft, enables automatic leader failover and stronger fault tolerance.

Pulsar
Pulsar stores messages in BookKeeper with multi-replica persistence. That means broker failures don’t affect stored data. Its multi-tenancy and strong isolation make it a natural fit for cloud-native setups.

Feature Comparison Table

Feature	Kafka	RabbitMQ	RocketMQ	Pulsar
Language	Java/Scala	Erlang	Java	Java
Message consumption	Pull	Push	Pull	Pull + Push
Throughput	Very high	Moderate	High	High
Latency	Low	Very low	Low	Low
Backlog handling	Excellent (replayable)	Limited	Strong	Strong (tiered storage)
Scalability	Excellent (partitions)	Moderate	Strong	Excellent (compute-storage separation)
Reliability	Excellent (replication, EOS support)	Good (Quorum Queue)	Strong (DLedger)	Excellent (BookKeeper)
Protocols	Kafka protocol	AMQP, MQTT, STOMP	Native + extensions	Native + extensions
Ecosystem	Richest, strongest community	Stable, plugin-rich	Strong in Asia, good cloud support	Growing fast, cloud-native
Use cases	Log ingestion, real-time analytics, data bus	Real-time communication, task scheduling, RPC	E-commerce, finance, payments	SaaS platforms, multi-datacenter streaming

How to Choose Between Them

Choosing the right broker depends heavily on your use case and priorities:

Choose Kafka if you need extremely high throughput, large-scale data ingestion, or replayable logs for analytics. It’s the de facto standard in big data ecosystems.
Choose RabbitMQ if your workloads demand very low latency, flexible routing, or traditional message queue patterns like task scheduling or RPC. It’s also beginner-friendly and battle-tested in smaller systems.
Choose RocketMQ if you need strict ordering, transactional messaging, or operate in financial/e-commerce domains where consistency is critical.
Choose Pulsar if you’re building cloud-native, multi-tenant, or geo-distributed systems. Its compute-storage separation and tiered storage make it ideal for modern, elastic deployments.

BladePipe: Simplifying Data Streaming into Message Brokers

Picking a message broker is only half the battle. The next challenge is moving data into it reliably and in real time.

That’s where BladePipe comes in. BladePipe is a real-time end-to-end data integration platform built for developers and DBAs. Key benefits include:

Real-time, low latency: It captures database changes via CDC and syncs them into Kafka, RabbitMQ, RocketMQ, and Pulsar within seconds.
One-stop support: A single tool to feed multiple brokers, no custom sync pipelines required.
Automation & visibility: A clean UI for configuration, monitoring, and operations, reducing maintenance overhead.
Flexible deployment: It is available in both self-hosted and SaaS versions, fitting startups and enterprises alike.

With BladePipe, teams can focus less on building fragile data pipelines and more on building value on top of their data. Whether you’re powering a real-time data warehouse or supporting multi-cloud active-active systems, BladePipe ensures your data keeps flowing smoothly.

DynamoDB vs MongoDB in 2025 - Key Differences, Use Cases

August 22, 2025 · 6 min read

Zoe

Choosing the right database for a given application is always a problem for data engineers. Two popular NoSQL database options that frequently come up are AWS DynamoDB and MongoDB. Both offer scalability and flexibility but differ significantly in their architecture, features, and operational characteristics. This blog provides a comprehensive comparison to help you make an informed decision.

What is Amazon DynamoDB?

Amazon DynamoDB is Amazon’s fully managed, serverless NoSQL service. It supports both key–value and document data, scales automatically, and delivers single-digit millisecond response times at any size. Features like global tables, on-demand scaling, and tight integration with AWS services make it a go-to for high-scale workloads.

Key Strengths:

Fully managed service: No server to manage. DynamoDB automatically partitions data and scales throughput, eliminating operational overhead.
Low-latency at scale: It is designed for consistent millisecond latency for reads and writes, even under heavy load.
Deep AWS integration: It natively integrated with Lambda, API Gateway, Kinesis, CloudWatch, and IAM, simplifying building serverless architectures.
Global replication: Its global table offers multi-region, active-active replication that automatically keeps multiple copies of a DynamoDB table in sync across different AWS Regions.

Pricing:
DynamoDB has two pricing modes: On‑Demand (pay per request) and Provisioned (buy read/write capacity units). On-demand is simple for unpredictable or spiky traffic, while provisioned is more cost-efficient for steady high throughput.

For storage, the first 25 GB per month is free, and then $0.25 per GB per month is charged.

Additional costs apply for backup, global tables, change data capture, etc.

What is MongoDB?

MongoDB is a document database that stores data as BSON (binary JSON) documents. It’s flexible, schema-optional, and supports rich queries, secondary indexes, and powerful aggregation pipelines. You can self-host it or use MongoDB Atlas, the managed service that runs on AWS, Azure, or GCP.

Key Strengths:

Flexible Data Model: Documents allow for embedding and nested structures, accommodating complex and evolving data.
Various ad-hoc queries: It supports a wide range of queries, including field-based queries, regular expressions, and geospatial queries.
Rich indexing & analytics: It supports compound, text, geospatial, wildcard and partial indexes. Aggregation pipeline enables complex transformations and analytics inside the DB.
ACID Transaction: It supports multi-document ACID transactions (since v4.0), ensuring data consistency even if the driver has unexpected errors.

Pricing:
MongoDB Enterprise charges for the infrastructure costs (servers, storage, networking) on your chosen platform.

MongoDB Atlas (managed service) has a free tier, shared tiers, and dedicated clusters billed hourly (pay‑as‑you‑go). Pricing depends on cloud provider, instance family, vCPU/RAM, storage, backup retention, and data transfer.

DynamoDB vs MongoDB At a Glance

Feature	DynamoDB	MongoDB
Type	Fully managed NoSQL database (AWS)	Document NoSQL database
Deployment	AWS only	On-premise / MongoDB Atlas (managed on multiple cloud providers)
Data Model	Key-value and document	Document
Max Document Size	400 KB per item	16 MB per document
Query Language	Primary key lookups, range queries, secondary indexes; limited aggregation	Support ad-hoc queries, joins, and advanced aggregation pipeline
Scalability	Automatic partitioning and scaling	Manual or automated scaling via sharding and replica sets
Consistency	Eventually consistent by default, optional strong consistency; multi-item ACID transactions	Tunable consistency levels; multi-document ACID transactions
Performance	Single-digit millisecond response time	Varies based on configuration
Security	Integrated with AWS IAM	Role-Based Access Control
Multi-Region Support	Built-in via global tables (active-active)	Atlas Global Clusters or custom sharding
Integration	Deep AWS integration	Broad ecosystem, multi-cloud support
Vendor Lock-in	High (AWS only)	Lower (run on multiple clouds or on-prem)

Core Features Comparison

Data Model & Query

DynamoDB:

Employ a key-value store with support for document structures.
Optimized for fast lookups based on the primary key.
Global and local secondary indexes for additional access paths.
Limited aggregation support.

MongoDB:

A document-oriented database where data is stored in BSON documents within collections.
Expressive query language that supports many operators.
Powerful aggregation pipelines allow for complex in-database transformations.

Scalability and Performance

DynamoDB:

Automatic horizontal scaling of both storage and throughput.
Single-digit millisecond latency at any scale.
Handle huge throughput with AWS-managed partitioning.

MongoDB:

Scale via sharding and replica sets.
Efforts required for setting up and managing sharding.
Performance depends on query patterns, indexing, and the chosen consistency level.

Consistency

DynamoDB:

Eventually consistent reads by default or strongly consistent reads at a cost of higher latency.
ACID transactions across one or more tables within a single AWS region.

MongoDB:

Offer various read concerns to control the consistency and isolation of read operations.
ACID transactions for multi-document operations.

Availability

DynamoDB:

Automatic multi-AZ replication within a region.
Automatic regional failover.
Global tables for automated multi-region, active-active replication.

MongoDB:

Replica sets provide high availability, requiring one primary node and multiple secondary nodes.
Manual or semi-automatic failover depending on configuration. Atlas automates in managed clusters.
Atlas Global Clusters enable zone sharding to partition data and pin it to specific regions.

How to Choose between them?

There’s no universal winner. Both are mature, battle-tested products. You may consider the following cases:

Choose DynamoDB if:

You are all-in on AWS. DynamoDB integrates seamlessly with other AWS services, making it a natural choice for serverless services built within the AWS ecosystem.
Your query patterns are simple and predictable. The ideal use case for DynamoDB is fetching data using a known primary key. It's not designed for complex, ad-hoc queries.
You prefer minimal operational burden. DynamoDB is fully managed by AWS, minimizing the operational overhead.

Real-world case: How Disney+ scales globally on Amazon DynamoDB

Choose MongoDB if:

You require complex querying and data aggregation. MongoDB's rich query language and aggregation pipelines are good for perfoming data searches and analysis.
You need a flexible schema. MongoDB's document model easily accommodates data structure changes.
You want deployment flexibility. MongoDB can be run on-premises, on any cloud provider (AWS, GCP, Azure), or as a fully managed service via MongoDB Atlas.

Real-world case: How Novo Nordisk accelerates time to value with GenAI and MongoDB

Stream Data to DynamoDB and MongoDB Easily

In real-world architectures, DynamoDB and MongoDB don’t exist in isolation. They’re part of a larger data ecosystem that needs to move information in and out in real time.

This is where BladePipe fits perfectly. As a real-time, end-to-end data replication tool, it supports 40+ out-of-the-box connectors. It captures data changes (CDC) from multiple sources and continuously sync them into DynamoDB or MongoDB with sub-second latency. This ensures both databases always have fresh, consistent data without manual ETL jobs or complex pipelines. Both on-prem and cloud deployment is supported.

With BladePipe, teams only need to focus on building applications, not moving data.

How to Build a Real-Time Lakehouse with BladePipe, Paimon and StarRocks

August 5, 2025 · 6 min read

Barry

In the age of real-time analytics, more businesses want to ingest data into their data lake with low latency and high consistency, and run unified analysis downstream. Apache Paimon, a next-gen lakehouse table format born from the Apache Flink community, is built for exactly this. With fast writes, real-time updates, and strong compatibility, Paimon is an ideal foundation for building a streaming lakehouse architecture.

In this article, we’ll walk through how to build a fully real-time, flexible, and easy-to-maintain lakehouse stack using BladePipe, Apache Paimon, and StarRocks.

What is Apache Paimon?

Apache Paimon is a lakehouse storage format designed for stream processing. It innovatively combines lake format and LSM-tree (Log-Structured Merge Tree), enabling real-time data updates, high-throughput ingestion, and efficient change tracking.

Key Features:

Streaming and batch processing: Support streaming write and snapshot read.
Primary key support: Enable fast upserts and deletes.
Schema evolution: Add, drop, or modify columns without rewriting old data.
ACID compliance: Ensure consistency in concurrent reads and writes.
Extensive ecosystem: Work with Flink, Spark, StarRocks, and more.
Object storage compatibility: Support S3, OSS, and other file systems.

Example: Real-Time Order Tracking
Imagine a large e-commerce platform with a real-time dashboard. The order status changes (e.g. from "Paid" to "Shipped") are supposed to be reflected on the dashboard instantly. How to realize the real-time data ingestion?

Traditional Approach (Merge-on-Read):

Changes are appended to log files and merged later in batch jobs.
Updates are delayed until the merge is complete — often several minutes.

With Paimon (LSM-tree):
Paimon tackles this issue by introducing a capability similar to primary key.

When order statuses change in a transactional database (e.g., MySQL), updates (like UPDATE orders SET status='Shipped' WHERE order_id='123') are immediately written to Paimon.
Paimon uses LSM-tree to allow these updates to be read within seconds.

Result: Downstream systems like StarRocks can query updated results in seconds.

Paimon vs. Iceberg: What’s the Difference?

Both Apache Paimon and Apache Iceberg are modern table formats for data lakes, but optimized for different needs.

Paimon is designed for stream processing with a LSM-tree architecture, suitable for cases requiring high-frequency updates and real-time data ingestion. Iceberg focuses on snapshot mechanisms with emphasis on data consistency. But it is evolving to support near real-time ingestion.

Feature	Paimon	Iceberg
Update mechanism	LSM-tree	Copy-on-Write / Merge-on-Read
Primary key support	Native support for upsert	Support upsert via Merge-on-Read
Streaming write	Yes	Yes
Update latency	Seconds or less	Minutes (typically)
Ecosystem	New, Flink-native	More mature, broad ecosystem
Best for	Real-time data warehouse, CDC, unified streaming and batch processing	Data warehouse, large batch processing, general data lake

In short, Paimon is better suited for real-time, high-frequency updates. Iceberg is ideal for general-purpose batch workloads and governance.

Building a Real-Time Lakehouse Stack

While you can use Flink to ingest data into Paimon, it often requires managing job state, recovery, and checkpoints, which is a high barrier for many teams.

BladePipe solves this with a lightweight, fully automated solution for real-time ingestion into Paimon.

How it Works:

Data sources: Core transaction databases (e.g. MySQL, PostgreSQL), logs (Kafka) and more.
BladePipe:
- Capture changes via log-based CDC, bringing sub-second latency.
- Support automated structure migration and DDL sync.
- Offer built-in verification, monitoring, alerting and recovery.
Apache Paimon:
- Ingest real-time data as the lakehouse base.
- Handle deduplication, partitioning, and compaction using LSM-tree.
- Store data in S3, OSS, etc., separating storage and computation.

StarRocks: Read real-time data directly from Paimon without the need of transformation.

Hands-on Guide

Here’s how to set up a real-time pipeline from MySQL to Paimon and query the results via StarRocks.

Step 1: Install BladePipe

Follow the instructions in Install Worker (Docker) or Install Worker (Binary) to download and install a BladePipe Worker.

Alternatively, you can deploy BladePipe on-premises.

Step 2: Add Data Sources

Log in to the BladePipe Cloud.
Click DataSource > Add DataSource, and add MySQL and Paimon instances.
When adding a Paimon instance, special configuration is required. See Add a Paimon DataSource.

Step 3: Create a Sync DataJob

Click DataJob > Create DataJob.
Select the source and target DataSources, and click Test Connection to ensure the connection to the source and target DataSources are both successful.
Select Incremental for DataJob Type, together with the Full Data option.
Select the tables to be replicated.
Select the columns to be replicated.
Confirm the DataJob creation.

BladePipe will perform full data migration and continue capturing real-time changes to write into Paimon.

Step 4: Query Data from StarRocks

The final step is to query and analyze the data in Paimon. StarRocks supports Paimon Catalog natively. It can query real-time data in Paimon without data transformation or importing.

1. Create External Catalog
Run CREATE EXTERNAL CATALOG statement in StarRocks, and all Paimon data will be mapped to StarRocks.

CREATE EXTERNAL CATALOG paimon_catalog
PROPERTIES
(
    "type" = "paimon",
    "paimon.catalog.type" = "filesystem",
    "paimon.catalog.warehouse" = "<s3_paimon_warehouse_path>",
    "aws.s3.use_instance_profile" = "true",
    "aws.s3.endpoint" = "<s3_endpoint>"
);

2. Query Real-Time Data

-- Show available databases
SHOW DATABASES FROM paimon_catalog;

-- Switch catalog 
SET CATALOG paimon_catalog;

-- Switch to a specific database
USE your_database;

-- Query data
SELECT COUNT(*) FROM your_table LIMIT 10;

Now, any update in MySQL is reflected in Paimon in real time and instantly queryable in StarRocks. No ETL is needed.

Final Thoughts

Apache Paimon unlocks real-time capabilities for modern data lakes. With BladePipe, teams can automate ingestion without writing a single line of code. And when paired with StarRocks, the full pipeline from source to query is truly real-time and production-ready.

If you're building a streaming lakehouse, this stack is worth trying.

A Comprehensive Guide to Wide Table

July 10, 2025 · 7 min read

John Li

In real-world business scenarios, even a basic report often requires joining 7 or 8 tables. This can severely impact query performance. Sometimes it takes hours for business teams to get a simple analysis done.

This article dives into how wide table technology helps solve this pain point. We’ll also show you how to build wide tables with zero code, making real-time cross-table data integration easier than ever.

The Challenge with Complex Queries

As business systems grow more complex, so do their data models. In an e-commerce system, for instance, tables recording orders, products, and users are naturally interrelated:

Order table: product ID (linked to Product table), quantity, discount, total price, buyer ID (linked to User table), etc.
Product table: name, color, texture, inventory, seller (linked to User table), etc.
User table: account info, phone numbers, emails, passwords, etc.

Relational databases are great at normalizing data and ensuring efficient storage and transaction performance. But when it comes to analytics, especially queries involving filtering, aggregation, and multi-table JOINs, the traditional schema becomes a performance bottleneck.

Take a query like "Top 10 products by sales in the last month": the more JOINs involved, the more complex and slower the query. And the number of possible query plans grows rapidly:

Tables Joined	Possible Query Plans
2	2
4	24
6	720
8	40320
10	3628800

For CRM or ERP systems, joining 5+ tables is standard. Then, the real question becomes: How to find the best query plan efficiently?

To tackle this, two main strategies have emerged: Query Optimization and Precomputation, with wide tables being a key form of the latter.

Query Optimization vs Precomputation

Query Optimization

One of the solutions is to reduce the number of possible query plans to accelerate query speed. This is called pruning. Two common approaches are derived:

RBO (Rule-Based Optimizer): RBO doesn't consider the actual distribution of your data. Instead, it tweak SQL query plans based on a set of predefined, static rules. Most databases have some common optimization rules built-in, like predicate pushdown. Depending on their specific business needs and architectural design, different databases also have their own unique optimization rules. Take SAP Hana, for instance: it powers SAP ERP operations and is designed for in-memory processing with lots of joins. Because of this, its optimizer rules are noticeably different from other databases.
CBO (Cost-Based Optimizer): CBO evaluates I/O, CPU and other resource consumption, and picks the plan with the lowest cost. This type of optimization dynamically adjusts based on the specific data distribution and the features of your SQL query. Even two identical SQL queries might end up with completely different query plans if the parameter values are different. CBO typically relies on a sophisticated and complex statistics subsystem, including crucial information like the volume of data in each table and data distribution histograms based on primary keys.

Most modern databases combine both RBO and CBO.

Precomputation

Precomputation assumes the relationships between tables are stable, so instead of joining on every query, it pre-joins data ahead of time into a wide table. When data is changed, only changes are delivered to the wide table. The idea has been around since the early days of materialized views in relational databases.

Compared with live queries, precomputation massively reduces runtime computation. But it's not perfect:

Limited JOIN semantics: Hard to handle anything beyond LEFT JOIN efficiently.
Heavy updates: A single change on the “1” side of a 1-to-N relation can cause cascading updates, challenging service reliability.
Functionality trade-offs: Precomputed tables lack the full flexibility of live queries (e.g. JOINs, filters, functions).

Best Practice: Combine Both

In the real world, a hybrid approach works best: use precomputation to generate intermediate wide tables, and use live queries on top of those to apply filters and aggregations.

Precomputation: A popular approach is stream computing, with stream processing databases emerging in recent years. Materialized views in traditional relational databases or data warehouses also offer an excellent solution.
Live queries: There is a significant performance boosts in data filtering and aggregation within real-time analytics databases, thanks to the columnar and hybrid row-column data structures, the new instruction sets like AVX 512, high-performance computing hardware such as FPGAs and GPUs, and the software application like distributed computing.

BladePipe's Wide Table Evolution

BladePipe started with a high-code approach: users had to write scripts to fetch related table data and construct wide tables manually during data sync. It worked, but wasn’t scalable due to too much effort required.

Now, BladePipe supports visual wide table building, enabling zero-code configuration. Users can select a driving table and the lookup tables directly in the UI to define JOINs. The system handles both initial data migration and real-time updates.

It currently supports visual wide table creation in the following pipelines:

MySQL -> MySQL/StarRocks/Doris/SelectDB
PostgreSQL/SQL Server/Oracle/MySQL -> MySQL
PostgreSQL -> StarRocks/Doris/SelectDB

More supported pipelines are coming soon.

How Visual Wide Table Building Works in BladePipe

Key Definitions

In BladePipe, a wide table consists of:

Driving Table: The main table used as the data source. Only one driving table can be selected.
Lookup Tables: Additional tables joined to the driving table. Multiple lookup tables are supported.

By default, the join behavior follows Left Join semantics: all records from the driving table are preserved, regardless of whether corresponding records exist in lookup tables.

BladePipe currently supports two types of join structures:

Linear: e.g., A.b_id = B.id AND B.c_id = C.id. Each table can only be selected once, and circular references are not allowed.
Star: e.g., A.b_id = B.id AND A.c_id = C.id. Each lookup table connects directly to the driving table. Cycles are not allowed.

In both cases, table A is the driving table, while B, C, etc. are lookup tables.

Data Change Rule

If the target is a relational DB (e.g. MySQL):

Driving table INSERT: Fields from lookup tables are automatically filled in.
Driving table UPDATE/DELETE: Lookup fields are not updated.
Lookup table INSERT: If downstream tables exist, the operation is converted to an UPDATE to refresh Lookup fields.
Lookup table UPDATE: If downstream tables exist, no changes are applied to related fields.
Lookup table DELETE: If downstream tables exist, the operation is converted to an UPDATE with all fields set to NULL.

If the target is an overwrite-style DB (e.g. StarRocks, Doris):

All operations (INSERT, UPDATE, DELETE) on the Driving table will auto-fill Lookup fields.
All operations on Lookup tables are ignored.
info
If you want to include lookup table updates when the target is an overwrite-style database, set up a two-satge pipeline:
1. Source DB → relational DB wide table
2. Wide table → overwrite-style DB

Step-by-Step Guide

Log in to BladePipe. Go to DataJob > Create DataJob.
In the Tables step,
1. Choose the tables that will participate in the wide table.
2. Click Batch Modify Target Names > Unified table name, and enter a name as the wide table name.
In the Data Processing step,
1. On the left panel, select the Driving Table and click Operation > Wide Table to define the join.
  - Specify Lookup Columns (multiple columns are supported).
  - Select additional fields from the Lookup Table and define how they map to wide table columns. This helps avoid naming conflicts across different source tables.
info
1. If a Lookup Table joins to another table, make sure to include the relevant Lookup columns. For example, in A.b_id = B.id AND B.c_id = C.id, when selecting fields from B, c_id must be included.
2. When multiple Driving or Lookup tables contain fields with the same name, always map them to different target column names to avoid collisions.
1. Click Submit to save the configuration.
1. Click Lookup Tables on the left panel to check whether field mappings are correct.
Continue with the DataJob creation process, and start the DataJob.

Wrapping up

Wide tables are a powerful way to speed up analytics by precomputing complex JOINs. With BladePipe’s visual builder, even non-engineers can set up and maintain real-time wide tables across multiple data systems.

Whether you're a data architect or a DBA, this tool helps streamline your analytics layer and power up your dashboards with near-instant queries.

BladePipe vs. Airbyte-Features, Pricing and More (2025)

July 4, 2025 · 7 min read

Zoe

In today’s data-driven landscape, building reliable pipelines is a business imperative, and the right integration tool can make a difference.

Two modern tools are BladePipe and Airbyte. BladePipe focuses on real-time end-to-end replication, while Airbyte offers a rich connector ecosystem for ELT pipelines. So, which one fits your use case?

In this blog, we break down the core differences between BladePipe and Airbyte to help you make an informed choice.

Intro

What is BladePipe?

BladePipe is a real-time end-to-end data replication tool. Founded in 2019, it’s built for high-throughput, low-latency environments, powering real-time analytics, AI applications, or microservices that require always-fresh data.

The key features include：

Real-time replication, with a latency less than 10 seconds.
End-to-end pipeline for great reliability and easy maintenance.
One-stop management of the whole lifecycle from schema evolution to monitoring and alerting.
Zero-code RAG building for simpler and smarter AI.

What is Airbyte?

Airbyte is founded in 2020. It is an open-source data integration platform that focuses on ELT pipelines. It offers a large library of pre-built and marketplace connectors for moving batch data from various sources to popular data warehouses and other destinations.

The key features include:

Focus on batch-based ELT pipelines.
Extensive connector ecosystem.
Open-source core with paid enterprise version.
Support for custom connectors with minimal code.

Feature Comparison

Features	BladePipe	Airbyte
Sync Mode	Real-time CDC-first/ETL	ELT-first/(Batch) CDC
Batch and Streaming	Batch and Streaming	Batch only
Sync Latency	≤ 10 seconds	≥ 1 minute
Data Connectors	60+ connectors built by BladePipe	50+ maintained connectors, 500+ marketplace connectors
Source Data Fetch	Pull and push hybrid	Pull-based
Data Transformation	Built-in transformations and custom code	dbt and SQL
Schema Evolution	Strong support	Limited
Verification & Correction	Yes	No
Deployment Options	Cloud (BYOC)/Self-hosted	Self-hosted(OSS)/Cloud (Managed)
Security	SOC 2, ISO 27001, GDPR	SOC 2, ISO 27001, GDPR, HIPAA Conduit
Support	Enterprise-level support	Community (free) and Enterprise-level support

Pipeline Latency

Airbyte realizes data movement through batch-based extraction and loading. It supports Debezium-based CDC, which is applicable to only a few sources, and only for tables with primary keys. In Airbyte CDC, changes are pulled and loaded in scheduled batches (e.g., every 5 mins or 1 hour). That makes the latency to be minutes or even hours depending on the sync frequency.

BladePipe is built around real-time Change Data Capture (CDC). Different from batch-based CDC, BladePipe captures changes occurred in the source instantly and delivers them in the destination, with sub-second latency. The real-time CDC is applicable to almost all 60+ connectors.

In summary, Airbyte usually has a high latency. BladePipe CDC is more suitable for real-time architectures where freshness, latency, and data integrity are essential.

Data Connectors

Airbyte clearly leads in the breadth of supported sources and destinations. By now, Airbyte supports over 550 connectors, most of which are API-based connectors. Airbyte allows custom connector building through its Connector Builder, giving great extensibility of its connector reach. But among all the connectors, only around 50 of them are Airbyte-official connectors and a SLA is provided. The rest are open-source connectors powered by the community.

BladePipe, on the other hand, focuses on depth over breadth. It now supports 60+ connectors, which are all self-built and actively maintained. It targets critical real-time infrastructure: OLTPs, OLAPs, message middleware, search engines, data warehouses/lakes, vector databases, etc. This makes it a better fit for real-time applications, where data freshness and change tracking matter more than diversity of sources.

In summary, Airbyte stands out for its extensive coverage of connectors, while BladePipe focuses on real-time change delivery among multiple sources. Choose the suitable tool based on your specific need.

Data Transformation

Airbyte, as a ELT-first platform, uses a post-load transformation model, where data is loaded into the target first and then transformation is applied. It offers two options: a serialized JSON object or a normalized version as tables. For advanced users, custom transformations can be done via SQL and through integration with dbt. But the transformation capabilities are limited because data is transformed after being loaded.

BladePipe finishes data transformation in real time before data loading. Configure the transformation method when creating a pipeline, and all is done automatically. BladePipe supports built-in data transformations in a visualized way, including data filtering, data masking, column pruning, mapping, etc. Complex transformations can be done via custom code. With BladePipe, data gets ready when it flows through the pipeline.

In summary, Airbyte's data transformation capabilities are limited due to its ELT way of data replication. BladePipe offers both built-in transformations and custome code to satisfy various needs, and the transformations happen in real time.

Support

Airbyte provides free and paid technical support. Open source users can seek help in the community or solve the issue by themselves. It's free of charge but can be time-consuming for urgent production issues. Cloud customers can get help through chatting with Airbyte team members and contributors. Enterprise-level support is a separate paid tier, with custom SLAs, and access to training.

BladePipe offers a more white-glove support experience. For both Cloud and Enterprise customers, BladePipe provides the according SLAs. Its technical team is closely involved in onboarding and tuning pipelines. Besides, for all customers, alert notifications can be sent via email and webhook to ensure pipeline reliability.

In summary, both Airbyte and BladePipe provide documentation and technical support for better understanding and use. Just think about your needs and make the right choice.

Use Case Comparison

Based on the features stated above, the performance of the two tools varies in different use cases.

Use Case	BladePipe	Airbyte
Data sync between relational databases	Excellent	Average
Data sync between online business databases (RDB, data warehouse, message, cache, search engine)	Excellent	Average
Data lakehouse support	Average	Excellent
SaaS sources support	Average	Average
Multi-cloud data sync	Excellent	Average

Pricing Model Comparison

Pricing is one of the key factors to consider when evaluating various tools, especially for startups and organizations with large amount of data to be replicated. BladePipe and Airbyte show great differences in the pricing model.

BladePipe

BladePipe offers two plans to choose:

Cloud: $0.01 per million rows of full data or $10 per million rows of incremental data. You can easily evaluate the costs via the price calculator. It is available at AWS Marketplace.
Enterprise: The costs are based on the number of pipelines and duration you need. Talk to the sales team on specific costs.

Airbyte

Airbyte has four plans to consider:

Open Source: Free to use for self-hosted deployment.
Cloud: $2.50 per credit, and start at $10/month(4 credits).
Team: Custom pricing for cloud deployment. Talk to the sales team on specific costs.
Enterprise: Custom pricing for self-hosted deployment. Talk to the sales team on specific costs.

Summary

Here's a quick comparison of costs between BladePipe BYOC and Airbyte Cloud.

Million Rows per Month	BladePipe* (BYOC)	Airbyte (Cloud)
1 M	$210	$450
10 M	$300	$1000
100 M	$1200	$3000
1000 M	$10200	$14000

*: include one AWS EC2 t2.xlarge for worker, $200 /month.

In summary, BladePipe is much cheaper than Airbyte. The cost gap becomes even wider when more data is moved per month. If you have a tight budget or need to integrate thousands of millions of rows of data, BladePipe would be a cost-effective option.

Final Thoughts

A right tool is critical for any business, and the choice should depend on your use case. This article lists a number of considerations and key differences. To summarize, Airbyte excels at extensive connectors and an open ecosystem, while BladePipe is designed for real-time end-to-end data use cases.

If your organization is building applications that rely on always-fresh, such as AI assistants, real-time search, or event streaming, BladePipe is likely a better fit.

If your organization needs to integrate data from various APIs or would like to maintain connectors by in-house staff, you may try Airbyte.

BladePipe vs. Fivetran-Features, Pricing and More (2025)

July 4, 2025 · 7 min read

John Li

In today’s data-driven landscape, businesses rely heavily on efficient data integration platforms to consolidate and transform data from multiple sources. Two prominent players in this space are Fivetran and BladePipe, both offering solutions to automate and streamline data movement across cloud and on-premises environments.

This blog provides a clear comparison of BladePipe and Fivetran as of 2025, covering their core features, pricing models, deployment options, and suitability for different business needs.

Quick Intro

What is BladePipe?

BladePipe is a data integration platform known for its extremely low latency and high performance that facilitates efficient migration and sync of data across both on-premises and cloud databases. Founded in 2019, it’s built for analytics, microservices and AI-focused use cases that emphasizing real-time data.

The key features include：

Real-time replication, with a latency less than 10 seconds.
End-to-end pipeline for great reliability and easy maintenance.
One-stop management of the whole lifecycle from schema evolution to monitoring and alerting.
Zero-code RAG building for simpler and smarter AI.

What is Fivetran?

Fivetran is a global leader in automated data movement and is widely trusted by many companies. It offers a fully managed ELT (Extract-Load-Transform) service that automates data pipelines with prebuilt connectors, ensuring robust data sync and automatic adaptation to source schema changes.

The key features include：

Managed ELT pipelines, automating the entire Extract-Load-Transform process.
Extensive connectors (700+ prebuilt connectors).
Strong data transformation ability with dbt integration and built-in models.
Automatic schema handling, reducing human efforts.

Feature Comparison

Features	BladePipe	Fivetran
Sync Mode	Real-time CDC-first/ETL	ELT/Batch CDC
Batch and Streaming	Batch and Streaming	Batch only
Sync Latency	≤ 10 seconds	≥ 1 minute
Data Connectors	60+ connectors built by BladePipe	700+ connectors, 450+ are Lite (API) connectors
Source Data Fetch	Pull and Push hybrid	Pull-based
Data Transformation	Built-in transformations and custom code	Post-load transformation and dbt integration
Schema Evolution	Strong support	Strong support
Verification & Correction	Yes	No
Deployment Options	Self-hosted/Cloud (BYOC)	Self-hosted/Hybrid/SaaS
Security	SOC 2, ISO 27001, GDPR	SOC 2, ISO 27001, GDPR, HIPAA
Support	Enterprise-level support	Tiered support (Standard, Enterprise, Business Critical)
SLA	Available	Available

Pipeline Latency

Fivetran adopts batch-based CDC, which means the data is read in batch intervals. It offers a range of sync frequencies, from as low as 1 minute (for Enterprise/Business Critical plans) to 24 hours. That makes the latency to be around 10 minutes. Besides, it increases pressure to the source end.

BladePipe uses real-time Change Data Capture (CDC) for data integration. That means it instantly grab data changes from your source and deliver them to the destination within seconds. This approach is a big shift from traditional batch-based CDC methods. In BladePipe, real-time CDC works with nearly all of its 60+ connectors.

In summary, BladePipe outweighs Fivetran in terms of latency, ideal for use cases that requiring always fresh data.

Data Connectors

Fivetran offers an extensive library (700+) of pre-built connectors, covering databases, APIs, files and more. A variety of connectors satisfy diverse business needs. Among all the connectors, around 450 of them are lite connectors built for specific use cases with limited endpoints.

BladePipe offers over 40 pre-built connectors. It focuses on essential systems for real-time needs, like OLTPs, OLAPs, messaging tools, search engines, data warehouses/lakes, and vector databases. This makes it a great choice for real-time projects where getting fresh data quickly is a fundamental requirement.

In summary, Fivetran excels with its broad range of connectors, while BladePipe focuses on data delivery for critical real-time infrastructure. Choose the right tool that works for you.

Reliability

Fivetran's reliability has been a point of concern. We can find 15 or more incidents occurred per month in their status page, including connector failures, 3rd party service errors, and other service degradations. It even experienced an outage lasting more than 2 days.

BladePipe is built with production-grade reliability at its core. It provides real-time dashboards for monitoring every step of data movement. Alert notifications can be triggered for latency, failures, or data loss. That makes it easy to maintain pipelines and solve problems, enhancing reliability.

In summary, BladePipe shows a more reliable system performance than Fivetran, and its monitoring and alerting mechanism brings even stronger support for stable pipelines.

Support

Fivetran offers documentation, support portal, and email support for Standard plan. However, some customers complain about the long time waiting for response. Enterprise and Business Critical plans enjoy 1-hour support response, but the costs are much higher.

BladePipe offers a more white-glove support experience. For both Cloud and Enterprise customers, BladePipe provides the according SLAs. Its technical team works closely with clients during onboarding and when fine-tuning data pipelines.

In summary, both Fivetran and BladePipe provide documentation and technical support for better understanding and use.

Use Case Comparison

Based on the features stated above, the performance of the two tools varies in different use cases.

Use Case	BladePipe	Fivetran
Data sync between relational databases	Excellent	Average
Data sync between online business databases (RDB, data warehouse, message, cache, search engine)	Excellent	Average
Data lakehouse support	Average	Average
SaaS sources support	Average	Excellent
Multi-cloud data sync	Excellent	Average

Pricing Model Comparison

Pricing is a crucial consideration when evaluating data integration tools, especially for startups and organizations with extensive data replication needs. Fivetran and BladePipe employ significantly different pricing models.

Fivetran

Fivetran has four plans to consider: Free, Standard, Enterprise and Business Critical. The free plan offers a free usage for low-volumes (e.g., up to 500,000 MAR). The other three plans adopt MAR-based tiered pricing. See more details at the pricing page.

Besides, Fivetran separately charges for data transformation based on the models users run in a month, making the costs even higher.

As of March 2025, Fivetran's pricing model has been changed to a connector-level pricing. Pricing and discounts are often applied per individual connector instead of the entire account. This means if you have many connectors, your total cost might increase even if your overall data volume hasn't changed.

BladePipe

BladePipe offers two plans to choose:

Cloud: $0.01 per million rows of full data and $10 per million rows of incremental data. You can easily evaluate the costs via the price calculator. It is available at AWS Marketplace.
Enterprise: The costs are based on the number of pipelines and duration you need. Talk to the sales team on specific costs.

Summary

Here's a quick comparison of costs between BladePipe BYOC and Fivetran(Standard).

Million Rows per Month	BladePipe* (BYOC)	Fivetran (Standard)
1 M	$210	$500+
10 M	$300	$1350+
100 M	$1200	$2900+

*: include one AWS EC2 t2.xlarge for BladePipe Worker, $200/month.

In summary, BladePipe is a better choice when it comes to costs, considering the following factors:

Cost-effectiveness: BladePipe is much more cheaper than Fivetran when moving the same amount of data. Besides, BladePipe doesn't charge for data transformation separately.
Cost Predictability: BladePipe's direct per-million-row pricing offers more immediate cost predictability, especially for large, consistent data volumes. Fivetran's MAR can be less predictable due to the nature of "active rows", the data transformation charge and the new connector-level pricing.

Final Thoughts

Choosing between Fivetran and BladePipe depends heavily on your organization's specific data integration needs and priorities. Fivetran provides extensive coverage of connectors and an automated ELT experience for analytics. BladePipe features real-time CDC, ideal for mission-critical data syncs. In terms of pricing, BladePipe is a cost-effective choice for start-ups and organizations with a tight budget.

Evaluate your specific data sources, latency requirements, budget, internal team resources, and desired level of support to make the most suitable choice.

Redis Sync at Scale-A Smarter Way to Handle Big Keys

June 24, 2025 · 4 min read

Barry

In enterprise-grade data replication workflows, Redis is widely adopted thanks to its blazing speed and flexible data structures. But as data grows, so do the keys in Redis—literally. Over time, it’s common to see Redis keys ballooning with hundreds of thousands of elements in structures like Lists, Sets, or Hashes.

These “big keys” are usually one of the roots of poor performance in a full data migration or sync, slowing down processes or even bringing them to a crashing halt.

That’s why BladePipe, a professional data replication platform, recently rolled out a fresh round of enhancements to its Redis support. This includes expanded command coverage, data verification feature, and more importantly, major improvements for big key sync.

Let’s dig into how these improvements work and how they keep Redis migrations smooth and reliable.

Challenges of Big Key Sync

In high-throughput, real-time applications, it’s common for a single Redis key to contain a massive amount of elements. When it comes to syncing that data, a few serious issues can pop up:

Out-of-Memory (OOM) Crashes: Reading big keys all at once can cause the sync process to blow up memory usage, sometimes leading to OOM.
Protocol Size Limits: Redis commands and payloads have strict limits (e.g., 512MB for a single command via the RESP protocol). Exceed those limits, and Redis will reject the operation.
Target-Side Write Failures: Even if the source syncs properly, the target Redis might fail to process oversized writes, leading to data sync interruption.

How BladePipe Tackles Big Key Syncs

To address these issues, BladePipe introduces lazy loading and sharded sync mechanisms specifically tailored for big keys without sacrificing data integrity.

Lazy Loading

Traditional data sync tools often attempt to load an entire key into memory in one go. BladePipe flips the script by using on-demand loading. Instead of stuffing the entire key into memory, BladePipe streams it shard-by-shard during the sync process.

This dramatically reduces memory usage and minimizes the risk of OOM crashes.

Sharded Sync

The heart of BladePipe’s big key optimization lies in breaking big keys into smaller shards. Each shard contains a configurable number of elements and is sent to the target Redis in multiple commands.

Configurable parameter: parseFullEventBatchSize
Default value: 1024 elements per shard
Supported types: List, Set, ZSet, Hash

Example: If a Set contains 500,000 elements, BladePipe will divide it into ~490 shards, each with up to 1024 elements, and send them as separate SADD commands.

Shard-by-Shard Sync Process

Here’s a breakdown of how it works:

Shard Planning: BladePipe inspects the total number of elements in a big key and calculates how many shards are needed based on the parameter parseFullEventBatchSize.
Shard Construction & Dispatch: Each shard is formatted into a Redis-compatible command and sent to the target sequentially.
Order & Integrity Guarantees: Shards are written in the correct order, preserving data consistency on the target Redis.

Real-World Results

To benchmark the improvements, BladePipe ran sync tests with a mixed dataset:

1 million regular keys (String, List, Hash, Set, ZSet)
50,000 large keys (~30MB each; max ~35MB)

Here’s what performance looked like:

The result shows that even with big keys in the mix, BladePipe achieved a steady sync throughput of 4–5K RPS from Redis to Redis, which is enough to handle the daily production workloads for most businesses without compromising accuracy.

Wrapping Up

Big keys don’t have to be big problems. With lazy loading and sharded sync, BladePipe provides a reliable and memory-safe way to handle full Redis migrations—even for your biggest keys.

Real-Time Data Sync-4 Questions We Get All the Time

June 20, 2025 · 5 min read

John Li

We work closely with teams building real-time systems, migrating databases, or bridging heterogeneous data platforms. Along the way, we hear a lot of recurring questions. So we figured—why not write them down?

This is Part 1 of a practical Q&A series on real-time data sync. In this post, I'd like to share thoughts on the following questions:

How should I choose between official and third-party tools?
Can my project rely on “real-time” sync latency?
What does real-time data sync mean to my project?
How do I keep pipeline stability and data integrity over time?

How should I choose between official and third-party tools?

Mature database vendors typically provide their own tools for data migration or cold/hot backup, like Oracle GoldenGate or MySQL's built-in dump utilities.

Official tools often deliver:

The best possible performance for the migration and sync of that database.
Compatibility with obscure engine-specific features.
Support for special cases that third-party tools often cannot (e.g., Oracle GoldenGate parsing Redo logs).

But they also tend to:

Offer limited or no support for other databases.
Be less flexible for niche or custom workflows.
Lock you in, making data exit harder than data entry.

Third-party tools shine when:

You're syncing across platforms (e.g. MySQL > Kafka/Iceberg/Elasticsearch).
You need advanced features like filtering and transformation.
The official tool simply doesn't support your use case.

In short:

If it’s homogeneous migration or backup, use the official tool.
If it’s heterogeneous sync or anything custom, go third-party tool.

Can my project rely on “real-time” sync latency?

In short: any data sync process that doesn't guarantee distributed transaction consistency comes with some latency risk. Even distributed transactions come at a cost—usually via redundant replication and sacrificing write performance or availability.

Latency typically falls into two categories: fault-induced latency and business-induced latency.

Fault-induced Latency:

Issues with the sync tool itself, such as memory limits or bugs.
Source/target database failures—data can't be pulled or written properly.
Constraint conflicts on the target side, leading to write errors.
Incomplete schema on the target side causing insert failures.

Business-induced Latency:

Bulk data imports or data corrections on the source side.
Traffic spikes during business peaks exceeding the tool’s processing capacity.

You can reduce the chances of delays (via task tuning, schema change rule setting, and database resource planning), but you’ll never fully eliminate them. So the real question becomes:

Do you have a fallback plan (e.g. graceful degradation) when latency hits?

That would significantly mitigate the risks brought by high latency.

What does real-time data sync mean to my project?

Two words: incremental + real-time.

Unlike traditional batch-based ETL, a good real-time sync tool:

Captures only what changes, saving massive bandwidth.
Delivers changes within seconds, enabling use cases like fraud detection or live analytics.
Preserves deletes and DDLs, whereas traditional ETL often relies on external metadata services.

Think of it like this: You don’t want to re-copy 1 billion rows every night when only 100 changed. Real-time sync gives you the speed and precision needed to power fast, reliable data products.

And with modern architectures—where one DB handles transactions, another serves queries, and a third powers ML—real-time sync is the glue holding it all together.

How do I keep pipeline stability and data integrity over time?

Most stability issues come from three factors: schema changes, traffic pattern shifts, and network environment issues. Mitigating or planning for these risks greatly improves stability.

Schema Changes:

Incompatibilities between schema change methods (e.g., native DDL, online tools like pt-osc or gh-ost) and the sync tool’s capabilities.
Uncoordinated changes to target schemas may cause errors or schema misalign.
Changes on the target side (e.g., schema changes or writes) may conflict with sync logic, causing the inconsistency between the source and target shcema or constraint conflicts.

Traffic Shifts:

Business surges causing unexpected peak loads that outstrip the sync tool’s capacity, leading to memory exhaustion or lag.
Ops activities like mass data corrections causing large data volumes and sync bottlenecks.

Network Environment:

Missing database whitelisting for sync nodes. Sync tasks may fail due to connection issues.
High latency in cross-region setups causing read/write problems.

You can reduce these risks significantly via change control setting, load testing during peak traffic, and pre-launch resource validation.

For data loss issues, they are typically resulted from:

Mismatched parallelism strategy causing write disorder.
Conflicting writes on the target side.
Excessive latency not handled in time, causing source-side logs to be purged before sync.

How to fight back:

Parallelism strategy mismatch often occurs due to cascading updates or reuse of primary key. You may need to fall back to table-level sync granularity and verify and correct data to ensure data consistency.
Target-side writes should be prevented via access control and database usage standardization.
Excessive latency must be caught via robust alerting. Also, extend log retention (ideally 24+ hours) on the source database.

With these measures in place, you can significantly enhance sync stability and data reliability—laying a solid foundation for data-driven business operations.

Intercontinental Data Sync - A Comparative Study for Performance Tuning

June 17, 2025 · 5 min read

John Li

When it comes to moving data across vast distances, particularly between continents, businesses often face a range of challenges that can impact performance. At BladePipe, we regularly help enterprises tackle these hurdles. The most common question we receive is: What’s the best way to deploy BladePipe for optimal performance?

While we can offer general advice based on our experience, the reality is that these tasks come with many variables. This article explores the best practice for intercontinental data migration and sync, blending theory with hands-on insights from real-world experiments.

Challenges of Intercontinental Data Sync

Intercontinental data migration is no easy feat. There are two primary challenges that stand in the way of fast and reliable data transfers:

Unavoidable network latency: For instance, network latency between Singapore and the U.S. typically ranges from 150ms to 300ms, which is significantly higher compared to the sub-5ms latency of typical relational database INSERT/UPDATE operations.
Complex factors affecting network quality: Factors such as packet loss and routing paths can degrade the performance of intercontinental data transfers. Unlike intranet communication, intercontinental transfers pass through multiple layers of switches and routers in data centers and backbone networks.

Beyond these, it’s critical to consider the load on both the source and target databases, network bandwidth, and the volume of data being transferred.

When using BladePipe, understanding its data extraction and writing mechanisms is essential to determine the best deployment strategy.

BladePipe Migration & Sync Techniques

Data Migration Techniques

For relational databases, BladePipe uses JDBC-based data scanning, with support for resumable migration using techniques like pagination. Additionally, it supports parallel data migration—both inter-table and intra-table parallelism (via multiple tasks with specific filters).

On the target side, since all data is inserted via INSERT operations, BladePipe uses several batch writing techniques:

Batching
Spliting and parallel writing
Bulk inserts
INSERT rewriting (e.g., converting multiple rows into insert..values(),(),())

Data Sync Techniques

BladePipe supports different methods for capturing incremental changes depending on the source database. Here’s a quick look:

Source Database	Incremental Capture Method
MySQL	Binlog parsing
PostgreSQL	logical WAL subscription
Oracle	LogMiner parsing
SQL Server	SQL Server CDC table scan
MongoDB	Oplog scan / ChangeStream
Redis	PSYNC command
SAP Hana	Trigger
Kafka	Message subscription
StarRocks	Periodic incremental scan
...	...

These methods largely rely on the source database to emit incremental changes, which can vary based on network conditions.

On the target side, unlike data migration, more operations (INSERT/UPDATE/DELETE) need to be handled while order consistency must be kept in data sync. BladePipe offers a variety of techniques to improve data sync performance:

Optimization	Description
Batching	Reduce network overhead and help with merge performance
Partitioning by unique key	Ensure data order consistency
Partitioning by table	Looser method when unique key changes occur
Multi-statement execution	Reduce network latency by concatenating SQL
Bulk load	For data sources with full-image and upsert capabilities, INSERT/UPDATE operations are converted into INSERT for batch overwriting
Distributed tasks	Allow parallel writes of the same amount of data using multiple tasks

Exploring the Best Practice

BladePipe’s design emphasizes performance optimizations on the target side, which are more controllable. Typically, we recommend deploying BladePipe near the source data source to mitigate the impact of network quality on data extraction.

But does this theory hold up in practice? To test this, we conducted an intercontinental MySQL-to-MySQL migration and sync experiment.

Experimental Setup

Resources:

Source MySQL: located in Singapore (4 cores, 8GB RAM)
Target MySQL: located in Silicon Valley, USA (4 cores, 8GB RAM)
BladePipe: deployed on VMs in both Singapore and Silicon Valley (8 cores, 16GB RAM)

Test Plan: We migrated and synchronized the same data twice to compare performance with BladePipe deployed in different locations.

Process

Generate 1.3 million rows of data in Singapore MySQL.
Use BladePipe deployed in Singapore to migrate data to the U.S. and record performance.

Make data changes (INSERT/UPDATE) at Singapore MySQL and record sync performance.

Stop the DataJob and delete target data.
Use BladePipe deployed in the U.S. to migrate the data again from Singapore MySQL and record performance.

Make data changes at Singapore MySQL and record sync performance again.

Results & Analysis

Deployment Location	Task Type	Performance
Source (Singapore)	Migration	6.5k records/sec
Target (Silicon Valley)	Migration	15k records/sec
Source (Singapore)	Sync	8k records/sec
Target (Silicon Valley)	Sync	32k records/sec

Surprisingly, deploying BladePipe at the target (Silicon Valley) significantly outperformed the source-side deployment.

Potential Reasons:

Network policies and bandwidth differences between the two locations.
Target-side batch writes are less affected by poor network conditions compared to binlog/logical scanning on the source side.
Other unpredictable network variables.

Recommendations

While the experiment offers valuable insights to intercontinental data migration and sync, real-world environments can differ:

Production databases may be under heavy load, impacting the ability to push incremental changes efficiently.
Dedicated network lines may offer more consistent network quality.
Gateway rules and security policies vary across data centers, affecting performance.

Our recommendation: During the POC phase, deploy BladePipe on both the source and target sides, compare performance, and choose the best deployment strategy based on real-world results.

Data Transformation in ETL (2025 Guide)

May 14, 2025 · 4 min read

John Li

ETL (Extract, Transform, Load) is a fundamental process in data integration and data warehousing. In this process, data transformation is a key step. It’s the stage where raw, messy data gets cleaned up and reorganized so it’s ready for analysis, business use and decision-making.

In this blog, we will break down data transformation to help you better understand and process data in ETL.

What is Data Transformation in ETL?

In the ETL process, data transformation is the middle step that turns extracted data from various sources into a consistent, usable format for the target system (like a data warehouse or analytics tool). This step applies rules, logic, and algorithms to:

Clean up errors and inconsistencies
Standardize formats (like dates and currencies)
Enrich data with new calculations or derived fields
Restructure data to fit the needs of the business or target system

Without transformation, data from different sources would be incompatible, error-prone, or simply not useful for downstream processing like reporting, analytics, or machine learning.

Why is Data Transformation Important?

Ensure Data Quality: Fix errors, fill in missing values, and remove duplicates so the data is accurate and trustworthy.
Improve Compatibility: Convert data into a format compatible with the target system, and handle schema differences, which are vital for combining data from different sources.
Enhance Performance & Efficiency: Filter unnecessary data early, reducing storage and processing costs. Optimize data structure through partitioning and indexing for faster queries.
Enable Better Analytics & Reporting: Aggregate, summarize, and structure data so it’s ready for dashboards and reports.

10 Types of Data Transformation

Here are the most common types of data transformation you’ll find in ETL pipelines, with simple explanations and examples:

Transformation Type	Explanation	Example/Use Case
Data Cleaning	Remove errors and fixes inconsistencies to improve quality	Replace missing values in a "Country" column with "Unknown"
Data Mapping	Match source data fields to target schema so data lands in the right place	Map “cust_id” from source to “customer_id” in target
Data Aggregation	Summarize detailed data into a higher-level view	Sum daily sales into monthly totals
Bucketing/Binning	Group continuous data into ranges or categories for easier analysis	Group ages into ranges (18–25, 26–35, etc.)
Data Derivation	Create new fields by applying formulas or rules to existing fields	Derive "Profit" by subtracting "Cost" from "Revenue" in a sales dataset
Filtering	Select only relevant or necessary records	Filter out only 2024 sales records from the entire sales table
Joining	Combine data from multiple sources or tables based on a common key	Join a "Customers" table with an "Orders" table on "CustomerID" to analyze order history
Splitting	Break up fields into multiple columns for granularity or clarity	Split “Full Name” into “First Name” and “Last Name”
Normalization	Standardize scales or units	Convert currencies to USD
Sorting and Ordering	Arrange records based on one or more fields, either ascending or descending	Sort a customer list by "Signup Date" in descending order to identify recent users

Automate Data Transformation with BladePipe

BladePipe is a real-time end-to-end data replication tool. It supports various ways to transform data. With a user-friendly interface, complex end-to-end transformations can be done in a few clicks.

Compared with tranditional data transformation ways, BladePipe offers the following features:

Real-time Transformation: Any incremental data is captured, transformed and loaded in real time, critical in projects requiring extremely low latency.
Flexibility: BladePipe offers multiple built-in transformation without manual scripting requirements. For special transformation, custom code can cater to personalized needs.
Ease of Use: Most operations are done in an intuitive interface with wizards. Except transformation via custom code, the other data transformations don't require any code.

Data Filtering

BladePipe allows to specify a condition to filter out data by SQL WHERE clause, so that only relevant records are processed and loaded, improving the ETL performance.

Data Cleaning

BladePipe has several built-in data transformation scripts, covering common use cases. For example, you can simply remove leading and trailing spaces from strings, standardizing the data format.

Data Mapping

In BladePipe, the table names and field names can be mapped to the target instance based on certain rules. Besides, you can name each table as you like.

Wrapping Up

Data transformation is the engine that powers the effective ETL process. By cleaning, standardizing, and enriching raw data, it ensures organizations have reliable, actionable information for decision-making. Whether you’re combining sales data, cleaning up customer lists, or preparing data for machine learning, transformation is what makes your data truly useful.

Data Masking in Real-time Replication

April 24, 2025 · 6 min read

Zoe

In today’s data-driven world, keeping sensitive information safe is more important than ever. That’s where data masking comes in. It hides or replaces private data so teams can work freely without risking exposure. In this blog, we’ll dive into data masking—what it is, when to use it, and how modern tools make it easy to mask your data as you move it.

What is Data Masking?

When moving or syncing data, especially personally identifiable information (PII), data masking is a key step. It keeps your data safe, private, and compliant—especially when you're migrating, testing, or sharing data. Any time sensitive data is being transferred, data masking should be part of the plan. It helps prevent leaks and protects your business.

There are two main types of data masking: static and dynamic.

Static data masking means masking data in bulk. It creates a new dataset where sensitive information is hidden or replaced. This masked data is safe to use in non-production environments like development, testing, or analytics.

Dynamic data masking happens in real-time. It shows different data to different users based on their roles or permissions. It is usually used in live production systems.

In this blog, we'll focus on static data masking, and how to statically mask data in data replication.

Use Cases

Data masking is useful in many situations where there’s a risk of data breach. It’s especially important when people from different departments—or even outside the organization—need to access the data. Masking keeps private information safe and secure.

Once data is statically masked and separated from the live production system, teams of different departments can use it freely—read it, write it, test with it—without risking the real data. Here are some common use cases for static data masking:

Software development and testing Developers often need real data to test new features or troubleshoot bugs. But dev environments usually aren’t as secure as production environments. Static masking hides the sensitive parts of the data, so developers can work safely without seeing private info.
Scientific research: Researchers need lots of real-world data to get meaningful results. But using raw data with personal or sensitive info is not compliant with privacy laws. With data masking, researchers get access to realistic data, just without the sensitive details, keeping things both useful and compliant.
Data sharing: Businesses often need to share data with partners or third-party vendors. Sharing raw data is risky for the potential of data breach. Masking it first removes that risk. Partners get the insights they need, but none of the sensitive stuff. It’s a win-win for privacy and collaboration.

Common Static Data Masking Techniques

There are several ways to apply static data masking. Each method helps hide sensitive information.

Masking Type	How It Works	Example
Substitution	Replace real data with fake but seemingly realistic values	Rose → Monica
Shuffling	Mix up the order of characters or fields	12345 → 54123
Encryption	Use algorithms like AES or RSA to encrypt the data	123456 → Xy1#Rt
Masking	Hide part of the data with asterisks	13812345678 → 138**5678
Truncation	Keep only part of the original data	622712345678 → 6227

Data Masking in Real-time Replication

In the use cases mentioned above, we often need both data migration/syncing and data masking. The best approach? Mask the data during the sync process itself. That way, teams get masked data right away—no need for extra tools. It’s faster, simpler, and safer. Plus, it lowers the risk of leaks and helps you stay compliant.

BladePipe, a professional end-to-end data replication tool, makes this easy. It supports data transformation during sync. Before, users had to write custom code to do masking while syncing, which is not ideal for non-developers. Now, with BladePipe’s new scripting support, masking can be done with built-in scripts. You can set masking rules for specific fields. When the data sync task runs, it automatically calls the script and applies the transformation. That means: “Sync and mask data at the same time.”

This works for full data migration, incremental sync, data verification and correction.

BladePipe now supports built-in masking rules, including masking and truncation. You can mask your data in several flexible ways:

Keep only the part after a certain character
Keep only the part before a certain character
Mask the part after a certain character
Mask teh part before a certain character
Mask a specific part of the string

Procedure

Here we show how to mask data in real time while replicating data from MySQL to MySQL.

Step 1: Install BladePipe

Follow the instructions in Install Worker (Docker) or Install Worker (Binary) to download and install a BladePipe Worker.

Step 2: Add DataSources

Log in to the BladePipe Cloud.
Click DataSource > Add DataSource.
Select the source and target DataSource type, and fill out the setup form respectively.

Step 3: Create a DataJob

Click DataJob > Create DataJob.
Select the source and target DataSources.
Select Incremental for DataJob Type, together with the Full Data option.
Select the tables to be replicated.
In the Data Processing step, select the table on the left side of the page and click Operation > Data Transform.
Select the column(s) that need data transformation, and click the icon next to Expression on the right side of the dialog box. Select the data transformation script in the pop-up dialog box, and click it to automatically copy the script.
Paste the copied script into the Expression input box, and replace col in @params['col'] of the script with the corresponding column name.
In the Test Value input box, enter a test value and click Test. Then you can view how the data is masked.
Confirm the DataJob creation.
Now the DataJob is created and started. The selected data is being masked in real time when moving to the target instance.

Wrapping Up

Data masking isn’t just a checkbox for compliance—it’s a smart move to protect your business and your users. Especially when working with real data in non-production environments or sharing it with others, static data masking gives you the safety net you need without slowing things down.

By integrating data masking directly into the data migration and sync process, tools like BladePipe make it easier than ever. No more juggling extra tools or writing custom code. You get clean, safe, ready-to-use data—all in one smooth step.

Whether you're testing, analyzing, or sharing data, masking should be part of your workflow. And now, it’s finally simple enough for everyone to use.

7 Best Change Data Capture (CDC) Tools in 2025

April 12, 2025 · 7 min read

John Li

Change Data Capture (CDC) is a technique that identifies and tracks changes to data stored in a database, such as inserts, updates, and deletes. By capturing these changes, CDC enables efficient data replication between systems without full data reloads. It’s widely used in modern data pipelines to power real-time analytics, maintain data lakes, update caches, and support event-driven architectures.

Why do You Need CDC?

Real-time Data Flow: As the name implies, data changes are captured as they happen in near real-time. So, when something updates in the source database, it's reflected almost immediately elsewhere. This feature perfectly suits the use cases requiring real-time change sync across different databases or systems.
Reduced Resource Requirement: CDC optimizes resource utilization to reduce operational costs by monitoring and extracting database changes in real-time, which requires fewer computing resources and provides better performance.
Greater Efficiency: Only data that has changed is synchronized, which is exponentially more efficient than replicating an entire database and enhances the accuracy of data and analytics.
Agile Business Insights: CDC enables data collection in real-time, allowing teams across organizations to access recent data for making data-driven decisions quickly and improving accuracy of decision-making.

7 Best CDC Tools in 2025

Debezium

Debezium is an open-source distributed platform for change data capture. Built on top of Apache Kafka, Debezium captures row-level changes from various databases, like MySQL, PostgreSQL, MongoDB, and others, and streams these changes to Kafka for downstream processing.

Key Features:

Open source: Debezium is actively developed with a strong community, and it's free of cost.
Kafka Integration: It is built on Apache Kafka, enabling scalable, fault-tolerant streaming of change events.
Snapshot & Stream Modes: It can take an initial snapshot of existing data and then continue with real-time streaming.

Fivetran

Fivetran is a fully managed data integration platform that simplifies and automates the process of moving data from various sources into centralized destinations like data warehouses or lakes. It handles schema changes, data normalization, and continuous updates without manual intervention.

Key Features:

Real-Time Data Movement: It continuously updates data with low-latency, using CDC where supported to reduce load and improve speed.
Data Normalization: It standardizes data structures and formats across sources to ensure consistency in your data warehouse.
Transformations with dbt Integration: It enables in-warehouse transformations using SQL or dbt, making it easy to prepare data for analytics.

Airbyte

Airbyte is an open-source data integration platform that supports log-based CDC from databases like Postgres, MySQL, and SQL Server. To assist log-based CDC, Airbyte uses Debezium to capture various operations like INSERT and UPDATE.

Key Features:

Open-Source & Extensible: It is fully open-source with a modular design that allows users to build and customize connectors easily.
A Wide Range of Connector Support: It supports for over 300 connectors, enabling data ingestion from APIs, databases, SaaS tools, and more.
Orchestration Integration: It is compatible with Airflow and Dagster, allowing integration into existing workflows.

BladePipe

BladePipe is a real-time end-to-end data replication tool that moves data between 30+ databases, message queues, search engines, caching, real-time data warehouses, data lakes, etc.

BladePipe tracks, captures and delivers data changes automatically and accurately with ultra-low latency (less than 3 seconds), greatly improving the efficiency of data integration. It provides sound solutions for use cases requiring real-time data replication, fueling data-driven decision-making and business agility.

Key Features:

Real-time Data Sync: The latency is extremely low, less than 3 seconds in most cases.
Intuitive Operation: It offers visual management interface for easy creation and monitoring of DataJobs. Almost all operations can be done by clicking the mouse.
Flexibility of Transformation: It supports filtering and mapping, and has multiple built-in data transformation scripts, which is friendly for non-developers. Also, users can realize special transformation using custom code.
Data Accuracy: It supports data verification and correction right after replication, making it easy for users to check the accuracy and integrity of data in the target instance.
Monitoring & Alerting: It has built-in tools for monitoring task health, performance metrics, and error handling. It also supports various ways for alert notification.

Qlik Replicate

Qlik Replicate is a high-performance data replication and change data capture (CDC) solution designed to enable real-time data movement across diverse systems. It supports a wide range of source and target platforms, including relational databases, data warehouses, cloud services, and big data environments.

Key Features:

Cloud and Hybrid Support: It works across on-premises, cloud, and hybrid environments, suitable for building modern data architectures.
High Performance & Scalability: It is optimized for high-volume data replication with minimal impact on source systems.
Broad Source and Target Support: It supports a wide range of platforms including Oracle, SQL Server, MySQL, PostgreSQL, SAP, Mainframe, Snowflake, Amazon Redshift, Google BigQuery, and more.

Striim

Striim is a real-time data integration and streaming platform. With built-in change data capture (CDC) capabilities, Striim enables low-latency replication from transactional databases to modern destinations such as data warehouses, lakes, and analytics platforms.

Key Features:

Real-Time Data Integration: It captures and delivers data changes instantly using log-based CDC.
Source & Target Support: It supports a wide range of sources and destinations, including databases, data warehouses, lakes, etc.
User-friendly UI: It offers a drag-and-drop interface and SQL support for building, deploying, and managing data pipelines.

Oracle GoldenGate

Oracle GoldenGate is a software package for enabling the replication of data in heterogeneous data environments. It enables continuous replication of transactional data between databases, whether on-premises or in the cloud, with minimal impact on source systems.

Key Features:

Log-Based Replication: It uses transaction logs for non-intrusive, high-performance data capture without impacting source systems.
Cloud Integration: It can seamlessly integrates with Oracle Cloud Infrastructure (OCI) and other cloud platforms for hybrid and multi-cloud deployments.
Data Transformation: It allows filtering, mapping, and transformation of data during replication.

How to Choose the CDC Tool that Works for You?

Choosing the right CDC tool depends on the specific needs and requirements of your organization. Here are some factors to consider:

Data Sources and Targets: Ensure that the CDC tool supports the data sources and targets you need to integrate.
Real-time Requirements: Evaluate the latency requirements of your applications and choose a CDC tool that can meet those needs.
Scalability: Consider the volume of data you need to process and choose a CDC tool that can scale to handle your workload.
Ease of Use: Look for a CDC tool that is easy to set up, configure, and manage.
Cost: Compare the pricing of different CDC tools and choose one that fits your budget.
Existing Infrastructure: Assess how well the CDC tool integrates with your current data infrastructure and tools.
Specific Use Cases: Align the tool's capabilities with your specific use cases, such as real-time analytics, data warehousing, or application integration.
Security and Compliance: Ensure the tool meets your organization's security and compliance requirements.
Support and Documentation: Check for comprehensive documentation, community support, and vendor support options.

Wrapping Up

CDC tools are about efficiency. It maintains consistency between systems without the cost of bulk data transfers, making real-time business insights possible. To choose a right CDC tool for your project, you have to consider multiple factors. Align a tool’s capabilities with your technical requirements and business goals, and select a CDC solution that ensures reliable, real-time data replication tailored to your project.

If you are looking for an efficient, stable and easy-to-use CDC tool, BladePipe is well-placed as it offers an out-of-the-box solution for real-time data movement. Whether you're building real-time analysis, syncing data across services, or preparing datasets for machine learning, BladePipe helps you move and shape data quickly, reliably, and efficiently.

What is Geo-Redundancy? A Comprehensive Guide

March 27, 2025 · 4 min read

John Li

Geo-redundancy is the practice of replicating and storing your critical IT infrastructure and data across multiple locations strategically.

Why Geo-Redundancy is Neeeded?

The main aim is to ensure continuous availability and resilience against local failures or disasters. Imagine that your system is built in a single data center or region, what will happen if a power outage hits the region? A catastrophe for your business. However, if you replicates systems and data in different regions in advance, your data will failover to another available data center, and the service will awalys be online.

Another vital purpose of geo-redundancy is backing up and data protection. Compared with single-location data storage, geo-redundancy safeguards data by replicating and maintaining copies of data in multiple places, minimizing the risk of data loss.

How Geo-Redundancy Works?

Geo-redundancy can be implemented using two primary patterns:

Active-Active: All regions are operational and handle requests simultaneously. This ensures load balancing and fault tolerance but requires robust synchronization mechanisms to maintain data consistency.
Active-Passive: A secondary region remains on standby and takes over only if the primary region fails. This is simpler to implement but may result in underutilized resources.

How to Set Up Geo-Redundancy?

To establish an effective geo-redundant system, the following steps can be considered:

Assess Business Requirements: Determine the number of data centers to be deployed based on the scale and impact of business. Then, decide the locations of the data centers according to distribution of users and their access needs.
Replicate Data: Select the data replication mode that is right for your business, and start to replicate data across chosen geographic locations, ensuring that replication methods align with the consistency requirements.
Establish Failover Procedures: Develop and document procedures for automatic or manual failover to secondary systems, ensuring minimal downtime during transitions.
Monitor and Regularly Test: Establish a monitoring system to monitor each data center and system components in real time to promptly detect and handle potential problems. Conduct failover and disaster recovery tests periodically to validate the effectiveness of geo-redundant configurations and update procedures based on test outcomes.

Common Challenges

Setting up and maintaining a running geo-redundant system is a complex process, and the challenges you may concern about include:

Data Consistency: Data is replicated among several data centers, making it hard to track and check the data consistency issue.
Cost Management: Deploying and maintaining multiple data centers can significantly increase operational costs.
Complexity of Configuration: Setting up geo-redundancy requires careful planning and expertise to avoid misconfigurations that could compromise system integrity.
Latency and Performance: Long distance between regions can introduce latency, affecting your system's performance.

How BladePipe Helps to Achieve Geo-Redundancy?

BladePipe, a real-time end-to-end data replication tool, presents various features to reduce the complexity of a geo-redundancy solution.

Real-time Data Sync: BladePipe replicates data between databases, data warehouses and other data sources using change data capture (CDC) technique. Only change data is replicated, making latency extremely low.
Bidirectional Data Flow: BladePipe can realize two-way data sync without circular data replication. This functionality plays a key role in realizing Active-Active geo-redundancy.
Data Verification and Correction: The built-in data verification and correction functionality helps to check the data on a regular basis, safeguarding data integrity and consistency.
User-friendly Interface: All operations in BladePipe is done in an intuitive way by clicking the mouse. No code requirements.

Conclusion

Geo-redundancy is an essential component of modern IT infrastructure. By understanding its key concepts, organizations can build resilient systems capable of withstanding regional failures and minimizing downtime. BladePipe, as a real-time data movement tool, is a perfect choice to help establish a robust geo-redundant system, making the whole process efficient, time-saving and effortless.

Data Verification - Definition, Benefits and Best Practice

March 13, 2025 · 5 min read

Zoe

When data moves from one system to another, you may have a question: does all the data stored in the target system in a correct way? If not, how can I identify the missing or wrong data? Data verification is introduced to resolve your concern. Verification acts as a safeguard, ensuring that all data is accurately replicated, intact, and functional in the new system.

What is Data Verification?

Data verification is the process of ensuring that all data has been accurately and completely replicated from the source instance to the target instance. It involves validating data integrity, consistency, and correctness to confirm that no data is lost, altered, or corrupted during the replication process.

Why Data Verification is Needed?

Ensuring Data Quality

In data replication, some data records may be skipped or failed to move to the target instance. That results in data loss and inconsistencies. Verification plays a key role in ensuring that data is completely and accurately moved from the source to the target.

Key aspects of data verification:

Completeness: Ensure that all data of the source instance is present in the target instance.
Integrity: Confirm that the data has not been altered or tampered with.
Consistency: Verify that the data in the source instance is in line with that in the target instance.

Enhancing Data Reliability

Stakeholders, including users and management, need confidence that the data replication is successfully done. Data verification provides solid evidence on data reliability. When data is verified, users have more trust in what they get, and more confidence to use the data for analytics.

Supporting Decision-making

Accurate and complete data is the backbone for data-driven insights. Any minor inconsistency, if not be identified and corrected, may lead to misunderstanding and huge costs. Data verification ensures that the data represents the accurate and real situation, offering a basis for wise decision making.

How to Verify Data?

Manual Verification

Manual verification involves human efforts to check data integrity, completeness, and consistency. For small datasets or specific cases requiring human judgment, you may find it's a cost-effective choice, because no specialized tools are needed. However, when there are hundreds of thousands of records of data to be verified, the manual way is time-consuming and labor-intensive, and human errors are tend to occur. That makes it hard to trust in data quality even after verification.

Automated Verification

Compared with the manual way, automated tools are faster, and more efficient, especially for large datasets. A large volume of data can be verified in only a few seconds, helping accelerate your data replication project. No human intervention is needed in this process, reducing human errors and ensuring consistency of every verification. Also, automated tool usually can correct the discrepancies automatically, saving much of your time and energy.

Best Practice

Here, we introduce a tool for automatic data verification and correction after data replication -- BladePipe.

BladePipe fetches data from the source instance batch by batch, then uses the primary key to fetch the corresponding data from the target instance using SQL IN or RANGE. The data with no matching data found in the target is marked as Loss, and then each row of data is compared on a field-by-field basis.

By default, all data is verified. Also, you can narrow the data range to be verified using filtering conditions. For the discrepancies, BladePipe performs 2 additional verifications to minimize the false result caused by the latency of data sync, thus improving the verification performance significantly.

With BladePipe, data can be verified and corrected in a few clicks.

Step 1: Install BladePipe

Follow the instructions in Install Worker (Docker) or Install Worker (Binary) to download and install a BladePipe Worker.

Step 2: Add DataSources

Log in to the BladePipe Cloud.
Click DataSource > Add DataSource.
Select the source and target DataSource type, and fill out the setup form respectively.

Step 3: Create a DataJob

Click DataJob > Create DataJob.
Select the source and target DataSources, and click Test Connection to ensure the connection to the source and target DataSources are both successful.
Select Verification and Correction for DataJob Type, and configure the following items:
- Select One-time for Verification.
- Select Correction Mode: Revise after Check / NONE.
  - Revise after Check: The data will be automatically corrected after the verification is completed.
  - NONE: The data will not be automatically corrected after the verification is completed.
Select the tables to be verified. Only existing tables can be selected.
Select the columns to be verified.
Confirm the DataJob creation. Then go back to the DataJob page, and check the data verification result.

Summary

Data verification is a vital process in data migration and sync to ensure data accuracy, consistency, and completeness. Use automated tools like BladePipe, data verification is easier than ever before. Just a few clicks, and data can be verified and corrected right after migration and sync.

How Modern Data Lakes Work​

Data Writing​

Data Querying​

Iceberg vs Delta Lake vs Paimon​

Apache Iceberg​

Delta Lake​

Apache Paimon​

The Evolution of Compute Bottlenecks​

Summary​

Building a Real-Time Data Lake​

Data Sync​

Add Data Sources​

Create Sync Job​

Takeways​

Further reading​

Why Real-Time Lakehouses Matter​

BladePipe + Paimon + SelectDB: The Core Building Blocks​

Lakehouse From Zero to Real-Time​

Prepare Tools​

Ingest Data with BladePipe​

Add Data Sources​

Create a Sync Task​

Query in SelectDB​

Wrapping Up​

Why Loops Happen in Bidirectional Sync？​

Solution 1: Auxiliary Tags​

Solution 2: Transaction Tags​

Hands-On Demo with BladePipe​

Step 1: Install BladePipe​

Step 2: Add DataSources​

Step 3: Create Forward DataJob​

Step 4: Create Reverse DataJob​

Step 5: Verify the Results​

Conclusion​

Architecture at a Glance​

Performance​

Scalability​

Reliability​

Feature Comparison Table​

How to Choose Between Them​

BladePipe: Simplifying Data Streaming into Message Brokers​

What is Amazon DynamoDB?​

What is MongoDB?​

DynamoDB vs MongoDB At a Glance​

Core Features Comparison​

Data Model & Query​

Scalability and Performance​

Consistency​

Availability​

How to Choose between them?​

Stream Data to DynamoDB and MongoDB Easily​

What is Apache Paimon?​

Paimon vs. Iceberg: What’s the Difference?​

Building a Real-Time Lakehouse Stack​

Hands-on Guide​

Step 1: Install BladePipe​

Step 2: Add Data Sources​

Step 3: Create a Sync DataJob​

Step 4: Query Data from StarRocks​

Final Thoughts​

The Challenge with Complex Queries​

Query Optimization vs Precomputation​

Query Optimization​

Precomputation​

Best Practice: Combine Both​

BladePipe's Wide Table Evolution​

How Visual Wide Table Building Works in BladePipe​

Key Definitions​

Data Change Rule​

If the target is a relational DB (e.g. MySQL):​

If the target is an overwrite-style DB (e.g. StarRocks, Doris):​

Step-by-Step Guide​

Wrapping up​

Intro​

What is BladePipe?​

What is Airbyte?​

Feature Comparison​

Pipeline Latency​

Data Connectors​

Data Transformation​

How Modern Data Lakes Work

Data Writing

Data Querying

Iceberg vs Delta Lake vs Paimon

Apache Iceberg

Delta Lake

Apache Paimon

The Evolution of Compute Bottlenecks

Summary

Building a Real-Time Data Lake

Data Sync

Add Data Sources

Create Sync Job

Takeways

Further reading

Why Real-Time Lakehouses Matter

BladePipe + Paimon + SelectDB: The Core Building Blocks

Lakehouse From Zero to Real-Time

Prepare Tools

Ingest Data with BladePipe

Add Data Sources

Create a Sync Task

Query in SelectDB

Wrapping Up

Why Loops Happen in Bidirectional Sync？

Solution 1: Auxiliary Tags

Solution 2: Transaction Tags

Hands-On Demo with BladePipe

Step 1: Install BladePipe

Step 2: Add DataSources

Step 3: Create Forward DataJob

Step 4: Create Reverse DataJob

Step 5: Verify the Results

Conclusion

Architecture at a Glance

Performance

Scalability

Reliability

Feature Comparison Table

How to Choose Between Them

BladePipe: Simplifying Data Streaming into Message Brokers

What is Amazon DynamoDB?

What is MongoDB?

DynamoDB vs MongoDB At a Glance

Core Features Comparison

Data Model & Query

Scalability and Performance

Consistency

Availability

How to Choose between them?

Stream Data to DynamoDB and MongoDB Easily

What is Apache Paimon?

Paimon vs. Iceberg: What’s the Difference?

Building a Real-Time Lakehouse Stack

Hands-on Guide

Step 1: Install BladePipe

Step 2: Add Data Sources

Step 3: Create a Sync DataJob

Step 4: Query Data from StarRocks

Final Thoughts

The Challenge with Complex Queries

Query Optimization vs Precomputation

Query Optimization

Precomputation

Best Practice: Combine Both

BladePipe's Wide Table Evolution

How Visual Wide Table Building Works in BladePipe

Key Definitions

Data Change Rule

If the target is a relational DB (e.g. MySQL):

If the target is an overwrite-style DB (e.g. StarRocks, Doris):

Step-by-Step Guide

Wrapping up

Intro

What is BladePipe?

What is Airbyte?

Feature Comparison

Pipeline Latency

Data Connectors

Data Transformation