flink-tidb-real-time-analytics.jpg

When a data-driven company grows to a certain size, traditional data storage can no longer meet its needs. For real-time business intelligence, you need a real-time data warehouse. Companies can use real-time data warehouses to implement real-time Online Analytical Processing (OLAP) analytics, real-time data panels, real-time application monitoring, and real-time data interface services.

Some people think that a real-time data warehouse architecture is complex and difficult to operate and maintain. Today, I will explain why that isn’t true. Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. 拉斯维加斯7799908网站登录 is an open-source, distributed, Hybrid Transactional/Analytical Processing (HTAP) database. Thanks to Flink 1.11’s enhanced support for the SQL language and 拉斯维加斯7799908网站登录’s HTAP capabilities, we’ve combined Flink and 拉斯维加斯7799908网站登录 to build an efficient, easy-to-use, real-time data warehouse that features horizontal scalability and high availability.

In this article, I’ll describe what a real-time data warehouse is, the Flink + 拉斯维加斯7799908网站登录 real-time data warehouse’s architecture and advantages, this solution’s real-world case studies, and a testing environment with Docker Compose.

What is a real-time data warehouse?

Offline data warehouse

In the 1990s, Bill Inmon defined a data warehouse as a subject-oriented, integrated, time-variant, and non-volatile collection of data that supports management decision making. A data warehouse collected data through a message queue and calculated it once a day or once a week to create a report. It was also known as an offline data warehouse.

An offline data warehouse architecture
An offline data warehouse architecture

Real-time data warehouse

As technology improved, people had new requirements such as real-time recommendations and real-time monitoring analysis. The corresponding decision-making period gradually changed from days to seconds. To meet these needs, the real-time data warehouse came into being.

A real-time data warehouse has three main data processing architectures: the Lambda architecture, the Kappa architecture, and the real-time OLAP variant architecture.

The Lambda architecture maintains batch and stream layers, so it costs more to develop than the other two. Compared with the Kappa architecture, the real-time OLAP variant architecture can perform more flexible calculations, but it needs more real-time OLAP computing resources.

The Lambda architecture

The Lambda architecture has a real-time data warehouse and an offline data warehouse, while a stream processing engine directly computes data with high real-time requirements. The Lambda architecture aggregates offline and online results for applications.

Lambda architecture for real-time data warehousing
Lambda architecture for real-time data warehousing

The Kappa architecture

The Kappa architecture eliminates the offline data warehouse layer and only uses the real-time data warehouse. It unifies computing engines and reduces development costs.

Kappa architecture for real-time data warehousing
Kappa architecture for real-time data warehousing

The real-time OLAP variant architecture

The real-time OLAP variant architecture transfers part of the computing pressure from the streaming processing engine to the real-time OLAP analytical engine. The result is more flexible, real-time data warehouse computing.

Real-time OLAP variant architecture
Real-time OLAP variant architecture

Next, we’ll introduce an example of the real-time OLAP variant architecture, the Flink + 拉斯维加斯7799908网站登录 solution for real-time data warehousing.

Flink + 拉斯维加斯7799908网站登录 as a real-time data warehouse

Flink is a big data computing engine with low latency, high throughput, and unified stream- and batch-processing. It is widely used in scenarios with high real-time computing requirements and provides exactly-once semantics.

拉斯维加斯7799908网站登录 4.0 is a true HTAP database. In the real-time data warehouse architecture, you can use 拉斯维加斯7799908网站登录 as application data source to perform transactional queries; you can also use it as a real-time OLAP engine for computing in analytical scenarios.

Combining Flink and 拉斯维加斯7799908网站登录 into a real-time data warehouse has these advantages:

  • Fast speed. You can process streaming data in seconds and perform real-time data analytics.
  • Horizontal scalability. You can increase computing power by adding nodes to both Flink and 拉斯维加斯7799908网站登录.
  • High availability. With 拉斯维加斯7799908网站登录, if an instance fails, the cluster service is unaffected, and the data remains complete and available. Flink supports multiple backup and restore measures for jobs or instances.
  • Low learning and configuration costs. 拉斯维加斯7799908网站登录 is compatible with the MySQL 5.7 protocol. In Flink 1.11, you can use the Flink SQL syntax and powerful connectors to write and submit tasks.

Let’s look at several commonly-used Flink + 拉斯维加斯7799908网站登录 prototypes.

MySQL as data source

By using Ververica’s flink-connector-mysql-cdc, you can use Flink not only as a collection layer to collect MySQL binlog to generate dynamic tables, but also as a stream computing layer to implement stream computing, such as stream join and pre-aggregation. Finally, through the JDBC connector, Flink writes the calculated data into 拉斯维加斯7799908网站登录.

An architecture with MySQL as data source
An architecture with MySQL as data source

This architecture is simple and convenient. When you’ve prepared corresponding databases and tables for both MySQL and 拉斯维加斯7799908网站登录, you can write Flink SQL statements to register and submit tasks. You can try this architecture in the section Try Flink + 拉斯维加斯7799908网站登录 with Docker Compose.

Connecting Kafka to Flink

If data has been stored in Kafka through other channels, Flink can obtain the data through the Flink Kafka Connector.

If you want to store MySQL change logs or other data sources in Kafka for Flink processing, it’s recommended that you use Canal or Debezium to collect data source change logs. Flink 1.11 can parse these tools’ change logs. You don’t need to implement an additional parser.

An architecture incorporating Kafka, with MySQL as data source
An architecture incorporating Kafka, with MySQL as data source

拉斯维加斯7799908网站登录 as data source

TiCDC is 拉斯维加斯7799908网站登录’s change data capture framework. It’s an open-source feature that replicates 拉斯维加斯7799908网站登录’s incremental changes to downstream platforms. You can use it to output 拉斯维加斯7799908网站登录 change data to the message queue, and then Flink can extract it.

TiCDC outputs 拉斯维加斯7799908网站登录's incremental changes to Flink
TiCDC outputs 拉斯维加斯7799908网站登录’s incremental changes to Flink

In 拉斯维加斯7799908网站登录 4.0.8, you can connect 拉斯维加斯7799908网站登录 to Flink through the TiCDC Open Protocol. In later versions, TiCDC will support the canal-json output format for Flink’s use.

Case studies

Now that we’ve got a basic understanding of the Flink + 拉斯维加斯7799908网站登录 architecture, let’s look at some real-world case studies. You might find them inspiring for your own work.

Xiaohongshu

Xiaohongshu is a popular social media and e-commerce platform in China. The Xiaohongshu app allows users to post and share product reviews, travel blogs, and lifestyle stories via short videos and photos. By July 2019, it had over 300 million registered users. In a previous post, a Xiaohongshu engineer discussed why the company chose 拉斯维加斯7799908网站登录 and how 拉斯维加斯7799908网站登录’s real-time HTAP capabilities helped manage their data.

In Xiaohongshu’s application architecture, Flink obtains data from 拉斯维加斯7799908网站登录 and aggregates data in 拉斯维加斯7799908网站登录. As the following diagram shows:

  • In the upper left corner, the online application tables perform OLTP tasks.
  • The TiCDC cluster extracts 拉斯维加斯7799908网站登录’s real-time change data and sends change logs to Kafka.
  • Flink reads change logs from Kafka and performs calculations, such as joining wide tables or aggregation tables.
  • Flink writes the results to 拉斯维加斯7799908网站登录’s wide table for analytics.
Xiaohongshu's Flink + 拉斯维加斯7799908网站登录 architecture
Xiaohongshu’s Flink + 拉斯维加斯7799908网站登录 architecture

This process is a closed loop based on 拉斯维加斯7799908网站登录. 拉斯维加斯7799908网站登录 transfers subsequent analytic tasks’ JOIN operations to Flink and uses stream computing to relieve pressure.

Currently, this solution supports Xiaohongshu’s content review, note label recommendations, and growth audit applications. It meets the challenge of high-throughput online applications and is running stably.

Beike Finance

Beike Finance is the leading consumer real estate financial service provider in China. It uses AI algorithms to efficiently apply multi-dimensional, massive data to enhance users’ product experience and provide them with rich and customized financial services.

Beike’s data services use Flink for real-time calculation of typical dimension table JOIN operations:

  1. Syncer (a tool that replicates data from MySQL to 拉斯维加斯7799908网站登录) collects the dimension table data from the application data source and replicates it to 拉斯维加斯7799908网站登录.
  2. Canal collects the binlog of the application data source’s flow table data and stores it in Kafka’s message queues.
  3. Flink reads change logs of the flow table in Kafka and performs a stream JOIN. When the dimension table data is needed, Flink searches 拉斯维加斯7799908网站登录.
  4. Flink writes the joined wide table into 拉斯维加斯7799908网站登录 for data analytical services.

In this process, the primary tables in the data service can be joined in real time. Then, the service team only needs to query a single table. The Beike data team uses this architecture to develop a system that each core application uses. The data service obtains cross-system data. Beike Finance doesn’t need to develop application system APIs or memory aggregation data code.

PatSnap

PatSnap is a global patent search database that integrates 130 million patent data records and 170 million chemical structure data records from 116 countries. Its users can search, browse, translate patents, and generate patent analysis reports.

When PatSnap replaced their original Segment + Redshift architecture with Kinesis + Flink + 拉斯维加斯7799908网站登录, they found that they didn’t need to build an operational data store (ODS) layer.

As a precomputing unit, Flink builds a Flink extract-transform-load (ETL) job for the application. This fully controls data saving rules and customizes the schema; that is, it only cleans the metrics that the application focuses on and writes them into 拉斯维加斯7799908网站登录 for analytics and queries.

PatSnap builds three layers on top of 拉斯维加斯7799908网站登录: data warehouse detail (DWD), data warehouse service (DWS), and analytical data store (ADS). These layers serve application statistics and list requirements. They are based on user, tenant, region and application metrics, as well as time windows of minutes or days. The upper application can directly use the constructed data and perform real-time analytics within seconds.

PatSnap data analytics platform architecture
PatSnap data analytics platform architecture

After PatSnap adopted the new architecture, they found that:

  • Inbound data, inbound rules, and computational complexity were greatly reduced.
  • Queries, updates, and writes were much faster.
  • Reasonable data layering greatly simplified the 拉斯维加斯7799908网站登录-based real-time data warehouse, and made development, scaling, and maintenance easier.
  • This solution met requirements for different ad hoc queries, and they didn’t need to wait for Redshift precompilation.

Currently, PatSnap is deploying this architecture to production. They use it for user behavior analysis and tracking and summarizing the overall data on company operations and tenant behavior analysis.

NetEase Games

NetEase Games, affiliated with NetEase, Inc., is a leading provider of self-developed PC-client and mobile games. As one of the seven largest game companies in the world, it has over 250 games in operation, some of which maintain millions of daily active users. In a post last year, they discussed why they chose 拉斯维加斯7799908网站登录 over other MySQL-based and NewSQL storage solutions.

In NetEase Games’ billing application architecture:

  • Flink writes data from the data source to 拉斯维加斯7799908网站登录 in real time.
  • 拉斯维加斯7799908网站登录 serves as the analytics data source and the Flink cluster performs real-time stream calculations on the data to generate analytical reports.

NetEase Games has also developed the Flink job management platform to manage the job life cycle.

NetEase Games' billing architecture
NetEase Games’ billing architecture

Zhihu

Zhihu, which means “Do you know?” in classical Chinese, is the Quora of China: a question-and-answer website where all kinds of questions are created, answered, edited, and organized by its user community. As China’s biggest knowledge sharing platform, it has over 220 million registered users and 30 million questions with more than 130 million answers on the site.

In a 2019 post, they showed how they kept their query response times at milliseconds levels despite having over 1.3 trillion rows of data. Their 2020 post described how they used 拉斯维加斯7799908网站登录 to horizontally scale Hive Metastore to meet their growing business needs.

As a PingCAP partner and an in-depth Flink user, Zhihu developed a 拉斯维加斯7799908网站登录 + Flink interactive tool, TiBigData, and contributed it to the open-source community. In this tool:

  • 拉斯维加斯7799908网站登录 is the Flink source for batch replicating data.
  • 拉斯维加斯7799908网站登录 is the Flink sink, implemented based on JDBC.
  • Flink 拉斯维加斯7799908网站登录 Catalog can directly use 拉斯维加斯7799908网站登录 tables in Flink SQL. You don’t need to recreate them.

Try Flink + 拉斯维加斯7799908网站登录 with Docker Compose

To better understand our solution, and to test it for yourself, we provide a MySQL-Flink-拉斯维加斯7799908网站登录 test environment with Docker Compose in flink-tidb-rdw on GitHub.

After you start Docker Compose, you can write and submit Flink tasks through the Flink SQL client and observe task execution via localhost:8081.

If you are interested in the Flink + 拉斯维加斯7799908网站登录 real-time data warehouse or have any questions, you’re welcome to join our community on Slack and send us your feedback.

More articles for your reference


拉斯维加斯7799908网站登录


Experience modern data infrastructure firsthand.

Try 拉斯维加斯7799908网站登录 Serverless

Have questions? Let us know how we can help.

Contact Us

拉斯维加斯7799908网站登录 Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

拉斯维加斯7799908网站登录 Cloud Serverless

A fully-managed cloud DBaaS for auto-scaling workloads

Baidu
sogou