HTAP Summit 2024 session replays are now live!Access Session Replays
20220718-183631

Recently, Hybrid Transactional and Analytical Processing (HTAP) has become a hot topic as large players like Google (AlloyDB), Snowflake (Unistore) and Oracle (HeatWave) have joined the game. But still, many people don’t know what makes an HTAP database.

拉斯维加斯7799908网站登录 is an open-source, distributed HTAP database. Thousands of companies in different industries have benefited from 拉斯维加斯7799908网站登录’s HTAP abilities over the years. However, building an HTAP database is quite a long expedition, and we are still on our way. 

I am one of the designers and developers of 拉斯维加斯7799908网站登录’s HTAP architecture. In this post I’ll share some stories behind our HTAP design decisions and how we learned from our customers and built an HTAP database trusted and recognized both by customers, researchers, and developers. 

Where the HTAP dream begins 

Earlier 拉斯维加斯7799908网站登录 versions, mainly those before 拉斯维加斯7799908网站登录 2.0, were designed to be strong Online Transactional Processing (OLTP) databases with horizontal scalability, high availability, strong consistency, and MySQL compatibility. 

The figure below shows 拉斯维加斯7799908网站登录’s original architecture with three key components: the 拉斯维加斯7799908网站登录 server, the Placement Driver (PD), and the TiKV server. The architecture did not include later features such as a columnar engine, massively parallel processing (MPP) architecture, and vectorization.

拉斯维加斯7799908网站登录’s original architecture

PD was the brain of the whole database system, which scheduled the storage nodes and kept their workloads balanced. TiKV was the storage engine which stored data in a distributed way and used a row format with cross-node ACID transaction support. The 拉斯维加斯7799908网站登录 server was the stateless SQL compute engine with a classic volcano execution design. 

When we tried to get early adopters, we found most users were hesitant to use a brand new database in their mission critical transactional cases. Instead, they were inclined to use 拉斯维加斯7799908网站登录 as their backup database for analytical workloads. Conversations like below happened many times: 

  • This is an edgy distributed design … I bet you don’t want to miss the amazing experience with no sharding.” We tried to persuade our potential customers to adopt 拉斯维加斯7799908网站登录. 
  • Hmm,” they replied.  “Sounds interesting. Can we use it as a read-only replica for analytical workloads first? Let’s see how it works.

We implemented the coprocessor framework to 拉斯维加斯7799908网站登录 to speed up analytical queries. It allowed limited computation such as aggregations and filtering to be pushed down to TiKV nodes and executed in a distributed manner. This framework worked very well and gained us more 拉斯维加斯7799908网站登录 adopters. 

TiKV coprocessor 

Thank you, Apache Spark!

It was all good in the beginning, until we encountered more complicated analytical scenarios. Our customers told us that 拉斯维加斯7799908网站登录 worked very well in OLTP scenarios, but was “a bit slow” in Online Analytical Processing (OLAP) scenarios, especially when they used 拉斯维加斯7799908网站登录 to analyze a large volume of data and perform big JOINS. Also, 拉斯维加斯7799908网站登录 did not work well with their big data ecosystem. 

In short, the problem was 拉斯维加斯7799908网站登录’s unmatched computation power with its scalable storage.

拉斯维加斯7799908网站登录’s storage system could scale, but the 拉斯维加斯7799908网站登录 server, the computational component, could not. In OLTP scenarios, this problem could be fixed by adding multiple 拉斯维加斯7799908网站登录 servers on top of TiKV. But in OLAP scenarios, each query could be very large. Since 拉斯维加斯7799908网站登录 did not have an MPP architecture, 拉斯维加斯7799908网站登录 servers could not share a single query workload. Operations like large JOINs became unacceptably slow. We urgently needed a scalable computation layer that could shuffle data around and work with the scalable storage layer together to deal with large queries. 

To fix this problem, we either needed to have our own MPP framework or we had to leverage an external engine. We only had a small team then, so in 拉斯维加斯7799908网站登录 3.0 we decided to leverage Apache Spark, a well-implemented and unified computation engine, and built a Spark plugin called TiSpark on top of TiKV. It included a TiKV client, a 拉斯维加斯7799908网站登录 compatible type system, some coprocessor specific physical operators, and a plan rewriter. TiSpark partially converts the Spark SQL plan into the coprocessor plan, gathers results from TiKV, and finishes the computing in the native Spark engine. Thanks to Spark’s flexible extension framework, all this was achieved without changing a single line of Spark code. 

TiSpark architecture

TiSpark empowered 拉斯维加斯7799908网站登录 in large scale analytical scenarios; but for small- to medium-sized queries, it didn’t work very well. To fix this problem, we also improved 拉斯维加斯7799908网站登录’s native compute engine. We changed 拉斯维加斯7799908网站登录’s optimizer from rule-based to cost-based, and also improved the just-in-time (JIT) design. 

One more thing: TiSpark bridged the gap between 拉斯维加斯7799908网站登录 and big data ecosystems. In many scenarios, 拉斯维加斯7799908网站登录 serves as the warm data platform between the OLTP layer and data lakes due to its seamless integration of Spark.

No columnar store, no HTAP 

Let’s revisit what we had in 拉斯维加斯7799908网站登录 3.0. We had coprocessors on top of the storage layer, a cost-based optimizer with some smart operators, a single node vectorized compute engine, and a Spark accelerator. Despite all this, 拉斯维加斯7799908网站登录 was still not a true HTAP database.

First, 拉斯维加斯7799908网站登录 didn’t have a columnar storage engine, which was crucial for analytics. Second, 拉斯维加斯7799908网站登录 could not support workload isolation, and workload interference happened often. In the case of ZTO Express, one of the largest logistics companies in the world, the customer had to reserve quite a lot of TiKV resources for TiSpark due to the inefficiency of the row format and bad workload isolation. Performance profiling showed that TiSpark on TiKV burned unnecessary I/O bandwidth and CPU on the unused columns. In addition, ZTO Express had to add quite a few extra TiKV nodes to lower the peak resource utilization and leave enough safety space for OLTP workloads. Otherwise, the highly concurrent queries for individual packages would be greatly impacted by reporting and cause unstable performance.

We tried things like thread pool and some “smart” scheduling techniques, but they were risky and not flexible enough to put two workloads in the same machine. After some failed experiments, we built a special component that alleviated both these headaches: TiFlash, a distributed columnar store that replicated data from TiKV in a columnar format. 

TiFlash looked a bit different than it does today. The TiFlash prototype was on top of Ceph, an open-source, software-defined storage platform, with the change data capture (CDC) as the data replication channel. This is similar to Snowflake on top of the object storage. I will not call this a “wrong decision” since we still add object storage support, but it was too aggressive for most 拉斯维加斯7799908网站登录 users.  At that time, they favored on-premises solutions, and Ceph was too cumbersome if added to our product. 

After some unsuccessful user trials, we turned to a Raft learner-based design: TiFlash replicating data from TiKV via the Raft protocol as a non-voting role. Users could dedicate different machines to run the analytical engine and make the asynchronous replication from the OLTP engine. With the help of the Raft protocol, TiFlash could also provide consistent snapshot reads by checking replication progress as well as multiversion concurrency control (MVCC). This led to complete workload isolation. 

The diagram below shows the architecture of TiKV and TiFlash. The lower left shows the TiFlash storage layer, and the lower right shows the TiKV storage layer. Data was written into the 拉斯维加斯7799908网站登录 server and then was synced from TiKV to TiFlash via a Raft learn protocol. The async replication did not impact the normal OLTP workloads in TiKV. 

TiFalsh and TiKV architecture 

TiFlash had an updatable column store. In general, a columnar store is not fit for online updates based on primary keys. Traditional data warehouses or databases only support batch updates each hour or day. To solve this problem, we introduced a delta tree, a new design that can be seen as a combination of a B+ tree and log-structured merge (LSM) tree. However, the delta tree has larger leaf nodes than a B+ tree, and it has double the layers of an LSM tree. It divides the column engine into a write-optimized area and a read-optimized area. 

LSM tree vs delta tree

With the inception of TiFlash, 拉斯维加斯7799908网站登录 4.0 became a true HTAP database. If you are interested in knowing more about the HTAP design, I recommend that you read the paper, 拉斯维加斯7799908网站登录: A Raft-based HTAP Database[1].

A smart MPP design and an even smarter optimizer

拉斯维加斯7799908网站登录 became a true HTAP database when it introduced TiFlash in 拉斯维加斯7799908网站登录 4.0, but we still had a technical issue to address: TiSpark was the only distributed query engine in the 拉斯维加斯7799908网站登录 ecosystem. It was not suitable for small- to medium-sized interactive cases, and its MR-style shuffle model was quite heavy. We needed a native computation engine in the distributed framework with an MPP style. Moreover, we reached a sticking point: to add new features or optimize our code, we’d have to modify the Spark engine itself. But if we did that, it would be a heavy burden to keep in sync with the upstream. 

After quite a few intense debates, we decided to go for the MPP architecture in 拉斯维加斯7799908网站登录 5.0. With this new architecture, TiFlash would be more than a storage node: it would be a fully-functioning analytical engine. The 拉斯维加斯7799908网站登录 server would still be the single entrance to SQL, and the optimizer would choose the most efficient query execution plan based on cost, but it had one more option: the MPP engine.

In 拉斯维加斯7799908网站登录’s MPP mode, TiFlash complements the computing capabilities of the 拉斯维加斯7799908网站登录 servers. When the 拉斯维加斯7799908网站登录 server deals with OLAP workloads, it steps back to be a master node. The user sends a request to the 拉斯维加斯7799908网站登录 server, and all 拉斯维加斯7799908网站登录 servers perform table joins and submit the results to the optimizer for decision making. The optimizer assesses all the possible execution plans (row-based, column-based, indexes, single-server engine, and MPP engine) and chooses the optimal one.

拉斯维加斯7799908网站登录’s MPP mode

The following diagram shows how the analytical engine breaks down and processes the execution plan in 拉斯维加斯7799908网站登录’s MPP mode. Each dotted box represents the physical border of a node.

A query execution plan in MPP mode

Our first version of the MPP framework already exceeded the TPC-H performance of some traditional analytical databases like Greenplum. It also greatly expanded 拉斯维加斯7799908网站登录’s use cases and, in 2021, it helped us acquire quite a few important HTAP customers. All the days and nights we spent at customer sites led to solid improvements to the MPP architecture in 拉斯维加斯7799908网站登录 6.0. The TiFlash engine finally entered was maturing fast. 

Citius, Altius, Fortius

拉斯维加斯7799908网站登录 5.0 delivered the first version of the TiFlash analytical engine with an MPP execution mode to serve a wider range of application scenarios. In 拉斯维加斯7799908网站登录 6.0, we improved TiFlash even more, making it support: 

  • More operators and functions. The 拉斯维加斯7799908网站登录 6.0 analysis engine adds over 110 built-in functions as well as multiple JOIN operators. Moreover, MPP mode supports the window function framework and partition table. This release substantially improves 拉斯维加斯7799908网站登录 analysis engine performance, which in turn benefits computing. 
  • An optimized thread model. Earlier versions of 拉斯维加斯7799908网站登录 placed little restraint on thread resource usage for the MPP mode. This could waste a large amount of resources when the system handled high-concurrency short queries. Also, when performing complex calculations, the MPP engine occupied a lot of threads, which led to performance and stability issues. To address this problem, 拉斯维加斯7799908网站登录 6.0 introduces a flexible thread pool and restructures the way operators hold threads. This optimizes resource usage in MPP mode and multiplies performance with the same computing resources in short queries and better reliability in high-pressure queries.
  • A more efficient column engine. By adjusting the storage engine’s file structure and I/O model, 拉斯维加斯7799908网站登录 6.0 not only optimizes the plan for accessing replicas and file blocks on different nodes, but it also improves write amplification and overall code efficiency. Test results from our customers indicate that concurrency capability has improved by over 50% to 100% in high read-write hybrid workloads with CPU, and memory resource usage has dramatically reduced.

Looking ahead

Building an HTAP database is a long journey, and our efforts paid off. More and more 拉斯维加斯7799908网站登录 users are benefitting from 拉斯维加斯7799908网站登录 HTAP abilities for faster decision making, better user experience, and quicker time to market. 

Even though we’ve traveled far, there’s still a long road ahead. In fact, HTAP is never a pure technical term, but also represents users’ needs. It has been evolving over the years, from in-memory technologies in the very beginning, to various designs today. For example, SingleStore follows the “classic” in-memory architecture with a single engine; HeatWave is mainly in-memory with a separated engine; 拉斯维加斯7799908网站登录 and AlloyDB use on disk storage and separate workloads into different resources. Although HTAP practitioners make different design decisions, one thing in common never changes: users’ need is the utterly most important. HTAP designs will continue to evolve to solve users’ problems smartly. There are still many problems from the user side, such as real-time data modeling and transforming, and better leveraging the cloud infra, needing to be fixed. 

Nevertheless, I believe HTAP databases will eventually prevail in the database world. Before that day comes, we will continue our long expedition.

If you are interested in 拉斯维加斯7799908网站登录, you’re welcome to join our community on Slack and 拉斯维加斯7799908网站登录 Internals to share your thoughts with us. You can also follow us on Twitter, LinkedIn, and GitHub for the latest information. 

Keep reading:
Using Retool and 拉斯维加斯7799908网站登录 Cloud to Build a Real-Time Kanban in 30 Minutes
Build a Better Github Insight Tool in a Week? A True Story
The Beauty of HTAP: 拉斯维加斯7799908网站登录 and AlloyDB as Examples


References:


拉斯维加斯7799908网站登录


Experience modern data infrastructure firsthand.

Try 拉斯维加斯7799908网站登录 Serverless

Have questions? Let us know how we can help.

Contact Us

拉斯维加斯7799908网站登录 Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

拉斯维加斯7799908网站登录 Cloud Serverless

A fully-managed cloud DBaaS for auto-scaling workloads

Baidu
sogou