# Crab-tree: A Crash Recoverable B+-tree Variant for Persistent Memory with ARMv8 Architecture

#### CHUNDONG WANG\*, ShanghaiTech University, China

SUDIPTA CHATTOPADHYAY, Singapore University of Technology and Design, Singapore GUNAVARAN BRIHADISWARN<sup>†</sup>, University of Moratuwa, Sri Lanka

In recent years, the next-generation non-volatile memory (NVM) technologies have emerged with DRAM-like byte addressability and disk-like durability. Computer architects have proposed to use them to build *persistent* memory that blurs the conventional boundary between volatile memory and non-volatile storage. On the other hand, ARM processors, ones that are widely used in embedded computing systems, start providing architectural supports to utilize NVM since ARMv8. In this paper, we consider tailoring B+-tree for NVM operated by a 64-bit ARMv8 processor. We first conduct an empirical study of performance overhead in writing and reading data for a B+-tree with an ARMv8 processor, including the time cost of cache line flushes and memory fences for crash consistency as well as the execution time of binary search compared to that of linear search. We hence identify the key weaknesses in the design of B+-tree with ARMv8 architecture. Accordingly, we develop a new B+-tree variant, namely, crash recoverable ARMv8-oriented B+-tree (Crab-tree). To insert and delete data at runtime, Crab-tree selectively chooses one of two strategies, i.e., copy on write and shifting in place, depending on which one causes less consistency cost. Crab-tree regulates a strict execution order in both strategies and recovers the tree structure in case of crashes. To further improve the performance of Crab-tree, we employ three methods to reduce software overhead, cache misses, and consistency cost, respectively. We have implemented and evaluated Crab-tree in Raspberry Pi 3 Model B+ with emulated NVM. Experiments show that Crab-tree significantly outperforms state-of-the-art B+-trees designed for persistent memory by up to 2.2× and 3.7× in write and read performances, respectively, with both consistency and scalability achieved.

# $\label{eq:CCS} Concepts: \bullet \textbf{Information systems} \rightarrow \textbf{B-trees}; \textbf{Storage class memory}; \bullet \textbf{Software and its engineering} \rightarrow \textbf{Consistency}; \bullet \textbf{Hardware} \rightarrow \textbf{Non-volatile memory}.$

Additional Key Words and Phrases: B+-tree, Non-volatile Memory, ARMv8, Persistent Memory

#### **ACM Reference Format:**

Chundong Wang, Sudipta Chattopadhyay, and Gunavaran Brihadiswarn. 2020. Crab-tree: A Crash Recoverable B+-tree Variant for Persistent Memory with ARMv8 Architecture. *ACM Trans. Embedd. Comput. Syst.* XX, Y, Article ZZZ (April 2020), 26 pages. https://doi.org/10.1234/0123456.7890123

\*A part of this work was done when Chundong Wang worked in Singapore University of Technology and Design. †This work was done when Gunavaran Brihadiswarn was an intern in Singapore University of Technology and Design.

Authors' addresses: Chundong Wang, cd\_wang@outlook.com, ShanghaiTech University, Shanghai, China; Sudipta Chattopadhyay, Singapore University of Technology and Design, Singapore, sudipta\_chattopadhyay@sutd.edu.sg; Gunavaran Brihadiswarn, gunavaran.15@cse.mrt.ac.lk, University of Moratuwa, Colombo, Western Province, Sri Lanka.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

1539-9087/2020/4-ARTZZZ \$15.00

https://doi.org/10.1234/0123456.7890123

<sup>© 2020</sup> Association for Computing Machinery.

#### **1 INTRODUCTION**

Flash memory-based storage devices have been widely used in both general-purpose and embedded computing systems. The development of next-generation non-volatile memory (NVM) technologies, such as Magnetoresistive RAM (MRAM) [1], Resistive RAM (Re-RAM) [2, 3], 3D XPoint [4] and Phase-Change memory (PCM) [5, 6], provide new promising storage media. As NVM technologies have DRAM-like byte addressability and disk-like durability, computer architects have proposed to place them on the memory bus to build *persistent memory* that blurs the boundary between volatile memory and durable storage. A number of system and application softwares have been designed to leverage the persistent memory in x86-based computing systems [7–21].

In general, NVM technologies are much denser and more energy-efficient than DRAM. They also have much shorter access latency and better write endurance than flash memory. Bringing NVM into embedded computing systems is likely to gain benefits for system performance and capacity, energy consumption and device lifetime. ARM architectures have recently started to include architectural supports for persistent memory like x86, particularly for 64-bit ARMv8 [22, 23]. It might seem natural to adopt the scientific works that deploy NVM into an x86-based system for ARMv8-based system. In this paper, we challenge such an approach and show that deploying NVM with ARMv8 requires rethinking and redesign of critical data structures. Concretely, we presume an embedded system equipped with an ARMv8-A 64-bit processor and byte-addressable NVM. We place our emphasis on tailoring a crucial data structure, i.e., B+-tree [24–28], for such a system.

The differences between x86 and ARMv8 introduce new challenges in designing in-NVM B+-tree. First, ARMv8 processors that are widely used in smartphones and Raspberry Pi have relatively fewer computational resources. For example, a commonplace x86 processor installed in desktop computers can have a shared L3 cache of 8192KB while the ARMv8 processor of Raspberry Pi 3 Model B+ just has a shared L2 cache of 512KB. Different computational resources lead to different write and read performance for B+-tree. Secondly, although the time cost of using cache line flushes and memory fences has been known for x86 processors [25, 28-33], such cost is not thoroughly evaluated for ARMv8. We note that cache line flushes and memory fences are required for the crash consistency of an in-NVM data structure, i.e., to correctly recover it after a system crash. Processors or memory controllers may alter the programmed order of writing multiple cache lines from CPU cache to memory for high memory bandwidth [7, 28-30, 33-35]. Cache line flush explicitly flushes a dirty cache line to memory while memory fence acts as a barrier enforcing those memory operations after the barrier to stall until operations before the barrier complete. A series of cache line flushes and memory fences thus secures a desired writing order to NVM and maintains crash consistency. Thirdly, x86 processors guarantee total store ordering (TSO) [36] that prevents store-after-store instructions from being reordered, but ARMv8 with non-TSO may reorder independent store instructions. TSO cannot avoid multiple cache lines being written back to memory in an altered order, but it helps x86-based system to waive the use of memory fences in consecutive stores that demand an order. State-of-the-art design of in-NVM B+-tree exploits the TSO to improve the performance of B+-tree, but this assumption completely breaks when moving to an ARM-based system, with inferior write performance reported [28].

To evaluate these concerns, we have performed an empirical study with an ARMv8-A processor. Our evaluation reveals several inspiring observations and opens the door of a plethora of research opportunities in deploying NVM with ARMv8. Specifically, we have the following observations:

• Using cache line flushes and memory fences to preserve a writing order of dirty cache lines is significantly time consuming with ARMv8. For example, the insertion performance dropped by 5.1× when cache line flushes and memory fences were enabled for B+-tree.

Crab-tree: A Crash Recoverable B+-tree Variant for Persistent Memory with ARMv8 Architecture

- Interposing extra memory fences to counteract non-TSO incurs substantial performance overhead. In particular, the insertion performance further dropped by 3.1× after adding such memory fences.
- Flushing multiple cache lines of a contiguous memory space in a batch ended by one eventual memory fence improves performance by 2.7× compared to flushing each of them through one cache line flush and memory fence.

In line with these observations, we have designed a new <u>c</u>rash <u>r</u>ecoverable <u>A</u>RMv8-oriented <u>B</u>+tree (Crab-tree). Crab-tree keeps all keys sorted in a tree node. It only considers crash consistency of leaf nodes (LNs), as upper-level index nodes (INs) can be reconstructed from LNs. The key ideas of Crab-tree to suit ARMv8 are summarized as follows.

- Crab-tree uses two consistency strategies: 1) copy on write (COW) copies key-value (KV) pairs from an LN to a new LN, flushing them in a batch with one ending memory fence, and then replaces the original LN with the new LN, and 2) shifting in place (SIP) shifts KV pairs in the original LN, flushing dirty cache lines and counteracting non-TSO with memory fences where necessary.
- Crab-tree selects a strategy when it receives an insertion or deletion request. Given a key to be inserted or deleted, Crab-tree estimates the consistency cost of COW and SIP. It chooses one strategy that incurs less cost.
- For both COW and SIP, Crab-tree enforces a strict modification and store order, and always ends an insertion or deletion transaction with an 8B atomic write supported by ARMv8. In such a fashion, Crab-tree is able to identify and recover the exact inconsistent LN after a crash.

COW and SIP complement each other. COW is applied if using SIP has to shift a substantial number of KV pairs. SIP is chosen when shifting only a few KV pairs in-situ can complete an insertion or deletion. To further improve the performance of Crab-tree, we have looked into the design and implementation of Crab-tree. We find that, 1) software overhead caused by allocating/deallocating NVM space impairs performance, 2) choosing SIP or COW of Crab-tree demands binary search that incurs cache misses in determining the location for insertion/deletion, and 3) with the latest 128-bit compare-and-swap instructions available since ARMv8.1, additional memory fences employed by SIP to counteract non-TSO can be reduced. Accordingly, we have proposed three methods to resolve these problems, respectively.

We have prototyped Crab-tree in a Raspberry Pi 3 Model B+ [37] with ARM Cortex-A53 and emulated NVM. Experiments show that the optimized Crab-tree significantly outperforms state-of-the-art B+-trees for persistent memory by up to  $2.2 \times$  and  $3.7 \times$  in write and read performances, respectively, without losing consistency or scalability.

The remainder of this paper is organized as follows. In Section 2 we show the background of NVM and in-NVM B+-tree variants. In Section 3 we present our empirical study. In Section 4 we detail Crab-tree on how it achieves crash consistency with low time cost with ARMv8. In Section 5 we show optimization methods used to further improve the performance of Crab-tree. In Section 6 we evaluate Crab-tree and conclude the paper in Section 7.

# 2 BACKGROUND AND RELATED WORK

#### 2.1 NVM and Architectural Supports

Front runners of NVM technologies include MRAM, ReRAM, 3D XPoint and PCM. In order to exploit the superb characteristics of NVM, such as the byte-addressability, non-volatility and relatively short access latency, a number of proposals that optimize, design or even revolutionize computer systems with NVM have been considered from micro-architectural level to application level [2, 5, 7, 9, 11, 12, 14–19, 29, 38–54]. One such approach that works with existing architectural



Fig. 1. An Illustration of Using Memory Fence for a Barrier

support is to build persistent memory by placing NVM on the memory bus, either with or without DRAM. In this fashion, processors directly load and store data with NVM.

Most of the proposals to deploy NVM are x86-oriented due to the maturity of architectural supports in x86 for NVM. ARM architecture has started to provide instructions for cache line flushes <sup>1</sup> (e.g., dc civac, dc cvac and dc cvap) and memory fences (e.g., dmb) since ARMv8. These instructions help programmers to guarantee the consistency and persistency of data stored in NVM when developing system- and application-level softwares for embedded systems.

Special architectural supports are necessitated because persistent memory blurs the boundary between memory and storage. Traditional main memory is volatile by the very nature of DRAM while data storage relies on block-based devices like hard disk or SSD [55]. Operating data structures in the persistent memory is non-trivial and needs to take into account characteristics of both memory and storage. For example, when processors store data into memory, they only guarantee at most 8B of data to be atomically (i.e., all-or-nothing) written. By contrast, a hard disk supports a sector-level (e.g., 512B) atomic write [9, 25, 28, 56].

Moreover, multiple writes to the persistent memory may not happen in the same order as they are written by an arbitrary program running on the CPU. While processor preserve the functional correctness of the program, there might be many in-flight writes to the persistent memory that break the order in which they occurred during program execution [7, 10, 28, 30, 34]. If all such in-flight writes complete, the program runs as expected. However, if a system crash happens before the completion of all write transactions to the persistent memory, the program might reach an inconsistent state. For example, in a correct program execution, the allocation of a variable must be performed before the pointer to it is recorded. Assume that the write respective to the allocation happens at cache line  $L_A$  and the same respective to recording the pointer happens at cache line  $L_R$ . The processor ensures that the write to  $L_A$  precedes the write to  $L_R$  to maintain a consistent program state. However, when the respective cache lines are written back to the persistent memory, the order of writes might be reversed depending on the processor. If the system crashes after writing  $L_R$  to the persistent memory, but before writing  $L_A$ , then the program's data structure will remain at an inconsistent state, basically creating a *dangling pointer*. We note that volatile memories, such as DRAM, do not have such consistency issues. This is because writing  $L_R$  to volatile memory would have been discarded naturally when the system restarts.

In summary, we must ensure that memory writes happen in the order *intended* by the programmer. To this end, in-NVM data structures need to leverage memory fences and cache line flushes. Memory fences guarantee that memory operations after a fence will not be performed until

<sup>&</sup>lt;sup>1</sup>These instructions are called cache line *clean* in ARMv8 manual [22]. We refer to them as cache line flush as regards the convention of x86 processors.

memory operations before the fence complete. There are different types of memory fences. Figure 1 illustrates how memory fence (e.g., mfence of x86 and dmb of ARMv8) and store fence (e.g., sfence of x86 and dsb of ARMv8) enforce respective barriers. As shown in Figure 1(a), both load and store operations cannot cross the barrier of memory fence over time. In Figure 1(b), a store fence only affects store operations while load operations that do not depend on previous stores can proceed either before or after the store fence.

On the other hand, cache line flush forcefully flushes a cache line to memory, i.e., storing the cache line's data into NVM. Thus, a combination of cache line flush and memory fence following each memory write ensures a desired order of writing data to the persistent memory. Whereas, for x86 processors, the overhead of using clflush (cache line flush) and mfence (memory fence) is significant, leading to severe performance degradation [10, 25, 28].

#### 2.2 B+-tree Variants for Persistent Memory

Several crash-consistent B+-tree variants have been proposed for persistent memory [24–28]. In a B+-tree, a key and a value (an 8B pointer to a record) form a KV pair. The KV pair is stored in a leaf node (LN) while upper-level index nodes (INs) assist in searching a specific key. LNs are critical to a B+-tree, as keys and pointers maintained in INs can be reconstructed by traversing LNs [25, 27].

Access performance and crash consistency are of paramount importance for an in-NVM B+-tree. Owing to the significant overhead of using cache line flushes and memory fences, existing B+-tree variants all try to reduce the consistency cost. It was discovered initially that maintaining *sorted* keys in a B+-tree node entails a large amount of cache line flushes and memory fences. As a result, NV-Tree [25], wB+-tree [26] and FPTree [27] append unsorted keys with values into tree nodes to reduce the use of cache line flushes and memory fences. Although this improves write performance, searching a key in an unsorted node is suboptimal, as a linear scan must be performed over unsorted keys.

Later Hwang et al. [28] found that, on insertion or deletion with a sorted B+-tree node, shifting neighboring KV pairs in one cache line does not need cache line flush or memory fence because of the store dependencies between them  $(key_i \rightarrow key_{i+1}, key_{i-1} \rightarrow key_i)$ , and  $value_i \rightarrow value_{i+1}$ ,  $value_{i-1} \rightarrow value_i)$ . But shifting KV pairs from one cache line to another does not guarantee these two cache lines will be written back to memory in their modification order. Also, store dependencies only exist between keys or values, and key and value are separately stored ( $key_i \leftrightarrow value_i$ ); for x86 with TSO, consecutive stores of key and value of one KV pair persist their writing order. These observations indicate that clflush and mfence only need to be used for orderly flushing dirty cache line, instead of each KV pair, in inserting and deleting data with B+-tree in an x86-based system. Unfortunately, for non-TSO architectures, such as ARM, an additional memory fence must be called to preserve an order in storing the key and value of a KV pair so as to avoid mismatch in each KV pair after a system crash.

Recently we have developed a new B+-tree variant with sorted nodes, namely Crab-tree [57], specifically for persistent memory that is operated by ARMv8 processor. By doing a quantitative study, we learn and analyze the consistency cost due to using cache line flush and memory fence for a B+-tree with a real-world ARMv8 processor. Accordingly, we design Crab-tree that chooses one of two consistency strategies, i.e., shifting in place (SIP) and copy on write (COW), at runtime to incur the least performance overhead. In this paper, we consider further optimizing Crab-tree in three dimensions, i.e., minimizing software overhead, improving cache efficiency, and reducing consistency cost in SIP. Experimental results have confirmed the effectiveness of these optimization methods.



Fig. 2. A Study of Consistency Costs and Search Performance for B+-tree with x86 and ARMv8 Processors

# **3 MOTIVATIONAL STUDY**

Designing a crash-consistent B+-tree in NVM demands cache line flushes and memory fences, the cost of which has not been quantitatively analyzed for ARMv8. Moreover, for ARMv8 with non-TSO, a memory fence is needed to orderly store the key and value of each KV pair to avoid any mismatch after a crash. Whereas, the cost of such extra memory fences is also unknown.

**Setup.** We have performed an empirical study with a use case. We choose a 2KB B+-tree node in which every key/value takes 8B. We simulate a scenario where we shift 63 KV pairs to insert a new smallest key with a value. We have repeated the insertion for 10,000 times with an x86 processor (Intel Core i7-7700) and an ARMv8 processor (ARM Cortex-A53). We perform these operations for several different configurations of B+-trees. We first consider a volatile B+-tree without cache line flush or memory fence (Volatile). Another configuration is to orderly flush dirty cache lines during shifting KV pairs (FLUSH). For ARMv8, the third configuration is adding a memory fence to FLUSH to orderly shift the key and value for each KV pair (FLUSH+Non\_TSO). This is to counteract non-TSO of ARMv8. FLUSH and FLUSH+Non\_TSO follow [28] to guarantee crash consistency.

**Key observations.** We have measured the execution time for different configurations with two processors and the left two diagrams of Figure 2 capture all the results. Let us first analyze the performance overhead of using cache line flushes and memory fences through a comparison between Volatile and FLUSH for both processors. Compared to Volatile, FLUSH spent 17.1× and 5.1× time in inserting KV pairs for x86 and ARMv8, respectively. Therefore, *the consistency cost due to using cache line flushes and memory fences, despite being seemingly less than that of x86, is still significant for ARMv8.* Therefore, the design of an in-NVM B+-tree with ARMv8 should minimize the use of cache line flushes and memory fences.

FLUSH is sufficient in preserving crash consistency for x86 processors. But for ARMv8, FLUSH+non\_TSO is the one that guarantees crash consistency with additional memory fences to counteract non-TSO. To the best of our knowledge, there does not exist quantitative analysis of such extra consistency cost incurred for non-TSO architectures. In Figure 2(b), the execution time of FLUSH+non\_TSO is 3.1× that of FLUSH. In other words, *the time cost for counteracting non-TSO using additional memory* 

fences is substantial and might badly degrade the performance of a consistent in-NVM B+-tree. Added by this further considerable overhead, the overall consistency cost stretches the execution time by 15.8× from Volatile to FLUSH+non\_TSO for ARMv8, which is close to the 17.1× performance gap between Volatile and FLUSH for x86.

**Reducing consistency cost with ARMv8.** As revealed with x86 processors [24, 25], a batched flush of a contiguous memory space, i.e., flushing cache lines of it with a series of cache line flushes but only one ending memory fence, consumes much less time than flushing the space cache line by cache line through many cache line flushes and memory fences. We first did a test to verify if ARMv8 complies with this observation. We allocated a cache line-aligned memory space of 512MB, and flushed it in the batched way and by each cache line, respectively, with both x86 and ARMv8 processors. Figure 2(c) shows the execution time. With x86 processor, the time cost of batched flush improves performance by a factor of 1.7. For ARMv8, such improvement is more significant, i.e., by a factor of 3.5. As a result, *with ARMv8 processors, flushing in an aggregated batch should save more time than individually flushing cache lines.* 

Inspired by this observation, we designed a new configuration called Flush\_new and applied it to the insertion use case. In Flush\_new, we allocated a new tree node and copied all KV pairs, including the new KV pair, into the newly-allocated node. We then flushed all KV pairs in the new node to NVM using the batched flush. Finally, we replaced the original node in the B+-tree with the new node through an 8B atomic write supported by ARMv8. Prior to replacing the original node, there is no need to counteract non-TSO via extra memory fences, because the new node is still *outside* the tree and flushing all its KV pairs with one ending memory fence to NVM resolves all ordering issues. As shown in Figure 2(a), Flush\_new marginally reduces the execution time with x86, but, with ARMv8, it significantly drops the execution time by 4.0× compared against FLUSH+non\_TSO. This performance boost motivates the design of our in-NVM B+-tree on ARMv8 with minimized consistency cost.

**Cost of search in ARMv8.** Searching a key is required for both write and read operations. Stateof-the-art B+-tree variants have different strategies to accelerate the process of search. NV-Tree performs binary search in INs and linear scan in LNs since its LNs are unsorted [25]. FAST-FAIR applies linear search in both INs and LNs although all of its nodes keep the keys sorted, because linear search yields better performance with node size no greater than 4KB with x86 processor [28]. Since embedded systems generally have fewer computational resources and employ different micro-architectural features (e.g., non-inclusive cache and in-order pipeline), we have empirically compared binary search to linear search between our x86 and ARMv8 processors with varied node sizes. Figure 2(d) shows the execution time of searching 1 million keys in a volatile B+-tree with all nodes sorted. Binary search costs much less or leads to comparable performance to linear search with ARMv8. Therefore, we consider binary search in designing our B+-tree for ARMv8.

Note that we have used an economical ARMv8 Cortex-A53 for study. The cost of using cache line flushes and memory fences to guarantee ordered writes and counteract non-TSO persists with high-end ARMv8 processors. In fact, such overhead shall incur more severe performance degradation for processors with higher frequencies and larger caches.

# 4 DESIGN OF CRAB-TREE

# 4.1 Overview of Crab-tree

Figure 3 shows the architecture of <u>c</u>rash <u>r</u>ecoverable <u>A</u>RMv8-oriented <u>B</u>+-tree (Crab-tree). As INs can be rebuilt by scanning LNs, Crab-tree only enforces crash consistency to LNs. Figure 3 also captures a logical view of an LN. An LN has KV pairs, a counter that records the number of valid KV pairs in the LN, and a pointer pointing to the right sibling LN. Crab-tree organizes LNs in an

LN linked list using this pointer. Both counter and pointer can be modified with an 8B atomic write. INs can be maintained in DRAM with a hybrid DRAM/NVM system.

Insertion, update and deletion are write operations to a target LN for B+-tree. To update a KV pair, Crab-tree replaces the value (pointer to a new record) via an 8B atomic write. As to insertion and deletion, existing KV pairs in the target LN should be shifted unless the key to be inserted or deleted is the largest. Crab-tree employs two strategies for insertion and deletion: copy-on-write (COW) and shifting in place (SIP). COW copies all KV pairs to a new LN and replaces the original LN through a batched flush. SIP shifts KV pairs in the original LN by orderly flushing dirty cache lines. Crab-tree chooses one of them to handle an insertion or deletion depending on their consistency costs. Given a specific key, the target LN it belongs to is deterministic, and assuredly, the number of KV pairs that should be shifted is calculable, which entails the number of cache line flushes and memory fences required to orderly flush dirty cache lines and counteract non-TSO. The consistency cost of SIP is hence determined. Also, the cost of COW in copying and flushing all KV pairs to a new LN as well as replacing the original LN can be estimated. Crab-tree picks the one with lower cost.

Both COW and SIP abide by strict orders of modification and store. To do COW, Crab-tree first creates a new LN and copies KV pairs. For insertion, Crab-tree copies and puts the new KV pair in the proper position of new LN. For deletion it omits the KV pair to be deleted in copying. After copying, Crab-tree links the new LN's pointer to the right sibling of the original LN. Crab-tree replaces the original LN by modifying its left sibling's pointer to point to the new LN. Modifying this pointer incorporates the new LN into the LN linked list. Doing so via an 8B atomic write does not compromise Crab-tree's consistency, as the atomic write either catches the original LN or the new LN in the LN linked list.

SIP leverages a property of B+-tree: B+-tree nodes do not allow duplicate values. On shifting KV pairs, SIP shifts a value to make duplicate values prior to shifting its corresponding key. In the final step of storing a new KV pair, SIP stores the key before storing the value so as to retain duplicate values for consistency check. However, using duplicate values alone cannot exactly recover an inconsistent LN, because both insertion and deletion shift KV pairs but in opposite directions (right for insertion and left for deletion). It is also difficult to tell which key factually matches the duplicate value. To resolve this issue, after shifting KV pairs, our SIP subtly enforces an atomic write to modify the number of KV pairs, by which Crab-tree can correctly recover from a crash.

In summary, Crab-tree tries to minimize consistency costs by online selecting COW or SIP. In either way it always ends a write transaction by an 8B atomic write, which secures the crash recoverability of Crab-tree. As a result, Crab-tree achieves both high performance and crash consistency.

#### 4.2 Insertion

Algorithm 1 illustrates how Crab-tree inserts a KV pair. Inserting a new KV pair starts by traversing from the root node until a target LN (Line 1). Then Crab-tree checks if the current LN is full (Line 2, D means the maximum number of KV pairs allowed to be stored in one LN). If so, the LN will be split with the newly-arrived KV pair (Line 3). Otherwise, Crab-tree uses binary search to locate the position where the new KV pair will be put (Lines 5 to 6). Given a system with a specific ARMv8 processor, the time cost of cache line flushes and memory fences as well as copying data can be estimated through a quick test. With the insertion position and the respective numbers of KV pairs to be copied or shifted, Crab-tree estimates both costs of COW and SIP (Line 7). If using COW is likely to spend less time, Crab-tree will follow COW to insert the newly-arrive KV pair (Lines 8 to 21). Otherwise, Crab-tree will perform SIP (Lines 23 to 33).



Fig. 3. The Structures of Crab-tree and its Leaf Node

Crab-tree first allocates a new LN for COW (Line 8). pmalloc at Line 8 particularly fills the allocated space with nulls (zeros). Crab-tree then copies existing KV pairs from the original LN to the newly-allocated LN except leaving an appropriate position (i.e., *p* at Lines 6) vacant for the new KV pair (Lines 9 to 11). After the new KV pair is stored into the newly-allocated LN (Lines 12), Crab-tree sets the number of KV pairs (Line 13) and makes the newly-allocated LN's pointer point to the right sibling of original LN. By doing so, Crab-tree 'half' links the new LN into the LN linked list. Crab-tree flushes the new LN into NVM to persist all copied KV pairs, the right sibling pointer and the number of KV pairs. Crab-tree then finds the left sibling of original LN (Line 17), and alters the left sibling's pointer to point to the new LN (Line 18). After flushing the left sibling's pointer (Line 19), the new LN successfully replaces the original LN in the LN linked list. Next Crab-tree updates parental INs with the pointer of the new LN (Line 20) and frees the space occupied by the original LN (Line 21).

A pointer in NVM has a size of 8B with ARMv8 64-bit processors, so storing an in-NVM pointer can be atomically done. Modifying the pointer of left sibling LN either completes or fails. If a crash occurs after it completes, Crab-tree observes the new LN. Otherwise it finds the original LN. In either case, the crash consistency of Crab-tree is not impaired. Moreover, before the batched flush at Line 16, Crab-tree does not execute any cache line flush or memory fence in copying KV pairs and linking the right sibling LN. The reason is that the new LN is still outside the tree before modifying the pointer of left sibling, so any changes and any store order in the new LN have no impact to Crab-tree's crash consistency. In addition, on locating the left sibling of the original LN (the function get\_left\_sibling at Line 17), Crab-tree does not need to traverse again from the root until the parental IN. At the beginning of each insertion, Crab-tree has traversed INs to find the target LN (Line 1). At that time Crab-tree can maintain an array, say traversed\_INs[h - 1](h is the height of tree), in which an element records a traversed IN and the position of the node traversed at the next level. To get an LN's left sibling, Crab-tree just checks the last element of traversed\_INs. Unless the recorded position is zero, which means that upper-level traversed IN(s) should be considered, Crab-tree promptly locates the LN's left sibling. traversed\_INs is an auxiliary array to temporarily record traversed INs, and does not need consistency guarantee.

If the cost of SIP is less, Crab-tree first shifts KV pairs to the right (Lines 23 to 28) to make space. It shifts values before shifting keys to maintain duplicate values with extra memory fences (Lines

#### **Algorithm 1** Insertion of Crab-Tree (Insert(< k, v >))

**Input:** A KV pair  $\langle k, v \rangle$  to be inserted 1: Search from the root until the target LN ln is obtained //ln has the number of KV pairs n2: if  $(ln.n \ge D)$  then //D is the degree (max No. of keys) of an LN split(ln, < k, v >); //We must split ln with < k, v >3: 4: else 5: //Get the insertion location and decide to do COW or SIP  $p := get_insert_location(ln, ln.n, k);$ 6: **if** (COW\_is\_more\_efficient(*ln.n*, *p*) == **True**) **then** 7: *new\_ln* := pmalloc(LN\_SIZE); 8: 9: **for** (*i* := 0; *i* < *ln.n*; *i* := *i* + 1) **do** //Copy KV pairs 10:  $new_{ln.KVs[i < p?i:i+1].k := ln.KVs[i].k;$ 11:  $new_ln.KVs[i < p?i:i+1].v := ln.KVs[i].v;$  $new_ln.KVs[p].k := k, new_ln.KVs[p].v := v;$ 12:  $new_{ln.n} := ln.n + 1;$ 13: new\_ln.rightSibling := ln.rightSibling; 14: 15: //Let us flush *new\_ln* to NVM FLUSH\_FENCE( $new_ln$ , (KV\_SIZE × new\_ln.n) sizeof(new\_ln.rightSibling) ++16: sizeof(new\_ln.n)); left\_ln := get\_left\_sibling(ln); 17: *left\_ln.rightSibling* := *new\_ln*; 18: FLUSH\_FENCE((left\_ln.rightSibling), sizeof(left\_ln.rightSibling)); 19: Update parental INs of *ln* with *new ln* if necessary; 20: pfree(ln); 21: else //We need to do SIP 22: **for** (i := ln.n + 1; i > p; i := i - 1) **do** //Shift KV pairs 23: ln.KV[i].v := ln.KV[i-1].v; FENCE(); 24: 25: ln.KV[i].k := ln.KV[i-1].k; FENCE(); if (*ln.KV*[*i*] is at boundary of cache lines) then 26: //We need to flush a dirty cache line 27: 28: FLUSH\_FENCE((ln.KV[i]), CL\_SIZE); ln.KV[p].k := k;FENCE(); 29: ln.KV[p].v := v;30: FLUSH\_FENCE((ln.KV[p]), KV\_SIZE); 31: 32: ln.n := ln.n + 1;33: FLUSH\_FENCE((ln.n, sizeof(ln.n)); 34: Return the completion of insertion;

24 to 25). During shifting KV pairs, when Crab-tree moves from one cache line to the left one, it needs to flush that dirty cache line (Lines 26 to 28). After shifting, Crab-tree stores the new key with a memory fence and then the new value (Lines 29 to 30), followed by a cache line flush and memory fence to persist the new KV pair (Line 31). Crab-tree increases and persists the number of KV pairs to end the insertion (Lines 32 to 33). This is also done via an 8B atomic write for crash recovery (cf. Section 4.5).

Figure 4 shows two insertions with Crab-tree. We assume that a cache line holds two KV pairs. In Figure 4(a), Crab-tree first decides where the new KV pair should be inserted (1). If Crab-tree chooses SIP, four KV pairs have to be shifted with three cache line flushes and memory fences (2) as well as memory fences to counteract non-TSO. So Crab-tree chooses COW. In Figure 4(b), it allocates a new LN and copies all six KV pairs to the new LN (3). Then it replaces the original LN with the new LN by setting two pointers (4 and 5). Figure 4(c) illustrates another scenario in



Fig. 4. An Illustration of Inserting KV Pairs by Crab-tree

which the key to be inserted is the new largest key. COW copying seven KV pairs takes more time than SIP. So Crab-tree performs SIP ((6) and (7)) and increases the number of KV pairs ((8)).

**Split.** Crab-tree splits an LN when the LN is full on inserting a new KV pair. Crab-tree still aims to minimize the use of cache line flushes and/or memory fences in splitting. Previous B+-tree variants for NVM split an LN into two LNs and insert the new KV pair into one of them through a normal insertion [25, 28]. Crab-tree works differently. It first determines the split point for the LN, from which half KV pairs with larger keys will be copied to a new LN. If the new KV pair falls into the larger half, Crab-tree just copies it with other KV pairs and do a batched flush. It then modifies the original LN's number of KV pairs. Next Crab-tree adds the new LN into the LN linked list and adjusts parental IN(s) if necessary. If the new KV pair should join the smaller half, after copying the larger half, Crab-tree will assess the costs of COW and SIP as in a normal insertion. If it chooses COW, two new LNs will be linked into the LN linked list.

Splitting also demands strict execution orders. If SIP adds one new LN, Crab-tree first lets this LN point to the right sibling of the original LN. Then Crab-tree sets the original LN's pointer with the new LN's address as the original LN's new right sibling. Both pointers are modified by 8B atomic

#### ZZZ:12

writes. COW adding two new LNs, Crab-tree first links them according to the ascending order of keys. It then makes the right new LN point to the original LN's right sibling. Next, Crab-tree finds out the original LN's left sibling and changes its pointer to link the left new LN. With this order, two LNs are consistently incorporated into the LN linked list.

### 4.3 Search

Looking up a value given a key for Crab-tree follows a binary search as justified in Section 3. In the beginning, Crab-tree gets the tree root. Then it traverses from the root until the target LN by comparing keys in binary search. Next Crab-tree does binary search in the LN. It will return the value if the key is found or a code for miss (-1). The same procedure is also used in inserting, updating and deleting a KV pair.

#### 4.4 Deletion

Deleting a KV pair is a reverse operation of insertion and requires shifting KV pairs in a tree node. Crab-tree considers two cases when removing a KV pair from an LN.

- If shifting KV pairs in the LN via SIP incurs less consistency cost than COW, Crab-tree will call SIP to shift KV pairs in-situ.
- Otherwise, Crab-tree allocates a new LN and copies to it KV pairs excluding the one to be deleted. Crab-tree then replaces the original LN using the new LN.

We note that there is a particular act for Crab-tree using SIP to delete a KV pair. With SIP for deletion, KV pairs are shifted to the left and the largest KV pair would hence be duplicate in the end of valid KV pairs. Prior to decreasing the number of KV pairs of the LN, Crab-tree clears the value (pointer) of the largest key in its original position to be null (zero). As mentioned in Section 4.2, when allocating a new LN, Crab-tree calls pmalloc to fill the LN with nulls. On deletion Crab-tree again sets a vacated value to be null. Null pointers exist as a landmark to identify valid KV pairs and are especially useful in recovery (cf. Section 4.5).

*Merge & Redistribution.* Continuous deletions decrease the space of LNs. When an LN becomes underutilized (the number of valid KV pairs less than  $\frac{D}{2}$ ), Crab-tree considers merging or redistributing its KV pairs with its siblings. Merging is called if the LN's either sibling has sufficient vacant space to absorb all the underutilized LN's KV pairs. After merging, the LN linked list is adjusted by detaching the underutilized LN, and parental INs will be updated where necessary. In addition, Crab-tree prefers merging KV pairs from an LN to its left sibling if possible so that it can perform a batched flush with less time cost.

If neither sibling of the underutilized LN has enough space, Crab-tree will redistribute KV pairs from two siblings to the underutilized LN until it has more than  $\frac{D}{2}$  KV pairs. Crab-tree applies SIP in shifts KV pairs for redistribution. It also needs to update parental IN(s) after the redistribution.

#### 4.5 Rebuilding and Recovery at Startup

After a normal exit or crash, Crab-tree rebuilds INs. It rebuilds only once at startup, unlike NV-tree that frequently rebuilds INs at runtime and stall write transactions [25].

The rebuilding of Crab-tree is also its recovery procedure. On rebuilding, Crab-tree scans every LN in the LN linked list with two tasks. One task is to identify the smallest and largest keys of the LN for filling upper-level INs. The other task is to check crash consistency. COW does not cause inconsistency because the ending atomic write of COW leaves either the original LN or the new LN. However, to avoid memory leakage caused by a crash, the addresses of newly-allocated and original LNs involved in COW can be recorded. In recovery, if either address is not found in the LN linked list, Crab-tree will ask memory library to reclaim that space.

SIP might cause inconsistency. When scanning an LN, Crab-tree tries to find duplicate values. It also obtains the number of KV pairs stored in the LN and counts the number of non-null values. There are three inconsistent scenarios.

- No duplicate value is found, but the recorded number of KV pairs differs from the number of non-null values.
- Crab-tree finds duplicate value and the recorded number of KV pairs differs from the number of non-null values.
- Crab-tree finds duplicate value but the recorded number of KV pairs is the same as the number of non-null values.

For the first scenario, if the recorded number is one smaller than the number of non-null vaules, the crash must have happened in insertion before increasing the number of KV pairs. If it is one larger than the number of non-null values, the crash must have happened in deletion before decreasing the number of KV pairs. In either case Crab-tree just needs to modify the number of KV pairs through an 8B atomic write.

The other two scenarios are caused when Crab-tree is shifting KV pairs. Insertion and deletion shift KV pairs in opposite directions. As shifting in insertion always moves KV pairs to the right to make space, a crash that has happened at that moment leaves one more non-null and duplicate value, which corresponds to the second scenario. Crab-tree shifts KV pairs to the left for *roll-back* recovery. On the other hand, a crash that has happened during shifting KV pairs in deletion would have one duplicate value but the number of KV pairs has not been changed yet, i.e., being equal to the number of non-null values, which matches the third scenario. Crab-tree shifts KV pairs to the left for *roll-forward* recovery.

Crab-tree and FAST-FAIR both use duplicate values to spot inconsistency, but they differ a lot in checking consistency and recovering from crashes. First, FAST-FAIR does lazy recovery and fixes inconsistencies in future write transactions. Crab-tree actively recovers inconsistent LNs at startup. More important, Crab-tree leverages the number of valid KV pairs with duplicate values to swiftly and precisely recover an inconsistent LN, which is not covered by FAST-FAIR.

#### 5 OPTIMIZATION OF CRAB-TREE

Crab-tree is a self-contained B+-tree variant for persistent memory with ARMv8 regarding crash consistency and high performance. To further improve the performance of Crab-tree, we have optimized it in three dimensions.

#### 5.1 Self-managed NVM Space

Software layers, including both application and system softwares, inflict performance overhead due to scheduling, memory allocation/deallocation, and I/O operations in processing data. Conventional computer systems are built on the slow hard disks and less affected by the software overhead. With the much faster byte-addressable NVM as both memory and storage, the software overhead makes more significant contributions to the overall performance overhead. For example, Caulfield et al. [58] found that up to 62% response time is contributed by software overhead.

In the design and implementation of Crab-tree, memory allocation and deallocation are frequently executed, especially for COW. As a result, the reduction of memory allocation and deallocation with NVM library is likely to boost the performance of Crab-tree. At Line 8 of Algorithm 1, Crab-tree asks for NVM space by calling pmalloc, which is a demand-based approach. Instead, we make Crab-tree manage a memory pool by itself. As Crab-tree only stores LNs that are in a fixed and uniform size into NVM space, the pool is not difficult to be maintained. In brief, at start, Crab-tree requests and initiates a large memory pool by calling pmalloc. At runtime, all requests for NVM space allocations are satisfied with the self-managed memory pool. Whenever



Fig. 5. An Illustration of Approximate Probe to Avoid Searching in an LN

the remaining free space of the pool is below a threshold, say, 64 LNs, Crab-tree calls pmalloc to refill the pool. By doing so, Crab-tree avoids waiting for the NVM library to allocate/deallocate space.

#### 5.2 Approximate Probe to Avoid Searching

Before choosing between SIP and COW for inserting a new KV pair (resp. deleting an existing KV pair) in an LN, Crab-tree first needs to determine the proper position for the KV pair, which is done through the function get\_insert\_location() in Algorithm 1. This function performs a binary search among existing KV pairs of the LN. Figure 5(a) reiterates the example shown in Figure 4(a) which inserts a new KV pair into an LN. As shown by ① to ③ in Figure 5(b), Crab-tree has to try three times to find the appropriate location. Assuming that one cache line holds two KV pairs and the LN has not been cached before calling get\_insert\_location(), two cache misses take place. Considering a large node size, e.g., 4KB or 8KB, in practice, searching out the proper position inevitably degrades performance.

We consider an approximate probe to accelerate the searching process. In brief, we only compare the key to be inserted or deleted with the middle key. If the key under insertion or deletion is no greater than the middle key, which means using SIP has to shift more than half existing KV pairs, Crab-tree chooses COW; otherwise, Crab-tree proceeds with Algorithm 1 by calling get\_insert\_location() in the greater half of KV pairs and choosing between SIP and COW. Figure 5(c) illustrates the approximate probe for the aforementioned example, which entails choosing COW as suggested by the detailed analysis with Figure 4(b). In our empirical evaluation (cf. Section 6.4), we show that despite being approximate, such an approach manges to reduce cache misses in practice.

1 Move the pointer &e with a memory fence 21 37 45 61 null 5 &b &с &d &e null &a 3 5 6 (2) Move the key 61 21 37 61 5 &a &b &с &d &e &e 3 4 5

(a) Shifting one KV pair by SIP with memory fence

|                                                  |    | (                        | 3) Coi        | mpare                                                          | e the §                            | 5th m                 | emory                              | / loca | tion to | o <ni< th=""><th>ull, n</th><th>ull&gt;</th></ni<> | ull, n | ull>    |
|--------------------------------------------------|----|--------------------------|---------------|----------------------------------------------------------------|------------------------------------|-----------------------|------------------------------------|--------|---------|----------------------------------------------------|--------|---------|
| 1                                                | -> | 12                       | 21            | 37                                                             | 45                                 | 61                    | null                               |        |         | Б                                                  |        |         |
|                                                  |    | &a                       | &b            | &с                                                             | &d                                 | &e                    | null                               |        |         | 5                                                  |        |         |
|                                                  |    | 0                        | 1             | 2                                                              | 3                                  | 4                     | 5                                  | 6      | 7       |                                                    |        |         |
| ④ Store the desired <61, &e> to the 5th location |    |                          |               |                                                                |                                    |                       |                                    |        |         |                                                    |        |         |
|                                                  |    |                          | (4            | ) Sto                                                          | re the                             | desi                  | red <6                             | 61, &e | e> to t | he 5                                               | 5th Io | cation  |
| Г                                                | →  | 12                       | (4<br>21      | ) Sto<br>37                                                    | re the<br>45                       | desi<br>61            | red <6<br>61                       | 61, &e | e> to t | he 5                                               | oth Io | cation  |
|                                                  | •  | 12<br>&a                 | 21<br>&b      | ) Sto<br>37<br>&c                                              | re the<br>45<br>&d                 | desi<br>61<br>&e      | red <6<br>61<br>&e                 | 61, &e | e> to t | the 5                                              | 5th Io | ocation |
|                                                  | •  | 12<br><u>&amp;a</u><br>0 | 21<br>&b<br>1 | <ul> <li>Sto</li> <li>37</li> <li>&amp;c</li> <li>2</li> </ul> | re the<br>45<br><u>&amp;d</u><br>3 | desi<br>61<br>&e<br>4 | red <6<br>61<br><u>&amp;e</u><br>5 | 61, &e | e> to t | the 5                                              | oth Ic | ocation |

(b) The process of CAS instruction

#### 5.3 The Reduction of Consistency Cost

Because ARM architecture is with non-TSO, additional memory fences are required in shifting key and value in each KV pair for consistency. Figure 6(a) shows an example. As mentioned in Section 3, such additional memory fences degrade the performance of a B+-tree variant solely relying on SIP. By contrast, Crab-tree has selectively chosen between COW and SIP at runtime for reducing the performance overhead caused by preserving crash consistency for KV pairs. However, additional memory fences are still needed when Crab-tree conducts SIP.

The ARMv8 architecture, as an evolving design, brings about a possibility of removing such additional memory fences with its upgrade to ARMv8.1 version. That is, ARMv8.1 extends the atomic compare-and-swap (CAS) instructions [59] from 32-bit and 64-bit to 128-bit (16B) [60–62]. In brief, a CAS instruction first compares the contents of a memory location with a given value; if they are the same, CAS modifies the memory location with a desired new value. The CPU performs these two steps in an all-or-nothing fashion, so the 128-bit CAS provides a 16B atomic write. As a result, an ARMv8.1 processor can atomically modify 16B stored in an in-NVM location. In fact, such 128-bit CAS instructions have been introduced into x86 since 2004 [63] and the 16B atomic write has been leveraged in designing system softwares with persistent memory [9, 12].

Using the 128-bit CAS instruction enables us to enhance Crab-tree in the case of a key with the size of no greater than 8B. As the size of a pointer (value) is 8B given a 64-bit processor, a key of no greater than 8B and an 8B pointer form a KV pair of 16B that can be atomically shifted together as an entity without using an additional memory fence. Since the CAS instruction is atomic, the crash consistency of Crab-tree is never impaired. Figure 6 makes a comparison between SIP with memory fence and 128-bit CAS instruction for shifting one KV pair to the right for insertion. Despite being without a memory fence, the atomic CAS instruction involves one comparison and one store operation. In addition, we note that cache lines modified by executing 128-bit CAS instructions still rely on cache line flushes to be written back to NVM. With a high-end fast ARMv8 processor

Fig. 6. A Comparison between SIP with Memory Fence and 128-bit CAS instruction

with efficient comparison and cache line flush, the performance gain of using such 128-bit CAS instructions should be significant.

#### **6 EVALUATION**

#### 6.1 Experimental Setup

We have chosen Raspberry Pi 3 Model B+ [37] as the evaluation platform. It has a 64-bit quad-core ARM Cortex-A53 processor. We installed Opensuse Leap 15.0 ARMv8 version [64] with GCC/G++ 7.3.1. As there is no NVM shipped for Raspberry Pi, we assume that all the 1GB memory of Raspberry Pi is made of NVM with the same access latency as DRAM. This is in line with Everspin's Toggle MRAM products which have good write endurance as well as symmetrical write and read latencies that are comparable to those of DRAM [1, 65]. Everspin has shipped a few products made of Toggle MRAM and STT-MRAM for embedded applications [66]. We note that some NVM technologies like PCM have limited write endurance that requires efforts to reduce memory writes and conduct wear leveling; however, MRAM technologies have been considered to embrace practically 'unlimited' endurance [1, 67–69].

We have implemented two versions of Crab-tree in C, i.e., Crab-tree and Crab\_opt. Crab-tree is the basic version of Crab-tree while Crab\_opt is the enhanced version with three optimization methods presented in Section 5. The instructions of cache line flush and memory fence are dc cvac and dmb, respectively, because dc cvac corresponds to the newest clwb of x86 [17]. The atomic 128-bit compare-and-swap (CAS) instruction for Crab\_opt is through calling the function \_\_atomic\_compare\_exchange with GCC/G++'s '-latomic' compiling option. As to competitors in evaluation, besides a volatile B+-tree without consistency guarantee (Volatile), we have considered NV-Tree and FAST-FAIR. We implemented NV-Tree (NV-Tree) regarding its original design [25]. We downloaded the x86 source codes of FAST-FAIR. We replaced x86's cache line flush and memory fence using ARMv8's and added dmb to counteract non-TSO following the algorithms of FAST-FAIR [28] (FAST-FAIR). Volatile and Crab-tree use binary search while FAST-FAIR uses linear search. NV-Tree performs binary search in INs and linear search in LNs as its LNs are unsorted.

For each tree, IN and LN have the same size, i.e., 1KB, 2KB, 4KB and 8KB. We do not choose smaller node sizes (<1KB) concerning both time- and space-efficiencies. First, a smaller node is likely to trigger more splits that incur more performance overhead. Second, a smaller node means smaller capacity; given the same quantity of stored data, the tree will have a larger height, which impairs searching from the root to a target LN for every operation. Third, still with the same quantity of stored data, a smaller LN requires many more INs, which underutilizes the system's memory space.

As to the workloads, we have considered keys following a uniform distribution without write skewness. It was used by FAST-FAIR and also favored by NV-Tree [25, 28]. We have inserted, searched and deleted 1 million and 10 million keys, respectively, with aforementioned five trees. The performance of each tree is measured using the average execution time per operation in micro-seconds ( $\mu$ s).

# 6.2 Write and Read Performance

The three diagrams in Figure 7 capture the average execution time of inserting, searching and deleting one million keys with five trees, respectively. Since Volatile has no consistency cost and employs binary search, its write and read performances are optimal. NV-Tree maintains unsorted keys in an LN, and does not shift KV pairs but just append to the end of an LN a KV pair with positive flag for insertion or negative flag for deletion. As a result, it shows higher write performance by 5.5%-30.0% than Crab-tree with four node sizes. However, owing to the unsorted keys, NV-Tree



Fig. 7. A Comparison of Insertion, Search and Deletion Performance with 1 Million Keys

has to linearly scan all KV pairs in an LN for searching. That explains why its search performance is much worse than Crab-tree's. The average search time of NV-Tree is 1.6×, 2.0×, 2.6× and 3.7× that of Crab-tree, respectively, with four node sizes. The gap of average search performance between them increases because a larger node requires more time for linear scan.

Crab-tree outperforms FAST-FAIR in both write and read performances. For example, with 8KB node size, the average execution time of FAST-FAIR for insertion and deletion are 1.6× and 1.9× that of Crab-tree while its average search time is 2.8× that of Crab-tree. The reason for Crab-tree's higher performances is threefold. First, on insertions and deletions, Crab-tree selectively uses COW and SIP to minimize consistency costs but FAST-FAIR always does shifting in situ. Second, Crab-tree enforces consistency to LNs while FAST-FAIR covers all nodes. Third, the linear search of FAST-FAIR is slower than binary search of Crab-tree, which affects the efficiency for all write



(c) The Number of Memory Fences



and read operations of B+-tree. The design of FAST-FAIR, especially its crash recovery, is yet closely bound to linear search [28].

For a quantitative analysis, we have recorded the number of cache line flushes, the amount of bytes flushed to NVM and the number of memory fences during inserting 1 million keys. Figure 8 shows the results. As NV-Tree always appends a new KV pair for insertion, it flushes the fewest data with the fewest memory fences. A comparison between FAST-FAIR and Crab-tree, however, may seem contradictory to the results shown in Figure 7, because Crab-tree executed more cache line flushes and flushed more data than FAST-FAIR, i.e., at most 22.7% with 8KB node size. Note that one idea of Crab-tree is to leverage COW with batched flushes to flush KV pair in an aggregated manner instead of flushing by cache line. The large amount of data flushed by Crab-tree is just



(-) = ------

Fig. 9. A Comparison of Insertion, Search and Deletion Performances with 10 Million Keys

entailed by COW. This is also justified by Figure 8(c), as the number of memory fences executed by Crab-tree is merely 22.5% that of FAST-FAIR. In practice, COW just needs two memory fences to flush a large number of cache lines in a batch.

We have also made a breakdown of data flushed by COW and SIP when Crab-tree was inserting 1 million keys. Figure 10(a) captures the percentages contributed by COW and SIP. It is evident that at least 71.9% flushed data was done by COW through batched flushes, which enables Crab-tree to achieve higher write performance than FAST-FAIR.

Crab\_opt, as an enhanced version of Crab-tree, yields higher performance for insertion and deletion than Crab-tree. Take the insertion performance for example. With four node sizes, Crab\_opt reduces the average execution time for insertion by 19.5%, 13.5%, 13.7%, and 16.8%, respectively, than the original Crab-tree. In particular, when deleting 1 million KV pairs with 1KB node size,

C. Wang et al.





(b) The Breakdown of Memory Fences Executed by COW and SIP

Fig. 10. The Breakdowns of COW and SIP in Using Cache Line Flushes and Memory Fences

Crab\_opt is up to 22.9% faster than Crab-tree. These performance improvements confirm the effectiveness of three optimization methods we have employed to upgrade the design and implementation of Crab-tree. Concretely, the performance gap between Crab\_opt and FAST-FAIR rises to be up to 2.2× on deleting 1 million KV pairs with 8KB node size while the average execution time of NV-tree is as much as  $3.7\times$  that of Crab\_opt with 8KB node size on searching 1 million KV pairs.

One method used by Crab\_opt is removing additional memory fences in SIP by calling 128-bit CAS instructions. In Figure 8(c), the number of memory fences of Crab\_opt is much less than that of Crab-tree. However, the cost of executing 128-bit CAS instructions is non-negligible. Moreover, as mentioned, using 128-bit CAS instructions still causes the same quantity of dirty cache lines that must be flushed to NVM through costly cache line flush instructions. As a result, this method helps Crab\_opt gain marginal performance improvement. A detailed discussion of three optimization methods can be found in Section 6.4.

Furthermore, we increased the number of keys to be operated from 1 million to 10 million. Whereas, the design of NV-Tree has to allocate a huge contiguous memory space when rebuilding INs at runtime [25]. Given a large volume of keys to be processed, the ever-increasing memory space required by NV-Tree will be eventually beyond the capability of an embedded system. Thus, NV-Tree, being with good write performance but suboptimal read performance under small workloads, is not scalable or robust for large workloads. We thus compare among Volatile, FAST-FAIR and Crab-tree. Figure 9 presents their execution time. The write and read performances of three trees degrade with increased keys. However, Crab-tree still significantly outperforms FAST-FAIR. For example, with 8KB node, FAST-FAIR had to spend  $1.6 \times$ ,  $1.9 \times$  and  $2.3 \times$  time than Crab-tree to completes an insertion, search and deletion request, respectively. More important, when the workload scales up, the performance of Crab-tree does not badly fluctuate. Take 8KB node for example again. The average time of deleting a KV pair of Crab-tree is  $10.3 \mu s$  and  $10.9 \mu s$  with 1 millon and 10 million keys, respectively. As a result, Crab-tree has good robustness and scalability.

The optimization methods used by Crab\_opt are still effective with heavier workloads. As shown by Figure 9, Crab\_opt still outperforms Crab-tree. For example, Crab\_opt completes deleting 10 million KV pairs with 1KB node size by spending 22.1% less time than Crab-tree.

### 6.3 The Impact of COW and SIP

The high performance of Crab-tree, especially the write performance, is accredited to the selection between COW and SIP. Crab-tree makes a decision when a key is to be inserted to an LN. The breakdowns of flushed data and executed memory fences in Figure 10(a) and 10(b) show that COW



Fig. 11. A Comparison among COW-only, SIP-only and Crab-tree



Fig. 12. The Contributions of Three Optimization Methods

and SIP act in their respective ways for crash consistency. More, we solely used either strategy to insert 1 million and 10 million keys. In Figure 11, COW-only and SIP-only mean using COW and SIP alone for inserting keys, respectively. The execution time of Crab-tree with a runtime combination of COW and SIP is much less than that of COW-only and SIP-only. For example, on inserting 1 million keys with 8KB node size, the time costs of COW-only and SIP-only are 90.5% and 22.2% more than that of Crab-tree. These results clearly demonstrate that employing one strategy would surely yield suboptimal performance compared to an integration and runtime selection between them.

# 6.4 Contributions of Three Optimization Methods

Crab\_opt includes three optimization methods. We further did a test to measure the respective and joint contributions of three methods through different configurations. Figure 12 illustrates the average execution time of inserting 1 million KV pairs with 1KB node size. In Figure 12, the suffix self\_space means Crab-tree is enhanced with the method of self-managed NVM space. The suffix probe means Crab-tree is enhanced with the method of approximate probe to reduce cache misses in searching. The suffix 16B\_AW means Crab-tree is enhanced with the method of removing the additional memory fences in SIP with 16B atomic writes. Two suffixes mean that two methods are added to Crab-tree. The rightmost Crab\_opt in Figure 12 includes all three optimization methods. The first observation through a comparison among eight bars corresponding to eight



Fig. 13. The Rebuilding Time of Crab-tree

configurations in Figure 12 is that, the method of self-managed NVM space is more effective than the other two methods. We note that, during an insertion or deletion, either SIP or COW executes a large amount of cache line flushes to orderly flush multiple KV pairs in an LN. Also, SIP with 128-bit CAS instructions does not waive the use of cache line flushes to orderly write back dirty cache lines. Those cache line flushes factually dominate the insertion and deletion performance, so the replacement of shifting KV pairs with 128-bit CAS instruction in SIP and the approximate probe to reduce binary search can only impose marginal impacts. On the other hand, software overhead is considerable in developing application and [58, 70], so a minimization of software overhead is likely to promote performance. In addition, the low-end ARMv8 processor used in our experiments embraces time-consuming cache line flush. Considering a high-end ARMv8 processor with lightweight instructions for 128-bit CAS and cache line flush, the contributions of reducing memory fences and cache misses would be much more significant.

The second observation obtained from Figure 12 is that, a combinational use of three methods entails more performance gains than using single one of them. For example, Crab+16B\_AW, Crab+probe, and Crab+self\_space are 2.3%, 3.0%, and 12.4% faster than Crab-tree, respectively, on completing 1 million insertions. Crab\_opt that jointly employs all three methods, however, is up to 19.5% faster than Crab-tree. The reason behind such an observation is that, a mix of multiple methods makes them benefit from each other in the minimization of performance overhead and consequently magnifies each one's respective efficacy.

#### 6.5 Rebuilding Time

Crab-tree rebuilds INs at startup. We have measured the time of rebuilding after a normal shutdown with 1 million and 10 million keys stored, respectively. Because the NVM we have used is emulated using DRAM, we saved LNs of Crab-tree to a file stored in the Micro-SD card of Raspberry Pi. Crab-tree loaded LNs from the file and initiated rebuilding. It scanned each LN to check whether any inconsistency existed or not. Figure 13 illustrates the rebuilding time with four node sizes. The average time over all keys is about  $8.3\mu s$  and  $9.7\mu s$  for 1 million and 10 million stored keys, respectively. Crab-tree is hence robust and scalable in rebuilding by a marginal increase of time cost (16.9%) with ten times more stored data.

### 7 CONCLUSION

The byte-addressable NVM promises persistent memory that can replace both volatile memory and persistent storage. In this paper, we conduct an empirical study to figure out the weaknesses of state-of-the-art in-NVM B+-tree by porting it to ARMv8. We accordingly develop Crab-tree that guarantees crash consistency with reduced temporal cost. Crab-tree selects one of two strategies, i.e., copy-on-write and shifting in place, for inserting and deleting data, depending on their respective consistency costs. Crab-tree also defines strict execution orders to rule out inconsistency after a

ZZZ:23

system crash. To further improve performance, Crab-tree incorporates methods to reduce software overhead, cache misses due to searching, as well as consistency cost in shifting KV pairs. Extensive experiments confirm that Crab-tree has achieved both high performance and good scalability.

#### 8 ACKNOWLEDGMENTS

This work is partially supported by the Ministry of Education of Singapore under the grant MOE2018-T2-1-098.

### REFERENCES

- [1] Everspin. MRAM technology attributes. https://www.everspin.com/mram-technology-attributes, December 2018.
- [2] Shimin Chen, Phillip B. Gibbons, and Suman Nath. Rethinking database algorithms for phase change memory. In 5th Biennial Conference on Innovative Data Systems Research (CIDR '11), pages 1–11, January 2011.
- [3] Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. GraphR: Accelerating graph processing using ReRAM. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 531–543, Feb 2018.
- [4] Micron and Intel. 3D XPoint technology. http://www.micron.com/about/innovations/3d-xpoint-technology.
- [5] Xiangyu Dong, Naveen Muralimanohar, Norm Jouppi, Richard Kaufmann, and Yuan Xie. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 57:1–57:12, New York, NY, USA, 2009. ACM.
- [6] Mingzhe Zhang, Lunkai Zhang, Lei Jiang, Zhiyong Liu, and Frederic T. Chong. Balancing performance and lifetime of MLC PCM by using a region retention monitor. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 385–396, Feb 2017.
- [7] Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. Better I/O through byte-addressable, persistent memory. In *Proceedings of the ACM SIGOPS 22Nd Symposium* on Operating Systems Principles, SOSP '09, pages 133–146, New York, NY, USA, 2009. ACM.
- [8] Eunji Lee, Hyokyung Bahn, and Sam H. Noh. Unioning of the buffer cache and journaling layers with non-volatile memory. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13), pages 73–80, San Jose, CA, 2013. USENIX.
- [9] Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. System software for persistent memory. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys '14, pages 15:1–15:15, New York, NY, USA, 2014. ACM.
- [10] Youyou Lu, Jiwu Shu, and Long Sun. Blurred persistence in transactional persistent memory. In 2015 31st Symposium on Mass Storage Systems and Technologies (MSST), pages 1–13, May 2015.
- [11] Jian Xu and Steven Swanson. NOVA: A log-structured file system for hybrid volatile/non-volatile main memories. In 14th USENIX Conference on File and Storage Technologies (FAST 16), pages 323–338, Santa Clara, CA, February 2016. USENIX Association.
- [12] Qingsong Wei, Chundong Wang, Cheng Chen, Yechao Yang, Jun Yang, and Mingdi Xue. Transactional NVM cache with high performance and crash consistency. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '17, pages 56:1–56:12, New York, NY, USA, 2017. ACM.
- [13] Tseng-Yi Chen, Yuan-Hao Chang, Shuo-Han Chen, Chih-Ching Kuo, Ming-Chang Yang, Hsin-Wen Wei, and Wei-Kuan Shih. Enabling write-reduction strategy for journaling file systems over byte-addressable NVRAM. In *Proceedings of the 54th Annual Design Automation Conference 2017*, DAC '17, pages 44:1–44:6, New York, NY, USA, 2017. ACM.
- [14] Jiaxin Ou, Jiwu Shu, and Youyou Lu. A high performance file system for non-volatile main memory. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 12:1–12:16, New York, NY, USA, 2016. ACM.
- [15] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. NV-Heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 105–118, New York, NY, USA, 2011. ACM.
- [16] Deshan Zhang, Lei Ju, Mengying Zhao, Xiang Gao, and Zhiping Jia. Write-back aware shared last-level cache management for hybrid main memory. In *Proceedings of the 53rd Annual Design Automation Conference*, DAC '16, pages 172:1–172:6, New York, NY, USA, 2016. ACM.
- [17] Intel. Persistent memory development kit. http://pmem.io/pmdk/.
- [18] Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne: Lightweight persistent memory. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 91–104, New York, NY, USA, 2011. ACM.

- [19] Haris Volos, Sanketh Nalli, Sankarlingam Panneerselvam, Venkatanathan Varadarajan, Prashant Saxena, and Michael M. Swift. Aerie: Flexible file-system interfaces to storage-class memory. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys '14, pages 14:1–14:14, New York, NY, USA, 2014. ACM.
- [20] Chundong Wang and Sudipta Chattopadhyay. LAWN: Boosting the performance of NVMM file system through reducing write amplification. In *Proceedings of the 55th Annual Design Automation Conference*, DAC '18, pages 6:1–6:6, New York, NY, USA, 2018. ACM.
- [21] Fang Wang, Zhaoyan Shen, Lei Han, and Zili Shao. ReRAM-based processing-in-memory architecture for blockchain platforms. In Proceedings of the 24th Asia and South Pacific Design Automation Conference, ASPDAC '19, pages 615–620, New York, NY, USA, 2019. ACM.
- [22] ARM. ARM architecture reference manual ARMv8, for ARMv8-A architecture profile. https://static.docs.arm.com/ddi0487/ca/DDI0487C\_a\_armv8\_arm.pdf, December 2017.
- [23] ARM. ARMv8-A architecture evolution. https://community.arm.com/processors/b/blog/posts/armv8-a-architectureevolution, January 2016.
- [24] Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and Roy H. Campbell. Consistent and durable data structures for non-volatile byte-addressable memory. In *Proceedings of the 9th USENIX Conference on File and Stroage Technologies*, FAST'11, pages 1–15, Berkeley, CA, USA, 2011. USENIX Association.
- [25] Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. NV-Tree: Reducing consistency cost for NVM-based single level systems. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 167–181, Santa Clara, CA, 2015. USENIX Association.
- [26] Shimin Chen and Qin Jin. Persistent B+-trees in non-volatile main memory. *Proc. VLDB Endow.*, 8(7):786–797, February 2015.
- [27] Ismail Oukid, Johan Lasperas, Anisoara Nica, Thomas Willhalm, and Wolfgang Lehner. FPTree: A hybrid SCM-DRAM persistent and concurrent B-Tree for storage class memory. In *Proceedings of the 2016 International Conference on Management of Data*, SIGMOD '16, pages 371–386, New York, NY, USA, 2016. ACM.
- [28] Deukyeon Hwang, Wook-Hee Kim, Youjip Won, and Beomseok Nam. Endurable transient inconsistency in byteaddressable persistent B+-Tree. In 16th USENIX Conference on File and Storage Technologies (FAST 18), pages 187–200, Oakland, CA, 2018. USENIX Association.
- [29] Ren-Shuo Liu, De-Yu Shen, Chia-Lin Yang, Shun-Chih Yu, and Cheng-Yuan Michael Wang. NVM Duet: Unified working memory and persistent store architecture. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, pages 455–470, New York, NY, USA, 2014. ACM.
- [30] Youyou Lu, Jiwu Shu, Long Sun, and Onur Mutlu. Loose-ordering consistency for persistent memory. In Computer Design (ICCD), 2014 32nd IEEE International Conference on, pages 216–223, Oct 2014.
- [31] Arpit Joshi, Vijay Nagarajan, Stratis Viglas, and Marcelo Cintra. ATOM: Atomic durability in non-volatile memory through hardware logging. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 361–372, Feb 2017.
- [32] Seunghee Shin, Satish Kumar Tirukkovalluri, James Tuck, and Yan Solihin. Proteus: A flexible and fast software supported hardware logging approach for NVM. In *Proceedings of the 50th Annual IEEE/ACM International Symposium* on *Microarchitecture*, MICRO-50 '17, pages 178–190, New York, NY, USA, 2017. ACM.
- [33] M. A. Ogleari, E. L. Miller, and J. Zhao. Steal but no force: Efficient hardware undo+redo logging for persistent memory systems. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 336–349, Feb 2018.
- [34] Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA '00, pages 128–138, New York, NY, USA, 2000. ACM.
- [35] Chundong Wang, Qingsong Wei, Jun Yang, Cheng Chen, and Mingdi Xue. How to be consistent with persistent memory? an evaluation approach. In *Networking, Architecture and Storage (NAS), 2015 IEEE International Conference* on, pages 186–194, Aug 2015.
- [36] Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. X86-TSO: A rigorous and usable programmer's model for x86 multiprocessors. *Commun. ACM*, 53(7):89–97, July 2010.
- [37] The Raspberry Pi Foundation. Raspberry Pi 3 Model B+, 2018. https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/.
- [38] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. Scalable high performance main memory system using phase-change memory technology. In *Proceedings of the 36th Annual International Symposium on Computer Architecture*, ISCA '09, pages 24–33, New York, NY, USA, 2009. ACM.
- [39] Zhenyu Sun, Xiuyuan Bi, Hai (Helen) Li, Weng-Fai Wong, Zhong-Liang Ong, Xiaochun Zhu, and Wenqing Wu. Multi retention level STT-RAM cache designs with a dynamic refresh scheme. In *Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture*, MICRO-44, pages 329–338, New York, NY, USA, 2011. ACM.

ACM Trans. Embedd. Comput. Syst., Vol. XX, No. Y, Article ZZZ. Publication date: April 2020.

Crab-tree: A Crash Recoverable B+-tree Variant for Persistent Memory with ARMv8 Architecture

- [40] Dushyanth Narayanan and Orion Hodson. Whole-system persistence. SIGPLAN Not., 47(4):401-410, March 2012.
- [41] Jishen Zhao, Sheng Li, Doe Hyun Yoon, Yuan Xie, and Norman P. Jouppi. Kiln: Closing the performance gap between systems with and without persistence support. In *Proceedings of the 46th Annual IEEE/ACM International Symposium* on *Microarchitecture*, MICRO-46, pages 421–432, New York, NY, USA, 2013. ACM.
- [42] Hyojun Kim, Sangeetha Seshadri, Clement L. Dickey, and Lawrence Chiu. Evaluating phase change memory for enterprise storage systems: A study of caching and tiering approaches. In *Proceedings of the 12th USENIX Conference* on File and Storage Technologies (FAST 14), pages 33–45, Santa Clara, CA, 2014. USENIX.
- [43] Rujia Wang, Lei Jiang, Youtao Zhang, Linzhang Wang, and Jun Yang. Selective restore: An energy efficient read disturbance mitigation scheme for future STT-MRAM. In Proceedings of the 52Nd Annual Design Automation Conference, DAC '15, pages 21:1–21:6, New York, NY, USA, 2015. ACM.
- [44] Mingkai Dong and Haibo Chen. Soft updates made simple and fast on non-volatile memory. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 719–731, Santa Clara, CA, 2017. USENIX Association.
- [45] Clinton W. Smullen, Vidyabhushan Mohan, Anurag Nigam, Sudhanva Gurumurthi, and Mircea R. Stan. Relaxing nonvolatility for fast and energy-efficient STT-RAM caches. In *Proceedings of the 2011 IEEE 17th International Symposium* on High Performance Computer Architecture, HPCA '11, pages 50–61, Washington, DC, USA, 2011. IEEE Computer Society.
- [46] Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu. ThyNVM: Enabling softwaretransparent crash consistency in persistent memory systems. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 672–685, Dec 2015.
- [47] Chen Liu and Chengmo Yang. Secure and durable (SEDURA): An integrated encryption and wear-leveling framework for pcm-based main memory. In Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems 2015, LCTES'15, pages 12:1–12:10, New York, NY, USA, 2015. ACM.
- [48] Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. Mojim: A reliable and highly-available non-volatile memory system. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 3–18, New York, NY, USA, 2015. ACM.
- [49] Duo Liu, Kan Zhong, Tianzheng Wang, Yi Wang, Zili Shao, Edwin Hsing-Mean Sha, and Jingling Xue. Durable address translation in PCM-Based flash storage systems. *IEEE Transactions on Parallel and Distributed Systems*, 28(2):475–490, Feb 2017.
- [50] Tianzheng Wang, Duo Liu, Yi Wang, and Zili Shao. Towards write-activity-aware page table management for non-volatile main memories. ACM Trans. Embed. Comput. Syst., 14(2):34:1–34:23, February 2015.
- [51] Daeyoung Lee and Hyunok Oh. A lifetime aware buffer assignment method for streaming applications on DRAM/PRAM hybrid memory. ACM Trans. Embed. Comput. Syst., 12(1s):36:1–36:17, March 2013.
- [52] Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. Memory persistency. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 265–276, Piscataway, NJ, USA, 2014. IEEE Press.
- [53] Chen Pan, Mimi Xie, Chengmo Yang, Yiran Chen, and Jingtong Hu. Exploiting multiple write modes of nonvolatile main memory in embedded systems. ACM Trans. Embed. Comput. Syst., 16(4):110:1–110:26, May 2017.
- [54] Cheng Chen, Qingsong Wei, Weng-Fai Wong, and Chundong Wang. NV-Journaling: Locality-aware journaling using byte-addressable non-volatile memory. *IEEE Transactions on Computers*, pages 1–12, 2019.
- [55] Lei Han, Zhaoyan Shen, Zili Shao, and Tao Li. Optimizing RAID/SSD controllers with lifetime extension for flash-based SSD array. In Proceedings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, LCTES 2018, pages 44–54, New York, NY, USA, 2018. ACM.
- [56] Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Consistency without ordering. In *Proceedings of the 10th USENIX Conference on File and Storage Technologies*, FAST'12, pages 101–116, Berkeley, CA, USA, 2012. USENIX Association.
- [57] Chundong Wang, Sudipta Chattopadhyay, and Gunavaran Brihadiswarn. Crash recoverable ARMv8-oriented B+-tree for byte-addressable persistent memory. In *Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems*, LCTES 2019, pages 33–44, New York, NY, USA, 2019. ACM.
- [58] Adrian M. Caulfield, Arup De, Joel Coburn, Todor I. Mollow, Rajesh K. Gupta, and Steven Swanson. Moneta: A high-performance storage array architecture for next-generation, non-volatile memories. In *Proceedings of the 2010* 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '43, pages 385–395, Washington, DC, USA, 2010. IEEE Computer Society.
- [59] Wikipedia. Compare-and-swap. https://en.wikipedia.org/wiki/Compare-and-swap, November 2019. Last accessed on 15 November 2019.
- [60] ARM Holdings. Release notes for arm compiler 6.10. https://en.wikipedia.org/wiki/Compare-and-swap, March 2018. Last accessed on 11 August 2019.
- [61] Phil Yang. Add 128-bit atomic compare exchange. http://patchwork.dpdk.org/patch/61468/, October 2019. Last accessed on 11 November 2019.

#### ZZZ:26

- [62] WikiChip. ARMv8.1. https://en.wikichip.org/wiki/arm/armv8.1, December 2017.
- [63] David A Patterson and John L Hennessy. Computer Organization and Design ARM Edition: The Hardware Software Interface. Morgan kaufmann, 2016.
- [64] openSUSE. openSUSE:AArch64, 2018. https://en.opensuse.org/openSUSE:AArch64.
- [65] Everspin. Spin-transfer torque MRAM technology. https://www.everspin.com/spin-transfer-torque-mram-technology, December 2018.
- [66] Everspin. Everspin technologies announces development of STT-MRAM for industrial and iot applications. https://www.everspin.com/sites/default/files/pressdocs/Everspin\_industrial\_IoT\_STTMRAM.pdf, February 2020.
- [67] Jimmy J. Kan, Chando Park, Chi Ching, Jaesoo Ahn, Yuan Xie, Mahendra Pakala, and Seung H. Kang. A study on practically unlimited endurance of STT-MRAM. *IEEE Transactions on Electron Devices*, 64(9):3639–3646, 2017.
- [68] Ping Chi, Wang-Chien Lee, and Yuan Xie. Making b+-tree efficient in PCM-based main memory. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design, ISLPED '14, pages 69–74, New York, NY, USA, 2014. Association for Computing Machinery.
- [69] Mazen Alwadi, Aziz Mohaisen, and Amro Awad. Phoenix: Towards persistently secure, recoverable, and NVM friendly tree of counters, 2019.
- [70] Adrian M. Caulfield, Todor I. Mollov, Louis Alex Eisner, Arup De, Joel Coburn, and Steven Swanson. Providing safe, user space access to fast, solid state disks. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 387–400, New York, NY, USA, 2012. ACM.