02 Cache Replacement Policies Explained: LRU, FIFO, LFU, MRU & More اصلي

When a cache miss occurs, the CPU needs to load the data from the main memory into the cache. However, if the cache is already full, it must decide which existing data in the cache to evict. This decision is made using a replacement algorithm, also known as a replacement policy.

There are many types of cache replacement policies, each with several variations. Every cache uses the best policy for its specific purpose. In this post, we'll briefly go over common replacement algorithms, and at the end, we will focus on the least recently used policy and its variations, as it's one of the most common cache replacement policies in modern systems.

Why Do We Need Cache Replacement Policies?

Limited space: Caches are small; eviction is inevitable.
Frequent data requests: Prioritize "hot" (frequently used) data.
Optimizing hit rate: Minimize misses to improve performance.

Disclaimer: The algorithms we're about to explain are simplified versions of the ones used in real-world systems. Each cache component has its own specific replacement policy based on its purpose, architecture, and predicted access patterns. Covering them all is impossible.

Common Cache Replacement Policies

Least Recently Used

The least recently used cache replacement policy is a widely used algorithm for managing cache memory and it's arguably the most intuitive one. It operates on the principle that when a cache is full and new data needs to be loaded, the data that hasn't been used for the longest time is evicted. To implement this, the cache tracks the order in which data is accessed.

When a cache miss occurs and a replacement is necessary, the data that was accessed the longest time ago is removed to make space for the new data.

Least Frequently Used

The least frequently used algorithm is designed to track how often data is accessed and keep frequently used data in the cache. In a full implementation of the LFU algorithm, each cache line has a frequency counter that represents the exact number of times the cache line has been accessed for reading or writing.

When data eviction is necessary, the line with the lowest frequency counter is replaced. However, the full LFU algorithm has several major drawbacks that make it inefficient in cache components, so it is rarely used in real systems.

Random Replacement

The random replacement algorithm randomly selects any cache line for replacement. When a cache miss occurs and replacement is needed, a random block of data in the cache is chosen to be evicted. This simple algorithm avoids the overhead of maintaining additional tracking data. It is sometimes used in systems where cache hits and misses are unpredictable or in systems where simplicity is the highest priority.

First-In First-Out

The first-in-first-out algorithm is a simple policy where cache lines are evicted in the order they were loaded, regardless of how frequently or recently they are accessed. When a replacement is needed, the oldest data line, the one that has been in the cache the longest, is removed to make space.

Unlike more sophisticated algorithms, first-in-first-out doesn't account for frequency or recency, which can lead to worse performance in some situations. However, its simplicity makes it easy to implement with minimal overhead.

Most Recently Used

The most recently used algorithm replaces the data that was accessed most recently. It's useful in systems where older data is more likely to be accessed than newer data. Repeated access to large files benefit from this policy, since the cache tends to keep the repeated access data.

Adaptive Replacement

Adaptive replacement dynamically balances between two algorithms based on the access patterns of the workload and the actual hit and miss rates for each. It does this by splitting cache sets into groups. Typically, the cache sets are divided into three groups.

One dedicated for the first algorithm, another to the second algorithm, and a third group called follower sets, which dynamically adjust their algorithm based on which dedicated set performs better. The algorithms are chosen by the manufacturer for optimal results.

For example, some Intel L3 caches use adaptive replacement with two variations of the least frequently used policy.

See this table to understand more.

Policy	Principle	Pros	Cons	Use Case
LRU	Evict least recently used	High accuracy	High overhead	L1 cache
LFU	Evict least frequently used	Keeps popular data	Complex tracking	Rare
FIFO	Evict oldest data	Simple	Ignores access patterns	Embedded systems
Random	Random eviction	Fast	Unpredictable	Low-resource systems
MRU	Evict most recent	Good for streaming	Inefficient for loops	Specialized apps

Least Recently Used Variations

Okay, we've covered all the common cache replacement algorithms, but now I want to focus on the least recently used algorithm, since it's the most commonly used policy in CPU caches.

Full LRU

A full LRU algorithm requires storing the exact order in which the data lines were accessed. Since each address from the main memory can only be mapped to a specific set in the cache, the access order needs to be tracked and compared among the data lines within that set.

Each set has an additional field that represents the current access order. In a four-way cache, there are 4! or 24 possible access order permutations. If each arrangement is marked by a number, the LRU field must be 5 bits in size to represent all 24 permutations. In general, for a cache with n ways, the number of LRU bits required for each set is log of n!

When data is accessed and the cache hit occurs, the LRU bits in the relevant set are updated to reflect the new order. If the cache is full and a cache miss occurs, the LRU bits are decoded and the last cache line in that order, which is guaranteed to be the least recently used, is evicted.

The new data is then loaded into that slot and the LRU bits are updated to reflect the new access order. Full LRU works well when the data that was used recently is likely to be needed again soon, but it comes with a high cost in the form of hardware complexity and overhead of reading and updating the LRU bits on every data access. For this reason, many CPU caches use approximations of the LRU algorithm.

Tree-Based Pseudo-LRU

In an LRU approximation, the cache doesn't know the exact order of data access. Instead, it attempts to choose older data lines for eviction. There are several ways to approximate the LRU algorithm, but a common method is to use a simple binary tree.

Each node in the tree is represented by a single bit, and the leaves of the tree represent the cache lines in a set. The value in each bit indicates the path to the most recently used data, where a value of 0 means the left subtree has more recently accessed data, and a value of 1 means the right subtree is more recent. When a cache line is accessed, and the requested address is in the cache, the algorithm updates the tree bits along the path from the root to that line to mark it as the most recently used data.

When the requested address is not in the cache, the tree algorithm traverses the tree in the opposite direction to find the least recently used data. It then evicts the old data line and copies the new data into this slot. Finally, it updates the tree so that the path points to the new block of data, marking it as the most recent one in that set.

The number of bits required for a 4-way cache is 3, which is lower than the 5 bits needed for each set in the full LRU implementation. In general, the number of bits required for a tree-based pseudo-LRU is the number of ways minus 1 for each set in the cache. This algorithm provides an approximation of LRU because it doesn't track the exact order of all cache line accesses.

Instead, it relies on the tree structure to give a good enough approximation. The pseudo-LRU replacement policy is used in many L1 caches in modern CPUs and much less frequently in larger caches.

Not Recently Used

In the not-recently-used algorithm, each cache line has a single bit that indicates whether it has been used recently. A bit set to 0 means the associated cache line was recently used, while a bit set to 1 means the cache line was not recently used.

Initially, all NRU bits are set to 1, indicating that the lines have not been recently used. When a cache line is accessed, its NRU bit is set to 0, marking it as recently used. When a cache miss occurs, the system looks for the first cache line with its bit still set to 1, replaces it, and flips its NRU bit to 0 to mark it as recently used.

Eventually, most NRU bits will be set to 0. If a cache miss occurs and there is only one cache line with an NRU value of 1, the data in that line is replaced, its NRU bit is set to 0, and all other NRU bits are flipped back to 1, indicating that all other lines are now considered not recently used. In a situation where all of the NRU bits are set to 0, they are all immediately flipped back to 1.

Quad-Age LRU

The quad-age LRU algorithm is typically used in caches with higher associativity, such as L3 CPU caches. Each cache line has an additional 2 bits that represent its relative age compared to the other lines in the same set. A cache line with a lower age means it was accessed more recently than a cache line with a higher age. Initially, data is inserted into the cache from left to right with an age of 1. When a cache hit occurs, the age of the accessed line is modified based on its current age.

A value of 2 or 3 will become 1, while a value of 1 will be reduced to 0. If, after an access, hit or miss, there are no lines with an age of 3, the ages of all lines are increased by 1. This process is repeated until a cache line with an age of 3 is found. Upon a cache miss, the first cache line with an age of 3 is evicted, the new data is written, and the age is updated to 1.

If you found this article helpful, please consider sharing it with others who might benefit from this information. Thank you for reading!

Athar 4 Editor EN