04 Cache Hierarchy: How Modern CPU Caches Are Organized (L1, L2 and L3)

Mustafa NJ

Best SEO-Friendly Title Example: CPU Cache Hierarchy and Operations Explained Simply

CPU Cache Essentials:

Efficient memory access is critical for high-performance computing, and at the heart of this process lies the CPU cache. In this comprehensive guide, we explore why caches are essential, the structure of modern cache hierarchies, and how read and write operations work within them. Whether you're a student, engineer, or tech enthusiast, this detailed breakdown will give you a complete understanding of how CPU caches work.

Why Caches Matter

Did you know that accessing main memory can take hundreds of CPU cycles? The processor operates at a very high speed, but every time it needs to fetch data from main memory, it's forced to wait until the requested data is retrieved. This huge delay is why computers have caches to keep the data the CPU needs closer and minimize those costly interruptions.

Modern CPUs feature a hierarchical cache system, where the cache closest to the processor core is the smallest and fastest, while the furthest cache is the largest but slowest.

Modern Cache Hierarchy Structure

Large caches are inherently more complex, which increases their access times. So to maximize performance while reducing latency and cost, the first level of cache, known as L1, is designed to be very small to match the speed of the processor.

L1 caches are typically divided into two separate components: one optimized for data fetching and another for storing instructions. At this point, the size of the cache becomes a limiting factor, so to solve this, many CPU architectures incorporate an additional cache that is larger in size but works at lower speeds.

Components of a Cache Hierarchy

This is known as the L2 cache. The L2 cache is usually a unified cache, which means it can store both data and instructions. It is dedicated to a single processor core and can directly communicate with the L1 caches.

But most modern systems are multi-core systems and need a fast way to share data between them. That's why CPUs usually have another cache: L3. This cache is larger but slower than L2.

It serves two main purposes:

  • It allows data sharing between processor cores without accessing main memory.
  • It provides an additional layer in the memory hierarchy.

When both L1 and L2 caches miss, the L3 cache is checked before resorting to main memory.

Cache Level Properties

Some specialized systems add an L4 cache on top of the usual L1, L2, and L3 caches to boost performance even more. Here's a breakdown of the cache levels:

L1 Cache:

  • Size: 16 KB to 128 KB per core
  • Associativity: 2 to 8 ways
  • Latency: A few CPU cycles

L2 Cache:

  • Size: 256 KB to 2 MB per core (older machines may have more)
  • Associativity: 4 to 16 ways
  • Latency: 4 to 10 CPU cycles

L3 Cache:

  • Size: 2 MB to 32 MB per core (some Apple and AMD CPUs exceed this)
  • Associativity: Typically 16 ways
  • Latency: 10 to 40 CPU cycles

Cache Inclusion Policies

Cache hierarchies can be categorized by their inclusion policies, which decide whether a data block is stored in just one cache level, copied across multiple levels, or handled in a more flexible manner.

The three main inclusion policies are:

Inclusive

Data stored in a higher-level cache (like L1) is also stored in lower-level caches (like L2 or L3). In this case, L2 includes L1.

Exclusive

A data block can only exist in one cache level at a time. If it's in L1, it won't be in L2 or L3, and vice versa. Here, L2 is exclusive of L1.

Non-Inclusive, Non-Exclusive (NINE)

A hybrid approach. There's no strict rule for duplication. Data may or may not exist across multiple cache levels, depending on the system's design.

In real-world systems, CPU cache hierarchies often combine inclusion policies. For instance, Intel processors like SandyBridge, IvyBridge, and Skylake have an inclusive L3 cache and a non-inclusive, non-exclusive L2 cache.

Read Operation Examples

Each inclusion policy has its benefits and drawbacks, but they all play a role in how data is retrieved or written within the cache hierarchy. Let's look at an example with an inclusive cache hierarchy that includes three levels:

  • A read request always begins at the highest cache level, L1.
  • If the requested address is found in L1, the data is forwarded directly to the processor core.
  • If the address is not found in L1, the search continues in L2.
  • If found in L2, the data is copied to L1 and then sent to the core. This improves hit rate for future accesses.
  • If the address is still not found, the search moves to L3.
  • If L3 contains the data, it is copied to L2, then L1, and finally to the core.
  • If none of the caches contain the data, the request is sent to main memory, and the data is copied down through L3 → L2 → L1 before being delivered to the processor.

Write Operation Example

When the CPU issues a write request, the cache response depends on the system's write policy. Let’s assume all caches use the same policy.

Write-Through Policy:

Data written to L1 is immediately propagated to L2, L3, and main memory. This keeps all levels synchronized.

Write-Back Policy:

Updates to lower levels are delayed. If a data block is modified in L1, it's marked as dirty. The update to L2 (and beyond) only occurs when the block is evicted from L1.

For example, a dirty block evicted from L1 is written to L2, where it is also marked dirty, and waits for its own eviction to move further down.

Write Misses depend on the cache’s allocation policy:

Write-Allocate:

Data blocks are brought into the cache hierarchy and updated there.

No Write-Allocate:

The cache is bypassed, and the data is written directly to the next level. If all caches are configured this way, the data goes straight to main memory.

Final Thoughts

Understanding cache hierarchy is key to optimizing CPU performance. From the size and speed of different cache levels to the intricacies of inclusion and write policies, each aspect plays a vital role in how data is accessed and stored. With increasing complexity in modern multi-core processors, mastering cache behavior is more important than ever for system architects, developers, and computer science enthusiasts.

Post a Comment