Post

Understanding the Principle of Locality

Understanding the Principle of Locality

Writing high-performance programs requires developers to understand a concept called locality, which enables programs to access data items that are in close proximity to recently accessed items. This notion, also widely known as the principle of locality, has a significant impact on designing performance-critical applications. In this article, I would like to briefly describe the fundamental idea of locality and provide some examples demonstrating how to write and evaluate locality-aware programs in C++.

Fundamental principles of locality

There are two main fundamental concepts in the principle of locality, namely temporal and spatial locality.

Temporal locality

Temporal locality refers to the tendency of a program to access the same memory locations repeatedly over a short period of time. In other words, if a program accesses a particular memory location, it is likely to access the same location again in the near future. This principle originates from the observation that programs often exhibit repetitive behavior, such as loops or frequently accessed variables. By exploiting temporal locality, programs can benefit from caching mechanisms that store recently accessed data in fast memory, reducing the time spent retrieving data from slower main memory or storage devices.

For example, consider a loop that iterates over an array to perform calculations. The loop repeatedly accesses elements of the array, exhibiting strong temporal locality. Caching mechanisms, such as CPU caches, take advantage of this repetitive access pattern by storing frequently accessed array elements in cache memory, thereby speeding up subsequent accesses and improving overall performance.

Spatial locality

Spatial locality refers to the tendency of a program to access data items that are stored close to each other in memory. When a program accesses a particular memory location, it is likely to access nearby memory locations in the near future. This principle originates from the observation that data elements in memory are often accessed sequentially or in close proximity due to the nature of program execution, such as iterating over arrays or traversing linked data structures.

For example, consider iterating over a two-dimensional array. As the program accesses elements of the array row by row, it exhibits strong spatial locality because nearby elements in memory are accessed sequentially. Spatial locality allows caching mechanisms to prefetch and cache contiguous blocks of memory, reducing the latency of memory accesses and improving overall efficiency.

Programmers should understand the principle of locality because, in general,

programs with good locality run faster than programs with poor locality.

Examples

Consider the following function that simply returns the sum of the elements of an input array of size N:

1
2
3
4
5
6
7
8
9
auto sumArray(const int arr[N])
{
    int sum = 0;

    for (size_t i = 0u; i < N; ++i)
        sum += arr[i];
    
    return sum;
}

Let’s evaluate this function to determine whether it has good locality by analyzing the access patterns of the variables. First, let’s examine the sum variable. This variable is repeatedly referenced inside each loop iteration, indicating that the function sumArray demonstrates good temporal locality with respect to sum. However, it does not have spatial locality since sum is a scalar variable.

In the case of the array variable arr, its elements are accessed sequentially, one after another, in each loop iteration. Therefore, we can infer that the function sumArray showcases good spatial locality with respect to arr, as each element in the array is stored contiguously in memory. However, it has poor temporal locality since each element of the array is accessed exactly once. Since the function sumArray demonstrates good temporal and/or spatial locality, we can conclude that this function has good locality.

Now, lets look at another example. Consider the following functions which sum up the two-dimensional array with two different access patterns:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
auto sumArrayByRows(const int arr[M][N])
{
    int sum = 0;

    for (size_t i = 0u; i < M; ++i)
        for (size_t j = 0u; j < N; ++j)
            sum += arr[i][j];
    
    return sum;
}

auto sumArrayByCols(const int arr[M][N])
{
    int sum = 0;

    for (size_t j = 0u; j < N; ++j)
        for (size_t i = 0u; i < M; ++i)
            sum += arr[i][j];
    
    return sum;
}

Both functions correctly return the sum of the two-dimensional array; however they visit the elements of the array differently. The first function sumArrayByRows traverses the array in a row-major order, i.e., the inner loop iterates over each row first and then moves to the next row. On the other hand, the second function sumArrayByCols traverses the array in a column-major order, i.e., it iterates over each column first and then moves to the next column.

Since 2D arrays in C/C++ are allocated in row-major order (i.e., the memory layout consists of all the values in row 0 first, followed by the values in row 1, and so on), the first function sumArrayByRows enjoys good spatial locality as it accesses the array’s contents row by row. However, by simply changing the iteration pattern, as we can see in the second function sumArrayByCols, we encounter poor spatial locality because it accesses the contents column-wise instead of row-wise.

The following are the benchmark results for the two functions tested with 2D arrays of sizes 8, 32, 128, 512 and 1024. As we can see from the data, the function sumArrayByRows which does the row-major order traversal clearly outperforms the one with the column-major order traversal as the array size increases. Refer to Appendix section if you would like to replicate the results.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Run on (8 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 1.77, 2.08, 2.57
----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
BM_sumArrayByRows_8x8            0.404 ns        0.404 ns   1000000000
BM_sumArrayByRows_32x32           26.6 ns         26.6 ns     26290586
BM_sumArrayByRows_128x128         1321 ns         1321 ns       528031
BM_sumArrayByRows_512x512        12571 ns        12571 ns        55531
BM_sumArrayByRows_1024x1024      45409 ns        45409 ns        15409
BM_sumArrayByCols_8x8            0.404 ns        0.404 ns   1000000000
BM_sumArrayByCols_32x32            271 ns          271 ns      2586643
BM_sumArrayByCols_128x128         4192 ns         4192 ns       168758
BM_sumArrayByCols_512x512       150941 ns       150941 ns         4632
BM_sumArrayByCols_1024x1024     776507 ns       776507 ns          897

sum2darray0 Performance comparsion results of row-major vs column-major in array traversal

Conclusion

So, how does it work? Why do programs with good temporal and spatial locality run faster? To answer these questions, we need to understand the underlying memory hierarchy that modern computing systems use to organize memory access and reduce the number of CPU clock cycles required to access data. In upcoming articles, we will look deeper into this memory hierarchy, also known as cache memories, and explore how to quantify the concept of locality in terms of cache hits and misses.

In any case, this article highlights the importance of quickly reviewing source code and having a general understanding of the program’s locality, which is a valuable and essential skill for programmers aiming to develop high-performance applications.

Appendix

If you would like to replicate the results, you can run the following script. Make sure to link against the Google Benchmark library during the build process. These benchmark results may vary from machine to machine. I ran those on my Macbook Air with a M2 chip using C++17 with Clang version 15.0.0 and optimization level -O3:

1
$ clang++ -std=c++17 -O3 -I <path-to-google-benchmark-include> -L <path-to-google-benchmark-lib> -lbenchmark sum_2darray_benchmark.cpp -o sum_2darray_benchmark

References

This post is licensed under CC BY 4.0 by the author.