The cache is a small amount of high-speed memory, usually with a memory cycle time comparable to the time required by the CPU to fetch one instruction. The cache is usually filled from main memory when instructions or data are fetched into the CPU. Often the main memory will supply a wider data word to the cache than the CPU requires, to fill the cache more rapidly. The amount of information which is replaces at one time in the cache is called the line size for the cache. This is normally the width of the data bus between the cache memory and the main memory. A wide line size for the cache means that several instruction or data words are loaded into the cache at one time, providing a kind of prefetching for instructions or data. Since the cache is small, the effectiveness of the cache relies on the following properties of most programs:
  • Spatial locality -- most programs are highly sequential; the next instruction usually comes from the next memory location.
Data is usually structured, and data in these structures normally are stored in contiguous memory locations.
  • Short loops are a common program structure, especially for the innermost sets of nested loops. This means that the same small set of instructions is used over and over.
Generally, several operations are performed on the same data values, or variables.
When a cache is used, there must be some way in which the memory controller determines whether the value currently being addressed in memory is available from the cache. There are several ways that this can be accomplished. One possibility is to store both the address and the value from main memory in the cache, with the address stored in a type of memory called associative memory or, more descriptively, content addressable memory.
An associative memory, or content addressable memory, has the property that when a value is presented to the memory, the address of the value is returned if the value is stored in the memory, otherwise an indication that the value is not in the associative memory is returned. All of the comparisons are done simultaneously, so the search is performed very quickly. This type of memory is very expensive, because each memory location must have both a comparator and a storage element. A cache memory can be implemented with a block of associative memory, together with a block of ``ordinary'' memory. The associative memory would hold the address of the data stored in the cache, and the ordinary memory would contain the data at that address. Such a cache memory might be configured as shown in Figure  .
     
CACHE IMPLEMENTED WITH ASSOCIATIVE MEMORY

If the address is not found in the associative memory, then the value is obtained from main memory.
Associative memory is very expensive, because a comparator is required for every word in the memory, to perform all the comparisons in parallel. A cheaper way to implement a cache memory, without using expensive associative memory, is to use direct mapping. Here, part of the memory address (usually the low order digits of the address) is used to address a word in the cache. This part of the address is called the index. The remaining high-order bits in the address, called the tag, are stored in the cache memory along with the data.
For example, if a processor has an 18 bit address for memory, and a cache of 1 K words of 2 bytes (16 bits) length, and the processor can address single bytes or 2 byte words, we might have the memory address field and cache organized as in Figure  .
     
DIRECT MAPPED CACHE CONFIGURATION

This was, in fact, the way the cache is organized in the PDP-11/60. In the 11/60, however, there are 4 other bits used to ensure that the data in the cache is valid. 3 of these are parity bits; one for each byte and one for the tag. The parity bits are used to check that a single bit error has not occurred to the data while in the cache. A fourth bit, called the valid bit is used to indicate whether or not a given location in cache is valid. In the PDP-11/60 and in many other processors, the cache is not updated if memory is altered by a device other than the CPU (for example when a disk stores new data in memory). When such a memory operation occurs to a location which has its value stored in cache, the valid bit is reset to show that the data is ``stale'' and does not correspond to the data in main memory. As well, the valid bit is reset when power is first applied to the processor or when the processor recovers from a power failure, because the data found in the cache at that time will be invalid.
In the PDP-11/60, the data path from memory to cache was the same size (16 bits) as from cache to the CPU. (In the PDP-11/70, a faster machine, the data path from the CPU to cache was 16 bits, while from memory to cache was 32 bits which means that the cache had effectively prefetched the next instruction, approximately half of the time). The amount of information (instructions or data) stored with each tag in the cache is called theline size of the cache. (It is usually the same size as the data path from main memory to the cache.) A large line size allows the prefetching of a number of instructions or data words. All items in a line of the cache are replaced in the cache simultaneously, however, resulting in a larger block of data being replaced for each cache miss.
The MIPS R2000/R3000 had a built-in cache controller which could control a cache up to 64K bytes. For a similar 2K word (or 8K byte) cache, the MIPS processor would typically have a cache configuration as shown in Figure  . Generally, the MIPS cache would be larger (64Kbytes would be typical, and line sizes of 1, 2 or 4 words would be typical).
     
ONE POSIBLE MIPS CACHE ORGANIZATION

A characteristic of the direct mapped cache is that a particular memory address can be mapped into only one cache location. Many memory addresses are mapped to the same cache location (in fact, all addresses with the same index field are mapped to the same cache location.) Whenever a ``cache miss'' occurs, the cache line will be replaced by a new line of information from main memory at an address with the same index but with a different tag.
Note that if the program ``jumps around'' in memory, this cache organization will likely not be effective because the index range is limited. Also, if both instructions and data are stored in cache, it may well happen that both map into the same area of cache, and may cause each other to be replaced very often. This could happen, for example, if the code for a matrix operation and the matrix data itself happened to have the same index values.
A more interesting configuration for a cache is the set associative cache, which uses a set associative mapping. In this cache organization, a given memory location can be mapped to more than one cache location. Here, each index corresponds to two or more data words, each with a corresponding tag. A set associative cache with n tag and data fields is called an ``n-way set associative cache''. Usually   , for k = 1, 2, 3 are chosen for a set associative cache (k = 0 corresponds to direct mapping). Such n-way set associative caches allow interesting tradeoff possibilities; cache performance can be improved by increasing the number of ``ways'', or by increasing the line size, for a given total amount of memory. An example of a 2-way set associative cache is shown in Figure  , which shows a cache containing a total of 2K lines, or 1 K sets, each set being 2-way associative. (The sets correspond to the rows in the figure.)
     
A SET-ASSOCIATIVE CACHE ORGANIZATION

In a 2-way set associative cache, if one data word is empty for a read operation corresponding to a particular index, then it is filled. If both data words are filled, then one must be overwritten by the new data. Similarly, in an n-way set associative cache, if all n data and tag fields in a set are filled, then one value in the set must be overwritten, or replaced, in the cache by the new tag and data values. Note that an entire line must be replaced each time. The most common replacement algorithms are:
  • Random -- the location for the value to be replaced is chosen at random from all n of the cache locations at that index position. In a 2-way set associative cache, this can be accomplished with a single modulo 2 random variable obtained, say, from an internal clock.
  • First in, first out (FIFO) -- here the first value stored in the cache, at each index position, is the value to be replaced. For a 2-way set associative cache, this replacement strategy can be implemented by setting a pointer to the previously loaded word each time a new word is stored in the cache; this pointer need only be a single bit. (For set sizes > 2, this algorithm can be implemented with a counter value stored for each ``line'', or index in the cache, and the cache can be filled in a ``round robin'' fashion).
  • Least recently used (LRU) -- here the value which was actually used least recently is replaced. In general, it is more likely that the most recently used value will be the one required in the near future. For a 2-way set associative cache, this is readily implemented by setting a special bit called the ``USED'' bit for the other word when a value is accessed while the corresponding bit for the word which was accessed is reset. The value to be replaced is then the value with the USED bit set. This replacement strategy can be implemented by adding a single USED bit to each cache location. The LRU strategy operates by setting a bit in the other word when a value is stored and resetting the corresponding bit for the new word. For an n-way set associative cache, this strategy can be implemented by storing a modulo n counter with each data word. (It is an interesting exercise to determine exactly what must be done in this case. The required circuitry may become somewhat complex, for large n.)
Cache memories normally allow one of two things to happen when data is written into a memory location for which there is a value stored in cache:
  • Write through cache -- both the cache and main memory are updated at the same time. This may slow down the execution of instructions which write data to memory, because of the relatively longer write time to main memory. Buffering memory writes can help speed up memory writes if they are relatively infrequent, however.
  • Write back cache -- here only the cache is updated directly by the CPU; the cache memory controller marks the value so that it can be written back into memory when the word is removed from the cache. This method is used because a memory location may often be altered several times while it is still in cache without having to write the value into main memory. This method is often implemented using an ``ALTERED'' bit in the cache. The ALTERED bit is set whenever a cache value is written into by the processor. Only if the ALTERED bit is set is it necessary to write the value back into main memory (i.e., only values which have been altered must be written back into main memory). The value should be written back immediately before the value is replaced in the cache.
The MIPS R2000/3000 processors used the write-through approach, with a buffer for the memory writes. (This was also the approach taken by the The VAX-11/780 processor ) In practice, memory writes are less frequent than memory reads; typically for each memory write, an instruction must be fetched from main memory, and usually two operands fetched as well. Therefore we might expect about three times as many read operations as write operations. In fact, there are often many more memory read operations than memory write operations.
Figure   shows the behaviour (actually the miss ratio, which is equal to 1 - the hit ratio) for cache memories with various combinations of total cache memory capacity and line size. The results are from simulations of the behaviour of several ``typical'' program mixes. Several interesting things can be seen from these figures; Figure   shows that the miss ratio drops consistently with cache size. Note, also, that increasing the line size is not always effective in increasing the throughput of the processor, although it does decrease the hit ratio, because of the additional time required to transfer large lines of data from the main memory to the cache.