BadRAM: Results

Results up to now

Well... what to say except... it works!

It is included by default on some distributions: Mandrake 9.2, Debian, Caldera (I think).

I have run tests for several years, without flaws. Many people over the world have expressed their gratitude, clearly stating that it works for them as well. And everybody thinks this is so cool they want to get their hands on bad memory, just to tease the less fortunate.

I used the very thorough RAM checker Memtest86 to find the erroneous addresses on my chips, but basically any tester would do. Specifically for memtest86, I made an extension to derive badram=... arguments for the LILO command line (go to the configure/printmode menu). Since LILO can also offer memtest86, it means that handling your bad RAM can be done with two reboots and no screwdriver. I've seen operating systems that need more reboots for far less impressive results ;-)
Hey, why is this thing not part of standard distributions?!? Oh well, Debian is going to do it soon, perhaps others will follow later on...

Benchmarks

I performed benchmarks to prove that BadRAM has no influence on system performance. This is indeed the case, as shown on the benchmark page that provides all the information of interest (and probably much more too).

I performed benchmarks to compare a system with 64 MB of flawless memory and 64 MB of faulty memory. The result: Performance differences are negligable. The conclusion: BadRAM forms a good extension to the Linux kernel.

Read about the benchmarks:

Description of available hardware
Correction of Influential Factors
Measurements and Discussion
Conclusion
Later amendements

Terms: I shall coin the term hole for a faulty byte in RAM, and refer to such faulty RAM modules as BadRAM. To contradict, classical (hole-free) RAM will be referred to here as OK RAM. Many BadRAMs contain holes all over, spread in a regular pattern, but I have developed a patch that makes Linux run smoothly on such RAMs.

Description of available hardware

My computer used to run with 128 MB of flawless RAM, with a CAS timing of 2. It has TLB and cache, as any Pentium-II system. In the new situation, I added two RAM modules of 32 MB each, each with holes, and each with a CAS timing of 3.

The first BadRAM has 512 holes, spread through the 8MB-16MB range of its 32MB. The second BadRAM has 256 holes, spread through the 0MB-8MB range of its 32MB.

The two interesting cases to compare would be:

The OK RAM only,
The BadRAM only.

Correction of Influential Factors

Factors of influence on this measurement are:

The memory size influences buffering and so on,
The different CAS timing for the OK RAM and the BadRAM,
The pages sacrificed because they contain a hole reduce the size of available RAM,
Networking, daemons, the weather and quantum-mechanical non-determinism.

These factors are dealt with as follows:

The used memory size will be equaled between the two tests (using the LILO boot option mem=...),
The BIOS will be instructed to assume a CAS timing of 3 in all cases; leaving a BadRAM in after the used region of RAM will help to convince the BIOS that this is the right value,
The amount of flawless memory offered to Linux will be reduced to the actually available amount in the BadRAM case (64MB minus 3MB of sacrificed pages in this case on i386),
The tests are performed in single user mode (no networking) by root and be the averages of 5 independent measurements.

I hope and expect this accounts for all possible problems.

Note: Why reduce the flawless memory with the pages that are sacrificed from the BadRAMs? Well, the point I intend to demonstrate is that BadRAM performs equally well as normal RAM after bad pages have been taken out. So, I should compare the 61MB of BadRAM with 61MB out of the flawless RAM.

Software: The measurements are performed with lmbench-2alpha10.

Measurements

These are the available results for these measurements:

	[A] 61MB of OK RAM	[B] 64MB of BadRAM, 3MB wrong
Measurements	Resultset #1	Resultset #1
	Resultset #2	Resultset #2
	Resultset #3	Resultset #3
	Resultset #4	Resultset #4
	Resultset #5	Resultset #5
Linux information	dmesg	dmesg
	/proc/meminfo	/proc/meminfo
	/proc/cmdline	/proc/cmdline
LMbanch' made	see	see
LMbanch' made	stats	stats

The following subsections deal with the latency tables in the make see results. Bandwidths are not discussed, as they are more likely to be influenced by the fact that they address different RAMs than by the BadRAM algorithms (which take no part in them).

The dmesg values reported for memory are different:

Memory: 60244k/62464k available (940k kernel code, 416k reserved, 804k data, 60k init, 0k badram)
Memory: 60212k/65536k available (940k kernel code, 416k reserved, 836k data, 60k init, 3072k badram)

The first line shows that no pages received a `BadRAM' treatment, and therefore, that no influence of BadRAM routines on runtime performance is possible. Note the difference in data segment for the kernel; no doubt, this is because bad pages are stored in the page tables, even though the memory is never made available.

Processor, Processes

From the make see results, one table of interest is the process(or) table, which are almost equal for both measurements. These tables are:

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host                 OS  Mhz null null      open selct sig  sig  fork exec sh
                             call  I/O stat clos       inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
i686-linu  Linux 2.2.14  351  0.9  1.2    7    9 0.04K  2.6    3 0.3K   2K   8K
i686-linu  Linux 2.2.14  351  0.9  1.2    7    9 0.04K  2.6    3 0.3K   2K   8K
i686-linu  Linux 2.2.14  351  0.9  1.2    7    9 0.04K  2.6    3 0.3K   2K   8K
i686-linu  Linux 2.2.14  351  0.9  1.2    7    9 0.04K  2.6    3 0.3K   2K   8K
i686-linu  Linux 2.2.14  351  0.9  1.2    7    9 0.04K  2.6    3 0.3K   2K   8K

for OK RAM, and for BadRAM is:

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host                 OS  Mhz null null      open selct sig  sig  fork exec sh
                             call  I/O stat clos       inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
i686-linu  Linux 2.2.14  351  0.9  1.2    7    9 0.04K  2.5    3 0.3K   2K   9K
i686-linu  Linux 2.2.14  351  0.9  1.2    7    9 0.04K  2.6    3 0.3K   2K   9K
i686-linu  Linux 2.2.14  351  0.9  1.2    7    9 0.04K  2.6    3 0.3K   2K   9K
i686-linu  Linux 2.2.14  351  0.8  1.2    7    9 0.04K  2.6    3 0.3K   2K   9K
i686-linu  Linux 2.2.14  351  0.9  1.2    7    9 0.04K  2.6    3 0.3K   2K   9K

The difference is mainly in the last line, which is 8K for OK RAM, and 9K for BadRAM. What does that mean?

Context Switching

The tables for context switching times are:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu  Linux 2.2.14    1     19     58    19    106      22     192
i686-linu  Linux 2.2.14    1     19     58    19    125      23     192
i686-linu  Linux 2.2.14    1     19     58    19     97      26     192
i686-linu  Linux 2.2.14    1     18     58    19    125      22     192
i686-linu  Linux 2.2.14    1     19     58    19    108      26     192

for OK RAM, and for BadRAM it is:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu  Linux 2.2.14    1     18     58    19    131      23     192
i686-linu  Linux 2.2.14    1     19     58    19    104      24     191
i686-linu  Linux 2.2.14    1     18     58    19     94      22     192
i686-linu  Linux 2.2.14    1     18     58    19     92      24     192
i686-linu  Linux 2.2.14    1     19     58    19    112      24     192

The averages for these columns are:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Measurement             2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--- -------- ---------- ----- ------ ------ ------ ------ ------- -------
[A] 61MB     OK RAM       1    18,8   58.0   19.0   112    23.8     192
[B] 64MB-3MB BadRAM       1    18.4   58.0   19.0   107    23.4     192

The last measurement fell out a little lower, but the result was rounded out; I have some difficulties believing in more than 3 digits of true value for a measurement of 5 minutes. To my utter surprise, BadRAM seems to cause improvements for the other values! I have the tendency to assign that to the measurements.

Local Communication Latencies

The latency tables for local communication are, for OK RAM:

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
i686-linu  Linux 2.2.14     1     9   17
i686-linu  Linux 2.2.14     1     9   17
i686-linu  Linux 2.2.14     1     9   17
i686-linu  Linux 2.2.14     1     9   17
i686-linu  Linux 2.2.14     1     9   17

and for BadRAM:

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
i686-linu  Linux 2.2.14     1     9   17
i686-linu  Linux 2.2.14     1     9   17
i686-linu  Linux 2.2.14     1     9   17
i686-linu  Linux 2.2.14     1     9   17
i686-linu  Linux 2.2.14     1     9   17

How boring; any differences fall under the benchmark's threshold :).

Virtual Memory Latencies

The tables for context swithing times are, for OK RAM:

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host                 OS   0K File      10K File      Mmap    Prot    Page
                        Create Delete Create Delete  Latency Fault   Fault
--------- ------------- ------ ------ ------ ------  ------- -----   -----
i686-linu  Linux 2.2.14     19      2     85      3     4624     1    0.8K
i686-linu  Linux 2.2.14     19      2     85      3     4603     1    0.8K
i686-linu  Linux 2.2.14     19      2     82      3     4656     1    0.8K
i686-linu  Linux 2.2.14     19      2     81      3     4690     1    0.8K
i686-linu  Linux 2.2.14     19      2     76      3     4642     1    0.8K

and for BadRAM:

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host                 OS   0K File      10K File      Mmap    Prot    Page
                        Create Delete Create Delete  Latency Fault   Fault
--------- ------------- ------ ------ ------ ------  ------- -----   -----
i686-linu  Linux 2.2.14     19      2     85      3     4647     1    0.7K
i686-linu  Linux 2.2.14     19      2     85      3     4361     1    0.7K
i686-linu  Linux 2.2.14     19      2     85      3     4472     1    0.7K
i686-linu  Linux 2.2.14     19      2     79      3     4444     1    0.7K
i686-linu  Linux 2.2.14     19      2     76      3     4500     1    0.7K

The averages for these columns are:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Measurement               0K File      10K File      Mmap    Prot    Page
                        Create Delete Create Delete  Latency Fault   Fault
--- -------- ---------- ------ ------ ------ ------  ------- -----   -----
[A] 61MB     OK RAM         19      2   81.8      3     4643     1    0.8K
[B] 64MB-3MB BadRAM         19      2   82.0      3     4485     1    0.7K

Here too, there are no signs of worse performance caused by BadRAM. We are not interested in the question whether BadRAM performs better than OK RAM, just whether there is a performance loss when replacing OK RAM with a same amount of OK memory in BadRAM.

Memory Latency

The tables for memory latency are, for OK RAM:

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
---------------------------------------------------
Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
--------- -------------   ---  ----   ----    --------    -------
i686-linu  Linux 2.2.14   351     8     62         163
i686-linu  Linux 2.2.14   351     8     62         163
i686-linu  Linux 2.2.14   351     8     62         163
i686-linu  Linux 2.2.14   351     8     62         163
i686-linu  Linux 2.2.14   351     8     78         163

and for BadRAM:

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
---------------------------------------------------
Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
--------- -------------   ---  ----   ----    --------    -------
i686-linu  Linux 2.2.14   351     8     78         163
i686-linu  Linux 2.2.14   351     8     62         163
i686-linu  Linux 2.2.14   351     8     62         163
i686-linu  Linux 2.2.14   351     8     62         163
i686-linu  Linux 2.2.14   351     8     62         163

And these results show no distinction between OK RAM and BadRAM performance either.

Note: I am unsure what to do with the `check graphs' message.

Conclusion

BadRAM performs equally well as normal RAM after bad pages have been taken out.

This is as expected. There is no influence to be expected, because the BadRAM's bad pages are never supplied to the kernel allocation routines of Linux. Although the regular appearance of holes in a RAM leads to increased fragmentation of page ranges, this is not a major problem because most memory is user space memory, which is allocated page-by-page anyway. In user space, memory page regions are formed from single pages through the MMU.

Later amendements

From version 2.4.0 on, there is not even any code overhead in the BadRAM patch; all the code is put into the __init segment of the kernel, which is flushed out of memory after booting. This means that the only overhead from BadRAM from this version forth is boot time (milliseconds) and kernel size (kilobytes).