diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..5014bfa
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,30 @@
+#
+# NOTE! Don't add files that are generated in specific
+# subdirectories here. Add them in the ".gitignore" file
+# in that subdirectory instead.
+#
+# Normal rules
+#
+.*
+*.o
+*.a
+*.s
+*.ko
+*.mod.c
+
+#
+# Top-level generic files
+#
+vmlinux*
+System.map
+Module.symvers
+
+#
+# Generated include files
+#
+include/asm
+include/config
+include/linux/autoconf.h
+include/linux/compile.h
+include/linux/version.h
+
diff --git a/COPYING b/COPYING
index 2a7e338..ca442d3 100644
--- a/COPYING
+++ b/COPYING
@@ -18,7 +18,7 @@
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
- 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
@@ -321,7 +321,7 @@ the "copyright" line and a pointer to wh
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
- Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
Also add information on how to contact you by electronic and paper mail.
diff --git a/CREDITS b/CREDITS
index f553f8c..a347520 100644
--- a/CREDITS
+++ b/CREDITS
@@ -2211,6 +2211,15 @@ D: OV511 driver
S: (address available on request)
S: USA
+N: Ian McDonald
+E: iam4@cs.waikato.ac.nz
+E: imcdnzl@gmail.com
+W: http://wand.net.nz/~iam4
+W: http://imcdnzl.blogspot.com
+D: DCCP, CCID3
+S: Hamilton
+S: New Zealand
+
N: Patrick McHardy
E: kaber@trash.net
P: 1024D/12155E80 B128 7DE6 FF0A C2B2 48BE AB4C C9D4 964E 1215 5E80
@@ -2246,19 +2255,12 @@ S: D-90453 Nuernberg
S: Germany
N: Arnaldo Carvalho de Melo
-E: acme@conectiva.com.br
-E: acme@kernel.org
-E: acme@gnu.org
-W: http://bazar2.conectiva.com.br/~acme
-W: http://advogato.org/person/acme
+E: acme@mandriva.com
+E: acme@ghostprotocols.net
+W: http://oops.ghostprotocols.net:81/blog/
P: 1024D/9224DF01 D5DF E3BB E3C8 BCBB F8AD 841A B6AB 4681 9224 DF01
-D: wanrouter hacking
-D: misc Makefile, Config.in, drivers and network stacks fixes
-D: IPX & LLC network stacks maintainer
-D: Cyclom 2X synchronous card driver
-D: wl3501 PCMCIA wireless card driver
-D: i18n for minicom, net-tools, util-linux, fetchmail, etc
-S: Conectiva S.A.
+D: IPX, LLC, DCCP, cyc2x, wl3501_cs, net/ hacks
+S: Mandriva
S: R. Tocantins, 89 - Cristo Rei
S: 80050-430 - Curitiba - Paraná
S: Brazil
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index f28a24e..433cf5e 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -46,6 +46,8 @@ SubmittingPatches
- procedure to get a source patch included into the kernel tree.
VGA-softcursor.txt
- how to change your VGA cursor from a blinking underscore.
+applying-patches.txt
+ - description of various trees and how to apply their patches.
arm/
- directory with info about Linux on the ARM architecture.
basic_profiling.txt
@@ -275,7 +277,7 @@ tty.txt
unicode.txt
- info on the Unicode character/font mapping used in Linux.
uml/
- - directory with infomation about User Mode Linux.
+ - directory with information about User Mode Linux.
usb/
- directory with info regarding the Universal Serial Bus.
video4linux/
diff --git a/Documentation/Changes b/Documentation/Changes
index 5eaab04..27232be 100644
--- a/Documentation/Changes
+++ b/Documentation/Changes
@@ -237,6 +237,12 @@ udev
udev is a userspace application for populating /dev dynamically with
only entries for devices actually present. udev replaces devfs.
+FUSE
+----
+
+Needs libfuse 2.4.0 or later. Absolute minimum is 2.3.0 but mount
+options 'direct_io' and 'kernel_cache' won't work.
+
Networking
==========
@@ -390,6 +396,10 @@ udev
----
o <http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html>
+FUSE
+----
+o <http://sourceforge.net/projects/fuse>
+
Networking
**********
diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle
index f25b395..eb7db3c 100644
--- a/Documentation/CodingStyle
+++ b/Documentation/CodingStyle
@@ -236,6 +236,9 @@ ugly), but try to avoid excess. Instead
of the function, telling people what it does, and possibly WHY it does
it.
+When commenting the kernel API functions, please use the kerneldoc format.
+See the files Documentation/kernel-doc-nano-HOWTO.txt and scripts/kernel-doc
+for details.
Chapter 8: You've made a mess of it
@@ -407,7 +410,26 @@ Kernel messages do not have to be termin
Printing numbers in parentheses (%d) adds no value and should be avoided.
- Chapter 13: References
+ Chapter 13: Allocating memory
+
+The kernel provides the following general purpose memory allocators:
+kmalloc(), kzalloc(), kcalloc(), and vmalloc(). Please refer to the API
+documentation for further information about them.
+
+The preferred form for passing a size of a struct is the following:
+
+ p = kmalloc(sizeof(*p), ...);
+
+The alternative form where struct name is spelled out hurts readability and
+introduces an opportunity for a bug when the pointer variable type is changed
+but the corresponding sizeof that is passed to a memory allocator is not.
+
+Casting the return value which is a void pointer is redundant. The conversion
+from void pointer to any other pointer type is guaranteed by the C programming
+language.
+
+
+ Chapter 14: References
The C Programming Language, Second Edition
by Brian W. Kernighan and Dennis M. Ritchie.
diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index 6ee3cd6..1af0f2d 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -121,7 +121,7 @@ pool's device.
dma_addr_t addr);
This puts memory back into the pool. The pool is what was passed to
-the the pool allocation routine; the cpu and dma addresses are what
+the pool allocation routine; the cpu and dma addresses are what
were returned when that routine allocated the memory being freed.
diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt
new file mode 100644
index 0000000..705f6be
--- /dev/null
+++ b/Documentation/DMA-ISA-LPC.txt
@@ -0,0 +1,151 @@
+ DMA with ISA and LPC devices
+ ============================
+
+ Pierre Ossman <drzeus@drzeus.cx>
+
+This document describes how to do DMA transfers using the old ISA DMA
+controller. Even though ISA is more or less dead today the LPC bus
+uses the same DMA system so it will be around for quite some time.
+
+Part I - Headers and dependencies
+---------------------------------
+
+To do ISA style DMA you need to include two headers:
+
+#include <linux/dma-mapping.h>
+#include <asm/dma.h>
+
+The first is the generic DMA API used to convert virtual addresses to
+physical addresses (see Documentation/DMA-API.txt for details).
+
+The second contains the routines specific to ISA DMA transfers. Since
+this is not present on all platforms make sure you construct your
+Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries
+to build your driver on unsupported platforms.
+
+Part II - Buffer allocation
+---------------------------
+
+The ISA DMA controller has some very strict requirements on which
+memory it can access so extra care must be taken when allocating
+buffers.
+
+(You usually need a special buffer for DMA transfers instead of
+transferring directly to and from your normal data structures.)
+
+The DMA-able address space is the lowest 16 MB of _physical_ memory.
+Also the transfer block may not cross page boundaries (which are 64
+or 128 KiB depending on which channel you use).
+
+In order to allocate a piece of memory that satisfies all these
+requirements you pass the flag GFP_DMA to kmalloc.
+
+Unfortunately the memory available for ISA DMA is scarce so unless you
+allocate the memory during boot-up it's a good idea to also pass
+__GFP_REPEAT and __GFP_NOWARN to make the allocater try a bit harder.
+
+(This scarcity also means that you should allocate the buffer as
+early as possible and not release it until the driver is unloaded.)
+
+Part III - Address translation
+------------------------------
+
+To translate the virtual address to a physical use the normal DMA
+API. Do _not_ use isa_virt_to_phys() even though it does the same
+thing. The reason for this is that the function isa_virt_to_phys()
+will require a Kconfig dependency to ISA, not just ISA_DMA_API which
+is really all you need. Remember that even though the DMA controller
+has its origins in ISA it is used elsewhere.
+
+Note: x86_64 had a broken DMA API when it came to ISA but has since
+been fixed. If your arch has problems then fix the DMA API instead of
+reverting to the ISA functions.
+
+Part IV - Channels
+------------------
+
+A normal ISA DMA controller has 8 channels. The lower four are for
+8-bit transfers and the upper four are for 16-bit transfers.
+
+(Actually the DMA controller is really two separate controllers where
+channel 4 is used to give DMA access for the second controller (0-3).
+This means that of the four 16-bits channels only three are usable.)
+
+You allocate these in a similar fashion as all basic resources:
+
+extern int request_dma(unsigned int dmanr, const char * device_id);
+extern void free_dma(unsigned int dmanr);
+
+The ability to use 16-bit or 8-bit transfers is _not_ up to you as a
+driver author but depends on what the hardware supports. Check your
+specs or test different channels.
+
+Part V - Transfer data
+----------------------
+
+Now for the good stuff, the actual DMA transfer. :)
+
+Before you use any ISA DMA routines you need to claim the DMA lock
+using claim_dma_lock(). The reason is that some DMA operations are
+not atomic so only one driver may fiddle with the registers at a
+time.
+
+The first time you use the DMA controller you should call
+clear_dma_ff(). This clears an internal register in the DMA
+controller that is used for the non-atomic operations. As long as you
+(and everyone else) uses the locking functions then you only need to
+reset this once.
+
+Next, you tell the controller in which direction you intend to do the
+transfer using set_dma_mode(). Currently you have the options
+DMA_MODE_READ and DMA_MODE_WRITE.
+
+Set the address from where the transfer should start (this needs to
+be 16-bit aligned for 16-bit transfers) and how many bytes to
+transfer. Note that it's _bytes_. The DMA routines will do all the
+required translation to values that the DMA controller understands.
+
+The final step is enabling the DMA channel and releasing the DMA
+lock.
+
+Once the DMA transfer is finished (or timed out) you should disable
+the channel again. You should also check get_dma_residue() to make
+sure that all data has been transfered.
+
+Example:
+
+int flags, residue;
+
+flags = claim_dma_lock();
+
+clear_dma_ff();
+
+set_dma_mode(channel, DMA_MODE_WRITE);
+set_dma_addr(channel, phys_addr);
+set_dma_count(channel, num_bytes);
+
+dma_enable(channel);
+
+release_dma_lock(flags);
+
+while (!device_done());
+
+flags = claim_dma_lock();
+
+dma_disable(channel);
+
+residue = dma_get_residue(channel);
+if (residue != 0)
+ printk(KERN_ERR "driver: Incomplete DMA transfer!"
+ " %d bytes left!\n", residue);
+
+release_dma_lock(flags);
+
+Part VI - Suspend/resume
+------------------------
+
+It is the driver's responsibility to make sure that the machine isn't
+suspended while a DMA transfer is in progress. Also, all DMA settings
+are lost when the system suspends so if your driver relies on the DMA
+controller being in a certain state then you have to restore these
+registers upon resume.
diff --git a/Documentation/DocBook/journal-api.tmpl b/Documentation/DocBook/journal-api.tmpl
index 1ef6f43..341aaa4 100644
--- a/Documentation/DocBook/journal-api.tmpl
+++ b/Documentation/DocBook/journal-api.tmpl
@@ -116,7 +116,7 @@ filesystem. Almost.
You still need to actually journal your filesystem changes, this
is done by wrapping them into transactions. Additionally you
-also need to wrap the modification of each of the the buffers
+also need to wrap the modification of each of the buffers
with calls to the journal layer, so it knows what the modifications
you are actually making are. To do this use journal_start() which
returns a transaction handle.
@@ -128,7 +128,7 @@ and its counterpart journal_stop(), whic
are nestable calls, so you can reenter a transaction if necessary,
but remember you must call journal_stop() the same number of times as
journal_start() before the transaction is completed (or more accurately
-leaves the the update phase). Ext3/VFS makes use of this feature to simplify
+leaves the update phase). Ext3/VFS makes use of this feature to simplify
quota support.
</para>
diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl
index 49a9ef8..582032e 100644
--- a/Documentation/DocBook/kernel-hacking.tmpl
+++ b/Documentation/DocBook/kernel-hacking.tmpl
@@ -8,8 +8,7 @@
<authorgroup>
<author>
- <firstname>Paul</firstname>
- <othername>Rusty</othername>
+ <firstname>Rusty</firstname>
<surname>Russell</surname>
<affiliation>
<address>
@@ -20,7 +19,7 @@
</authorgroup>
<copyright>
- <year>2001</year>
+ <year>2005</year>
<holder>Rusty Russell</holder>
</copyright>
@@ -64,7 +63,7 @@
<chapter id="introduction">
<title>Introduction</title>
<para>
- Welcome, gentle reader, to Rusty's Unreliable Guide to Linux
+ Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux
Kernel Hacking. This document describes the common routines and
general requirements for kernel code: its goal is to serve as a
primer for Linux kernel development for experienced C
@@ -96,13 +95,13 @@
<listitem>
<para>
- not associated with any process, serving a softirq, tasklet or bh;
+ not associated with any process, serving a softirq or tasklet;
</para>
</listitem>
<listitem>
<para>
- running in kernel space, associated with a process;
+ running in kernel space, associated with a process (user context);
</para>
</listitem>
@@ -114,11 +113,12 @@
</itemizedlist>
<para>
- There is a strict ordering between these: other than the last
- category (userspace) each can only be pre-empted by those above.
- For example, while a softirq is running on a CPU, no other
- softirq will pre-empt it, but a hardware interrupt can. However,
- any other CPUs in the system execute independently.
+ There is an ordering between these. The bottom two can preempt
+ each other, but above that is a strict hierarchy: each can only be
+ preempted by the ones above it. For example, while a softirq is
+ running on a CPU, no other softirq will preempt it, but a hardware
+ interrupt can. However, any other CPUs in the system execute
+ independently.
</para>
<para>
@@ -130,10 +130,10 @@
<title>User Context</title>
<para>
- User context is when you are coming in from a system call or
- other trap: you can sleep, and you own the CPU (except for
- interrupts) until you call <function>schedule()</function>.
- In other words, user context (unlike userspace) is not pre-emptable.
+ User context is when you are coming in from a system call or other
+ trap: like userspace, you can be preempted by more important tasks
+ and by interrupts. You can sleep, by calling
+ <function>schedule()</function>.
</para>
<note>
@@ -153,7 +153,7 @@
<caution>
<para>
- Beware that if you have interrupts or bottom halves disabled
+ Beware that if you have preemption or softirqs disabled
(see below), <function>in_interrupt()</function> will return a
false positive.
</para>
@@ -168,10 +168,10 @@
<hardware>keyboard</hardware> are examples of real
hardware which produce interrupts at any time. The kernel runs
interrupt handlers, which services the hardware. The kernel
- guarantees that this handler is never re-entered: if another
+ guarantees that this handler is never re-entered: if the same
interrupt arrives, it is queued (or dropped). Because it
disables interrupts, this handler has to be fast: frequently it
- simply acknowledges the interrupt, marks a `software interrupt'
+ simply acknowledges the interrupt, marks a 'software interrupt'
for execution and exits.
</para>
@@ -188,60 +188,52 @@
</sect1>
<sect1 id="basics-softirqs">
- <title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title>
+ <title>Software Interrupt Context: Softirqs and Tasklets</title>
<para>
Whenever a system call is about to return to userspace, or a
- hardware interrupt handler exits, any `software interrupts'
+ hardware interrupt handler exits, any 'software interrupts'
which are marked pending (usually by hardware interrupts) are
run (<filename>kernel/softirq.c</filename>).
</para>
<para>
Much of the real interrupt handling work is done here. Early in
- the transition to <acronym>SMP</acronym>, there were only `bottom
+ the transition to <acronym>SMP</acronym>, there were only 'bottom
halves' (BHs), which didn't take advantage of multiple CPUs. Shortly
after we switched from wind-up computers made of match-sticks and snot,
- we abandoned this limitation.
+ we abandoned this limitation and switched to 'softirqs'.
</para>
<para>
<filename class="headerfile">include/linux/interrupt.h</filename> lists the
- different BH's. No matter how many CPUs you have, no two BHs will run at
- the same time. This made the transition to SMP simpler, but sucks hard for
- scalable performance. A very important bottom half is the timer
- BH (<filename class="headerfile">include/linux/timer.h</filename>): you
- can register to have it call functions for you in a given length of time.
+ different softirqs. A very important softirq is the
+ timer softirq (<filename
+ class="headerfile">include/linux/timer.h</filename>): you can
+ register to have it call functions for you in a given length of
+ time.
</para>
<para>
- 2.3.43 introduced softirqs, and re-implemented the (now
- deprecated) BHs underneath them. Softirqs are fully-SMP
- versions of BHs: they can run on as many CPUs at once as
- required. This means they need to deal with any races in shared
- data using their own locks. A bitmask is used to keep track of
- which are enabled, so the 32 available softirqs should not be
- used up lightly. (<emphasis>Yes</emphasis>, people will
- notice).
- </para>
-
- <para>
- tasklets (<filename class="headerfile">include/linux/interrupt.h</filename>)
- are like softirqs, except they are dynamically-registrable (meaning you
- can have as many as you want), and they also guarantee that any tasklet
- will only run on one CPU at any time, although different tasklets can
- run simultaneously (unlike different BHs).
+ Softirqs are often a pain to deal with, since the same softirq
+ will run simultaneously on more than one CPU. For this reason,
+ tasklets (<filename
+ class="headerfile">include/linux/interrupt.h</filename>) are more
+ often used: they are dynamically-registrable (meaning you can have
+ as many as you want), and they also guarantee that any tasklet
+ will only run on one CPU at any time, although different tasklets
+ can run simultaneously.
</para>
<caution>
<para>
- The name `tasklet' is misleading: they have nothing to do with `tasks',
+ The name 'tasklet' is misleading: they have nothing to do with 'tasks',
and probably more to do with some bad vodka Alexey Kuznetsov had at the
time.
</para>
</caution>
<para>
- You can tell you are in a softirq (or bottom half, or tasklet)
+ You can tell you are in a softirq (or tasklet)
using the <function>in_softirq()</function> macro
(<filename class="headerfile">include/linux/interrupt.h</filename>).
</para>
@@ -288,11 +280,10 @@
<term>A rigid stack limit</term>
<listitem>
<para>
- The kernel stack is about 6K in 2.2 (for most
- architectures: it's about 14K on the Alpha), and shared
- with interrupts so you can't use it all. Avoid deep
- recursion and huge local arrays on the stack (allocate
- them dynamically instead).
+ Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's
+ about 14K on most 64-bit archs, and often shared with interrupts
+ so you can't use it all. Avoid deep recursion and huge local
+ arrays on the stack (allocate them dynamically instead).
</para>
</listitem>
</varlistentry>
@@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg)
<para>
If all your routine does is read or write some parameter, consider
- implementing a <function>sysctl</function> interface instead.
+ implementing a <function>sysfs</function> interface instead.
</para>
<para>
@@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */
</para>
<para>
- You will eventually lock up your box if you break these rules.
+ You should always compile your kernel
+ <symbol>CONFIG_DEBUG_SPINLOCK_SLEEP</symbol> on, and it will warn
+ you if you break these rules. If you <emphasis>do</emphasis> break
+ the rules, you will eventually lock up your box.
</para>
<para>
@@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
success).
</para>
</caution>
- [Yes, this moronic interface makes me cringe. Please submit a
- patch and become my hero --RR.]
+ [Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.]
</para>
<para>
The functions may sleep implicitly. This should never be called
@@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
</variablelist>
<para>
- If you see a <errorname>kmem_grow: Called nonatomically from int
- </errorname> warning message you called a memory allocation function
- from interrupt context without <constant>GFP_ATOMIC</constant>.
- You should really fix that. Run, don't walk.
+ If you see a <errorname>sleeping function called from invalid
+ context</errorname> warning message, then maybe you called a
+ sleeping allocation function from interrupt context without
+ <constant>GFP_ATOMIC</constant>. You should really fix that.
+ Run, don't walk.
</para>
<para>
@@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
</sect1>
<sect1 id="routines-udelay">
- <title><function>udelay()</function>/<function>mdelay()</function>
+ <title><function>mdelay()</function>/<function>udelay()</function>
<filename class="headerfile">include/asm/delay.h</filename>
<filename class="headerfile">include/linux/delay.h</filename>
</title>
<para>
- The <function>udelay()</function> function can be used for small pauses.
- Do not use large values with <function>udelay()</function> as you risk
+ The <function>udelay()</function> and <function>ndelay()</function> functions can be used for small pauses.
+ Do not use large values with them as you risk
overflow - the helper function <function>mdelay()</function> is useful
- here, or even consider <function>schedule_timeout()</function>.
+ here, or consider <function>msleep()</function>.
</para>
</sect1>
@@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
These routines disable soft interrupts on the local CPU, and
restore them. They are reentrant; if soft interrupts were
disabled before, they will still be disabled after this pair
- of functions has been called. They prevent softirqs, tasklets
- and bottom halves from running on the current CPU.
+ of functions has been called. They prevent softirqs and tasklets
+ from running on the current CPU.
</para>
</sect1>
@@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
<filename class="headerfile">include/asm/smp.h</filename></title>
<para>
- <function>smp_processor_id()</function> returns the current
- processor number, between 0 and <symbol>NR_CPUS</symbol> (the
- maximum number of CPUs supported by Linux, currently 32). These
- values are not necessarily continuous.
+ <function>get_cpu()</function> disables preemption (so you won't
+ suddenly get moved to another CPU) and returns the current
+ processor number, between 0 and <symbol>NR_CPUS</symbol>. Note
+ that the CPU numbers are not necessarily continuous. You return
+ it again with <function>put_cpu()</function> when you are done.
+ </para>
+ <para>
+ If you know you cannot be preempted by another task (ie. you are
+ in interrupt context, or have preemption disabled) you can use
+ smp_processor_id().
</para>
</sect1>
@@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
<para>
After boot, the kernel frees up a special section; functions
marked with <type>__init</type> and data structures marked with
- <type>__initdata</type> are dropped after boot is complete (within
- modules this directive is currently ignored). <type>__exit</type>
+ <type>__initdata</type> are dropped after boot is complete: similarly
+ modules discard this memory after initialization. <type>__exit</type>
is used to declare a function which is only required on exit: the
function will be dropped if this file is not compiled as a module.
See the header file for use. Note that it makes no sense for a function
marked with <type>__init</type> to be exported to modules with
<function>EXPORT_SYMBOL()</function> - this will break.
</para>
- <para>
- Static data structures marked as <type>__initdata</type> must be initialised
- (as opposed to ordinary static data which is zeroed BSS) and cannot be
- <type>const</type>.
- </para>
</sect1>
@@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
<para>
The function can return a negative error number to cause
module loading to fail (unfortunately, this has no effect if
- the module is compiled into the kernel). For modules, this is
- called in user context, with interrupts enabled, and the
- kernel lock held, so it can sleep.
+ the module is compiled into the kernel). This function is
+ called in user context with interrupts enabled, so it can sleep.
</para>
</sect1>
@@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
reached zero. This function can also sleep, but cannot fail:
everything must be cleaned up by the time it returns.
</para>
+
+ <para>
+ Note that this macro is optional: if it is not present, your
+ module will not be removable (except for 'rmmod -f').
+ </para>
+ </sect1>
+
+ <sect1 id="routines-module-use-counters">
+ <title> <function>try_module_get()</function>/<function>module_put()</function>
+ <filename class="headerfile">include/linux/module.h</filename></title>
+
+ <para>
+ These manipulate the module usage count, to protect against
+ removal (a module also can't be removed if another module uses one
+ of its exported symbols: see below). Before calling into module
+ code, you should call <function>try_module_get()</function> on
+ that module: if it fails, then the module is being removed and you
+ should act as if it wasn't there. Otherwise, you can safely enter
+ the module, and call <function>module_put()</function> when you're
+ finished.
+ </para>
+
+ <para>
+ Most registerable structures have an
+ <structfield>owner</structfield> field, such as in the
+ <structname>file_operations</structname> structure. Set this field
+ to the macro <symbol>THIS_MODULE</symbol>.
+ </para>
</sect1>
<!-- add info on new-style module refcounting here -->
@@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
There is a macro to do this:
<function>wait_event_interruptible()</function>
- <filename class="headerfile">include/linux/sched.h</filename> The
+ <filename class="headerfile">include/linux/wait.h</filename> The
first argument is the wait queue head, and the second is an
expression which is evaluated; the macro returns
<returnvalue>0</returnvalue> when this expression is true, or
@@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
<para>
Call <function>wake_up()</function>
- <filename class="headerfile">include/linux/sched.h</filename>;,
+ <filename class="headerfile">include/linux/wait.h</filename>;,
which will wake up every process in the queue. The exception is
if one has <constant>TASK_EXCLUSIVE</constant> set, in which case
- the remainder of the queue will not be woken.
+ the remainder of the queue will not be woken. There are other variants
+ of this basic function available in the same header.
</para>
</sect1>
</chapter>
@@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
first class of operations work on <type>atomic_t</type>
<filename class="headerfile">include/asm/atomic.h</filename>; this
- contains a signed integer (at least 24 bits long), and you must use
+ contains a signed integer (at least 32 bits long), and you must use
these functions to manipulate or read atomic_t variables.
<function>atomic_read()</function> and
<function>atomic_set()</function> get and set the counter,
@@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
<para>
Note that these functions are slower than normal arithmetic, and
- so should not be used unnecessarily. On some platforms they
- are much slower, like 32-bit Sparc where they use a spinlock.
+ so should not be used unnecessarily.
</para>
<para>
- The second class of atomic operations is atomic bit operations on a
- <type>long</type>, defined in
+ The second class of atomic operations is atomic bit operations on an
+ <type>unsigned long</type>, defined in
<filename class="headerfile">include/linux/bitops.h</filename>. These
operations generally take a pointer to the bit pattern, and a bit
@@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
<function>test_and_clear_bit()</function> and
<function>test_and_change_bit()</function> do the same thing,
except return true if the bit was previously set; these are
- particularly useful for very simple locking.
+ particularly useful for atomically setting flags.
</para>
<para>
@@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
than BITS_PER_LONG. The resulting behavior is strange on big-endian
platforms though so it is a good idea not to do this.
</para>
-
- <para>
- Note that the order of bits depends on the architecture, and in
- particular, the bitfield passed to these operations must be at
- least as large as a <type>long</type>.
- </para>
</chapter>
<chapter id="symbols">
@@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
<filename class="headerfile">include/linux/module.h</filename></title>
<para>
- This is the classic method of exporting a symbol, and it works
- for both modules and non-modules. In the kernel all these
- declarations are often bundled into a single file to help
- genksyms (which searches source files for these declarations).
- See the comment on genksyms and Makefiles below.
+ This is the classic method of exporting a symbol: dynamically
+ loaded modules will be able to use the symbol as normal.
</para>
</sect1>
@@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can
only be seen by modules with a
<function>MODULE_LICENSE()</function> that specifies a GPL
- compatible license.
+ compatible license. It implies that the function is considered
+ an internal implementation issue, and not really an interface.
</para>
</sect1>
</chapter>
@@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
<filename class="headerfile">include/linux/list.h</filename></title>
<para>
- There are three sets of linked-list routines in the kernel
- headers, but this one seems to be winning out (and Linus has
- used it). If you don't have some particular pressing need for
- a single list, it's a good choice. In fact, I don't care
- whether it's a good choice or not, just use it so we can get
- rid of the others.
+ There used to be three sets of linked-list routines in the kernel
+ headers, but this one is the winner. If you don't have some
+ particular pressing need for a single list, it's a good choice.
+ </para>
+
+ <para>
+ In particular, <function>list_for_each_entry</function> is useful.
</para>
</sect1>
@@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n",
convention, and return <returnvalue>0</returnvalue> for success,
and a negative error number
(eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be
- unintuitive at first, but it's fairly widespread in the networking
- code, for example.
+ unintuitive at first, but it's fairly widespread in the kernel.
</para>
<para>
- The filesystem code uses <function>ERR_PTR()</function>
+ Using <function>ERR_PTR()</function>
- <filename class="headerfile">include/linux/fs.h</filename>; to
+ <filename class="headerfile">include/linux/err.h</filename>; to
encode a negative error number into a pointer, and
<function>IS_ERR()</function> and <function>PTR_ERR()</function>
to get it back out again: avoids a separate pointer parameter for
@@ -1040,7 +1054,7 @@ static struct block_device_operations op
supported, due to lack of general use, but the following are
considered standard (see the GCC info page section "C
Extensions" for more details - Yes, really the info page, the
- man page is only a short summary of the stuff in info):
+ man page is only a short summary of the stuff in info).
</para>
<itemizedlist>
<listitem>
@@ -1091,7 +1105,7 @@ static struct block_device_operations op
</listitem>
<listitem>
<para>
- Function names as strings (__FUNCTION__)
+ Function names as strings (__FUNCTION__).
</para>
</listitem>
<listitem>
@@ -1164,63 +1178,35 @@ static struct block_device_operations op
<listitem>
<para>
Usually you want a configuration option for your kernel hack.
- Edit <filename>Config.in</filename> in the appropriate directory
- (but under <filename>arch/</filename> it's called
- <filename>config.in</filename>). The Config Language used is not
- bash, even though it looks like bash; the safe way is to use only
- the constructs that you already see in
- <filename>Config.in</filename> files (see
- <filename>Documentation/kbuild/kconfig-language.txt</filename>).
- It's good to run "make xconfig" at least once to test (because
- it's the only one with a static parser).
- </para>
-
- <para>
- Variables which can be Y or N use <type>bool</type> followed by a
- tagline and the config define name (which must start with
- CONFIG_). The <type>tristate</type> function is the same, but
- allows the answer M (which defines
- <symbol>CONFIG_foo_MODULE</symbol> in your source, instead of
- <symbol>CONFIG_FOO</symbol>) if <symbol>CONFIG_MODULES</symbol>
- is enabled.
+ Edit <filename>Kconfig</filename> in the appropriate directory.
+ The Config language is simple to use by cut and paste, and there's
+ complete documentation in
+ <filename>Documentation/kbuild/kconfig-language.txt</filename>.
</para>
<para>
You may well want to make your CONFIG option only visible if
<symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a
warning to users. There many other fancy things you can do: see
- the various <filename>Config.in</filename> files for ideas.
+ the various <filename>Kconfig</filename> files for ideas.
</para>
- </listitem>
- <listitem>
<para>
- Edit the <filename>Makefile</filename>: the CONFIG variables are
- exported here so you can conditionalize compilation with `ifeq'.
- If your file exports symbols then add the names to
- <varname>export-objs</varname> so that genksyms will find them.
- <caution>
- <para>
- There is a restriction on the kernel build system that objects
- which export symbols must have globally unique names.
- If your object does not have a globally unique name then the
- standard fix is to move the
- <function>EXPORT_SYMBOL()</function> statements to their own
- object with a unique name.
- This is why several systems have separate exporting objects,
- usually suffixed with ksyms.
- </para>
- </caution>
+ In your description of the option, make sure you address both the
+ expert user and the user who knows nothing about your feature. Mention
+ incompatibilities and issues here. <emphasis> Definitely
+ </emphasis> end your description with <quote> if in doubt, say N
+ </quote> (or, occasionally, `Y'); this is for people who have no
+ idea what you are talking about.
</para>
</listitem>
<listitem>
<para>
- Document your option in Documentation/Configure.help. Mention
- incompatibilities and issues here. <emphasis> Definitely
- </emphasis> end your description with <quote> if in doubt, say N
- </quote> (or, occasionally, `Y'); this is for people who have no
- idea what you are talking about.
+ Edit the <filename>Makefile</filename>: the CONFIG variables are
+ exported here so you can usually just add a "obj-$(CONFIG_xxx) +=
+ xxx.o" line. The syntax is documented in
+ <filename>Documentation/kbuild/makefiles.txt</filename>.
</para>
</listitem>
@@ -1253,20 +1239,12 @@ static struct block_device_operations op
</para>
<para>
- <filename>include/linux/brlock.h:</filename>
+ <filename>include/asm-i386/delay.h:</filename>
</para>
<programlisting>
-extern inline void br_read_lock (enum brlock_indices idx)
-{
- /*
- * This causes a link-time bug message if an
- * invalid index is used:
- */
- if (idx >= __BR_END)
- __br_lock_usage_bug();
-
- read_lock(&__brlock_array[smp_processor_id()][idx]);
-}
+#define ndelay(n) (__builtin_constant_p(n) ? \
+ ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \
+ __ndelay(n))
</programlisting>
<para>
diff --git a/Documentation/DocBook/mcabook.tmpl b/Documentation/DocBook/mcabook.tmpl
index 4367f46..42a760c 100644
--- a/Documentation/DocBook/mcabook.tmpl
+++ b/Documentation/DocBook/mcabook.tmpl
@@ -96,7 +96,7 @@
<chapter id="pubfunctions">
<title>Public Functions Provided</title>
-!Earch/i386/kernel/mca.c
+!Edrivers/mca/mca-legacy.c
</chapter>
<chapter id="dmafunctions">
diff --git a/Documentation/DocBook/usb.tmpl b/Documentation/DocBook/usb.tmpl
index f3ef0bf..705c442 100644
--- a/Documentation/DocBook/usb.tmpl
+++ b/Documentation/DocBook/usb.tmpl
@@ -841,7 +841,7 @@ usbdev_ioctl (int fd, int ifno, unsigned
File modification time is not updated by this request.
</para><para>
Those struct members are from some interface descriptor
- applying to the the current configuration.
+ applying to the current configuration.
The interface number is the bInterfaceNumber value, and
the altsetting number is the bAlternateSetting value.
(This resets each endpoint in the interface.)
diff --git a/Documentation/IPMI.txt b/Documentation/IPMI.txt
index 84d3d4d..bf1cf98 100644
--- a/Documentation/IPMI.txt
+++ b/Documentation/IPMI.txt
@@ -605,12 +605,13 @@ is in the ipmi_poweroff module. When th
it will send the proper IPMI commands to do this. This is supported on
several platforms.
-There is a module parameter named "poweroff_control" that may either be zero
-(do a power down) or 2 (do a power cycle, power the system off, then power
-it on in a few seconds). Setting ipmi_poweroff.poweroff_control=x will do
-the same thing on the kernel command line. The parameter is also available
-via the proc filesystem in /proc/ipmi/poweroff_control. Note that if the
-system does not support power cycling, it will always to the power off.
+There is a module parameter named "poweroff_powercycle" that may
+either be zero (do a power down) or non-zero (do a power cycle, power
+the system off, then power it on in a few seconds). Setting
+ipmi_poweroff.poweroff_control=x will do the same thing on the kernel
+command line. The parameter is also available via the proc filesystem
+in /proc/sys/dev/ipmi/poweroff_powercycle. Note that if the system
+does not support power cycling, it will always do the power off.
Note that if you have ACPI enabled, the system will prefer using ACPI to
power off.
diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt
index d5032eb..63edc5f 100644
--- a/Documentation/MSI-HOWTO.txt
+++ b/Documentation/MSI-HOWTO.txt
@@ -430,7 +430,7 @@ which may result in system hang. The sof
MSI-capable hardware is responsible for whether calling
pci_enable_msi or not. A return of zero indicates the kernel
successfully initializes the MSI/MSI-X capability structure of the
-device funtion. The device function is now running on MSI/MSI-X mode.
+device function. The device function is now running on MSI/MSI-X mode.
5.6 How to tell whether MSI/MSI-X is enabled on device function
diff --git a/Documentation/RCU/NMI-RCU.txt b/Documentation/RCU/NMI-RCU.txt
new file mode 100644
index 0000000..d0634a5
--- /dev/null
+++ b/Documentation/RCU/NMI-RCU.txt
@@ -0,0 +1,112 @@
+Using RCU to Protect Dynamic NMI Handlers
+
+
+Although RCU is usually used to protect read-mostly data structures,
+it is possible to use RCU to provide dynamic non-maskable interrupt
+handlers, as well as dynamic irq handlers. This document describes
+how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer
+work in "arch/i386/oprofile/nmi_timer_int.c" and in
+"arch/i386/kernel/traps.c".
+
+The relevant pieces of code are listed below, each followed by a
+brief explanation.
+
+ static int dummy_nmi_callback(struct pt_regs *regs, int cpu)
+ {
+ return 0;
+ }
+
+The dummy_nmi_callback() function is a "dummy" NMI handler that does
+nothing, but returns zero, thus saying that it did nothing, allowing
+the NMI handler to take the default machine-specific action.
+
+ static nmi_callback_t nmi_callback = dummy_nmi_callback;
+
+This nmi_callback variable is a global function pointer to the current
+NMI handler.
+
+ fastcall void do_nmi(struct pt_regs * regs, long error_code)
+ {
+ int cpu;
+
+ nmi_enter();
+
+ cpu = smp_processor_id();
+ ++nmi_count(cpu);
+
+ if (!rcu_dereference(nmi_callback)(regs, cpu))
+ default_do_nmi(regs);
+
+ nmi_exit();
+ }
+
+The do_nmi() function processes each NMI. It first disables preemption
+in the same way that a hardware irq would, then increments the per-CPU
+count of NMIs. It then invokes the NMI handler stored in the nmi_callback
+function pointer. If this handler returns zero, do_nmi() invokes the
+default_do_nmi() function to handle a machine-specific NMI. Finally,
+preemption is restored.
+
+Strictly speaking, rcu_dereference() is not needed, since this code runs
+only on i386, which does not need rcu_dereference() anyway. However,
+it is a good documentation aid, particularly for anyone attempting to
+do something similar on Alpha.
+
+Quick Quiz: Why might the rcu_dereference() be necessary on Alpha,
+ given that the code referenced by the pointer is read-only?
+
+
+Back to the discussion of NMI and RCU...
+
+ void set_nmi_callback(nmi_callback_t callback)
+ {
+ rcu_assign_pointer(nmi_callback, callback);
+ }
+
+The set_nmi_callback() function registers an NMI handler. Note that any
+data that is to be used by the callback must be initialized up -before-
+the call to set_nmi_callback(). On architectures that do not order
+writes, the rcu_assign_pointer() ensures that the NMI handler sees the
+initialized values.
+
+ void unset_nmi_callback(void)
+ {
+ rcu_assign_pointer(nmi_callback, dummy_nmi_callback);
+ }
+
+This function unregisters an NMI handler, restoring the original
+dummy_nmi_handler(). However, there may well be an NMI handler
+currently executing on some other CPU. We therefore cannot free
+up any data structures used by the old NMI handler until execution
+of it completes on all other CPUs.
+
+One way to accomplish this is via synchronize_sched(), perhaps as
+follows:
+
+ unset_nmi_callback();
+ synchronize_sched();
+ kfree(my_nmi_data);
+
+This works because synchronize_sched() blocks until all CPUs complete
+any preemption-disabled segments of code that they were executing.
+Since NMI handlers disable preemption, synchronize_sched() is guaranteed
+not to return until all ongoing NMI handlers exit. It is therefore safe
+to free up the handler's data as soon as synchronize_sched() returns.
+
+
+Answer to Quick Quiz
+
+ Why might the rcu_dereference() be necessary on Alpha, given
+ that the code referenced by the pointer is read-only?
+
+ Answer: The caller to set_nmi_callback() might well have
+ initialized some data that is to be used by the
+ new NMI handler. In this case, the rcu_dereference()
+ would be needed, because otherwise a CPU that received
+ an NMI just after the new handler was set might see
+ the pointer to the new NMI handler, but the old
+ pre-initialized version of the handler's data.
+
+ More important, the rcu_dereference() makes it clear
+ to someone reading the code that the pointer is being
+ protected by RCU.
diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt
index 9c6d450..fcbcbc3 100644
--- a/Documentation/RCU/RTFP.txt
+++ b/Documentation/RCU/RTFP.txt
@@ -2,7 +2,8 @@ Read the F-ing Papers!
This document describes RCU-related publications, and is followed by
-the corresponding bibtex entries.
+the corresponding bibtex entries. A number of the publications may
+be found at http://www.rdrop.com/users/paulmck/RCU/.
The first thing resembling RCU was published in 1980, when Kung and Lehman
[Kung80] recommended use of a garbage collector to defer destruction
@@ -113,6 +114,10 @@ describing how to make RCU safe for soft
and a paper describing SELinux performance with RCU [JamesMorris04b].
+2005 has seen further adaptation of RCU to realtime use, permitting
+preemption of RCU realtime critical sections [PaulMcKenney05a,
+PaulMcKenney05b].
+
Bibtex Entries
@article{Kung80
@@ -410,3 +415,32 @@ Oregon Health and Sciences University"
\url{http://www.livejournal.com/users/james_morris/2153.html}
[Viewed December 10, 2004]"
}
+
+@unpublished{PaulMcKenney05a
+,Author="Paul E. McKenney"
+,Title="{[RFC]} {RCU} and {CONFIG\_PREEMPT\_RT} progress"
+,month="May"
+,year="2005"
+,note="Available:
+\url{http://lkml.org/lkml/2005/5/9/185}
+[Viewed May 13, 2005]"
+,annotation="
+ First publication of working lock-based deferred free patches
+ for the CONFIG_PREEMPT_RT environment.
+"
+}
+
+@conference{PaulMcKenney05b
+,Author="Paul E. McKenney and Dipankar Sarma"
+,Title="Towards Hard Realtime Response from the Linux Kernel on SMP Hardware"
+,Booktitle="linux.conf.au 2005"
+,month="April"
+,year="2005"
+,address="Canberra, Australia"
+,note="Available:
+\url{http://www.rdrop.com/users/paulmck/RCU/realtimeRCU.2005.04.23a.pdf}
+[Viewed May 13, 2005]"
+,annotation="
+ Realtime turns into making RCU yet more realtime friendly.
+"
+}
diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt
index 3bfb84b..aab4a9e 100644
--- a/Documentation/RCU/UP.txt
+++ b/Documentation/RCU/UP.txt
@@ -8,7 +8,7 @@ is that since there is only one CPU, it
wait for anything else to get done, since there are no other CPUs for
anything else to be happening on. Although this approach will -sort- -of-
work a surprising amount of the time, it is a very bad idea in general.
-This document presents two examples that demonstrate exactly how bad an
+This document presents three examples that demonstrate exactly how bad an
idea this is.
@@ -26,6 +26,9 @@ from softirq, the list scan would find i
element B. This situation can greatly decrease the life expectancy of
your kernel.
+This same problem can occur if call_rcu() is invoked from a hardware
+interrupt handler.
+
Example 2: Function-Call Fatality
@@ -44,8 +47,37 @@ its arguments would cause it to fail to
underlying RCU, namely that call_rcu() defers invoking its arguments until
all RCU read-side critical sections currently executing have completed.
-Quick Quiz: why is it -not- legal to invoke synchronize_rcu() in
-this case?
+Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in
+ this case?
+
+
+Example 3: Death by Deadlock
+
+Suppose that call_rcu() is invoked while holding a lock, and that the
+callback function must acquire this same lock. In this case, if
+call_rcu() were to directly invoke the callback, the result would
+be self-deadlock.
+
+In some cases, it would possible to restructure to code so that
+the call_rcu() is delayed until after the lock is released. However,
+there are cases where this can be quite ugly:
+
+1. If a number of items need to be passed to call_rcu() within
+ the same critical section, then the code would need to create
+ a list of them, then traverse the list once the lock was
+ released.
+
+2. In some cases, the lock will be held across some kernel API,
+ so that delaying the call_rcu() until the lock is released
+ requires that the data item be passed up via a common API.
+ It is far better to guarantee that callbacks are invoked
+ with no locks held than to have to modify such APIs to allow
+ arbitrary data items to be passed back up through them.
+
+If call_rcu() directly invokes the callback, painful locking restrictions
+or API changes would be required.
+
+Quick Quiz #2: What locking restriction must RCU callbacks respect?
Summary
@@ -53,12 +85,35 @@ Summary
Permitting call_rcu() to immediately invoke its arguments or permitting
synchronize_rcu() to immediately return breaks RCU, even on a UP system.
So do not do it! Even on a UP system, the RCU infrastructure -must-
-respect grace periods.
+respect grace periods, and -must- invoke callbacks from a known environment
+in which no locks are held.
-Answer to Quick Quiz
+Answer to Quick Quiz #1:
+ Why is it -not- legal to invoke synchronize_rcu() in this case?
-The calling function is scanning an RCU-protected linked list, and
-is therefore within an RCU read-side critical section. Therefore,
-the called function has been invoked within an RCU read-side critical
-section, and is not permitted to block.
+ Because the calling function is scanning an RCU-protected linked
+ list, and is therefore within an RCU read-side critical section.
+ Therefore, the called function has been invoked within an RCU
+ read-side critical section, and is not permitted to block.
+
+Answer to Quick Quiz #2:
+ What locking restriction must RCU callbacks respect?
+
+ Any lock that is acquired within an RCU callback must be
+ acquired elsewhere using an _irq variant of the spinlock
+ primitive. For example, if "mylock" is acquired by an
+ RCU callback, then a process-context acquisition of this
+ lock must use something like spin_lock_irqsave() to
+ acquire the lock.
+
+ If the process-context code were to simply use spin_lock(),
+ then, since RCU callbacks can be invoked from softirq context,
+ the callback might be called from a softirq that interrupted
+ the process-context critical section. This would result in
+ self-deadlock.
+
+ This restriction might seem gratuitous, since very few RCU
+ callbacks acquire locks directly. However, a great many RCU
+ callbacks do acquire locks -indirectly-, for example, via
+ the kfree() primitive.
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 8f3fb77..e118a7c 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -43,6 +43,10 @@ over a rather long period of time, but i
rcu_read_lock_bh()) in the read-side critical sections,
and are also an excellent aid to readability.
+ As a rough rule of thumb, any dereference of an RCU-protected
+ pointer must be covered by rcu_read_lock() or rcu_read_lock_bh()
+ or by the appropriate update-side lock.
+
3. Does the update code tolerate concurrent accesses?
The whole point of RCU is to permit readers to run without
@@ -90,7 +94,11 @@ over a rather long period of time, but i
The rcu_dereference() primitive is used by the various
"_rcu()" list-traversal primitives, such as the
- list_for_each_entry_rcu().
+ list_for_each_entry_rcu(). Note that it is perfectly
+ legal (if redundant) for update-side code to use
+ rcu_dereference() and the "_rcu()" list-traversal
+ primitives. This is particularly useful in code
+ that is common to readers and updaters.
b. If the list macros are being used, the list_add_tail_rcu()
and list_add_rcu() primitives must be used in order
@@ -150,16 +158,9 @@ over a rather long period of time, but i
Use of the _rcu() list-traversal primitives outside of an
RCU read-side critical section causes no harm other than
- a slight performance degradation on Alpha CPUs and some
- confusion on the part of people trying to read the code.
-
- Another way of thinking of this is "If you are holding the
- lock that prevents the data structure from changing, why do
- you also need RCU-based protection?" That said, there may
- well be situations where use of the _rcu() list-traversal
- primitives while the update-side lock is held results in
- simpler and more maintainable code. The jury is still out
- on this question.
+ a slight performance degradation on Alpha CPUs. It can
+ also be quite helpful in reducing code bloat when common
+ code is shared between readers and updaters.
10. Conversely, if you are in an RCU read-side critical section,
you -must- use the "_rcu()" variants of the list macros.
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt
index eb44400..6fa0922 100644
--- a/Documentation/RCU/rcu.txt
+++ b/Documentation/RCU/rcu.txt
@@ -64,6 +64,54 @@ o I hear that RCU is patented? What is
Of these, one was allowed to lapse by the assignee, and the
others have been contributed to the Linux kernel under GPL.
+o I hear that RCU needs work in order to support realtime kernels?
+
+ Yes, work in progress.
+
o Where can I find more information on RCU?
See the RTFP.txt file in this directory.
+ Or point your browser at http://www.rdrop.com/users/paulmck/RCU/.
+
+o What are all these files in this directory?
+
+
+ NMI-RCU.txt
+
+ Describes how to use RCU to implement dynamic
+ NMI handlers, which can be revectored on the fly,
+ without rebooting.
+
+ RTFP.txt
+
+ List of RCU-related publications and web sites.
+
+ UP.txt
+
+ Discussion of RCU usage in UP kernels.
+
+ arrayRCU.txt
+
+ Describes how to use RCU to protect arrays, with
+ resizeable arrays whose elements reference other
+ data structures being of the most interest.
+
+ checklist.txt
+
+ Lists things to check for when inspecting code that
+ uses RCU.
+
+ listRCU.txt
+
+ Describes how to use RCU to protect linked lists.
+ This is the simplest and most common use of RCU
+ in the Linux kernel.
+
+ rcu.txt
+
+ You are reading it!
+
+ whatisRCU.txt
+
+ Overview of how the RCU implementation works. Along
+ the way, presents a conceptual view of RCU.
diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt
new file mode 100644
index 0000000..a23fee6
--- /dev/null
+++ b/Documentation/RCU/rcuref.txt
@@ -0,0 +1,74 @@
+Refcounter framework for elements of lists/arrays protected by
+RCU.
+
+Refcounting on elements of lists which are protected by traditional
+reader/writer spinlocks or semaphores are straight forward as in:
+
+1. 2.
+add() search_and_reference()
+{ {
+ alloc_object read_lock(&list_lock);
+ ... search_for_element
+ atomic_set(&el->rc, 1); atomic_inc(&el->rc);
+ write_lock(&list_lock); ...
+ add_element read_unlock(&list_lock);
+ ... ...
+ write_unlock(&list_lock); }
+}
+
+3. 4.
+release_referenced() delete()
+{ {
+ ... write_lock(&list_lock);
+ atomic_dec(&el->rc, relfunc) ...
+ ... delete_element
+} write_unlock(&list_lock);
+ ...
+ if (atomic_dec_and_test(&el->rc))
+ kfree(el);
+ ...
+ }
+
+If this list/array is made lock free using rcu as in changing the
+write_lock in add() and delete() to spin_lock and changing read_lock
+in search_and_reference to rcu_read_lock(), the rcuref_get in
+search_and_reference could potentially hold reference to an element which
+has already been deleted from the list/array. rcuref_lf_get_rcu takes
+care of this scenario. search_and_reference should look as;
+
+1. 2.
+add() search_and_reference()
+{ {
+ alloc_object rcu_read_lock();
+ ... search_for_element
+ atomic_set(&el->rc, 1); if (rcuref_inc_lf(&el->rc)) {
+ write_lock(&list_lock); rcu_read_unlock();
+ return FAIL;
+ add_element }
+ ... ...
+ write_unlock(&list_lock); rcu_read_unlock();
+} }
+3. 4.
+release_referenced() delete()
+{ {
+ ... write_lock(&list_lock);
+ rcuref_dec(&el->rc, relfunc) ...
+ ... delete_element
+} write_unlock(&list_lock);
+ ...
+ if (rcuref_dec_and_test(&el->rc))
+ call_rcu(&el->head, el_free);
+ ...
+ }
+
+Sometimes, reference to the element need to be obtained in the
+update (write) stream. In such cases, rcuref_inc_lf might be an overkill
+since the spinlock serialising list updates are held. rcuref_inc
+is to be used in such cases.
+For arches which do not have cmpxchg rcuref_inc_lf
+api uses a hashed spinlock implementation and the same hashed spinlock
+is acquired in all rcuref_xxx primitives to preserve atomicity.
+Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the
+refcounter atleast at one place. Mixing rcuref_inc and atomic_xxx api
+might lead to races. rcuref_inc_lf() must be used in lockfree
+RCU critical sections only.
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
new file mode 100644
index 0000000..354d89c
--- /dev/null
+++ b/Documentation/RCU/whatisRCU.txt
@@ -0,0 +1,902 @@
+What is RCU?
+
+RCU is a synchronization mechanism that was added to the Linux kernel
+during the 2.5 development effort that is optimized for read-mostly
+situations. Although RCU is actually quite simple once you understand it,
+getting there can sometimes be a challenge. Part of the problem is that
+most of the past descriptions of RCU have been written with the mistaken
+assumption that there is "one true way" to describe RCU. Instead,
+the experience has been that different people must take different paths
+to arrive at an understanding of RCU. This document provides several
+different paths, as follows:
+
+1. RCU OVERVIEW
+2. WHAT IS RCU'S CORE API?
+3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
+4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
+5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
+6. ANALOGY WITH READER-WRITER LOCKING
+7. FULL LIST OF RCU APIs
+8. ANSWERS TO QUICK QUIZZES
+
+People who prefer starting with a conceptual overview should focus on
+Section 1, though most readers will profit by reading this section at
+some point. People who prefer to start with an API that they can then
+experiment with should focus on Section 2. People who prefer to start
+with example uses should focus on Sections 3 and 4. People who need to
+understand the RCU implementation should focus on Section 5, then dive
+into the kernel source code. People who reason best by analogy should
+focus on Section 6. Section 7 serves as an index to the docbook API
+documentation, and Section 8 is the traditional answer key.
+
+So, start with the section that makes the most sense to you and your
+preferred method of learning. If you need to know everything about
+everything, feel free to read the whole thing -- but if you are really
+that type of person, you have perused the source code and will therefore
+never need this document anyway. ;-)
+
+
+1. RCU OVERVIEW
+
+The basic idea behind RCU is to split updates into "removal" and
+"reclamation" phases. The removal phase removes references to data items
+within a data structure (possibly by replacing them with references to
+new versions of these data items), and can run concurrently with readers.
+The reason that it is safe to run the removal phase concurrently with
+readers is the semantics of modern CPUs guarantee that readers will see
+either the old or the new version of the data structure rather than a
+partially updated reference. The reclamation phase does the work of reclaiming
+(e.g., freeing) the data items removed from the data structure during the
+removal phase. Because reclaiming data items can disrupt any readers
+concurrently referencing those data items, the reclamation phase must
+not start until readers no longer hold references to those data items.
+
+Splitting the update into removal and reclamation phases permits the
+updater to perform the removal phase immediately, and to defer the
+reclamation phase until all readers active during the removal phase have
+completed, either by blocking until they finish or by registering a
+callback that is invoked after they finish. Only readers that are active
+during the removal phase need be considered, because any reader starting
+after the removal phase will be unable to gain a reference to the removed
+data items, and therefore cannot be disrupted by the reclamation phase.
+
+So the typical RCU update sequence goes something like the following:
+
+a. Remove pointers to a data structure, so that subsequent
+ readers cannot gain a reference to it.
+
+b. Wait for all previous readers to complete their RCU read-side
+ critical sections.
+
+c. At this point, there cannot be any readers who hold references
+ to the data structure, so it now may safely be reclaimed
+ (e.g., kfree()d).
+
+Step (b) above is the key idea underlying RCU's deferred destruction.
+The ability to wait until all readers are done allows RCU readers to
+use much lighter-weight synchronization, in some cases, absolutely no
+synchronization at all. In contrast, in more conventional lock-based
+schemes, readers must use heavy-weight synchronization in order to
+prevent an updater from deleting the data structure out from under them.
+This is because lock-based updaters typically update data items in place,
+and must therefore exclude readers. In contrast, RCU-based updaters
+typically take advantage of the fact that writes to single aligned
+pointers are atomic on modern CPUs, allowing atomic insertion, removal,
+and replacement of data items in a linked structure without disrupting
+readers. Concurrent RCU readers can then continue accessing the old
+versions, and can dispense with the atomic operations, memory barriers,
+and communications cache misses that are so expensive on present-day
+SMP computer systems, even in absence of lock contention.
+
+In the three-step procedure shown above, the updater is performing both
+the removal and the reclamation step, but it is often helpful for an
+entirely different thread to do the reclamation, as is in fact the case
+in the Linux kernel's directory-entry cache (dcache). Even if the same
+thread performs both the update step (step (a) above) and the reclamation
+step (step (c) above), it is often helpful to think of them separately.
+For example, RCU readers and updaters need not communicate at all,
+but RCU provides implicit low-overhead communication between readers
+and reclaimers, namely, in step (b) above.
+
+So how the heck can a reclaimer tell when a reader is done, given
+that readers are not doing any sort of synchronization operations???
+Read on to learn about how RCU's API makes this easy.
+
+
+2. WHAT IS RCU'S CORE API?
+
+The core RCU API is quite small:
+
+a. rcu_read_lock()
+b. rcu_read_unlock()
+c. synchronize_rcu() / call_rcu()
+d. rcu_assign_pointer()
+e. rcu_dereference()
+
+There are many other members of the RCU API, but the rest can be
+expressed in terms of these five, though most implementations instead
+express synchronize_rcu() in terms of the call_rcu() callback API.
+
+The five core RCU APIs are described below, the other 18 will be enumerated
+later. See the kernel docbook documentation for more info, or look directly
+at the function header comments.
+
+rcu_read_lock()
+
+ void rcu_read_lock(void);
+
+ Used by a reader to inform the reclaimer that the reader is
+ entering an RCU read-side critical section. It is illegal
+ to block while in an RCU read-side critical section, though
+ kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side
+ critical sections. Any RCU-protected data structure accessed
+ during an RCU read-side critical section is guaranteed to remain
+ unreclaimed for the full duration of that critical section.
+ Reference counts may be used in conjunction with RCU to maintain
+ longer-term references to data structures.
+
+rcu_read_unlock()
+
+ void rcu_read_unlock(void);
+
+ Used by a reader to inform the reclaimer that the reader is
+ exiting an RCU read-side critical section. Note that RCU
+ read-side critical sections may be nested and/or overlapping.
+
+synchronize_rcu()
+
+ void synchronize_rcu(void);
+
+ Marks the end of updater code and the beginning of reclaimer
+ code. It does this by blocking until all pre-existing RCU
+ read-side critical sections on all CPUs have completed.
+ Note that synchronize_rcu() will -not- necessarily wait for
+ any subsequent RCU read-side critical sections to complete.
+ For example, consider the following sequence of events:
+
+ CPU 0 CPU 1 CPU 2
+ ----------------- ------------------------- ---------------
+ 1. rcu_read_lock()
+ 2. enters synchronize_rcu()
+ 3. rcu_read_lock()
+ 4. rcu_read_unlock()
+ 5. exits synchronize_rcu()
+ 6. rcu_read_unlock()
+
+ To reiterate, synchronize_rcu() waits only for ongoing RCU
+ read-side critical sections to complete, not necessarily for
+ any that begin after synchronize_rcu() is invoked.
+
+ Of course, synchronize_rcu() does not necessarily return
+ -immediately- after the last pre-existing RCU read-side critical
+ section completes. For one thing, there might well be scheduling
+ delays. For another thing, many RCU implementations process
+ requests in batches in order to improve efficiencies, which can
+ further delay synchronize_rcu().
+
+ Since synchronize_rcu() is the API that must figure out when
+ readers are done, its implementation is key to RCU. For RCU
+ to be useful in all but the most read-intensive situations,
+ synchronize_rcu()'s overhead must also be quite small.
+
+ The call_rcu() API is a callback form of synchronize_rcu(),
+ and is described in more detail in a later section. Instead of
+ blocking, it registers a function and argument which are invoked
+ after all ongoing RCU read-side critical sections have completed.
+ This callback variant is particularly useful in situations where
+ it is illegal to block.
+
+rcu_assign_pointer()
+
+ typeof(p) rcu_assign_pointer(p, typeof(p) v);
+
+ Yes, rcu_assign_pointer() -is- implemented as a macro, though it
+ would be cool to be able to declare a function in this manner.
+ (Compiler experts will no doubt disagree.)
+
+ The updater uses this function to assign a new value to an
+ RCU-protected pointer, in order to safely communicate the change
+ in value from the updater to the reader. This function returns
+ the new value, and also executes any memory-barrier instructions
+ required for a given CPU architecture.
+
+ Perhaps more important, it serves to document which pointers
+ are protected by RCU. That said, rcu_assign_pointer() is most
+ frequently used indirectly, via the _rcu list-manipulation
+ primitives such as list_add_rcu().
+
+rcu_dereference()
+
+ typeof(p) rcu_dereference(p);
+
+ Like rcu_assign_pointer(), rcu_dereference() must be implemented
+ as a macro.
+
+ The reader uses rcu_dereference() to fetch an RCU-protected
+ pointer, which returns a value that may then be safely
+ dereferenced. Note that rcu_deference() does not actually
+ dereference the pointer, instead, it protects the pointer for
+ later dereferencing. It also executes any needed memory-barrier
+ instructions for a given CPU architecture. Currently, only Alpha
+ needs memory barriers within rcu_dereference() -- on other CPUs,
+ it compiles to nothing, not even a compiler directive.
+
+ Common coding practice uses rcu_dereference() to copy an
+ RCU-protected pointer to a local variable, then dereferences
+ this local variable, for example as follows:
+
+ p = rcu_dereference(head.next);
+ return p->data;
+
+ However, in this case, one could just as easily combine these
+ into one statement:
+
+ return rcu_dereference(head.next)->data;
+
+ If you are going to be fetching multiple fields from the
+ RCU-protected structure, using the local variable is of
+ course preferred. Repeated rcu_dereference() calls look
+ ugly and incur unnecessary overhead on Alpha CPUs.
+
+ Note that the value returned by rcu_dereference() is valid
+ only within the enclosing RCU read-side critical section.
+ For example, the following is -not- legal:
+
+ rcu_read_lock();
+ p = rcu_dereference(head.next);
+ rcu_read_unlock();
+ x = p->address;
+ rcu_read_lock();
+ y = p->data;
+ rcu_read_unlock();
+
+ Holding a reference from one RCU read-side critical section
+ to another is just as illegal as holding a reference from
+ one lock-based critical section to another! Similarly,
+ using a reference outside of the critical section in which
+ it was acquired is just as illegal as doing so with normal
+ locking.
+
+ As with rcu_assign_pointer(), an important function of
+ rcu_dereference() is to document which pointers are protected
+ by RCU. And, again like rcu_assign_pointer(), rcu_dereference()
+ is typically used indirectly, via the _rcu list-manipulation
+ primitives, such as list_for_each_entry_rcu().
+
+The following diagram shows how each API communicates among the
+reader, updater, and reclaimer.
+
+
+ rcu_assign_pointer()
+ +--------+
+ +---------------------->| reader |---------+
+ | +--------+ |
+ | | |
+ | | | Protect:
+ | | | rcu_read_lock()
+ | | | rcu_read_unlock()
+ | rcu_dereference() | |
+ +---------+ | |
+ | updater |<---------------------+ |
+ +---------+ V
+ | +-----------+
+ +----------------------------------->| reclaimer |
+ +-----------+
+ Defer:
+ synchronize_rcu() & call_rcu()
+
+
+The RCU infrastructure observes the time sequence of rcu_read_lock(),
+rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
+order to determine when (1) synchronize_rcu() invocations may return
+to their callers and (2) call_rcu() callbacks may be invoked. Efficient
+implementations of the RCU infrastructure make heavy use of batching in
+order to amortize their overhead over many uses of the corresponding APIs.
+
+There are no fewer than three RCU mechanisms in the Linux kernel; the
+diagram above shows the first one, which is by far the most commonly used.
+The rcu_dereference() and rcu_assign_pointer() primitives are used for
+all three mechanisms, but different defer and protect primitives are
+used as follows:
+
+ Defer Protect
+
+a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock()
+ call_rcu()
+
+b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh()
+
+c. synchronize_sched() preempt_disable() / preempt_enable()
+ local_irq_save() / local_irq_restore()
+ hardirq enter / hardirq exit
+ NMI enter / NMI exit
+
+These three mechanisms are used as follows:
+
+a. RCU applied to normal data structures.
+
+b. RCU applied to networking data structures that may be subjected
+ to remote denial-of-service attacks.
+
+c. RCU applied to scheduler and interrupt/NMI-handler tasks.
+
+Again, most uses will be of (a). The (b) and (c) cases are important
+for specialized uses, but are relatively uncommon.
+
+
+3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
+
+This section shows a simple use of the core RCU API to protect a
+global pointer to a dynamically allocated structure. More typical
+uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt.
+
+ struct foo {
+ int a;
+ char b;
+ long c;
+ };
+ DEFINE_SPINLOCK(foo_mutex);
+
+ struct foo *gbl_foo;
+
+ /*
+ * Create a new struct foo that is the same as the one currently
+ * pointed to by gbl_foo, except that field "a" is replaced
+ * with "new_a". Points gbl_foo to the new structure, and
+ * frees up the old structure after a grace period.
+ *
+ * Uses rcu_assign_pointer() to ensure that concurrent readers
+ * see the initialized version of the new structure.
+ *
+ * Uses synchronize_rcu() to ensure that any readers that might
+ * have references to the old structure complete before freeing
+ * the old structure.
+ */
+ void foo_update_a(int new_a)
+ {
+ struct foo *new_fp;
+ struct foo *old_fp;
+
+ new_fp = kmalloc(sizeof(*fp), GFP_KERNEL);
+ spin_lock(&foo_mutex);
+ old_fp = gbl_foo;
+ *new_fp = *old_fp;
+ new_fp->a = new_a;
+ rcu_assign_pointer(gbl_foo, new_fp);
+ spin_unlock(&foo_mutex);
+ synchronize_rcu();
+ kfree(old_fp);
+ }
+
+ /*
+ * Return the value of field "a" of the current gbl_foo
+ * structure. Use rcu_read_lock() and rcu_read_unlock()
+ * to ensure that the structure does not get deleted out
+ * from under us, and use rcu_dereference() to ensure that
+ * we see the initialized version of the structure (important
+ * for DEC Alpha and for people reading the code).
+ */
+ int foo_get_a(void)
+ {
+ int retval;
+
+ rcu_read_lock();
+ retval = rcu_dereference(gbl_foo)->a;
+ rcu_read_unlock();
+ return retval;
+ }
+
+So, to sum up:
+
+o Use rcu_read_lock() and rcu_read_unlock() to guard RCU
+ read-side critical sections.
+
+o Within an RCU read-side critical section, use rcu_dereference()
+ to dereference RCU-protected pointers.
+
+o Use some solid scheme (such as locks or semaphores) to
+ keep concurrent updates from interfering with each other.
+
+o Use rcu_assign_pointer() to update an RCU-protected pointer.
+ This primitive protects concurrent readers from the updater,
+ -not- concurrent updates from each other! You therefore still
+ need to use locking (or something similar) to keep concurrent
+ rcu_assign_pointer() primitives from interfering with each other.
+
+o Use synchronize_rcu() -after- removing a data element from an
+ RCU-protected data structure, but -before- reclaiming/freeing
+ the data element, in order to wait for the completion of all
+ RCU read-side critical sections that might be referencing that
+ data item.
+
+See checklist.txt for additional rules to follow when using RCU.
+
+
+4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
+
+In the example above, foo_update_a() blocks until a grace period elapses.
+This is quite simple, but in some cases one cannot afford to wait so
+long -- there might be other high-priority work to be done.
+
+In such cases, one uses call_rcu() rather than synchronize_rcu().
+The call_rcu() API is as follows:
+
+ void call_rcu(struct rcu_head * head,
+ void (*func)(struct rcu_head *head));
+
+This function invokes func(head) after a grace period has elapsed.
+This invocation might happen from either softirq or process context,
+so the function is not permitted to block. The foo struct needs to
+have an rcu_head structure added, perhaps as follows:
+
+ struct foo {
+ int a;
+ char b;
+ long c;
+ struct rcu_head rcu;
+ };
+
+The foo_update_a() function might then be written as follows:
+
+ /*
+ * Create a new struct foo that is the same as the one currently
+ * pointed to by gbl_foo, except that field "a" is replaced
+ * with "new_a". Points gbl_foo to the new structure, and
+ * frees up the old structure after a grace period.
+ *
+ * Uses rcu_assign_pointer() to ensure that concurrent readers
+ * see the initialized version of the new structure.
+ *
+ * Uses call_rcu() to ensure that any readers that might have
+ * references to the old structure complete before freeing the
+ * old structure.
+ */
+ void foo_update_a(int new_a)
+ {
+ struct foo *new_fp;
+ struct foo *old_fp;
+
+ new_fp = kmalloc(sizeof(*fp), GFP_KERNEL);
+ spin_lock(&foo_mutex);
+ old_fp = gbl_foo;
+ *new_fp = *old_fp;
+ new_fp->a = new_a;
+ rcu_assign_pointer(gbl_foo, new_fp);
+ spin_unlock(&foo_mutex);
+ call_rcu(&old_fp->rcu, foo_reclaim);
+ }
+
+The foo_reclaim() function might appear as follows:
+
+ void foo_reclaim(struct rcu_head *rp)
+ {
+ struct foo *fp = container_of(rp, struct foo, rcu);
+
+ kfree(fp);
+ }
+
+The container_of() primitive is a macro that, given a pointer into a
+struct, the type of the struct, and the pointed-to field within the
+struct, returns a pointer to the beginning of the struct.
+
+The use of call_rcu() permits the caller of foo_update_a() to
+immediately regain control, without needing to worry further about the
+old version of the newly updated element. It also clearly shows the
+RCU distinction between updater, namely foo_update_a(), and reclaimer,
+namely foo_reclaim().
+
+The summary of advice is the same as for the previous section, except
+that we are now using call_rcu() rather than synchronize_rcu():
+
+o Use call_rcu() -after- removing a data element from an
+ RCU-protected data structure in order to register a callback
+ function that will be invoked after the completion of all RCU
+ read-side critical sections that might be referencing that
+ data item.
+
+Again, see checklist.txt for additional rules governing the use of RCU.
+
+
+5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
+
+One of the nice things about RCU is that it has extremely simple "toy"
+implementations that are a good first step towards understanding the
+production-quality implementations in the Linux kernel. This section
+presents two such "toy" implementations of RCU, one that is implemented
+in terms of familiar locking primitives, and another that more closely
+resembles "classic" RCU. Both are way too simple for real-world use,
+lacking both functionality and performance. However, they are useful
+in getting a feel for how RCU works. See kernel/rcupdate.c for a
+production-quality implementation, and see:
+
+ http://www.rdrop.com/users/paulmck/RCU
+
+for papers describing the Linux kernel RCU implementation. The OLS'01
+and OLS'02 papers are a good introduction, and the dissertation provides
+more details on the current implementation.
+
+
+5A. "TOY" IMPLEMENTATION #1: LOCKING
+
+This section presents a "toy" RCU implementation that is based on
+familiar locking primitives. Its overhead makes it a non-starter for
+real-life use, as does its lack of scalability. It is also unsuitable
+for realtime use, since it allows scheduling latency to "bleed" from
+one read-side critical section to another.
+
+However, it is probably the easiest implementation to relate to, so is
+a good starting point.
+
+It is extremely simple:
+
+ static DEFINE_RWLOCK(rcu_gp_mutex);
+
+ void rcu_read_lock(void)
+ {
+ read_lock(&rcu_gp_mutex);
+ }
+
+ void rcu_read_unlock(void)
+ {
+ read_unlock(&rcu_gp_mutex);
+ }
+
+ void synchronize_rcu(void)
+ {
+ write_lock(&rcu_gp_mutex);
+ write_unlock(&rcu_gp_mutex);
+ }
+
+[You can ignore rcu_assign_pointer() and rcu_dereference() without
+missing much. But here they are anyway. And whatever you do, don't
+forget about them when submitting patches making use of RCU!]
+
+ #define rcu_assign_pointer(p, v) ({ \
+ smp_wmb(); \
+ (p) = (v); \
+ })
+
+ #define rcu_dereference(p) ({ \
+ typeof(p) _________p1 = p; \
+ smp_read_barrier_depends(); \
+ (_________p1); \
+ })
+
+
+The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
+and release a global reader-writer lock. The synchronize_rcu()
+primitive write-acquires this same lock, then immediately releases
+it. This means that once synchronize_rcu() exits, all RCU read-side
+critical sections that were in progress before synchonize_rcu() was
+called are guaranteed to have completed -- there is no way that
+synchronize_rcu() would have been able to write-acquire the lock
+otherwise.
+
+It is possible to nest rcu_read_lock(), since reader-writer locks may
+be recursively acquired. Note also that rcu_read_lock() is immune
+from deadlock (an important property of RCU). The reason for this is
+that the only thing that can block rcu_read_lock() is a synchronize_rcu().
+But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex,
+so there can be no deadlock cycle.
+
+Quick Quiz #1: Why is this argument naive? How could a deadlock
+ occur when using this algorithm in a real-world Linux
+ kernel? How could this deadlock be avoided?
+
+
+5B. "TOY" EXAMPLE #2: CLASSIC RCU
+
+This section presents a "toy" RCU implementation that is based on
+"classic RCU". It is also short on performance (but only for updates) and
+on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT
+kernels. The definitions of rcu_dereference() and rcu_assign_pointer()
+are the same as those shown in the preceding section, so they are omitted.
+
+ void rcu_read_lock(void) { }
+
+ void rcu_read_unlock(void) { }
+
+ void synchronize_rcu(void)
+ {
+ int cpu;
+
+ for_each_cpu(cpu)
+ run_on(cpu);
+ }
+
+Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing.
+This is the great strength of classic RCU in a non-preemptive kernel:
+read-side overhead is precisely zero, at least on non-Alpha CPUs.
+And there is absolutely no way that rcu_read_lock() can possibly
+participate in a deadlock cycle!
+
+The implementation of synchronize_rcu() simply schedules itself on each
+CPU in turn. The run_on() primitive can be implemented straightforwardly
+in terms of the sched_setaffinity() primitive. Of course, a somewhat less
+"toy" implementation would restore the affinity upon completion rather
+than just leaving all tasks running on the last CPU, but when I said
+"toy", I meant -toy-!
+
+So how the heck is this supposed to work???
+
+Remember that it is illegal to block while in an RCU read-side critical
+section. Therefore, if a given CPU executes a context switch, we know
+that it must have completed all preceding RCU read-side critical sections.
+Once -all- CPUs have executed a context switch, then -all- preceding
+RCU read-side critical sections will have completed.
+
+So, suppose that we remove a data item from its structure and then invoke
+synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed
+that there are no RCU read-side critical sections holding a reference
+to that data item, so we can safely reclaim it.
+
+Quick Quiz #2: Give an example where Classic RCU's read-side
+ overhead is -negative-.
+
+Quick Quiz #3: If it is illegal to block in an RCU read-side
+ critical section, what the heck do you do in
+ PREEMPT_RT, where normal spinlocks can block???
+
+
+6. ANALOGY WITH READER-WRITER LOCKING
+
+Although RCU can be used in many different ways, a very common use of
+RCU is analogous to reader-writer locking. The following unified
+diff shows how closely related RCU and reader-writer locking can be.
+
+ @@ -13,15 +14,15 @@
+ struct list_head *lp;
+ struct el *p;
+
+ - read_lock();
+ - list_for_each_entry(p, head, lp) {
+ + rcu_read_lock();
+ + list_for_each_entry_rcu(p, head, lp) {
+ if (p->key == key) {
+ *result = p->data;
+ - read_unlock();
+ + rcu_read_unlock();
+ return 1;
+ }
+ }
+ - read_unlock();
+ + rcu_read_unlock();
+ return 0;
+ }
+
+ @@ -29,15 +30,16 @@
+ {
+ struct el *p;
+
+ - write_lock(&listmutex);
+ + spin_lock(&listmutex);
+ list_for_each_entry(p, head, lp) {
+ if (p->key == key) {
+ list_del(&p->list);
+ - write_unlock(&listmutex);
+ + spin_unlock(&listmutex);
+ + synchronize_rcu();
+ kfree(p);
+ return 1;
+ }
+ }
+ - write_unlock(&listmutex);
+ + spin_unlock(&listmutex);
+ return 0;
+ }
+
+Or, for those who prefer a side-by-side listing:
+
+ 1 struct el { 1 struct el {
+ 2 struct list_head list; 2 struct list_head list;
+ 3 long key; 3 long key;
+ 4 spinlock_t mutex; 4 spinlock_t mutex;
+ 5 int data; 5 int data;
+ 6 /* Other data fields */ 6 /* Other data fields */
+ 7 }; 7 };
+ 8 spinlock_t listmutex; 8 spinlock_t listmutex;
+ 9 struct el head; 9 struct el head;
+
+ 1 int search(long key, int *result) 1 int search(long key, int *result)
+ 2 { 2 {
+ 3 struct list_head *lp; 3 struct list_head *lp;
+ 4 struct el *p; 4 struct el *p;
+ 5 5
+ 6 read_lock(); 6 rcu_read_lock();
+ 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) {
+ 8 if (p->key == key) { 8 if (p->key == key) {
+ 9 *result = p->data; 9 *result = p->data;
+10 read_unlock(); 10 rcu_read_unlock();
+11 return 1; 11 return 1;
+12 } 12 }
+13 } 13 }
+14 read_unlock(); 14 rcu_read_unlock();
+15 return 0; 15 return 0;
+16 } 16 }
+
+ 1 int delete(long key) 1 int delete(long key)
+ 2 { 2 {
+ 3 struct el *p; 3 struct el *p;
+ 4 4
+ 5 write_lock(&listmutex); 5 spin_lock(&listmutex);
+ 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) {
+ 7 if (p->key == key) { 7 if (p->key == key) {
+ 8 list_del(&p->list); 8 list_del(&p->list);
+ 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
+ 10 synchronize_rcu();
+10 kfree(p); 11 kfree(p);
+11 return 1; 12 return 1;
+12 } 13 }
+13 } 14 }
+14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
+15 return 0; 16 return 0;
+16 } 17 }
+
+Either way, the differences are quite small. Read-side locking moves
+to rcu_read_lock() and rcu_read_unlock, update-side locking moves from
+from a reader-writer lock to a simple spinlock, and a synchronize_rcu()
+precedes the kfree().
+
+However, there is one potential catch: the read-side and update-side
+critical sections can now run concurrently. In many cases, this will
+not be a problem, but it is necessary to check carefully regardless.
+For example, if multiple independent list updates must be seen as
+a single atomic update, converting to RCU will require special care.
+
+Also, the presence of synchronize_rcu() means that the RCU version of
+delete() can now block. If this is a problem, there is a callback-based
+mechanism that never blocks, namely call_rcu(), that can be used in
+place of synchronize_rcu().
+
+
+7. FULL LIST OF RCU APIs
+
+The RCU APIs are documented in docbook-format header comments in the
+Linux-kernel source code, but it helps to have a full list of the
+APIs, since there does not appear to be a way to categorize them
+in docbook. Here is the list, by category.
+
+Markers for RCU read-side critical sections:
+
+ rcu_read_lock
+ rcu_read_unlock
+ rcu_read_lock_bh
+ rcu_read_unlock_bh
+
+RCU pointer/list traversal:
+
+ rcu_dereference
+ list_for_each_rcu (to be deprecated in favor of
+ list_for_each_entry_rcu)
+ list_for_each_safe_rcu (deprecated, not used)
+ list_for_each_entry_rcu
+ list_for_each_continue_rcu (to be deprecated in favor of new
+ list_for_each_entry_continue_rcu)
+ hlist_for_each_rcu (to be deprecated in favor of
+ hlist_for_each_entry_rcu)
+ hlist_for_each_entry_rcu
+
+RCU pointer update:
+
+ rcu_assign_pointer
+ list_add_rcu
+ list_add_tail_rcu
+ list_del_rcu
+ list_replace_rcu
+ hlist_del_rcu
+ hlist_add_head_rcu
+
+RCU grace period:
+
+ synchronize_kernel (deprecated)
+ synchronize_net
+ synchronize_sched
+ synchronize_rcu
+ call_rcu
+ call_rcu_bh
+
+See the comment headers in the source code (or the docbook generated
+from them) for more information.
+
+
+8. ANSWERS TO QUICK QUIZZES
+
+Quick Quiz #1: Why is this argument naive? How could a deadlock
+ occur when using this algorithm in a real-world Linux
+ kernel? [Referring to the lock-based "toy" RCU
+ algorithm.]
+
+Answer: Consider the following sequence of events:
+
+ 1. CPU 0 acquires some unrelated lock, call it
+ "problematic_lock".
+
+ 2. CPU 1 enters synchronize_rcu(), write-acquiring
+ rcu_gp_mutex.
+
+ 3. CPU 0 enters rcu_read_lock(), but must wait
+ because CPU 1 holds rcu_gp_mutex.
+
+ 4. CPU 1 is interrupted, and the irq handler
+ attempts to acquire problematic_lock.
+
+ The system is now deadlocked.
+
+ One way to avoid this deadlock is to use an approach like
+ that of CONFIG_PREEMPT_RT, where all normal spinlocks
+ become blocking locks, and all irq handlers execute in
+ the context of special tasks. In this case, in step 4
+ above, the irq handler would block, allowing CPU 1 to
+ release rcu_gp_mutex, avoiding the deadlock.
+
+ Even in the absence of deadlock, this RCU implementation
+ allows latency to "bleed" from readers to other
+ readers through synchronize_rcu(). To see this,
+ consider task A in an RCU read-side critical section
+ (thus read-holding rcu_gp_mutex), task B blocked
+ attempting to write-acquire rcu_gp_mutex, and
+ task C blocked in rcu_read_lock() attempting to
+ read_acquire rcu_gp_mutex. Task A's RCU read-side
+ latency is holding up task C, albeit indirectly via
+ task B.
+
+ Realtime RCU implementations therefore use a counter-based
+ approach where tasks in RCU read-side critical sections
+ cannot be blocked by tasks executing synchronize_rcu().
+
+Quick Quiz #2: Give an example where Classic RCU's read-side
+ overhead is -negative-.
+
+Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT
+ kernel where a routing table is used by process-context
+ code, but can be updated by irq-context code (for example,
+ by an "ICMP REDIRECT" packet). The usual way of handling
+ this would be to have the process-context code disable
+ interrupts while searching the routing table. Use of
+ RCU allows such interrupt-disabling to be dispensed with.
+ Thus, without RCU, you pay the cost of disabling interrupts,
+ and with RCU you don't.
+
+ One can argue that the overhead of RCU in this
+ case is negative with respect to the single-CPU
+ interrupt-disabling approach. Others might argue that
+ the overhead of RCU is merely zero, and that replacing
+ the positive overhead of the interrupt-disabling scheme
+ with the zero-overhead RCU scheme does not constitute
+ negative overhead.
+
+ In real life, of course, things are more complex. But
+ even the theoretical possibility of negative overhead for
+ a synchronization primitive is a bit unexpected. ;-)
+
+Quick Quiz #3: If it is illegal to block in an RCU read-side
+ critical section, what the heck do you do in
+ PREEMPT_RT, where normal spinlocks can block???
+
+Answer: Just as PREEMPT_RT permits preemption of spinlock
+ critical sections, it permits preemption of RCU
+ read-side critical sections. It also permits
+ spinlocks blocking while in RCU read-side critical
+ sections.
+
+ Why the apparent inconsistency? Because it is it
+ possible to use priority boosting to keep the RCU
+ grace periods short if need be (for example, if running
+ short of memory). In contrast, if blocking waiting
+ for (say) network reception, there is no way to know
+ what should be boosted. Especially given that the
+ process we need to boost might well be a human being
+ who just went out for a pizza or something. And although
+ a computer-operated cattle prod might arouse serious
+ interest, it might also provoke serious objections.
+ Besides, how does the computer know what pizza parlor
+ the human being went to???
+
+
+ACKNOWLEDGEMENTS
+
+My thanks to the people who helped make this human-readable, including
+Jon Walpole, Josh Triplett, Serge Hallyn, and Suzanne Wood.
+
+
+For more information, see http://www.rdrop.com/users/paulmck/RCU.
diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
index 7f43b04..237d54c 100644
--- a/Documentation/SubmittingPatches
+++ b/Documentation/SubmittingPatches
@@ -301,8 +301,84 @@ now, but you can do this to mark interna
point out some special detail about the sign-off.
+12) The canonical patch format
-12) More references for submitting patches
+The canonical patch subject line is:
+
+ Subject: [PATCH 001/123] subsystem: summary phrase
+
+The canonical patch message body contains the following:
+
+ - A "from" line specifying the patch author.
+
+ - An empty line.
+
+ - The body of the explanation, which will be copied to the
+ permanent changelog to describe this patch.
+
+ - The "Signed-off-by:" lines, described above, which will
+ also go in the changelog.
+
+ - A marker line containing simply "---".
+
+ - Any additional comments not suitable for the changelog.
+
+ - The actual patch (diff output).
+
+The Subject line format makes it very easy to sort the emails
+alphabetically by subject line - pretty much any email reader will
+support that - since because the sequence number is zero-padded,
+the numerical and alphabetic sort is the same.
+
+The "subsystem" in the email's Subject should identify which
+area or subsystem of the kernel is being patched.
+
+The "summary phrase" in the email's Subject should concisely
+describe the patch which that email contains. The "summary
+phrase" should not be a filename. Do not use the same "summary
+phrase" for every patch in a whole patch series.
+
+Bear in mind that the "summary phrase" of your email becomes
+a globally-unique identifier for that patch. It propagates
+all the way into the git changelog. The "summary phrase" may
+later be used in developer discussions which refer to the patch.
+People will want to google for the "summary phrase" to read
+discussion regarding that patch.
+
+A couple of example Subjects:
+
+ Subject: [patch 2/5] ext2: improve scalability of bitmap searching
+ Subject: [PATCHv2 001/207] x86: fix eflags tracking
+
+The "from" line must be the very first line in the message body,
+and has the form:
+
+ From: Original Author <author@example.com>
+
+The "from" line specifies who will be credited as the author of the
+patch in the permanent changelog. If the "from" line is missing,
+then the "From:" line from the email header will be used to determine
+the patch author in the changelog.
+
+The explanation body will be committed to the permanent source
+changelog, so should make sense to a competent reader who has long
+since forgotten the immediate details of the discussion that might
+have led to this patch.
+
+The "---" marker line serves the essential purpose of marking for patch
+handling tools where the changelog message ends.
+
+One good use for the additional comments after the "---" marker is for
+a diffstat, to show what files have changed, and the number of inserted
+and deleted lines per file. A diffstat is especially useful on bigger
+patches. Other comments relevant only to the moment or the maintainer,
+not suitable for the permanent changelog, should also go here.
+
+See more details on the proper patch format in the following
+references.
+
+
+13) More references for submitting patches
Andrew Morton, "The perfect patch" (tpp).
<http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
@@ -310,6 +386,14 @@ Andrew Morton, "The perfect patch" (tpp)
Jeff Garzik, "Linux kernel patch submission format."
<http://linux.yyz.us/patch-format.html>
+Greg KH, "How to piss off a kernel subsystem maintainer"
+ <http://www.kroah.com/log/2005/03/31/>
+
+Kernel Documentation/CodingStyle
+ <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
+
+Linus Torvald's mail on the canonical patch format:
+ <http://lkml.org/lkml/2005/4/7/183>
-----------------------------------
diff --git a/Documentation/acpi-hotkey.txt b/Documentation/acpi-hotkey.txt
index 0acdc80..744f1ae 100644
--- a/Documentation/acpi-hotkey.txt
+++ b/Documentation/acpi-hotkey.txt
@@ -35,4 +35,4 @@ created. Please use command "cat /proc/
to retrieve it.
Note: Use cmdline "acpi_generic_hotkey" to over-ride
-loading any platform specific drivers.
+platform-specific with generic driver.
diff --git a/Documentation/aoe/mkshelf.sh b/Documentation/aoe/mkshelf.sh
index 8bacf9f..3261581 100644
--- a/Documentation/aoe/mkshelf.sh
+++ b/Documentation/aoe/mkshelf.sh
@@ -8,13 +8,15 @@ fi
n_partitions=${n_partitions:-16}
dir=$1
shelf=$2
+nslots=16
+maxslot=`echo $nslots 1 - p | dc`
MAJOR=152
set -e
-minor=`echo 10 \* $shelf \* $n_partitions | bc`
+minor=`echo $nslots \* $shelf \* $n_partitions | bc`
endp=`echo $n_partitions - 1 | bc`
-for slot in `seq 0 9`; do
+for slot in `seq 0 $maxslot`; do
for part in `seq 0 $endp`; do
name=e$shelf.$slot
test "$part" != "0" && name=${name}p$part
diff --git a/Documentation/applying-patches.txt b/Documentation/applying-patches.txt
new file mode 100644
index 0000000..681e426
--- /dev/null
+++ b/Documentation/applying-patches.txt
@@ -0,0 +1,439 @@
+
+ Applying Patches To The Linux Kernel
+ ------------------------------------
+
+ (Written by Jesper Juhl, August 2005)
+
+
+
+A frequently asked question on the Linux Kernel Mailing List is how to apply
+a patch to the kernel or, more specifically, what base kernel a patch for
+one of the many trees/branches should be applied to. Hopefully this document
+will explain this to you.
+
+In addition to explaining how to apply and revert patches, a brief
+description of the different kernel trees (and examples of how to apply
+their specific patches) is also provided.
+
+
+What is a patch?
+---
+ A patch is a small text document containing a delta of changes between two
+different versions of a source tree. Patches are created with the `diff'
+program.
+To correctly apply a patch you need to know what base it was generated from
+and what new version the patch will change the source tree into. These
+should both be present in the patch file metadata or be possible to deduce
+from the filename.
+
+
+How do I apply or revert a patch?
+---
+ You apply a patch with the `patch' program. The patch program reads a diff
+(or patch) file and makes the changes to the source tree described in it.
+
+Patches for the Linux kernel are generated relative to the parent directory
+holding the kernel source dir.
+
+This means that paths to files inside the patch file contain the name of the
+kernel source directories it was generated against (or some other directory
+names like "a/" and "b/").
+Since this is unlikely to match the name of the kernel source dir on your
+local machine (but is often useful info to see what version an otherwise
+unlabeled patch was generated against) you should change into your kernel
+source directory and then strip the first element of the path from filenames
+in the patch file when applying it (the -p1 argument to `patch' does this).
+
+To revert a previously applied patch, use the -R argument to patch.
+So, if you applied a patch like this:
+ patch -p1 < ../patch-x.y.z
+
+You can revert (undo) it like this:
+ patch -R -p1 < ../patch-x.y.z
+
+
+How do I feed a patch/diff file to `patch'?
+---
+ This (as usual with Linux and other UNIX like operating systems) can be
+done in several different ways.
+In all the examples below I feed the file (in uncompressed form) to patch
+via stdin using the following syntax:
+ patch -p1 < path/to/patch-x.y.z
+
+If you just want to be able to follow the examples below and don't want to
+know of more than one way to use patch, then you can stop reading this
+section here.
+
+Patch can also get the name of the file to use via the -i argument, like
+this:
+ patch -p1 -i path/to/patch-x.y.z
+
+If your patch file is compressed with gzip or bzip2 and you don't want to
+uncompress it before applying it, then you can feed it to patch like this
+instead:
+ zcat path/to/patch-x.y.z.gz | patch -p1
+ bzcat path/to/patch-x.y.z.bz2 | patch -p1
+
+If you wish to uncompress the patch file by hand first before applying it
+(what I assume you've done in the examples below), then you simply run
+gunzip or bunzip2 on the file - like this:
+ gunzip patch-x.y.z.gz
+ bunzip2 patch-x.y.z.bz2
+
+Which will leave you with a plain text patch-x.y.z file that you can feed to
+patch via stdin or the -i argument, as you prefer.
+
+A few other nice arguments for patch are -s which causes patch to be silent
+except for errors which is nice to prevent errors from scrolling out of the
+screen too fast, and --dry-run which causes patch to just print a listing of
+what would happen, but doesn't actually make any changes. Finally --verbose
+tells patch to print more information about the work being done.
+
+
+Common errors when patching
+---
+ When patch applies a patch file it attempts to verify the sanity of the
+file in different ways.
+Checking that the file looks like a valid patch file, checking the code
+around the bits being modified matches the context provided in the patch are
+just two of the basic sanity checks patch does.
+
+If patch encounters something that doesn't look quite right it has two
+options. It can either refuse to apply the changes and abort or it can try
+to find a way to make the patch apply with a few minor changes.
+
+One example of something that's not 'quite right' that patch will attempt to
+fix up is if all the context matches, the lines being changed match, but the
+line numbers are different. This can happen, for example, if the patch makes
+a change in the middle of the file but for some reasons a few lines have
+been added or removed near the beginning of the file. In that case
+everything looks good it has just moved up or down a bit, and patch will
+usually adjust the line numbers and apply the patch.
+
+Whenever patch applies a patch that it had to modify a bit to make it fit
+it'll tell you about it by saying the patch applied with 'fuzz'.
+You should be wary of such changes since even though patch probably got it
+right it doesn't /always/ get it right, and the result will sometimes be
+wrong.
+
+When patch encounters a change that it can't fix up with fuzz it rejects it
+outright and leaves a file with a .rej extension (a reject file). You can
+read this file to see exactely what change couldn't be applied, so you can
+go fix it up by hand if you wish.
+
+If you don't have any third party patches applied to your kernel source, but
+only patches from kernel.org and you apply the patches in the correct order,
+and have made no modifications yourself to the source files, then you should
+never see a fuzz or reject message from patch. If you do see such messages
+anyway, then there's a high risk that either your local source tree or the
+patch file is corrupted in some way. In that case you should probably try
+redownloading the patch and if things are still not OK then you'd be advised
+to start with a fresh tree downloaded in full from kernel.org.
+
+Let's look a bit more at some of the messages patch can produce.
+
+If patch stops and presents a "File to patch:" prompt, then patch could not
+find a file to be patched. Most likely you forgot to specify -p1 or you are
+in the wrong directory. Less often, you'll find patches that need to be
+applied with -p0 instead of -p1 (reading the patch file should reveal if
+this is the case - if so, then this is an error by the person who created
+the patch but is not fatal).
+
+If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a
+message similar to that, then it means that patch had to adjust the location
+of the change (in this example it needed to move 7 lines from where it
+expected to make the change to make it fit).
+The resulting file may or may not be OK, depending on the reason the file
+was different than expected.
+This often happens if you try to apply a patch that was generated against a
+different kernel version than the one you are trying to patch.
+
+If you get a message like "Hunk #3 FAILED at 2387.", then it means that the
+patch could not be applied correctly and the patch program was unable to
+fuzz its way through. This will generate a .rej file with the change that
+caused the patch to fail and also a .orig file showing you the original
+content that couldn't be changed.
+
+If you get "Reversed (or previously applied) patch detected! Assume -R? [n]"
+then patch detected that the change contained in the patch seems to have
+already been made.
+If you actually did apply this patch previously and you just re-applied it
+in error, then just say [n]o and abort this patch. If you applied this patch
+previously and actually intended to revert it, but forgot to specify -R,
+then you can say [y]es here to make patch revert it for you.
+This can also happen if the creator of the patch reversed the source and
+destination directories when creating the patch, and in that case reverting
+the patch will in fact apply it.
+
+A message similar to "patch: **** unexpected end of file in patch" or "patch
+unexpectedly ends in middle of line" means that patch could make no sense of
+the file you fed to it. Either your download is broken or you tried to feed
+patch a compressed patch file without uncompressing it first.
+
+As I already mentioned above, these errors should never happen if you apply
+a patch from kernel.org to the correct version of an unmodified source tree.
+So if you get these errors with kernel.org patches then you should probably
+assume that either your patch file or your tree is broken and I'd advice you
+to start over with a fresh download of a full kernel tree and the patch you
+wish to apply.
+
+
+Are there any alternatives to `patch'?
+---
+ Yes there are alternatives. You can use the `interdiff' program
+(http://cyberelk.net/tim/patchutils/) to generate a patch representing the
+differences between two patches and then apply the result.
+This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single
+step. The -z flag to interdiff will even let you feed it patches in gzip or
+bzip2 compressed form directly without the use of zcat or bzcat or manual
+decompression.
+
+Here's how you'd go from 2.6.12.2 to 2.6.12.3 in a single step:
+ interdiff -z ../patch-2.6.12.2.bz2 ../patch-2.6.12.3.gz | patch -p1
+
+Although interdiff may save you a step or two you are generally advised to
+do the additional steps since interdiff can get things wrong in some cases.
+
+ Another alternative is `ketchup', which is a python script for automatic
+downloading and applying of patches (http://www.selenic.com/ketchup/).
+
+Other nice tools are diffstat which shows a summary of changes made by a
+patch, lsdiff which displays a short listing of affected files in a patch
+file, along with (optionally) the line numbers of the start of each patch
+and grepdiff which displays a list of the files modified by a patch where
+the patch contains a given regular expression.
+
+
+Where can I download the patches?
+---
+ The patches are available at http://kernel.org/
+Most recent patches are linked from the front page, but they also have
+specific homes.
+
+The 2.6.x.y (-stable) and 2.6.x patches live at
+ ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
+
+The -rc patches live at
+ ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/
+
+The -git patches live at
+ ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots/
+
+The -mm kernels live at
+ ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/
+
+In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a
+country code. This way you'll be downloading from a mirror site that's most
+likely geographically closer to you, resulting in faster downloads for you,
+less bandwidth used globally and less load on the main kernel.org servers -
+these are good things, do use mirrors when possible.
+
+
+The 2.6.x kernels
+---
+ These are the base stable releases released by Linus. The highest numbered
+release is the most recent.
+
+If regressions or other serious flaws are found then a -stable fix patch
+will be released (see below) on top of this base. Once a new 2.6.x base
+kernel is released, a patch is made available that is a delta between the
+previous 2.6.x kernel and the new one.
+
+To apply a patch moving from 2.6.11 to 2.6.12 you'd do the following (note
+that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the
+base 2.6.x kernel - if you need to move from 2.6.x.y to 2.6.x+1 you need to
+first revert the 2.6.x.y patch).
+
+Here are some examples:
+
+# moving from 2.6.11 to 2.6.12
+$ cd ~/linux-2.6.11 # change to kernel source dir
+$ patch -p1 < ../patch-2.6.12 # apply the 2.6.12 patch
+$ cd ..
+$ mv linux-2.6.11 linux-2.6.12 # rename source dir
+
+# moving from 2.6.11.1 to 2.6.12
+$ cd ~/linux-2.6.11.1 # change to kernel source dir
+$ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch
+ # source dir is now 2.6.11
+$ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch
+$ cd ..
+$ mv linux-2.6.11.1 inux-2.6.12 # rename source dir
+
+
+The 2.6.x.y kernels
+---
+ Kernels with 4 digit versions are -stable kernels. They contain small(ish)
+critical fixes for security problems or significant regressions discovered
+in a given 2.6.x kernel.
+
+This is the recommended branch for users who want the most recent stable
+kernel and are not interested in helping test development/experimental
+versions.
+
+If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is
+the current stable kernel.
+
+These patches are not incremental, meaning that for example the 2.6.12.3
+patch does not apply on top of the 2.6.12.2 kernel source, but rather on top
+of the base 2.6.12 kernel source.
+So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel
+source you have to first back out the 2.6.12.2 patch (so you are left with a
+base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch.
+
+Here's a small example:
+
+$ cd ~/linux-2.6.12.2 # change into the kernel source dir
+$ patch -p1 -R < ../patch-2.6.12.2 # revert the 2.6.12.2 patch
+$ patch -p1 < ../patch-2.6.12.3 # apply the new 2.6.12.3 patch
+$ cd ..
+$ mv linux-2.6.12.2 linux-2.6.12.3 # rename the kernel source dir
+
+
+The -rc kernels
+---
+ These are release-candidate kernels. These are development kernels released
+by Linus whenever he deems the current git (the kernel's source management
+tool) tree to be in a reasonably sane state adequate for testing.
+
+These kernels are not stable and you should expect occasional breakage if
+you intend to run them. This is however the most stable of the main
+development branches and is also what will eventually turn into the next
+stable kernel, so it is important that it be tested by as many people as
+possible.
+
+This is a good branch to run for people who want to help out testing
+development kernels but do not want to run some of the really experimental
+stuff (such people should see the sections about -git and -mm kernels below).
+
+The -rc patches are not incremental, they apply to a base 2.6.x kernel, just
+like the 2.6.x.y patches described above. The kernel version before the -rcN
+suffix denotes the version of the kernel that this -rc kernel will eventually
+turn into.
+So, 2.6.13-rc5 means that this is the fifth release candidate for the 2.6.13
+kernel and the patch should be applied on top of the 2.6.12 kernel source.
+
+Here are 3 examples of how to apply these patches:
+
+# first an example of moving from 2.6.12 to 2.6.13-rc3
+$ cd ~/linux-2.6.12 # change into the 2.6.12 source dir
+$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch
+$ cd ..
+$ mv linux-2.6.12 linux-2.6.13-rc3 # rename the source dir
+
+# now let's move from 2.6.13-rc3 to 2.6.13-rc5
+$ cd ~/linux-2.6.13-rc3 # change into the 2.6.13-rc3 dir
+$ patch -p1 -R < ../patch-2.6.13-rc3 # revert the 2.6.13-rc3 patch
+$ patch -p1 < ../patch-2.6.13-rc5 # apply the new 2.6.13-rc5 patch
+$ cd ..
+$ mv linux-2.6.13-rc3 linux-2.6.13-rc5 # rename the source dir
+
+# finally let's try and move from 2.6.12.3 to 2.6.13-rc5
+$ cd ~/linux-2.6.12.3 # change to the kernel source dir
+$ patch -p1 -R < ../patch-2.6.12.3 # revert the 2.6.12.3 patch
+$ patch -p1 < ../patch-2.6.13-rc5 # apply new 2.6.13-rc5 patch
+$ cd ..
+$ mv linux-2.6.12.3 linux-2.6.13-rc5 # rename the kernel source dir
+
+
+The -git kernels
+---
+ These are daily snapshots of Linus' kernel tree (managed in a git
+repository, hence the name).
+
+These patches are usually released daily and represent the current state of
+Linus' tree. They are more experimental than -rc kernels since they are
+generated automatically without even a cursory glance to see if they are
+sane.
+
+-git patches are not incremental and apply either to a base 2.6.x kernel or
+a base 2.6.x-rc kernel - you can see which from their name.
+A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch
+named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel.
+
+Here are some examples of how to apply these patches:
+
+# moving from 2.6.12 to 2.6.12-git1
+$ cd ~/linux-2.6.12 # change to the kernel source dir
+$ patch -p1 < ../patch-2.6.12-git1 # apply the 2.6.12-git1 patch
+$ cd ..
+$ mv linux-2.6.12 linux-2.6.12-git1 # rename the kernel source dir
+
+# moving from 2.6.12-git1 to 2.6.13-rc2-git3
+$ cd ~/linux-2.6.12-git1 # change to the kernel source dir
+$ patch -p1 -R < ../patch-2.6.12-git1 # revert the 2.6.12-git1 patch
+ # we now have a 2.6.12 kernel
+$ patch -p1 < ../patch-2.6.13-rc2 # apply the 2.6.13-rc2 patch
+ # the kernel is now 2.6.13-rc2
+$ patch -p1 < ../patch-2.6.13-rc2-git3 # apply the 2.6.13-rc2-git3 patch
+ # the kernel is now 2.6.13-rc2-git3
+$ cd ..
+$ mv linux-2.6.12-git1 linux-2.6.13-rc2-git3 # rename source dir
+
+
+The -mm kernels
+---
+ These are experimental kernels released by Andrew Morton.
+
+The -mm tree serves as a sort of proving ground for new features and other
+experimental patches.
+Once a patch has proved its worth in -mm for a while Andrew pushes it on to
+Linus for inclusion in mainline.
+
+Although it's encouraged that patches flow to Linus via the -mm tree, this
+is not always enforced.
+Subsystem maintainers (or individuals) sometimes push their patches directly
+to Linus, even though (or after) they have been merged and tested in -mm (or
+sometimes even without prior testing in -mm).
+
+You should generally strive to get your patches into mainline via -mm to
+ensure maximum testing.
+
+This branch is in constant flux and contains many experimental features, a
+lot of debugging patches not appropriate for mainline etc and is the most
+experimental of the branches described in this document.
+
+These kernels are not appropriate for use on systems that are supposed to be
+stable and they are more risky to run than any of the other branches (make
+sure you have up-to-date backups - that goes for any experimental kernel but
+even more so for -mm kernels).
+
+These kernels in addition to all the other experimental patches they contain
+usually also contain any changes in the mainline -git kernels available at
+the time of release.
+
+Testing of -mm kernels is greatly appreciated since the whole point of the
+tree is to weed out regressions, crashes, data corruption bugs, build
+breakage (and any other bug in general) before changes are merged into the
+more stable mainline Linus tree.
+But testers of -mm should be aware that breakage in this tree is more common
+than in any other tree.
+
+The -mm kernels are not released on a fixed schedule, but usually a few -mm
+kernels are released in between each -rc kernel (1 to 3 is common).
+The -mm kernels apply to either a base 2.6.x kernel (when no -rc kernels
+have been released yet) or to a Linus -rc kernel.
+
+Here are some examples of applying the -mm patches:
+
+# moving from 2.6.12 to 2.6.12-mm1
+$ cd ~/linux-2.6.12 # change to the 2.6.12 source dir
+$ patch -p1 < ../2.6.12-mm1 # apply the 2.6.12-mm1 patch
+$ cd ..
+$ mv linux-2.6.12 linux-2.6.12-mm1 # rename the source appropriately
+
+# moving from 2.6.12-mm1 to 2.6.13-rc3-mm3
+$ cd ~/linux-2.6.12-mm1
+$ patch -p1 -R < ../2.6.12-mm1 # revert the 2.6.12-mm1 patch
+ # we now have a 2.6.12 source
+$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch
+ # we now have a 2.6.13-rc3 source
+$ patch -p1 < ../2.6.13-rc3-mm3 # apply the 2.6.13-rc3-mm3 patch
+$ cd ..
+$ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir
+
+
+This concludes this list of explanations of the various kernel trees and I
+hope you are now crystal clear on how to apply the various patches and help
+testing the kernel.
+
diff --git a/Documentation/cciss.txt b/Documentation/cciss.txt
index c8f9a73..68a711f 100644
--- a/Documentation/cciss.txt
+++ b/Documentation/cciss.txt
@@ -17,7 +17,9 @@ This driver is known to work with the fo
* SA P600
* SA P800
* SA E400
- * SA E300
+ * SA P400i
+ * SA E200
+ * SA E200i
If nodes are not already created in the /dev/cciss directory, run as root:
diff --git a/Documentation/cdrom/sonycd535 b/Documentation/cdrom/sonycd535
index 59581a4..b81e109 100644
--- a/Documentation/cdrom/sonycd535
+++ b/Documentation/cdrom/sonycd535
@@ -68,7 +68,8 @@ it a better device citizen. Further tha
Porfiri Claudio <C.Porfiri@nisms.tei.ericsson.se> for patches
to make the driver work with the older CDU-510/515 series, and
Heiko Eissfeldt <heiko@colossus.escape.de> for pointing out that
-the verify_area() checks were ignoring the results of said checks.
+the verify_area() checks were ignoring the results of said checks
+(note: verify_area() has since been replaced by access_ok()).
(Acknowledgments from Ron Jeppesen in the 0.3 release:)
Thanks to Corey Minyard who wrote the original CDU-31A driver on which
diff --git a/Documentation/connector/cn_test.c b/Documentation/connector/cn_test.c
new file mode 100644
index 0000000..b7de82e
--- /dev/null
+++ b/Documentation/connector/cn_test.c
@@ -0,0 +1,194 @@
+/*
+ * cn_test.c
+ *
+ * 2004-2005 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * Al