Merge with /home/shaggy/git/linus-clean/
This commit is contained in:
commit
0a0fc0ddbe
2
.gitignore
vendored
2
.gitignore
vendored
@ -10,6 +10,7 @@
|
||||
*.a
|
||||
*.s
|
||||
*.ko
|
||||
*.so
|
||||
*.mod.c
|
||||
|
||||
#
|
||||
@ -23,6 +24,7 @@ Module.symvers
|
||||
# Generated include files
|
||||
#
|
||||
include/asm
|
||||
include/asm-*/asm-offsets.h
|
||||
include/config
|
||||
include/linux/autoconf.h
|
||||
include/linux/compile.h
|
||||
|
17
CREDITS
17
CREDITS
@ -611,8 +611,7 @@ S: USA
|
||||
N: Randolph Chung
|
||||
E: tausq@debian.org
|
||||
D: Linux/PA-RISC hacker
|
||||
S: Los Altos, CA 94022
|
||||
S: USA
|
||||
S: Hong Kong
|
||||
|
||||
N: Juan Jose Ciarlante
|
||||
W: http://juanjox.kernelnotes.org/
|
||||
@ -1097,7 +1096,7 @@ S: 80050-430 - Curitiba - Paran
|
||||
S: Brazil
|
||||
|
||||
N: Kumar Gala
|
||||
E: kumar.gala@freescale.com
|
||||
E: galak@kernel.crashing.org
|
||||
D: Embedded PowerPC 6xx/7xx/74xx/82xx/83xx/85xx support
|
||||
S: Austin, Texas 78729
|
||||
S: USA
|
||||
@ -1884,6 +1883,7 @@ N: Jaya Kumar
|
||||
E: jayalk@intworks.biz
|
||||
W: http://www.intworks.biz
|
||||
D: Arc monochrome LCD framebuffer driver, x86 reboot fixups
|
||||
D: pirq addr, CS5535 alsa audio driver
|
||||
S: Gurgaon, India
|
||||
S: Kuala Lumpur, Malaysia
|
||||
|
||||
@ -3203,7 +3203,7 @@ N: Eugene Surovegin
|
||||
E: ebs@ebshome.net
|
||||
W: http://kernel.ebshome.net/
|
||||
P: 1024D/AE5467F1 FF22 39F1 6728 89F6 6E6C 2365 7602 F33D AE54 67F1
|
||||
D: Embedded PowerPC 4xx: I2C, PIC and random hacks/fixes
|
||||
D: Embedded PowerPC 4xx: EMAC, I2C, PIC and random hacks/fixes
|
||||
S: Sunnyvale, California 94085
|
||||
S: USA
|
||||
|
||||
@ -3405,6 +3405,15 @@ S: Chudenicka 8
|
||||
S: 10200 Prague 10, Hostivar
|
||||
S: Czech Republic
|
||||
|
||||
N: Thibaut Varene
|
||||
E: T-Bone@parisc-linux.org
|
||||
W: http://www.parisc-linux.org/
|
||||
P: 1024D/B7D2F063 E67C 0D43 A75E 12A5 BB1C FA2F 1E32 C3DA B7D2 F063
|
||||
D: PA-RISC port minion, PDC and GSCPS2 drivers, debuglocks and other bits
|
||||
D: Some bits in an ARM port, S1D13XXX FB driver, random patches here and there
|
||||
D: AD1889 sound driver
|
||||
S: Paris, France
|
||||
|
||||
N: Heikki Vatiainen
|
||||
E: hessu@cs.tut.fi
|
||||
D: Co-author of Multi-Protocol Over ATM (MPOA), some LANE hacks
|
||||
|
@ -24,6 +24,8 @@ DMA-mapping.txt
|
||||
- info for PCI drivers using DMA portably across all platforms.
|
||||
DocBook/
|
||||
- directory with DocBook templates etc. for kernel documentation.
|
||||
HOWTO
|
||||
- The process and procedures of how to do Linux kernel development.
|
||||
IO-mapping.txt
|
||||
- how to access I/O mapped memory from within device drivers.
|
||||
IPMI.txt
|
||||
@ -256,6 +258,10 @@ specialix.txt
|
||||
- info on hardware/driver for specialix IO8+ multiport serial card.
|
||||
spinlocks.txt
|
||||
- info on using spinlocks to provide exclusive access in kernel.
|
||||
stable_api_nonsense.txt
|
||||
- info on why the kernel does not have a stable in-kernel api or abi.
|
||||
stable_kernel_rules.txt
|
||||
- rules and procedures for the -stable kernel releases.
|
||||
stallion.txt
|
||||
- info on using the Stallion multiport serial driver.
|
||||
svga.txt
|
||||
|
@ -31,8 +31,6 @@ al espa
|
||||
Eine deutsche Version dieser Datei finden Sie unter
|
||||
<http://www.stefan-winter.de/Changes-2.4.0.txt>.
|
||||
|
||||
Last updated: October 29th, 2002
|
||||
|
||||
Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu).
|
||||
|
||||
Current Minimal Requirements
|
||||
@ -48,7 +46,7 @@ necessary on all systems; obviously, if you don't have any ISDN
|
||||
hardware, for example, you probably needn't concern yourself with
|
||||
isdn4k-utils.
|
||||
|
||||
o Gnu C 2.95.3 # gcc --version
|
||||
o Gnu C 3.2 # gcc --version
|
||||
o Gnu make 3.79.1 # make --version
|
||||
o binutils 2.12 # ld -v
|
||||
o util-linux 2.10o # fdformat --version
|
||||
@ -74,26 +72,7 @@ GCC
|
||||
---
|
||||
|
||||
The gcc version requirements may vary depending on the type of CPU in your
|
||||
computer. The next paragraph applies to users of x86 CPUs, but not
|
||||
necessarily to users of other CPUs. Users of other CPUs should obtain
|
||||
information about their gcc version requirements from another source.
|
||||
|
||||
The recommended compiler for the kernel is gcc 2.95.x (x >= 3), and it
|
||||
should be used when you need absolute stability. You may use gcc 3.0.x
|
||||
instead if you wish, although it may cause problems. Later versions of gcc
|
||||
have not received much testing for Linux kernel compilation, and there are
|
||||
almost certainly bugs (mainly, but not exclusively, in the kernel) that
|
||||
will need to be fixed in order to use these compilers. In any case, using
|
||||
pgcc instead of plain gcc is just asking for trouble.
|
||||
|
||||
The Red Hat gcc 2.96 compiler subtree can also be used to build this tree.
|
||||
You should ensure you use gcc-2.96-74 or later. gcc-2.96-54 will not build
|
||||
the kernel correctly.
|
||||
|
||||
In addition, please pay attention to compiler optimization. Anything
|
||||
greater than -O2 may not be wise. Similarly, if you choose to use gcc-2.95.x
|
||||
or derivatives, be sure not to use -fstrict-aliasing (which, depending on
|
||||
your version of gcc 2.95.x, may necessitate using -fno-strict-aliasing).
|
||||
computer.
|
||||
|
||||
Make
|
||||
----
|
||||
@ -322,9 +301,9 @@ Getting updated software
|
||||
Kernel compilation
|
||||
******************
|
||||
|
||||
gcc 2.95.3
|
||||
----------
|
||||
o <ftp://ftp.gnu.org/gnu/gcc/gcc-2.95.3.tar.gz>
|
||||
gcc
|
||||
---
|
||||
o <ftp://ftp.gnu.org/gnu/gcc/>
|
||||
|
||||
Make
|
||||
----
|
||||
|
@ -199,7 +199,7 @@ The rationale is:
|
||||
modifications are prevented
|
||||
- saves the compiler work to optimize redundant code away ;)
|
||||
|
||||
int fun(int )
|
||||
int fun(int a)
|
||||
{
|
||||
int result = 0;
|
||||
char *buffer = kmalloc(SIZE);
|
||||
@ -344,7 +344,7 @@ Remember: if another thread can find your data structure, and you don't
|
||||
have a reference count on it, you almost certainly have a bug.
|
||||
|
||||
|
||||
Chapter 11: Macros, Enums, Inline functions and RTL
|
||||
Chapter 11: Macros, Enums and RTL
|
||||
|
||||
Names of macros defining constants and labels in enums are capitalized.
|
||||
|
||||
@ -429,7 +429,35 @@ from void pointer to any other pointer type is guaranteed by the C programming
|
||||
language.
|
||||
|
||||
|
||||
Chapter 14: References
|
||||
Chapter 14: The inline disease
|
||||
|
||||
There appears to be a common misperception that gcc has a magic "make me
|
||||
faster" speedup option called "inline". While the use of inlines can be
|
||||
appropriate (for example as a means of replacing macros, see Chapter 11), it
|
||||
very often is not. Abundant use of the inline keyword leads to a much bigger
|
||||
kernel, which in turn slows the system as a whole down, due to a bigger
|
||||
icache footprint for the CPU and simply because there is less memory
|
||||
available for the pagecache. Just think about it; a pagecache miss causes a
|
||||
disk seek, which easily takes 5 miliseconds. There are a LOT of cpu cycles
|
||||
that can go into these 5 miliseconds.
|
||||
|
||||
A reasonable rule of thumb is to not put inline at functions that have more
|
||||
than 3 lines of code in them. An exception to this rule are the cases where
|
||||
a parameter is known to be a compiletime constant, and as a result of this
|
||||
constantness you *know* the compiler will be able to optimize most of your
|
||||
function away at compile time. For a good example of this later case, see
|
||||
the kmalloc() inline function.
|
||||
|
||||
Often people argue that adding inline to functions that are static and used
|
||||
only once is always a win since there is no space tradeoff. While this is
|
||||
technically correct, gcc is capable of inlining these automatically without
|
||||
help, and the maintenance issue of removing the inline when a second user
|
||||
appears outweighs the potential value of the hint that tells gcc to do
|
||||
something it would have done anyway.
|
||||
|
||||
|
||||
|
||||
Chapter 15: References
|
||||
|
||||
The C Programming Language, Second Edition
|
||||
by Brian W. Kernighan and Dennis M. Ritchie.
|
||||
@ -444,10 +472,13 @@ ISBN 0-201-61586-X.
|
||||
URL: http://cm.bell-labs.com/cm/cs/tpop/
|
||||
|
||||
GNU manuals - where in compliance with K&R and this text - for cpp, gcc,
|
||||
gcc internals and indent, all available from http://www.gnu.org
|
||||
gcc internals and indent, all available from http://www.gnu.org/manual/
|
||||
|
||||
WG14 is the international standardization working group for the programming
|
||||
language C, URL: http://std.dkuug.dk/JTC1/SC22/WG14/
|
||||
language C, URL: http://www.open-std.org/JTC1/SC22/WG14/
|
||||
|
||||
Kernel CodingStyle, by greg@kroah.com at OLS 2002:
|
||||
http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/
|
||||
|
||||
--
|
||||
Last updated on 16 February 2004 by a community effort on LKML.
|
||||
Last updated on 30 December 2005 by a community effort on LKML.
|
||||
|
6
Documentation/DocBook/.gitignore
vendored
Normal file
6
Documentation/DocBook/.gitignore
vendored
Normal file
@ -0,0 +1,6 @@
|
||||
*.xml
|
||||
*.ps
|
||||
*.pdf
|
||||
*.html
|
||||
*.9.gz
|
||||
*.9
|
@ -20,6 +20,12 @@ DOCBOOKS := wanbook.xml z8530book.xml mcabook.xml videobook.xml \
|
||||
# +--> DIR=file (htmldocs)
|
||||
# +--> man/ (mandocs)
|
||||
|
||||
|
||||
# for PDF and PS output you can choose between xmlto and docbook-utils tools
|
||||
PDF_METHOD = $(prefer-db2x)
|
||||
PS_METHOD = $(prefer-db2x)
|
||||
|
||||
|
||||
###
|
||||
# The targets that may be used.
|
||||
.PHONY: xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs
|
||||
@ -93,27 +99,39 @@ C-procfs-example = procfs_example.xml
|
||||
C-procfs-example2 = $(addprefix $(obj)/,$(C-procfs-example))
|
||||
$(obj)/procfs-guide.xml: $(C-procfs-example2)
|
||||
|
||||
###
|
||||
# Rules to generate postscript, PDF and HTML
|
||||
# db2html creates a directory. Generate a html file used for timestamp
|
||||
notfoundtemplate = echo "*** You have to install docbook-utils or xmlto ***"; \
|
||||
exit 1
|
||||
db2xtemplate = db2TYPE -o $(dir $@) $<
|
||||
xmltotemplate = xmlto TYPE $(XMLTOFLAGS) -o $(dir $@) $<
|
||||
|
||||
quiet_cmd_db2ps = XMLTO $@
|
||||
cmd_db2ps = xmlto ps $(XMLTOFLAGS) -o $(dir $@) $<
|
||||
# determine which methods are available
|
||||
ifeq ($(shell which db2ps >/dev/null 2>&1 && echo found),found)
|
||||
use-db2x = db2x
|
||||
prefer-db2x = db2x
|
||||
else
|
||||
use-db2x = notfound
|
||||
prefer-db2x = $(use-xmlto)
|
||||
endif
|
||||
ifeq ($(shell which xmlto >/dev/null 2>&1 && echo found),found)
|
||||
use-xmlto = xmlto
|
||||
prefer-xmlto = xmlto
|
||||
else
|
||||
use-xmlto = notfound
|
||||
prefer-xmlto = $(use-db2x)
|
||||
endif
|
||||
|
||||
# the commands, generated from the chosen template
|
||||
quiet_cmd_db2ps = PS $@
|
||||
cmd_db2ps = $(subst TYPE,ps, $($(PS_METHOD)template))
|
||||
%.ps : %.xml
|
||||
@(which xmlto > /dev/null 2>&1) || \
|
||||
(echo "*** You need to install xmlto ***"; \
|
||||
exit 1)
|
||||
$(call cmd,db2ps)
|
||||
|
||||
quiet_cmd_db2pdf = XMLTO $@
|
||||
cmd_db2pdf = xmlto pdf $(XMLTOFLAGS) -o $(dir $@) $<
|
||||
quiet_cmd_db2pdf = PDF $@
|
||||
cmd_db2pdf = $(subst TYPE,pdf, $($(PDF_METHOD)template))
|
||||
%.pdf : %.xml
|
||||
@(which xmlto > /dev/null 2>&1) || \
|
||||
(echo "*** You need to install xmlto ***"; \
|
||||
exit 1)
|
||||
$(call cmd,db2pdf)
|
||||
|
||||
quiet_cmd_db2html = XMLTO $@
|
||||
quiet_cmd_db2html = HTML $@
|
||||
cmd_db2html = xmlto xhtml $(XMLTOFLAGS) -o $(patsubst %.html,%,$@) $< && \
|
||||
echo '<a HREF="$(patsubst %.html,%,$(notdir $@))/index.html"> \
|
||||
Goto $(patsubst %.html,%,$(notdir $@))</a><p>' > $@
|
||||
@ -127,7 +145,7 @@ quiet_cmd_db2html = XMLTO $@
|
||||
@if [ ! -z "$(PNG-$(basename $(notdir $@)))" ]; then \
|
||||
cp $(PNG-$(basename $(notdir $@))) $(patsubst %.html,%,$@); fi
|
||||
|
||||
quiet_cmd_db2man = XMLTO $@
|
||||
quiet_cmd_db2man = MAN $@
|
||||
cmd_db2man = if grep -q refentry $<; then xmlto man $(XMLTOFLAGS) -o $(obj)/man $< ; gzip -f $(obj)/man/*.9; fi
|
||||
%.9 : %.xml
|
||||
@(which xmlto > /dev/null 2>&1) || \
|
||||
|
@ -53,6 +53,11 @@
|
||||
!Iinclude/linux/sched.h
|
||||
!Ekernel/sched.c
|
||||
!Ekernel/timer.c
|
||||
</sect1>
|
||||
<sect1><title>High-resolution timers</title>
|
||||
!Iinclude/linux/ktime.h
|
||||
!Iinclude/linux/hrtimer.h
|
||||
!Ekernel/hrtimer.c
|
||||
</sect1>
|
||||
<sect1><title>Internal Functions</title>
|
||||
!Ikernel/exit.c
|
||||
@ -68,9 +73,7 @@ X!Iinclude/linux/kobject.h
|
||||
|
||||
<sect1><title>Kernel utility functions</title>
|
||||
!Iinclude/linux/kernel.h
|
||||
<!-- This needs to clean up to make kernel-doc happy
|
||||
X!Ekernel/printk.c
|
||||
-->
|
||||
!Ekernel/printk.c
|
||||
!Ekernel/panic.c
|
||||
!Ekernel/sys.c
|
||||
!Ekernel/rcupdate.c
|
||||
@ -239,8 +242,10 @@ X!Ilib/string.c
|
||||
<sect1><title>Driver Support</title>
|
||||
!Enet/core/dev.c
|
||||
!Enet/ethernet/eth.c
|
||||
!Einclude/linux/etherdevice.h
|
||||
!Enet/core/wireless.c
|
||||
!Iinclude/linux/etherdevice.h
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
X!Enet/core/wireless.c
|
||||
-->
|
||||
</sect1>
|
||||
<sect1><title>Synchronous PPP</title>
|
||||
!Edrivers/net/wan/syncppp.c
|
||||
@ -369,6 +374,7 @@ X!Edrivers/acpi/motherboard.c
|
||||
X!Edrivers/acpi/bus.c
|
||||
-->
|
||||
!Edrivers/acpi/scan.c
|
||||
!Idrivers/acpi/scan.c
|
||||
<!-- No correct structured comments
|
||||
X!Edrivers/acpi/pci_bind.c
|
||||
-->
|
||||
@ -388,7 +394,7 @@ X!Edrivers/pnp/system.c
|
||||
|
||||
<chapter id="blkdev">
|
||||
<title>Block Devices</title>
|
||||
!Edrivers/block/ll_rw_blk.c
|
||||
!Eblock/ll_rw_blk.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="miscdev">
|
||||
|
@ -222,7 +222,7 @@
|
||||
<title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title>
|
||||
|
||||
<para>
|
||||
There are two main types of kernel locks. The fundamental type
|
||||
There are three main types of kernel locks. The fundamental type
|
||||
is the spinlock
|
||||
(<filename class="headerfile">include/asm/spinlock.h</filename>),
|
||||
which is a very simple single-holder lock: if you can't get the
|
||||
@ -230,16 +230,22 @@
|
||||
very small and fast, and can be used anywhere.
|
||||
</para>
|
||||
<para>
|
||||
The second type is a semaphore
|
||||
The second type is a mutex
|
||||
(<filename class="headerfile">include/linux/mutex.h</filename>): it
|
||||
is like a spinlock, but you may block holding a mutex.
|
||||
If you can't lock a mutex, your task will suspend itself, and be woken
|
||||
up when the mutex is released. This means the CPU can do something
|
||||
else while you are waiting. There are many cases when you simply
|
||||
can't sleep (see <xref linkend="sleeping-things"/>), and so have to
|
||||
use a spinlock instead.
|
||||
</para>
|
||||
<para>
|
||||
The third type is a semaphore
|
||||
(<filename class="headerfile">include/asm/semaphore.h</filename>): it
|
||||
can have more than one holder at any time (the number decided at
|
||||
initialization time), although it is most commonly used as a
|
||||
single-holder lock (a mutex). If you can't get a semaphore,
|
||||
your task will put itself on the queue, and be woken up when the
|
||||
semaphore is released. This means the CPU will do something
|
||||
else while you are waiting, but there are many cases when you
|
||||
simply can't sleep (see <xref linkend="sleeping-things"/>), and so
|
||||
have to use a spinlock instead.
|
||||
single-holder lock (a mutex). If you can't get a semaphore, your
|
||||
task will be suspended and later on woken up - just like for mutexes.
|
||||
</para>
|
||||
<para>
|
||||
Neither type of lock is recursive: see
|
||||
|
@ -3,4 +3,5 @@
|
||||
<param name="chunk.quietly">1</param>
|
||||
<param name="funcsynopsis.style">ansi</param>
|
||||
<param name="funcsynopsis.tabular.threshold">80</param>
|
||||
<!-- <param name="paper.type">A4</param> -->
|
||||
</stylesheet>
|
||||
|
@ -253,6 +253,7 @@
|
||||
!Edrivers/usb/core/urb.c
|
||||
!Edrivers/usb/core/message.c
|
||||
!Edrivers/usb/core/file.c
|
||||
!Edrivers/usb/core/driver.c
|
||||
!Edrivers/usb/core/usb.c
|
||||
!Edrivers/usb/core/hub.c
|
||||
</chapter>
|
||||
|
@ -229,7 +229,7 @@ int __init myradio_init(struct video_init *v)
|
||||
|
||||
static int users = 0;
|
||||
|
||||
static int radio_open(stuct video_device *dev, int flags)
|
||||
static int radio_open(struct video_device *dev, int flags)
|
||||
{
|
||||
if(users)
|
||||
return -EBUSY;
|
||||
@ -949,7 +949,7 @@ int __init mycamera_init(struct video_init *v)
|
||||
|
||||
static int users = 0;
|
||||
|
||||
static int camera_open(stuct video_device *dev, int flags)
|
||||
static int camera_open(struct video_device *dev, int flags)
|
||||
{
|
||||
if(users)
|
||||
return -EBUSY;
|
||||
|
618
Documentation/HOWTO
Normal file
618
Documentation/HOWTO
Normal file
@ -0,0 +1,618 @@
|
||||
HOWTO do Linux kernel development
|
||||
---------------------------------
|
||||
|
||||
This is the be-all, end-all document on this topic. It contains
|
||||
instructions on how to become a Linux kernel developer and how to learn
|
||||
to work with the Linux kernel development community. It tries to not
|
||||
contain anything related to the technical aspects of kernel programming,
|
||||
but will help point you in the right direction for that.
|
||||
|
||||
If anything in this document becomes out of date, please send in patches
|
||||
to the maintainer of this file, who is listed at the bottom of the
|
||||
document.
|
||||
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
So, you want to learn how to become a Linux kernel developer? Or you
|
||||
have been told by your manager, "Go write a Linux driver for this
|
||||
device." This document's goal is to teach you everything you need to
|
||||
know to achieve this by describing the process you need to go through,
|
||||
and hints on how to work with the community. It will also try to
|
||||
explain some of the reasons why the community works like it does.
|
||||
|
||||
The kernel is written mostly in C, with some architecture-dependent
|
||||
parts written in assembly. A good understanding of C is required for
|
||||
kernel development. Assembly (any architecture) is not required unless
|
||||
you plan to do low-level development for that architecture. Though they
|
||||
are not a good substitute for a solid C education and/or years of
|
||||
experience, the following books are good for, if anything, reference:
|
||||
- "The C Programming Language" by Kernighan and Ritchie [Prentice Hall]
|
||||
- "Practical C Programming" by Steve Oualline [O'Reilly]
|
||||
|
||||
The kernel is written using GNU C and the GNU toolchain. While it
|
||||
adheres to the ISO C89 standard, it uses a number of extensions that are
|
||||
not featured in the standard. The kernel is a freestanding C
|
||||
environment, with no reliance on the standard C library, so some
|
||||
portions of the C standard are not supported. Arbitrary long long
|
||||
divisions and floating point are not allowed. It can sometimes be
|
||||
difficult to understand the assumptions the kernel has on the toolchain
|
||||
and the extensions that it uses, and unfortunately there is no
|
||||
definitive reference for them. Please check the gcc info pages (`info
|
||||
gcc`) for some information on them.
|
||||
|
||||
Please remember that you are trying to learn how to work with the
|
||||
existing development community. It is a diverse group of people, with
|
||||
high standards for coding, style and procedure. These standards have
|
||||
been created over time based on what they have found to work best for
|
||||
such a large and geographically dispersed team. Try to learn as much as
|
||||
possible about these standards ahead of time, as they are well
|
||||
documented; do not expect people to adapt to you or your company's way
|
||||
of doing things.
|
||||
|
||||
|
||||
Legal Issues
|
||||
------------
|
||||
|
||||
The Linux kernel source code is released under the GPL. Please see the
|
||||
file, COPYING, in the main directory of the source tree, for details on
|
||||
the license. If you have further questions about the license, please
|
||||
contact a lawyer, and do not ask on the Linux kernel mailing list. The
|
||||
people on the mailing lists are not lawyers, and you should not rely on
|
||||
their statements on legal matters.
|
||||
|
||||
For common questions and answers about the GPL, please see:
|
||||
http://www.gnu.org/licenses/gpl-faq.html
|
||||
|
||||
|
||||
Documentation
|
||||
------------
|
||||
|
||||
The Linux kernel source tree has a large range of documents that are
|
||||
invaluable for learning how to interact with the kernel community. When
|
||||
new features are added to the kernel, it is recommended that new
|
||||
documentation files are also added which explain how to use the feature.
|
||||
When a kernel change causes the interface that the kernel exposes to
|
||||
userspace to change, it is recommended that you send the information or
|
||||
a patch to the manual pages explaining the change to the manual pages
|
||||
maintainer at mtk-manpages@gmx.net.
|
||||
|
||||
Here is a list of files that are in the kernel source tree that are
|
||||
required reading:
|
||||
README
|
||||
This file gives a short background on the Linux kernel and describes
|
||||
what is necessary to do to configure and build the kernel. People
|
||||
who are new to the kernel should start here.
|
||||
|
||||
Documentation/Changes
|
||||
This file gives a list of the minimum levels of various software
|
||||
packages that are necessary to build and run the kernel
|
||||
successfully.
|
||||
|
||||
Documentation/CodingStyle
|
||||
This describes the Linux kernel coding style, and some of the
|
||||
rationale behind it. All new code is expected to follow the
|
||||
guidelines in this document. Most maintainers will only accept
|
||||
patches if these rules are followed, and many people will only
|
||||
review code if it is in the proper style.
|
||||
|
||||
Documentation/SubmittingPatches
|
||||
Documentation/SubmittingDrivers
|
||||
These files describe in explicit detail how to successfully create
|
||||
and send a patch, including (but not limited to):
|
||||
- Email contents
|
||||
- Email format
|
||||
- Who to send it to
|
||||
Following these rules will not guarantee success (as all patches are
|
||||
subject to scrutiny for content and style), but not following them
|
||||
will almost always prevent it.
|
||||
|
||||
Other excellent descriptions of how to create patches properly are:
|
||||
"The Perfect Patch"
|
||||
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt
|
||||
"Linux kernel patch submission format"
|
||||
http://linux.yyz.us/patch-format.html
|
||||
|
||||
Documentation/stable_api_nonsense.txt
|
||||
This file describes the rationale behind the conscious decision to
|
||||
not have a stable API within the kernel, including things like:
|
||||
- Subsystem shim-layers (for compatibility?)
|
||||
- Driver portability between Operating Systems.
|
||||
- Mitigating rapid change within the kernel source tree (or
|
||||
preventing rapid change)
|
||||
This document is crucial for understanding the Linux development
|
||||
philosophy and is very important for people moving to Linux from
|
||||
development on other Operating Systems.
|
||||
|
||||
Documentation/SecurityBugs
|
||||
If you feel you have found a security problem in the Linux kernel,
|
||||
please follow the steps in this document to help notify the kernel
|
||||
developers, and help solve the issue.
|
||||
|
||||
Documentation/ManagementStyle
|
||||
This document describes how Linux kernel maintainers operate and the
|
||||
shared ethos behind their methodologies. This is important reading
|
||||
for anyone new to kernel development (or anyone simply curious about
|
||||
it), as it resolves a lot of common misconceptions and confusion
|
||||
about the unique behavior of kernel maintainers.
|
||||
|
||||
Documentation/stable_kernel_rules.txt
|
||||
This file describes the rules on how the stable kernel releases
|
||||
happen, and what to do if you want to get a change into one of these
|
||||
releases.
|
||||
|
||||
Documentation/kernel-docs.txt
|
||||
A list of external documentation that pertains to kernel
|
||||
development. Please consult this list if you do not find what you
|
||||
are looking for within the in-kernel documentation.
|
||||
|
||||
Documentation/applying-patches.txt
|
||||
A good introduction describing exactly what a patch is and how to
|
||||
apply it to the different development branches of the kernel.
|
||||
|
||||
The kernel also has a large number of documents that can be
|
||||
automatically generated from the source code itself. This includes a
|
||||
full description of the in-kernel API, and rules on how to handle
|
||||
locking properly. The documents will be created in the
|
||||
Documentation/DocBook/ directory and can be generated as PDF,
|
||||
Postscript, HTML, and man pages by running:
|
||||
make pdfdocs
|
||||
make psdocs
|
||||
make htmldocs
|
||||
make mandocs
|
||||
respectively from the main kernel source directory.
|
||||
|
||||
|
||||
Becoming A Kernel Developer
|
||||
---------------------------
|
||||
|
||||
If you do not know anything about Linux kernel development, you should
|
||||
look at the Linux KernelNewbies project:
|
||||
http://kernelnewbies.org
|
||||
It consists of a helpful mailing list where you can ask almost any type
|
||||
of basic kernel development question (make sure to search the archives
|
||||
first, before asking something that has already been answered in the
|
||||
past.) It also has an IRC channel that you can use to ask questions in
|
||||
real-time, and a lot of helpful documentation that is useful for
|
||||
learning about Linux kernel development.
|
||||
|
||||
The website has basic information about code organization, subsystems,
|
||||
and current projects (both in-tree and out-of-tree). It also describes
|
||||
some basic logistical information, like how to compile a kernel and
|
||||
apply a patch.
|
||||
|
||||
If you do not know where you want to start, but you want to look for
|
||||
some task to start doing to join into the kernel development community,
|
||||
go to the Linux Kernel Janitor's project:
|
||||
http://janitor.kernelnewbies.org/
|
||||
It is a great place to start. It describes a list of relatively simple
|
||||
problems that need to be cleaned up and fixed within the Linux kernel
|
||||
source tree. Working with the developers in charge of this project, you
|
||||
will learn the basics of getting your patch into the Linux kernel tree,
|
||||
and possibly be pointed in the direction of what to go work on next, if
|
||||
you do not already have an idea.
|
||||
|
||||
If you already have a chunk of code that you want to put into the kernel
|
||||
tree, but need some help getting it in the proper form, the
|
||||
kernel-mentors project was created to help you out with this. It is a
|
||||
mailing list, and can be found at:
|
||||
http://selenic.com/mailman/listinfo/kernel-mentors
|
||||
|
||||
Before making any actual modifications to the Linux kernel code, it is
|
||||
imperative to understand how the code in question works. For this
|
||||
purpose, nothing is better than reading through it directly (most tricky
|
||||
bits are commented well), perhaps even with the help of specialized
|
||||
tools. One such tool that is particularly recommended is the Linux
|
||||
Cross-Reference project, which is able to present source code in a
|
||||
self-referential, indexed webpage format. An excellent up-to-date
|
||||
repository of the kernel code may be found at:
|
||||
http://sosdg.org/~coywolf/lxr/
|
||||
|
||||
|
||||
The development process
|
||||
-----------------------
|
||||
|
||||
Linux kernel development process currently consists of a few different
|
||||
main kernel "branches" and lots of different subsystem-specific kernel
|
||||
branches. These different branches are:
|
||||
- main 2.6.x kernel tree
|
||||
- 2.6.x.y -stable kernel tree
|
||||
- 2.6.x -git kernel patches
|
||||
- 2.6.x -mm kernel patches
|
||||
- subsystem specific kernel trees and patches
|
||||
|
||||
2.6.x kernel tree
|
||||
-----------------
|
||||
2.6.x kernels are maintained by Linus Torvalds, and can be found on
|
||||
kernel.org in the pub/linux/kernel/v2.6/ directory. Its development
|
||||
process is as follows:
|
||||
- As soon as a new kernel is released a two weeks window is open,
|
||||
during this period of time maintainers can submit big diffs to
|
||||
Linus, usually the patches that have already been included in the
|
||||
-mm kernel for a few weeks. The preferred way to submit big changes
|
||||
is using git (the kernel's source management tool, more information
|
||||
can be found at http://git.or.cz/) but plain patches are also just
|
||||
fine.
|
||||
- After two weeks a -rc1 kernel is released it is now possible to push
|
||||
only patches that do not include new features that could affect the
|
||||
stability of the whole kernel. Please note that a whole new driver
|
||||
(or filesystem) might be accepted after -rc1 because there is no
|
||||
risk of causing regressions with such a change as long as the change
|
||||
is self-contained and does not affect areas outside of the code that
|
||||
is being added. git can be used to send patches to Linus after -rc1
|
||||
is released, but the patches need to also be sent to a public
|
||||
mailing list for review.
|
||||
- A new -rc is released whenever Linus deems the current git tree to
|
||||
be in a reasonably sane state adequate for testing. The goal is to
|
||||
release a new -rc kernel every week.
|
||||
- Process continues until the kernel is considered "ready", the
|
||||
process should last around 6 weeks.
|
||||
|
||||
It is worth mentioning what Andrew Morton wrote on the linux-kernel
|
||||
mailing list about kernel releases:
|
||||
"Nobody knows when a kernel will be released, because it's
|
||||
released according to perceived bug status, not according to a
|
||||
preconceived timeline."
|
||||
|
||||
2.6.x.y -stable kernel tree
|
||||
---------------------------
|
||||
Kernels with 4 digit versions are -stable kernels. They contain
|
||||
relatively small and critical fixes for security problems or significant
|
||||
regressions discovered in a given 2.6.x kernel.
|
||||
|
||||
This is the recommended branch for users who want the most recent stable
|
||||
kernel and are not interested in helping test development/experimental
|
||||
versions.
|
||||
|
||||
If no 2.6.x.y kernel is available, then the highest numbered 2.6.x
|
||||
kernel is the current stable kernel.
|
||||
|
||||
2.6.x.y are maintained by the "stable" team <stable@kernel.org>, and are
|
||||
released almost every other week.
|
||||
|
||||
The file Documentation/stable_kernel_rules.txt in the kernel tree
|
||||
documents what kinds of changes are acceptable for the -stable tree, and
|
||||
how the release process works.
|
||||
|
||||
2.6.x -git patches
|
||||
------------------
|
||||
These are daily snapshots of Linus' kernel tree which are managed in a
|
||||
git repository (hence the name.) These patches are usually released
|
||||
daily and represent the current state of Linus' tree. They are more
|
||||
experimental than -rc kernels since they are generated automatically
|
||||
without even a cursory glance to see if they are sane.
|
||||
|
||||
2.6.x -mm kernel patches
|
||||
------------------------
|
||||
These are experimental kernel patches released by Andrew Morton. Andrew
|
||||
takes all of the different subsystem kernel trees and patches and mushes
|
||||
them together, along with a lot of patches that have been plucked from
|
||||
the linux-kernel mailing list. This tree serves as a proving ground for
|
||||
new features and patches. Once a patch has proved its worth in -mm for
|
||||
a while Andrew or the subsystem maintainer pushes it on to Linus for
|
||||
inclusion in mainline.
|
||||
|
||||
It is heavily encouraged that all new patches get tested in the -mm tree
|
||||
before they are sent to Linus for inclusion in the main kernel tree.
|
||||
|
||||
These kernels are not appropriate for use on systems that are supposed
|
||||
to be stable and they are more risky to run than any of the other
|
||||
branches.
|
||||
|
||||
If you wish to help out with the kernel development process, please test
|
||||
and use these kernel releases and provide feedback to the linux-kernel
|
||||
mailing list if you have any problems, and if everything works properly.
|
||||
|
||||
In addition to all the other experimental patches, these kernels usually
|
||||
also contain any changes in the mainline -git kernels available at the
|
||||
time of release.
|
||||
|
||||
The -mm kernels are not released on a fixed schedule, but usually a few
|
||||
-mm kernels are released in between each -rc kernel (1 to 3 is common).
|
||||
|
||||
Subsystem Specific kernel trees and patches
|
||||
-------------------------------------------
|
||||
A number of the different kernel subsystem developers expose their
|
||||
development trees so that others can see what is happening in the
|
||||
different areas of the kernel. These trees are pulled into the -mm
|
||||
kernel releases as described above.
|
||||
|
||||
Here is a list of some of the different kernel trees available:
|
||||
git trees:
|
||||
- Kbuild development tree, Sam Ravnborg <sam@ravnborg.org>
|
||||
kernel.org:/pub/scm/linux/kernel/git/sam/kbuild.git
|
||||
|
||||
- ACPI development tree, Len Brown <len.brown@intel.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6.git
|
||||
|
||||
- Block development tree, Jens Axboe <axboe@suse.de>
|
||||
kernel.org:/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git
|
||||
|
||||
- DRM development tree, Dave Airlie <airlied@linux.ie>
|
||||
kernel.org:/pub/scm/linux/kernel/git/airlied/drm-2.6.git
|
||||
|
||||
- ia64 development tree, Tony Luck <tony.luck@intel.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6.git
|
||||
|
||||
- ieee1394 development tree, Jody McIntyre <scjody@modernduck.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/scjody/ieee1394.git
|
||||
|
||||
- infiniband, Roland Dreier <rolandd@cisco.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git
|
||||
|
||||
- libata, Jeff Garzik <jgarzik@pobox.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev.git
|
||||
|
||||
- network drivers, Jeff Garzik <jgarzik@pobox.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git
|
||||
|
||||
- pcmcia, Dominik Brodowski <linux@dominikbrodowski.net>
|
||||
kernel.org:/pub/scm/linux/kernel/git/brodo/pcmcia-2.6.git
|
||||
|
||||
- SCSI, James Bottomley <James.Bottomley@SteelEye.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
|
||||
|
||||
Other git kernel trees can be found listed at http://kernel.org/git
|
||||
|
||||
quilt trees:
|
||||
- USB, PCI, Driver Core, and I2C, Greg Kroah-Hartman <gregkh@suse.de>
|
||||
kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/
|
||||
|
||||
|
||||
Bug Reporting
|
||||
-------------
|
||||
|
||||
bugzilla.kernel.org is where the Linux kernel developers track kernel
|
||||
bugs. Users are encouraged to report all bugs that they find in this
|
||||
tool. For details on how to use the kernel bugzilla, please see:
|
||||
http://test.kernel.org/bugzilla/faq.html
|
||||
|
||||
The file REPORTING-BUGS in the main kernel source directory has a good
|
||||
template for how to report a possible kernel bug, and details what kind
|
||||
of information is needed by the kernel developers to help track down the
|
||||
problem.
|
||||
|
||||
|
||||
Mailing lists
|
||||
-------------
|
||||
|
||||
As some of the above documents describe, the majority of the core kernel
|
||||
developers participate on the Linux Kernel Mailing list. Details on how
|
||||
to subscribe and unsubscribe from the list can be found at:
|
||||
http://vger.kernel.org/vger-lists.html#linux-kernel
|
||||
There are archives of the mailing list on the web in many different
|
||||
places. Use a search engine to find these archives. For example:
|
||||
http://dir.gmane.org/gmane.linux.kernel
|
||||
It is highly recommended that you search the archives about the topic
|
||||
you want to bring up, before you post it to the list. A lot of things
|
||||
already discussed in detail are only recorded at the mailing list
|
||||
archives.
|
||||
|
||||
Most of the individual kernel subsystems also have their own separate
|
||||
mailing list where they do their development efforts. See the
|
||||
MAINTAINERS file for a list of what these lists are for the different
|
||||
groups.
|
||||
|
||||
Many of the lists are hosted on kernel.org. Information on them can be
|
||||
found at:
|
||||
http://vger.kernel.org/vger-lists.html
|
||||
|
||||
Please remember to follow good behavioral habits when using the lists.
|
||||
Though a bit cheesy, the following URL has some simple guidelines for
|
||||
interacting with the list (or any list):
|
||||
http://www.albion.com/netiquette/
|
||||
|
||||
If multiple people respond to your mail, the CC: list of recipients may
|
||||
get pretty large. Don't remove anybody from the CC: list without a good
|
||||
reason, or don't reply only to the list address. Get used to receiving the
|
||||
mail twice, one from the sender and the one from the list, and don't try
|
||||
to tune that by adding fancy mail-headers, people will not like it.
|
||||
|
||||
Remember to keep the context and the attribution of your replies intact,
|
||||
keep the "John Kernelhacker wrote ...:" lines at the top of your reply, and
|
||||
add your statements between the individual quoted sections instead of
|
||||
writing at the top of the mail.
|
||||
|
||||
If you add patches to your mail, make sure they are plain readable text
|
||||
as stated in Documentation/SubmittingPatches. Kernel developers don't
|
||||
want to deal with attachments or compressed patches; they may want
|
||||
to comment on individual lines of your patch, which works only that way.
|
||||
Make sure you use a mail program that does not mangle spaces and tab
|
||||
characters. A good first test is to send the mail to yourself and try
|
||||
to apply your own patch by yourself. If that doesn't work, get your
|
||||
mail program fixed or change it until it works.
|
||||
|
||||
Above all, please remember to show respect to other subscribers.
|
||||
|
||||
|
||||
Working with the community
|
||||
--------------------------
|
||||
|
||||
The goal of the kernel community is to provide the best possible kernel
|
||||
there is. When you submit a patch for acceptance, it will be reviewed
|
||||
on its technical merits and those alone. So, what should you be
|
||||
expecting?
|
||||
- criticism
|
||||
- comments
|
||||
- requests for change
|
||||
- requests for justification
|
||||
- silence
|
||||
|
||||
Remember, this is part of getting your patch into the kernel. You have
|
||||
to be able to take criticism and comments about your patches, evaluate
|
||||
them at a technical level and either rework your patches or provide
|
||||
clear and concise reasoning as to why those changes should not be made.
|
||||
If there are no responses to your posting, wait a few days and try
|
||||
again, sometimes things get lost in the huge volume.
|
||||
|
||||
What should you not do?
|
||||
- expect your patch to be accepted without question
|
||||
- become defensive
|
||||
- ignore comments
|
||||
- resubmit the patch without making any of the requested changes
|
||||
|
||||
In a community that is looking for the best technical solution possible,
|
||||
there will always be differing opinions on how beneficial a patch is.
|
||||
You have to be cooperative, and willing to adapt your idea to fit within
|
||||
the kernel. Or at least be willing to prove your idea is worth it.
|
||||
Remember, being wrong is acceptable as long as you are willing to work
|
||||
toward a solution that is right.
|
||||
|
||||
It is normal that the answers to your first patch might simply be a list
|
||||
of a dozen things you should correct. This does _not_ imply that your
|
||||
patch will not be accepted, and it is _not_ meant against you
|
||||
personally. Simply correct all issues raised against your patch and
|
||||
resend it.
|
||||
|
||||
|
||||
Differences between the kernel community and corporate structures
|
||||
-----------------------------------------------------------------
|
||||
|
||||
The kernel community works differently than most traditional corporate
|
||||
development environments. Here are a list of things that you can try to
|
||||
do to try to avoid problems:
|
||||
Good things to say regarding your proposed changes:
|
||||
- "This solves multiple problems."
|
||||
- "This deletes 2000 lines of code."
|
||||
- "Here is a patch that explains what I am trying to describe."
|
||||
- "I tested it on 5 different architectures..."
|
||||
- "Here is a series of small patches that..."
|
||||
- "This increases performance on typical machines..."
|
||||
|
||||
Bad things you should avoid saying:
|
||||
- "We did it this way in AIX/ptx/Solaris, so therefore it must be
|
||||
good..."
|
||||
- "I've being doing this for 20 years, so..."
|
||||
- "This is required for my company to make money"
|
||||
- "This is for our Enterprise product line."
|
||||
- "Here is my 1000 page design document that describes my idea"
|
||||
- "I've been working on this for 6 months..."
|
||||
- "Here's a 5000 line patch that..."
|
||||
- "I rewrote all of the current mess, and here it is..."
|
||||
- "I have a deadline, and this patch needs to be applied now."
|
||||
|
||||
Another way the kernel community is different than most traditional
|
||||
software engineering work environments is the faceless nature of
|
||||
interaction. One benefit of using email and irc as the primary forms of
|
||||
communication is the lack of discrimination based on gender or race.
|
||||
The Linux kernel work environment is accepting of women and minorities
|
||||
because all you are is an email address. The international aspect also
|
||||
helps to level the playing field because you can't guess gender based on
|
||||
a person's name. A man may be named Andrea and a woman may be named Pat.
|
||||
Most women who have worked in the Linux kernel and have expressed an
|
||||
opinion have had positive experiences.
|
||||
|
||||
The language barrier can cause problems for some people who are not
|
||||
comfortable with English. A good grasp of the language can be needed in
|
||||
order to get ideas across properly on mailing lists, so it is
|
||||
recommended that you check your emails to make sure they make sense in
|
||||
English before sending them.
|
||||
|
||||
|
||||
Break up your changes
|
||||
---------------------
|
||||
|
||||
The Linux kernel community does not gladly accept large chunks of code
|
||||
dropped on it all at once. The changes need to be properly introduced,
|
||||
discussed, and broken up into tiny, individual portions. This is almost
|
||||
the exact opposite of what companies are used to doing. Your proposal
|
||||
should also be introduced very early in the development process, so that
|
||||
you can receive feedback on what you are doing. It also lets the
|
||||
community feel that you are working with them, and not simply using them
|
||||
as a dumping ground for your feature. However, don't send 50 emails at
|
||||
one time to a mailing list, your patch series should be smaller than
|
||||
that almost all of the time.
|
||||
|
||||
The reasons for breaking things up are the following:
|
||||
|
||||
1) Small patches increase the likelihood that your patches will be
|
||||
applied, since they don't take much time or effort to verify for
|
||||
correctness. A 5 line patch can be applied by a maintainer with
|
||||
barely a second glance. However, a 500 line patch may take hours to
|
||||
review for correctness (the time it takes is exponentially
|
||||
proportional to the size of the patch, or something).
|
||||
|
||||
Small patches also make it very easy to debug when something goes
|
||||
wrong. It's much easier to back out patches one by one than it is
|
||||
to dissect a very large patch after it's been applied (and broken
|
||||
something).
|
||||
|
||||
2) It's important not only to send small patches, but also to rewrite
|
||||
and simplify (or simply re-order) patches before submitting them.
|
||||
|
||||
Here is an analogy from kernel developer Al Viro:
|
||||
"Think of a teacher grading homework from a math student. The
|
||||
teacher does not want to see the student's trials and errors
|
||||
before they came up with the solution. They want to see the
|
||||
cleanest, most elegant answer. A good student knows this, and
|
||||
would never submit her intermediate work before the final
|
||||
solution."
|
||||
|
||||
The same is true of kernel development. The maintainers and
|
||||
reviewers do not want to see the thought process behind the
|
||||
solution to the problem one is solving. They want to see a
|
||||
simple and elegant solution."
|
||||
|
||||
It may be challenging to keep the balance between presenting an elegant
|
||||
solution and working together with the community and discussing your
|
||||
unfinished work. Therefore it is good to get early in the process to
|
||||
get feedback to improve your work, but also keep your changes in small
|
||||
chunks that they may get already accepted, even when your whole task is
|
||||
not ready for inclusion now.
|
||||
|
||||
Also realize that it is not acceptable to send patches for inclusion
|
||||
that are unfinished and will be "fixed up later."
|
||||
|
||||
|
||||
Justify your change
|
||||
-------------------
|
||||
|
||||
Along with breaking up your patches, it is very important for you to let
|
||||
the Linux community know why they should add this change. New features
|
||||
must be justified as being needed and useful.
|
||||
|
||||
|
||||
Document your change
|
||||
--------------------
|
||||
|
||||
When sending in your patches, pay special attention to what you say in
|
||||
the text in your email. This information will become the ChangeLog
|
||||
information for the patch, and will be preserved for everyone to see for
|
||||
all time. It should describe the patch completely, containing:
|
||||
- why the change is necessary
|
||||
- the overall design approach in the patch
|
||||
- implementation details
|
||||
- testing results
|
||||
|
||||
For more details on what this should all look like, please see the
|
||||
ChangeLog section of the document:
|
||||
"The Perfect Patch"
|
||||
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt
|
||||
|
||||
|
||||
|
||||
|
||||
All of these things are sometimes very hard to do. It can take years to
|
||||
perfect these practices (if at all). It's a continuous process of
|
||||
improvement that requires a lot of patience and determination. But
|
||||
don't give up, it's possible. Many have done it before, and each had to
|
||||
start exactly where you are now.
|
||||
|
||||
|
||||
|
||||
|
||||
----------
|
||||
Thanks to Paolo Ciarrocchi who allowed the "Development Process" section
|
||||
to be based on text he had written, and to Randy Dunlap and Gerrit
|
||||
Huizenga for some of the list of things you should and should not say.
|
||||
Also thanks to Pat Mochel, Hanna Linder, Randy Dunlap, Kay Sievers,
|
||||
Vojtech Pavlik, Jan Kara, Josh Boyer, Kees Cook, Andrew Morton, Andi
|
||||
Kleen, Vadim Lobanov, Jesper Juhl, Adrian Bunk, Keri Harris, Frans Pop,
|
||||
David A. Wheeler, Junio Hamano, Michael Kerrisk, and Alex Shepard for
|
||||
their review, comments, and contributions. Without their help, this
|
||||
document would not have been possible.
|
||||
|
||||
|
||||
|
||||
Maintainer: Greg Kroah-Hartman <greg@kroah.com>
|
@ -1,74 +1,67 @@
|
||||
Refcounter framework for elements of lists/arrays protected by
|
||||
RCU.
|
||||
Refcounter design for elements of lists/arrays protected by RCU.
|
||||
|
||||
Refcounting on elements of lists which are protected by traditional
|
||||
reader/writer spinlocks or semaphores are straight forward as in:
|
||||
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object read_lock(&list_lock);
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
||||
write_lock(&list_lock); ...
|
||||
add_element read_unlock(&list_lock);
|
||||
... ...
|
||||
write_unlock(&list_lock); }
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object read_lock(&list_lock);
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
||||
write_lock(&list_lock); ...
|
||||
add_element read_unlock(&list_lock);
|
||||
... ...
|
||||
write_unlock(&list_lock); }
|
||||
}
|
||||
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... write_lock(&list_lock);
|
||||
atomic_dec(&el->rc, relfunc) ...
|
||||
... delete_element
|
||||
} write_unlock(&list_lock);
|
||||
...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
kfree(el);
|
||||
...
|
||||
... write_lock(&list_lock);
|
||||
atomic_dec(&el->rc, relfunc) ...
|
||||
... delete_element
|
||||
} write_unlock(&list_lock);
|
||||
...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
kfree(el);
|
||||
...
|
||||
}
|
||||
|
||||
If this list/array is made lock free using rcu as in changing the
|
||||
write_lock in add() and delete() to spin_lock and changing read_lock
|
||||
in search_and_reference to rcu_read_lock(), the rcuref_get in
|
||||
in search_and_reference to rcu_read_lock(), the atomic_get in
|
||||
search_and_reference could potentially hold reference to an element which
|
||||
has already been deleted from the list/array. rcuref_lf_get_rcu takes
|
||||
has already been deleted from the list/array. atomic_inc_not_zero takes
|
||||
care of this scenario. search_and_reference should look as;
|
||||
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object rcu_read_lock();
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); if (rcuref_inc_lf(&el->rc)) {
|
||||
write_lock(&list_lock); rcu_read_unlock();
|
||||
return FAIL;
|
||||
add_element }
|
||||
... ...
|
||||
write_unlock(&list_lock); rcu_read_unlock();
|
||||
alloc_object rcu_read_lock();
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); if (atomic_inc_not_zero(&el->rc)) {
|
||||
write_lock(&list_lock); rcu_read_unlock();
|
||||
return FAIL;
|
||||
add_element }
|
||||
... ...
|
||||
write_unlock(&list_lock); rcu_read_unlock();
|
||||
} }
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... write_lock(&list_lock);
|
||||
rcuref_dec(&el->rc, relfunc) ...
|
||||
... delete_element
|
||||
} write_unlock(&list_lock);
|
||||
...
|
||||
if (rcuref_dec_and_test(&el->rc))
|
||||
call_rcu(&el->head, el_free);
|
||||
...
|
||||
... write_lock(&list_lock);
|
||||
atomic_dec(&el->rc, relfunc) ...
|
||||
... delete_element
|
||||
} write_unlock(&list_lock);
|
||||
...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
call_rcu(&el->head, el_free);
|
||||
...
|
||||
}
|
||||
|
||||
Sometimes, reference to the element need to be obtained in the
|
||||
update (write) stream. In such cases, rcuref_inc_lf might be an overkill
|
||||
since the spinlock serialising list updates are held. rcuref_inc
|
||||
update (write) stream. In such cases, atomic_inc_not_zero might be an
|
||||
overkill since the spinlock serialising list updates are held. atomic_inc
|
||||
is to be used in such cases.
|
||||
For arches which do not have cmpxchg rcuref_inc_lf
|
||||
api uses a hashed spinlock implementation and the same hashed spinlock
|
||||
is acquired in all rcuref_xxx primitives to preserve atomicity.
|
||||
Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the
|
||||
refcounter atleast at one place. Mixing rcuref_inc and atomic_xxx api
|
||||
might lead to races. rcuref_inc_lf() must be used in lockfree
|
||||
RCU critical sections only.
|
||||
|
||||
|
@ -27,18 +27,17 @@ Who To Submit Drivers To
|
||||
------------------------
|
||||
|
||||
Linux 2.0:
|
||||
No new drivers are accepted for this kernel tree
|
||||
No new drivers are accepted for this kernel tree.
|
||||
|
||||
Linux 2.2:
|
||||
No new drivers are accepted for this kernel tree.
|
||||
|
||||
Linux 2.4:
|
||||
If the code area has a general maintainer then please submit it to
|
||||
the maintainer listed in MAINTAINERS in the kernel file. If the
|
||||
maintainer does not respond or you cannot find the appropriate
|
||||
maintainer then please contact the 2.2 kernel maintainer:
|
||||
Marc-Christian Petersen <m.c.p@wolk-project.de>.
|
||||
|
||||
Linux 2.4:
|
||||
The same rules apply as 2.2. The final contact point for Linux 2.4
|
||||
submissions is Marcelo Tosatti <marcelo.tosatti@cyclades.com>.
|
||||
maintainer then please contact Marcelo Tosatti
|
||||
<marcelo.tosatti@cyclades.com>.
|
||||
|
||||
Linux 2.6:
|
||||
The same rules apply as 2.4 except that you should follow linux-kernel
|
||||
@ -53,6 +52,7 @@ Licensing: The code must be released to us under the
|
||||
of exclusive GPL licensing, and if you wish the driver
|
||||
to be useful to other communities such as BSD you may well
|
||||
wish to release under multiple licenses.
|
||||
See accepted licenses at include/linux/module.h
|
||||
|
||||
Copyright: The copyright owner must agree to use of GPL.
|
||||
It's best if the submitter and copyright owner
|
||||
@ -143,5 +143,13 @@ KernelNewbies:
|
||||
http://kernelnewbies.org/
|
||||
|
||||
Linux USB project:
|
||||
http://sourceforge.net/projects/linux-usb/
|
||||
http://www.linux-usb.org/
|
||||
|
||||
How to NOT write kernel driver by arjanv@redhat.com
|
||||
http://people.redhat.com/arjanv/olspaper.pdf
|
||||
|
||||
Kernel Janitor:
|
||||
http://janitor.kernelnewbies.org/
|
||||
|
||||
--
|
||||
Last updated on 17 Nov 2005.
|
||||
|
@ -78,7 +78,9 @@ Randy Dunlap's patch scripts:
|
||||
http://www.xenotime.net/linux/scripts/patching-scripts-002.tar.gz
|
||||
|
||||
Andrew Morton's patch scripts:
|
||||
http://www.zip.com.au/~akpm/linux/patches/patch-scripts-0.20
|
||||
http://www.zip.com.au/~akpm/linux/patches/
|
||||
Instead of these scripts, quilt is the recommended patch management
|
||||
tool (see above).
|
||||
|
||||
|
||||
|
||||
@ -97,7 +99,7 @@ need to split up your patch. See #3, next.
|
||||
|
||||
3) Separate your changes.
|
||||
|
||||
Separate each logical change into its own patch.
|
||||
Separate _logical changes_ into a single patch file.
|
||||
|
||||
For example, if your changes include both bug fixes and performance
|
||||
enhancements for a single driver, separate those changes into two
|
||||
@ -112,6 +114,10 @@ If one patch depends on another patch in order for a change to be
|
||||
complete, that is OK. Simply note "this patch depends on patch X"
|
||||
in your patch description.
|
||||
|
||||
If you cannot condense your patch set into a smaller set of patches,
|
||||
then only post say 15 or so at a time and wait for review and integration.
|
||||
|
||||
|
||||
|
||||
4) Select e-mail destination.
|
||||
|
||||
@ -124,6 +130,10 @@ your patch to the primary Linux kernel developer's mailing list,
|
||||
linux-kernel@vger.kernel.org. Most kernel developers monitor this
|
||||
e-mail list, and can comment on your changes.
|
||||
|
||||
|
||||
Do not send more than 15 patches at once to the vger mailing lists!!!
|
||||
|
||||
|
||||
Linus Torvalds is the final arbiter of all changes accepted into the
|
||||
Linux kernel. His e-mail address is <torvalds@osdl.org>. He gets
|
||||
a lot of e-mail, so typically you should do your best to -avoid- sending
|
||||
@ -149,6 +159,9 @@ USB, framebuffer devices, the VFS, the SCSI subsystem, etc. See the
|
||||
MAINTAINERS file for a mailing list that relates specifically to
|
||||
your change.
|
||||
|
||||
Majordomo lists of VGER.KERNEL.ORG at:
|
||||
<http://vger.kernel.org/vger-lists.html>
|
||||
|
||||
If changes affect userland-kernel interfaces, please send
|
||||
the MAN-PAGES maintainer (as listed in the MAINTAINERS file)
|
||||
a man-pages patch, or at least a notification of the change,
|
||||
@ -158,7 +171,7 @@ Even if the maintainer did not respond in step #4, make sure to ALWAYS
|
||||
copy the maintainer when you change their code.
|
||||
|
||||
For small patches you may want to CC the Trivial Patch Monkey
|
||||
trivial@rustcorp.com.au set up by Rusty Russell; which collects "trivial"
|
||||
trivial@kernel.org managed by Adrian Bunk; which collects "trivial"
|
||||
patches. Trivial patches must qualify for one of the following rules:
|
||||
Spelling fixes in documentation
|
||||
Spelling fixes which could break grep(1).
|
||||
@ -171,7 +184,7 @@ patches. Trivial patches must qualify for one of the following rules:
|
||||
since people copy, as long as it's trivial)
|
||||
Any fix by the author/maintainer of the file. (ie. patch monkey
|
||||
in re-transmission mode)
|
||||
URL: <http://www.kernel.org/pub/linux/kernel/people/rusty/trivial/>
|
||||
URL: <http://www.kernel.org/pub/linux/kernel/people/bunk/trivial/>
|
||||
|
||||
|
||||
|
||||
@ -373,27 +386,14 @@ a diffstat, to show what files have changed, and the number of inserted
|
||||
and deleted lines per file. A diffstat is especially useful on bigger
|
||||
patches. Other comments relevant only to the moment or the maintainer,
|
||||
not suitable for the permanent changelog, should also go here.
|
||||
Use diffstat options "-p 1 -w 70" so that filenames are listed from the
|
||||
top of the kernel source tree and don't use too much horizontal space
|
||||
(easily fit in 80 columns, maybe with some indentation).
|
||||
|
||||
See more details on the proper patch format in the following
|
||||
references.
|
||||
|
||||
|
||||
13) More references for submitting patches
|
||||
|
||||
Andrew Morton, "The perfect patch" (tpp).
|
||||
<http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
|
||||
|
||||
Jeff Garzik, "Linux kernel patch submission format."
|
||||
<http://linux.yyz.us/patch-format.html>
|
||||
|
||||
Greg KH, "How to piss off a kernel subsystem maintainer"
|
||||
<http://www.kroah.com/log/2005/03/31/>
|
||||
|
||||
Kernel Documentation/CodingStyle
|
||||
<http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
|
||||
|
||||
Linus Torvald's mail on the canonical patch format:
|
||||
<http://lkml.org/lkml/2005/4/7/183>
|
||||
|
||||
|
||||
-----------------------------------
|
||||
@ -466,3 +466,31 @@ and 'extern __inline__'.
|
||||
Don't try to anticipate nebulous future cases which may or may not
|
||||
be useful: "Make it as simple as you can, and no simpler."
|
||||
|
||||
|
||||
|
||||
----------------------
|
||||
SECTION 3 - REFERENCES
|
||||
----------------------
|
||||
|
||||
Andrew Morton, "The perfect patch" (tpp).
|
||||
<http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
|
||||
|
||||
Jeff Garzik, "Linux kernel patch submission format."
|
||||
<http://linux.yyz.us/patch-format.html>
|
||||
|
||||
Greg Kroah-Hartman "How to piss off a kernel subsystem maintainer".
|
||||
<http://www.kroah.com/log/2005/03/31/>
|
||||
<http://www.kroah.com/log/2005/07/08/>
|
||||
<http://www.kroah.com/log/2005/10/19/>
|
||||
<http://www.kroah.com/log/2006/01/11/>
|
||||
|
||||
NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people!.
|
||||
<http://marc.theaimsgroup.com/?l=linux-kernel&m=112112749912944&w=2>
|
||||
|
||||
Kernel Documentation/CodingStyle
|
||||
<http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
|
||||
|
||||
Linus Torvald's mail on the canonical patch format:
|
||||
<http://lkml.org/lkml/2005/4/7/183>
|
||||
--
|
||||
Last updated on 17 Nov 2005.
|
||||
|
@ -2,8 +2,8 @@
|
||||
Applying Patches To The Linux Kernel
|
||||
------------------------------------
|
||||
|
||||
(Written by Jesper Juhl, August 2005)
|
||||
|
||||
Original by: Jesper Juhl, August 2005
|
||||
Last update: 2006-01-05
|
||||
|
||||
|
||||
A frequently asked question on the Linux Kernel Mailing List is how to apply
|
||||
@ -76,7 +76,7 @@ instead:
|
||||
|
||||
If you wish to uncompress the patch file by hand first before applying it
|
||||
(what I assume you've done in the examples below), then you simply run
|
||||
gunzip or bunzip2 on the file - like this:
|
||||
gunzip or bunzip2 on the file -- like this:
|
||||
gunzip patch-x.y.z.gz
|
||||
bunzip2 patch-x.y.z.bz2
|
||||
|
||||
@ -94,7 +94,7 @@ Common errors when patching
|
||||
---
|
||||
When patch applies a patch file it attempts to verify the sanity of the
|
||||
file in different ways.
|
||||
Checking that the file looks like a valid patch file, checking the code
|
||||
Checking that the file looks like a valid patch file & checking the code
|
||||
around the bits being modified matches the context provided in the patch are
|
||||
just two of the basic sanity checks patch does.
|
||||
|
||||
@ -118,16 +118,16 @@ wrong.
|
||||
|
||||
When patch encounters a change that it can't fix up with fuzz it rejects it
|
||||
outright and leaves a file with a .rej extension (a reject file). You can
|
||||
read this file to see exactely what change couldn't be applied, so you can
|
||||
read this file to see exactly what change couldn't be applied, so you can
|
||||
go fix it up by hand if you wish.
|
||||
|
||||
If you don't have any third party patches applied to your kernel source, but
|
||||
If you don't have any third-party patches applied to your kernel source, but
|
||||
only patches from kernel.org and you apply the patches in the correct order,
|
||||
and have made no modifications yourself to the source files, then you should
|
||||
never see a fuzz or reject message from patch. If you do see such messages
|
||||
anyway, then there's a high risk that either your local source tree or the
|
||||
patch file is corrupted in some way. In that case you should probably try
|
||||
redownloading the patch and if things are still not OK then you'd be advised
|
||||
re-downloading the patch and if things are still not OK then you'd be advised
|
||||
to start with a fresh tree downloaded in full from kernel.org.
|
||||
|
||||
Let's look a bit more at some of the messages patch can produce.
|
||||
@ -136,7 +136,7 @@ If patch stops and presents a "File to patch:" prompt, then patch could not
|
||||
find a file to be patched. Most likely you forgot to specify -p1 or you are
|
||||
in the wrong directory. Less often, you'll find patches that need to be
|
||||
applied with -p0 instead of -p1 (reading the patch file should reveal if
|
||||
this is the case - if so, then this is an error by the person who created
|
||||
this is the case -- if so, then this is an error by the person who created
|
||||
the patch but is not fatal).
|
||||
|
||||
If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a
|
||||
@ -167,22 +167,28 @@ the patch will in fact apply it.
|
||||
|
||||
A message similar to "patch: **** unexpected end of file in patch" or "patch
|
||||
unexpectedly ends in middle of line" means that patch could make no sense of
|
||||
the file you fed to it. Either your download is broken or you tried to feed
|
||||
patch a compressed patch file without uncompressing it first.
|
||||
the file you fed to it. Either your download is broken, you tried to feed
|
||||
patch a compressed patch file without uncompressing it first, or the patch
|
||||
file that you are using has been mangled by a mail client or mail transfer
|
||||
agent along the way somewhere, e.g., by splitting a long line into two lines.
|
||||
Often these warnings can easily be fixed by joining (concatenating) the
|
||||
two lines that had been split.
|
||||
|
||||
As I already mentioned above, these errors should never happen if you apply
|
||||
a patch from kernel.org to the correct version of an unmodified source tree.
|
||||
So if you get these errors with kernel.org patches then you should probably
|
||||
assume that either your patch file or your tree is broken and I'd advice you
|
||||
assume that either your patch file or your tree is broken and I'd advise you
|
||||
to start over with a fresh download of a full kernel tree and the patch you
|
||||
wish to apply.
|
||||
|
||||
|
||||
Are there any alternatives to `patch'?
|
||||
---
|
||||
Yes there are alternatives. You can use the `interdiff' program
|
||||
(http://cyberelk.net/tim/patchutils/) to generate a patch representing the
|
||||
differences between two patches and then apply the result.
|
||||
Yes there are alternatives.
|
||||
|
||||
You can use the `interdiff' program (http://cyberelk.net/tim/patchutils/) to
|
||||
generate a patch representing the differences between two patches and then
|
||||
apply the result.
|
||||
This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single
|
||||
step. The -z flag to interdiff will even let you feed it patches in gzip or
|
||||
bzip2 compressed form directly without the use of zcat or bzcat or manual
|
||||
@ -197,10 +203,10 @@ do the additional steps since interdiff can get things wrong in some cases.
|
||||
Another alternative is `ketchup', which is a python script for automatic
|
||||
downloading and applying of patches (http://www.selenic.com/ketchup/).
|
||||
|
||||
Other nice tools are diffstat which shows a summary of changes made by a
|
||||
patch, lsdiff which displays a short listing of affected files in a patch
|
||||
file, along with (optionally) the line numbers of the start of each patch
|
||||
and grepdiff which displays a list of the files modified by a patch where
|
||||
Other nice tools are diffstat, which shows a summary of changes made by a
|
||||
patch; lsdiff, which displays a short listing of affected files in a patch
|
||||
file, along with (optionally) the line numbers of the start of each patch;
|
||||
and grepdiff, which displays a list of the files modified by a patch where
|
||||
the patch contains a given regular expression.
|
||||
|
||||
|
||||
@ -225,8 +231,8 @@ The -mm kernels live at
|
||||
In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a
|
||||
country code. This way you'll be downloading from a mirror site that's most
|
||||
likely geographically closer to you, resulting in faster downloads for you,
|
||||
less bandwidth used globally and less load on the main kernel.org servers -
|
||||
these are good things, do use mirrors when possible.
|
||||
less bandwidth used globally and less load on the main kernel.org servers --
|
||||
these are good things, so do use mirrors when possible.
|
||||
|
||||
|
||||
The 2.6.x kernels
|
||||
@ -234,14 +240,14 @@ The 2.6.x kernels
|
||||
These are the base stable releases released by Linus. The highest numbered
|
||||
release is the most recent.
|
||||
|
||||
If regressions or other serious flaws are found then a -stable fix patch
|
||||
If regressions or other serious flaws are found, then a -stable fix patch
|
||||
will be released (see below) on top of this base. Once a new 2.6.x base
|
||||
kernel is released, a patch is made available that is a delta between the
|
||||
previous 2.6.x kernel and the new one.
|
||||
|
||||
To apply a patch moving from 2.6.11 to 2.6.12 you'd do the following (note
|
||||
To apply a patch moving from 2.6.11 to 2.6.12, you'd do the following (note
|
||||
that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the
|
||||
base 2.6.x kernel - if you need to move from 2.6.x.y to 2.6.x+1 you need to
|
||||
base 2.6.x kernel -- if you need to move from 2.6.x.y to 2.6.x+1 you need to
|
||||
first revert the 2.6.x.y patch).
|
||||
|
||||
Here are some examples:
|
||||
@ -258,12 +264,12 @@ $ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch
|
||||
# source dir is now 2.6.11
|
||||
$ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.11.1 inux-2.6.12 # rename source dir
|
||||
$ mv linux-2.6.11.1 linux-2.6.12 # rename source dir
|
||||
|
||||
|
||||
The 2.6.x.y kernels
|
||||
---
|
||||
Kernels with 4 digit versions are -stable kernels. They contain small(ish)
|
||||
Kernels with 4-digit versions are -stable kernels. They contain small(ish)
|
||||
critical fixes for security problems or significant regressions discovered
|
||||
in a given 2.6.x kernel.
|
||||
|
||||
@ -274,9 +280,14 @@ versions.
|
||||
If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is
|
||||
the current stable kernel.
|
||||
|
||||
note: the -stable team usually do make incremental patches available as well
|
||||
as patches against the latest mainline release, but I only cover the
|
||||
non-incremental ones below. The incremental ones can be found at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/incr/
|
||||
|
||||
These patches are not incremental, meaning that for example the 2.6.12.3
|
||||
patch does not apply on top of the 2.6.12.2 kernel source, but rather on top
|
||||
of the base 2.6.12 kernel source.
|
||||
of the base 2.6.12 kernel source .
|
||||
So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel
|
||||
source you have to first back out the 2.6.12.2 patch (so you are left with a
|
||||
base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch.
|
||||
@ -342,12 +353,12 @@ The -git kernels
|
||||
repository, hence the name).
|
||||
|
||||
These patches are usually released daily and represent the current state of
|
||||
Linus' tree. They are more experimental than -rc kernels since they are
|
||||
Linus's tree. They are more experimental than -rc kernels since they are
|
||||
generated automatically without even a cursory glance to see if they are
|
||||
sane.
|
||||
|
||||
-git patches are not incremental and apply either to a base 2.6.x kernel or
|
||||
a base 2.6.x-rc kernel - you can see which from their name.
|
||||
a base 2.6.x-rc kernel -- you can see which from their name.
|
||||
A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch
|
||||
named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel.
|
||||
|
||||
@ -390,12 +401,12 @@ You should generally strive to get your patches into mainline via -mm to
|
||||
ensure maximum testing.
|
||||
|
||||
This branch is in constant flux and contains many experimental features, a
|
||||
lot of debugging patches not appropriate for mainline etc and is the most
|
||||
lot of debugging patches not appropriate for mainline etc., and is the most
|
||||
experimental of the branches described in this document.
|
||||
|
||||
These kernels are not appropriate for use on systems that are supposed to be
|
||||
stable and they are more risky to run than any of the other branches (make
|
||||
sure you have up-to-date backups - that goes for any experimental kernel but
|
||||
sure you have up-to-date backups -- that goes for any experimental kernel but
|
||||
even more so for -mm kernels).
|
||||
|
||||
These kernels in addition to all the other experimental patches they contain
|
||||
@ -433,7 +444,11 @@ $ cd ..
|
||||
$ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir
|
||||
|
||||
|
||||
This concludes this list of explanations of the various kernel trees and I
|
||||
hope you are now crystal clear on how to apply the various patches and help
|
||||
testing the kernel.
|
||||
This concludes this list of explanations of the various kernel trees.
|
||||
I hope you are now clear on how to apply the various patches and help testing
|
||||
the kernel.
|
||||
|
||||
Thank you's to Randy Dunlap, Rolf Eike Beer, Linus Torvalds, Bodo Eggert,
|
||||
Johannes Stezenbach, Grant Coady, Pavel Machek and others that I may have
|
||||
forgotten for their reviews and contributions to this document.
|
||||
|
||||
|
@ -16,5 +16,7 @@ empeg
|
||||
- Empeg documentation
|
||||
mem_alignment
|
||||
- alignment abort handler documentation
|
||||
memory.txt
|
||||
- description of the virtual memory layout
|
||||
nwfpe
|
||||
- NWFPE floating point emulator documentation
|
||||
|
@ -12,7 +12,7 @@ This release has been validated against the SoftFloat-2b library by
|
||||
John R. Hauser using the TestFloat-2a test suite. Details of this
|
||||
library and test suite can be found at:
|
||||
|
||||
http://www.cs.berkeley.edu/~jhauser/arithmetic/SoftFloat.html
|
||||
http://www.jhauser.us/arithmetic/SoftFloat.html
|
||||
|
||||
The operations which have been tested with this package are:
|
||||
|
||||
|
@ -1,7 +1,7 @@
|
||||
Kernel Memory Layout on ARM Linux
|
||||
|
||||
Russell King <rmk@arm.linux.org.uk>
|
||||
May 21, 2004 (2.6.6)
|
||||
November 17, 2005 (2.6.15)
|
||||
|
||||
This document describes the virtual memory layout which the Linux
|
||||
kernel uses for ARM processors. It indicates which regions are
|
||||
@ -37,6 +37,8 @@ ff000000 ffbfffff Reserved for future expansion of DMA
|
||||
mapping region.
|
||||
|
||||
VMALLOC_END feffffff Free for platform use, recommended.
|
||||
VMALLOC_END must be aligned to a 2MB
|
||||
boundary.
|
||||
|
||||
VMALLOC_START VMALLOC_END-1 vmalloc() / ioremap() space.
|
||||
Memory returned by vmalloc/ioremap will
|
||||
|
@ -115,6 +115,33 @@ boolean is return which indicates whether the resulting counter value
|
||||
is negative. It requires explicit memory barrier semantics around the
|
||||
operation.
|
||||
|
||||
Then:
|
||||
|
||||
int atomic_cmpxchg(atomic_t *v, int old, int new);
|
||||
|
||||
This performs an atomic compare exchange operation on the atomic value v,
|
||||
with the given old and new values. Like all atomic_xxx operations,
|
||||
atomic_cmpxchg will only satisfy its atomicity semantics as long as all
|
||||
other accesses of *v are performed through atomic_xxx operations.
|
||||
|
||||
atomic_cmpxchg requires explicit memory barriers around the operation.
|
||||
|
||||
The semantics for atomic_cmpxchg are the same as those defined for 'cas'
|
||||
below.
|
||||
|
||||
Finally:
|
||||
|
||||
int atomic_add_unless(atomic_t *v, int a, int u);
|
||||
|
||||
If the atomic value v is not equal to u, this function adds a to v, and
|
||||
returns non zero. If v is equal to u then it returns zero. This is done as
|
||||
an atomic operation.
|
||||
|
||||
atomic_add_unless requires explicit memory barriers around the operation.
|
||||
|
||||
atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0)
|
||||
|
||||
|
||||
If a caller requires memory barrier semantics around an atomic_t
|
||||
operation which does not return a value, a set of interfaces are
|
||||
defined which accomplish this:
|
||||
|
271
Documentation/block/barrier.txt
Normal file
271
Documentation/block/barrier.txt
Normal file
@ -0,0 +1,271 @@
|
||||
I/O Barriers
|
||||
============
|
||||
Tejun Heo <htejun@gmail.com>, July 22 2005
|
||||
|
||||
I/O barrier requests are used to guarantee ordering around the barrier
|
||||
requests. Unless you're crazy enough to use disk drives for
|
||||
implementing synchronization constructs (wow, sounds interesting...),
|
||||
the ordering is meaningful only for write requests for things like
|
||||
journal checkpoints. All requests queued before a barrier request
|
||||
must be finished (made it to the physical medium) before the barrier
|
||||
request is started, and all requests queued after the barrier request
|
||||
must be started only after the barrier request is finished (again,
|
||||
made it to the physical medium).
|
||||
|
||||
In other words, I/O barrier requests have the following two properties.
|
||||
|
||||
1. Request ordering
|
||||
|
||||
Requests cannot pass the barrier request. Preceding requests are
|
||||
processed before the barrier and following requests after.
|
||||
|
||||
Depending on what features a drive supports, this can be done in one
|
||||
of the following three ways.
|
||||
|
||||
i. For devices which have queue depth greater than 1 (TCQ devices) and
|
||||
support ordered tags, block layer can just issue the barrier as an
|
||||
ordered request and the lower level driver, controller and drive
|
||||
itself are responsible for making sure that the ordering contraint is
|
||||
met. Most modern SCSI controllers/drives should support this.
|
||||
|
||||
NOTE: SCSI ordered tag isn't currently used due to limitation in the
|
||||
SCSI midlayer, see the following random notes section.
|
||||
|
||||
ii. For devices which have queue depth greater than 1 but don't
|
||||
support ordered tags, block layer ensures that the requests preceding
|
||||
a barrier request finishes before issuing the barrier request. Also,
|
||||
it defers requests following the barrier until the barrier request is
|
||||
finished. Older SCSI controllers/drives and SATA drives fall in this
|
||||
category.
|
||||
|
||||
iii. Devices which have queue depth of 1. This is a degenerate case
|
||||
of ii. Just keeping issue order suffices. Ancient SCSI
|
||||
controllers/drives and IDE drives are in this category.
|
||||
|
||||
2. Forced flushing to physcial medium
|
||||
|
||||
Again, if you're not gonna do synchronization with disk drives (dang,
|
||||
it sounds even more appealing now!), the reason you use I/O barriers
|
||||
is mainly to protect filesystem integrity when power failure or some
|
||||
other events abruptly stop the drive from operating and possibly make
|
||||
the drive lose data in its cache. So, I/O barriers need to guarantee
|
||||
that requests actually get written to non-volatile medium in order.
|
||||
|
||||
There are four cases,
|
||||
|
||||
i. No write-back cache. Keeping requests ordered is enough.
|
||||
|
||||
ii. Write-back cache but no flush operation. There's no way to
|
||||
gurantee physical-medium commit order. This kind of devices can't to
|
||||
I/O barriers.
|
||||
|
||||
iii. Write-back cache and flush operation but no FUA (forced unit
|
||||
access). We need two cache flushes - before and after the barrier
|
||||
request.
|
||||
|
||||
iv. Write-back cache, flush operation and FUA. We still need one
|
||||
flush to make sure requests preceding a barrier are written to medium,
|
||||
but post-barrier flush can be avoided by using FUA write on the
|
||||
barrier itself.
|
||||
|
||||
|
||||
How to support barrier requests in drivers
|
||||
------------------------------------------
|
||||
|
||||
All barrier handling is done inside block layer proper. All low level
|
||||
drivers have to are implementing its prepare_flush_fn and using one
|
||||
the following two functions to indicate what barrier type it supports
|
||||
and how to prepare flush requests. Note that the term 'ordered' is
|
||||
used to indicate the whole sequence of performing barrier requests
|
||||
including draining and flushing.
|
||||
|
||||
typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);
|
||||
|
||||
int blk_queue_ordered(request_queue_t *q, unsigned ordered,
|
||||
prepare_flush_fn *prepare_flush_fn,
|
||||
unsigned gfp_mask);
|
||||
|
||||
int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
|
||||
prepare_flush_fn *prepare_flush_fn,
|
||||
unsigned gfp_mask);
|
||||
|
||||
The only difference between the two functions is whether or not the
|
||||
caller is holding q->queue_lock on entry. The latter expects the
|
||||
caller is holding the lock.
|
||||
|
||||
@q : the queue in question
|
||||
@ordered : the ordered mode the driver/device supports
|
||||
@prepare_flush_fn : this function should prepare @rq such that it
|
||||
flushes cache to physical medium when executed
|
||||
@gfp_mask : gfp_mask used when allocating data structures
|
||||
for ordered processing
|
||||
|
||||
For example, SCSI disk driver's prepare_flush_fn looks like the
|
||||
following.
|
||||
|
||||
static void sd_prepare_flush(request_queue_t *q, struct request *rq)
|
||||
{
|
||||
memset(rq->cmd, 0, sizeof(rq->cmd));
|
||||
rq->flags |= REQ_BLOCK_PC;
|
||||
rq->timeout = SD_TIMEOUT;
|
||||
rq->cmd[0] = SYNCHRONIZE_CACHE;
|
||||
}
|
||||
|
||||
The following seven ordered modes are supported. The following table
|
||||
shows which mode should be used depending on what features a
|
||||
device/driver supports. In the leftmost column of table,
|
||||
QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
|
||||
|
||||
The table is followed by description of each mode. Note that in the
|
||||
descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
|
||||
used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
|
||||
preceding step must be complete before proceeding to the next step.
|
||||
'->' indicates that the next step can start as soon as the previous
|
||||
step is issued.
|
||||
|
||||
write-back cache ordered tag flush FUA
|
||||
-----------------------------------------------------------------------
|
||||
NONE yes/no N/A no N/A
|
||||
DRAIN no no N/A N/A
|
||||
DRAIN_FLUSH yes no yes no
|
||||
DRAIN_FUA yes no yes yes
|
||||
TAG no yes N/A N/A
|
||||
TAG_FLUSH yes yes yes no
|
||||
TAG_FUA yes yes yes yes
|
||||
|
||||
|
||||
QUEUE_ORDERED_NONE
|
||||
I/O barriers are not needed and/or supported.
|
||||
|
||||
Sequence: N/A
|
||||
|
||||
QUEUE_ORDERED_DRAIN
|
||||
Requests are ordered by draining the request queue and cache
|
||||
flushing isn't needed.
|
||||
|
||||
Sequence: drain => barrier
|
||||
|
||||
QUEUE_ORDERED_DRAIN_FLUSH
|
||||
Requests are ordered by draining the request queue and both
|
||||
pre-barrier and post-barrier cache flushings are needed.
|
||||
|
||||
Sequence: drain => preflush => barrier => postflush
|
||||
|
||||
QUEUE_ORDERED_DRAIN_FUA
|
||||
Requests are ordered by draining the request queue and
|
||||
pre-barrier cache flushing is needed. By using FUA on barrier
|
||||
request, post-barrier flushing can be skipped.
|
||||
|
||||
Sequence: drain => preflush => barrier
|
||||
|
||||
QUEUE_ORDERED_TAG
|
||||
Requests are ordered by ordered tag and cache flushing isn't
|
||||
needed.
|
||||
|
||||
Sequence: barrier
|
||||
|
||||
QUEUE_ORDERED_TAG_FLUSH
|
||||
Requests are ordered by ordered tag and both pre-barrier and
|
||||
post-barrier cache flushings are needed.
|
||||
|
||||
Sequence: preflush -> barrier -> postflush
|
||||
|
||||
QUEUE_ORDERED_TAG_FUA
|
||||
Requests are ordered by ordered tag and pre-barrier cache
|
||||
flushing is needed. By using FUA on barrier request,
|
||||
post-barrier flushing can be skipped.
|
||||
|
||||
Sequence: preflush -> barrier
|
||||
|
||||
|
||||
Random notes/caveats
|
||||
--------------------
|
||||
|
||||
* SCSI layer currently can't use TAG ordering even if the drive,
|
||||
controller and driver support it. The problem is that SCSI midlayer
|
||||
request dispatch function is not atomic. It releases queue lock and
|
||||
switch to SCSI host lock during issue and it's possible and likely to
|
||||
happen in time that requests change their relative positions. Once
|
||||
this problem is solved, TAG ordering can be enabled.
|
||||
|
||||
* Currently, no matter which ordered mode is used, there can be only
|
||||
one barrier request in progress. All I/O barriers are held off by
|
||||
block layer until the previous I/O barrier is complete. This doesn't
|
||||
make any difference for DRAIN ordered devices, but, for TAG ordered
|
||||
devices with very high command latency, passing multiple I/O barriers
|
||||
to low level *might* be helpful if they are very frequent. Well, this
|
||||
certainly is a non-issue. I'm writing this just to make clear that no
|
||||
two I/O barrier is ever passed to low-level driver.
|
||||
|
||||
* Completion order. Requests in ordered sequence are issued in order
|
||||
but not required to finish in order. Barrier implementation can
|
||||
handle out-of-order completion of ordered sequence. IOW, the requests
|
||||
MUST be processed in order but the hardware/software completion paths
|
||||
are allowed to reorder completion notifications - eg. current SCSI
|
||||
midlayer doesn't preserve completion order during error handling.
|
||||
|
||||
* Requeueing order. Low-level drivers are free to requeue any request
|
||||
after they removed it from the request queue with
|
||||
blkdev_dequeue_request(). As barrier sequence should be kept in order
|
||||
when requeued, generic elevator code takes care of putting requests in
|
||||
order around barrier. See blk_ordered_req_seq() and
|
||||
ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
|
||||
|
||||
Note that block drivers must not requeue preceding requests while
|
||||
completing latter requests in an ordered sequence. Currently, no
|
||||
error checking is done against this.
|
||||
|
||||
* Error handling. Currently, block layer will report error to upper
|
||||
layer if any of requests in an ordered sequence fails. Unfortunately,
|
||||
this doesn't seem to be enough. Look at the following request flow.
|
||||
QUEUE_ORDERED_TAG_FLUSH is in use.
|
||||
|
||||
[0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
|
||||
still in elevator
|
||||
|
||||
Let's say request [2], [3] are write requests to update file system
|
||||
metadata (journal or whatever) and [barrier] is used to mark that
|
||||
those updates are valid. Consider the following sequence.
|
||||
|
||||
i. Requests [0] ~ [post] leaves the request queue and enters
|
||||
low-level driver.
|
||||
ii. After a while, unfortunately, something goes wrong and the
|
||||
drive fails [2]. Note that any of [0], [1] and [3] could have
|
||||
completed by this time, but [pre] couldn't have been finished
|
||||
as the drive must process it in order and it failed before
|
||||
processing that command.
|
||||
iii. Error handling kicks in and determines that the error is
|
||||
unrecoverable and fails [2], and resumes operation.
|
||||
iv. [pre] [barrier] [post] gets processed.
|
||||
v. *BOOM* power fails
|
||||
|
||||
The problem here is that the barrier request is *supposed* to indicate
|
||||
that filesystem update requests [2] and [3] made it safely to the
|
||||
physical medium and, if the machine crashes after the barrier is
|
||||
written, filesystem recovery code can depend on that. Sadly, that
|
||||
isn't true in this case anymore. IOW, the success of a I/O barrier
|
||||
should also be dependent on success of some of the preceding requests,
|
||||
where only upper layer (filesystem) knows what 'some' is.
|
||||
|
||||
This can be solved by implementing a way to tell the block layer which
|
||||
requests affect the success of the following barrier request and
|
||||
making lower lever drivers to resume operation on error only after
|
||||
block layer tells it to do so.
|
||||
|
||||
As the probability of this happening is very low and the drive should
|
||||
be faulty, implementing the fix is probably an overkill. But, still,
|
||||
it's there.
|
||||
|
||||
* In previous drafts of barrier implementation, there was fallback
|
||||
mechanism such that, if FUA or ordered TAG fails, less fancy ordered
|
||||
mode can be selected and the failed barrier request is retried
|
||||
automatically. The rationale for this feature was that as FUA is
|
||||
pretty new in ATA world and ordered tag was never used widely, there
|
||||
could be devices which report to support those features but choke when
|
||||
actually given such requests.
|
||||
|
||||
This was removed for two reasons 1. it's an overkill 2. it's
|
||||
impossible to implement properly when TAG ordering is used as low
|
||||
level drivers resume after an error automatically. If it's ever
|
||||
needed adding it back and modifying low level drivers accordingly
|
||||
shouldn't be difficult.
|
@ -31,7 +31,7 @@ The following people helped with review comments and inputs for this
|
||||
document:
|
||||
Christoph Hellwig <hch@infradead.org>
|
||||
Arjan van de Ven <arjanv@redhat.com>
|
||||
Randy Dunlap <rddunlap@osdl.org>
|
||||
Randy Dunlap <rdunlap@xenotime.net>
|
||||
Andre Hedrick <andre@linux-ide.org>
|
||||
|
||||
The following people helped with fixes/contributions to the bio patches
|
||||
@ -263,14 +263,8 @@ A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o.
|
||||
The generic i/o scheduler would make sure that it places the barrier request and
|
||||
all other requests coming after it after all the previous requests in the
|
||||
queue. Barriers may be implemented in different ways depending on the
|
||||
driver. A SCSI driver for example could make use of ordered tags to
|
||||
preserve the necessary ordering with a lower impact on throughput. For IDE
|
||||
this might be two sync cache flush: a pre and post flush when encountering
|
||||
a barrier write.
|
||||
|
||||
There is a provision for queues to indicate what kind of barriers they
|
||||
can provide. This is as of yet unmerged, details will be added here once it
|
||||
is in the kernel.
|
||||
driver. For more details regarding I/O barriers, please read barrier.txt
|
||||
in this directory.
|
||||
|
||||
1.2.2 Request Priority/Latency
|
||||
|
||||
@ -1063,8 +1057,8 @@ Aside:
|
||||
4.4 I/O contexts
|
||||
I/O contexts provide a dynamically allocated per process data area. They may
|
||||
be used in I/O schedulers, and in the block layer (could be used for IO statis,
|
||||
priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and
|
||||
as-iosched.c for an example of usage in an i/o scheduler.
|
||||
priorities for example). See *io_context in block/ll_rw_blk.c, and as-iosched.c
|
||||
for an example of usage in an i/o scheduler.
|
||||
|
||||
|
||||
5. Scalability related changes
|
||||
|
82
Documentation/block/stat.txt
Normal file
82
Documentation/block/stat.txt
Normal file
@ -0,0 +1,82 @@
|
||||
Block layer statistics in /sys/block/<dev>/stat
|
||||
===============================================
|
||||
|
||||
This file documents the contents of the /sys/block/<dev>/stat file.
|
||||
|
||||
The stat file provides several statistics about the state of block
|
||||
device <dev>.
|
||||
|
||||
Q. Why are there multiple statistics in a single file? Doesn't sysfs
|
||||
normally contain a single value per file?
|
||||
A. By having a single file, the kernel can guarantee that the statistics
|
||||
represent a consistent snapshot of the state of the device. If the
|
||||
statistics were exported as multiple files containing one statistic
|
||||
each, it would be impossible to guarantee that a set of readings
|
||||
represent a single point in time.
|
||||
|
||||
The stat file consists of a single line of text containing 11 decimal
|
||||
values separated by whitespace. The fields are summarized in the
|
||||
following table, and described in more detail below.
|
||||
|
||||
Name units description
|
||||
---- ----- -----------
|
||||
read I/Os requests number of read I/Os processed
|
||||
read merges requests number of read I/Os merged with in-queue I/O
|
||||
read sectors sectors number of sectors read
|
||||
read ticks milliseconds total wait time for read requests
|
||||
write I/Os requests number of write I/Os processed
|
||||
write merges requests number of write I/Os merged with in-queue I/O
|
||||
write sectors sectors number of sectors written
|
||||
write ticks milliseconds total wait time for write requests
|
||||
in_flight requests number of I/Os currently in flight
|
||||
io_ticks milliseconds total time this block device has been active
|
||||
time_in_queue milliseconds total wait time for all requests
|
||||
|
||||
read I/Os, write I/Os
|
||||
=====================
|
||||
|
||||
These values increment when an I/O request completes.
|
||||
|
||||
read merges, write merges
|
||||
=========================
|
||||
|
||||
These values increment when an I/O request is merged with an
|
||||
already-queued I/O request.
|
||||
|
||||
read sectors, write sectors
|
||||
===========================
|
||||
|
||||
These values count the number of sectors read from or written to this
|
||||
block device. The "sectors" in question are the standard UNIX 512-byte
|
||||
sectors, not any device- or filesystem-specific block size. The
|
||||
counters are incremented when the I/O completes.
|
||||
|
||||
read ticks, write ticks
|
||||
=======================
|
||||
|
||||
These values count the number of milliseconds that I/O requests have
|
||||
waited on this block device. If there are multiple I/O requests waiting,
|
||||
these values will increase at a rate greater than 1000/second; for
|
||||
example, if 60 read requests wait for an average of 30 ms, the read_ticks
|
||||
field will increase by 60*30 = 1800.
|
||||
|
||||
in_flight
|
||||
=========
|
||||
|
||||
This value counts the number of I/O requests that have been issued to
|
||||
the device driver but have not yet completed. It does not include I/O
|
||||
requests that are in the queue but not yet issued to the device driver.
|
||||
|
||||
io_ticks
|
||||
========
|
||||
|
||||
This value counts the number of milliseconds during which the device has
|
||||
had I/O requests queued.
|
||||
|
||||
time_in_queue
|
||||
=============
|
||||
|
||||
This value counts the number of milliseconds that I/O requests have waited
|
||||
on this block device. If there are multiple I/O requests waiting, this
|
||||
value will increase as the product of the number of milliseconds times the
|
||||
number of requests waiting (see "read ticks" above for an example).
|
@ -136,7 +136,7 @@ changes occur:
|
||||
8) void lazy_mmu_prot_update(pte_t pte)
|
||||
This interface is called whenever the protection on
|
||||
any user PTEs change. This interface provides a notification
|
||||
to architecture specific code to take appropiate action.
|
||||
to architecture specific code to take appropriate action.
|
||||
|
||||
|
||||
Next, we have the cache flushing interfaces. In general, when Linux
|
||||
|
@ -133,3 +133,32 @@ hardware and it is important to prevent the kernel from attempting to directly
|
||||
access these devices too, as if the array controller were merely a SCSI
|
||||
controller in the same way that we are allowing it to access SCSI tape drives.
|
||||
|
||||
SCSI error handling for tape drives and medium changers
|
||||
-------------------------------------------------------
|
||||
|
||||
The linux SCSI mid layer provides an error handling protocol which
|
||||
kicks into gear whenever a SCSI command fails to complete within a
|
||||
certain amount of time (which can vary depending on the command).
|
||||
The cciss driver participates in this protocol to some extent. The
|
||||
normal protocol is a four step process. First the device is told
|
||||
to abort the command. If that doesn't work, the device is reset.
|
||||
If that doesn't work, the SCSI bus is reset. If that doesn't work
|
||||
the host bus adapter is reset. Because the cciss driver is a block
|
||||
driver as well as a SCSI driver and only the tape drives and medium
|
||||
changers are presented to the SCSI mid layer, and unlike more
|
||||
straightforward SCSI drivers, disk i/o continues through the block
|
||||
side during the SCSI error recovery process, the cciss driver only
|
||||
implements the first two of these actions, aborting the command, and
|
||||
resetting the device. Additionally, most tape drives will not oblige
|
||||
in aborting commands, and sometimes it appears they will not even
|
||||
obey a reset coommand, though in most circumstances they will. In
|
||||
the case that the command cannot be aborted and the device cannot be
|
||||
reset, the device will be set offline.
|
||||
|
||||
In the event the error handling code is triggered and a tape drive is
|
||||
successfully reset or the tardy command is successfully aborted, the
|
||||
tape drive may still not allow i/o to continue until some command
|
||||
is issued which positions the tape to a known position. Typically you
|
||||
must rewind the tape (by issuing "mt -f /dev/st0 rewind" for example)
|
||||
before i/o can proceed again to a tape drive which was reset.
|
||||
|
||||
|
@ -27,6 +27,7 @@ Contents:
|
||||
2.2 Powersave
|
||||
2.3 Userspace
|
||||
2.4 Ondemand
|
||||
2.5 Conservative
|
||||
|
||||
3. The Governor Interface in the CPUfreq Core
|
||||
|
||||
@ -110,9 +111,64 @@ directory.
|
||||
|
||||
The CPUfreq govenor "ondemand" sets the CPU depending on the
|
||||
current usage. To do this the CPU must have the capability to
|
||||
switch the frequency very fast.
|
||||
switch the frequency very quickly. There are a number of sysfs file
|
||||
accessible parameters:
|
||||
|
||||
sampling_rate: measured in uS (10^-6 seconds), this is how often you
|
||||
want the kernel to look at the CPU usage and to make decisions on
|
||||
what to do about the frequency. Typically this is set to values of
|
||||
around '10000' or more.
|
||||
|
||||
show_sampling_rate_(min|max): the minimum and maximum sampling rates
|
||||
available that you may set 'sampling_rate' to.
|
||||
|
||||
up_threshold: defines what the average CPU usaged between the samplings
|
||||
of 'sampling_rate' needs to be for the kernel to make a decision on
|
||||
whether it should increase the frequency. For example when it is set
|
||||
to its default value of '80' it means that between the checking
|
||||
intervals the CPU needs to be on average more than 80% in use to then
|
||||
decide that the CPU frequency needs to be increased.
|
||||
|
||||
sampling_down_factor: this parameter controls the rate that the CPU
|
||||
makes a decision on when to decrease the frequency. When set to its
|
||||
default value of '5' it means that at 1/5 the sampling_rate the kernel
|
||||
makes a decision to lower the frequency. Five "lower rate" decisions
|
||||
have to be made in a row before the CPU frequency is actually lower.
|
||||
If set to '1' then the frequency decreases as quickly as it increases,
|
||||
if set to '2' it decreases at half the rate of the increase.
|
||||
|
||||
ignore_nice_load: this parameter takes a value of '0' or '1', when set
|
||||
to '0' (its default) then all processes are counted towards towards the
|
||||
'cpu utilisation' value. When set to '1' then processes that are
|
||||
run with a 'nice' value will not count (and thus be ignored) in the
|
||||
overal usage calculation. This is useful if you are running a CPU
|
||||
intensive calculation on your laptop that you do not care how long it
|
||||
takes to complete as you can 'nice' it and prevent it from taking part
|
||||
in the deciding process of whether to increase your CPU frequency.
|
||||
|
||||
|
||||
2.5 Conservative
|
||||
----------------
|
||||
|
||||
The CPUfreq governor "conservative", much like the "ondemand"
|
||||
governor, sets the CPU depending on the current usage. It differs in
|
||||
behaviour in that it gracefully increases and decreases the CPU speed
|
||||
rather than jumping to max speed the moment there is any load on the
|
||||
CPU. This behaviour more suitable in a battery powered environment.
|
||||
The governor is tweaked in the same manner as the "ondemand" governor
|
||||
through sysfs with the addition of:
|
||||
|
||||
freq_step: this describes what percentage steps the cpu freq should be
|
||||
increased and decreased smoothly by. By default the cpu frequency will
|
||||
increase in 5% chunks of your maximum cpu frequency. You can change this
|
||||
value to anywhere between 0 and 100 where '0' will effectively lock your
|
||||
CPU at a speed regardless of its load whilst '100' will, in theory, make
|
||||
it behave identically to the "ondemand" governor.
|
||||
|
||||
down_threshold: same as the 'up_threshold' found for the "ondemand"
|
||||
governor but for the opposite direction. For example when set to its
|
||||
default value of '20' it means that if the CPU usage needs to be below
|
||||
20% between samples to have the frequency decreased.
|
||||
|
||||
3. The Governor Interface in the CPUfreq Core
|
||||
=============================================
|
||||
|
357
Documentation/cpu-hotplug.txt
Normal file
357
Documentation/cpu-hotplug.txt
Normal file
@ -0,0 +1,357 @@
|
||||
CPU hotplug Support in Linux(tm) Kernel
|
||||
|
||||
Maintainers:
|
||||
CPU Hotplug Core:
|
||||
Rusty Russell <rusty@rustycorp.com.au>
|
||||
Srivatsa Vaddagiri <vatsa@in.ibm.com>
|
||||
i386:
|
||||
Zwane Mwaikambo <zwane@arm.linux.org.uk>
|
||||
ppc64:
|
||||
Nathan Lynch <nathanl@austin.ibm.com>
|
||||
Joel Schopp <jschopp@austin.ibm.com>
|
||||
ia64/x86_64:
|
||||
Ashok Raj <ashok.raj@intel.com>
|
||||
|
||||
Authors: Ashok Raj <ashok.raj@intel.com>
|
||||
Lots of feedback: Nathan Lynch <nathanl@austin.ibm.com>,
|
||||
Joel Schopp <jschopp@austin.ibm.com>
|
||||
|
||||
Introduction
|
||||
|
||||
Modern advances in system architectures have introduced advanced error
|
||||
reporting and correction capabilities in processors. CPU architectures permit
|
||||
partitioning support, where compute resources of a single CPU could be made
|
||||
available to virtual machine environments. There are couple OEMS that
|
||||
support NUMA hardware which are hot pluggable as well, where physical
|
||||
node insertion and removal require support for CPU hotplug.
|
||||
|
||||
Such advances require CPUs available to a kernel to be removed either for
|
||||
provisioning reasons, or for RAS purposes to keep an offending CPU off
|
||||
system execution path. Hence the need for CPU hotplug support in the
|
||||
Linux kernel.
|
||||
|
||||
A more novel use of CPU-hotplug support is its use today in suspend
|
||||
resume support for SMP. Dual-core and HT support makes even
|
||||
a laptop run SMP kernels which didn't support these methods. SMP support
|
||||
for suspend/resume is a work in progress.
|
||||
|
||||
General Stuff about CPU Hotplug
|
||||
--------------------------------
|
||||
|
||||
Command Line Switches
|
||||
---------------------
|
||||
maxcpus=n Restrict boot time cpus to n. Say if you have 4 cpus, using
|
||||
maxcpus=2 will only boot 2. You can choose to bring the
|
||||
other cpus later online, read FAQ's for more info.
|
||||
|
||||
additional_cpus=n [x86_64 only] use this to limit hotpluggable cpus.
|
||||
This option sets
|
||||
cpu_possible_map = cpu_present_map + additional_cpus
|
||||
|
||||
CPU maps and such
|
||||
-----------------
|
||||
[More on cpumaps and primitive to manipulate, please check
|
||||
include/linux/cpumask.h that has more descriptive text.]
|
||||
|
||||
cpu_possible_map: Bitmap of possible CPUs that can ever be available in the
|
||||
system. This is used to allocate some boot time memory for per_cpu variables
|
||||
that aren't designed to grow/shrink as CPUs are made available or removed.
|
||||
Once set during boot time discovery phase, the map is static, i.e no bits
|
||||
are added or removed anytime. Trimming it accurately for your system needs
|
||||
upfront can save some boot time memory. See below for how we use heuristics
|
||||
in x86_64 case to keep this under check.
|
||||
|
||||
cpu_online_map: Bitmap of all CPUs currently online. Its set in __cpu_up()
|
||||
after a cpu is available for kernel scheduling and ready to receive
|
||||
interrupts from devices. Its cleared when a cpu is brought down using
|
||||
__cpu_disable(), before which all OS services including interrupts are
|
||||
migrated to another target CPU.
|
||||
|
||||
cpu_present_map: Bitmap of CPUs currently present in the system. Not all
|
||||
of them may be online. When physical hotplug is processed by the relevant
|
||||
subsystem (e.g ACPI) can change and new bit either be added or removed
|
||||
from the map depending on the event is hot-add/hot-remove. There are currently
|
||||
no locking rules as of now. Typical usage is to init topology during boot,
|
||||
at which time hotplug is disabled.
|
||||
|
||||
You really dont need to manipulate any of the system cpu maps. They should
|
||||
be read-only for most use. When setting up per-cpu resources almost always use
|
||||
cpu_possible_map/for_each_cpu() to iterate.
|
||||
|
||||
Never use anything other than cpumask_t to represent bitmap of CPUs.
|
||||
|
||||
#include <linux/cpumask.h>
|
||||
|
||||
for_each_cpu - Iterate over cpu_possible_map
|
||||
for_each_online_cpu - Iterate over cpu_online_map
|
||||
for_each_present_cpu - Iterate over cpu_present_map
|
||||
for_each_cpu_mask(x,mask) - Iterate over some random collection of cpu mask.
|
||||
|
||||
#include <linux/cpu.h>
|
||||
lock_cpu_hotplug() and unlock_cpu_hotplug():
|
||||
|
||||
The above calls are used to inhibit cpu hotplug operations. While holding the
|
||||
cpucontrol mutex, cpu_online_map will not change. If you merely need to avoid
|
||||
cpus going away, you could also use preempt_disable() and preempt_enable()
|
||||
for those sections. Just remember the critical section cannot call any
|
||||
function that can sleep or schedule this process away. The preempt_disable()
|
||||
will work as long as stop_machine_run() is used to take a cpu down.
|
||||
|
||||
CPU Hotplug - Frequently Asked Questions.
|
||||
|
||||
Q: How to i enable my kernel to support CPU hotplug?
|
||||
A: When doing make defconfig, Enable CPU hotplug support
|
||||
|
||||
"Processor type and Features" -> Support for Hotpluggable CPUs
|
||||
|
||||
Make sure that you have CONFIG_HOTPLUG, and CONFIG_SMP turned on as well.
|
||||
|
||||
You would need to enable CONFIG_HOTPLUG_CPU for SMP suspend/resume support
|
||||
as well.
|
||||
|
||||
Q: What architectures support CPU hotplug?
|
||||
A: As of 2.6.14, the following architectures support CPU hotplug.
|
||||
|
||||
i386 (Intel), ppc, ppc64, parisc, s390, ia64 and x86_64
|
||||
|
||||
Q: How to test if hotplug is supported on the newly built kernel?
|
||||
A: You should now notice an entry in sysfs.
|
||||
|
||||
Check if sysfs is mounted, using the "mount" command. You should notice
|
||||
an entry as shown below in the output.
|
||||
|
||||
....
|
||||
none on /sys type sysfs (rw)
|
||||
....
|
||||
|
||||
if this is not mounted, do the following.
|
||||
|
||||
#mkdir /sysfs
|
||||
#mount -t sysfs sys /sys
|
||||
|
||||
now you should see entries for all present cpu, the following is an example
|
||||
in a 8-way system.
|
||||
|
||||
#pwd
|
||||
#/sys/devices/system/cpu
|
||||
#ls -l
|
||||
total 0
|
||||
drwxr-xr-x 10 root root 0 Sep 19 07:44 .
|
||||
drwxr-xr-x 13 root root 0 Sep 19 07:45 ..
|
||||
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu0
|
||||
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu1
|
||||
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu2
|
||||
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu3
|
||||
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu4
|
||||
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu5
|
||||
drwxr-xr-x 3 root root 0 Sep 19 07:44 cpu6
|
||||
drwxr-xr-x 3 root root 0 Sep 19 07:48 cpu7
|
||||
|
||||
Under each directory you would find an "online" file which is the control
|
||||
file to logically online/offline a processor.
|
||||
|
||||
Q: Does hot-add/hot-remove refer to physical add/remove of cpus?
|
||||
A: The usage of hot-add/remove may not be very consistently used in the code.
|
||||
CONFIG_CPU_HOTPLUG enables logical online/offline capability in the kernel.
|
||||
To support physical addition/removal, one would need some BIOS hooks and
|
||||
the platform should have something like an attention button in PCI hotplug.
|
||||
CONFIG_ACPI_HOTPLUG_CPU enables ACPI support for physical add/remove of CPUs.
|
||||
|
||||
Q: How do i logically offline a CPU?
|
||||
A: Do the following.
|
||||
|
||||
#echo 0 > /sys/devices/system/cpu/cpuX/online
|
||||
|
||||
once the logical offline is successful, check
|
||||
|
||||
#cat /proc/interrupts
|
||||
|
||||
you should now not see the CPU that you removed. Also online file will report
|
||||
the state as 0 when a cpu if offline and 1 when its online.
|
||||
|
||||
#To display the current cpu state.
|
||||
#cat /sys/devices/system/cpu/cpuX/online
|
||||
|
||||
Q: Why cant i remove CPU0 on some systems?
|
||||
A: Some architectures may have some special dependency on a certain CPU.
|
||||
|
||||
For e.g in IA64 platforms we have ability to sent platform interrupts to the
|
||||
OS. a.k.a Corrected Platform Error Interrupts (CPEI). In current ACPI
|
||||
specifications, we didn't have a way to change the target CPU. Hence if the
|
||||
current ACPI version doesn't support such re-direction, we disable that CPU
|
||||
by making it not-removable.
|
||||
|
||||
In such cases you will also notice that the online file is missing under cpu0.
|
||||
|
||||
Q: How do i find out if a particular CPU is not removable?
|
||||
A: Depending on the implementation, some architectures may show this by the
|
||||
absence of the "online" file. This is done if it can be determined ahead of
|
||||
time that this CPU cannot be removed.
|
||||
|
||||
In some situations, this can be a run time check, i.e if you try to remove the
|
||||
last CPU, this will not be permitted. You can find such failures by
|
||||
investigating the return value of the "echo" command.
|
||||
|
||||
Q: What happens when a CPU is being logically offlined?
|
||||
A: The following happen, listed in no particular order :-)
|
||||
|
||||
- A notification is sent to in-kernel registered modules by sending an event
|
||||
CPU_DOWN_PREPARE
|
||||
- All process is migrated away from this outgoing CPU to a new CPU
|
||||
- All interrupts targeted to this CPU is migrated to a new CPU
|
||||
- timers/bottom half/task lets are also migrated to a new CPU
|
||||
- Once all services are migrated, kernel calls an arch specific routine
|
||||
__cpu_disable() to perform arch specific cleanup.
|
||||
- Once this is successful, an event for successful cleanup is sent by an event
|
||||
CPU_DEAD.
|
||||
|
||||
"It is expected that each service cleans up when the CPU_DOWN_PREPARE
|
||||
notifier is called, when CPU_DEAD is called its expected there is nothing
|
||||
running on behalf of this CPU that was offlined"
|
||||
|
||||
Q: If i have some kernel code that needs to be aware of CPU arrival and
|
||||
departure, how to i arrange for proper notification?
|
||||
A: This is what you would need in your kernel code to receive notifications.
|
||||
|
||||
#include <linux/cpu.h>
|
||||
static int __cpuinit foobar_cpu_callback(struct notifier_block *nfb,
|
||||
unsigned long action, void *hcpu)
|
||||
{
|
||||
unsigned int cpu = (unsigned long)hcpu;
|
||||
|
||||
switch (action) {
|
||||
case CPU_ONLINE:
|
||||
foobar_online_action(cpu);
|
||||
break;
|
||||
case CPU_DEAD:
|
||||
foobar_dead_action(cpu);
|
||||
break;
|
||||
}
|
||||
return NOTIFY_OK;
|
||||
}
|
||||
|
||||
static struct notifier_block foobar_cpu_notifer =
|
||||
{
|
||||
.notifier_call = foobar_cpu_callback,
|
||||
};
|
||||
|
||||
|
||||
In your init function,
|
||||
|
||||
register_cpu_notifier(&foobar_cpu_notifier);
|
||||
|
||||
You can fail PREPARE notifiers if something doesn't work to prepare resources.
|
||||
This will stop the activity and send a following CANCELED event back.
|
||||
|
||||
CPU_DEAD should not be failed, its just a goodness indication, but bad
|
||||
things will happen if a notifier in path sent a BAD notify code.
|
||||
|
||||
Q: I don't see my action being called for all CPUs already up and running?
|
||||
A: Yes, CPU notifiers are called only when new CPUs are on-lined or offlined.
|
||||
If you need to perform some action for each cpu already in the system, then
|
||||
|
||||
for_each_online_cpu(i) {
|
||||
foobar_cpu_callback(&foobar_cpu_notifier, CPU_UP_PREPARE, i);
|
||||
foobar_cpu_callback(&foobar-cpu_notifier, CPU_ONLINE, i);
|
||||
}
|
||||
|
||||
Q: If i would like to develop cpu hotplug support for a new architecture,
|
||||
what do i need at a minimum?
|
||||
A: The following are what is required for CPU hotplug infrastructure to work
|
||||
correctly.
|
||||
|
||||
- Make sure you have an entry in Kconfig to enable CONFIG_HOTPLUG_CPU
|
||||
- __cpu_up() - Arch interface to bring up a CPU
|
||||
- __cpu_disable() - Arch interface to shutdown a CPU, no more interrupts
|
||||
can be handled by the kernel after the routine
|
||||
returns. Including local APIC timers etc are
|
||||
shutdown.
|
||||
- __cpu_die() - This actually supposed to ensure death of the CPU.
|
||||
Actually look at some example code in other arch
|
||||
that implement CPU hotplug. The processor is taken
|
||||
down from the idle() loop for that specific
|
||||
architecture. __cpu_die() typically waits for some
|
||||
per_cpu state to be set, to ensure the processor
|
||||
dead routine is called to be sure positively.
|
||||
|
||||
Q: I need to ensure that a particular cpu is not removed when there is some
|
||||
work specific to this cpu is in progress.
|
||||
A: First switch the current thread context to preferred cpu
|
||||
|
||||
int my_func_on_cpu(int cpu)
|
||||
{
|
||||
cpumask_t saved_mask, new_mask = CPU_MASK_NONE;
|
||||
int curr_cpu, err = 0;
|
||||
|
||||
saved_mask = current->cpus_allowed;
|
||||
cpu_set(cpu, new_mask);
|
||||
err = set_cpus_allowed(current, new_mask);
|
||||
|
||||
if (err)
|
||||
return err;
|
||||
|
||||
/*
|
||||
* If we got scheduled out just after the return from
|
||||
* set_cpus_allowed() before running the work, this ensures
|
||||
* we stay locked.
|
||||
*/
|
||||
curr_cpu = get_cpu();
|
||||
|
||||
if (curr_cpu != cpu) {
|
||||
err = -EAGAIN;
|
||||
goto ret;
|
||||
} else {
|
||||
/*
|
||||
* Do work : But cant sleep, since get_cpu() disables preempt
|
||||
*/
|
||||
}
|
||||
ret:
|
||||
put_cpu();
|
||||
set_cpus_allowed(current, saved_mask);
|
||||
return err;
|
||||
}
|
||||
|
||||
|
||||
Q: How do we determine how many CPUs are available for hotplug.
|
||||
A: There is no clear spec defined way from ACPI that can give us that
|
||||
information today. Based on some input from Natalie of Unisys,
|
||||
that the ACPI MADT (Multiple APIC Description Tables) marks those possible
|
||||
CPUs in a system with disabled status.
|
||||
|
||||
Andi implemented some simple heuristics that count the number of disabled
|
||||
CPUs in MADT as hotpluggable CPUS. In the case there are no disabled CPUS
|
||||
we assume 1/2 the number of CPUs currently present can be hotplugged.
|
||||
|
||||
Caveat: Today's ACPI MADT can only provide 256 entries since the apicid field
|
||||
in MADT is only 8 bits.
|
||||
|
||||
User Space Notification
|
||||
|
||||
Hotplug support for devices is common in Linux today. Its being used today to
|
||||
support automatic configuration of network, usb and pci devices. A hotplug
|
||||
event can be used to invoke an agent script to perform the configuration task.
|
||||
|
||||
You can add /etc/hotplug/cpu.agent to handle hotplug notification user space
|
||||
scripts.
|
||||
|
||||
#!/bin/bash
|
||||
# $Id: cpu.agent
|
||||
# Kernel hotplug params include:
|
||||
#ACTION=%s [online or offline]
|
||||
#DEVPATH=%s
|
||||
#
|
||||
cd /etc/hotplug
|
||||
. ./hotplug.functions
|
||||
|
||||
case $ACTION in
|
||||
online)
|
||||
echo `date` ":cpu.agent" add cpu >> /tmp/hotplug.txt
|
||||
;;
|
||||
offline)
|
||||
echo `date` ":cpu.agent" remove cpu >>/tmp/hotplug.txt
|
||||
;;
|
||||
*)
|
||||
debug_mesg CPU $ACTION event not supported
|
||||
exit 1
|
||||
;;
|
||||
esac
|
@ -14,7 +14,10 @@ CONTENTS:
|
||||
1.1 What are cpusets ?
|
||||
1.2 Why are cpusets needed ?
|
||||
1.3 How are cpusets implemented ?
|
||||
1.4 How do I use cpusets ?
|
||||
1.4 What are exclusive cpusets ?
|
||||
1.5 What does notify_on_release do ?
|
||||
1.6 What is memory_pressure ?
|
||||
1.7 How do I use cpusets ?
|
||||
2. Usage Examples and Syntax
|
||||
2.1 Basic Usage
|
||||
2.2 Adding/removing cpus
|
||||
@ -49,29 +52,6 @@ its cpus_allowed vector, and the kernel page allocator will not
|
||||
allocate a page on a node that is not allowed in the requesting tasks
|
||||
mems_allowed vector.
|
||||
|
||||
If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
|
||||
ancestor or descendent, may share any of the same CPUs or Memory Nodes.
|
||||
A cpuset that is cpu exclusive has a sched domain associated with it.
|
||||
The sched domain consists of all cpus in the current cpuset that are not
|
||||
part of any exclusive child cpusets.
|
||||
This ensures that the scheduler load balacing code only balances
|
||||
against the cpus that are in the sched domain as defined above and not
|
||||
all of the cpus in the system. This removes any overhead due to
|
||||
load balancing code trying to pull tasks outside of the cpu exclusive
|
||||
cpuset only to be prevented by the tasks' cpus_allowed mask.
|
||||
|
||||
A cpuset that is mem_exclusive restricts kernel allocations for
|
||||
page, buffer and other data commonly shared by the kernel across
|
||||
multiple users. All cpusets, whether mem_exclusive or not, restrict
|
||||
allocations of memory for user space. This enables configuring a
|
||||
system so that several independent jobs can share common kernel
|
||||
data, such as file system pages, while isolating each jobs user
|
||||
allocation in its own cpuset. To do this, construct a large
|
||||
mem_exclusive cpuset to hold all the jobs, and construct child,
|
||||
non-mem_exclusive cpusets for each individual job. Only a small
|
||||
amount of typical kernel memory, such as requests from interrupt
|
||||
handlers, is allowed to be taken outside even a mem_exclusive cpuset.
|
||||
|
||||
User level code may create and destroy cpusets by name in the cpuset
|
||||
virtual file system, manage the attributes and permissions of these
|
||||
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
|
||||
@ -155,7 +135,7 @@ Cpusets extends these two mechanisms as follows:
|
||||
The implementation of cpusets requires a few, simple hooks
|
||||
into the rest of the kernel, none in performance critical paths:
|
||||
|
||||
- in main/init.c, to initialize the root cpuset at system boot.
|
||||
- in init/main.c, to initialize the root cpuset at system boot.
|
||||
- in fork and exit, to attach and detach a task from its cpuset.
|
||||
- in sched_setaffinity, to mask the requested CPUs by what's
|
||||
allowed in that tasks cpuset.
|
||||
@ -166,7 +146,7 @@ into the rest of the kernel, none in performance critical paths:
|
||||
and related changes in both sched.c and arch/ia64/kernel/domain.c
|
||||
- in the mbind and set_mempolicy system calls, to mask the requested
|
||||
Memory Nodes by what's allowed in that tasks cpuset.
|
||||
- in page_alloc, to restrict memory to allowed nodes.
|
||||
- in page_alloc.c, to restrict memory to allowed nodes.
|
||||
- in vmscan.c, to restrict page recovery to the current cpuset.
|
||||
|
||||
In addition a new file system, of type "cpuset" may be mounted,
|
||||
@ -192,9 +172,15 @@ containing the following files describing that cpuset:
|
||||
|
||||
- cpus: list of CPUs in that cpuset
|
||||
- mems: list of Memory Nodes in that cpuset
|
||||
- memory_migrate flag: if set, move pages to cpusets nodes
|
||||
- cpu_exclusive flag: is cpu placement exclusive?
|
||||
- mem_exclusive flag: is memory placement exclusive?
|
||||
- tasks: list of tasks (by pid) attached to that cpuset
|
||||
- notify_on_release flag: run /sbin/cpuset_release_agent on exit?
|
||||
- memory_pressure: measure of how much paging pressure in cpuset
|
||||
|
||||
In addition, the root cpuset only has the following file:
|
||||
- memory_pressure_enabled flag: compute memory_pressure?
|
||||
|
||||
New cpusets are created using the mkdir system call or shell
|
||||
command. The properties of a cpuset, such as its flags, allowed
|
||||
@ -228,7 +214,108 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
|
||||
to represent the cpuset hierarchy provides for a familiar permission
|
||||
and name space for cpusets, with a minimum of additional kernel code.
|
||||
|
||||
1.4 How do I use cpusets ?
|
||||
|
||||
1.4 What are exclusive cpusets ?
|
||||
--------------------------------
|
||||
|
||||
If a cpuset is cpu or mem exclusive, no other cpuset, other than
|
||||
a direct ancestor or descendent, may share any of the same CPUs or
|
||||
Memory Nodes.
|
||||
|
||||
A cpuset that is cpu_exclusive has a scheduler (sched) domain
|
||||
associated with it. The sched domain consists of all CPUs in the
|
||||
current cpuset that are not part of any exclusive child cpusets.
|
||||
This ensures that the scheduler load balancing code only balances
|
||||
against the CPUs that are in the sched domain as defined above and
|
||||
not all of the CPUs in the system. This removes any overhead due to
|
||||
load balancing code trying to pull tasks outside of the cpu_exclusive
|
||||
cpuset only to be prevented by the tasks' cpus_allowed mask.
|
||||
|
||||
A cpuset that is mem_exclusive restricts kernel allocations for
|
||||
page, buffer and other data commonly shared by the kernel across
|
||||
multiple users. All cpusets, whether mem_exclusive or not, restrict
|
||||
allocations of memory for user space. This enables configuring a
|
||||
system so that several independent jobs can share common kernel data,
|
||||
such as file system pages, while isolating each jobs user allocation in
|
||||
its own cpuset. To do this, construct a large mem_exclusive cpuset to
|
||||
hold all the jobs, and construct child, non-mem_exclusive cpusets for
|
||||
each individual job. Only a small amount of typical kernel memory,
|
||||
such as requests from interrupt handlers, is allowed to be taken
|
||||
outside even a mem_exclusive cpuset.
|
||||
|
||||
|
||||
1.5 What does notify_on_release do ?
|
||||
------------------------------------
|
||||
|
||||
If the notify_on_release flag is enabled (1) in a cpuset, then whenever
|
||||
the last task in the cpuset leaves (exits or attaches to some other
|
||||
cpuset) and the last child cpuset of that cpuset is removed, then
|
||||
the kernel runs the command /sbin/cpuset_release_agent, supplying the
|
||||
pathname (relative to the mount point of the cpuset file system) of the
|
||||
abandoned cpuset. This enables automatic removal of abandoned cpusets.
|
||||
The default value of notify_on_release in the root cpuset at system
|
||||
boot is disabled (0). The default value of other cpusets at creation
|
||||
is the current value of their parents notify_on_release setting.
|
||||
|
||||
|
||||
1.6 What is memory_pressure ?
|
||||
-----------------------------
|
||||
The memory_pressure of a cpuset provides a simple per-cpuset metric
|
||||
of the rate that the tasks in a cpuset are attempting to free up in
|
||||
use memory on the nodes of the cpuset to satisfy additional memory
|
||||
requests.
|
||||
|
||||
This enables batch managers monitoring jobs running in dedicated
|
||||
cpusets to efficiently detect what level of memory pressure that job
|
||||
is causing.
|
||||
|
||||
This is useful both on tightly managed systems running a wide mix of
|
||||
submitted jobs, which may choose to terminate or re-prioritize jobs that
|
||||
are trying to use more memory than allowed on the nodes assigned them,
|
||||
and with tightly coupled, long running, massively parallel scientific
|
||||
computing jobs that will dramatically fail to meet required performance
|
||||
goals if they start to use more memory than allowed to them.
|
||||
|
||||
This mechanism provides a very economical way for the batch manager
|
||||
to monitor a cpuset for signs of memory pressure. It's up to the
|
||||
batch manager or other user code to decide what to do about it and
|
||||
take action.
|
||||
|
||||
==> Unless this feature is enabled by writing "1" to the special file
|
||||
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
|
||||
code of __alloc_pages() for this metric reduces to simply noticing
|
||||
that the cpuset_memory_pressure_enabled flag is zero. So only
|
||||
systems that enable this feature will compute the metric.
|
||||
|
||||
Why a per-cpuset, running average:
|
||||
|
||||
Because this meter is per-cpuset, rather than per-task or mm,
|
||||
the system load imposed by a batch scheduler monitoring this
|
||||
metric is sharply reduced on large systems, because a scan of
|
||||
the tasklist can be avoided on each set of queries.
|
||||
|
||||
Because this meter is a running average, instead of an accumulating
|
||||
counter, a batch scheduler can detect memory pressure with a
|
||||
single read, instead of having to read and accumulate results
|
||||
for a period of time.
|
||||
|
||||
Because this meter is per-cpuset rather than per-task or mm,
|
||||
the batch scheduler can obtain the key information, memory
|
||||
pressure in a cpuset, with a single read, rather than having to
|
||||
query and accumulate results over all the (dynamically changing)
|
||||
set of tasks in the cpuset.
|
||||
|
||||
A per-cpuset simple digital filter (requires a spinlock and 3 words
|
||||
of data per-cpuset) is kept, and updated by any task attached to that
|
||||
cpuset, if it enters the synchronous (direct) page reclaim code.
|
||||
|
||||
A per-cpuset file provides an integer number representing the recent
|
||||
(half-life of 10 seconds) rate of direct page reclaims caused by
|
||||
the tasks in the cpuset, in units of reclaims attempted per second,
|
||||
times 1000.
|
||||
|
||||
|
||||
1.7 How do I use cpusets ?
|
||||
--------------------------
|
||||
|
||||
In order to minimize the impact of cpusets on critical kernel
|
||||
@ -277,6 +364,30 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid
|
||||
impacting the scheduler code in the kernel with a check for changes
|
||||
in a tasks processor placement.
|
||||
|
||||
Normally, once a page is allocated (given a physical page
|
||||
of main memory) then that page stays on whatever node it
|
||||
was allocated, so long as it remains allocated, even if the
|
||||
cpusets memory placement policy 'mems' subsequently changes.
|
||||
If the cpuset flag file 'memory_migrate' is set true, then when
|
||||
tasks are attached to that cpuset, any pages that task had
|
||||
allocated to it on nodes in its previous cpuset are migrated
|
||||
to the tasks new cpuset. Depending on the implementation,
|
||||
this migration may either be done by swapping the page out,
|
||||
so that the next time the page is referenced, it will be paged
|
||||
into the tasks new cpuset, usually on the node where it was
|
||||
referenced, or this migration may be done by directly copying
|
||||
the pages from the tasks previous cpuset to the new cpuset,
|
||||
where possible to the same node, relative to the new cpuset,
|
||||
as the node that held the page, relative to the old cpuset.
|
||||
Also if 'memory_migrate' is set true, then if that cpusets
|
||||
'mems' file is modified, pages allocated to tasks in that
|
||||
cpuset, that were on nodes in the previous setting of 'mems',
|
||||
will be moved to nodes in the new setting of 'mems.' Again,
|
||||
depending on the implementation, this might be done by swapping,
|
||||
or by direct copying. In either case, pages that were not in
|
||||
the tasks prior cpuset, or in the cpusets prior 'mems' setting,
|
||||
will not be moved.
|
||||
|
||||
There is an exception to the above. If hotplug functionality is used
|
||||
to remove all the CPUs that are currently assigned to a cpuset,
|
||||
then the kernel will automatically update the cpus_allowed of all
|
||||
|
@ -2903,14 +2903,14 @@ Your cooperation is appreciated.
|
||||
196 = /dev/dvb/adapter3/video0 first video decoder of fourth card
|
||||
|
||||
|
||||
216 char USB BlueTooth devices
|
||||
0 = /dev/ttyUB0 First USB BlueTooth device
|
||||
1 = /dev/ttyUB1 Second USB BlueTooth device
|
||||
216 char Bluetooth RFCOMM TTY devices
|
||||
0 = /dev/rfcomm0 First Bluetooth RFCOMM TTY device
|
||||
1 = /dev/rfcomm1 Second Bluetooth RFCOMM TTY device
|
||||
...
|
||||
|
||||
217 char USB BlueTooth devices (alternate devices)
|
||||
0 = /dev/cuub0 Callout device for ttyUB0
|
||||
1 = /dev/cuub1 Callout device for ttyUB1
|
||||
217 char Bluetooth RFCOMM TTY devices (alternate devices)
|
||||
0 = /dev/curf0 Callout device for rfcomm0
|
||||
1 = /dev/curf1 Callout device for rfcomm1
|
||||
...
|
||||
|
||||
218 char The Logical Company bus Unibus/Qbus adapters
|
||||
|
673
Documentation/drivers/edac/edac.txt
Normal file
673
Documentation/drivers/edac/edac.txt
Normal file
@ -0,0 +1,673 @@
|
||||
|
||||
|
||||
EDAC - Error Detection And Correction
|
||||
|
||||
Written by Doug Thompson <norsk5@xmission.com>
|
||||
7 Dec 2005
|
||||
|
||||
|
||||
EDAC was written by:
|
||||
Thayne Harbaugh,
|
||||
modified by Dave Peterson, Doug Thompson, et al,
|
||||
from the bluesmoke.sourceforge.net project.
|
||||
|
||||
|
||||
============================================================================
|
||||
EDAC PURPOSE
|
||||
|
||||
The 'edac' kernel module goal is to detect and report errors that occur
|
||||
within the computer system. In the initial release, memory Correctable Errors
|
||||
(CE) and Uncorrectable Errors (UE) are the primary errors being harvested.
|
||||
|
||||
Detecting CE events, then harvesting those events and reporting them,
|
||||
CAN be a predictor of future UE events. With CE events, the system can
|
||||
continue to operate, but with less safety. Preventive maintainence and
|
||||
proactive part replacement of memory DIMMs exhibiting CEs can reduce
|
||||
the likelihood of the dreaded UE events and system 'panics'.
|
||||
|
||||
|
||||
In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
|
||||
in order to determine if errors are occurring on data transfers.
|
||||
The presence of PCI Parity errors must be examined with a grain of salt.
|
||||
There are several addin adapters that do NOT follow the PCI specification
|
||||
with regards to Parity generation and reporting. The specification says
|
||||
the vendor should tie the parity status bits to 0 if they do not intend
|
||||
to generate parity. Some vendors do not do this, and thus the parity bit
|
||||
can "float" giving false positives.
|
||||
|
||||
The PCI Parity EDAC device has the ability to "skip" known flakey
|
||||
cards during the parity scan. These are set by the parity "blacklist"
|
||||
interface in the sysfs for PCI Parity. (See the PCI section in the sysfs
|
||||
section below.) There is also a parity "whitelist" which is used as
|
||||
an explicit list of devices to scan, while the blacklist is a list
|
||||
of devices to skip.
|
||||
|
||||
EDAC will have future error detectors that will be added or integrated
|
||||
into EDAC in the following list:
|
||||
|
||||
MCE Machine Check Exception
|
||||
MCA Machine Check Architecture
|
||||
NMI NMI notification of ECC errors
|
||||
MSRs Machine Specific Register error cases
|
||||
and other mechanisms.
|
||||
|
||||
These errors are usually bus errors, ECC errors, thermal throttling
|
||||
and the like.
|
||||
|
||||
|
||||
============================================================================
|
||||
EDAC VERSIONING
|
||||
|
||||
EDAC is composed of a "core" module (edac_mc.ko) and several Memory
|
||||
Controller (MC) driver modules. On a given system, the CORE
|
||||
is loaded and one MC driver will be loaded. Both the CORE and
|
||||
the MC driver have individual versions that reflect current release
|
||||
level of their respective modules. Thus, to "report" on what version
|
||||
a system is running, one must report both the CORE's and the
|
||||
MC driver's versions.
|
||||
|
||||
|
||||
LOADING
|
||||
|
||||
If 'edac' was statically linked with the kernel then no loading is
|
||||
necessary. If 'edac' was built as modules then simply modprobe the
|
||||
'edac' pieces that you need. You should be able to modprobe
|
||||
hardware-specific modules and have the dependencies load the necessary core
|
||||
modules.
|
||||
|
||||
Example:
|
||||
|
||||
$> modprobe amd76x_edac
|
||||
|
||||
loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
|
||||
core module.
|
||||
|
||||
|
||||
============================================================================
|
||||
EDAC sysfs INTERFACE
|
||||
|
||||
EDAC presents a 'sysfs' interface for control, reporting and attribute
|
||||
reporting purposes.
|
||||
|
||||
EDAC lives in the /sys/devices/system/edac directory. Within this directory
|
||||
there currently reside 2 'edac' components:
|
||||
|
||||
mc memory controller(s) system
|
||||
pci PCI status system
|
||||
|
||||
|
||||
============================================================================
|
||||
Memory Controller (mc) Model
|
||||
|
||||
First a background on the memory controller's model abstracted in EDAC.
|
||||
Each mc device controls a set of DIMM memory modules. These modules are
|
||||
layed out in a Chip-Select Row (csrowX) and Channel table (chX). There can
|
||||
be multiple csrows and two channels.
|
||||
|
||||
Memory controllers allow for several csrows, with 8 csrows being a typical value.
|
||||
Yet, the actual number of csrows depends on the electrical "loading"
|
||||
of a given motherboard, memory controller and DIMM characteristics.
|
||||
|
||||
Dual channels allows for 128 bit data transfers to the CPU from memory.
|
||||
|
||||
|
||||
Channel 0 Channel 1
|
||||
===================================
|
||||
csrow0 | DIMM_A0 | DIMM_B0 |
|
||||
csrow1 | DIMM_A0 | DIMM_B0 |
|
||||
===================================
|
||||
|
||||
===================================
|
||||
csrow2 | DIMM_A1 | DIMM_B1 |
|
||||
csrow3 | DIMM_A1 | DIMM_B1 |
|
||||
===================================
|
||||
|
||||
In the above example table there are 4 physical slots on the motherboard
|
||||
for memory DIMMs:
|
||||
|
||||
DIMM_A0
|
||||
DIMM_B0
|
||||
DIMM_A1
|
||||
DIMM_B1
|
||||
|
||||
Labels for these slots are usually silk screened on the motherboard. Slots
|
||||
labeled 'A' are channel 0 in this example. Slots labled 'B'
|
||||
are channel 1. Notice that there are two csrows possible on a
|
||||
physical DIMM. These csrows are allocated their csrow assignment
|
||||
based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
|
||||
is placed in each Channel, the csrows cross both DIMMs.
|
||||
|
||||
Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
|
||||
Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
|
||||
will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
|
||||
when 2 dual ranked DIMMs are similiaryly placed, then both csrow0 and
|
||||
csrow1 will be populated. The pattern repeats itself for csrow2 and
|
||||
csrow3.
|
||||
|
||||
The representation of the above is reflected in the directory tree
|
||||
in EDAC's sysfs interface. Starting in directory
|
||||
/sys/devices/system/edac/mc each memory controller will be represented
|
||||
by its own 'mcX' directory, where 'X" is the index of the MC.
|
||||
|
||||
|
||||
..../edac/mc/
|
||||
|
|
||||
|->mc0
|
||||
|->mc1
|
||||
|->mc2
|
||||
....
|
||||
|
||||
Under each 'mcX' directory each 'csrowX' is again represented by a
|
||||
'csrowX', where 'X" is the csrow index:
|
||||
|
||||
|
||||
.../mc/mc0/
|
||||
|
|
||||
|->csrow0
|
||||
|->csrow2
|
||||
|->csrow3
|
||||
....
|
||||
|
||||
Notice that there is no csrow1, which indicates that csrow0 is
|
||||
composed of a single ranked DIMMs. This should also apply in both
|
||||
Channels, in order to have dual-channel mode be operational. Since
|
||||
both csrow2 and csrow3 are populated, this indicates a dual ranked
|
||||
set of DIMMs for channels 0 and 1.
|
||||
|
||||
|
||||
Within each of the 'mc','mcX' and 'csrowX' directories are several
|
||||
EDAC control and attribute files.
|
||||
|
||||
|
||||
============================================================================
|
||||
DIRECTORY 'mc'
|
||||
|
||||
In directory 'mc' are EDAC system overall control and attribute files:
|
||||
|
||||
|
||||
Panic on UE control file:
|
||||
|
||||
'panic_on_ue'
|
||||
|
||||
An uncorrectable error will cause a machine panic. This is usually
|
||||
desirable. It is a bad idea to continue when an uncorrectable error
|
||||
occurs - it is indeterminate what was uncorrected and the operating
|
||||
system context might be so mangled that continuing will lead to further
|
||||
corruption. If the kernel has MCE configured, then EDAC will never
|
||||
notice the UE.
|
||||
|
||||
LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]
|
||||
|
||||
RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue
|
||||
|
||||
|
||||
Log UE control file:
|
||||
|
||||
'log_ue'
|
||||
|
||||
Generate kernel messages describing uncorrectable errors. These errors
|
||||
are reported through the system message log system. UE statistics
|
||||
will be accumulated even when UE logging is disabled.
|
||||
|
||||
LOAD TIME: module/kernel parameter: log_ue=[0|1]
|
||||
|
||||
RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue
|
||||
|
||||
|
||||
Log CE control file:
|
||||
|
||||
'log_ce'
|
||||
|
||||
Generate kernel messages describing correctable errors. These
|
||||
errors are reported through the system message log system.
|
||||
CE statistics will be accumulated even when CE logging is disabled.
|
||||
|
||||
LOAD TIME: module/kernel parameter: log_ce=[0|1]
|
||||
|
||||
RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce
|
||||
|
||||
|
||||
Polling period control file:
|
||||
|
||||
'poll_msec'
|
||||
|
||||
The time period, in milliseconds, for polling for error information.
|
||||
Too small a value wastes resources. Too large a value might delay
|
||||
necessary handling of errors and might loose valuable information for
|
||||
locating the error. 1000 milliseconds (once each second) is about
|
||||
right for most uses.
|
||||
|
||||
LOAD TIME: module/kernel parameter: poll_msec=[0|1]
|
||||
|
||||
RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
|
||||
|
||||
|
||||
Module Version read-only attribute file:
|
||||
|
||||
'mc_version'
|
||||
|
||||
The EDAC CORE modules's version and compile date are shown here to
|
||||
indicate what EDAC is running.
|
||||
|
||||
|
||||
|
||||
============================================================================
|
||||
'mcX' DIRECTORIES
|
||||
|
||||
|
||||
In 'mcX' directories are EDAC control and attribute files for
|
||||
this 'X" instance of the memory controllers:
|
||||
|
||||
|
||||
Counter reset control file:
|
||||
|
||||
'reset_counters'
|
||||
|
||||
This write-only control file will zero all the statistical counters
|
||||
for UE and CE errors. Zeroing the counters will also reset the timer
|
||||
indicating how long since the last counter zero. This is useful
|
||||
for computing errors/time. Since the counters are always reset at
|
||||
driver initialization time, no module/kernel parameter is available.
|
||||
|
||||
RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset
|
||||
|
||||
This resets the counters on memory controller 0
|
||||
|
||||
|
||||
Seconds since last counter reset control file:
|
||||
|
||||
'seconds_since_reset'
|
||||
|
||||
This attribute file displays how many seconds have elapsed since the
|
||||
last counter reset. This can be used with the error counters to
|
||||
measure error rates.
|
||||
|
||||
|
||||
|
||||
DIMM capability attribute file:
|
||||
|
||||
'edac_capability'
|
||||
|
||||
The EDAC (Error Detection and Correction) capabilities/modes of
|
||||
the memory controller hardware.
|
||||
|
||||
|
||||
DIMM Current Capability attribute file:
|
||||
|
||||
'edac_current_capability'
|
||||
|
||||
The EDAC capabilities available with the hardware
|
||||
configuration. This may not be the same as "EDAC capability"
|
||||
if the correct memory is not used. If a memory controller is
|
||||
capable of EDAC, but DIMMs without check bits are in use, then
|
||||
Parity, SECDED, S4ECD4ED capabilities will not be available
|
||||
even though the memory controller might be capable of those
|
||||
modes with the proper memory loaded.
|
||||
|
||||
|
||||
Memory Type supported on this controller attribute file:
|
||||
|
||||
'supported_mem_type'
|
||||
|
||||
This attribute file displays the memory type, usually
|
||||
buffered and unbuffered DIMMs.
|
||||
|
||||
|
||||
Memory Controller name attribute file:
|
||||
|
||||
'mc_name'
|
||||
|
||||
This attribute file displays the type of memory controller
|
||||
that is being utilized.
|
||||
|
||||
|
||||
Memory Controller Module name attribute file:
|
||||
|
||||
'module_name'
|
||||
|
||||
This attribute file displays the memory controller module name,
|
||||
version and date built. The name of the memory controller
|
||||
hardware - some drivers work with multiple controllers and
|
||||
this field shows which hardware is present.
|
||||
|
||||
|
||||
Total memory managed by this memory controller attribute file:
|
||||
|
||||
'size_mb'
|
||||
|
||||
This attribute file displays, in count of megabytes, of memory
|
||||
that this instance of memory controller manages.
|
||||
|
||||
|
||||
Total Uncorrectable Errors count attribute file:
|
||||
|
||||
'ue_count'
|
||||
|
||||
This attribute file displays the total count of uncorrectable
|
||||
errors that have occurred on this memory controller. If panic_on_ue
|
||||
is set this counter will not have a chance to increment,
|
||||
since EDAC will panic the system.
|
||||
|
||||
|
||||
Total UE count that had no information attribute fileY:
|
||||
|
||||
'ue_noinfo_count'
|
||||
|
||||
This attribute file displays the number of UEs that
|
||||
have occurred have occurred with no informations as to which DIMM
|
||||
slot is having errors.
|
||||
|
||||
|
||||
Total Correctable Errors count attribute file:
|
||||
|
||||
'ce_count'
|
||||
|
||||
This attribute file displays the total count of correctable
|
||||
errors that have occurred on this memory controller. This
|
||||
count is very important to examine. CEs provide early
|
||||
indications that a DIMM is beginning to fail. This count
|
||||
field should be monitored for non-zero values and report
|
||||
such information to the system administrator.
|
||||
|
||||
|
||||
Total Correctable Errors count attribute file:
|
||||
|
||||
'ce_noinfo_count'
|
||||
|
||||
This attribute file displays the number of CEs that
|
||||
have occurred wherewith no informations as to which DIMM slot
|
||||
is having errors. Memory is handicapped, but operational,
|
||||
yet no information is available to indicate which slot
|
||||
the failing memory is in. This count field should be also
|
||||
be monitored for non-zero values.
|
||||
|
||||
Device Symlink:
|
||||
|
||||
'device'
|
||||
|
||||
Symlink to the memory controller device
|
||||
|
||||
|
||||
|
||||
============================================================================
|
||||
'csrowX' DIRECTORIES
|
||||
|
||||
In the 'csrowX' directories are EDAC control and attribute files for
|
||||
this 'X" instance of csrow:
|
||||
|
||||
|
||||
Total Uncorrectable Errors count attribute file:
|
||||
|
||||
'ue_count'
|
||||
|
||||
This attribute file displays the total count of uncorrectable
|
||||
errors that have occurred on this csrow. If panic_on_ue is set
|
||||
this counter will not have a chance to increment, since EDAC
|
||||
will panic the system.
|
||||
|
||||
|
||||
Total Correctable Errors count attribute file:
|
||||
|
||||
'ce_count'
|
||||
|
||||
This attribute file displays the total count of correctable
|
||||
errors that have occurred on this csrow. This
|
||||
count is very important to examine. CEs provide early
|
||||
indications that a DIMM is beginning to fail. This count
|
||||
field should be monitored for non-zero values and report
|
||||
such information to the system administrator.
|
||||
|
||||
|
||||
Total memory managed by this csrow attribute file:
|
||||
|
||||
'size_mb'
|
||||
|
||||
This attribute file displays, in count of megabytes, of memory
|
||||
that this csrow contatins.
|
||||
|
||||
|
||||
Memory Type attribute file:
|
||||
|
||||
'mem_type'
|
||||
|
||||
This attribute file will display what type of memory is currently
|
||||
on this csrow. Normally, either buffered or unbuffered memory.
|
||||
|
||||
|
||||
EDAC Mode of operation attribute file:
|
||||
|
||||
'edac_mode'
|
||||
|
||||
This attribute file will display what type of Error detection
|
||||
and correction is being utilized.
|
||||
|
||||
|
||||
Device type attribute file:
|
||||
|
||||
'dev_type'
|
||||
|
||||
This attribute file will display what type of DIMM device is
|
||||
being utilized. Example: x4
|
||||
|
||||
|
||||
Channel 0 CE Count attribute file:
|
||||
|
||||
'ch0_ce_count'
|
||||
|
||||
This attribute file will display the count of CEs on this
|
||||
DIMM located in channel 0.
|
||||
|
||||
|
||||
Channel 0 UE Count attribute file:
|
||||
|
||||
'ch0_ue_count'
|
||||
|
||||
This attribute file will display the count of UEs on this
|
||||
DIMM located in channel 0.
|
||||
|
||||
|
||||
Channel 0 DIMM Label control file:
|
||||
|
||||
'ch0_dimm_label'
|
||||
|
||||
This control file allows this DIMM to have a label assigned
|
||||
to it. With this label in the module, when errors occur
|
||||
the output can provide the DIMM label in the system log.
|
||||
This becomes vital for panic events to isolate the
|
||||
cause of the UE event.
|
||||
|
||||
DIMM Labels must be assigned after booting, with information
|
||||
that correctly identifies the physical slot with its
|
||||
silk screen label. This information is currently very
|
||||
motherboard specific and determination of this information
|
||||
must occur in userland at this time.
|
||||
|
||||
|
||||
Channel 1 CE Count attribute file:
|
||||
|
||||
'ch1_ce_count'
|
||||
|
||||
This attribute file will display the count of CEs on this
|
||||
DIMM located in channel 1.
|
||||
|
||||
|
||||
Channel 1 UE Count attribute file:
|
||||
|
||||
'ch1_ue_count'
|
||||
|
||||
This attribute file will display the count of UEs on this
|
||||
DIMM located in channel 0.
|
||||
|
||||
|
||||
Channel 1 DIMM Label control file:
|
||||
|
||||
'ch1_dimm_label'
|
||||
|
||||
This control file allows this DIMM to have a label assigned
|
||||
to it. With this label in the module, when errors occur
|
||||
the output can provide the DIMM label in the system log.
|
||||
This becomes vital for panic events to isolate the
|
||||
cause of the UE event.
|
||||
|
||||
DIMM Labels must be assigned after booting, with information
|
||||
that correctly identifies the physical slot with its
|
||||
silk screen label. This information is currently very
|
||||
motherboard specific and determination of this information
|
||||
must occur in userland at this time.
|
||||
|
||||
|
||||
============================================================================
|
||||
SYSTEM LOGGING
|
||||
|
||||
If logging for UEs and CEs are enabled then system logs will have
|
||||
error notices indicating errors that have been detected:
|
||||
|
||||
MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
|
||||
channel 1 "DIMM_B1": amd76x_edac
|
||||
|
||||
MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
|
||||
channel 1 "DIMM_B1": amd76x_edac
|
||||
|
||||
|
||||
The structure of the message is:
|
||||
the memory controller (MC0)
|
||||
Error type (CE)
|
||||
memory page (0x283)
|
||||
offset in the page (0xce0)
|
||||
the byte granularity (grain 8)
|
||||
or resolution of the error
|
||||
the error syndrome (0xb741)
|
||||
memory row (row 0)
|
||||
memory channel (channel 1)
|
||||
DIMM label, if set prior (DIMM B1
|
||||
and then an optional, driver-specific message that may
|
||||
have additional information.
|
||||
|
||||
Both UEs and CEs with no info will lack all but memory controller,
|
||||
error type, a notice of "no info" and then an optional,
|
||||
driver-specific error message.
|
||||
|
||||
|
||||
|
||||
============================================================================
|
||||
PCI Bus Parity Detection
|
||||
|
||||
|
||||
On Header Type 00 devices the primary status is looked at
|
||||
for any parity error regardless of whether Parity is enabled on the
|
||||
device. (The spec indicates parity is generated in some cases).
|
||||
On Header Type 01 bridges, the secondary status register is also
|
||||
looked at to see if parity ocurred on the bus on the other side of
|
||||
the bridge.
|
||||
|
||||
|
||||
SYSFS CONFIGURATION
|
||||
|
||||
Under /sys/devices/system/edac/pci are control and attribute files as follows:
|
||||
|
||||
|
||||
Enable/Disable PCI Parity checking control file:
|
||||
|
||||
'check_pci_parity'
|
||||
|
||||
|
||||
This control file enables or disables the PCI Bus Parity scanning
|
||||
operation. Writing a 1 to this file enables the scanning. Writing
|
||||
a 0 to this file disables the scanning.
|
||||
|
||||
Enable:
|
||||
echo "1" >/sys/devices/system/edac/pci/check_pci_parity
|
||||
|
||||
Disable:
|
||||
echo "0" >/sys/devices/system/edac/pci/check_pci_parity
|
||||
|
||||
|
||||
|
||||
Panic on PCI PARITY Error:
|
||||
|
||||
'panic_on_pci_parity'
|
||||
|
||||
|
||||
This control files enables or disables panic'ing when a parity
|
||||
error has been detected.
|
||||
|
||||
|
||||
module/kernel parameter: panic_on_pci_parity=[0|1]
|
||||
|
||||
Enable:
|
||||
echo "1" >/sys/devices/system/edac/pci/panic_on_pci_parity
|
||||
|
||||
Disable:
|
||||
echo "0" >/sys/devices/system/edac/pci/panic_on_pci_parity
|
||||
|
||||
|
||||
Parity Count:
|
||||
|
||||
'pci_parity_count'
|
||||
|
||||
This attribute file will display the number of parity errors that
|
||||
have been detected.
|
||||
|
||||
|
||||
|
||||
PCI Device Whitelist:
|
||||
|
||||
'pci_parity_whitelist'
|
||||
|
||||
This control file allows for an explicit list of PCI devices to be
|
||||
scanned for parity errors. Only devices found on this list will
|
||||
be examined. The list is a line of hexadecimel VENDOR and DEVICE
|
||||
ID tuples:
|
||||
|
||||
1022:7450,1434:16a6
|
||||
|
||||
One or more can be inserted, seperated by a comma.
|
||||
|
||||
To write the above list doing the following as one command line:
|
||||
|
||||
echo "1022:7450,1434:16a6"
|
||||
> /sys/devices/system/edac/pci/pci_parity_whitelist
|
||||
|
||||
|
||||
|
||||
To display what the whitelist is, simply 'cat' the same file.
|
||||
|
||||
|
||||
PCI Device Blacklist:
|
||||
|
||||
'pci_parity_blacklist'
|
||||
|
||||
This control file allows for a list of PCI devices to be
|
||||
skipped for scanning.
|
||||
The list is a line of hexadecimel VENDOR and DEVICE ID tuples:
|
||||
|
||||
1022:7450,1434:16a6
|
||||
|
||||
One or more can be inserted, seperated by a comma.
|
||||
|
||||
To write the above list doing the following as one command line:
|
||||
|
||||
echo "1022:7450,1434:16a6"
|
||||
> /sys/devices/system/edac/pci/pci_parity_blacklist
|
||||
|
||||
|
||||
To display what the whitelist current contatins,
|
||||
simply 'cat' the same file.
|
||||
|
||||
=======================================================================
|
||||
|
||||
PCI Vendor and Devices IDs can be obtained with the lspci command. Using
|
||||
the -n option lspci will display the vendor and device IDs. The system
|
||||
adminstrator will have to determine which devices should be scanned or
|
||||
skipped.
|
||||
|
||||
|
||||
|
||||
The two lists (white and black) are prioritized. blacklist is the lower
|
||||
priority and will NOT be utilized when a whitelist has been set.
|
||||
Turn OFF a whitelist by an empty echo command:
|
||||
|
||||
echo > /sys/devices/system/edac/pci/pci_parity_whitelist
|
||||
|
||||
and any previous blacklist will be utililzed.
|
||||
|
@ -50,12 +50,12 @@ http://www.linuxtv.org/wiki/index.php/DVB_USB
|
||||
0. History & News:
|
||||
2005-06-30 - added support for WideView WT-220U (Thanks to Steve Chang)
|
||||
2005-05-30 - added basic isochronous support to the dvb-usb-framework
|
||||
added support for Conexant Hybrid reference design and Nebula DigiTV USB
|
||||
added support for Conexant Hybrid reference design and Nebula DigiTV USB
|
||||
2005-04-17 - all dibusb devices ported to make use of the dvb-usb-framework
|
||||
2005-04-02 - re-enabled and improved remote control code.
|
||||
2005-03-31 - ported the Yakumo/Hama/Typhoon DVB-T USB2.0 device to dvb-usb.
|
||||
2005-03-30 - first commit of the dvb-usb-module based on the dibusb-source. First device is a new driver for the
|
||||
TwinhanDTV Alpha / MagicBox II USB2.0-only DVB-T device.
|
||||
TwinhanDTV Alpha / MagicBox II USB2.0-only DVB-T device.
|
||||
|
||||
(change from dvb-dibusb to dvb-usb)
|
||||
2005-03-28 - added support for the AVerMedia AverTV DVB-T USB2.0 device (Thanks to Glen Harris and Jiun-Kuei Jung, AVerMedia)
|
||||
@ -64,50 +64,50 @@ http://www.linuxtv.org/wiki/index.php/DVB_USB
|
||||
2005-02-02 - added support for the Hauppauge Win-TV Nova-T USB2
|
||||
2005-01-31 - distorted streaming is gone for USB1.1 devices
|
||||
2005-01-13 - moved the mirrored pid_filter_table back to dvb-dibusb
|
||||
- first almost working version for HanfTek UMT-010
|
||||
- found out, that Yakumo/HAMA/Typhoon are predecessors of the HanfTek UMT-010
|
||||
- first almost working version for HanfTek UMT-010
|
||||
- found out, that Yakumo/HAMA/Typhoon are predecessors of the HanfTek UMT-010
|
||||
2005-01-10 - refactoring completed, now everything is very delightful
|
||||
- tuner quirks for some weird devices (Artec T1 AN2235 device has sometimes a
|
||||
Panasonic Tuner assembled). Tunerprobing implemented. Thanks a lot to Gunnar Wittich.
|
||||
- tuner quirks for some weird devices (Artec T1 AN2235 device has sometimes a
|
||||
Panasonic Tuner assembled). Tunerprobing implemented. Thanks a lot to Gunnar Wittich.
|
||||
2004-12-29 - after several days of struggling around bug of no returning URBs fixed.
|
||||
2004-12-26 - refactored the dibusb-driver, splitted into separate files
|
||||
- i2c-probing enabled
|
||||
- i2c-probing enabled
|
||||
2004-12-06 - possibility for demod i2c-address probing
|
||||
- new usb IDs (Compro, Artec)
|
||||
- new usb IDs (Compro, Artec)
|
||||
2004-11-23 - merged changes from DiB3000MC_ver2.1
|
||||
- revised the debugging
|
||||
- possibility to deliver the complete TS for USB2.0
|
||||
- revised the debugging
|
||||
- possibility to deliver the complete TS for USB2.0
|
||||
2004-11-21 - first working version of the dib3000mc/p frontend driver.
|
||||
2004-11-12 - added additional remote control keys. Thanks to Uwe Hanke.
|
||||
2004-11-07 - added remote control support. Thanks to David Matthews.
|
||||
2004-11-05 - added support for a new devices (Grandtec/Avermedia/Artec)
|
||||
- merged my changes (for dib3000mb/dibusb) to the FE_REFACTORING, because it became HEAD
|
||||
- moved transfer control (pid filter, fifo control) from usb driver to frontend, it seems
|
||||
better settled there (added xfer_ops-struct)
|
||||
- created a common files for frontends (mc/p/mb)
|
||||
- merged my changes (for dib3000mb/dibusb) to the FE_REFACTORING, because it became HEAD
|
||||
- moved transfer control (pid filter, fifo control) from usb driver to frontend, it seems
|
||||
better settled there (added xfer_ops-struct)
|
||||
- created a common files for frontends (mc/p/mb)
|
||||
2004-09-28 - added support for a new device (Unkown, vendor ID is Hyper-Paltek)
|
||||
2004-09-20 - added support for a new device (Compro DVB-U2000), thanks
|
||||
to Amaury Demol for reporting
|
||||
- changed usb TS transfer method (several urbs, stopping transfer
|
||||
before setting a new pid)
|
||||
to Amaury Demol for reporting
|
||||
- changed usb TS transfer method (several urbs, stopping transfer
|
||||
before setting a new pid)
|
||||
2004-09-13 - added support for a new device (Artec T1 USB TVBOX), thanks
|
||||
to Christian Motschke for reporting
|
||||
to Christian Motschke for reporting
|
||||
2004-09-05 - released the dibusb device and dib3000mb-frontend driver
|
||||
|
||||
(old news for vp7041.c)
|
||||
2004-07-15 - found out, by accident, that the device has a TUA6010XS for
|
||||
PLL
|
||||
PLL
|
||||
2004-07-12 - figured out, that the driver should also work with the
|
||||
CTS Portable (Chinese Television System)
|
||||
CTS Portable (Chinese Television System)
|
||||
2004-07-08 - firmware-extraction-2.422-problem solved, driver is now working
|
||||
properly with firmware extracted from 2.422
|
||||
- #if for 2.6.4 (dvb), compile issue
|
||||
- changed firmware handling, see vp7041.txt sec 1.1
|
||||
properly with firmware extracted from 2.422
|
||||
- #if for 2.6.4 (dvb), compile issue
|
||||
- changed firmware handling, see vp7041.txt sec 1.1
|
||||
2004-07-02 - some tuner modifications, v0.1, cleanups, first public
|
||||
2004-06-28 - now using the dvb_dmx_swfilter_packets, everything
|
||||
runs fine now
|
||||
runs fine now
|
||||
2004-06-27 - able to watch and switching channels (pre-alpha)
|
||||
- no section filtering yet
|
||||
- no section filtering yet
|
||||
2004-06-06 - first TS received, but kernel oops :/
|
||||
2004-05-14 - firmware loader is working
|
||||
2004-05-11 - start writing the driver
|
||||
|
@ -174,7 +174,7 @@ Debugging
|
||||
Everything which is identical in the following table, can be put into a common
|
||||
flexcop-module.
|
||||
|
||||
PCI USB
|
||||
PCI USB
|
||||
-------------------------------------------------------------------------------
|
||||
Different:
|
||||
Register access: accessing IO memory USB control message
|
||||
|
@ -1,6 +1,6 @@
|
||||
|
||||
HOWTO: Get An Avermedia DVB-T working under Linux
|
||||
______________________________________________
|
||||
______________________________________________
|
||||
|
||||
Table of Contents
|
||||
Assumptions and Introduction
|
||||
@ -150,7 +150,8 @@ Getting the card going
|
||||
|
||||
The frontend module sp887x.o, requires an external firmware.
|
||||
Please use the command "get_dvb_firmware sp887x" to download
|
||||
it. Then copy it to /usr/lib/hotplug/firmware.
|
||||
it. Then copy it to /usr/lib/hotplug/firmware or /lib/firmware/
|
||||
(depending on configuration of firmware hotplug).
|
||||
|
||||
Receiving DVB-T in Australia
|
||||
|
||||
|
@ -16,7 +16,7 @@ Hardware supported by the linuxtv.org DVB drivers
|
||||
shielding, and the whole metal box has its own part number.
|
||||
|
||||
|
||||
o Frontends drivers:
|
||||
o Frontends drivers:
|
||||
- dvb_dummy_fe: for testing...
|
||||
DVB-S:
|
||||
- ves1x93 : Alps BSRV2 (ves1893 demodulator) and dbox2 (ves1993)
|
||||
@ -24,7 +24,7 @@ o Frontends drivers:
|
||||
- grundig_29504-491 : Grundig 29504-491 (Philips TDA8083 demodulator), tsa5522 PLL
|
||||
- mt312 : Zarlink mt312 or Mitel vp310 demodulator, sl1935 or tsa5059 PLL
|
||||
- stv0299 : Alps BSRU6 (tsa5059 PLL), LG TDQB-S00x (tsa5059 PLL),
|
||||
LG TDQF-S001F (sl1935 PLL), Philips SU1278 (tua6100 PLL),
|
||||
LG TDQF-S001F (sl1935 PLL), Philips SU1278 (tua6100 PLL),
|
||||
Philips SU1278SH (tsa5059 PLL), Samsung TBMU24112IMB
|
||||
DVB-C:
|
||||
- ves1820 : various (ves1820 demodulator, sp5659c or spXXXX PLL)
|
||||
@ -35,8 +35,8 @@ o Frontends drivers:
|
||||
- grundig_29504-401 : Grundig 29504-401 (LSI L64781 demodulator), tsa5060 PLL
|
||||
- tda1004x : Philips tda10045h (td1344 or tdm1316l PLL)
|
||||
- nxt6000 : Alps TDME7 (MITEL SP5659 PLL), Alps TDED4 (TI ALP510 PLL),
|
||||
Comtech DVBT-6k07 (SP5730 PLL)
|
||||
(NxtWave Communications NXT6000 demodulator)
|
||||
Comtech DVBT-6k07 (SP5730 PLL)
|
||||
(NxtWave Communications NXT6000 demodulator)
|
||||
- sp887x : Microtune 7202D
|
||||
- dib3000mb : DiBcom 3000-MB demodulator
|
||||
DVB-S/C/T:
|
||||
|
@ -15,7 +15,7 @@ Michael Holzt <kju@debian.org>
|
||||
|
||||
Diego Picciani <d.picciani@novacomp.it>
|
||||
for CyberLogin for Linux which allows logging onto EON
|
||||
(in case you are wondering where CyberLogin is, EON changed its login
|
||||
(in case you are wondering where CyberLogin is, EON changed its login
|
||||
procedure and CyberLogin is no longer used.)
|
||||
|
||||
Martin Schaller <martin@smurf.franken.de>
|
||||
@ -57,7 +57,7 @@ Augusto Cardoso <augusto@carhil.net>
|
||||
Davor Emard <emard@softhome.net>
|
||||
for his work on the budget drivers, the demux code,
|
||||
the module unloading problems, ...
|
||||
|
||||
|
||||
Hans-Frieder Vogt <hfvogt@arcor.de>
|
||||
for his work on calculating and checking the crc's for the
|
||||
TechnoTrend/Hauppauge DEC driver firmware
|
||||
|
@ -60,7 +60,6 @@ Some very frequently asked questions about linuxtv-dvb
|
||||
Metzler Bros. DVB development; alternate drivers and
|
||||
DVB utilities, include dvb-mpegtools and tuxzap.
|
||||
|
||||
http://www.linuxstb.org/
|
||||
http://sourceforge.net/projects/dvbtools/
|
||||
Dave Chapman's dvbtools package, including
|
||||
dvbstream and dvbtune
|
||||
|
@ -23,7 +23,7 @@ use IO::Handle;
|
||||
|
||||
@components = ( "sp8870", "sp887x", "tda10045", "tda10046", "av7110", "dec2000t",
|
||||
"dec2540t", "dec3000s", "vp7041", "dibusb", "nxt2002", "nxt2004",
|
||||
"or51211", "or51132_qam", "or51132_vsb");
|
||||
"or51211", "or51132_qam", "or51132_vsb", "bluebird");
|
||||
|
||||
# Check args
|
||||
syntax() if (scalar(@ARGV) != 1);
|
||||
@ -34,7 +34,11 @@ for ($i=0; $i < scalar(@components); $i++) {
|
||||
if ($cid eq $components[$i]) {
|
||||
$outfile = eval($cid);
|
||||
die $@ if $@;
|
||||
print STDERR "Firmware $outfile extracted successfully. Now copy it to either /lib/firmware or /usr/lib/hotplug/firmware/ (depending on your hotplug version).\n";
|
||||
print STDERR <<EOF;
|
||||
Firmware $outfile extracted successfully.
|
||||
Now copy it to either /usr/lib/hotplug/firmware or /lib/firmware
|
||||
(depending on configuration of firmware hotplug).
|
||||
EOF
|
||||
exit(0);
|
||||
}
|
||||
}
|
||||
@ -243,7 +247,7 @@ sub nxt2002 {
|
||||
my $tmpdir = tempdir(DIR => "/tmp", CLEANUP => 1);
|
||||
|
||||
checkstandard();
|
||||
|
||||
|
||||
wgetfile($sourcefile, $url);
|
||||
unzip($sourcefile, $tmpdir);
|
||||
verify("$tmpdir/SkyNETU.sys", $hash);
|
||||
@ -308,6 +312,19 @@ sub or51132_vsb {
|
||||
$fwfile;
|
||||
}
|
||||
|
||||
sub bluebird {
|
||||
my $url = "http://www.linuxtv.org/download/dvb/firmware/dvb-usb-bluebird-01.fw";
|
||||
my $outfile = "dvb-usb-bluebird-01.fw";
|
||||
my $hash = "658397cb9eba9101af9031302671f49d";
|
||||
|
||||
checkstandard();
|
||||
|
||||
wgetfile($outfile, $url);
|
||||
verify($outfile,$hash);
|
||||
|
||||
$outfile;
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------
|
||||
# Utilities
|
||||
|
||||
|
@ -20,7 +20,7 @@ http://linuxtv.org/downloads/
|
||||
|
||||
What's inside this directory:
|
||||
|
||||
"cards.txt"
|
||||
"cards.txt"
|
||||
contains a list of supported hardware.
|
||||
|
||||
"contributors.txt"
|
||||
@ -37,7 +37,7 @@ that require it.
|
||||
contains detailed informations about the
|
||||
TT DEC2000/DEC3000 USB DVB hardware.
|
||||
|
||||
"bt8xx.txt"
|
||||
"bt8xx.txt"
|
||||
contains detailed installation instructions for the
|
||||
various bt8xx based "budget" DVB cards
|
||||
(Nebula, Pinnacle PCTV, Twinhan DST)
|
||||
|
@ -41,4 +41,5 @@ Hotplug Firmware Loading for 2.6 kernels
|
||||
For 2.6 kernels the firmware is loaded at the point that the driver module is
|
||||
loaded. See linux/Documentation/dvb/firmware.txt for more information.
|
||||
|
||||
Copy the three files downloaded above into the /usr/lib/hotplug/firmware directory.
|
||||
Copy the three files downloaded above into the /usr/lib/hotplug/firmware or
|
||||
/lib/firmware directory (depending on configuration of firmware hotplug).
|
||||
|
@ -28,7 +28,7 @@ the image from specifications.
|
||||
CPIO ARCHIVE method
|
||||
|
||||
You can create a cpio archive that contains the early userspace image.
|
||||
Youre cpio archive should be specified in CONFIG_INITRAMFS_SOURCE and it
|
||||
Your cpio archive should be specified in CONFIG_INITRAMFS_SOURCE and it
|
||||
will be used directly. Only a single cpio file may be specified in
|
||||
CONFIG_INITRAMFS_SOURCE and directory and file names are not allowed in
|
||||
combination with a cpio archive.
|
||||
|
@ -11,4 +11,3 @@ Untested features
|
||||
|
||||
All LCD stuff is untested. If it worked in tridentfb, it should work in
|
||||
cyblafb. Please test and report the results to Knut_Petersen@t-online.de.
|
||||
|
||||
|
@ -14,142 +14,141 @@
|
||||
#
|
||||
|
||||
mode "640x480-50"
|
||||
geometry 640 480 640 3756 8
|
||||
geometry 640 480 2048 4096 8
|
||||
timings 47619 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-60"
|
||||
geometry 640 480 640 3756 8
|
||||
geometry 640 480 2048 4096 8
|
||||
timings 39682 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-70"
|
||||
geometry 640 480 640 3756 8
|
||||
geometry 640 480 2048 4096 8
|
||||
timings 34013 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-72"
|
||||
geometry 640 480 640 3756 8
|
||||
geometry 640 480 2048 4096 8
|
||||
timings 33068 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-75"
|
||||
geometry 640 480 640 3756 8
|
||||
geometry 640 480 2048 4096 8
|
||||
timings 31746 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-80"
|
||||
geometry 640 480 640 3756 8
|
||||
geometry 640 480 2048 4096 8
|
||||
timings 29761 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-85"
|
||||
geometry 640 480 640 3756 8
|
||||
geometry 640 480 2048 4096 8
|
||||
timings 28011 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "800x600-50"
|
||||
geometry 800 600 800 3221 8
|
||||
geometry 800 600 2048 4096 8
|
||||
timings 30303 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-60"
|
||||
geometry 800 600 800 3221 8
|
||||
geometry 800 600 2048 4096 8
|
||||
timings 25252 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-70"
|
||||
geometry 800 600 800 3221 8
|
||||
geometry 800 600 2048 4096 8
|
||||
timings 21645 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-72"
|
||||
geometry 800 600 800 3221 8
|
||||
geometry 800 600 2048 4096 8
|
||||
timings 21043 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-75"
|
||||
geometry 800 600 800 3221 8
|
||||
geometry 800 600 2048 4096 8
|
||||
timings 20202 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-80"
|
||||
geometry 800 600 800 3221 8
|
||||
geometry 800 600 2048 4096 8
|
||||
timings 18939 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-85"
|
||||
geometry 800 600 800 3221 8
|
||||
geometry 800 600 2048 4096 8
|
||||
timings 17825 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "1024x768-50"
|
||||
geometry 1024 768 1024 2815 8
|
||||
geometry 1024 768 2048 4096 8
|
||||
timings 19054 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-60"
|
||||
geometry 1024 768 1024 2815 8
|
||||
geometry 1024 768 2048 4096 8
|
||||
timings 15880 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-70"
|
||||
geometry 1024 768 1024 2815 8
|
||||
geometry 1024 768 2048 4096 8
|
||||
timings 13610 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-72"
|
||||
geometry 1024 768 1024 2815 8
|
||||
geometry 1024 768 2048 4096 8
|
||||
timings 13232 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-75"
|
||||
geometry 1024 768 1024 2815 8
|
||||
geometry 1024 768 2048 4096 8
|
||||
timings 12703 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-80"
|
||||
geometry 1024 768 1024 2815 8
|
||||
geometry 1024 768 2048 4096 8
|
||||
timings 11910 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-85"
|
||||
geometry 1024 768 1024 2815 8
|
||||
geometry 1024 768 2048 4096 8
|
||||
timings 11209 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-50"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
geometry 1280 1024 2048 4096 8
|
||||
timings 11114 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-60"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
geometry 1280 1024 2048 4096 8
|
||||
timings 9262 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-70"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
geometry 1280 1024 2048 4096 8
|
||||
timings 7939 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-72"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
geometry 1280 1024 2048 4096 8
|
||||
timings 7719 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-75"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
geometry 1280 1024 2048 4096 8
|
||||
timings 7410 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-80"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
geometry 1280 1024 2048 4096 8
|
||||
timings 6946 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-85"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
geometry 1280 1024 2048 4096 8
|
||||
timings 6538 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
|
@ -77,4 +77,3 @@ patch that speeds up kernel bitblitting a lot ( > 20%).
|
||||
| | | | |
|
||||
| | | | |
|
||||
+-----------+-----------------+-----------------+-----------------+
|
||||
|
||||
|
@ -22,11 +22,10 @@ accelerated color blitting Who needs it? The console driver does use color
|
||||
everything else is done using color expanding
|
||||
blitting of 1bpp character bitmaps.
|
||||
|
||||
xpanning Who needs it?
|
||||
|
||||
ioctls Who needs it?
|
||||
|
||||
TV-out Will be done later
|
||||
TV-out Will be done later. Use "vga= " at boot time
|
||||
to set a suitable video mode.
|
||||
|
||||
??? Feel free to contact me if you have any
|
||||
feature requests
|
||||
|
@ -40,6 +40,16 @@ Selecting Modes
|
||||
None of the modes possible to select as startup modes are affected by
|
||||
the problems described at the end of the next subsection.
|
||||
|
||||
For all startup modes cyblafb chooses a virtual x resolution of 2048,
|
||||
the only exception is mode 1280x1024 in combination with 32 bpp. This
|
||||
allows ywrap scrolling for all those modes if rotation is 0 or 2, and
|
||||
also fast scrolling if rotation is 1 or 3. The default virtual y reso-
|
||||
lution is 4096 for bpp == 8, 2048 for bpp==16 and 1024 for bpp == 32,
|
||||
again with the only exception of 1280x1024 at 32 bpp.
|
||||
|
||||
Please do set your video memory size to 8 Mb in the Bios setup. Other
|
||||
values will work, but performace is decreased for a lot of modes.
|
||||
|
||||
Mode changes using fbset
|
||||
========================
|
||||
|
||||
@ -54,20 +64,26 @@ Selecting Modes
|
||||
- if a flat panel is found, cyblafb does not allow you
|
||||
to program a resolution higher than the physical
|
||||
resolution of the flat panel monitor
|
||||
- cyblafb does not allow xres to differ from xres_virtual
|
||||
- cyblafb does not allow vclk to exceed 230 MHz. As 32 bpp
|
||||
and (currently) 24 bit modes use a doubled vclk internally,
|
||||
the dotclock limit as seen by fbset is 115 MHz for those
|
||||
modes and 230 MHz for 8 and 16 bpp modes.
|
||||
- cyblafb will allow you to select very high resolutions as
|
||||
long as the hardware can be programmed to these modes. The
|
||||
documented limit 1600x1200 is not enforced, but don't expect
|
||||
perfect signal quality.
|
||||
|
||||
Any request that violates the rules given above will be ignored and
|
||||
fbset will return an error.
|
||||
Any request that violates the rules given above will be either changed
|
||||
to something the hardware supports or an error value will be returned.
|
||||
|
||||
If you program a virtual y resolution higher than the hardware limit,
|
||||
cyblafb will silently decrease that value to the highest possible
|
||||
value.
|
||||
value. The same is true for a virtual x resolution that is not
|
||||
supported by the hardware. Cyblafb tries to adapt vyres first because
|
||||
vxres decides if ywrap scrolling is possible or not.
|
||||
|
||||
Attempts to disable acceleration are ignored.
|
||||
Attempts to disable acceleration are ignored, I believe that this is
|
||||
safe.
|
||||
|
||||
Some video modes that should work do not work as expected. If you use
|
||||
the standard fb.modes, fbset 640x480-60 will program that mode, but
|
||||
@ -129,10 +145,6 @@ mode 640x480 or 800x600 or 1024x768 or 1280x1024
|
||||
verbosity 0 is the default, increase to at least 2 for every
|
||||
bug report!
|
||||
|
||||
vesafb allows cyblafb to be loaded after vesafb has been
|
||||
loaded. See sections "Module unloading ...".
|
||||
|
||||
|
||||
Development hints
|
||||
=================
|
||||
|
||||
@ -195,7 +207,7 @@ a graphics mode.
|
||||
After booting, load cyblafb without any mode and bpp parameter and assign
|
||||
cyblafb to individual ttys using con2fb, e.g.:
|
||||
|
||||
modprobe cyblafb vesafb=1
|
||||
modprobe cyblafb
|
||||
con2fb /dev/fb1 /dev/tty1
|
||||
|
||||
Unloading cyblafb works without problems after you assign vesafb to all
|
||||
@ -203,4 +215,3 @@ ttys again, e.g.:
|
||||
|
||||
con2fb /dev/fb0 /dev/tty1
|
||||
rmmod cyblafb
|
||||
|
||||
|
29
Documentation/fb/cyblafb/whatsnew
Normal file
29
Documentation/fb/cyblafb/whatsnew
Normal file
@ -0,0 +1,29 @@
|
||||
0.62
|
||||
====
|
||||
|
||||
- the vesafb parameter has been removed as I decided to allow the
|
||||
feature without any special parameter.
|
||||
|
||||
- Cyblafb does not use the vga style of panning any longer, now the
|
||||
"right view" register in the graphics engine IO space is used. Without
|
||||
that change it was impossible to use all available memory, and without
|
||||
access to all available memory it is impossible to ywrap.
|
||||
|
||||
- The imageblit function now uses hardware acceleration for all font
|
||||
widths. Hardware blitting across pixel column 2048 is broken in the
|
||||
cyberblade/i1 graphics core, but we work around that hardware bug.
|
||||
|
||||
- modes with vxres != xres are supported now.
|
||||
|
||||
- ywrap scrolling is supported now and the default. This is a big
|
||||
performance gain.
|
||||
|
||||
- default video modes use vyres > yres and vxres > xres to allow
|
||||
almost optimal scrolling speed for normal and rotated screens
|
||||
|
||||
- some features mainly usefull for debugging the upper layers of the
|
||||
framebuffer system have been added, have a look at the code
|
||||
|
||||
- fixed: Oops after unloading cyblafb when reading /proc/io*
|
||||
|
||||
- we work around some bugs of the higher framebuffer layers.
|
@ -47,17 +47,6 @@ Who: Paul E. McKenney <paulmck@us.ibm.com>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: IEEE1394 Audio and Music Data Transmission Protocol driver,
|
||||
Connection Management Procedures driver
|
||||
When: November 2005
|
||||
Files: drivers/ieee1394/{amdtp,cmp}*
|
||||
Why: These are incomplete, have never worked, and are better implemented
|
||||
in userland via raw1394 (see http://freebob.sourceforge.net/ for
|
||||
example.)
|
||||
Who: Jody McIntyre <scjody@steamballoon.com>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: raw1394: requests of type RAW1394_REQ_ISO_SEND, RAW1394_REQ_ISO_LISTEN
|
||||
When: November 2005
|
||||
Why: Deprecated in favour of the new ioctl-based rawiso interface, which is
|
||||
@ -82,15 +71,6 @@ Who: Mauro Carvalho Chehab <mchehab@brturbo.com.br>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: i2c sysfs name change: in1_ref, vid deprecated in favour of cpu0_vid
|
||||
When: November 2005
|
||||
Files: drivers/i2c/chips/adm1025.c, drivers/i2c/chips/adm1026.c
|
||||
Why: Match the other drivers' name for the same function, duplicate names
|
||||
will be available until removal of old names.
|
||||
Who: Grant Coady <gcoady@gmail.com>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: remove EXPORT_SYMBOL(panic_timeout)
|
||||
When: April 2006
|
||||
Files: kernel/panic.c
|
||||
@ -140,3 +120,31 @@ What: EXPORT_SYMBOL(lookup_hash)
|
||||
When: January 2006
|
||||
Why: Too low-level interface. Use lookup_one_len or lookup_create instead.
|
||||
Who: Christoph Hellwig <hch@lst.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: CONFIG_FORCED_INLINING
|
||||
When: June 2006
|
||||
Why: Config option is there to see if gcc is good enough. (in january
|
||||
2006). If it is, the behavior should just be the default. If it's not,
|
||||
the option should just go away entirely.
|
||||
Who: Arjan van de Ven
|
||||
|
||||
---------------------------
|
||||
|
||||
What: START_ARRAY ioctl for md
|
||||
When: July 2006
|
||||
Files: drivers/md/md.c
|
||||
Why: Not reliable by design - can fail when most needed.
|
||||
Alternatives exist
|
||||
Who: NeilBrown <neilb@suse.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: au1x00_uart driver
|
||||
When: January 2006
|
||||
Why: The 8250 serial driver now has the ability to deal with the differences
|
||||
between the standard 8250 family of UARTs and their slightly strange
|
||||
brother on Alchemy SOCs. The loss of features is not considered an
|
||||
issue.
|
||||
Who: Ralf Baechle <ralf@linux-mips.org>
|
||||
|
@ -12,14 +12,16 @@ cifs.txt
|
||||
- description of the CIFS filesystem
|
||||
coda.txt
|
||||
- description of the CODA filesystem.
|
||||
configfs/
|
||||
- directory containing configfs documentation and example code.
|
||||
cramfs.txt
|
||||
- info on the cram filesystem for small storage (ROMs etc)
|
||||
devfs/
|
||||
- directory containing devfs documentation.
|
||||
dlmfs.txt
|
||||
- info on the userspace interface to the OCFS2 DLM.
|
||||
ext2.txt
|
||||
- info, mount options and specifications for the Ext2 filesystem.
|
||||
fat_cvf.txt
|
||||
- info on the Compressed Volume Files extension to the FAT filesystem
|
||||
hpfs.txt
|
||||
- info and mount options for the OS/2 HPFS.
|
||||
isofs.txt
|
||||
@ -32,6 +34,8 @@ ntfs.txt
|
||||
- info and mount options for the NTFS filesystem (Windows NT).
|
||||
proc.txt
|
||||
- info on Linux's /proc filesystem.
|
||||
ocfs2.txt
|
||||
- info and mount options for the OCFS2 clustered filesystem.
|
||||
romfs.txt
|
||||
- Description of the ROMFS filesystem.
|
||||
smbfs.txt
|
||||
|
@ -216,4 +216,4 @@ due to an incompatibility with the Amiga floppy controller.
|
||||
|
||||
If you are interested in an Amiga Emulator for Linux, look at
|
||||
|
||||
http://www-users.informatik.rwth-aachen.de/~crux/uae.html
|
||||
http://www.freiburg.linux.de/~uae/
|
||||
|
434
Documentation/filesystems/configfs/configfs.txt
Normal file
434
Documentation/filesystems/configfs/configfs.txt
Normal file
@ -0,0 +1,434 @@
|
||||
|
||||
configfs - Userspace-driven kernel object configuation.
|
||||
|
||||
Joel Becker <joel.becker@oracle.com>
|
||||
|
||||
Updated: 31 March 2005
|
||||
|
||||
Copyright (c) 2005 Oracle Corporation,
|
||||
Joel Becker <joel.becker@oracle.com>
|
||||
|
||||
|
||||
[What is configfs?]
|
||||
|
||||
configfs is a ram-based filesystem that provides the converse of
|
||||
sysfs's functionality. Where sysfs is a filesystem-based view of
|
||||
kernel objects, configfs is a filesystem-based manager of kernel
|
||||
objects, or config_items.
|
||||
|
||||
With sysfs, an object is created in kernel (for example, when a device
|
||||
is discovered) and it is registered with sysfs. Its attributes then
|
||||
appear in sysfs, allowing userspace to read the attributes via
|
||||
readdir(3)/read(2). It may allow some attributes to be modified via
|
||||
write(2). The important point is that the object is created and
|
||||
destroyed in kernel, the kernel controls the lifecycle of the sysfs
|
||||
representation, and sysfs is merely a window on all this.
|
||||
|
||||
A configfs config_item is created via an explicit userspace operation:
|
||||
mkdir(2). It is destroyed via rmdir(2). The attributes appear at
|
||||
mkdir(2) time, and can be read or modified via read(2) and write(2).
|
||||
As with sysfs, readdir(3) queries the list of items and/or attributes.
|
||||
symlink(2) can be used to group items together. Unlike sysfs, the
|
||||
lifetime of the representation is completely driven by userspace. The
|
||||
kernel modules backing the items must respond to this.
|
||||
|
||||
Both sysfs and configfs can and should exist together on the same
|
||||
system. One is not a replacement for the other.
|
||||
|
||||
[Using configfs]
|
||||
|
||||
configfs can be compiled as a module or into the kernel. You can access
|
||||
it by doing
|
||||
|
||||
mount -t configfs none /config
|
||||
|
||||
The configfs tree will be empty unless client modules are also loaded.
|
||||
These are modules that register their item types with configfs as
|
||||
subsystems. Once a client subsystem is loaded, it will appear as a
|
||||
subdirectory (or more than one) under /config. Like sysfs, the
|
||||
configfs tree is always there, whether mounted on /config or not.
|
||||
|
||||
An item is created via mkdir(2). The item's attributes will also
|
||||
appear at this time. readdir(3) can determine what the attributes are,
|
||||
read(2) can query their default values, and write(2) can store new
|
||||
values. Like sysfs, attributes should be ASCII text files, preferably
|
||||
with only one value per file. The same efficiency caveats from sysfs
|
||||
apply. Don't mix more than one attribute in one attribute file.
|
||||
|
||||
Like sysfs, configfs expects write(2) to store the entire buffer at
|
||||
once. When writing to configfs attributes, userspace processes should
|
||||
first read the entire file, modify the portions they wish to change, and
|
||||
then write the entire buffer back. Attribute files have a maximum size
|
||||
of one page (PAGE_SIZE, 4096 on i386).
|
||||
|
||||
When an item needs to be destroyed, remove it with rmdir(2). An
|
||||
item cannot be destroyed if any other item has a link to it (via
|
||||
symlink(2)). Links can be removed via unlink(2).
|
||||
|
||||
[Configuring FakeNBD: an Example]
|
||||
|
||||
Imagine there's a Network Block Device (NBD) driver that allows you to
|
||||
access remote block devices. Call it FakeNBD. FakeNBD uses configfs
|
||||
for its configuration. Obviously, there will be a nice program that
|
||||
sysadmins use to configure FakeNBD, but somehow that program has to tell
|
||||
the driver about it. Here's where configfs comes in.
|
||||
|
||||
When the FakeNBD driver is loaded, it registers itself with configfs.
|
||||
readdir(3) sees this just fine:
|
||||
|
||||
# ls /config
|
||||
fakenbd
|
||||
|
||||
A fakenbd connection can be created with mkdir(2). The name is
|
||||
arbitrary, but likely the tool will make some use of the name. Perhaps
|
||||
it is a uuid or a disk name:
|
||||
|
||||
# mkdir /config/fakenbd/disk1
|
||||
# ls /config/fakenbd/disk1
|
||||
target device rw
|
||||
|
||||
The target attribute contains the IP address of the server FakeNBD will
|
||||
connect to. The device attribute is the device on the server.
|
||||
Predictably, the rw attribute determines whether the connection is
|
||||
read-only or read-write.
|
||||
|
||||
# echo 10.0.0.1 > /config/fakenbd/disk1/target
|
||||
# echo /dev/sda1 > /config/fakenbd/disk1/device
|
||||
# echo 1 > /config/fakenbd/disk1/rw
|
||||
|
||||
That's it. That's all there is. Now the device is configured, via the
|
||||
shell no less.
|
||||
|
||||
[Coding With configfs]
|
||||
|
||||
Every object in configfs is a config_item. A config_item reflects an
|
||||
object in the subsystem. It has attributes that match values on that
|
||||
object. configfs handles the filesystem representation of that object
|
||||
and its attributes, allowing the subsystem to ignore all but the
|
||||
basic show/store interaction.
|
||||
|
||||
Items are created and destroyed inside a config_group. A group is a
|
||||
collection of items that share the same attributes and operations.
|
||||
Items are created by mkdir(2) and removed by rmdir(2), but configfs
|
||||
handles that. The group has a set of operations to perform these tasks
|
||||
|
||||
A subsystem is the top level of a client module. During initialization,
|
||||
the client module registers the subsystem with configfs, the subsystem
|
||||
appears as a directory at the top of the configfs filesystem. A
|
||||
subsystem is also a config_group, and can do everything a config_group
|
||||
can.
|
||||
|
||||
[struct config_item]
|
||||
|
||||
struct config_item {
|
||||
char *ci_name;
|
||||
char ci_namebuf[UOBJ_NAME_LEN];
|
||||
struct kref ci_kref;
|
||||
struct list_head ci_entry;
|
||||
struct config_item *ci_parent;
|
||||
struct config_group *ci_group;
|
||||
struct config_item_type *ci_type;
|
||||
struct dentry *ci_dentry;
|
||||
};
|
||||
|
||||
void config_item_init(struct config_item *);
|
||||
void config_item_init_type_name(struct config_item *,
|
||||
const char *name,
|
||||
struct config_item_type *type);
|
||||
struct config_item *config_item_get(struct config_item *);
|
||||
void config_item_put(struct config_item *);
|
||||
|
||||
Generally, struct config_item is embedded in a container structure, a
|
||||
structure that actually represents what the subsystem is doing. The
|
||||
config_item portion of that structure is how the object interacts with
|
||||
configfs.
|
||||
|
||||
Whether statically defined in a source file or created by a parent
|
||||
config_group, a config_item must have one of the _init() functions
|
||||
called on it. This initializes the reference count and sets up the
|
||||
appropriate fields.
|
||||
|
||||
All users of a config_item should have a reference on it via
|
||||
config_item_get(), and drop the reference when they are done via
|
||||
config_item_put().
|
||||
|
||||
By itself, a config_item cannot do much more than appear in configfs.
|
||||
Usually a subsystem wants the item to display and/or store attributes,
|
||||
among other things. For that, it needs a type.
|
||||
|
||||
[struct config_item_type]
|
||||
|
||||
struct configfs_item_operations {
|
||||
void (*release)(struct config_item *);
|
||||
ssize_t (*show_attribute)(struct config_item *,
|
||||
struct configfs_attribute *,
|
||||
char *);
|
||||
ssize_t (*store_attribute)(struct config_item *,
|
||||
struct configfs_attribute *,
|
||||
const char *, size_t);
|
||||
int (*allow_link)(struct config_item *src,
|
||||
struct config_item *target);
|
||||
int (*drop_link)(struct config_item *src,
|
||||
struct config_item *target);
|
||||
};
|
||||
|
||||
struct config_item_type {
|
||||
struct module *ct_owner;
|
||||
struct configfs_item_operations *ct_item_ops;
|
||||
struct configfs_group_operations *ct_group_ops;
|
||||
struct configfs_attribute **ct_attrs;
|
||||
};
|
||||
|
||||
The most basic function of a config_item_type is to define what
|
||||
operations can be performed on a config_item. All items that have been
|
||||
allocated dynamically will need to provide the ct_item_ops->release()
|
||||
method. This method is called when the config_item's reference count
|
||||
reaches zero. Items that wish to display an attribute need to provide
|
||||
the ct_item_ops->show_attribute() method. Similarly, storing a new
|
||||
attribute value uses the store_attribute() method.
|
||||
|
||||
[struct configfs_attribute]
|
||||
|
||||
struct configfs_attribute {
|
||||
char *ca_name;
|
||||
struct module *ca_owner;
|
||||
mode_t ca_mode;
|
||||
};
|
||||
|
||||
When a config_item wants an attribute to appear as a file in the item's
|
||||
configfs directory, it must define a configfs_attribute describing it.
|
||||
It then adds the attribute to the NULL-terminated array
|
||||
config_item_type->ct_attrs. When the item appears in configfs, the
|
||||
attribute file will appear with the configfs_attribute->ca_name
|
||||
filename. configfs_attribute->ca_mode specifies the file permissions.
|
||||
|
||||
If an attribute is readable and the config_item provides a
|
||||
ct_item_ops->show_attribute() method, that method will be called
|
||||
whenever userspace asks for a read(2) on the attribute. The converse
|
||||
will happen for write(2).
|
||||
|
||||
[struct config_group]
|
||||
|
||||
A config_item cannot live in a vaccum. The only way one can be created
|
||||
is via mkdir(2) on a config_group. This will trigger creation of a
|
||||
child item.
|
||||
|
||||
struct config_group {
|
||||
struct config_item cg_item;
|
||||
struct list_head cg_children;
|
||||
struct configfs_subsystem *cg_subsys;
|
||||
struct config_group **default_groups;
|
||||
};
|
||||
|
||||
void config_group_init(struct config_group *group);
|
||||
void config_group_init_type_name(struct config_group *group,
|
||||
const char *name,
|
||||
struct config_item_type *type);
|
||||
|
||||
|
||||
The config_group structure contains a config_item. Properly configuring
|
||||
that item means that a group can behave as an item in its own right.
|
||||
However, it can do more: it can create child items or groups. This is
|
||||
accomplished via the group operations specified on the group's
|
||||
config_item_type.
|
||||
|
||||
struct configfs_group_operations {
|
||||
struct config_item *(*make_item)(struct config_group *group,
|
||||
const char *name);
|
||||
struct config_group *(*make_group)(struct config_group *group,
|
||||
const char *name);
|
||||
int (*commit_item)(struct config_item *item);
|
||||
void (*drop_item)(struct config_group *group,
|
||||
struct config_item *item);
|
||||
};
|
||||
|
||||
A group creates child items by providing the
|
||||
ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new
|
||||
config_item (or more likely, its container structure), initializes it,
|
||||
and returns it to configfs. Configfs will then populate the filesystem
|
||||
tree to reflect the new item.
|
||||
|
||||
If the subsystem wants the child to be a group itself, the subsystem
|
||||
provides ct_group_ops->make_group(). Everything else behaves the same,
|
||||
using the group _init() functions on the group.
|
||||
|
||||
Finally, when userspace calls rmdir(2) on the item or group,
|
||||
ct_group_ops->drop_item() is called. As a config_group is also a
|
||||
config_item, it is not necessary for a seperate drop_group() method.
|
||||
The subsystem must config_item_put() the reference that was initialized
|
||||
upon item allocation. If a subsystem has no work to do, it may omit
|
||||
the ct_group_ops->drop_item() method, and configfs will call
|
||||
config_item_put() on the item on behalf of the subsystem.
|
||||
|
||||
IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2)
|
||||
is called, configfs WILL remove the item from the filesystem tree
|
||||
(assuming that it has no children to keep it busy). The subsystem is
|
||||
responsible for responding to this. If the subsystem has references to
|
||||
the item in other threads, the memory is safe. It may take some time
|
||||
for the item to actually disappear from the subsystem's usage. But it
|
||||
is gone from configfs.
|
||||
|
||||
A config_group cannot be removed while it still has child items. This
|
||||
is implemented in the configfs rmdir(2) code. ->drop_item() will not be
|
||||
called, as the item has not been dropped. rmdir(2) will fail, as the
|
||||
directory is not empty.
|
||||
|
||||
[struct configfs_subsystem]
|
||||
|
||||
A subsystem must register itself, ususally at module_init time. This
|
||||
tells configfs to make the subsystem appear in the file tree.
|
||||
|
||||
struct configfs_subsystem {
|
||||
struct config_group su_group;
|
||||
struct semaphore su_sem;
|
||||
};
|
||||
|
||||
int configfs_register_subsystem(struct configfs_subsystem *subsys);
|
||||
void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
|
||||
|
||||
A subsystem consists of a toplevel config_group and a semaphore.
|
||||
The group is where child config_items are created. For a subsystem,
|
||||
this group is usually defined statically. Before calling
|
||||
configfs_register_subsystem(), the subsystem must have initialized the
|
||||
group via the usual group _init() functions, and it must also have
|
||||
initialized the semaphore.
|
||||
When the register call returns, the subsystem is live, and it
|
||||
will be visible via configfs. At that point, mkdir(2) can be called and
|
||||
the subsystem must be ready for it.
|
||||
|
||||
[An Example]
|
||||
|
||||
The best example of these basic concepts is the simple_children
|
||||
subsystem/group and the simple_child item in configfs_example.c It
|
||||
shows a trivial object displaying and storing an attribute, and a simple
|
||||
group creating and destroying these children.
|
||||
|
||||
[Hierarchy Navigation and the Subsystem Semaphore]
|
||||
|
||||
There is an extra bonus that configfs provides. The config_groups and
|
||||
config_items are arranged in a hierarchy due to the fact that they
|
||||
appear in a filesystem. A subsystem is NEVER to touch the filesystem
|
||||
parts, but the subsystem might be interested in this hierarchy. For
|
||||
this reason, the hierarchy is mirrored via the config_group->cg_children
|
||||
and config_item->ci_parent structure members.
|
||||
|
||||
A subsystem can navigate the cg_children list and the ci_parent pointer
|
||||
to see the tree created by the subsystem. This can race with configfs'
|
||||
management of the hierarchy, so configfs uses the subsystem semaphore to
|
||||
protect modifications. Whenever a subsystem wants to navigate the
|
||||
hierarchy, it must do so under the protection of the subsystem
|
||||
semaphore.
|
||||
|
||||
A subsystem will be prevented from acquiring the semaphore while a newly
|
||||
allocated item has not been linked into this hierarchy. Similarly, it
|
||||
will not be able to acquire the semaphore while a dropping item has not
|
||||
yet been unlinked. This means that an item's ci_parent pointer will
|
||||
never be NULL while the item is in configfs, and that an item will only
|
||||
be in its parent's cg_children list for the same duration. This allows
|
||||
a subsystem to trust ci_parent and cg_children while they hold the
|
||||
semaphore.
|
||||
|
||||
[Item Aggregation Via symlink(2)]
|
||||
|
||||
configfs provides a simple group via the group->item parent/child
|
||||
relationship. Often, however, a larger environment requires aggregation
|
||||
outside of the parent/child connection. This is implemented via
|
||||
symlink(2).
|
||||
|
||||
A config_item may provide the ct_item_ops->allow_link() and
|
||||
ct_item_ops->drop_link() methods. If the ->allow_link() method exists,
|
||||
symlink(2) may be called with the config_item as the source of the link.
|
||||
These links are only allowed between configfs config_items. Any
|
||||
symlink(2) attempt outside the configfs filesystem will be denied.
|
||||
|
||||
When symlink(2) is called, the source config_item's ->allow_link()
|
||||
method is called with itself and a target item. If the source item
|
||||
allows linking to target item, it returns 0. A source item may wish to
|
||||
reject a link if it only wants links to a certain type of object (say,
|
||||
in its own subsystem).
|
||||
|
||||
When unlink(2) is called on the symbolic link, the source item is
|
||||
notified via the ->drop_link() method. Like the ->drop_item() method,
|
||||
this is a void function and cannot return failure. The subsystem is
|
||||
responsible for responding to the change.
|
||||
|
||||
A config_item cannot be removed while it links to any other item, nor
|
||||
can it be removed while an item links to it. Dangling symlinks are not
|
||||
allowed in configfs.
|
||||
|
||||
[Automatically Created Subgroups]
|
||||
|
||||
A new config_group may want to have two types of child config_items.
|
||||
While this could be codified by magic names in ->make_item(), it is much
|
||||
more explicit to have a method whereby userspace sees this divergence.
|
||||
|
||||
Rather than have a group where some items behave differently than
|
||||
others, configfs provides a method whereby one or many subgroups are
|
||||
automatically created inside the parent at its creation. Thus,
|
||||
mkdir("parent) results in "parent", "parent/subgroup1", up through
|
||||
"parent/subgroupN". Items of type 1 can now be created in
|
||||
"parent/subgroup1", and items of type N can be created in
|
||||
"parent/subgroupN".
|
||||
|
||||
These automatic subgroups, or default groups, do not preclude other
|
||||
children of the parent group. If ct_group_ops->make_group() exists,
|
||||
other child groups can be created on the parent group directly.
|
||||
|
||||
A configfs subsystem specifies default groups by filling in the
|
||||
NULL-terminated array default_groups on the config_group structure.
|
||||
Each group in that array is populated in the configfs tree at the same
|
||||
time as the parent group. Similarly, they are removed at the same time
|
||||
as the parent. No extra notification is provided. When a ->drop_item()
|
||||
method call notifies the subsystem the parent group is going away, it
|
||||
also means every default group child associated with that parent group.
|
||||
|
||||
As a consequence of this, default_groups cannot be removed directly via
|
||||
rmdir(2). They also are not considered when rmdir(2) on the parent
|
||||
group is checking for children.
|
||||
|
||||
[Committable Items]
|
||||
|
||||
NOTE: Committable items are currently unimplemented.
|
||||
|
||||
Some config_items cannot have a valid initial state. That is, no
|
||||
default values can be specified for the item's attributes such that the
|
||||
item can do its work. Userspace must configure one or more attributes,
|
||||
after which the subsystem can start whatever entity this item
|
||||
represents.
|
||||
|
||||
Consider the FakeNBD device from above. Without a target address *and*
|
||||
a target device, the subsystem has no idea what block device to import.
|
||||
The simple example assumes that the subsystem merely waits until all the
|
||||
appropriate attributes are configured, and then connects. This will,
|
||||
indeed, work, but now every attribute store must check if the attributes
|
||||
are initialized. Every attribute store must fire off the connection if
|
||||
that condition is met.
|
||||
|
||||
Far better would be an explicit action notifying the subsystem that the
|
||||
config_item is ready to go. More importantly, an explicit action allows
|
||||
the subsystem to provide feedback as to whether the attibutes are
|
||||
initialized in a way that makes sense. configfs provides this as
|
||||
committable items.
|
||||
|
||||
configfs still uses only normal filesystem operations. An item is
|
||||
committed via rename(2). The item is moved from a directory where it
|
||||
can be modified to a directory where it cannot.
|
||||
|
||||
Any group that provides the ct_group_ops->commit_item() method has
|
||||
committable items. When this group appears in configfs, mkdir(2) will
|
||||
not work directly in the group. Instead, the group will have two
|
||||
subdirectories: "live" and "pending". The "live" directory does not
|
||||
support mkdir(2) or rmdir(2) either. It only allows rename(2). The
|
||||
"pending" directory does allow mkdir(2) and rmdir(2). An item is
|
||||
created in the "pending" directory. Its attributes can be modified at
|
||||
will. Userspace commits the item by renaming it into the "live"
|
||||
directory. At this point, the subsystem recieves the ->commit_item()
|
||||
callback. If all required attributes are filled to satisfaction, the
|
||||
method returns zero and the item is moved to the "live" directory.
|
||||
|
||||
As rmdir(2) does not work in the "live" directory, an item must be
|
||||
shutdown, or "uncommitted". Again, this is done via rename(2), this
|
||||
time from the "live" directory back to the "pending" one. The subsystem
|
||||
is notified by the ct_group_ops->uncommit_object() method.
|
||||
|
||||
|
474
Documentation/filesystems/configfs/configfs_example.c
Normal file
474
Documentation/filesystems/configfs/configfs_example.c
Normal file
@ -0,0 +1,474 @@
|
||||
/*
|
||||
* vim: noexpandtab ts=8 sts=0 sw=8:
|
||||
*
|
||||
* configfs_example.c - This file is a demonstration module containing
|
||||
* a number of configfs subsystems.
|
||||
*
|
||||
* This program is free software; you can redistribute it and/or
|
||||
* modify it under the terms of the GNU General Public
|
||||
* License as published by the Free Software Foundation; either
|
||||
* version 2 of the License, or (at your option) any later version.
|
||||
*
|
||||
* This program is distributed in the hope that it will be useful,
|
||||
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
||||
* General Public License for more details.
|
||||
*
|
||||
* You should have received a copy of the GNU General Public
|
||||
* License along with this program; if not, write to the
|
||||
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
|
||||
* Boston, MA 021110-1307, USA.
|
||||
*
|
||||
* Based on sysfs:
|
||||
* sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
|
||||
*
|
||||
* configfs Copyright (C) 2005 Oracle. All rights reserved.
|
||||
*/
|
||||
|
||||
#include <linux/init.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/slab.h>
|
||||
|
||||
#include <linux/configfs.h>
|
||||
|
||||
|
||||
|
||||
/*
|
||||
* 01-childless
|
||||
*
|
||||
* This first example is a childless subsystem. It cannot create
|
||||
* any config_items. It just has attributes.
|
||||
*
|
||||
* Note that we are enclosing the configfs_subsystem inside a container.
|
||||
* This is not necessary if a subsystem has no attributes directly
|
||||
* on the subsystem. See the next example, 02-simple-children, for
|
||||
* such a subsystem.
|
||||
*/
|
||||
|
||||
struct childless {
|
||||
struct configfs_subsystem subsys;
|
||||
int showme;
|
||||
int storeme;
|
||||
};
|
||||
|
||||
struct childless_attribute {
|
||||
struct configfs_attribute attr;
|
||||
ssize_t (*show)(struct childless *, char *);
|
||||
ssize_t (*store)(struct childless *, const char *, size_t);
|
||||
};
|
||||
|
||||
static inline struct childless *to_childless(struct config_item *item)
|
||||
{
|
||||
return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
|
||||
}
|
||||
|
||||
static ssize_t childless_showme_read(struct childless *childless,
|
||||
char *page)
|
||||
{
|
||||
ssize_t pos;
|
||||
|
||||
pos = sprintf(page, "%d\n", childless->showme);
|
||||
childless->showme++;
|
||||
|
||||
return pos;
|
||||
}
|
||||
|
||||
static ssize_t childless_storeme_read(struct childless *childless,
|
||||
char *page)
|
||||
{
|
||||
return sprintf(page, "%d\n", childless->storeme);
|
||||
}
|
||||
|
||||
static ssize_t childless_storeme_write(struct childless *childless,
|
||||
const char *page,
|
||||
size_t count)
|
||||
{
|
||||
unsigned long tmp;
|
||||
char *p = (char *) page;
|
||||
|
||||
tmp = simple_strtoul(p, &p, 10);
|
||||
if (!p || (*p && (*p != '\n')))
|
||||
return -EINVAL;
|
||||
|
||||
if (tmp > INT_MAX)
|
||||
return -ERANGE;
|
||||
|
||||
childless->storeme = tmp;
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
static ssize_t childless_description_read(struct childless *childless,
|
||||
char *page)
|
||||
{
|
||||
return sprintf(page,
|
||||
"[01-childless]\n"
|
||||
"\n"
|
||||
"The childless subsystem is the simplest possible subsystem in\n"
|
||||
"configfs. It does not support the creation of child config_items.\n"
|
||||
"It only has a few attributes. In fact, it isn't much different\n"
|
||||
"than a directory in /proc.\n");
|
||||
}
|
||||
|
||||
static struct childless_attribute childless_attr_showme = {
|
||||
.attr = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO },
|
||||
.show = childless_showme_read,
|
||||
};
|
||||
static struct childless_attribute childless_attr_storeme = {
|
||||
.attr = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR },
|
||||
.show = childless_storeme_read,
|
||||
.store = childless_storeme_write,
|
||||
};
|
||||
static struct childless_attribute childless_attr_description = {
|
||||
.attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO },
|
||||
.show = childless_description_read,
|
||||
};
|
||||
|
||||
static struct configfs_attribute *childless_attrs[] = {
|
||||
&childless_attr_showme.attr,
|
||||
&childless_attr_storeme.attr,
|
||||
&childless_attr_description.attr,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static ssize_t childless_attr_show(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
char *page)
|
||||
{
|
||||
struct childless *childless = to_childless(item);
|
||||
struct childless_attribute *childless_attr =
|
||||
container_of(attr, struct childless_attribute, attr);
|
||||
ssize_t ret = 0;
|
||||
|
||||
if (childless_attr->show)
|
||||
ret = childless_attr->show(childless, page);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static ssize_t childless_attr_store(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
const char *page, size_t count)
|
||||
{
|
||||
struct childless *childless = to_childless(item);
|
||||
struct childless_attribute *childless_attr =
|
||||
container_of(attr, struct childless_attribute, attr);
|
||||
ssize_t ret = -EINVAL;
|
||||
|
||||
if (childless_attr->store)
|
||||
ret = childless_attr->store(childless, page, count);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static struct configfs_item_operations childless_item_ops = {
|
||||
.show_attribute = childless_attr_show,
|
||||
.store_attribute = childless_attr_store,
|
||||
};
|
||||
|
||||
static struct config_item_type childless_type = {
|
||||
.ct_item_ops = &childless_item_ops,
|
||||
.ct_attrs = childless_attrs,
|
||||
.ct_owner = THIS_MODULE,
|
||||
};
|
||||
|
||||
static struct childless childless_subsys = {
|
||||
.subsys = {
|
||||
.su_group = {
|
||||
.cg_item = {
|
||||
.ci_namebuf = "01-childless",
|
||||
.ci_type = &childless_type,
|
||||
},
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
|
||||
/* ----------------------------------------------------------------- */
|
||||
|
||||
/*
|
||||
* 02-simple-children
|
||||
*
|
||||
* This example merely has a simple one-attribute child. Note that
|
||||
* there is no extra attribute structure, as the child's attribute is
|
||||
* known from the get-go. Also, there is no container for the
|
||||
* subsystem, as it has no attributes of its own.
|
||||
*/
|
||||
|
||||
struct simple_child {
|
||||
struct config_item item;
|
||||
int storeme;
|
||||
};
|
||||
|
||||
static inline struct simple_child *to_simple_child(struct config_item *item)
|
||||
{
|
||||
return item ? container_of(item, struct simple_child, item) : NULL;
|
||||
}
|
||||
|
||||
static struct configfs_attribute simple_child_attr_storeme = {
|
||||
.ca_owner = THIS_MODULE,
|
||||
.ca_name = "storeme",
|
||||
.ca_mode = S_IRUGO | S_IWUSR,
|
||||
};
|
||||
|
||||
static struct configfs_attribute *simple_child_attrs[] = {
|
||||
&simple_child_attr_storeme,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static ssize_t simple_child_attr_show(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
char *page)
|
||||
{
|
||||
ssize_t count;
|
||||
struct simple_child *simple_child = to_simple_child(item);
|
||||
|
||||
count = sprintf(page, "%d\n", simple_child->storeme);
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
static ssize_t simple_child_attr_store(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
const char *page, size_t count)
|
||||
{
|
||||
struct simple_child *simple_child = to_simple_child(item);
|
||||
unsigned long tmp;
|
||||
char *p = (char *) page;
|
||||
|
||||
tmp = simple_strtoul(p, &p, 10);
|
||||
if (!p || (*p && (*p != '\n')))
|
||||
return -EINVAL;
|
||||
|
||||
if (tmp > INT_MAX)
|
||||
return -ERANGE;
|
||||
|
||||
simple_child->storeme = tmp;
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
static void simple_child_release(struct config_item *item)
|
||||
{
|
||||
kfree(to_simple_child(item));
|
||||
}
|
||||
|
||||
static struct configfs_item_operations simple_child_item_ops = {
|
||||
.release = simple_child_release,
|
||||
.show_attribute = simple_child_attr_show,
|
||||
.store_attribute = simple_child_attr_store,
|
||||
};
|
||||
|
||||
static struct config_item_type simple_child_type = {
|
||||
.ct_item_ops = &simple_child_item_ops,
|
||||
.ct_attrs = simple_child_attrs,
|
||||
.ct_owner = THIS_MODULE,
|
||||
};
|
||||
|
||||
|
||||
static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
|
||||
{
|
||||
struct simple_child *simple_child;
|
||||
|
||||
simple_child = kmalloc(sizeof(struct simple_child), GFP_KERNEL);
|
||||
if (!simple_child)
|
||||
return NULL;
|
||||
|
||||
memset(simple_child, 0, sizeof(struct simple_child));
|
||||
|
||||
config_item_init_type_name(&simple_child->item, name,
|
||||
&simple_child_type);
|
||||
|
||||
simple_child->storeme = 0;
|
||||
|
||||
return &simple_child->item;
|
||||
}
|
||||
|
||||
static struct configfs_attribute simple_children_attr_description = {
|
||||
.ca_owner = THIS_MODULE,
|
||||
.ca_name = "description",
|
||||
.ca_mode = S_IRUGO,
|
||||
};
|
||||
|
||||
static struct configfs_attribute *simple_children_attrs[] = {
|
||||
&simple_children_attr_description,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static ssize_t simple_children_attr_show(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
char *page)
|
||||
{
|
||||
return sprintf(page,
|
||||
"[02-simple-children]\n"
|
||||
"\n"
|
||||
"This subsystem allows the creation of child config_items. These\n"
|
||||
"items have only one attribute that is readable and writeable.\n");
|
||||
}
|
||||
|
||||
static struct configfs_item_operations simple_children_item_ops = {
|
||||
.show_attribute = simple_children_attr_show,
|
||||
};
|
||||
|
||||
/*
|
||||
* Note that, since no extra work is required on ->drop_item(),
|
||||
* no ->drop_item() is provided.
|
||||
*/
|
||||
static struct configfs_group_operations simple_children_group_ops = {
|
||||
.make_item = simple_children_make_item,
|
||||
};
|
||||
|
||||
static struct config_item_type simple_children_type = {
|
||||
.ct_item_ops = &simple_children_item_ops,
|
||||
.ct_group_ops = &simple_children_group_ops,
|
||||
.ct_attrs = simple_children_attrs,
|
||||
};
|
||||
|
||||
static struct configfs_subsystem simple_children_subsys = {
|
||||
.su_group = {
|
||||
.cg_item = {
|
||||
.ci_namebuf = "02-simple-children",
|
||||
.ci_type = &simple_children_type,
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
|
||||
/* ----------------------------------------------------------------- */
|
||||
|
||||
/*
|
||||
* 03-group-children
|
||||
*
|
||||
* This example reuses the simple_children group from above. However,
|
||||
* the simple_children group is not the subsystem itself, it is a
|
||||
* child of the subsystem. Creation of a group in the subsystem creates
|
||||
* a new simple_children group. That group can then have simple_child
|
||||
* children of its own.
|
||||
*/
|
||||
|
||||
struct simple_children {
|
||||
struct config_group group;
|
||||
};
|
||||
|
||||
static struct config_group *group_children_make_group(struct config_group *group, const char *name)
|
||||
{
|
||||
struct simple_children *simple_children;
|
||||
|
||||
simple_children = kmalloc(sizeof(struct simple_children),
|
||||
GFP_KERNEL);
|
||||
if (!simple_children)
|
||||
return NULL;
|
||||
|
||||
memset(simple_children, 0, sizeof(struct simple_children));
|
||||
|
||||
config_group_init_type_name(&simple_children->group, name,
|
||||
&simple_children_type);
|
||||
|
||||
return &simple_children->group;
|
||||
}
|
||||
|
||||
static struct configfs_attribute group_children_attr_description = {
|
||||
.ca_owner = THIS_MODULE,
|
||||
.ca_name = "description",
|
||||
.ca_mode = S_IRUGO,
|
||||
};
|
||||
|
||||
static struct configfs_attribute *group_children_attrs[] = {
|
||||
&group_children_attr_description,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static ssize_t group_children_attr_show(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
char *page)
|
||||
{
|
||||
return sprintf(page,
|
||||
"[03-group-children]\n"
|
||||
"\n"
|
||||
"This subsystem allows the creation of child config_groups. These\n"
|
||||
"groups are like the subsystem simple-children.\n");
|
||||
}
|
||||
|
||||
static struct configfs_item_operations group_children_item_ops = {
|
||||
.show_attribute = group_children_attr_show,
|
||||
};
|
||||
|
||||
/*
|
||||
* Note that, since no extra work is required on ->drop_item(),
|
||||
* no ->drop_item() is provided.
|
||||
*/
|
||||
static struct configfs_group_operations group_children_group_ops = {
|
||||
.make_group = group_children_make_group,
|
||||
};
|
||||
|
||||
static struct config_item_type group_children_type = {
|
||||
.ct_item_ops = &group_children_item_ops,
|
||||
.ct_group_ops = &group_children_group_ops,
|
||||
.ct_attrs = group_children_attrs,
|
||||
};
|
||||
|
||||
static struct configfs_subsystem group_children_subsys = {
|
||||
.su_group = {
|
||||
.cg_item = {
|
||||
.ci_namebuf = "03-group-children",
|
||||
.ci_type = &group_children_type,
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
/* ----------------------------------------------------------------- */
|
||||
|
||||
/*
|
||||
* We're now done with our subsystem definitions.
|
||||
* For convenience in this module, here's a list of them all. It
|
||||
* allows the init function to easily register them. Most modules
|
||||
* will only have one subsystem, and will only call register_subsystem
|
||||
* on it directly.
|
||||
*/
|
||||
static struct configfs_subsystem *example_subsys[] = {
|
||||
&childless_subsys.subsys,
|
||||
&simple_children_subsys,
|
||||
&group_children_subsys,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static int __init configfs_example_init(void)
|
||||
{
|
||||
int ret;
|
||||
int i;
|
||||
struct configfs_subsystem *subsys;
|
||||
|
||||
for (i = 0; example_subsys[i]; i++) {
|
||||
subsys = example_subsys[i];
|
||||
|
||||
config_group_init(&subsys->su_group);
|
||||
init_MUTEX(&subsys->su_sem);
|
||||
ret = configfs_register_subsystem(subsys);
|
||||
if (ret) {
|
||||
printk(KERN_ERR "Error %d while registering subsystem %s\n",
|
||||
ret,
|
||||
subsys->su_group.cg_item.ci_namebuf);
|
||||
goto out_unregister;
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
|
||||
out_unregister:
|
||||
for (; i >= 0; i--) {
|
||||
configfs_unregister_subsystem(example_subsys[i]);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void __exit configfs_example_exit(void)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; example_subsys[i]; i++) {
|
||||
configfs_unregister_subsystem(example_subsys[i]);
|
||||
}
|
||||
}
|
||||
|
||||
module_init(configfs_example_init);
|
||||
module_exit(configfs_example_exit);
|
||||
MODULE_LICENSE("GPL");
|
130
Documentation/filesystems/dlmfs.txt
Normal file
130
Documentation/filesystems/dlmfs.txt
Normal file
@ -0,0 +1,130 @@
|
||||
dlmfs
|
||||
==================
|
||||
A minimal DLM userspace interface implemented via a virtual file
|
||||
system.
|
||||
|
||||
dlmfs is built with OCFS2 as it requires most of its infrastructure.
|
||||
|
||||
Project web page: http://oss.oracle.com/projects/ocfs2
|
||||
Tools web page: http://oss.oracle.com/projects/ocfs2-tools
|
||||
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
|
||||
|
||||
All code copyright 2005 Oracle except when otherwise noted.
|
||||
|
||||
CREDITS
|
||||
=======
|
||||
|
||||
Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds
|
||||
and Transmeta Corp.
|
||||
|
||||
Mark Fasheh <mark.fasheh@oracle.com>
|
||||
|
||||
Caveats
|
||||
=======
|
||||
- Right now it only works with the OCFS2 DLM, though support for other
|
||||
DLM implementations should not be a major issue.
|
||||
|
||||
Mount options
|
||||
=============
|
||||
None
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
If you're just interested in OCFS2, then please see ocfs2.txt. The
|
||||
rest of this document will be geared towards those who want to use
|
||||
dlmfs for easy to setup and easy to use clustered locking in
|
||||
userspace.
|
||||
|
||||
Setup
|
||||
=====
|
||||
|
||||
dlmfs requires that the OCFS2 cluster infrastructure be in
|
||||
place. Please download ocfs2-tools from the above url and configure a
|
||||
cluster.
|
||||
|
||||
You'll want to start heartbeating on a volume which all the nodes in
|
||||
your lockspace can access. The easiest way to do this is via
|
||||
ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires
|
||||
that an OCFS2 file system be in place so that it can automatically
|
||||
find it's heartbeat area, though it will eventually support heartbeat
|
||||
against raw disks.
|
||||
|
||||
Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed
|
||||
with ocfs2-tools.
|
||||
|
||||
Once you're heartbeating, DLM lock 'domains' can be easily created /
|
||||
destroyed and locks within them accessed.
|
||||
|
||||
Locking
|
||||
=======
|
||||
|
||||
Users may access dlmfs via standard file system calls, or they can use
|
||||
'libo2dlm' (distributed with ocfs2-tools) which abstracts the file
|
||||
system calls and presents a more traditional locking api.
|
||||
|
||||
dlmfs handles lock caching automatically for the user, so a lock
|
||||
request for an already acquired lock will not generate another DLM
|
||||
call. Userspace programs are assumed to handle their own local
|
||||
locking.
|
||||
|
||||
Two levels of locks are supported - Shared Read, and Exlcusive.
|
||||
Also supported is a Trylock operation.
|
||||
|
||||
For information on the libo2dlm interface, please see o2dlm.h,
|
||||
distributed with ocfs2-tools.
|
||||
|
||||
Lock value blocks can be read and written to a resource via read(2)
|
||||
and write(2) against the fd obtained via your open(2) call. The
|
||||
maximum currently supported LVB length is 64 bytes (though that is an
|
||||
OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share
|
||||
small amounts of data amongst their nodes.
|
||||
|
||||
mkdir(2) signals dlmfs to join a domain (which will have the same name
|
||||
as the resulting directory)
|
||||
|
||||
rmdir(2) signals dlmfs to leave the domain
|
||||
|
||||
Locks for a given domain are represented by regular inodes inside the
|
||||
domain directory. Locking against them is done via the open(2) system
|
||||
call.
|
||||
|
||||
The open(2) call will not return until your lock has been granted or
|
||||
an error has occurred, unless it has been instructed to do a trylock
|
||||
operation. If the lock succeeds, you'll get an fd.
|
||||
|
||||
open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
|
||||
not automatically create inodes for existing lock resources.
|
||||
|
||||
Open Flag Lock Request Type
|
||||
--------- -----------------
|
||||
O_RDONLY Shared Read
|
||||
O_RDWR Exclusive
|
||||
|
||||
Open Flag Resulting Locking Behavior
|
||||
--------- --------------------------
|
||||
O_NONBLOCK Trylock operation
|
||||
|
||||
You must provide exactly one of O_RDONLY or O_RDWR.
|
||||
|
||||
If O_NONBLOCK is also provided and the trylock operation was valid but
|
||||
could not lock the resource then open(2) will return ETXTBUSY.
|
||||
|
||||
close(2) drops the lock associated with your fd.
|
||||
|
||||
Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is
|
||||
supported locally as well. This means you can use them to restrict
|
||||
access to the resources via dlmfs on your local node only.
|
||||
|
||||
The resource LVB may be read from the fd in either Shared Read or
|
||||
Exclusive modes via the read(2) system call. It can be written via
|
||||
write(2) only when open in Exclusive mode.
|
||||
|
||||
Once written, an LVB will be visible to other nodes who obtain Read
|
||||
Only or higher level locks on the resource.
|
||||
|
||||
See Also
|
||||
========
|
||||
http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
|
||||
|
||||
For more information on the VMS distributed locking API.
|
@ -369,9 +369,8 @@ The kernel source file:/usr/src/linux/fs/ext2/
|
||||
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
|
||||
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
|
||||
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
|
||||
Hashed Directories http://kernelnewbies.org/~phillips/htree/
|
||||
Filesystem Resizing http://ext2resize.sourceforge.net/
|
||||
Compression (*) http://www.netspace.net.au/~reiter/e2compr/
|
||||
Compression (*) http://e2compr.sourceforge.net/
|
||||
|
||||
Implementations for:
|
||||
Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm
|
||||
|
@ -2,11 +2,11 @@
|
||||
Ext3 Filesystem
|
||||
===============
|
||||
|
||||
ext3 was originally released in September 1999. Written by Stephen Tweedie
|
||||
for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
|
||||
Ext3 was originally released in September 1999. Written by Stephen Tweedie
|
||||
for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
|
||||
Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
|
||||
|
||||
ext3 is ext2 filesystem enhanced with journalling capabilities.
|
||||
Ext3 is the ext2 filesystem enhanced with journalling capabilities.
|
||||
|
||||
Options
|
||||
=======
|
||||
@ -14,76 +14,81 @@ Options
|
||||
When mounting an ext3 filesystem, the following option are accepted:
|
||||
(*) == default
|
||||
|
||||
jounal=update Update the ext3 file system's journal to the
|
||||
current format.
|
||||
journal=update Update the ext3 file system's journal to the current
|
||||
format.
|
||||
|
||||
journal=inum When a journal already exists, this option is
|
||||
ignored. Otherwise, it specifies the number of
|
||||
the inode which will represent the ext3 file
|
||||
system's journal file.
|
||||
journal=inum When a journal already exists, this option is ignored.
|
||||
Otherwise, it specifies the number of the inode which
|
||||
will represent the ext3 file system's journal file.
|
||||
|
||||
journal_dev=devnum When the external journal device's major/minor numbers
|
||||
have changed, this option allows the user to specify
|
||||
the new journal location. The journal device is
|
||||
identified through its new major/minor numbers encoded
|
||||
in devnum.
|
||||
|
||||
noload Don't load the journal on mounting.
|
||||
|
||||
data=journal All data are committed into the journal prior
|
||||
to being written into the main file system.
|
||||
data=journal All data are committed into the journal prior to being
|
||||
written into the main file system.
|
||||
|
||||
data=ordered (*) All data are forced directly out to the main file
|
||||
system prior to its metadata being committed to
|
||||
the journal.
|
||||
system prior to its metadata being committed to the
|
||||
journal.
|
||||
|
||||
data=writeback Data ordering is not preserved, data may be
|
||||
written into the main file system after its
|
||||
metadata has been committed to the journal.
|
||||
data=writeback Data ordering is not preserved, data may be written
|
||||
into the main file system after its metadata has been
|
||||
committed to the journal.
|
||||
|
||||
commit=nrsec (*) Ext3 can be told to sync all its data and metadata
|
||||
every 'nrsec' seconds. The default value is 5 seconds.
|
||||
This means that if you lose your power, you will lose,
|
||||
as much, the latest 5 seconds of work (your filesystem
|
||||
will not be damaged though, thanks to journaling). This
|
||||
default value (or any low value) will hurt performance,
|
||||
but it's good for data-safety. Setting it to 0 will
|
||||
have the same effect than leaving the default 5 sec.
|
||||
This means that if you lose your power, you will lose
|
||||
as much as the latest 5 seconds of work (your
|
||||
filesystem will not be damaged though, thanks to the
|
||||
journaling). This default value (or any low value)
|
||||
will hurt performance, but it's good for data-safety.
|
||||
Setting it to 0 will have the same effect as leaving
|
||||
it at the default (5 seconds).
|
||||
Setting it to very large values will improve
|
||||
performance.
|
||||
|
||||
barrier=1 This enables/disables barriers. barrier=0 disables it,
|
||||
barrier=1 enables it.
|
||||
barrier=1 This enables/disables barriers. barrier=0 disables
|
||||
it, barrier=1 enables it.
|
||||
|
||||
orlov (*) This enables the new Orlov block allocator. It's enabled
|
||||
by default.
|
||||
orlov (*) This enables the new Orlov block allocator. It is
|
||||
enabled by default.
|
||||
|
||||
oldalloc This disables the Orlov block allocator and enables the
|
||||
old block allocator. Orlov should have better performance,
|
||||
we'd like to get some feedback if it's the contrary for
|
||||
you.
|
||||
oldalloc This disables the Orlov block allocator and enables
|
||||
the old block allocator. Orlov should have better
|
||||
performance - we'd like to get some feedback if it's
|
||||
the contrary for you.
|
||||
|
||||
user_xattr (*) Enables POSIX Extended Attributes. It's enabled by
|
||||
default, however you need to confifure its support
|
||||
(CONFIG_EXT3_FS_XATTR). This is neccesary if you want
|
||||
to use POSIX Acces Control Lists support. You can visit
|
||||
http://acl.bestbits.at to know more about POSIX Extended
|
||||
attributes.
|
||||
user_xattr Enables Extended User Attributes. Additionally, you
|
||||
need to have extended attribute support enabled in the
|
||||
kernel configuration (CONFIG_EXT3_FS_XATTR). See the
|
||||
attr(5) manual page and http://acl.bestbits.at/ to
|
||||
learn more about extended attributes.
|
||||
|
||||
nouser_xattr Disables POSIX Extended Attributes.
|
||||
nouser_xattr Disables Extended User Attributes.
|
||||
|
||||
acl (*) Enables POSIX Access Control Lists support. This is
|
||||
enabled by default, however you need to configure
|
||||
its support (CONFIG_EXT3_FS_POSIX_ACL). If you want
|
||||
to know more about ACLs visit http://acl.bestbits.at
|
||||
acl Enables POSIX Access Control Lists support.
|
||||
Additionally, you need to have ACL support enabled in
|
||||
the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL).
|
||||
See the acl(5) manual page and http://acl.bestbits.at/
|
||||
for more information.
|
||||
|
||||
noacl This option disables POSIX Access Control List support.
|
||||
noacl This option disables POSIX Access Control List
|
||||
support.
|
||||
|
||||
reservation
|
||||
|
||||
noreservation
|
||||
|
||||
resize=
|
||||
|
||||
bsddf (*) Make 'df' act like BSD.
|
||||
minixdf Make 'df' act like Minix.
|
||||
|
||||
check=none Don't do extra checking of bitmaps on mount.
|
||||
nocheck
|
||||
nocheck
|
||||
|
||||
debug Extra debugging information is sent to syslog.
|
||||
|
||||
@ -92,7 +97,7 @@ errors=continue Keep going on a filesystem error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
|
||||
grpid Give objects the same group ID as their creator.
|
||||
bsdgroups
|
||||
bsdgroups
|
||||
|
||||
nogrpid (*) New objects have the group ID of their creator.
|
||||
sysvgroups
|
||||
@ -103,81 +108,83 @@ resuid=n The user ID which may use the reserved blocks.
|
||||
|
||||
sb=n Use alternate superblock at this location.
|
||||
|
||||
quota Quota options are currently silently ignored.
|
||||
noquota (see fs/ext3/super.c, line 594)
|
||||
quota
|
||||
noquota
|
||||
grpquota
|
||||
usrquota
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
ext3 shares all disk implementation with ext2 filesystem, and add
|
||||
transactions capabilities to ext2. Journaling is done by the
|
||||
Journaling block device layer.
|
||||
Ext3 shares all disk implementation with the ext2 filesystem, and adds
|
||||
transactions capabilities to ext2. Journaling is done by the Journaling Block
|
||||
Device layer.
|
||||
|
||||
Journaling Block Device layer
|
||||
-----------------------------
|
||||
The Journaling Block Device layer (JBD) isn't ext3 specific. It was
|
||||
design to add journaling capabilities on a block device. The ext3
|
||||
filesystem code will inform the JBD of modifications it is performing
|
||||
(Call a transaction). the journal support the transactions start and
|
||||
stop, and in case of crash, the journal can replayed the transactions
|
||||
to put the partition on a consistent state fastly.
|
||||
The Journaling Block Device layer (JBD) isn't ext3 specific. It was design to
|
||||
add journaling capabilities on a block device. The ext3 filesystem code will
|
||||
inform the JBD of modifications it is performing (called a transaction). The
|
||||
journal supports the transactions start and stop, and in case of crash, the
|
||||
journal can replayed the transactions to put the partition back in a
|
||||
consistent state fast.
|
||||
|
||||
handles represent a single atomic update to a filesystem. JBD can
|
||||
handle external journal on a block device.
|
||||
Handles represent a single atomic update to a filesystem. JBD can handle an
|
||||
external journal on a block device.
|
||||
|
||||
Data Mode
|
||||
---------
|
||||
There's 3 different data modes:
|
||||
There are 3 different data modes:
|
||||
|
||||
* writeback mode
|
||||
In data=writeback mode, ext3 does not journal data at all. This mode
|
||||
provides a similar level of journaling as XFS, JFS, and ReiserFS in its
|
||||
default mode - metadata journaling. A crash+recovery can cause
|
||||
incorrect data to appear in files which were written shortly before the
|
||||
crash. This mode will typically provide the best ext3 performance.
|
||||
In data=writeback mode, ext3 does not journal data at all. This mode provides
|
||||
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
|
||||
mode - metadata journaling. A crash+recovery can cause incorrect data to
|
||||
appear in files which were written shortly before the crash. This mode will
|
||||
typically provide the best ext3 performance.
|
||||
|
||||
* ordered mode
|
||||
In data=ordered mode, ext3 only officially journals metadata, but it
|
||||
logically groups metadata and data blocks into a single unit called a
|
||||
transaction. When it's time to write the new metadata out to disk, the
|
||||
associated data blocks are written first. In general, this mode
|
||||
perform slightly slower than writeback but significantly faster than
|
||||
journal mode.
|
||||
In data=ordered mode, ext3 only officially journals metadata, but it logically
|
||||
groups metadata and data blocks into a single unit called a transaction. When
|
||||
it's time to write the new metadata out to disk, the associated data blocks
|
||||
are written first. In general, this mode performs slightly slower than
|
||||
writeback but significantly faster than journal mode.
|
||||
|
||||
* journal mode
|
||||
data=journal mode provides full data and metadata journaling. All new
|
||||
data is written to the journal first, and then to its final location.
|
||||
In the event of a crash, the journal can be replayed, bringing both
|
||||
data and metadata into a consistent state. This mode is the slowest
|
||||
except when data needs to be read from and written to disk at the same
|
||||
time where it outperform all others mode.
|
||||
data=journal mode provides full data and metadata journaling. All new data is
|
||||
written to the journal first, and then to its final location.
|
||||
In the event of a crash, the journal can be replayed, bringing both data and
|
||||
metadata into a consistent state. This mode is the slowest except when data
|
||||
needs to be read from and written to disk at the same time where it
|
||||
outperforms all others modes.
|
||||
|
||||
Compatibility
|
||||
-------------
|
||||
|
||||
Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
|
||||
Ext3 is fully compatible with Ext2. Ext3 partitions can easily be
|
||||
mounted as Ext2.
|
||||
Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as
|
||||
Ext2.
|
||||
|
||||
|
||||
External Tools
|
||||
==============
|
||||
see manual pages to know more.
|
||||
See manual pages to learn more.
|
||||
|
||||
tune2fs: create a ext3 journal on a ext2 partition with the -j flag.
|
||||
mke2fs: create a ext3 partition with the -j flag.
|
||||
debugfs: ext2 and ext3 file system debugger.
|
||||
ext2online: online (mounted) ext2 and ext3 filesystem resizer
|
||||
|
||||
tune2fs: create a ext3 journal on a ext2 partition with the -j flags
|
||||
mke2fs: create a ext3 partition with the -j flags
|
||||
debugfs: ext2 and ext3 file system debugger
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
kernel source: file:/usr/src/linux/fs/ext3
|
||||
file:/usr/src/linux/fs/jbd
|
||||
kernel source: <file:fs/ext3/>
|
||||
<file:fs/jbd/>
|
||||
|
||||
programs: http://e2fsprogs.sourceforge.net
|
||||
programs: http://e2fsprogs.sourceforge.net/
|
||||
http://ext2resize.sourceforge.net
|
||||
|
||||
useful link:
|
||||
http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
|
||||
useful links: http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
|
||||
http://www-106.ibm.com/developerworks/linux/library/l-fs7/
|
||||
http://www-106.ibm.com/developerworks/linux/library/l-fs8/
|
||||
|
@ -86,6 +86,62 @@ Mount options
|
||||
The default is infinite. Note that the size of read requests is
|
||||
limited anyway to 32 pages (which is 128kbyte on i386).
|
||||
|
||||
Sysfs
|
||||
~~~~~
|
||||
|
||||
FUSE sets up the following hierarchy in sysfs:
|
||||
|
||||
/sys/fs/fuse/connections/N/
|
||||
|
||||
where N is an increasing number allocated to each new connection.
|
||||
|
||||
For each connection the following attributes are defined:
|
||||
|
||||
'waiting'
|
||||
|
||||
The number of requests which are waiting to be transfered to
|
||||
userspace or being processed by the filesystem daemon. If there is
|
||||
no filesystem activity and 'waiting' is non-zero, then the
|
||||
filesystem is hung or deadlocked.
|
||||
|
||||
'abort'
|
||||
|
||||
Writing anything into this file will abort the filesystem
|
||||
connection. This means that all waiting requests will be aborted an
|
||||
error returned for all aborted and new requests.
|
||||
|
||||
Only a privileged user may read or write these attributes.
|
||||
|
||||
Aborting a filesystem connection
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
It is possible to get into certain situations where the filesystem is
|
||||
not responding. Reasons for this may be:
|
||||
|
||||
a) Broken userspace filesystem implementation
|
||||
|
||||
b) Network connection down
|
||||
|
||||
c) Accidental deadlock
|
||||
|
||||
d) Malicious deadlock
|
||||
|
||||
(For more on c) and d) see later sections)
|
||||
|
||||
In either of these cases it may be useful to abort the connection to
|
||||
the filesystem. There are several ways to do this:
|
||||
|
||||
- Kill the filesystem daemon. Works in case of a) and b)
|
||||
|
||||
- Kill the filesystem daemon and all users of the filesystem. Works
|
||||
in all cases except some malicious deadlocks
|
||||
|
||||
- Use forced umount (umount -f). Works in all cases but only if
|
||||
filesystem is still attached (it hasn't been lazy unmounted)
|
||||
|
||||
- Abort filesystem through the sysfs interface. Most powerful
|
||||
method, always works.
|
||||
|
||||
How do non-privileged mounts work?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
@ -313,3 +369,10 @@ faulted with get_user_pages(). The 'req->locked' flag indicates
|
||||
when the copy is taking place, and interruption is delayed until
|
||||
this flag is unset.
|
||||
|
||||
Scenario 3 - Tricky deadlock with asynchronous read
|
||||
---------------------------------------------------
|
||||
|
||||
The same situation as above, except thread-1 will wait on page lock
|
||||
and hence it will be uninterruptible as well. The solution is to
|
||||
abort the connection with forced umount (if mount is attached) or
|
||||
through the abort attribute in sysfs.
|
||||
|
55
Documentation/filesystems/ocfs2.txt
Normal file
55
Documentation/filesystems/ocfs2.txt
Normal file
@ -0,0 +1,55 @@
|
||||
OCFS2 filesystem
|
||||
==================
|
||||
OCFS2 is a general purpose extent based shared disk cluster file
|
||||
system with many similarities to ext3. It supports 64 bit inode
|
||||
numbers, and has automatically extending metadata groups which may
|
||||
also make it attractive for non-clustered use.
|
||||
|
||||
You'll want to install the ocfs2-tools package in order to at least
|
||||
get "mount.ocfs2" and "ocfs2_hb_ctl".
|
||||
|
||||
Project web page: http://oss.oracle.com/projects/ocfs2
|
||||
Tools web page: http://oss.oracle.com/projects/ocfs2-tools
|
||||
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
|
||||
|
||||
All code copyright 2005 Oracle except when otherwise noted.
|
||||
|
||||
CREDITS:
|
||||
Lots of code taken from ext3 and other projects.
|
||||
|
||||
Authors in alphabetical order:
|
||||
Joel Becker <joel.becker@oracle.com>
|
||||
Zach Brown <zach.brown@oracle.com>
|
||||
Mark Fasheh <mark.fasheh@oracle.com>
|
||||
Kurt Hackel <kurt.hackel@oracle.com>
|
||||
Sunil Mushran <sunil.mushran@oracle.com>
|
||||
Manish Singh <manish.singh@oracle.com>
|
||||
|
||||
Caveats
|
||||
=======
|
||||
Features which OCFS2 does not support yet:
|
||||
- sparse files
|
||||
- extended attributes
|
||||
- shared writeable mmap
|
||||
- loopback is supported, but data written will not
|
||||
be cluster coherent.
|
||||
- quotas
|
||||
- cluster aware flock
|
||||
- Directory change notification (F_NOTIFY)
|
||||
- Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
|
||||
- POSIX ACLs
|
||||
- readpages / writepages (not user visible)
|
||||
|
||||
Mount options
|
||||
=============
|
||||
|
||||
OCFS2 supports the following mount options:
|
||||
(*) == default
|
||||
|
||||
barrier=1 This enables/disables barriers. barrier=0 disables it,
|
||||
barrier=1 enables it.
|
||||
errors=remount-ro(*) Remount the filesystem read-only on an error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
intr (*) Allow signals to interrupt cluster operations.
|
||||
nointr Do not allow signals to interrupt cluster
|
||||
operations.
|
@ -418,7 +418,7 @@ VmallocChunk: 111088 kB
|
||||
Dirty: Memory which is waiting to get written back to the disk
|
||||
Writeback: Memory which is actively being written back to the disk
|
||||
Mapped: files which have been mmaped, such as libraries
|
||||
Slab: in-kernel data structures cache
|
||||
Slab: in-kernel data structures cache
|
||||
CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'),
|
||||
this is the total amount of memory currently available to
|
||||
be allocated on the system. This limit is only adhered to
|
||||
@ -1302,6 +1302,23 @@ VM has token based thrashing control mechanism and uses the token to prevent
|
||||
unnecessary page faults in thrashing situation. The unit of the value is
|
||||
second. The value would be useful to tune thrashing behavior.
|
||||
|
||||
drop_caches
|
||||
-----------
|
||||
|
||||
Writing to this will cause the kernel to drop clean caches, dentries and
|
||||
inodes from memory, causing that memory to become free.
|
||||
|
||||
To free pagecache:
|
||||
echo 1 > /proc/sys/vm/drop_caches
|
||||
To free dentries and inodes:
|
||||
echo 2 > /proc/sys/vm/drop_caches
|
||||
To free pagecache, dentries and inodes:
|
||||
echo 3 > /proc/sys/vm/drop_caches
|
||||
|
||||
As this is a non-destructive operation and dirty objects are not freeable, the
|
||||
user should run `sync' first.
|
||||
|
||||
|
||||
2.5 /proc/sys/dev - Device specific parameters
|
||||
----------------------------------------------
|
||||
|
||||
|
@ -143,12 +143,26 @@ as the following example:
|
||||
dir /mnt 755 0 0
|
||||
file /init initramfs/init.sh 755 0 0
|
||||
|
||||
Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
|
||||
documenting the above file format.
|
||||
|
||||
One advantage of the text file is that root access is not required to
|
||||
set permissions or create device nodes in the new archive. (Note that those
|
||||
two example "file" entries expect to find files named "init.sh" and "busybox" in
|
||||
a directory called "initramfs", under the linux-2.6.* directory. See
|
||||
Documentation/early-userspace/README for more details.)
|
||||
|
||||
The kernel does not depend on external cpio tools, gen_init_cpio is created
|
||||
from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's
|
||||
boot-time extractor is also (obviously) self-contained. However, if you _do_
|
||||
happen to have cpio installed, the following command line can extract the
|
||||
generated cpio image back into its component files:
|
||||
|
||||
cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
|
||||
|
||||
Contents of initramfs:
|
||||
----------------------
|
||||
|
||||
If you don't already understand what shared libraries, devices, and paths
|
||||
you need to get a minimal root filesystem up and running, here are some
|
||||
references:
|
||||
@ -161,13 +175,69 @@ designed to be a tiny C library to statically link early userspace
|
||||
code against, along with some related utilities. It is BSD licensed.
|
||||
|
||||
I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
|
||||
myself. These are LGPL and GPL, respectively.
|
||||
myself. These are LGPL and GPL, respectively. (A self-contained initramfs
|
||||
package is planned for the busybox 1.2 release.)
|
||||
|
||||
In theory you could use glibc, but that's not well suited for small embedded
|
||||
uses like this. (A "hello world" program statically linked against glibc is
|
||||
over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
|
||||
name lookups, even when otherwise statically linked.)
|
||||
|
||||
Why cpio rather than tar?
|
||||
-------------------------
|
||||
|
||||
This decision was made back in December, 2001. The discussion started here:
|
||||
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html
|
||||
|
||||
And spawned a second thread (specifically on tar vs cpio), starting here:
|
||||
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html
|
||||
|
||||
The quick and dirty summary version (which is no substitute for reading
|
||||
the above threads) is:
|
||||
|
||||
1) cpio is a standard. It's decades old (from the AT&T days), and already
|
||||
widely used on Linux (inside RPM, Red Hat's device driver disks). Here's
|
||||
a Linux Journal article about it from 1996:
|
||||
|
||||
http://www.linuxjournal.com/article/1213
|
||||
|
||||
It's not as popular as tar because the traditional cpio command line tools
|
||||
require _truly_hideous_ command line arguments. But that says nothing
|
||||
either way about the archive format, and there are alternative tools,
|
||||
such as:
|
||||
|
||||
http://freshmeat.net/projects/afio/
|
||||
|
||||
2) The cpio archive format chosen by the kernel is simpler and cleaner (and
|
||||
thus easier to create and parse) than any of the (literally dozens of)
|
||||
various tar archive formats. The complete initramfs archive format is
|
||||
explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
|
||||
extracted in init/initramfs.c. All three together come to less than 26k
|
||||
total of human-readable text.
|
||||
|
||||
3) The GNU project standardizing on tar is approximately as relevant as
|
||||
Windows standardizing on zip. Linux is not part of either, and is free
|
||||
to make its own technical decisions.
|
||||
|
||||
4) Since this is a kernel internal format, it could easily have been
|
||||
something brand new. The kernel provides its own tools to create and
|
||||
extract this format anyway. Using an existing standard was preferable,
|
||||
but not essential.
|
||||
|
||||
5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be
|
||||
supported on the kernel side"):
|
||||
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html
|
||||
|
||||
explained his reasoning:
|
||||
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
|
||||
|
||||
and, most importantly, designed and implemented the initramfs code.
|
||||
|
||||
Future directions:
|
||||
------------------
|
||||
|
||||
|
@ -44,30 +44,41 @@ relayfs can operate in a mode where it will overwrite data not yet
|
||||
collected by userspace, and not wait for it to consume it.
|
||||
|
||||
relayfs itself does not provide for communication of such data between
|
||||
userspace and kernel, allowing the kernel side to remain simple and not
|
||||
impose a single interface on userspace. It does provide a separate
|
||||
helper though, described below.
|
||||
userspace and kernel, allowing the kernel side to remain simple and
|
||||
not impose a single interface on userspace. It does provide a set of
|
||||
examples and a separate helper though, described below.
|
||||
|
||||
klog, relay-app & librelay
|
||||
==========================
|
||||
klog and relay-apps example code
|
||||
================================
|
||||
|
||||
relayfs itself is ready to use, but to make things easier, two
|
||||
additional systems are provided. klog is a simple wrapper to make
|
||||
writing formatted text or raw data to a channel simpler, regardless of
|
||||
whether a channel to write into exists or not, or whether relayfs is
|
||||
compiled into the kernel or is configured as a module. relay-app is
|
||||
the kernel counterpart of userspace librelay.c, combined these two
|
||||
files provide glue to easily stream data to disk, without having to
|
||||
bother with housekeeping. klog and relay-app can be used together,
|
||||
with klog providing high-level logging functions to the kernel and
|
||||
relay-app taking care of kernel-user control and disk-logging chores.
|
||||
relayfs itself is ready to use, but to make things easier, a couple
|
||||
simple utility functions and a set of examples are provided.
|
||||
|
||||
It is possible to use relayfs without relay-app & librelay, but you'll
|
||||
have to implement communication between userspace and kernel, allowing
|
||||
both to convey the state of buffers (full, empty, amount of padding).
|
||||
The relay-apps example tarball, available on the relayfs sourceforge
|
||||
site, contains a set of self-contained examples, each consisting of a
|
||||
pair of .c files containing boilerplate code for each of the user and
|
||||
kernel sides of a relayfs application; combined these two sets of
|
||||
boilerplate code provide glue to easily stream data to disk, without
|
||||
having to bother with mundane housekeeping chores.
|
||||
|
||||
The 'klog debugging functions' patch (klog.patch in the relay-apps
|
||||
tarball) provides a couple of high-level logging functions to the
|
||||
kernel which allow writing formatted text or raw data to a channel,
|
||||
regardless of whether a channel to write into exists or not, or
|
||||
whether relayfs is compiled into the kernel or is configured as a
|
||||
module. These functions allow you to put unconditional 'trace'
|
||||
statements anywhere in the kernel or kernel modules; only when there
|
||||
is a 'klog handler' registered will data actually be logged (see the
|
||||
klog and kleak examples for details).
|
||||
|
||||
It is of course possible to use relayfs from scratch i.e. without
|
||||
using any of the relay-apps example code or klog, but you'll have to
|
||||
implement communication between userspace and kernel, allowing both to
|
||||
convey the state of buffers (full, empty, amount of padding).
|
||||
|
||||
klog and the relay-apps examples can be found in the relay-apps
|
||||
tarball on http://relayfs.sourceforge.net
|
||||
|
||||
klog, relay-app and librelay can be found in the relay-apps tarball on
|
||||
http://relayfs.sourceforge.net
|
||||
|
||||
The relayfs user space API
|
||||
==========================
|
||||
@ -125,6 +136,8 @@ Here's a summary of the API relayfs provides to in-kernel clients:
|
||||
relay_reset(chan)
|
||||
relayfs_create_dir(name, parent)
|
||||
relayfs_remove_dir(dentry)
|
||||
relayfs_create_file(name, parent, mode, fops, data)
|
||||
relayfs_remove_file(dentry)
|
||||
|
||||
channel management typically called on instigation of userspace:
|
||||
|
||||
@ -141,6 +154,8 @@ Here's a summary of the API relayfs provides to in-kernel clients:
|
||||
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
|
||||
buf_mapped(buf, filp)
|
||||
buf_unmapped(buf, filp)
|
||||
create_buf_file(filename, parent, mode, buf, is_global)
|
||||
remove_buf_file(dentry)
|
||||
|
||||
helper functions:
|
||||
|
||||
@ -320,6 +335,71 @@ forces a sub-buffer switch on all the channel buffers, and can be used
|
||||
to finalize and process the last sub-buffers before the channel is
|
||||
closed.
|
||||
|
||||
Creating non-relay files
|
||||
------------------------
|
||||
|
||||
relay_open() automatically creates files in the relayfs filesystem to
|
||||
represent the per-cpu kernel buffers; it's often useful for
|
||||
applications to be able to create their own files alongside the relay
|
||||
files in the relayfs filesystem as well e.g. 'control' files much like
|
||||
those created in /proc or debugfs for similar purposes, used to
|
||||
communicate control information between the kernel and user sides of a
|
||||
relayfs application. For this purpose the relayfs_create_file() and
|
||||
relayfs_remove_file() API functions exist. For relayfs_create_file(),
|
||||
the caller passes in a set of user-defined file operations to be used
|
||||
for the file and an optional void * to a user-specified data item,
|
||||
which will be accessible via inode->u.generic_ip (see the relay-apps
|
||||
tarball for examples). The file_operations are a required parameter
|
||||
to relayfs_create_file() and thus the semantics of these files are
|
||||
completely defined by the caller.
|
||||
|
||||
See the relay-apps tarball at http://relayfs.sourceforge.net for
|
||||
examples of how these non-relay files are meant to be used.
|
||||
|
||||
Creating relay files in other filesystems
|
||||
-----------------------------------------
|
||||
|
||||
By default of course, relay_open() creates relay files in the relayfs
|
||||
filesystem. Because relay_file_operations is exported, however, it's
|
||||
also possible to create and use relay files in other pseudo-filesytems
|
||||
such as debugfs.
|
||||
|
||||
For this purpose, two callback functions are provided,
|
||||
create_buf_file() and remove_buf_file(). create_buf_file() is called
|
||||
once for each per-cpu buffer from relay_open() to allow the client to
|
||||
create a file to be used to represent the corresponding buffer; if
|
||||
this callback is not defined, the default implementation will create
|
||||
and return a file in the relayfs filesystem to represent the buffer.
|
||||
The callback should return the dentry of the file created to represent
|
||||
the relay buffer. Note that the parent directory passed to
|
||||
relay_open() (and passed along to the callback), if specified, must
|
||||
exist in the same filesystem the new relay file is created in. If
|
||||
create_buf_file() is defined, remove_buf_file() must also be defined;
|
||||
it's responsible for deleting the file(s) created in create_buf_file()
|
||||
and is called during relay_close().
|
||||
|
||||
The create_buf_file() implementation can also be defined in such a way
|
||||
as to allow the creation of a single 'global' buffer instead of the
|
||||
default per-cpu set. This can be useful for applications interested
|
||||
mainly in seeing the relative ordering of system-wide events without
|
||||
the need to bother with saving explicit timestamps for the purpose of
|
||||
merging/sorting per-cpu files in a postprocessing step.
|
||||
|
||||
To have relay_open() create a global buffer, the create_buf_file()
|
||||
implementation should set the value of the is_global outparam to a
|
||||
non-zero value in addition to creating the file that will be used to
|
||||
represent the single buffer. In the case of a global buffer,
|
||||
create_buf_file() and remove_buf_file() will be called only once. The
|
||||
normal channel-writing functions e.g. relay_write() can still be used
|
||||
- writes from any cpu will transparently end up in the global buffer -
|
||||
but since it is a global buffer, callers should make sure they use the
|
||||
proper locking for such a buffer, either by wrapping writes in a
|
||||
spinlock, or by copying a write function from relayfs_fs.h and
|
||||
creating a local version that internally does the proper locking.
|
||||
|
||||
See the 'exported-relayfile' examples in the relay-apps tarball for
|
||||
examples of creating and using relay files in debugfs.
|
||||
|
||||
Misc
|
||||
----
|
||||
|
||||
|
521
Documentation/filesystems/spufs.txt
Normal file
521
Documentation/filesystems/spufs.txt
Normal file
@ -0,0 +1,521 @@
|
||||
SPUFS(2) Linux Programmer's Manual SPUFS(2)
|
||||
|
||||
|
||||
|
||||
NAME
|
||||
spufs - the SPU file system
|
||||
|
||||
|
||||
DESCRIPTION
|
||||
The SPU file system is used on PowerPC machines that implement the Cell
|
||||
Broadband Engine Architecture in order to access Synergistic Processor
|
||||
Units (SPUs).
|
||||
|
||||
The file system provides a name space similar to posix shared memory or
|
||||
message queues. Users that have write permissions on the file system
|
||||
can use spu_create(2) to establish SPU contexts in the spufs root.
|
||||
|
||||
Every SPU context is represented by a directory containing a predefined
|
||||
set of files. These files can be used for manipulating the state of the
|
||||
logical SPU. Users can change permissions on those files, but not actu-
|
||||
ally add or remove files.
|
||||
|
||||
|
||||
MOUNT OPTIONS
|
||||
uid=<uid>
|
||||
set the user owning the mount point, the default is 0 (root).
|
||||
|
||||
gid=<gid>
|
||||
set the group owning the mount point, the default is 0 (root).
|
||||
|
||||
|
||||
FILES
|
||||
The files in spufs mostly follow the standard behavior for regular sys-
|
||||
tem calls like read(2) or write(2), but often support only a subset of
|
||||
the operations supported on regular file systems. This list details the
|
||||
supported operations and the deviations from the behaviour in the
|
||||
respective man pages.
|
||||
|
||||
All files that support the read(2) operation also support readv(2) and
|
||||
all files that support the write(2) operation also support writev(2).
|
||||
All files support the access(2) and stat(2) family of operations, but
|
||||
only the st_mode, st_nlink, st_uid and st_gid fields of struct stat
|
||||
contain reliable information.
|
||||
|
||||
All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera-
|
||||
tions, but will not be able to grant permissions that contradict the
|
||||
possible operations, e.g. read access on the wbox file.
|
||||
|
||||
The current set of files is:
|
||||
|
||||
|
||||
/mem
|
||||
the contents of the local storage memory of the SPU. This can be
|
||||
accessed like a regular shared memory file and contains both code and
|
||||
data in the address space of the SPU. The possible operations on an
|
||||
open mem file are:
|
||||
|
||||
read(2), pread(2), write(2), pwrite(2), lseek(2)
|
||||
These operate as documented, with the exception that seek(2),
|
||||
write(2) and pwrite(2) are not supported beyond the end of the
|
||||
file. The file size is the size of the local storage of the SPU,
|
||||
which normally is 256 kilobytes.
|
||||
|
||||
mmap(2)
|
||||
Mapping mem into the process address space gives access to the
|
||||
SPU local storage within the process address space. Only
|
||||
MAP_SHARED mappings are allowed.
|
||||
|
||||
|
||||
/mbox
|
||||
The first SPU to CPU communication mailbox. This file is read-only and
|
||||
can be read in units of 32 bits. The file can only be used in non-
|
||||
blocking mode and it even poll() will not block on it. The possible
|
||||
operations on an open mbox file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. If there is no data available in the mail
|
||||
box, the return value is set to -1 and errno becomes EAGAIN.
|
||||
When data has been read successfully, four bytes are placed in
|
||||
the data buffer and the value four is returned.
|
||||
|
||||
|
||||
/ibox
|
||||
The second SPU to CPU communication mailbox. This file is similar to
|
||||
the first mailbox file, but can be read in blocking I/O mode, and the
|
||||
poll familiy of system calls can be used to wait for it. The possible
|
||||
operations on an open ibox file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. If there is no data available in the mail
|
||||
box and the file descriptor has been opened with O_NONBLOCK, the
|
||||
return value is set to -1 and errno becomes EAGAIN.
|
||||
|
||||
If there is no data available in the mail box and the file
|
||||
descriptor has been opened without O_NONBLOCK, the call will
|
||||
block until the SPU writes to its interrupt mailbox channel.
|
||||
When data has been read successfully, four bytes are placed in
|
||||
the data buffer and the value four is returned.
|
||||
|
||||
poll(2)
|
||||
Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever
|
||||
data is available for reading.
|
||||
|
||||
|
||||
/wbox
|
||||
The CPU to SPU communation mailbox. It is write-only can can be written
|
||||
in units of 32 bits. If the mailbox is full, write() will block and
|
||||
poll can be used to wait for it becoming empty again. The possible
|
||||
operations on an open wbox file are: write(2) If a count smaller than
|
||||
four is requested, write returns -1 and sets errno to EINVAL. If there
|
||||
is no space available in the mail box and the file descriptor has been
|
||||
opened with O_NONBLOCK, the return value is set to -1 and errno becomes
|
||||
EAGAIN.
|
||||
|
||||
If there is no space available in the mail box and the file descriptor
|
||||
has been opened without O_NONBLOCK, the call will block until the SPU
|
||||
reads from its PPE mailbox channel. When data has been read success-
|
||||
fully, four bytes are placed in the data buffer and the value four is
|
||||
returned.
|
||||
|
||||
poll(2)
|
||||
Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever
|
||||
space is available for writing.
|
||||
|
||||
|
||||
/mbox_stat
|
||||
/ibox_stat
|
||||
/wbox_stat
|
||||
Read-only files that contain the length of the current queue, i.e. how
|
||||
many words can be read from mbox or ibox or how many words can be
|
||||
written to wbox without blocking. The files can be read only in 4-byte
|
||||
units and return a big-endian binary integer number. The possible
|
||||
operations on an open *box_stat file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is placed in
|
||||
the data buffer, containing the number of elements that can be
|
||||
read from (for mbox_stat and ibox_stat) or written to (for
|
||||
wbox_stat) the respective mail box without blocking or resulting
|
||||
in EAGAIN.
|
||||
|
||||
|
||||
/npc
|
||||
/decr
|
||||
/decr_status
|
||||
/spu_tag_mask
|
||||
/event_mask
|
||||
/srr0
|
||||
Internal registers of the SPU. The representation is an ASCII string
|
||||
with the numeric value of the next instruction to be executed. These
|
||||
can be used in read/write mode for debugging, but normal operation of
|
||||
programs should not rely on them because access to any of them except
|
||||
npc requires an SPU context save and is therefore very inefficient.
|
||||
|
||||
The contents of these files are:
|
||||
|
||||
npc Next Program Counter
|
||||
|
||||
decr SPU Decrementer
|
||||
|
||||
decr_status Decrementer Status
|
||||
|
||||
spu_tag_mask MFC tag mask for SPU DMA
|
||||
|
||||
event_mask Event mask for SPU interrupts
|
||||
|
||||
srr0 Interrupt Return address register
|
||||
|
||||
|
||||
The possible operations on an open npc, decr, decr_status,
|
||||
spu_tag_mask, event_mask or srr0 file are:
|
||||
|
||||
read(2)
|
||||
When the count supplied to the read call is shorter than the
|
||||
required length for the pointer value plus a newline character,
|
||||
subsequent reads from the same file descriptor will result in
|
||||
completing the string, regardless of changes to the register by
|
||||
a running SPU task. When a complete string has been read, all
|
||||
subsequent read operations will return zero bytes and a new file
|
||||
descriptor needs to be opened to read the value again.
|
||||
|
||||
write(2)
|
||||
A write operation on the file results in setting the register to
|
||||
the value given in the string. The string is parsed from the
|
||||
beginning to the first non-numeric character or the end of the
|
||||
buffer. Subsequent writes to the same file descriptor overwrite
|
||||
the previous setting.
|
||||
|
||||
|
||||
/fpcr
|
||||
This file gives access to the Floating Point Status and Control Regis-
|
||||
ter as a four byte long file. The operations on the fpcr file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is placed in
|
||||
the data buffer, containing the current value of the fpcr regis-
|
||||
ter.
|
||||
|
||||
write(2)
|
||||
If a count smaller than four is requested, write returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is copied
|
||||
from the data buffer, updating the value of the fpcr register.
|
||||
|
||||
|
||||
/signal1
|
||||
/signal2
|
||||
The two signal notification channels of an SPU. These are read-write
|
||||
files that operate on a 32 bit word. Writing to one of these files
|
||||
triggers an interrupt on the SPU. The value writting to the signal
|
||||
files can be read from the SPU through a channel read or from host user
|
||||
space through the file. After the value has been read by the SPU, it
|
||||
is reset to zero. The possible operations on an open signal1 or sig-
|
||||
nal2 file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is placed in
|
||||
the data buffer, containing the current value of the specified
|
||||
signal notification register.
|
||||
|
||||
write(2)
|
||||
If a count smaller than four is requested, write returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is copied
|
||||
from the data buffer, updating the value of the specified signal
|
||||
notification register. The signal notification register will
|
||||
either be replaced with the input data or will be updated to the
|
||||
bitwise OR or the old value and the input data, depending on the
|
||||
contents of the signal1_type, or signal2_type respectively,
|
||||
file.
|
||||
|
||||
|
||||
/signal1_type
|
||||
/signal2_type
|
||||
These two files change the behavior of the signal1 and signal2 notifi-
|
||||
cation files. The contain a numerical ASCII string which is read as
|
||||
either "1" or "0". In mode 0 (overwrite), the hardware replaces the
|
||||
contents of the signal channel with the data that is written to it. in
|
||||
mode 1 (logical OR), the hardware accumulates the bits that are subse-
|
||||
quently written to it. The possible operations on an open signal1_type
|
||||
or signal2_type file are:
|
||||
|
||||
read(2)
|
||||
When the count supplied to the read call is shorter than the
|
||||
required length for the digit plus a newline character, subse-
|
||||
quent reads from the same file descriptor will result in com-
|
||||
pleting the string. When a complete string has been read, all
|
||||
subsequent read operations will return zero bytes and a new file
|
||||
descriptor needs to be opened to read the value again.
|
||||
|
||||
write(2)
|
||||
A write operation on the file results in setting the register to
|
||||
the value given in the string. The string is parsed from the
|
||||
beginning to the first non-numeric character or the end of the
|
||||
buffer. Subsequent writes to the same file descriptor overwrite
|
||||
the previous setting.
|
||||
|
||||
|
||||
EXAMPLES
|
||||
/etc/fstab entry
|
||||
none /spu spufs gid=spu 0 0
|
||||
|
||||
|
||||
AUTHORS
|
||||
Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
|
||||
Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
|
||||
|
||||
SEE ALSO
|
||||
capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
|
||||
|
||||
|
||||
|
||||
Linux 2005-09-28 SPUFS(2)
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
|
||||
|
||||
|
||||
|
||||
NAME
|
||||
spu_run - execute an spu context
|
||||
|
||||
|
||||
SYNOPSIS
|
||||
#include <sys/spu.h>
|
||||
|
||||
int spu_run(int fd, unsigned int *npc, unsigned int *event);
|
||||
|
||||
DESCRIPTION
|
||||
The spu_run system call is used on PowerPC machines that implement the
|
||||
Cell Broadband Engine Architecture in order to access Synergistic Pro-
|
||||
cessor Units (SPUs). It uses the fd that was returned from spu_cre-
|
||||
ate(2) to address a specific SPU context. When the context gets sched-
|
||||
uled to a physical SPU, it starts execution at the instruction pointer
|
||||
passed in npc.
|
||||
|
||||
Execution of SPU code happens synchronously, meaning that spu_run does
|
||||
not return while the SPU is still running. If there is a need to exe-
|
||||
cute SPU code in parallel with other code on either the main CPU or
|
||||
other SPUs, you need to create a new thread of execution first, e.g.
|
||||
using the pthread_create(3) call.
|
||||
|
||||
When spu_run returns, the current value of the SPU instruction pointer
|
||||
is written back to npc, so you can call spu_run again without updating
|
||||
the pointers.
|
||||
|
||||
event can be a NULL pointer or point to an extended status code that
|
||||
gets filled when spu_run returns. It can be one of the following con-
|
||||
stants:
|
||||
|
||||
SPE_EVENT_DMA_ALIGNMENT
|
||||
A DMA alignment error
|
||||
|
||||
SPE_EVENT_SPE_DATA_SEGMENT
|
||||
A DMA segmentation error
|
||||
|
||||
SPE_EVENT_SPE_DATA_STORAGE
|
||||
A DMA storage error
|
||||
|
||||
If NULL is passed as the event argument, these errors will result in a
|
||||
signal delivered to the calling process.
|
||||
|
||||
RETURN VALUE
|
||||
spu_run returns the value of the spu_status register or -1 to indicate
|
||||
an error and set errno to one of the error codes listed below. The
|
||||
spu_status register value contains a bit mask of status codes and
|
||||
optionally a 14 bit code returned from the stop-and-signal instruction
|
||||
on the SPU. The bit masks for the status codes are:
|
||||
|
||||
0x02 SPU was stopped by stop-and-signal.
|
||||
|
||||
0x04 SPU was stopped by halt.
|
||||
|
||||
0x08 SPU is waiting for a channel.
|
||||
|
||||
0x10 SPU is in single-step mode.
|
||||
|
||||
0x20 SPU has tried to execute an invalid instruction.
|
||||
|
||||
0x40 SPU has tried to access an invalid channel.
|
||||
|
||||
0x3fff0000
|
||||
The bits masked with this value contain the code returned from
|
||||
stop-and-signal.
|
||||
|
||||
There are always one or more of the lower eight bits set or an error
|
||||
code is returned from spu_run.
|
||||
|
||||
ERRORS
|
||||
EAGAIN or EWOULDBLOCK
|
||||
fd is in non-blocking mode and spu_run would block.
|
||||
|
||||
EBADF fd is not a valid file descriptor.
|
||||
|
||||
EFAULT npc is not a valid pointer or status is neither NULL nor a valid
|
||||
pointer.
|
||||
|
||||
EINTR A signal occured while spu_run was in progress. The npc value
|
||||
has been updated to the new program counter value if necessary.
|
||||
|
||||
EINVAL fd is not a file descriptor returned from spu_create(2).
|
||||
|
||||
ENOMEM Insufficient memory was available to handle a page fault result-
|
||||
ing from an MFC direct memory access.
|
||||
|
||||
ENOSYS the functionality is not provided by the current system, because
|
||||
either the hardware does not provide SPUs or the spufs module is
|
||||
not loaded.
|
||||
|
||||
|
||||
NOTES
|
||||
spu_run is meant to be used from libraries that implement a more
|
||||
abstract interface to SPUs, not to be used from regular applications.
|
||||
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
|
||||
ommended libraries.
|
||||
|
||||
|
||||
CONFORMING TO
|
||||
This call is Linux specific and only implemented by the ppc64 architec-
|
||||
ture. Programs using this system call are not portable.
|
||||
|
||||
|
||||
BUGS
|
||||
The code does not yet fully implement all features lined out here.
|
||||
|
||||
|
||||
AUTHOR
|
||||
Arnd Bergmann <arndb@de.ibm.com>
|
||||
|
||||
SEE ALSO
|
||||
capabilities(7), close(2), spu_create(2), spufs(7)
|
||||
|
||||
|
||||
|
||||
Linux 2005-09-28 SPU_RUN(2)
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
|
||||
|
||||
|
||||
|
||||
NAME
|
||||
spu_create - create a new spu context
|
||||
|
||||
|
||||
SYNOPSIS
|
||||
#include <sys/types.h>
|
||||
#include <sys/spu.h>
|
||||
|
||||
int spu_create(const char *pathname, int flags, mode_t mode);
|
||||
|
||||
DESCRIPTION
|
||||
The spu_create system call is used on PowerPC machines that implement
|
||||
the Cell Broadband Engine Architecture in order to access Synergistic
|
||||
Processor Units (SPUs). It creates a new logical context for an SPU in
|
||||
pathname and returns a handle to associated with it. pathname must
|
||||
point to a non-existing directory in the mount point of the SPU file
|
||||
system (spufs). When spu_create is successful, a directory gets cre-
|
||||
ated on pathname and it is populated with files.
|
||||
|
||||
The returned file handle can only be passed to spu_run(2) or closed,
|
||||
other operations are not defined on it. When it is closed, all associ-
|
||||
ated directory entries in spufs are removed. When the last file handle
|
||||
pointing either inside of the context directory or to this file
|
||||
descriptor is closed, the logical SPU context is destroyed.
|
||||
|
||||
The parameter flags can be zero or any bitwise or'd combination of the
|
||||
following constants:
|
||||
|
||||
SPU_RAWIO
|
||||
Allow mapping of some of the hardware registers of the SPU into
|
||||
user space. This flag requires the CAP_SYS_RAWIO capability, see
|
||||
capabilities(7).
|
||||
|
||||
The mode parameter specifies the permissions used for creating the new
|
||||
directory in spufs. mode is modified with the user's umask(2) value
|
||||
and then used for both the directory and the files contained in it. The
|
||||
file permissions mask out some more bits of mode because they typically
|
||||
support only read or write access. See stat(2) for a full list of the
|
||||
possible mode values.
|
||||
|
||||
|
||||
RETURN VALUE
|
||||
spu_create returns a new file descriptor. It may return -1 to indicate
|
||||
an error condition and set errno to one of the error codes listed
|
||||
below.
|
||||
|
||||
|
||||
ERRORS
|
||||
EACCESS
|
||||
The current user does not have write access on the spufs mount
|
||||
point.
|
||||
|
||||
EEXIST An SPU context already exists at the given path name.
|
||||
|
||||
EFAULT pathname is not a valid string pointer in the current address
|
||||
space.
|
||||
|
||||
EINVAL pathname is not a directory in the spufs mount point.
|
||||
|
||||
ELOOP Too many symlinks were found while resolving pathname.
|
||||
|
||||
EMFILE The process has reached its maximum open file limit.
|
||||
|
||||
ENAMETOOLONG
|
||||
pathname was too long.
|
||||
|
||||
ENFILE The system has reached the global open file limit.
|
||||
|
||||
ENOENT Part of pathname could not be resolved.
|
||||
|
||||
ENOMEM The kernel could not allocate all resources required.
|
||||
|
||||
ENOSPC There are not enough SPU resources available to create a new
|
||||
context or the user specific limit for the number of SPU con-
|
||||
texts has been reached.
|
||||
|
||||
ENOSYS the functionality is not provided by the current system, because
|
||||
either the hardware does not provide SPUs or the spufs module is
|
||||
not loaded.
|
||||
|
||||
ENOTDIR
|
||||
A part of pathname is not a directory.
|
||||
|
||||
|
||||
|
||||
NOTES
|
||||
spu_create is meant to be used from libraries that implement a more
|
||||
abstract interface to SPUs, not to be used from regular applications.
|
||||
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
|
||||
ommended libraries.
|
||||
|
||||
|
||||
FILES
|
||||
pathname must point to a location beneath the mount point of spufs. By
|
||||
convention, it gets mounted in /spu.
|
||||
|
||||
|
||||
CONFORMING TO
|
||||
This call is Linux specific and only implemented by the ppc64 architec-
|
||||
ture. Programs using this system call are not portable.
|
||||
|
||||
|
||||
BUGS
|
||||
The code does not yet fully implement all features lined out here.
|
||||
|
||||
|
||||
AUTHOR
|
||||
Arnd Bergmann <arndb@de.ibm.com>
|
||||
|
||||
SEE ALSO
|
||||
capabilities(7), close(2), spu_run(2), spufs(7)
|
||||
|
||||
|
||||
|
||||
Linux 2005-09-28 SPU_CREATE(2)
|
@ -1,4 +1,5 @@
|
||||
Accessing PCI device resources through sysfs
|
||||
--------------------------------------------
|
||||
|
||||
sysfs, usually mounted at /sys, provides access to PCI resources on platforms
|
||||
that support it. For example, a given bus might look like this:
|
||||
@ -47,14 +48,21 @@ files, each with their own function.
|
||||
binary - file contains binary data
|
||||
cpumask - file contains a cpumask type
|
||||
|
||||
The read only files are informational, writes to them will be ignored.
|
||||
Writable files can be used to perform actions on the device (e.g. changing
|
||||
config space, detaching a device). mmapable files are available via an
|
||||
mmap of the file at offset 0 and can be used to do actual device programming
|
||||
from userspace. Note that some platforms don't support mmapping of certain
|
||||
resources, so be sure to check the return value from any attempted mmap.
|
||||
The read only files are informational, writes to them will be ignored, with
|
||||
the exception of the 'rom' file. Writable files can be used to perform
|
||||
actions on the device (e.g. changing config space, detaching a device).
|
||||
mmapable files are available via an mmap of the file at offset 0 and can be
|
||||
used to do actual device programming from userspace. Note that some platforms
|
||||
don't support mmapping of certain resources, so be sure to check the return
|
||||
value from any attempted mmap.
|
||||
|
||||
The 'rom' file is special in that it provides read-only access to the device's
|
||||
ROM file, if available. It's disabled by default, however, so applications
|
||||
should write the string "1" to the file to enable it before attempting a read
|
||||
call, and disable it following the access by writing "0" to the file.
|
||||
|
||||
Accessing legacy resources through sysfs
|
||||
----------------------------------------
|
||||
|
||||
Legacy I/O port and ISA memory resources are also provided in sysfs if the
|
||||
underlying platform supports them. They're located in the PCI class heirarchy,
|
||||
@ -75,6 +83,7 @@ simply dereference the returned pointer (after checking for errors of course)
|
||||
to access legacy memory space.
|
||||
|
||||
Supporting PCI access on new platforms
|
||||
--------------------------------------
|
||||
|
||||
In order to support PCI resource mapping as described above, Linux platform
|
||||
code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
|
||||
|
@ -78,6 +78,18 @@ use up all the memory on the machine; but enhances the scalability of
|
||||
that instance in a system with many cpus making intensive use of it.
|
||||
|
||||
|
||||
tmpfs has a mount option to set the NUMA memory allocation policy for
|
||||
all files in that instance:
|
||||
mpol=interleave prefers to allocate memory from each node in turn
|
||||
mpol=default prefers to allocate memory from the local node
|
||||
mpol=bind prefers to allocate from mpol_nodelist
|
||||
mpol=preferred prefers to allocate from first node in mpol_nodelist
|
||||
|
||||
The following mount option is used in conjunction with mpol=interleave,
|
||||
mpol=bind or mpol=preferred:
|
||||
mpol_nodelist: nodelist suitable for parsing with nodelist_parse.
|
||||
|
||||
|
||||
To specify the initial root directory you can use the following mount
|
||||
options:
|
||||
|
||||
|
@ -162,9 +162,8 @@ get_sb() method fills in is the "s_op" field. This is a pointer to
|
||||
a "struct super_operations" which describes the next level of the
|
||||
filesystem implementation.
|
||||
|
||||
Usually, a filesystem uses generic one of the generic get_sb()
|
||||
implementations and provides a fill_super() method instead. The
|
||||
generic methods are:
|
||||
Usually, a filesystem uses one of the generic get_sb() implementations
|
||||
and provides a fill_super() method instead. The generic methods are:
|
||||
|
||||
get_sb_bdev: mount a filesystem residing on a block device
|
||||
|
||||
|
@ -4,7 +4,7 @@ FAQ list:
|
||||
=========
|
||||
|
||||
A FAQ list may be found in the fdutils package (see below), and also
|
||||
at http://fdutils.linux.lu/FAQ.html
|
||||
at <http://fdutils.linux.lu/faq.html>.
|
||||
|
||||
|
||||
LILO configuration options (Thinkpad users, read this)
|
||||
@ -217,10 +217,10 @@ It also contains additional documentation about the floppy driver.
|
||||
The latest version can be found at fdutils homepage:
|
||||
http://fdutils.linux.lu
|
||||
|
||||
The fdutils-5.4 release can be found at:
|
||||
http://fdutils.linux.lu/fdutils-5.4.src.tar.gz
|
||||
http://www.tux.org/pub/knaff/fdutils/fdutils-5.4.src.tar.gz
|
||||
ftp://metalab.unc.edu/pub/Linux/utils/disk-management/fdutils-5.4.src.tar.gz
|
||||
The fdutils releases can be found at:
|
||||
http://fdutils.linux.lu/download.html
|
||||
http://www.tux.org/pub/knaff/fdutils/
|
||||
ftp://metalab.unc.edu/pub/Linux/utils/disk-management/
|
||||
|
||||
Reporting problems about the floppy driver
|
||||
==========================================
|
||||
|
@ -2,7 +2,7 @@
|
||||
|
||||
The High Precision Event Timer (HPET) hardware is the future replacement
|
||||
for the 8254 and Real Time Clock (RTC) periodic timer functionality.
|
||||
Each HPET can have up two 32 timers. It is possible to configure the
|
||||
Each HPET can have up to 32 timers. It is possible to configure the
|
||||
first two timers as legacy replacements for 8254 and RTC periodic timers.
|
||||
A specification done by Intel and Microsoft can be found at
|
||||
<http://www.intel.com/hardwaredesign/hpetspec.htm>.
|
||||
|
178
Documentation/hrtimers.txt
Normal file
178
Documentation/hrtimers.txt
Normal file
@ -0,0 +1,178 @@
|
||||
|
||||
hrtimers - subsystem for high-resolution kernel timers
|
||||
----------------------------------------------------
|
||||
|
||||
This patch introduces a new subsystem for high-resolution kernel timers.
|
||||
|
||||
One might ask the question: we already have a timer subsystem
|
||||
(kernel/timers.c), why do we need two timer subsystems? After a lot of
|
||||
back and forth trying to integrate high-resolution and high-precision
|
||||
features into the existing timer framework, and after testing various
|
||||
such high-resolution timer implementations in practice, we came to the
|
||||
conclusion that the timer wheel code is fundamentally not suitable for
|
||||
such an approach. We initially didnt believe this ('there must be a way
|
||||
to solve this'), and spent a considerable effort trying to integrate
|
||||
things into the timer wheel, but we failed. In hindsight, there are
|
||||
several reasons why such integration is hard/impossible:
|
||||
|
||||
- the forced handling of low-resolution and high-resolution timers in
|
||||
the same way leads to a lot of compromises, macro magic and #ifdef
|
||||
mess. The timers.c code is very "tightly coded" around jiffies and
|
||||
32-bitness assumptions, and has been honed and micro-optimized for a
|
||||
relatively narrow use case (jiffies in a relatively narrow HZ range)
|
||||
for many years - and thus even small extensions to it easily break
|
||||
the wheel concept, leading to even worse compromises. The timer wheel
|
||||
code is very good and tight code, there's zero problems with it in its
|
||||
current usage - but it is simply not suitable to be extended for
|
||||
high-res timers.
|
||||
|
||||
- the unpredictable [O(N)] overhead of cascading leads to delays which
|
||||
necessiate a more complex handling of high resolution timers, which
|
||||
in turn decreases robustness. Such a design still led to rather large
|
||||
timing inaccuracies. Cascading is a fundamental property of the timer
|
||||
wheel concept, it cannot be 'designed out' without unevitably
|
||||
degrading other portions of the timers.c code in an unacceptable way.
|
||||
|
||||
- the implementation of the current posix-timer subsystem on top of
|
||||
the timer wheel has already introduced a quite complex handling of
|
||||
the required readjusting of absolute CLOCK_REALTIME timers at
|
||||
settimeofday or NTP time - further underlying our experience by
|
||||
example: that the timer wheel data structure is too rigid for high-res
|
||||
timers.
|
||||
|
||||
- the timer wheel code is most optimal for use cases which can be
|
||||
identified as "timeouts". Such timeouts are usually set up to cover
|
||||
error conditions in various I/O paths, such as networking and block
|
||||
I/O. The vast majority of those timers never expire and are rarely
|
||||
recascaded because the expected correct event arrives in time so they
|
||||
can be removed from the timer wheel before any further processing of
|
||||
them becomes necessary. Thus the users of these timeouts can accept
|
||||
the granularity and precision tradeoffs of the timer wheel, and
|
||||
largely expect the timer subsystem to have near-zero overhead.
|
||||
Accurate timing for them is not a core purpose - in fact most of the
|
||||
timeout values used are ad-hoc. For them it is at most a necessary
|
||||
evil to guarantee the processing of actual timeout completions
|
||||
(because most of the timeouts are deleted before completion), which
|
||||
should thus be as cheap and unintrusive as possible.
|
||||
|
||||
The primary users of precision timers are user-space applications that
|
||||
utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
|
||||
users like drivers and subsystems which require precise timed events
|
||||
(e.g. multimedia) can benefit from the availability of a seperate
|
||||
high-resolution timer subsystem as well.
|
||||
|
||||
While this subsystem does not offer high-resolution clock sources just
|
||||
yet, the hrtimer subsystem can be easily extended with high-resolution
|
||||
clock capabilities, and patches for that exist and are maturing quickly.
|
||||
The increasing demand for realtime and multimedia applications along
|
||||
with other potential users for precise timers gives another reason to
|
||||
separate the "timeout" and "precise timer" subsystems.
|
||||
|
||||
Another potential benefit is that such a seperation allows even more
|
||||
special-purpose optimization of the existing timer wheel for the low
|
||||
resolution and low precision use cases - once the precision-sensitive
|
||||
APIs are separated from the timer wheel and are migrated over to
|
||||
hrtimers. E.g. we could decrease the frequency of the timeout subsystem
|
||||
from 250 Hz to 100 HZ (or even smaller).
|
||||
|
||||
hrtimer subsystem implementation details
|
||||
----------------------------------------
|
||||
|
||||
the basic design considerations were:
|
||||
|
||||
- simplicity
|
||||
|
||||
- data structure not bound to jiffies or any other granularity. All the
|
||||
kernel logic works at 64-bit nanoseconds resolution - no compromises.
|
||||
|
||||
- simplification of existing, timing related kernel code
|
||||
|
||||
another basic requirement was the immediate enqueueing and ordering of
|
||||
timers at activation time. After looking at several possible solutions
|
||||
such as radix trees and hashes, we chose the red black tree as the basic
|
||||
data structure. Rbtrees are available as a library in the kernel and are
|
||||
used in various performance-critical areas of e.g. memory management and
|
||||
file systems. The rbtree is solely used for time sorted ordering, while
|
||||
a separate list is used to give the expiry code fast access to the
|
||||
queued timers, without having to walk the rbtree.
|
||||
|
||||
(This seperate list is also useful for later when we'll introduce
|
||||
high-resolution clocks, where we need seperate pending and expired
|
||||
queues while keeping the time-order intact.)
|
||||
|
||||
Time-ordered enqueueing is not purely for the purposes of
|
||||
high-resolution clocks though, it also simplifies the handling of
|
||||
absolute timers based on a low-resolution CLOCK_REALTIME. The existing
|
||||
implementation needed to keep an extra list of all armed absolute
|
||||
CLOCK_REALTIME timers along with complex locking. In case of
|
||||
settimeofday and NTP, all the timers (!) had to be dequeued, the
|
||||
time-changing code had to fix them up one by one, and all of them had to
|
||||
be enqueued again. The time-ordered enqueueing and the storage of the
|
||||
expiry time in absolute time units removes all this complex and poorly
|
||||
scaling code from the posix-timer implementation - the clock can simply
|
||||
be set without having to touch the rbtree. This also makes the handling
|
||||
of posix-timers simpler in general.
|
||||
|
||||
The locking and per-CPU behavior of hrtimers was mostly taken from the
|
||||
existing timer wheel code, as it is mature and well suited. Sharing code
|
||||
was not really a win, due to the different data structures. Also, the
|
||||
hrtimer functions now have clearer behavior and clearer names - such as
|
||||
hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
|
||||
equivalent to del_timer() and del_timer_sync()] - so there's no direct
|
||||
1:1 mapping between them on the algorithmical level, and thus no real
|
||||
potential for code sharing either.
|
||||
|
||||
Basic data types: every time value, absolute or relative, is in a
|
||||
special nanosecond-resolution type: ktime_t. The kernel-internal
|
||||
representation of ktime_t values and operations is implemented via
|
||||
macros and inline functions, and can be switched between a "hybrid
|
||||
union" type and a plain "scalar" 64bit nanoseconds representation (at
|
||||
compile time). The hybrid union type optimizes time conversions on 32bit
|
||||
CPUs. This build-time-selectable ktime_t storage format was implemented
|
||||
to avoid the performance impact of 64-bit multiplications and divisions
|
||||
on 32bit CPUs. Such operations are frequently necessary to convert
|
||||
between the storage formats provided by kernel and userspace interfaces
|
||||
and the internal time format. (See include/linux/ktime.h for further
|
||||
details.)
|
||||
|
||||
hrtimers - rounding of timer values
|
||||
-----------------------------------
|
||||
|
||||
the hrtimer code will round timer events to lower-resolution clocks
|
||||
because it has to. Otherwise it will do no artificial rounding at all.
|
||||
|
||||
one question is, what resolution value should be returned to the user by
|
||||
the clock_getres() interface. This will return whatever real resolution
|
||||
a given clock has - be it low-res, high-res, or artificially-low-res.
|
||||
|
||||
hrtimers - testing and verification
|
||||
----------------------------------
|
||||
|
||||
We used the high-resolution clock subsystem ontop of hrtimers to verify
|
||||
the hrtimer implementation details in praxis, and we also ran the posix
|
||||
timer tests in order to ensure specification compliance. We also ran
|
||||
tests on low-resolution clocks.
|
||||
|
||||
The hrtimer patch converts the following kernel functionality to use
|
||||
hrtimers:
|
||||
|
||||
- nanosleep
|
||||
- itimers
|
||||
- posix-timers
|
||||
|
||||
The conversion of nanosleep and posix-timers enabled the unification of
|
||||
nanosleep and clock_nanosleep.
|
||||
|
||||
The code was successfully compiled for the following platforms:
|
||||
|
||||
i386, x86_64, ARM, PPC, PPC64, IA64
|
||||
|
||||
The code was run-tested on the following platforms:
|
||||
|
||||
i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
|
||||
|
||||
hrtimers were also integrated into the -rt tree, along with a
|
||||
hrtimers-based high-resolution clock implementation, so the hrtimers
|
||||
code got a healthy amount of testing and use in practice.
|
||||
|
||||
Thomas Gleixner, Ingo Molnar
|
@ -54,13 +54,16 @@ If you really want i2c accesses for these Super I/O chips,
|
||||
use the w83781d driver. However this is not the preferred method
|
||||
now that this ISA driver has been developed.
|
||||
|
||||
Technically, the w83627thf does not support a VID reading. However, it's
|
||||
possible or even likely that your mainboard maker has routed these signals
|
||||
to a specific set of general purpose IO pins (the Asus P4C800-E is one such
|
||||
board). The w83627thf driver now interprets these as VID. If the VID on
|
||||
your board doesn't work, first see doc/vid in the lm_sensors package. If
|
||||
that still doesn't help, email us at lm-sensors@lm-sensors.org.
|
||||
The w83627_HF_ uses pins 110-106 as VID0-VID4. The w83627_THF_ uses the
|
||||
same pins as GPIO[0:4]. Technically, the w83627_THF_ does not support a
|
||||
VID reading. However the two chips have the identical 128 pin package. So,
|
||||
it is possible or even likely for a w83627thf to have the VID signals routed
|
||||
to these pins despite their not being labeled for that purpose. Therefore,
|
||||
the w83627thf driver interprets these as VID. If the VID on your board
|
||||
doesn't work, first see doc/vid in the lm_sensors package[1]. If that still
|
||||
doesn't help, you may just ignore the bogus VID reading with no harm done.
|
||||
|
||||
For further information on this driver see the w83781d driver
|
||||
documentation.
|
||||
For further information on this driver see the w83781d driver documentation.
|
||||
|
||||
[1] http://www2.lm-sensors.nu/~lm78/cvs/browse.cgi/lm_sensors2/doc/vid
|
||||
|
||||
|
@ -5,7 +5,8 @@ Supported adapters:
|
||||
* nForce2 Ultra 400 MCP 10de:0084
|
||||
* nForce3 Pro150 MCP 10de:00D4
|
||||
* nForce3 250Gb MCP 10de:00E4
|
||||
* nForce4 MCP 10de:0052
|
||||
* nForce4 MCP 10de:0052
|
||||
* nForce4 MCP-04 10de:0034
|
||||
|
||||
Datasheet: not publically available, but seems to be similar to the
|
||||
AMD-8111 SMBus 2.0 adapter.
|
||||
|
@ -17,6 +17,7 @@ It currently supports the following devices:
|
||||
* Velleman K8000 adapter
|
||||
* ELV adapter
|
||||
* Analog Devices evaluation boards (ADM1025, ADM1030, ADM1031, ADM1032)
|
||||
* Barco LPT->DVI (K5800236) adapter
|
||||
|
||||
These devices use different pinout configurations, so you have to tell
|
||||
the driver what you have, using the type module parameter. There is no
|
||||
|
@ -1,10 +1,13 @@
|
||||
Revision 5, 2005-07-29
|
||||
Revision 6, 2005-11-20
|
||||
Jean Delvare <khali@linux-fr.org>
|
||||
Greg KH <greg@kroah.com>
|
||||
|
||||
This is a guide on how to convert I2C chip drivers from Linux 2.4 to
|
||||
Linux 2.6. I have been using existing drivers (lm75, lm78) as examples.
|
||||
Then I converted a driver myself (lm83) and updated this document.
|
||||
Note that this guide is strongly oriented towards hardware monitoring
|
||||
drivers. Many points are still valid for other type of drivers, but
|
||||
others may be irrelevant.
|
||||
|
||||
There are two sets of points below. The first set concerns technical
|
||||
changes. The second set concerns coding policy. Both are mandatory.
|
||||
@ -22,16 +25,20 @@ Technical changes:
|
||||
#include <linux/module.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/jiffies.h>
|
||||
#include <linux/i2c.h>
|
||||
#include <linux/i2c-isa.h> /* for ISA drivers */
|
||||
#include <linux/hwmon.h> /* for hardware monitoring drivers */
|
||||
#include <linux/hwmon-sysfs.h>
|
||||
#include <linux/hwmon-vid.h> /* if you need VRM support */
|
||||
#include <linux/err.h> /* for class registration */
|
||||
#include <asm/io.h> /* if you have I/O operations */
|
||||
Please respect this inclusion order. Some extra headers may be
|
||||
required for a given driver (e.g. "lm75.h").
|
||||
|
||||
* [Addresses] SENSORS_I2C_END becomes I2C_CLIENT_END, ISA addresses
|
||||
are no more handled by the i2c core.
|
||||
are no more handled by the i2c core. Address ranges are no more
|
||||
supported either, define each individual address separately.
|
||||
SENSORS_INSMOD_<n> becomes I2C_CLIENT_INSMOD_<n>.
|
||||
|
||||
* [Client data] Get rid of sysctl_id. Try using standard names for
|
||||
@ -48,23 +55,23 @@ Technical changes:
|
||||
int kind);
|
||||
static void lm75_init_client(struct i2c_client *client);
|
||||
static int lm75_detach_client(struct i2c_client *client);
|
||||
static void lm75_update_client(struct i2c_client *client);
|
||||
static struct lm75_data lm75_update_device(struct device *dev);
|
||||
|
||||
* [Sysctl] All sysctl stuff is of course gone (defines, ctl_table
|
||||
and functions). Instead, you have to define show and set functions for
|
||||
each sysfs file. Only define set for writable values. Take a look at an
|
||||
existing 2.6 driver for details (lm78 for example). Don't forget
|
||||
existing 2.6 driver for details (it87 for example). Don't forget
|
||||
to define the attributes for each file (this is that step that
|
||||
links callback functions). Use the file names specified in
|
||||
Documentation/i2c/sysfs-interface for the individual files. Also
|
||||
Documentation/hwmon/sysfs-interface for the individual files. Also
|
||||
convert the units these files read and write to the specified ones.
|
||||
If you need to add a new type of file, please discuss it on the
|
||||
sensors mailing list <lm-sensors@lm-sensors.org> by providing a
|
||||
patch to the Documentation/i2c/sysfs-interface file.
|
||||
patch to the Documentation/hwmon/sysfs-interface file.
|
||||
|
||||
* [Attach] For I2C drivers, the attach function should make sure
|
||||
that the adapter's class has I2C_CLASS_HWMON, using the
|
||||
following construct:
|
||||
that the adapter's class has I2C_CLASS_HWMON (or whatever class is
|
||||
suitable for your driver), using the following construct:
|
||||
if (!(adapter->class & I2C_CLASS_HWMON))
|
||||
return 0;
|
||||
ISA-only drivers of course don't need this.
|
||||
@ -72,63 +79,72 @@ Technical changes:
|
||||
|
||||
* [Detect] As mentioned earlier, the flags parameter is gone.
|
||||
The type_name and client_name strings are replaced by a single
|
||||
name string, which will be filled with a lowercase, short string
|
||||
(typically the driver name, e.g. "lm75").
|
||||
name string, which will be filled with a lowercase, short string.
|
||||
In i2c-only drivers, drop the i2c_is_isa_adapter check, it's
|
||||
useless. Same for isa-only drivers, as the test would always be
|
||||
true. Only hybrid drivers (which are quite rare) still need it.
|
||||
The errorN labels are reduced to the number needed. If that number
|
||||
is 2 (i2c-only drivers), it is advised that the labels are named
|
||||
exit and exit_free. For i2c+isa drivers, labels should be named
|
||||
ERROR0, ERROR1 and ERROR2. Don't forget to properly set err before
|
||||
The labels used for error paths are reduced to the number needed.
|
||||
It is advised that the labels are given descriptive names such as
|
||||
exit and exit_free. Don't forget to properly set err before
|
||||
jumping to error labels. By the way, labels should be left-aligned.
|
||||
Use kzalloc instead of kmalloc.
|
||||
Use i2c_set_clientdata to set the client data (as opposed to
|
||||
a direct access to client->data).
|
||||
Use strlcpy instead of strcpy to copy the client name.
|
||||
Use strlcpy instead of strcpy or snprintf to copy the client name.
|
||||
Replace the sysctl directory registration by calls to
|
||||
device_create_file. Move the driver initialization before any
|
||||
sysfs file creation.
|
||||
Register the client with the hwmon class (using hwmon_device_register)
|
||||
if applicable.
|
||||
Drop client->id.
|
||||
Drop any 24RF08 corruption prevention you find, as this is now done
|
||||
at the i2c-core level, and doing it twice voids it.
|
||||
Don't add I2C_CLIENT_ALLOW_USE to client->flags, it's the default now.
|
||||
|
||||
* [Init] Limits must not be set by the driver (can be done later in
|
||||
user-space). Chip should not be reset default (although a module
|
||||
parameter may be used to force is), and initialization should be
|
||||
parameter may be used to force it), and initialization should be
|
||||
limited to the strictly necessary steps.
|
||||
|
||||
* [Detach] Get rid of data, remove the call to
|
||||
i2c_deregister_entry. Do not log an error message if
|
||||
i2c_detach_client fails, as i2c-core will now do it for you.
|
||||
* [Detach] Remove the call to i2c_deregister_entry. Do not log an
|
||||
error message if i2c_detach_client fails, as i2c-core will now do
|
||||
it for you.
|
||||
Unregister from the hwmon class if applicable.
|
||||
|
||||
* [Update] Don't access client->data directly, use
|
||||
i2c_get_clientdata(client) instead.
|
||||
* [Update] The function prototype changed, it is now
|
||||
passed a device structure, which you have to convert to a client
|
||||
using to_i2c_client(dev). The update function should return a
|
||||
pointer to the client data.
|
||||
Don't access client->data directly, use i2c_get_clientdata(client)
|
||||
instead.
|
||||
Use time_after() instead of direct jiffies comparison.
|
||||
|
||||
* [Interface] Init function should not print anything. Make sure
|
||||
there is a MODULE_LICENSE() line, at the bottom of the file
|
||||
(after MODULE_AUTHOR() and MODULE_DESCRIPTION(), in this order).
|
||||
* [Interface] Make sure there is a MODULE_LICENSE() line, at the bottom
|
||||
of the file (after MODULE_AUTHOR() and MODULE_DESCRIPTION(), in this
|
||||
order).
|
||||
|
||||
* [Driver] The flags field of the i2c_driver structure is gone.
|
||||
I2C_DF_NOTIFY is now the default behavior.
|
||||
The i2c_driver structure has a driver member, which is itself a
|
||||
structure, those name member should be initialized to a driver name
|
||||
string. i2c_driver itself has no name member anymore.
|
||||
|
||||
Coding policy:
|
||||
|
||||
* [Copyright] Use (C), not (c), for copyright.
|
||||
|
||||
* [Debug/log] Get rid of #ifdef DEBUG/#endif constructs whenever you
|
||||
can. Calls to printk/pr_debug for debugging purposes are replaced
|
||||
by calls to dev_dbg. Here is an example on how to call it (taken
|
||||
from lm75_detect):
|
||||
can. Calls to printk for debugging purposes are replaced by calls to
|
||||
dev_dbg where possible, else to pr_debug. Here is an example of how
|
||||
to call it (taken from lm75_detect):
|
||||
dev_dbg(&client->dev, "Starting lm75 update\n");
|
||||
Replace other printk calls with the dev_info, dev_err or dev_warn
|
||||
function, as appropriate.
|
||||
|
||||
* [Constants] Constants defines (registers, conversions, initial
|
||||
values) should be aligned. This greatly improves readability.
|
||||
Same goes for variables declarations. Alignments are achieved by the
|
||||
means of tabs, not spaces. Remember that tabs are set to 8 in the
|
||||
Linux kernel code.
|
||||
|
||||
* [Structure definition] The name field should be standardized. All
|
||||
lowercase and as simple as the driver name itself (e.g. "lm75").
|
||||
* [Constants] Constants defines (registers, conversions) should be
|
||||
aligned. This greatly improves readability.
|
||||
Alignments are achieved by the means of tabs, not spaces. Remember
|
||||
that tabs are set to 8 in the Linux kernel code.
|
||||
|
||||
* [Layout] Avoid extra empty lines between comments and what they
|
||||
comment. Respect the coding style (see Documentation/CodingStyle),
|
||||
|
@ -25,9 +25,9 @@ routines, a client structure specific information like the actual I2C
|
||||
address.
|
||||
|
||||
static struct i2c_driver foo_driver = {
|
||||
.owner = THIS_MODULE,
|
||||
.name = "Foo version 2.3 driver",
|
||||
.flags = I2C_DF_NOTIFY,
|
||||
.driver = {
|
||||
.name = "foo",
|
||||
},
|
||||
.attach_adapter = &foo_attach_adapter,
|
||||
.detach_client = &foo_detach_client,
|
||||
.command = &foo_command /* may be NULL */
|
||||
@ -36,10 +36,6 @@ static struct i2c_driver foo_driver = {
|
||||
The name field must match the driver name, including the case. It must not
|
||||
contain spaces, and may be up to 31 characters long.
|
||||
|
||||
Don't worry about the flags field; just put I2C_DF_NOTIFY into it. This
|
||||
means that your driver will be notified when new adapters are found.
|
||||
This is almost always what you want.
|
||||
|
||||
All other fields are for call-back functions which will be explained
|
||||
below.
|
||||
|
||||
@ -496,17 +492,13 @@ Note that some functions are marked by `__init', and some data structures
|
||||
by `__init_data'. Hose functions and structures can be removed after
|
||||
kernel booting (or module loading) is completed.
|
||||
|
||||
|
||||
Command function
|
||||
================
|
||||
|
||||
A generic ioctl-like function call back is supported. You will seldom
|
||||
need this. You may even set it to NULL.
|
||||
|
||||
/* No commands defined */
|
||||
int foo_command(struct i2c_client *client, unsigned int cmd, void *arg)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
need this, and its use is deprecated anyway, so newer design should not
|
||||
use it. Set it to NULL.
|
||||
|
||||
|
||||
Sending and receiving
|
||||
|
@ -185,7 +185,7 @@ VII. Getting Parameters
|
||||
ENOMEM Kernel memory allocation error
|
||||
|
||||
A return value of 0 does not mean that the value was actually
|
||||
properly retreived. The user should check the result list
|
||||
properly retrieved. The user should check the result list
|
||||
to determine the specific status of the transaction.
|
||||
|
||||
VIII. Downloading Software
|
||||
|
@ -3,7 +3,7 @@ Apple Touchpad Driver (appletouch)
|
||||
Copyright (C) 2005 Stelian Pop <stelian@popies.net>
|
||||
|
||||
appletouch is a Linux kernel driver for the USB touchpad found on post
|
||||
February 2005 Apple Alu Powerbooks.
|
||||
February 2005 and October 2005 Apple Aluminium Powerbooks.
|
||||
|
||||
This driver is derived from Johannes Berg's appletrackpad driver[1], but it has
|
||||
been improved in some areas:
|
||||
@ -13,7 +13,8 @@ been improved in some areas:
|
||||
|
||||
Credits go to Johannes Berg for reverse-engineering the touchpad protocol,
|
||||
Frank Arnold for further improvements, and Alex Harper for some additional
|
||||
information about the inner workings of the touchpad sensors.
|
||||
information about the inner workings of the touchpad sensors. Michael
|
||||
Hanselmann added support for the October 2005 models.
|
||||
|
||||
Usage:
|
||||
------
|
||||
|
@ -120,7 +120,7 @@ to the unique id assigned by the driver. This data is required for performing
|
||||
some operations (removing an effect, controlling the playback).
|
||||
This if field must be set to -1 by the user in order to tell the driver to
|
||||
allocate a new effect.
|
||||
See <linux/input.h> for a description of the ff_effect stuct. You should also
|
||||
See <linux/input.h> for a description of the ff_effect struct. You should also
|
||||
find help in a few sketches, contained in files shape.fig and interactive.fig.
|
||||
You need xfig to visualize these files.
|
||||
|
||||
|
@ -133,7 +133,7 @@ Code Seq# Include File Comments
|
||||
'l' 00-3F linux/tcfs_fs.h transparent cryptographic file system
|
||||
<http://mikonos.dia.unisa.it/tcfs>
|
||||
'l' 40-7F linux/udf_fs_i.h in development:
|
||||
<http://www.trylinux.com/projects/udf/>
|
||||
<http://sourceforge.net/projects/linux-udf/>
|
||||
'm' all linux/mtio.h conflict!
|
||||
'm' all linux/soundcard.h conflict!
|
||||
'm' all linux/synclink.h conflict!
|
||||
|
@ -946,7 +946,7 @@ HDIO_SCAN_HWIF register and (re)scan interface
|
||||
|
||||
This ioctl initializes the addresses and irq for a disk
|
||||
controller, probes for drives, and creates /proc/ide
|
||||
interfaces as appropiate.
|
||||
interfaces as appropriate.
|
||||
|
||||
|
||||
|
||||
|
@ -1033,9 +1033,9 @@ When kbuild executes the following steps are followed (roughly):
|
||||
|
||||
Example:
|
||||
#arch/i386/Makefile
|
||||
GCC_VERSION := $(call cc-version)
|
||||
cflags-y += $(shell \
|
||||
if [ $(GCC_VERSION) -ge 0300 ] ; then echo "-mregparm=3"; fi ;)
|
||||
if [ $(call cc-version) -ge 0300 ] ; then \
|
||||
echo "-mregparm=3"; fi ;)
|
||||
|
||||
In the above example -mregparm=3 is only used for gcc version greater
|
||||
than or equal to gcc 3.0.
|
||||
|
@ -18,6 +18,7 @@ In this document you will find information about:
|
||||
=== 5. Include files
|
||||
--- 5.1 How to include files from the kernel include dir
|
||||
--- 5.2 External modules using an include/ dir
|
||||
--- 5.3 External modules using several directories
|
||||
=== 6. Module installation
|
||||
--- 6.1 INSTALL_MOD_PATH
|
||||
--- 6.2 INSTALL_MOD_DIR
|
||||
@ -38,7 +39,7 @@ included in the kernel tree.
|
||||
What is covered within this file is mainly information to authors
|
||||
of modules. The author of an external modules should supply
|
||||
a makefile that hides most of the complexity so one only has to type
|
||||
'make' to buld the module. A complete example will be present in
|
||||
'make' to build the module. A complete example will be present in
|
||||
chapter ¤. Creating a kbuild file for an external module".
|
||||
|
||||
|
||||
@ -69,7 +70,7 @@ when building an external module.
|
||||
|
||||
--- 2.2 Available targets
|
||||
|
||||
$KDIR refers to path to kernel source top-level directory
|
||||
$KDIR refers to the path to the kernel source top-level directory
|
||||
|
||||
make -C $KDIR M=`pwd`
|
||||
Will build the module(s) located in current directory.
|
||||
@ -87,11 +88,11 @@ when building an external module.
|
||||
make -C $KDIR M=$PWD modules_install
|
||||
Install the external module(s).
|
||||
Installation default is in /lib/modules/<kernel-version>/extra,
|
||||
but may be prefixed with INSTALL_MOD_PATH - see separate chater.
|
||||
but may be prefixed with INSTALL_MOD_PATH - see separate chapter.
|
||||
|
||||
make -C $KDIR M=$PWD clean
|
||||
Remove all generated files for the module - the kernel
|
||||
source directory is not moddified.
|
||||
source directory is not modified.
|
||||
|
||||
make -C $KDIR M=`pwd` help
|
||||
help will list the available target when building external
|
||||
@ -99,7 +100,7 @@ when building an external module.
|
||||
|
||||
--- 2.3 Available options:
|
||||
|
||||
$KDIR refer to path to kernel src
|
||||
$KDIR refers to the path to the kernel source top-level directory
|
||||
|
||||
make -C $KDIR
|
||||
Used to specify where to find the kernel source.
|
||||
@ -206,11 +207,11 @@ following files:
|
||||
|
||||
KERNELDIR := /lib/modules/`uname -r`/build
|
||||
all::
|
||||
$(MAKE) -C $KERNELDIR M=`pwd` $@
|
||||
$(MAKE) -C $(KERNELDIR) M=`pwd` $@
|
||||
|
||||
# Module specific targets
|
||||
genbin:
|
||||
echo "X" > 8123_bini.o_shipped
|
||||
echo "X" > 8123_bin.o_shipped
|
||||
|
||||
endif
|
||||
|
||||
@ -341,13 +342,52 @@ directory and therefore needs to deal with this in their kbuild file.
|
||||
EXTRA_CFLAGS := -Iinclude
|
||||
8123-y := 8123_if.o 8123_pci.o 8123_bin.o
|
||||
|
||||
Note that in the assingment there is no space between -I and the path.
|
||||
This is a kbuild limitation and no space must be present.
|
||||
Note that in the assignment there is no space between -I and the path.
|
||||
This is a kbuild limitation: there must be no space present.
|
||||
|
||||
--- 5.3 External modules using several directories
|
||||
|
||||
If an external module does not follow the usual kernel style but
|
||||
decide to spread files over several directories then kbuild can
|
||||
support this too.
|
||||
|
||||
Consider the following example:
|
||||
|
||||
|
|
||||
+- src/complex_main.c
|
||||
| +- hal/hardwareif.c
|
||||
| +- hal/include/hardwareif.h
|
||||
+- include/complex.h
|
||||
|
||||
To build a single module named complex.ko we then need the following
|
||||
kbuild file:
|
||||
|
||||
Kbuild:
|
||||
obj-m := complex.o
|
||||
complex-y := src/complex_main.o
|
||||
complex-y += src/hal/hardwareif.o
|
||||
|
||||
EXTRA_CFLAGS := -I$(src)/include
|
||||
EXTRA_CFLAGS += -I$(src)src/hal/include
|
||||
|
||||
|
||||
kbuild knows how to handle .o files located in another directory -
|
||||
although this is NOT reccommended practice. The syntax is to specify
|
||||
the directory relative to the directory where the Kbuild file is
|
||||
located.
|
||||
|
||||
To find the .h files we have to explicitly tell kbuild where to look
|
||||
for the .h files. When kbuild executes current directory is always
|
||||
the root of the kernel tree (argument to -C) and therefore we have to
|
||||
tell kbuild how to find the .h files using absolute paths.
|
||||
$(src) will specify the absolute path to the directory where the
|
||||
Kbuild file are located when being build as an external module.
|
||||
Therefore -I$(src)/ is used to point out the directory of the Kbuild
|
||||
file and any additional path are just appended.
|
||||
|
||||
=== 6. Module installation
|
||||
|
||||
Modules which are included in the kernel is installed in the directory:
|
||||
Modules which are included in the kernel are installed in the directory:
|
||||
|
||||
/lib/modules/$(KERNELRELEASE)/kernel
|
||||
|
||||
@ -365,7 +405,7 @@ External modules are installed in the directory:
|
||||
=> Install dir: /frodo/lib/modules/$(KERNELRELEASE)/kernel
|
||||
|
||||
INSTALL_MOD_PATH may be set as an ordinary shell variable or as in the
|
||||
example above be specified on the commandline when calling make.
|
||||
example above be specified on the command line when calling make.
|
||||
INSTALL_MOD_PATH has effect both when installing modules included in
|
||||
the kernel as well as when installing external modules.
|
||||
|
||||
@ -384,7 +424,7 @@ External modules are installed in the directory:
|
||||
|
||||
=== 7. Module versioning
|
||||
|
||||
Module versioning are enabled by the CONFIG_MODVERSIONS tag.
|
||||
Module versioning is enabled by the CONFIG_MODVERSIONS tag.
|
||||
|
||||
Module versioning is used as a simple ABI consistency check. The Module
|
||||
versioning creates a CRC value of the full prototype for an exported symbol and
|
||||
|
@ -177,3 +177,25 @@ document trapinfo
|
||||
'trapinfo <pid>' will tell you by which trap & possibly
|
||||
addresthe kernel paniced.
|
||||
end
|
||||
|
||||
|
||||
define dmesg
|
||||
set $i = 0
|
||||
set $end_idx = (log_end - 1) & (log_buf_len - 1)
|
||||
|
||||
while ($i < logged_chars)
|
||||
set $idx = (log_end - 1 - logged_chars + $i) & (log_buf_len - 1)
|
||||
|
||||
if ($idx + 100 <= $end_idx) || \
|
||||
($end_idx <= $idx && $idx + 100 < log_buf_len)
|
||||
printf "%.100s", &log_buf[$idx]
|
||||
set $i = $i + 100
|
||||
else
|
||||
printf "%c", log_buf[$idx]
|
||||
set $i = $i + 1
|
||||
end
|
||||
end
|
||||
end
|
||||
document dmesg
|
||||
print the kernel ring buffer
|
||||
end
|
||||
|
@ -4,10 +4,10 @@ Documentation for kdump - the kexec-based crash dumping solution
|
||||
DESIGN
|
||||
======
|
||||
|
||||
Kdump uses kexec to reboot to a second kernel whenever a dump needs to be taken.
|
||||
This second kernel is booted with very little memory. The first kernel reserves
|
||||
the section of memory that the second kernel uses. This ensures that on-going
|
||||
DMA from the first kernel does not corrupt the second kernel.
|
||||
Kdump uses kexec to reboot to a second kernel whenever a dump needs to be
|
||||
taken. This second kernel is booted with very little memory. The first kernel
|
||||
reserves the section of memory that the second kernel uses. This ensures that
|
||||
on-going DMA from the first kernel does not corrupt the second kernel.
|
||||
|
||||
All the necessary information about Core image is encoded in ELF format and
|
||||
stored in reserved area of memory before crash. Physical address of start of
|
||||
@ -35,77 +35,82 @@ In the second kernel, "old memory" can be accessed in two ways.
|
||||
SETUP
|
||||
=====
|
||||
|
||||
1) Download http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz
|
||||
and apply http://lse.sourceforge.net/kdump/patches/kexec-tools-1.101-kdump.patch
|
||||
and after that build the source.
|
||||
1) Download the upstream kexec-tools userspace package from
|
||||
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz.
|
||||
|
||||
2) Download and build the appropriate (2.6.13-rc1 onwards) vanilla kernel.
|
||||
Apply the latest consolidated kdump patch on top of kexec-tools-1.101
|
||||
from http://lse.sourceforge.net/kdump/. This arrangment has been made
|
||||
till all the userspace patches supporting kdump are integrated with
|
||||
upstream kexec-tools userspace.
|
||||
|
||||
2) Download and build the appropriate (2.6.13-rc1 onwards) vanilla kernels.
|
||||
Two kernels need to be built in order to get this feature working.
|
||||
Following are the steps to properly configure the two kernels specific
|
||||
to kexec and kdump features:
|
||||
|
||||
A) First kernel:
|
||||
A) First kernel or regular kernel:
|
||||
----------------------------------
|
||||
a) Enable "kexec system call" feature (in Processor type and features).
|
||||
CONFIG_KEXEC=y
|
||||
b) This kernel's physical load address should be the default value of
|
||||
0x100000 (0x100000, 1 MB) (in Processor type and features).
|
||||
CONFIG_PHYSICAL_START=0x100000
|
||||
c) Enable "sysfs file system support" (in Pseudo filesystems).
|
||||
CONFIG_SYSFS=y
|
||||
CONFIG_KEXEC=y
|
||||
b) Enable "sysfs file system support" (in Pseudo filesystems).
|
||||
CONFIG_SYSFS=y
|
||||
c) make
|
||||
d) Boot into first kernel with the command line parameter "crashkernel=Y@X".
|
||||
Use appropriate values for X and Y. Y denotes how much memory to reserve
|
||||
for the second kernel, and X denotes at what physical address the reserved
|
||||
memory section starts. For example: "crashkernel=64M@16M".
|
||||
for the second kernel, and X denotes at what physical address the
|
||||
reserved memory section starts. For example: "crashkernel=64M@16M".
|
||||
|
||||
B) Second kernel:
|
||||
a) Enable "kernel crash dumps" feature (in Processor type and features).
|
||||
CONFIG_CRASH_DUMP=y
|
||||
b) Specify a suitable value for "Physical address where the kernel is
|
||||
loaded" (in Processor type and features). Typically this value
|
||||
should be same as X (See option d) above, e.g., 16 MB or 0x1000000.
|
||||
CONFIG_PHYSICAL_START=0x1000000
|
||||
c) Enable "/proc/vmcore support" (Optional, in Pseudo filesystems).
|
||||
CONFIG_PROC_VMCORE=y
|
||||
d) Disable SMP support and build a UP kernel (Until it is fixed).
|
||||
CONFIG_SMP=n
|
||||
e) Enable "Local APIC support on uniprocessors".
|
||||
CONFIG_X86_UP_APIC=y
|
||||
f) Enable "IO-APIC support on uniprocessors"
|
||||
CONFIG_X86_UP_IOAPIC=y
|
||||
|
||||
Note: i) Options a) and b) depend upon "Configure standard kernel features
|
||||
(for small systems)" (under General setup).
|
||||
ii) Option a) also depends on CONFIG_HIGHMEM (under Processor
|
||||
type and features).
|
||||
iii) Both option a) and b) are under "Processor type and features".
|
||||
B) Second kernel or dump capture kernel:
|
||||
---------------------------------------
|
||||
a) For i386 architecture enable Highmem support
|
||||
CONFIG_HIGHMEM=y
|
||||
b) Enable "kernel crash dumps" feature (under "Processor type and features")
|
||||
CONFIG_CRASH_DUMP=y
|
||||
c) Make sure a suitable value for "Physical address where the kernel is
|
||||
loaded" (under "Processor type and features"). By default this value
|
||||
is 0x1000000 (16MB) and it should be same as X (See option d above),
|
||||
e.g., 16 MB or 0x1000000.
|
||||
CONFIG_PHYSICAL_START=0x1000000
|
||||
d) Enable "/proc/vmcore support" (Optional, under "Pseudo filesystems").
|
||||
CONFIG_PROC_VMCORE=y
|
||||
|
||||
3) Boot into the first kernel. You are now ready to try out kexec-based crash
|
||||
dumps.
|
||||
|
||||
4) Load the second kernel to be booted using:
|
||||
3) After booting to regular kernel or first kernel, load the second kernel
|
||||
using the following command:
|
||||
|
||||
kexec -p <second-kernel> --args-linux --elf32-core-headers
|
||||
--append="root=<root-dev> init 1 irqpoll"
|
||||
--append="root=<root-dev> init 1 irqpoll maxcpus=1"
|
||||
|
||||
Note: i) <second-kernel> has to be a vmlinux image. bzImage will not work,
|
||||
as of now.
|
||||
ii) By default ELF headers are stored in ELF64 format. Option
|
||||
--elf32-core-headers forces generation of ELF32 headers. gdb can
|
||||
not open ELF64 headers on 32 bit systems. So creating ELF32
|
||||
headers can come handy for users who have got non-PAE systems and
|
||||
hence have memory less than 4GB.
|
||||
iii) Specify "irqpoll" as command line parameter. This reduces driver
|
||||
initialization failures in second kernel due to shared interrupts.
|
||||
iv) <root-dev> needs to be specified in a format corresponding to
|
||||
the root device name in the output of mount command.
|
||||
v) If you have built the drivers required to mount root file
|
||||
system as modules in <second-kernel>, then, specify
|
||||
--initrd=<initrd-for-second-kernel>.
|
||||
Notes:
|
||||
======
|
||||
i) <second-kernel> has to be a vmlinux image ie uncompressed elf image.
|
||||
bzImage will not work, as of now.
|
||||
ii) --args-linux has to be speicfied as if kexec it loading an elf image,
|
||||
it needs to know that the arguments supplied are of linux type.
|
||||
iii) By default ELF headers are stored in ELF64 format to support systems
|
||||
with more than 4GB memory. Option --elf32-core-headers forces generation
|
||||
of ELF32 headers. The reason for this option being, as of now gdb can
|
||||
not open vmcore file with ELF64 headers on a 32 bit systems. So ELF32
|
||||
headers can be used if one has non-PAE systems and hence memory less
|
||||
than 4GB.
|
||||
iv) Specify "irqpoll" as command line parameter. This reduces driver
|
||||
initialization failures in second kernel due to shared interrupts.
|
||||
v) <root-dev> needs to be specified in a format corresponding to the root
|
||||
device name in the output of mount command.
|
||||
vi) If you have built the drivers required to mount root file system as
|
||||
modules in <second-kernel>, then, specify
|
||||
--initrd=<initrd-for-second-kernel>.
|
||||
vii) Specify maxcpus=1 as, if during first kernel run, if panic happens on
|
||||
non-boot cpus, second kernel doesn't seem to be boot up all the cpus.
|
||||
The other option is to always built the second kernel without SMP
|
||||
support ie CONFIG_SMP=n
|
||||
|
||||
5) System reboots into the second kernel when a panic occurs. A module can be
|
||||
written to force the panic or "ALT-SysRq-c" can be used initiate a crash
|
||||
dump for testing purposes.
|
||||
4) After successfully loading the second kernel as above, if a panic occurs
|
||||
system reboots into the second kernel. A module can be written to force
|
||||
the panic or "ALT-SysRq-c" can be used initiate a crash dump for testing
|
||||
purposes.
|
||||
|
||||
6) Write out the dump file using
|
||||
5) Once the second kernel has booted, write out the dump file using
|
||||
|
||||
cp /proc/vmcore <dump-file>
|
||||
|
||||
@ -119,9 +124,9 @@ SETUP
|
||||
|
||||
Entire memory: dd if=/dev/oldmem of=oldmem.001
|
||||
|
||||
|
||||
ANALYSIS
|
||||
========
|
||||
|
||||
Limited analysis can be done using gdb on the dump file copied out of
|
||||
/proc/vmcore. Use vmlinux built with -g and run
|
||||
|
||||
@ -132,15 +137,19 @@ work fine.
|
||||
|
||||
Note: gdb cannot analyse core files generated in ELF64 format for i386.
|
||||
|
||||
Latest "crash" (crash-4.0-2.18) as available on Dave Anderson's site
|
||||
http://people.redhat.com/~anderson/ works well with kdump format.
|
||||
|
||||
|
||||
TODO
|
||||
====
|
||||
|
||||
1) Provide a kernel pages filtering mechanism so that core file size is not
|
||||
insane on systems having huge memory banks.
|
||||
2) Modify "crash" tool to make it recognize this dump.
|
||||
2) Relocatable kernel can help in maintaining multiple kernels for crashdump
|
||||
and same kernel as the first kernel can be used to capture the dump.
|
||||
|
||||
|
||||
CONTACT
|
||||
=======
|
||||
|
||||
Vivek Goyal (vgoyal@in.ibm.com)
|
||||
Maneesh Soni (maneesh@in.ibm.com)
|
||||
|
@ -196,7 +196,7 @@
|
||||
|
||||
* Title: "Writing Linux Device Drivers"
|
||||
Author: Michael K. Johnson.
|
||||
URL: http://people.redhat.com/johnsonm/devices.html
|
||||
URL: http://users.evitech.fi/~tk/rtos/writing_linux_device_d.html
|
||||
Keywords: files, VFS, file operations, kernel interface, character
|
||||
vs block devices, I/O access, hardware interrupts, DMA, access to
|
||||
user memory, memory allocation, timers.
|
||||
@ -284,7 +284,7 @@
|
||||
|
||||
* Title: "Linux Kernel Module Programming Guide"
|
||||
Author: Ori Pomerantz.
|
||||
URL: http://www.tldp.org/LDP/lkmpg/mpg.html
|
||||
URL: http://tldp.org/LDP/lkmpg/2.6/html/index.html
|
||||
Keywords: modules, GPL book, /proc, ioctls, system calls,
|
||||
interrupt handlers .
|
||||
Description: Very nice 92 pages GPL book on the topic of modules
|
||||
@ -292,7 +292,7 @@
|
||||
|
||||
* Title: "Device File System (devfs) Overview"
|
||||
Author: Richard Gooch.
|
||||
URL: http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.txt
|
||||
URL: http://www.atnf.csiro.au/people/rgooch/linux/docs/devfs.html
|
||||
Keywords: filesystem, /dev, devfs, dynamic devices, major/minor
|
||||
allocation, device management.
|
||||
Description: Document describing Richard Gooch's controversial
|
||||
@ -316,9 +316,8 @@
|
||||
|
||||
* Title: "The Kernel Hacking HOWTO"
|
||||
Author: Various Talented People, and Rusty.
|
||||
URL:
|
||||
http://www.lisoleg.net/doc/Kernel-Hacking-HOWTO/kernel-hacking-HOW
|
||||
TO.html
|
||||
Location: in kernel tree, Documentation/DocBook/kernel-hacking/
|
||||
(must be built as "make {htmldocs | psdocs | pdfdocs})
|
||||
Keywords: HOWTO, kernel contexts, deadlock, locking, modules,
|
||||
symbols, return conventions.
|
||||
Description: From the Introduction: "Please understand that I
|
||||
@ -332,13 +331,13 @@
|
||||
originally written for the 2.3 kernels, but nearly all of it
|
||||
applies to 2.2 too; 2.0 is slightly different".
|
||||
|
||||
* Title: "ALSA 0.5.0 Developer documentation"
|
||||
Author: Stephan 'Jumpy' Bartels .
|
||||
URL: http://www.math.TU-Berlin.de/~sbartels/alsa/
|
||||
* Title: "Writing an ALSA Driver"
|
||||
Author: Takashi Iwai <tiwai@suse.de>
|
||||
URL: http://www.alsa-project.org/~iwai/writing-an-alsa-driver/index.html
|
||||
Keywords: ALSA, sound, soundcard, driver, lowlevel, hardware.
|
||||
Description: Advanced Linux Sound Architecture for developers,
|
||||
both at kernel and user-level sides. Work in progress. ALSA is
|
||||
supposed to be Linux's next generation sound architecture.
|
||||
both at kernel and user-level sides. ALSA is the Linux kernel
|
||||
sound architecture in the 2.6 kernel version.
|
||||
|
||||
* Title: "Programming Guide for Linux USB Device Drivers"
|
||||
Author: Detlef Fliegl.
|
||||
@ -369,8 +368,8 @@
|
||||
filesystems, IPC and Networking Code.
|
||||
|
||||
* Title: "Linux Kernel Mailing List Glossary"
|
||||
Author: John Levon.
|
||||
URL: http://www.movement.uklinux.net/glossary.html
|
||||
Author: various
|
||||
URL: http://kernelnewbies.org/glossary/
|
||||
Keywords: glossary, terms, linux-kernel.
|
||||
Description: From the introduction: "This glossary is intended as
|
||||
a brief description of some of the acronyms and terms you may hear
|
||||
@ -378,9 +377,8 @@
|
||||
|
||||
* Title: "Linux Kernel Locking HOWTO"
|
||||
Author: Various Talented People, and Rusty.
|
||||
URL:
|
||||
http://netfilter.kernelnotes.org/unreliable-guides/kernel-locking-
|
||||
HOWTO.html
|
||||
Location: in kernel tree, Documentation/DocBook/kernel-locking/
|
||||
(must be built as "make {htmldocs | psdocs | pdfdocs})
|
||||
Keywords: locks, locking, spinlock, semaphore, atomic, race
|
||||
condition, bottom halves, tasklets, softirqs.
|
||||
Description: The title says it all: document describing the
|
||||
@ -490,7 +488,7 @@
|
||||
|
||||
* Title: "Get those boards talking under Linux."
|
||||
Author: Alex Ivchenko.
|
||||
URL: http://www.ednmag.com/ednmag/reg/2000/06222000/13df2.htm
|
||||
URL: http://www.edn.com/article/CA46968.html
|
||||
Keywords: data-acquisition boards, drivers, modules, interrupts,
|
||||
memory allocation.
|
||||
Description: Article written for people wishing to make their data
|
||||
@ -498,7 +496,7 @@
|
||||
overview on writing drivers, from the naming of functions to
|
||||
interrupt handling.
|
||||
Notes: Two-parts article. Part II is at
|
||||
http://www.ednmag.com/ednmag/reg/2000/07062000/14df.htm
|
||||
URL: http://www.edn.com/article/CA46998.html
|
||||
|
||||
* Title: "Linux PCMCIA Programmer's Guide"
|
||||
Author: David Hinds.
|
||||
@ -529,7 +527,7 @@
|
||||
definitive guide for hackers, virus coders and system
|
||||
administrators."
|
||||
Author: pragmatic/THC.
|
||||
URL: http://packetstorm.securify.com/groups/thc/LKM_HACKING.html
|
||||
URL: http://packetstormsecurity.org/docs/hack/LKM_HACKING.html
|
||||
Keywords: syscalls, intercept, hide, abuse, symbol table.
|
||||
Description: Interesting paper on how to abuse the Linux kernel in
|
||||
order to intercept and modify syscalls, make
|
||||
@ -537,8 +535,7 @@
|
||||
write kernel modules based virus... and solutions for admins to
|
||||
avoid all those abuses.
|
||||
Notes: For 2.0.x kernels. Gives guidances to port it to 2.2.x
|
||||
kernels. Also available in txt format at
|
||||
http://www.blacknemesis.org/hacking/txt/cllkm.txt
|
||||
kernels.
|
||||
|
||||
BOOKS: (Not on-line)
|
||||
|
||||
@ -557,7 +554,17 @@
|
||||
ISBN: 0-59600-008-1
|
||||
Notes: Further information in
|
||||
http://www.oreilly.com/catalog/linuxdrive2/
|
||||
|
||||
|
||||
* Title: "Linux Device Drivers, 3nd Edition"
|
||||
Authors: Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman
|
||||
Publisher: O'Reilly & Associates.
|
||||
Date: 2005.
|
||||
Pages: 636.
|
||||
ISBN: 0-596-00590-3
|
||||
Notes: Further information in
|
||||
http://www.oreilly.com/catalog/linuxdrive3/
|
||||
PDF format, URL: http://lwn.net/Kernel/LDD3/
|
||||
|
||||
* Title: "Linux Kernel Internals"
|
||||
Author: Michael Beck.
|
||||
Publisher: Addison-Wesley.
|
||||
@ -766,12 +773,15 @@
|
||||
documents, FAQs...
|
||||
|
||||
* Name: "linux-kernel mailing list archives and search engines"
|
||||
URL: http://vger.kernel.org/vger-lists.html
|
||||
URL: http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html
|
||||
URL: http://www.kernelnotes.org/lnxlists/linux-kernel/
|
||||
URL: http://www.geocrawler.com
|
||||
URL: http://marc.theaimsgroup.com/?l=linux-kernel
|
||||
URL: http://groups.google.com/group/mlist.linux.kernel
|
||||
URL: http://www.cs.helsinki.fi/linux/linux-kernel/
|
||||
URL: http://www.lib.uaa.alaska.edu/linux-kernel/
|
||||
Keywords: linux-kernel, archives, search.
|
||||
Description: Some of the linux-kernel mailing list archivers. If
|
||||
you have a better/another one, please let me know.
|
||||
_________________________________________________________________
|
||||
|
||||
Document last updated on Thu Jun 28 15:09:39 CEST 2001
|
||||
Document last updated on Sat 2005-NOV-19
|
||||
|
@ -471,14 +471,15 @@ running once the system is up.
|
||||
arch/i386/kernel/cpu/cpufreq/elanfreq.c.
|
||||
|
||||
elevator= [IOSCHED]
|
||||
Format: {"as" | "cfq" | "deadline" | "noop"}
|
||||
Format: {"anticipatory" | "cfq" | "deadline" | "noop"}
|
||||
See Documentation/block/as-iosched.txt and
|
||||
Documentation/block/deadline-iosched.txt for details.
|
||||
|
||||
elfcorehdr= [IA-32]
|
||||
elfcorehdr= [IA-32, X86_64]
|
||||
Specifies physical address of start of kernel core
|
||||
image elf header.
|
||||
See Documentation/kdump.txt for details.
|
||||
image elf header. Generally kexec loader will
|
||||
pass this option to capture kernel.
|
||||
See Documentation/kdump/kdump.txt for details.
|
||||
|
||||
enforcing [SELINUX] Set initial enforcing status.
|
||||
Format: {"0" | "1"}
|
||||
@ -633,6 +634,14 @@ running once the system is up.
|
||||
inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver
|
||||
Format: <irq>
|
||||
|
||||
combined_mode= [HW] control which driver uses IDE ports in combined
|
||||
mode: legacy IDE driver, libata, or both
|
||||
(in the libata case, libata.atapi_enabled=1 may be
|
||||
useful as well). Note that using the ide or libata
|
||||
options may affect your device naming (e.g. by
|
||||
changing hdc to sdb).
|
||||
Format: combined (default), ide, or libata
|
||||
|
||||
inttest= [IA64]
|
||||
|
||||
io7= [HW] IO7 for Marvel based alpha systems
|
||||
@ -703,9 +712,17 @@ running once the system is up.
|
||||
load_ramdisk= [RAM] List of ramdisks to load from floppy
|
||||
See Documentation/ramdisk.txt.
|
||||
|
||||
lockd.udpport= [NFS]
|
||||
lockd.nlm_grace_period=P [NFS] Assign grace period.
|
||||
Format: <integer>
|
||||
|
||||
lockd.tcpport= [NFS]
|
||||
lockd.nlm_tcpport=N [NFS] Assign TCP port.
|
||||
Format: <integer>
|
||||
|
||||
lockd.nlm_timeout=T [NFS] Assign timeout value.
|
||||
Format: <integer>
|
||||
|
||||
lockd.nlm_udpport=M [NFS] Assign UDP port.
|
||||
Format: <integer>
|
||||
|
||||
logibm.irq= [HW,MOUSE] Logitech Bus Mouse Driver
|
||||
Format: <irq>
|
||||
@ -824,7 +841,7 @@ running once the system is up.
|
||||
mem=nopentium [BUGS=IA-32] Disable usage of 4MB pages for kernel
|
||||
memory.
|
||||
|
||||
memmap=exactmap [KNL,IA-32] Enable setting of an exact
|
||||
memmap=exactmap [KNL,IA-32,X86_64] Enable setting of an exact
|
||||
E820 memory map, as specified by the user.
|
||||
Such memmap=exactmap lines can be constructed based on
|
||||
BIOS output or other requirements. See the memmap=nn@ss
|
||||
@ -847,6 +864,49 @@ running once the system is up.
|
||||
|
||||
mga= [HW,DRM]
|
||||
|
||||
migration_cost=
|
||||
[KNL,SMP] debug: override scheduler migration costs
|
||||
Format: <level-1-usecs>,<level-2-usecs>,...
|
||||
This debugging option can be used to override the
|
||||
default scheduler migration cost matrix. The numbers
|
||||
are indexed by 'CPU domain distance'.
|
||||
E.g. migration_cost=1000,2000,3000 on an SMT NUMA
|
||||
box will set up an intra-core migration cost of
|
||||
1 msec, an inter-core migration cost of 2 msecs,
|
||||
and an inter-node migration cost of 3 msecs.
|
||||
|
||||
WARNING: using the wrong values here can break
|
||||
scheduler performance, so it's only for scheduler
|
||||
development purposes, not production environments.
|
||||
|
||||
migration_debug=
|
||||
[KNL,SMP] migration cost auto-detect verbosity
|
||||
Format=<0|1|2>
|
||||
If a system's migration matrix reported at bootup
|
||||
seems erroneous then this option can be used to
|
||||
increase verbosity of the detection process.
|
||||
We default to 0 (no extra messages), 1 will print
|
||||
some more information, and 2 will be really
|
||||
verbose (probably only useful if you also have a
|
||||
serial console attached to the system).
|
||||
|
||||
migration_factor=
|
||||
[KNL,SMP] multiply/divide migration costs by a factor
|
||||
Format=<percent>
|
||||
This debug option can be used to proportionally
|
||||
increase or decrease the auto-detected migration
|
||||
costs for all entries of the migration matrix.
|
||||
E.g. migration_factor=150 will increase migration
|
||||
costs by 50%. (and thus the scheduler will be less
|
||||
eager migrating cache-hot tasks)
|
||||
migration_factor=80 will decrease migration costs
|
||||
by 20%. (thus the scheduler will be more eager to
|
||||
migrate tasks)
|
||||
|
||||
WARNING: using the wrong values here can break
|
||||
scheduler performance, so it's only for scheduler
|
||||
development purposes, not production environments.
|
||||
|
||||
mousedev.tap_time=
|
||||
[MOUSE] Maximum time between finger touching and
|
||||
leaving touchpad surface for touch to be considered
|
||||
@ -902,6 +962,14 @@ running once the system is up.
|
||||
nfsroot= [NFS] nfs root filesystem for disk-less boxes.
|
||||
See Documentation/nfsroot.txt.
|
||||
|
||||
nfs.callback_tcpport=
|
||||
[NFS] set the TCP port on which the NFSv4 callback
|
||||
channel should listen.
|
||||
|
||||
nfs.idmap_cache_timeout=
|
||||
[NFS] set the maximum lifetime for idmapper cache
|
||||
entries.
|
||||
|
||||
nmi_watchdog= [KNL,BUGS=IA-32] Debugging features for SMP kernels
|
||||
|
||||
no387 [BUGS=IA-32] Tells the kernel to use the 387 maths
|
||||
@ -982,6 +1050,8 @@ running once the system is up.
|
||||
|
||||
nowb [ARM]
|
||||
|
||||
nr_uarts= [SERIAL] maximum number of UARTs to be registered.
|
||||
|
||||
opl3= [HW,OSS]
|
||||
Format: <io>
|
||||
|
||||
@ -1160,6 +1230,10 @@ running once the system is up.
|
||||
Limit processor to maximum C-state
|
||||
max_cstate=9 overrides any DMI blacklist limit.
|
||||
|
||||
processor.nocst [HW,ACPI]
|
||||
Ignore the _CST method to determine C-states,
|
||||
instead using the legacy FADT method
|
||||
|
||||
prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk
|
||||
before loading.
|
||||
See Documentation/ramdisk.txt.
|
||||
|
@ -56,10 +56,12 @@ A request proceeds in the following manner:
|
||||
(4) request_key() then forks and executes /sbin/request-key with a new session
|
||||
keyring that contains a link to auth key V.
|
||||
|
||||
(5) /sbin/request-key execs an appropriate program to perform the actual
|
||||
(5) /sbin/request-key assumes the authority associated with key U.
|
||||
|
||||
(6) /sbin/request-key execs an appropriate program to perform the actual
|
||||
instantiation.
|
||||
|
||||
(6) The program may want to access another key from A's context (say a
|
||||
(7) The program may want to access another key from A's context (say a
|
||||
Kerberos TGT key). It just requests the appropriate key, and the keyring
|
||||
search notes that the session keyring has auth key V in its bottom level.
|
||||
|
||||
@ -67,19 +69,19 @@ A request proceeds in the following manner:
|
||||
UID, GID, groups and security info of process A as if it was process A,
|
||||
and come up with key W.
|
||||
|
||||
(7) The program then does what it must to get the data with which to
|
||||
(8) The program then does what it must to get the data with which to
|
||||
instantiate key U, using key W as a reference (perhaps it contacts a
|
||||
Kerberos server using the TGT) and then instantiates key U.
|
||||
|
||||
(8) Upon instantiating key U, auth key V is automatically revoked so that it
|
||||
(9) Upon instantiating key U, auth key V is automatically revoked so that it
|
||||
may not be used again.
|
||||
|
||||
(9) The program then exits 0 and request_key() deletes key V and returns key
|
||||
(10) The program then exits 0 and request_key() deletes key V and returns key
|
||||
U to the caller.
|
||||
|
||||
This also extends further. If key W (step 5 above) didn't exist, key W would be
|
||||
created uninstantiated, another auth key (X) would be created [as per step 3]
|
||||
and another copy of /sbin/request-key spawned [as per step 4]; but the context
|
||||
This also extends further. If key W (step 7 above) didn't exist, key W would be
|
||||
created uninstantiated, another auth key (X) would be created (as per step 3)
|
||||
and another copy of /sbin/request-key spawned (as per step 4); but the context
|
||||
specified by auth key X will still be process A, as it was in auth key V.
|
||||
|
||||
This is because process A's keyrings can't simply be attached to
|
||||
@ -138,8 +140,8 @@ until one succeeds:
|
||||
|
||||
(3) The process's session keyring is searched.
|
||||
|
||||
(4) If the process has a request_key() authorisation key in its session
|
||||
keyring then:
|
||||
(4) If the process has assumed the authority associated with a request_key()
|
||||
authorisation key then:
|
||||
|
||||
(a) If extant, the calling process's thread keyring is searched.
|
||||
|
||||
|
@ -308,6 +308,8 @@ process making the call:
|
||||
KEY_SPEC_USER_KEYRING -4 UID-specific keyring
|
||||
KEY_SPEC_USER_SESSION_KEYRING -5 UID-session keyring
|
||||
KEY_SPEC_GROUP_KEYRING -6 GID-specific keyring
|
||||
KEY_SPEC_REQKEY_AUTH_KEY -7 assumed request_key()
|
||||
authorisation key
|
||||
|
||||
|
||||
The main syscalls are:
|
||||
@ -498,7 +500,11 @@ The keyctl syscall functions are:
|
||||
keyring is full, error ENFILE will result.
|
||||
|
||||
The link procedure checks the nesting of the keyrings, returning ELOOP if
|
||||
it appears to deep or EDEADLK if the link would introduce a cycle.
|
||||
it appears too deep or EDEADLK if the link would introduce a cycle.
|
||||
|
||||
Any links within the keyring to keys that match the new key in terms of
|
||||
type and description will be discarded from the keyring as the new one is
|
||||
added.
|
||||
|
||||
|
||||
(*) Unlink a key or keyring from another keyring:
|
||||
@ -628,6 +634,41 @@ The keyctl syscall functions are:
|
||||
there is one, otherwise the user default session keyring.
|
||||
|
||||
|
||||
(*) Set the timeout on a key.
|
||||
|
||||
long keyctl(KEYCTL_SET_TIMEOUT, key_serial_t key, unsigned timeout);
|
||||
|
||||
This sets or clears the timeout on a key. The timeout can be 0 to clear
|
||||
the timeout or a number of seconds to set the expiry time that far into
|
||||
the future.
|
||||
|
||||
The process must have attribute modification access on a key to set its
|
||||
timeout. Timeouts may not be set with this function on negative, revoked
|
||||
or expired keys.
|
||||
|
||||
|
||||
(*) Assume the authority granted to instantiate a key
|
||||
|
||||
long keyctl(KEYCTL_ASSUME_AUTHORITY, key_serial_t key);
|
||||
|
||||
This assumes or divests the authority required to instantiate the
|
||||
specified key. Authority can only be assumed if the thread has the
|
||||
authorisation key associated with the specified key in its keyrings
|
||||
somewhere.
|
||||
|
||||
Once authority is assumed, searches for keys will also search the
|
||||
requester's keyrings using the requester's security label, UID, GID and
|
||||
groups.
|
||||
|
||||
If the requested authority is unavailable, error EPERM will be returned,
|
||||
likewise if the authority has been revoked because the target key is
|
||||
already instantiated.
|
||||
|
||||
If the specified key is 0, then any assumed authority will be divested.
|
||||
|
||||
The assumed authorititive key is inherited across fork and exec.
|
||||
|
||||
|
||||
===============
|
||||
KERNEL SERVICES
|
||||
===============
|
||||
@ -860,24 +901,6 @@ The structure has a number of fields, some of which are mandatory:
|
||||
It is safe to sleep in this method.
|
||||
|
||||
|
||||
(*) int (*duplicate)(struct key *key, const struct key *source);
|
||||
|
||||
If this type of key can be duplicated, then this method should be
|
||||
provided. It is called to copy the payload attached to the source into the
|
||||
new key. The data length on the new key will have been updated and the
|
||||
quota adjusted already.
|
||||
|
||||
This method will be called with the source key's semaphore read-locked to
|
||||
prevent its payload from being changed, thus RCU constraints need not be
|
||||
applied to the source key.
|
||||
|
||||
This method does not have to lock the destination key in order to attach a
|
||||
payload. The fact that KEY_FLAG_INSTANTIATED is not set in key->flags
|
||||
prevents anything else from gaining access to the key.
|
||||
|
||||
It is safe to sleep in this method.
|
||||
|
||||
|
||||
(*) int (*update)(struct key *key, const void *data, size_t datalen);
|
||||
|
||||
If this type of key can be updated, then this method should be provided.
|
||||
|
@ -411,7 +411,8 @@ int init_module(void)
|
||||
printk("Couldn't find %s to plant kprobe\n", "do_fork");
|
||||
return -1;
|
||||
}
|
||||
if ((ret = register_kprobe(&kp) < 0)) {
|
||||
ret = register_kprobe(&kp);
|
||||
if (ret < 0) {
|
||||
printk("register_kprobe failed, returned %d\n", ret);
|
||||
return -1;
|
||||
}
|
||||
|
@ -3,7 +3,7 @@ How to conserve battery power using laptop-mode
|
||||
|
||||
Document Author: Bart Samwel (bart@samwel.tk)
|
||||
Date created: January 2, 2004
|
||||
Last modified: July 10, 2004
|
||||
Last modified: December 06, 2004
|
||||
|
||||
Introduction
|
||||
------------
|
||||
@ -33,7 +33,7 @@ or anything. Simply install all the files included in this document, and
|
||||
laptop mode will automatically be started when you're on battery. For
|
||||
your convenience, a tarball containing an installer can be downloaded at:
|
||||
|
||||
http://www.xs4all.nl/~bsamwel/laptop_mode/tools
|
||||
http://www.xs4all.nl/~bsamwel/laptop_mode/tools/
|
||||
|
||||
To configure laptop mode, you need to edit the configuration file, which is
|
||||
located in /etc/default/laptop-mode on Debian-based systems, or in
|
||||
@ -357,7 +357,7 @@ MAX_AGE=${MAX_AGE:-'600'}
|
||||
# Read-ahead, in kilobytes
|
||||
READAHEAD=${READAHEAD:-'4096'}
|
||||
|
||||
# Shall we remount journaled fs. with appropiate commit interval? (1=yes)
|
||||
# Shall we remount journaled fs. with appropriate commit interval? (1=yes)
|
||||
DO_REMOUNTS=${DO_REMOUNTS:-'1'}
|
||||
|
||||
# And shall we add the "noatime" option to that as well? (1=yes)
|
||||
@ -912,7 +912,7 @@ void usage()
|
||||
exit(0);
|
||||
}
|
||||
|
||||
int main(int ac, char **av)
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
int fd;
|
||||
char *disk = 0;
|
||||
|
@ -65,20 +65,3 @@ The default is to disallow mandatory locking. The intention is that
|
||||
mandatory locking only be enabled on a local filesystem as the specific need
|
||||
arises.
|
||||
|
||||
Until an updated version of mount(8) becomes available you may have to apply
|
||||
this patch to the mount sources (based on the version distributed with Rick
|
||||
Faith's util-linux-2.5 package):
|
||||
|
||||
*** mount.c.orig Sat Jun 8 09:14:31 1996
|
||||
--- mount.c Sat Jun 8 09:13:02 1996
|
||||
***************
|
||||
*** 100,105 ****
|
||||
--- 100,107 ----
|
||||
{ "noauto", 0, MS_NOAUTO }, /* Can only be mounted explicitly */
|
||||
{ "user", 0, MS_USER }, /* Allow ordinary user to mount */
|
||||
{ "nouser", 1, MS_USER }, /* Forbid ordinary user to mount */
|
||||
+ { "mand", 0, MS_MANDLOCK }, /* Allow mandatory locks on this FS */
|
||||
+ { "nomand", 1, MS_MANDLOCK }, /* Forbid mandatory locks on this FS */
|
||||
/* add new options here */
|
||||
#ifdef MS_NOSUB
|
||||
{ "sub", 1, MS_NOSUB }, /* allow submounts */
|
||||
|
@ -252,7 +252,7 @@ their names here, but I don't have a list handy. Check the MCA Linux
|
||||
home page (URL below) for a perpetually out-of-date list.
|
||||
|
||||
=====================================================================
|
||||
MCA Linux Home Page: http://glycerine.itsmm.uni.edu/mca/
|
||||
MCA Linux Home Page: http://www.dgmicro.com/mca/
|
||||
|
||||
Christophe Beauregard
|
||||
chrisb@truespectra.com
|
||||
|
@ -51,6 +51,30 @@ superblock can be autodetected and run at boot time.
|
||||
The kernel parameter "raid=partitionable" (or "raid=part") means
|
||||
that all auto-detected arrays are assembled as partitionable.
|
||||
|
||||
Boot time assembly of degraded/dirty arrays
|
||||
-------------------------------------------
|
||||
|
||||
If a raid5 or raid6 array is both dirty and degraded, it could have
|
||||
undetectable data corruption. This is because the fact that it is
|
||||
'dirty' means that the parity cannot be trusted, and the fact that it
|
||||
is degraded means that some datablocks are missing and cannot reliably
|
||||
be reconstructed (due to no parity).
|
||||
|
||||
For this reason, md will normally refuse to start such an array. This
|
||||
requires the sysadmin to take action to explicitly start the array
|
||||
desipite possible corruption. This is normally done with
|
||||
mdadm --assemble --force ....
|
||||
|
||||
This option is not really available if the array has the root
|
||||
filesystem on it. In order to support this booting from such an
|
||||
array, md supports a module parameter "start_dirty_degraded" which,
|
||||
when set to 1, bypassed the checks and will allows dirty degraded
|
||||
arrays to be started.
|
||||
|
||||
So, to boot with a root filesystem of a dirty degraded raid[56], use
|
||||
|
||||
md-mod.start_dirty_degraded=1
|
||||
|
||||
|
||||
Superblock formats
|
||||
------------------
|
||||
@ -141,6 +165,70 @@ All md devices contain:
|
||||
in a fully functional array. If this is not yet known, the file
|
||||
will be empty. If an array is being resized (not currently
|
||||
possible) this will contain the larger of the old and new sizes.
|
||||
Some raid level (RAID1) allow this value to be set while the
|
||||
array is active. This will reconfigure the array. Otherwise
|
||||
it can only be set while assembling an array.
|
||||
|
||||
chunk_size
|
||||
This is the size if bytes for 'chunks' and is only relevant to
|
||||
raid levels that involve striping (1,4,5,6,10). The address space
|
||||
of the array is conceptually divided into chunks and consecutive
|
||||
chunks are striped onto neighbouring devices.
|
||||
The size should be atleast PAGE_SIZE (4k) and should be a power
|
||||
of 2. This can only be set while assembling an array
|
||||
|
||||
component_size
|
||||
For arrays with data redundancy (i.e. not raid0, linear, faulty,
|
||||
multipath), all components must be the same size - or at least
|
||||
there must a size that they all provide space for. This is a key
|
||||
part or the geometry of the array. It is measured in sectors
|
||||
and can be read from here. Writing to this value may resize
|
||||
the array if the personality supports it (raid1, raid5, raid6),
|
||||
and if the component drives are large enough.
|
||||
|
||||
metadata_version
|
||||
This indicates the format that is being used to record metadata
|
||||
about the array. It can be 0.90 (traditional format), 1.0, 1.1,
|
||||
1.2 (newer format in varying locations) or "none" indicating that
|
||||
the kernel isn't managing metadata at all.
|
||||
|
||||
level
|
||||
The raid 'level' for this array. The name will often (but not
|
||||
always) be the same as the name of the module that implements the
|
||||
level. To be auto-loaded the module must have an alias
|
||||
md-$LEVEL e.g. md-raid5
|
||||
This can be written only while the array is being assembled, not
|
||||
after it is started.
|
||||
|
||||
new_dev
|
||||
This file can be written but not read. The value written should
|
||||
be a block device number as major:minor. e.g. 8:0
|
||||
This will cause that device to be attached to the array, if it is
|
||||
available. It will then appear at md/dev-XXX (depending on the
|
||||
name of the device) and further configuration is then possible.
|
||||
|
||||
sync_speed_min
|
||||
sync_speed_max
|
||||
This are similar to /proc/sys/dev/raid/speed_limit_{min,max}
|
||||
however they only apply to the particular array.
|
||||
If no value has been written to these, of if the word 'system'
|
||||
is written, then the system-wide value is used. If a value,
|
||||
in kibibytes-per-second is written, then it is used.
|
||||
When the files are read, they show the currently active value
|
||||
followed by "(local)" or "(system)" depending on whether it is
|
||||
a locally set or system-wide value.
|
||||
|
||||
sync_completed
|
||||
This shows the number of sectors that have been completed of
|
||||
whatever the current sync_action is, followed by the number of
|
||||
sectors in total that could need to be processed. The two
|
||||
numbers are separated by a '/' thus effectively showing one
|
||||
value, a fraction of the process that is complete.
|
||||
|
||||
sync_speed
|
||||
This shows the current actual speed, in K/sec, of the current
|
||||
sync_action. It is averaged over the last 30 seconds.
|
||||
|
||||
|
||||
As component devices are added to an md array, they appear in the 'md'
|
||||
directory as new directories named
|
||||
@ -167,6 +255,38 @@ Each directory contains:
|
||||
of being recoverred to
|
||||
This list make grow in future.
|
||||
|
||||
errors
|
||||
An approximate count of read errors that have been detected on
|
||||
this device but have not caused the device to be evicted from
|
||||
the array (either because they were corrected or because they
|
||||
happened while the array was read-only). When using version-1
|
||||
metadata, this value persists across restarts of the array.
|
||||
|
||||
This value can be written while assembling an array thus
|
||||
providing an ongoing count for arrays with metadata managed by
|
||||
userspace.
|
||||
|
||||
slot
|
||||
This gives the role that the device has in the array. It will
|
||||
either be 'none' if the device is not active in the array
|
||||
(i.e. is a spare or has failed) or an integer less than the
|
||||
'raid_disks' number for the array indicating which possition
|
||||
it currently fills. This can only be set while assembling an
|
||||
array. A device for which this is set is assumed to be working.
|
||||
|
||||
offset
|
||||
This gives the location in the device (in sectors from the
|
||||
start) where data from the array will be stored. Any part of
|
||||
the device before this offset us not touched, unless it is
|
||||
used for storing metadata (Formats 1.1 and 1.2).
|
||||
|
||||
size
|
||||
The amount of the device, after the offset, that can be used
|
||||
for storage of data. This will normally be the same as the
|
||||
component_size. This can be written while assembling an
|
||||
array. If a value less than the current component_size is
|
||||
written, component_size will be reduced to this value.
|
||||
|
||||
|
||||
An active md device will also contain and entry for each active device
|
||||
in the array. These are named
|
||||
|
135
Documentation/mutex-design.txt
Normal file
135
Documentation/mutex-design.txt
Normal file
@ -0,0 +1,135 @@
|
||||
Generic Mutex Subsystem
|
||||
|
||||
started by Ingo Molnar <mingo@redhat.com>
|
||||
|
||||
"Why on earth do we need a new mutex subsystem, and what's wrong
|
||||
with semaphores?"
|
||||
|
||||
firstly, there's nothing wrong with semaphores. But if the simpler
|
||||
mutex semantics are sufficient for your code, then there are a couple
|
||||
of advantages of mutexes:
|
||||
|
||||
- 'struct mutex' is smaller on most architectures: .e.g on x86,
|
||||
'struct semaphore' is 20 bytes, 'struct mutex' is 16 bytes.
|
||||
A smaller structure size means less RAM footprint, and better
|
||||
CPU-cache utilization.
|
||||
|
||||
- tighter code. On x86 i get the following .text sizes when
|
||||
switching all mutex-alike semaphores in the kernel to the mutex
|
||||
subsystem:
|
||||
|
||||
text data bss dec hex filename
|
||||
3280380 868188 396860 4545428 455b94 vmlinux-semaphore
|
||||
3255329 865296 396732 4517357 44eded vmlinux-mutex
|
||||
|
||||
that's 25051 bytes of code saved, or a 0.76% win - off the hottest
|
||||
codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%)
|
||||
Smaller code means better icache footprint, which is one of the
|
||||
major optimization goals in the Linux kernel currently.
|
||||
|
||||
- the mutex subsystem is slightly faster and has better scalability for
|
||||
contended workloads. On an 8-way x86 system, running a mutex-based
|
||||
kernel and testing creat+unlink+close (of separate, per-task files)
|
||||
in /tmp with 16 parallel tasks, the average number of ops/sec is:
|
||||
|
||||
Semaphores: Mutexes:
|
||||
|
||||
$ ./test-mutex V 16 10 $ ./test-mutex V 16 10
|
||||
8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
|
||||
checking VFS performance. checking VFS performance.
|
||||
avg loops/sec: 34713 avg loops/sec: 84153
|
||||
CPU utilization: 63% CPU utilization: 22%
|
||||
|
||||
i.e. in this workload, the mutex based kernel was 2.4 times faster
|
||||
than the semaphore based kernel, _and_ it also had 2.8 times less CPU
|
||||
utilization. (In terms of 'ops per CPU cycle', the semaphore kernel
|
||||
performed 551 ops/sec per 1% of CPU time used, while the mutex kernel
|
||||
performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times
|
||||
more efficient.)
|
||||
|
||||
the scalability difference is visible even on a 2-way P4 HT box:
|
||||
|
||||
Semaphores: Mutexes:
|
||||
|
||||
$ ./test-mutex V 16 10 $ ./test-mutex V 16 10
|
||||
4 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
|
||||
checking VFS performance. checking VFS performance.
|
||||
avg loops/sec: 127659 avg loops/sec: 181082
|
||||
CPU utilization: 100% CPU utilization: 34%
|
||||
|
||||
(the straight performance advantage of mutexes is 41%, the per-cycle
|
||||
efficiency of mutexes is 4.1 times better.)
|
||||
|
||||
- there are no fastpath tradeoffs, the mutex fastpath is just as tight
|
||||
as the semaphore fastpath. On x86, the locking fastpath is 2
|
||||
instructions:
|
||||
|
||||
c0377ccb <mutex_lock>:
|
||||
c0377ccb: f0 ff 08 lock decl (%eax)
|
||||
c0377cce: 78 0e js c0377cde <.text.lock.mutex>
|
||||
c0377cd0: c3 ret
|
||||
|
||||
the unlocking fastpath is equally tight:
|
||||
|
||||
c0377cd1 <mutex_unlock>:
|
||||
c0377cd1: f0 ff 00 lock incl (%eax)
|
||||
c0377cd4: 7e 0f jle c0377ce5 <.text.lock.mutex+0x7>
|
||||
c0377cd6: c3 ret
|
||||
|
||||
- 'struct mutex' semantics are well-defined and are enforced if
|
||||
CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have
|
||||
virtually no debugging code or instrumentation. The mutex subsystem
|
||||
checks and enforces the following rules:
|
||||
|
||||
* - only one task can hold the mutex at a time
|
||||
* - only the owner can unlock the mutex
|
||||
* - multiple unlocks are not permitted
|
||||
* - recursive locking is not permitted
|
||||
* - a mutex object must be initialized via the API
|
||||
* - a mutex object must not be initialized via memset or copying
|
||||
* - task may not exit with mutex held
|
||||
* - memory areas where held locks reside must not be freed
|
||||
* - held mutexes must not be reinitialized
|
||||
* - mutexes may not be used in irq contexts
|
||||
|
||||
furthermore, there are also convenience features in the debugging
|
||||
code:
|
||||
|
||||
* - uses symbolic names of mutexes, whenever they are printed in debug output
|
||||
* - point-of-acquire tracking, symbolic lookup of function names
|
||||
* - list of all locks held in the system, printout of them
|
||||
* - owner tracking
|
||||
* - detects self-recursing locks and prints out all relevant info
|
||||
* - detects multi-task circular deadlocks and prints out all affected
|
||||
* locks and tasks (and only those tasks)
|
||||
|
||||
Disadvantages
|
||||
-------------
|
||||
|
||||
The stricter mutex API means you cannot use mutexes the same way you
|
||||
can use semaphores: e.g. they cannot be used from an interrupt context,
|
||||
nor can they be unlocked from a different context that which acquired
|
||||
it. [ I'm not aware of any other (e.g. performance) disadvantages from
|
||||
using mutexes at the moment, please let me know if you find any. ]
|
||||
|
||||
Implementation of mutexes
|
||||
-------------------------
|
||||
|
||||
'struct mutex' is the new mutex type, defined in include/linux/mutex.h
|
||||
and implemented in kernel/mutex.c. It is a counter-based mutex with a
|
||||
spinlock and a wait-list. The counter has 3 states: 1 for "unlocked",
|
||||
0 for "locked" and negative numbers (usually -1) for "locked, potential
|
||||
waiters queued".
|
||||
|
||||
the APIs of 'struct mutex' have been streamlined:
|
||||
|
||||
DEFINE_MUTEX(name);
|
||||
|
||||
mutex_init(mutex);
|
||||
|
||||
void mutex_lock(struct mutex *lock);
|
||||
int mutex_lock_interruptible(struct mutex *lock);
|
||||
int mutex_trylock(struct mutex *lock);
|
||||
void mutex_unlock(struct mutex *lock);
|
||||
int mutex_is_locked(struct mutex *lock);
|
||||
|
@ -945,7 +945,6 @@ bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
|
||||
collisions:0 txqueuelen:0
|
||||
|
||||
eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
|
||||
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
|
||||
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
|
||||
RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0
|
||||
TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0
|
||||
@ -953,7 +952,6 @@ eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
|
||||
Interrupt:10 Base address:0x1080
|
||||
|
||||
eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
|
||||
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
|
||||
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
|
||||
RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0
|
||||
TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0
|
||||
|
@ -1,7 +1,4 @@
|
||||
Documents about softnet driver issues in general can be found
|
||||
at:
|
||||
|
||||
http://www.firstfloor.org/~andi/softnet/
|
||||
Document about softnet driver issues
|
||||
|
||||
Transmit path guidelines:
|
||||
|
||||
|
72
Documentation/networking/gianfar.txt
Normal file
72
Documentation/networking/gianfar.txt
Normal file
@ -0,0 +1,72 @@
|
||||
The Gianfar Ethernet Driver
|
||||
Sysfs File description
|
||||
|
||||
Author: Andy Fleming <afleming@freescale.com>
|
||||
Updated: 2005-07-28
|
||||
|
||||
SYSFS
|
||||
|
||||
Several of the features of the gianfar driver are controlled
|
||||
through sysfs files. These are:
|
||||
|
||||
bd_stash:
|
||||
To stash RX Buffer Descriptors in the L2, echo 'on' or '1' to
|
||||
bd_stash, echo 'off' or '0' to disable
|
||||
|
||||
rx_stash_len:
|
||||
To stash the first n bytes of the packet in L2, echo the number
|
||||
of bytes to buf_stash_len. echo 0 to disable.
|
||||
|
||||
WARNING: You could really screw these up if you set them too low or high!
|
||||
fifo_threshold:
|
||||
To change the number of bytes the controller needs in the
|
||||
fifo before it starts transmission, echo the number of bytes to
|
||||
fifo_thresh. Range should be 0-511.
|
||||
|
||||
fifo_starve:
|
||||
When the FIFO has less than this many bytes during a transmit, it
|
||||
enters starve mode, and increases the priority of TX memory
|
||||
transactions. To change, echo the number of bytes to
|
||||
fifo_starve. Range should be 0-511.
|
||||
|
||||
fifo_starve_off:
|
||||
Once in starve mode, the FIFO remains there until it has this
|
||||
many bytes. To change, echo the number of bytes to
|
||||
fifo_starve_off. Range should be 0-511.
|
||||
|
||||
CHECKSUM OFFLOADING
|
||||
|
||||
The eTSEC controller (first included in parts from late 2005 like
|
||||
the 8548) has the ability to perform TCP, UDP, and IP checksums
|
||||
in hardware. The Linux kernel only offloads the TCP and UDP
|
||||
checksums (and always performs the pseudo header checksums), so
|
||||
the driver only supports checksumming for TCP/IP and UDP/IP
|
||||
packets. Use ethtool to enable or disable this feature for RX
|
||||
and TX.
|
||||
|
||||
VLAN
|
||||
|
||||
In order to use VLAN, please consult Linux documentation on
|
||||
configuring VLANs. The gianfar driver supports hardware insertion and
|
||||
extraction of VLAN headers, but not filtering. Filtering will be
|
||||
done by the kernel.
|
||||
|
||||
MULTICASTING
|
||||
|
||||
The gianfar driver supports using the group hash table on the
|
||||
TSEC (and the extended hash table on the eTSEC) for multicast
|
||||
filtering. On the eTSEC, the exact-match MAC registers are used
|
||||
before the hash tables. See Linux documentation on how to join
|
||||
multicast groups.
|
||||
|
||||
PADDING
|
||||
|
||||
The gianfar driver supports padding received frames with 2 bytes
|
||||
to align the IP header to a 16-byte boundary, when supported by
|
||||
hardware.
|
||||
|
||||
ETHTOOL
|
||||
|
||||
The gianfar driver supports the use of ethtool for many
|
||||
configuration options. You must run ethtool only on currently
|
||||
open interfaces. See ethtool documentation for details.
|
@ -693,13 +693,7 @@ static int enslave(char *master_ifname, char *slave_ifname)
|
||||
/* Older bonding versions would panic if the slave has no IP
|
||||
* address, so get the IP setting from the master.
|
||||
*/
|
||||
res = set_if_addr(master_ifname, slave_ifname);
|
||||
if (res) {
|
||||
fprintf(stderr,
|
||||
"Slave '%s': Error: set address failed\n",
|
||||
slave_ifname);
|
||||
return res;
|
||||
}
|
||||
set_if_addr(master_ifname, slave_ifname);
|
||||
} else {
|
||||
res = clear_if_addr(slave_ifname);
|
||||
if (res) {
|
||||
@ -1085,7 +1079,6 @@ static int set_if_addr(char *master_ifname, char *slave_ifname)
|
||||
slave_ifname, ifra[i].req_name,
|
||||
strerror(saved_errno));
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
ipaddr = ifr.ifr_addr.sa_data;
|
||||
|
@ -46,6 +46,29 @@ ipfrag_secret_interval - INTEGER
|
||||
for the hash secret) for IP fragments.
|
||||
Default: 600
|
||||
|
||||
ipfrag_max_dist - INTEGER
|
||||
ipfrag_max_dist is a non-negative integer value which defines the
|
||||
maximum "disorder" which is allowed among fragments which share a
|
||||
common IP source address. Note that reordering of packets is
|
||||
not unusual, but if a large number of fragments arrive from a source
|
||||
IP address while a particular fragment queue remains incomplete, it
|
||||
probably indicates that one or more fragments belonging to that queue
|
||||
have been lost. When ipfrag_max_dist is positive, an additional check
|
||||
is done on fragments before they are added to a reassembly queue - if
|
||||
ipfrag_max_dist (or more) fragments have arrived from a particular IP
|
||||
address between additions to any IP fragment queue using that source
|
||||
address, it's presumed that one or more fragments in the queue are
|
||||
lost. The existing fragment queue will be dropped, and a new one
|
||||
started. An ipfrag_max_dist value of zero disables this check.
|
||||
|
||||
Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can
|
||||
result in unnecessarily dropping fragment queues when normal
|
||||
reordering of packets occurs, which could lead to poor application
|
||||
performance. Using a very large value, e.g. 50000, increases the
|
||||
likelihood of incorrectly reassembling IP fragments that originate
|
||||
from different IP datagrams, which could result in data corruption.
|
||||
Default: 64
|
||||
|
||||
INET peer storage:
|
||||
|
||||
inet_peer_threshold - INTEGER
|
||||
|
@ -22,7 +22,7 @@ The features and limitations of this driver are as follows:
|
||||
- All variants of Interphase ATM PCI (i)Chip adapter cards are supported,
|
||||
including x575 (OC3, control memory 128K , 512K and packet memory 128K,
|
||||
512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See
|
||||
http://www.iphase.com/products/ClassSheet.cfm?ClassID=ATM
|
||||
http://www.iphase.com/site/iphase-web/?epi_menuItemID=e196f04b4b3b40502f150882e21046a0
|
||||
for details.
|
||||
- Only x86 platforms are supported.
|
||||
- SMP is supported.
|
||||
|
@ -3,12 +3,8 @@ of the IrDA Utilities. More detailed information about these and associated
|
||||
programs can be found on http://irda.sourceforge.net/
|
||||
|
||||
For more information about how to use the IrDA protocol stack, see the
|
||||
Linux Infared HOWTO (http://www.tuxmobil.org/Infrared-HOWTO/Infrared-HOWTO.html)
|
||||
by Werner Heuser <wehe@tuxmobil.org>
|
||||
Linux Infrared HOWTO by Werner Heuser <wehe@tuxmobil.org>:
|
||||
<http://www.tuxmobil.org/Infrared-HOWTO/Infrared-HOWTO.html>
|
||||
|
||||
There is an active mailing list for discussing Linux-IrDA matters called
|
||||
irda-users@lists.sourceforge.net
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -29,8 +29,7 @@ with nondefault parameters, they can be edited in
|
||||
will find them all.
|
||||
|
||||
Information on card services is available at:
|
||||
ftp://hyper.stanford.edu/pub/pcmcia/doc
|
||||
http://hyper.stanford.edu/HyperNews/get/pcmcia/home.html
|
||||
http://pcmcia-cs.sourceforge.net/
|
||||
|
||||
|
||||
Card services user programs are still required for PCMCIA devices.
|
||||
|
@ -91,7 +91,7 @@ To use the driver as a module, proceed as follows:
|
||||
with (M)
|
||||
5. Execute the command "make modules".
|
||||
6. Execute the command "make modules_install".
|
||||
The appropiate modules will be installed.
|
||||
The appropriate modules will be installed.
|
||||
7. Reboot your system.
|
||||
|
||||
|
||||
@ -245,7 +245,7 @@ Default: Both
|
||||
This parameters is only relevant if auto-negotiation for this port is
|
||||
not set to "Sense". If auto-negotiation is set to "On", all three values
|
||||
are possible. If it is set to "Off", only "Full" and "Half" are allowed.
|
||||
This parameter is usefull if your link partner does not support all
|
||||
This parameter is useful if your link partner does not support all
|
||||
possible combinations.
|
||||
|
||||
Flow Control
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user