forked from Minki/linux
Merge branch 'linus' into core/generic-dma-coherent
Conflicts: kernel/Makefile Signed-off-by: Ingo Molnar <mingo@elte.hu>
This commit is contained in:
commit
f6dc8ccaab
12
.gitignore
vendored
12
.gitignore
vendored
@ -3,6 +3,10 @@
|
||||
# subdirectories here. Add them in the ".gitignore" file
|
||||
# in that subdirectory instead.
|
||||
#
|
||||
# NOTE! Please use 'git-ls-files -i --exclude-standard'
|
||||
# command after changing this file, to see if there are
|
||||
# any tracked files which get ignored after the change.
|
||||
#
|
||||
# Normal rules
|
||||
#
|
||||
.*
|
||||
@ -18,19 +22,21 @@
|
||||
*.lst
|
||||
*.symtypes
|
||||
*.order
|
||||
*.elf
|
||||
*.bin
|
||||
*.gz
|
||||
|
||||
#
|
||||
# Top-level generic files
|
||||
#
|
||||
tags
|
||||
TAGS
|
||||
vmlinux*
|
||||
!vmlinux.lds.S
|
||||
!vmlinux.lds.h
|
||||
vmlinux
|
||||
System.map
|
||||
Module.markers
|
||||
Module.symvers
|
||||
!.gitignore
|
||||
!.mailmap
|
||||
|
||||
#
|
||||
# Generated include files
|
||||
|
5
CREDITS
5
CREDITS
@ -2611,8 +2611,9 @@ S: Perth, Western Australia
|
||||
S: Australia
|
||||
|
||||
N: Miguel Ojeda Sandonis
|
||||
E: maxextreme@gmail.com
|
||||
W: http://maxextreme.googlepages.com/
|
||||
E: miguel.ojeda.sandonis@gmail.com
|
||||
W: http://miguelojeda.es
|
||||
W: http://jair.lab.fi.uva.es/~migojed/
|
||||
D: Author of the ks0108, cfag12864b and cfag12864bfb auxiliary display drivers.
|
||||
D: Maintainer of the auxiliary display drivers tree (drivers/auxdisplay/*)
|
||||
S: C/ Mieses 20, 9-B
|
||||
|
@ -26,3 +26,37 @@ Description:
|
||||
I/O statistics of partition <part>. The format is the
|
||||
same as the above-written /sys/block/<disk>/stat
|
||||
format.
|
||||
|
||||
|
||||
What: /sys/block/<disk>/integrity/format
|
||||
Date: June 2008
|
||||
Contact: Martin K. Petersen <martin.petersen@oracle.com>
|
||||
Description:
|
||||
Metadata format for integrity capable block device.
|
||||
E.g. T10-DIF-TYPE1-CRC.
|
||||
|
||||
|
||||
What: /sys/block/<disk>/integrity/read_verify
|
||||
Date: June 2008
|
||||
Contact: Martin K. Petersen <martin.petersen@oracle.com>
|
||||
Description:
|
||||
Indicates whether the block layer should verify the
|
||||
integrity of read requests serviced by devices that
|
||||
support sending integrity metadata.
|
||||
|
||||
|
||||
What: /sys/block/<disk>/integrity/tag_size
|
||||
Date: June 2008
|
||||
Contact: Martin K. Petersen <martin.petersen@oracle.com>
|
||||
Description:
|
||||
Number of bytes of integrity tag space available per
|
||||
512 bytes of data.
|
||||
|
||||
|
||||
What: /sys/block/<disk>/integrity/write_generate
|
||||
Date: June 2008
|
||||
Contact: Martin K. Petersen <martin.petersen@oracle.com>
|
||||
Description:
|
||||
Indicates whether the block layer should automatically
|
||||
generate checksums for write requests bound for
|
||||
devices that support receiving integrity metadata.
|
||||
|
35
Documentation/ABI/testing/sysfs-bus-css
Normal file
35
Documentation/ABI/testing/sysfs-bus-css
Normal file
@ -0,0 +1,35 @@
|
||||
What: /sys/bus/css/devices/.../type
|
||||
Date: March 2008
|
||||
Contact: Cornelia Huck <cornelia.huck@de.ibm.com>
|
||||
linux-s390@vger.kernel.org
|
||||
Description: Contains the subchannel type, as reported by the hardware.
|
||||
This attribute is present for all subchannel types.
|
||||
|
||||
What: /sys/bus/css/devices/.../modalias
|
||||
Date: March 2008
|
||||
Contact: Cornelia Huck <cornelia.huck@de.ibm.com>
|
||||
linux-s390@vger.kernel.org
|
||||
Description: Contains the module alias as reported with uevents.
|
||||
It is of the format css:t<type> and present for all
|
||||
subchannel types.
|
||||
|
||||
What: /sys/bus/css/drivers/io_subchannel/.../chpids
|
||||
Date: December 2002
|
||||
Contact: Cornelia Huck <cornelia.huck@de.ibm.com>
|
||||
linux-s390@vger.kernel.org
|
||||
Description: Contains the ids of the channel paths used by this
|
||||
subchannel, as reported by the channel subsystem
|
||||
during subchannel recognition.
|
||||
Note: This is an I/O-subchannel specific attribute.
|
||||
Users: s390-tools, HAL
|
||||
|
||||
What: /sys/bus/css/drivers/io_subchannel/.../pimpampom
|
||||
Date: December 2002
|
||||
Contact: Cornelia Huck <cornelia.huck@de.ibm.com>
|
||||
linux-s390@vger.kernel.org
|
||||
Description: Contains the PIM/PAM/POM values, as reported by the
|
||||
channel subsystem when last queried by the common I/O
|
||||
layer (this implies that this attribute is not neccessarily
|
||||
in sync with the values current in the channel subsystem).
|
||||
Note: This is an I/O-subchannel specific attribute.
|
||||
Users: s390-tools, HAL
|
@ -29,46 +29,46 @@ Description:
|
||||
|
||||
$ cd /sys/firmware/acpi/interrupts
|
||||
$ grep . *
|
||||
error:0
|
||||
ff_gbl_lock:0
|
||||
ff_pmtimer:0
|
||||
ff_pwr_btn:0
|
||||
ff_rt_clk:0
|
||||
ff_slp_btn:0
|
||||
gpe00:0
|
||||
gpe01:0
|
||||
gpe02:0
|
||||
gpe03:0
|
||||
gpe04:0
|
||||
gpe05:0
|
||||
gpe06:0
|
||||
gpe07:0
|
||||
gpe08:0
|
||||
gpe09:174
|
||||
gpe0A:0
|
||||
gpe0B:0
|
||||
gpe0C:0
|
||||
gpe0D:0
|
||||
gpe0E:0
|
||||
gpe0F:0
|
||||
gpe10:0
|
||||
gpe11:60
|
||||
gpe12:0
|
||||
gpe13:0
|
||||
gpe14:0
|
||||
gpe15:0
|
||||
gpe16:0
|
||||
gpe17:0
|
||||
gpe18:0
|
||||
gpe19:7
|
||||
gpe1A:0
|
||||
gpe1B:0
|
||||
gpe1C:0
|
||||
gpe1D:0
|
||||
gpe1E:0
|
||||
gpe1F:0
|
||||
gpe_all:241
|
||||
sci:241
|
||||
error: 0
|
||||
ff_gbl_lock: 0 enable
|
||||
ff_pmtimer: 0 invalid
|
||||
ff_pwr_btn: 0 enable
|
||||
ff_rt_clk: 2 disable
|
||||
ff_slp_btn: 0 invalid
|
||||
gpe00: 0 invalid
|
||||
gpe01: 0 enable
|
||||
gpe02: 108 enable
|
||||
gpe03: 0 invalid
|
||||
gpe04: 0 invalid
|
||||
gpe05: 0 invalid
|
||||
gpe06: 0 enable
|
||||
gpe07: 0 enable
|
||||
gpe08: 0 invalid
|
||||
gpe09: 0 invalid
|
||||
gpe0A: 0 invalid
|
||||
gpe0B: 0 invalid
|
||||
gpe0C: 0 invalid
|
||||
gpe0D: 0 invalid
|
||||
gpe0E: 0 invalid
|
||||
gpe0F: 0 invalid
|
||||
gpe10: 0 invalid
|
||||
gpe11: 0 invalid
|
||||
gpe12: 0 invalid
|
||||
gpe13: 0 invalid
|
||||
gpe14: 0 invalid
|
||||
gpe15: 0 invalid
|
||||
gpe16: 0 invalid
|
||||
gpe17: 1084 enable
|
||||
gpe18: 0 enable
|
||||
gpe19: 0 invalid
|
||||
gpe1A: 0 invalid
|
||||
gpe1B: 0 invalid
|
||||
gpe1C: 0 invalid
|
||||
gpe1D: 0 invalid
|
||||
gpe1E: 0 invalid
|
||||
gpe1F: 0 invalid
|
||||
gpe_all: 1192
|
||||
sci: 1194
|
||||
|
||||
sci - The total number of times the ACPI SCI
|
||||
has claimed an interrupt.
|
||||
@ -89,6 +89,13 @@ Description:
|
||||
|
||||
error - an interrupt that can't be accounted for above.
|
||||
|
||||
invalid: it's either a wakeup GPE or a GPE/Fixed Event that
|
||||
doesn't have an event handler.
|
||||
|
||||
disable: the GPE/Fixed Event is valid but disabled.
|
||||
|
||||
enable: the GPE/Fixed Event is valid and enabled.
|
||||
|
||||
Root has permission to clear any of these counters. Eg.
|
||||
# echo 0 > gpe11
|
||||
|
||||
@ -97,3 +104,43 @@ Description:
|
||||
|
||||
None of these counters has an effect on the function
|
||||
of the system, they are simply statistics.
|
||||
|
||||
Besides this, user can also write specific strings to these files
|
||||
to enable/disable/clear ACPI interrupts in user space, which can be
|
||||
used to debug some ACPI interrupt storm issues.
|
||||
|
||||
Note that only writting to VALID GPE/Fixed Event is allowed,
|
||||
i.e. user can only change the status of runtime GPE and
|
||||
Fixed Event with event handler installed.
|
||||
|
||||
Let's take power button fixed event for example, please kill acpid
|
||||
and other user space applications so that the machine won't shutdown
|
||||
when pressing the power button.
|
||||
# cat ff_pwr_btn
|
||||
0
|
||||
# press the power button for 3 times;
|
||||
# cat ff_pwr_btn
|
||||
3
|
||||
# echo disable > ff_pwr_btn
|
||||
# cat ff_pwr_btn
|
||||
disable
|
||||
# press the power button for 3 times;
|
||||
# cat ff_pwr_btn
|
||||
disable
|
||||
# echo enable > ff_pwr_btn
|
||||
# cat ff_pwr_btn
|
||||
4
|
||||
/*
|
||||
* this is because the status bit is set even if the enable bit is cleared,
|
||||
* and it triggers an ACPI fixed event when the enable bit is set again
|
||||
*/
|
||||
# press the power button for 3 times;
|
||||
# cat ff_pwr_btn
|
||||
7
|
||||
# echo disable > ff_pwr_btn
|
||||
# press the power button for 3 times;
|
||||
# echo clear > ff_pwr_btn /* clear the status bit */
|
||||
# echo disable > ff_pwr_btn
|
||||
# cat ff_pwr_btn
|
||||
7
|
||||
|
||||
|
71
Documentation/ABI/testing/sysfs-firmware-memmap
Normal file
71
Documentation/ABI/testing/sysfs-firmware-memmap
Normal file
@ -0,0 +1,71 @@
|
||||
What: /sys/firmware/memmap/
|
||||
Date: June 2008
|
||||
Contact: Bernhard Walle <bwalle@suse.de>
|
||||
Description:
|
||||
On all platforms, the firmware provides a memory map which the
|
||||
kernel reads. The resources from that memory map are registered
|
||||
in the kernel resource tree and exposed to userspace via
|
||||
/proc/iomem (together with other resources).
|
||||
|
||||
However, on most architectures that firmware-provided memory
|
||||
map is modified afterwards by the kernel itself, either because
|
||||
the kernel merges that memory map with other information or
|
||||
just because the user overwrites that memory map via command
|
||||
line.
|
||||
|
||||
kexec needs the raw firmware-provided memory map to setup the
|
||||
parameter segment of the kernel that should be booted with
|
||||
kexec. Also, the raw memory map is useful for debugging. For
|
||||
that reason, /sys/firmware/memmap is an interface that provides
|
||||
the raw memory map to userspace.
|
||||
|
||||
The structure is as follows: Under /sys/firmware/memmap there
|
||||
are subdirectories with the number of the entry as their name:
|
||||
|
||||
/sys/firmware/memmap/0
|
||||
/sys/firmware/memmap/1
|
||||
/sys/firmware/memmap/2
|
||||
/sys/firmware/memmap/3
|
||||
...
|
||||
|
||||
The maximum depends on the number of memory map entries provided
|
||||
by the firmware. The order is just the order that the firmware
|
||||
provides.
|
||||
|
||||
Each directory contains three files:
|
||||
|
||||
start : The start address (as hexadecimal number with the
|
||||
'0x' prefix).
|
||||
end : The end address, inclusive (regardless whether the
|
||||
firmware provides inclusive or exclusive ranges).
|
||||
type : Type of the entry as string. See below for a list of
|
||||
valid types.
|
||||
|
||||
So, for example:
|
||||
|
||||
/sys/firmware/memmap/0/start
|
||||
/sys/firmware/memmap/0/end
|
||||
/sys/firmware/memmap/0/type
|
||||
/sys/firmware/memmap/1/start
|
||||
...
|
||||
|
||||
Currently following types exist:
|
||||
|
||||
- System RAM
|
||||
- ACPI Tables
|
||||
- ACPI Non-volatile Storage
|
||||
- reserved
|
||||
|
||||
Following shell snippet can be used to display that memory
|
||||
map in a human-readable format:
|
||||
|
||||
-------------------- 8< ----------------------------------------
|
||||
#!/bin/bash
|
||||
cd /sys/firmware/memmap
|
||||
for dir in * ; do
|
||||
start=$(cat $dir/start)
|
||||
end=$(cat $dir/end)
|
||||
type=$(cat $dir/type)
|
||||
printf "%016x-%016x (%s)\n" $start $[ $end +1] "$type"
|
||||
done
|
||||
-------------------- >8 ----------------------------------------
|
@ -377,7 +377,7 @@ Bug Reporting
|
||||
bugzilla.kernel.org is where the Linux kernel developers track kernel
|
||||
bugs. Users are encouraged to report all bugs that they find in this
|
||||
tool. For details on how to use the kernel bugzilla, please see:
|
||||
http://test.kernel.org/bugzilla/faq.html
|
||||
http://bugzilla.kernel.org/page.cgi?id=faq.html
|
||||
|
||||
The file REPORTING-BUGS in the main kernel source directory has a good
|
||||
template for how to report a possible kernel bug, and details what kind
|
||||
|
@ -1,17 +1,26 @@
|
||||
ChangeLog:
|
||||
Started by Ingo Molnar <mingo@redhat.com>
|
||||
Update by Max Krasnyansky <maxk@qualcomm.com>
|
||||
|
||||
SMP IRQ affinity, started by Ingo Molnar <mingo@redhat.com>
|
||||
|
||||
SMP IRQ affinity
|
||||
|
||||
/proc/irq/IRQ#/smp_affinity specifies which target CPUs are permitted
|
||||
for a given IRQ source. It's a bitmask of allowed CPUs. It's not allowed
|
||||
to turn off all CPUs, and if an IRQ controller does not support IRQ
|
||||
affinity then the value will not change from the default 0xffffffff.
|
||||
|
||||
Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
|
||||
the IRQ to CPU4-7 (this is an 8-CPU SMP box):
|
||||
/proc/irq/default_smp_affinity specifies default affinity mask that applies
|
||||
to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask
|
||||
will be set to the default mask. It can then be changed as described above.
|
||||
Default mask is 0xffffffff.
|
||||
|
||||
Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
|
||||
it to CPU4-7 (this is an 8-CPU SMP box):
|
||||
|
||||
[root@moon 44]# cd /proc/irq/44
|
||||
[root@moon 44]# cat smp_affinity
|
||||
ffffffff
|
||||
|
||||
[root@moon 44]# echo 0f > smp_affinity
|
||||
[root@moon 44]# cat smp_affinity
|
||||
0000000f
|
||||
@ -21,17 +30,27 @@ PING hell (195.4.7.3): 56 data bytes
|
||||
--- hell ping statistics ---
|
||||
6029 packets transmitted, 6027 packets received, 0% packet loss
|
||||
round-trip min/avg/max = 0.1/0.1/0.4 ms
|
||||
[root@moon 44]# cat /proc/interrupts | grep 44:
|
||||
44: 0 1785 1785 1783 1783 1
|
||||
1 0 IO-APIC-level eth1
|
||||
[root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:'
|
||||
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
|
||||
44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1
|
||||
|
||||
As can be seen from the line above IRQ44 was delivered only to the first four
|
||||
processors (0-3).
|
||||
Now lets restrict that IRQ to CPU(4-7).
|
||||
|
||||
[root@moon 44]# echo f0 > smp_affinity
|
||||
[root@moon 44]# cat smp_affinity
|
||||
000000f0
|
||||
[root@moon 44]# ping -f h
|
||||
PING hell (195.4.7.3): 56 data bytes
|
||||
..
|
||||
--- hell ping statistics ---
|
||||
2779 packets transmitted, 2777 packets received, 0% packet loss
|
||||
round-trip min/avg/max = 0.1/0.5/585.4 ms
|
||||
[root@moon 44]# cat /proc/interrupts | grep 44:
|
||||
44: 1068 1785 1785 1784 1784 1069 1070 1069 IO-APIC-level eth1
|
||||
[root@moon 44]#
|
||||
[root@moon 44]# cat /proc/interrupts | 'CPU\|44:'
|
||||
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
|
||||
44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1
|
||||
|
||||
This time around IRQ44 was delivered only to the last four processors.
|
||||
i.e counters for the CPU0-3 did not change.
|
||||
|
||||
|
@ -93,6 +93,9 @@ Since NMI handlers disable preemption, synchronize_sched() is guaranteed
|
||||
not to return until all ongoing NMI handlers exit. It is therefore safe
|
||||
to free up the handler's data as soon as synchronize_sched() returns.
|
||||
|
||||
Important note: for this to work, the architecture in question must
|
||||
invoke irq_enter() and irq_exit() on NMI entry and exit, respectively.
|
||||
|
||||
|
||||
Answer to Quick Quiz
|
||||
|
||||
|
@ -52,6 +52,10 @@ of each iteration. Unfortunately, chaotic relaxation requires highly
|
||||
structured data, such as the matrices used in scientific programs, and
|
||||
is thus inapplicable to most data structures in operating-system kernels.
|
||||
|
||||
In 1992, Henry (now Alexia) Massalin completed a dissertation advising
|
||||
parallel programmers to defer processing when feasible to simplify
|
||||
synchronization. RCU makes extremely heavy use of this advice.
|
||||
|
||||
In 1993, Jacobson [Jacobson93] verbally described what is perhaps the
|
||||
simplest deferred-free technique: simply waiting a fixed amount of time
|
||||
before freeing blocks awaiting deferred free. Jacobson did not describe
|
||||
@ -138,6 +142,13 @@ blocking in read-side critical sections appeared [PaulEMcKenney2006c],
|
||||
Robert Olsson described an RCU-protected trie-hash combination
|
||||
[RobertOlsson2006a].
|
||||
|
||||
2007 saw the journal version of the award-winning RCU paper from 2006
|
||||
[ThomasEHart2007a], as well as a paper demonstrating use of Promela
|
||||
and Spin to mechanically verify an optimization to Oleg Nesterov's
|
||||
QRCU [PaulEMcKenney2007QRCUspin], a design document describing
|
||||
preemptible RCU [PaulEMcKenney2007PreemptibleRCU], and the three-part
|
||||
LWN "What is RCU?" series [PaulEMcKenney2007WhatIsRCUFundamentally,
|
||||
PaulEMcKenney2008WhatIsRCUUsage, and PaulEMcKenney2008WhatIsRCUAPI].
|
||||
|
||||
Bibtex Entries
|
||||
|
||||
@ -202,6 +213,20 @@ Bibtex Entries
|
||||
,Year="1991"
|
||||
}
|
||||
|
||||
@phdthesis{HMassalinPhD
|
||||
,author="H. Massalin"
|
||||
,title="Synthesis: An Efficient Implementation of Fundamental Operating
|
||||
System Services"
|
||||
,school="Columbia University"
|
||||
,address="New York, NY"
|
||||
,year="1992"
|
||||
,annotation="
|
||||
Mondo optimizing compiler.
|
||||
Wait-free stuff.
|
||||
Good advice: defer work to avoid synchronization.
|
||||
"
|
||||
}
|
||||
|
||||
@unpublished{Jacobson93
|
||||
,author="Van Jacobson"
|
||||
,title="Avoid Read-Side Locking Via Delayed Free"
|
||||
@ -635,3 +660,86 @@ Revised:
|
||||
"
|
||||
}
|
||||
|
||||
@unpublished{PaulEMcKenney2007PreemptibleRCU
|
||||
,Author="Paul E. McKenney"
|
||||
,Title="The design of preemptible read-copy-update"
|
||||
,month="October"
|
||||
,day="8"
|
||||
,year="2007"
|
||||
,note="Available:
|
||||
\url{http://lwn.net/Articles/253651/}
|
||||
[Viewed October 25, 2007]"
|
||||
,annotation="
|
||||
LWN article describing the design of preemptible RCU.
|
||||
"
|
||||
}
|
||||
|
||||
########################################################################
|
||||
#
|
||||
# "What is RCU?" LWN series.
|
||||
#
|
||||
|
||||
@unpublished{PaulEMcKenney2007WhatIsRCUFundamentally
|
||||
,Author="Paul E. McKenney and Jonathan Walpole"
|
||||
,Title="What is {RCU}, Fundamentally?"
|
||||
,month="December"
|
||||
,day="17"
|
||||
,year="2007"
|
||||
,note="Available:
|
||||
\url{http://lwn.net/Articles/262464/}
|
||||
[Viewed December 27, 2007]"
|
||||
,annotation="
|
||||
Lays out the three basic components of RCU: (1) publish-subscribe,
|
||||
(2) wait for pre-existing readers to complete, and (2) maintain
|
||||
multiple versions.
|
||||
"
|
||||
}
|
||||
|
||||
@unpublished{PaulEMcKenney2008WhatIsRCUUsage
|
||||
,Author="Paul E. McKenney"
|
||||
,Title="What is {RCU}? Part 2: Usage"
|
||||
,month="January"
|
||||
,day="4"
|
||||
,year="2008"
|
||||
,note="Available:
|
||||
\url{http://lwn.net/Articles/263130/}
|
||||
[Viewed January 4, 2008]"
|
||||
,annotation="
|
||||
Lays out six uses of RCU:
|
||||
1. RCU is a Reader-Writer Lock Replacement
|
||||
2. RCU is a Restricted Reference-Counting Mechanism
|
||||
3. RCU is a Bulk Reference-Counting Mechanism
|
||||
4. RCU is a Poor Man's Garbage Collector
|
||||
5. RCU is a Way of Providing Existence Guarantees
|
||||
6. RCU is a Way of Waiting for Things to Finish
|
||||
"
|
||||
}
|
||||
|
||||
@unpublished{PaulEMcKenney2008WhatIsRCUAPI
|
||||
,Author="Paul E. McKenney"
|
||||
,Title="{RCU} part 3: the {RCU} {API}"
|
||||
,month="January"
|
||||
,day="17"
|
||||
,year="2008"
|
||||
,note="Available:
|
||||
\url{http://lwn.net/Articles/264090/}
|
||||
[Viewed January 10, 2008]"
|
||||
,annotation="
|
||||
Gives an overview of the Linux-kernel RCU API and a brief annotated RCU
|
||||
bibliography.
|
||||
"
|
||||
}
|
||||
|
||||
@article{DinakarGuniguntala2008IBMSysJ
|
||||
,author="D. Guniguntala and P. E. McKenney and J. Triplett and J. Walpole"
|
||||
,title="The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with {Linux}"
|
||||
,Year="2008"
|
||||
,Month="April"
|
||||
,journal="IBM Systems Journal"
|
||||
,volume="47"
|
||||
,number="2"
|
||||
,pages="@@-@@"
|
||||
,annotation="
|
||||
RCU, realtime RCU, sleepable RCU, performance.
|
||||
"
|
||||
}
|
||||
|
@ -13,10 +13,13 @@ over a rather long period of time, but improvements are always welcome!
|
||||
detailed performance measurements show that RCU is nonetheless
|
||||
the right tool for the job.
|
||||
|
||||
The other exception would be where performance is not an issue,
|
||||
and RCU provides a simpler implementation. An example of this
|
||||
situation is the dynamic NMI code in the Linux 2.6 kernel,
|
||||
at least on architectures where NMIs are rare.
|
||||
Another exception is where performance is not an issue, and RCU
|
||||
provides a simpler implementation. An example of this situation
|
||||
is the dynamic NMI code in the Linux 2.6 kernel, at least on
|
||||
architectures where NMIs are rare.
|
||||
|
||||
Yet another exception is where the low real-time latency of RCU's
|
||||
read-side primitives is critically important.
|
||||
|
||||
1. Does the update code have proper mutual exclusion?
|
||||
|
||||
@ -39,9 +42,10 @@ over a rather long period of time, but improvements are always welcome!
|
||||
|
||||
2. Do the RCU read-side critical sections make proper use of
|
||||
rcu_read_lock() and friends? These primitives are needed
|
||||
to suppress preemption (or bottom halves, in the case of
|
||||
rcu_read_lock_bh()) in the read-side critical sections,
|
||||
and are also an excellent aid to readability.
|
||||
to prevent grace periods from ending prematurely, which
|
||||
could result in data being unceremoniously freed out from
|
||||
under your read-side code, which can greatly increase the
|
||||
actuarial risk of your kernel.
|
||||
|
||||
As a rough rule of thumb, any dereference of an RCU-protected
|
||||
pointer must be covered by rcu_read_lock() or rcu_read_lock_bh()
|
||||
@ -54,15 +58,30 @@ over a rather long period of time, but improvements are always welcome!
|
||||
be running while updates are in progress. There are a number
|
||||
of ways to handle this concurrency, depending on the situation:
|
||||
|
||||
a. Make updates appear atomic to readers. For example,
|
||||
a. Use the RCU variants of the list and hlist update
|
||||
primitives to add, remove, and replace elements on an
|
||||
RCU-protected list. Alternatively, use the RCU-protected
|
||||
trees that have been added to the Linux kernel.
|
||||
|
||||
This is almost always the best approach.
|
||||
|
||||
b. Proceed as in (a) above, but also maintain per-element
|
||||
locks (that are acquired by both readers and writers)
|
||||
that guard per-element state. Of course, fields that
|
||||
the readers refrain from accessing can be guarded by the
|
||||
update-side lock.
|
||||
|
||||
This works quite well, also.
|
||||
|
||||
c. Make updates appear atomic to readers. For example,
|
||||
pointer updates to properly aligned fields will appear
|
||||
atomic, as will individual atomic primitives. Operations
|
||||
performed under a lock and sequences of multiple atomic
|
||||
primitives will -not- appear to be atomic.
|
||||
|
||||
This is almost always the best approach.
|
||||
This can work, but is starting to get a bit tricky.
|
||||
|
||||
b. Carefully order the updates and the reads so that
|
||||
d. Carefully order the updates and the reads so that
|
||||
readers see valid data at all phases of the update.
|
||||
This is often more difficult than it sounds, especially
|
||||
given modern CPUs' tendency to reorder memory references.
|
||||
@ -123,18 +142,22 @@ over a rather long period of time, but improvements are always welcome!
|
||||
when publicizing a pointer to a structure that can
|
||||
be traversed by an RCU read-side critical section.
|
||||
|
||||
5. If call_rcu(), or a related primitive such as call_rcu_bh(),
|
||||
is used, the callback function must be written to be called
|
||||
from softirq context. In particular, it cannot block.
|
||||
5. If call_rcu(), or a related primitive such as call_rcu_bh() or
|
||||
call_rcu_sched(), is used, the callback function must be
|
||||
written to be called from softirq context. In particular,
|
||||
it cannot block.
|
||||
|
||||
6. Since synchronize_rcu() can block, it cannot be called from
|
||||
any sort of irq context.
|
||||
any sort of irq context. Ditto for synchronize_sched() and
|
||||
synchronize_srcu().
|
||||
|
||||
7. If the updater uses call_rcu(), then the corresponding readers
|
||||
must use rcu_read_lock() and rcu_read_unlock(). If the updater
|
||||
uses call_rcu_bh(), then the corresponding readers must use
|
||||
rcu_read_lock_bh() and rcu_read_unlock_bh(). Mixing things up
|
||||
will result in confusion and broken kernels.
|
||||
rcu_read_lock_bh() and rcu_read_unlock_bh(). If the updater
|
||||
uses call_rcu_sched(), then the corresponding readers must
|
||||
disable preemption. Mixing things up will result in confusion
|
||||
and broken kernels.
|
||||
|
||||
One exception to this rule: rcu_read_lock() and rcu_read_unlock()
|
||||
may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh()
|
||||
@ -143,9 +166,9 @@ over a rather long period of time, but improvements are always welcome!
|
||||
such cases is a must, of course! And the jury is still out on
|
||||
whether the increased speed is worth it.
|
||||
|
||||
8. Although synchronize_rcu() is a bit slower than is call_rcu(),
|
||||
it usually results in simpler code. So, unless update
|
||||
performance is critically important or the updaters cannot block,
|
||||
8. Although synchronize_rcu() is slower than is call_rcu(), it
|
||||
usually results in simpler code. So, unless update performance
|
||||
is critically important or the updaters cannot block,
|
||||
synchronize_rcu() should be used in preference to call_rcu().
|
||||
|
||||
An especially important property of the synchronize_rcu()
|
||||
@ -187,23 +210,23 @@ over a rather long period of time, but improvements are always welcome!
|
||||
number of updates per grace period.
|
||||
|
||||
9. All RCU list-traversal primitives, which include
|
||||
list_for_each_rcu(), list_for_each_entry_rcu(),
|
||||
rcu_dereference(), list_for_each_rcu(), list_for_each_entry_rcu(),
|
||||
list_for_each_continue_rcu(), and list_for_each_safe_rcu(),
|
||||
must be within an RCU read-side critical section. RCU
|
||||
must be either within an RCU read-side critical section or
|
||||
must be protected by appropriate update-side locks. RCU
|
||||
read-side critical sections are delimited by rcu_read_lock()
|
||||
and rcu_read_unlock(), or by similar primitives such as
|
||||
rcu_read_lock_bh() and rcu_read_unlock_bh().
|
||||
|
||||
Use of the _rcu() list-traversal primitives outside of an
|
||||
RCU read-side critical section causes no harm other than
|
||||
a slight performance degradation on Alpha CPUs. It can
|
||||
also be quite helpful in reducing code bloat when common
|
||||
code is shared between readers and updaters.
|
||||
The reason that it is permissible to use RCU list-traversal
|
||||
primitives when the update-side lock is held is that doing so
|
||||
can be quite helpful in reducing code bloat when common code is
|
||||
shared between readers and updaters.
|
||||
|
||||
10. Conversely, if you are in an RCU read-side critical section,
|
||||
you -must- use the "_rcu()" variants of the list macros.
|
||||
Failing to do so will break Alpha and confuse people reading
|
||||
your code.
|
||||
and you don't hold the appropriate update-side lock, you -must-
|
||||
use the "_rcu()" variants of the list macros. Failing to do so
|
||||
will break Alpha and confuse people reading your code.
|
||||
|
||||
11. Note that synchronize_rcu() -only- guarantees to wait until
|
||||
all currently executing rcu_read_lock()-protected RCU read-side
|
||||
@ -230,6 +253,14 @@ over a rather long period of time, but improvements are always welcome!
|
||||
must use whatever locking or other synchronization is required
|
||||
to safely access and/or modify that data structure.
|
||||
|
||||
RCU callbacks are -usually- executed on the same CPU that executed
|
||||
the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(),
|
||||
but are by -no- means guaranteed to be. For example, if a given
|
||||
CPU goes offline while having an RCU callback pending, then that
|
||||
RCU callback will execute on some surviving CPU. (If this was
|
||||
not the case, a self-spawning RCU callback would prevent the
|
||||
victim CPU from ever going offline.)
|
||||
|
||||
14. SRCU (srcu_read_lock(), srcu_read_unlock(), and synchronize_srcu())
|
||||
may only be invoked from process context. Unlike other forms of
|
||||
RCU, it -is- permissible to block in an SRCU read-side critical
|
||||
|
@ -10,23 +10,30 @@ status messages via printk(), which can be examined via the dmesg
|
||||
command (perhaps grepping for "torture"). The test is started
|
||||
when the module is loaded, and stops when the module is unloaded.
|
||||
|
||||
However, actually setting this config option to "y" results in the system
|
||||
running the test immediately upon boot, and ending only when the system
|
||||
is taken down. Normally, one will instead want to build the system
|
||||
with CONFIG_RCU_TORTURE_TEST=m and to use modprobe and rmmod to control
|
||||
the test, perhaps using a script similar to the one shown at the end of
|
||||
this document. Note that you will need CONFIG_MODULE_UNLOAD in order
|
||||
to be able to end the test.
|
||||
CONFIG_RCU_TORTURE_TEST_RUNNABLE
|
||||
|
||||
It is also possible to specify CONFIG_RCU_TORTURE_TEST=y, which will
|
||||
result in the tests being loaded into the base kernel. In this case,
|
||||
the CONFIG_RCU_TORTURE_TEST_RUNNABLE config option is used to specify
|
||||
whether the RCU torture tests are to be started immediately during
|
||||
boot or whether the /proc/sys/kernel/rcutorture_runnable file is used
|
||||
to enable them. This /proc file can be used to repeatedly pause and
|
||||
restart the tests, regardless of the initial state specified by the
|
||||
CONFIG_RCU_TORTURE_TEST_RUNNABLE config option.
|
||||
|
||||
You will normally -not- want to start the RCU torture tests during boot
|
||||
(and thus the default is CONFIG_RCU_TORTURE_TEST_RUNNABLE=n), but doing
|
||||
this can sometimes be useful in finding boot-time bugs.
|
||||
|
||||
|
||||
MODULE PARAMETERS
|
||||
|
||||
This module has the following parameters:
|
||||
|
||||
nreaders This is the number of RCU reading threads supported.
|
||||
The default is twice the number of CPUs. Why twice?
|
||||
To properly exercise RCU implementations with preemptible
|
||||
read-side critical sections.
|
||||
irqreaders Says to invoke RCU readers from irq level. This is currently
|
||||
done via timers. Defaults to "1" for variants of RCU that
|
||||
permit this. (Or, more accurately, variants of RCU that do
|
||||
-not- permit this know to ignore this variable.)
|
||||
|
||||
nfakewriters This is the number of RCU fake writer threads to run. Fake
|
||||
writer threads repeatedly use the synchronous "wait for
|
||||
@ -37,6 +44,16 @@ nfakewriters This is the number of RCU fake writer threads to run. Fake
|
||||
to trigger special cases caused by multiple writers, such as
|
||||
the synchronize_srcu() early return optimization.
|
||||
|
||||
nreaders This is the number of RCU reading threads supported.
|
||||
The default is twice the number of CPUs. Why twice?
|
||||
To properly exercise RCU implementations with preemptible
|
||||
read-side critical sections.
|
||||
|
||||
shuffle_interval
|
||||
The number of seconds to keep the test threads affinitied
|
||||
to a particular subset of the CPUs, defaults to 3 seconds.
|
||||
Used in conjunction with test_no_idle_hz.
|
||||
|
||||
stat_interval The number of seconds between output of torture
|
||||
statistics (via printk()). Regardless of the interval,
|
||||
statistics are printed when the module is unloaded.
|
||||
@ -44,10 +61,11 @@ stat_interval The number of seconds between output of torture
|
||||
be printed -only- when the module is unloaded, and this
|
||||
is the default.
|
||||
|
||||
shuffle_interval
|
||||
The number of seconds to keep the test threads affinitied
|
||||
to a particular subset of the CPUs, defaults to 5 seconds.
|
||||
Used in conjunction with test_no_idle_hz.
|
||||
stutter The length of time to run the test before pausing for this
|
||||
same period of time. Defaults to "stutter=5", so as
|
||||
to run and pause for (roughly) five-second intervals.
|
||||
Specifying "stutter=0" causes the test to run continuously
|
||||
without pausing, which is the old default behavior.
|
||||
|
||||
test_no_idle_hz Whether or not to test the ability of RCU to operate in
|
||||
a kernel that disables the scheduling-clock interrupt to
|
||||
|
@ -1,3 +1,11 @@
|
||||
Please note that the "What is RCU?" LWN series is an excellent place
|
||||
to start learning about RCU:
|
||||
|
||||
1. What is RCU, Fundamentally? http://lwn.net/Articles/262464/
|
||||
2. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/
|
||||
3. RCU part 3: the RCU API http://lwn.net/Articles/264090/
|
||||
|
||||
|
||||
What is RCU?
|
||||
|
||||
RCU is a synchronization mechanism that was added to the Linux kernel
|
||||
@ -772,26 +780,18 @@ Linux-kernel source code, but it helps to have a full list of the
|
||||
APIs, since there does not appear to be a way to categorize them
|
||||
in docbook. Here is the list, by category.
|
||||
|
||||
Markers for RCU read-side critical sections:
|
||||
|
||||
rcu_read_lock
|
||||
rcu_read_unlock
|
||||
rcu_read_lock_bh
|
||||
rcu_read_unlock_bh
|
||||
srcu_read_lock
|
||||
srcu_read_unlock
|
||||
|
||||
RCU pointer/list traversal:
|
||||
|
||||
rcu_dereference
|
||||
list_for_each_rcu (to be deprecated in favor of
|
||||
list_for_each_entry_rcu)
|
||||
list_for_each_entry_rcu
|
||||
list_for_each_continue_rcu (to be deprecated in favor of new
|
||||
list_for_each_entry_continue_rcu)
|
||||
hlist_for_each_entry_rcu
|
||||
|
||||
RCU pointer update:
|
||||
list_for_each_rcu (to be deprecated in favor of
|
||||
list_for_each_entry_rcu)
|
||||
list_for_each_continue_rcu (to be deprecated in favor of new
|
||||
list_for_each_entry_continue_rcu)
|
||||
|
||||
RCU pointer/list update:
|
||||
|
||||
rcu_assign_pointer
|
||||
list_add_rcu
|
||||
@ -799,16 +799,36 @@ RCU pointer update:
|
||||
list_del_rcu
|
||||
list_replace_rcu
|
||||
hlist_del_rcu
|
||||
hlist_add_after_rcu
|
||||
hlist_add_before_rcu
|
||||
hlist_add_head_rcu
|
||||
hlist_replace_rcu
|
||||
list_splice_init_rcu()
|
||||
|
||||
RCU grace period:
|
||||
RCU: Critical sections Grace period Barrier
|
||||
|
||||
rcu_read_lock synchronize_net rcu_barrier
|
||||
rcu_read_unlock synchronize_rcu
|
||||
call_rcu
|
||||
|
||||
|
||||
bh: Critical sections Grace period Barrier
|
||||
|
||||
rcu_read_lock_bh call_rcu_bh rcu_barrier_bh
|
||||
rcu_read_unlock_bh
|
||||
|
||||
|
||||
sched: Critical sections Grace period Barrier
|
||||
|
||||
[preempt_disable] synchronize_sched rcu_barrier_sched
|
||||
[and friends] call_rcu_sched
|
||||
|
||||
|
||||
SRCU: Critical sections Grace period Barrier
|
||||
|
||||
srcu_read_lock synchronize_srcu N/A
|
||||
srcu_read_unlock
|
||||
|
||||
synchronize_net
|
||||
synchronize_sched
|
||||
synchronize_rcu
|
||||
synchronize_srcu
|
||||
call_rcu
|
||||
call_rcu_bh
|
||||
|
||||
See the comment headers in the source code (or the docbook generated
|
||||
from them) for more information.
|
||||
|
@ -24,6 +24,8 @@ There are three different groups of fields in the struct taskstats:
|
||||
|
||||
4) Per-task and per-thread context switch count statistics
|
||||
|
||||
5) Time accounting for SMT machines
|
||||
|
||||
Future extension should add fields to the end of the taskstats struct, and
|
||||
should not change the relative position of each field within the struct.
|
||||
|
||||
@ -164,4 +166,8 @@ struct taskstats {
|
||||
__u64 nvcsw; /* Context voluntary switch counter */
|
||||
__u64 nivcsw; /* Context involuntary switch counter */
|
||||
|
||||
5) Time accounting for SMT machines
|
||||
__u64 ac_utimescaled; /* utime scaled on frequency etc */
|
||||
__u64 ac_stimescaled; /* stime scaled on frequency etc */
|
||||
__u64 cpu_scaled_run_real_total; /* scaled cpu_run_real_total */
|
||||
}
|
||||
|
@ -3,7 +3,7 @@
|
||||
===================================
|
||||
|
||||
License: GPLv2
|
||||
Author & Maintainer: Miguel Ojeda Sandonis <maxextreme@gmail.com>
|
||||
Author & Maintainer: Miguel Ojeda Sandonis
|
||||
Date: 2006-10-27
|
||||
|
||||
|
||||
@ -22,7 +22,7 @@ Date: 2006-10-27
|
||||
1. DRIVER INFORMATION
|
||||
---------------------
|
||||
|
||||
This driver support one cfag12864b display at time.
|
||||
This driver supports a cfag12864b LCD.
|
||||
|
||||
|
||||
---------------------
|
||||
|
@ -4,7 +4,7 @@
|
||||
* Description: cfag12864b LCD userspace example program
|
||||
* License: GPLv2
|
||||
*
|
||||
* Author: Copyright (C) Miguel Ojeda Sandonis <maxextreme@gmail.com>
|
||||
* Author: Copyright (C) Miguel Ojeda Sandonis
|
||||
* Date: 2006-10-31
|
||||
*
|
||||
* This program is free software; you can redistribute it and/or modify
|
||||
|
@ -3,7 +3,7 @@
|
||||
==========================================
|
||||
|
||||
License: GPLv2
|
||||
Author & Maintainer: Miguel Ojeda Sandonis <maxextreme@gmail.com>
|
||||
Author & Maintainer: Miguel Ojeda Sandonis
|
||||
Date: 2006-10-27
|
||||
|
||||
|
||||
@ -21,7 +21,7 @@ Date: 2006-10-27
|
||||
1. DRIVER INFORMATION
|
||||
---------------------
|
||||
|
||||
This driver support the ks0108 LCD controller.
|
||||
This driver supports the ks0108 LCD controller.
|
||||
|
||||
|
||||
---------------------
|
||||
|
327
Documentation/block/data-integrity.txt
Normal file
327
Documentation/block/data-integrity.txt
Normal file
@ -0,0 +1,327 @@
|
||||
----------------------------------------------------------------------
|
||||
1. INTRODUCTION
|
||||
|
||||
Modern filesystems feature checksumming of data and metadata to
|
||||
protect against data corruption. However, the detection of the
|
||||
corruption is done at read time which could potentially be months
|
||||
after the data was written. At that point the original data that the
|
||||
application tried to write is most likely lost.
|
||||
|
||||
The solution is to ensure that the disk is actually storing what the
|
||||
application meant it to. Recent additions to both the SCSI family
|
||||
protocols (SBC Data Integrity Field, SCC protection proposal) as well
|
||||
as SATA/T13 (External Path Protection) try to remedy this by adding
|
||||
support for appending integrity metadata to an I/O. The integrity
|
||||
metadata (or protection information in SCSI terminology) includes a
|
||||
checksum for each sector as well as an incrementing counter that
|
||||
ensures the individual sectors are written in the right order. And
|
||||
for some protection schemes also that the I/O is written to the right
|
||||
place on disk.
|
||||
|
||||
Current storage controllers and devices implement various protective
|
||||
measures, for instance checksumming and scrubbing. But these
|
||||
technologies are working in their own isolated domains or at best
|
||||
between adjacent nodes in the I/O path. The interesting thing about
|
||||
DIF and the other integrity extensions is that the protection format
|
||||
is well defined and every node in the I/O path can verify the
|
||||
integrity of the I/O and reject it if corruption is detected. This
|
||||
allows not only corruption prevention but also isolation of the point
|
||||
of failure.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
2. THE DATA INTEGRITY EXTENSIONS
|
||||
|
||||
As written, the protocol extensions only protect the path between
|
||||
controller and storage device. However, many controllers actually
|
||||
allow the operating system to interact with the integrity metadata
|
||||
(IMD). We have been working with several FC/SAS HBA vendors to enable
|
||||
the protection information to be transferred to and from their
|
||||
controllers.
|
||||
|
||||
The SCSI Data Integrity Field works by appending 8 bytes of protection
|
||||
information to each sector. The data + integrity metadata is stored
|
||||
in 520 byte sectors on disk. Data + IMD are interleaved when
|
||||
transferred between the controller and target. The T13 proposal is
|
||||
similar.
|
||||
|
||||
Because it is highly inconvenient for operating systems to deal with
|
||||
520 (and 4104) byte sectors, we approached several HBA vendors and
|
||||
encouraged them to allow separation of the data and integrity metadata
|
||||
scatter-gather lists.
|
||||
|
||||
The controller will interleave the buffers on write and split them on
|
||||
read. This means that the Linux can DMA the data buffers to and from
|
||||
host memory without changes to the page cache.
|
||||
|
||||
Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
|
||||
is somewhat heavy to compute in software. Benchmarks found that
|
||||
calculating this checksum had a significant impact on system
|
||||
performance for a number of workloads. Some controllers allow a
|
||||
lighter-weight checksum to be used when interfacing with the operating
|
||||
system. Emulex, for instance, supports the TCP/IP checksum instead.
|
||||
The IP checksum received from the OS is converted to the 16-bit CRC
|
||||
when writing and vice versa. This allows the integrity metadata to be
|
||||
generated by Linux or the application at very low cost (comparable to
|
||||
software RAID5).
|
||||
|
||||
The IP checksum is weaker than the CRC in terms of detecting bit
|
||||
errors. However, the strength is really in the separation of the data
|
||||
buffers and the integrity metadata. These two distinct buffers much
|
||||
match up for an I/O to complete.
|
||||
|
||||
The separation of the data and integrity metadata buffers as well as
|
||||
the choice in checksums is referred to as the Data Integrity
|
||||
Extensions. As these extensions are outside the scope of the protocol
|
||||
bodies (T10, T13), Oracle and its partners are trying to standardize
|
||||
them within the Storage Networking Industry Association.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
3. KERNEL CHANGES
|
||||
|
||||
The data integrity framework in Linux enables protection information
|
||||
to be pinned to I/Os and sent to/received from controllers that
|
||||
support it.
|
||||
|
||||
The advantage to the integrity extensions in SCSI and SATA is that
|
||||
they enable us to protect the entire path from application to storage
|
||||
device. However, at the same time this is also the biggest
|
||||
disadvantage. It means that the protection information must be in a
|
||||
format that can be understood by the disk.
|
||||
|
||||
Generally Linux/POSIX applications are agnostic to the intricacies of
|
||||
the storage devices they are accessing. The virtual filesystem switch
|
||||
and the block layer make things like hardware sector size and
|
||||
transport protocols completely transparent to the application.
|
||||
|
||||
However, this level of detail is required when preparing the
|
||||
protection information to send to a disk. Consequently, the very
|
||||
concept of an end-to-end protection scheme is a layering violation.
|
||||
It is completely unreasonable for an application to be aware whether
|
||||
it is accessing a SCSI or SATA disk.
|
||||
|
||||
The data integrity support implemented in Linux attempts to hide this
|
||||
from the application. As far as the application (and to some extent
|
||||
the kernel) is concerned, the integrity metadata is opaque information
|
||||
that's attached to the I/O.
|
||||
|
||||
The current implementation allows the block layer to automatically
|
||||
generate the protection information for any I/O. Eventually the
|
||||
intent is to move the integrity metadata calculation to userspace for
|
||||
user data. Metadata and other I/O that originates within the kernel
|
||||
will still use the automatic generation interface.
|
||||
|
||||
Some storage devices allow each hardware sector to be tagged with a
|
||||
16-bit value. The owner of this tag space is the owner of the block
|
||||
device. I.e. the filesystem in most cases. The filesystem can use
|
||||
this extra space to tag sectors as they see fit. Because the tag
|
||||
space is limited, the block interface allows tagging bigger chunks by
|
||||
way of interleaving. This way, 8*16 bits of information can be
|
||||
attached to a typical 4KB filesystem block.
|
||||
|
||||
This also means that applications such as fsck and mkfs will need
|
||||
access to manipulate the tags from user space. A passthrough
|
||||
interface for this is being worked on.
|
||||
|
||||
|
||||
----------------------------------------------------------------------
|
||||
4. BLOCK LAYER IMPLEMENTATION DETAILS
|
||||
|
||||
4.1 BIO
|
||||
|
||||
The data integrity patches add a new field to struct bio when
|
||||
CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer
|
||||
to a struct bip which contains the bio integrity payload. Essentially
|
||||
a bip is a trimmed down struct bio which holds a bio_vec containing
|
||||
the integrity metadata and the required housekeeping information (bvec
|
||||
pool, vector count, etc.)
|
||||
|
||||
A kernel subsystem can enable data integrity protection on a bio by
|
||||
calling bio_integrity_alloc(bio). This will allocate and attach the
|
||||
bip to the bio.
|
||||
|
||||
Individual pages containing integrity metadata can subsequently be
|
||||
attached using bio_integrity_add_page().
|
||||
|
||||
bio_free() will automatically free the bip.
|
||||
|
||||
|
||||
4.2 BLOCK DEVICE
|
||||
|
||||
Because the format of the protection data is tied to the physical
|
||||
disk, each block device has been extended with a block integrity
|
||||
profile (struct blk_integrity). This optional profile is registered
|
||||
with the block layer using blk_integrity_register().
|
||||
|
||||
The profile contains callback functions for generating and verifying
|
||||
the protection data, as well as getting and setting application tags.
|
||||
The profile also contains a few constants to aid in completing,
|
||||
merging and splitting the integrity metadata.
|
||||
|
||||
Layered block devices will need to pick a profile that's appropriate
|
||||
for all subdevices. blk_integrity_compare() can help with that. DM
|
||||
and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6
|
||||
will require extra work due to the application tag.
|
||||
|
||||
|
||||
----------------------------------------------------------------------
|
||||
5.0 BLOCK LAYER INTEGRITY API
|
||||
|
||||
5.1 NORMAL FILESYSTEM
|
||||
|
||||
The normal filesystem is unaware that the underlying block device
|
||||
is capable of sending/receiving integrity metadata. The IMD will
|
||||
be automatically generated by the block layer at submit_bio() time
|
||||
in case of a WRITE. A READ request will cause the I/O integrity
|
||||
to be verified upon completion.
|
||||
|
||||
IMD generation and verification can be toggled using the
|
||||
|
||||
/sys/block/<bdev>/integrity/write_generate
|
||||
|
||||
and
|
||||
|
||||
/sys/block/<bdev>/integrity/read_verify
|
||||
|
||||
flags.
|
||||
|
||||
|
||||
5.2 INTEGRITY-AWARE FILESYSTEM
|
||||
|
||||
A filesystem that is integrity-aware can prepare I/Os with IMD
|
||||
attached. It can also use the application tag space if this is
|
||||
supported by the block device.
|
||||
|
||||
|
||||
int bdev_integrity_enabled(block_device, int rw);
|
||||
|
||||
bdev_integrity_enabled() will return 1 if the block device
|
||||
supports integrity metadata transfer for the data direction
|
||||
specified in 'rw'.
|
||||
|
||||
bdev_integrity_enabled() honors the write_generate and
|
||||
read_verify flags in sysfs and will respond accordingly.
|
||||
|
||||
|
||||
int bio_integrity_prep(bio);
|
||||
|
||||
To generate IMD for WRITE and to set up buffers for READ, the
|
||||
filesystem must call bio_integrity_prep(bio).
|
||||
|
||||
Prior to calling this function, the bio data direction and start
|
||||
sector must be set, and the bio should have all data pages
|
||||
added. It is up to the caller to ensure that the bio does not
|
||||
change while I/O is in progress.
|
||||
|
||||
bio_integrity_prep() should only be called if
|
||||
bio_integrity_enabled() returned 1.
|
||||
|
||||
|
||||
int bio_integrity_tag_size(bio);
|
||||
|
||||
If the filesystem wants to use the application tag space it will
|
||||
first have to find out how much storage space is available.
|
||||
Because tag space is generally limited (usually 2 bytes per
|
||||
sector regardless of sector size), the integrity framework
|
||||
supports interleaving the information between the sectors in an
|
||||
I/O.
|
||||
|
||||
Filesystems can call bio_integrity_tag_size(bio) to find out how
|
||||
many bytes of storage are available for that particular bio.
|
||||
|
||||
Another option is bdev_get_tag_size(block_device) which will
|
||||
return the number of available bytes per hardware sector.
|
||||
|
||||
|
||||
int bio_integrity_set_tag(bio, void *tag_buf, len);
|
||||
|
||||
After a successful return from bio_integrity_prep(),
|
||||
bio_integrity_set_tag() can be used to attach an opaque tag
|
||||
buffer to a bio. Obviously this only makes sense if the I/O is
|
||||
a WRITE.
|
||||
|
||||
|
||||
int bio_integrity_get_tag(bio, void *tag_buf, len);
|
||||
|
||||
Similarly, at READ I/O completion time the filesystem can
|
||||
retrieve the tag buffer using bio_integrity_get_tag().
|
||||
|
||||
|
||||
6.3 PASSING EXISTING INTEGRITY METADATA
|
||||
|
||||
Filesystems that either generate their own integrity metadata or
|
||||
are capable of transferring IMD from user space can use the
|
||||
following calls:
|
||||
|
||||
|
||||
struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
|
||||
|
||||
Allocates the bio integrity payload and hangs it off of the bio.
|
||||
nr_pages indicate how many pages of protection data need to be
|
||||
stored in the integrity bio_vec list (similar to bio_alloc()).
|
||||
|
||||
The integrity payload will be freed at bio_free() time.
|
||||
|
||||
|
||||
int bio_integrity_add_page(bio, page, len, offset);
|
||||
|
||||
Attaches a page containing integrity metadata to an existing
|
||||
bio. The bio must have an existing bip,
|
||||
i.e. bio_integrity_alloc() must have been called. For a WRITE,
|
||||
the integrity metadata in the pages must be in a format
|
||||
understood by the target device with the notable exception that
|
||||
the sector numbers will be remapped as the request traverses the
|
||||
I/O stack. This implies that the pages added using this call
|
||||
will be modified during I/O! The first reference tag in the
|
||||
integrity metadata must have a value of bip->bip_sector.
|
||||
|
||||
Pages can be added using bio_integrity_add_page() as long as
|
||||
there is room in the bip bio_vec array (nr_pages).
|
||||
|
||||
Upon completion of a READ operation, the attached pages will
|
||||
contain the integrity metadata received from the storage device.
|
||||
It is up to the receiver to process them and verify data
|
||||
integrity upon completion.
|
||||
|
||||
|
||||
6.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
|
||||
METADATA
|
||||
|
||||
To enable integrity exchange on a block device the gendisk must be
|
||||
registered as capable:
|
||||
|
||||
int blk_integrity_register(gendisk, blk_integrity);
|
||||
|
||||
The blk_integrity struct is a template and should contain the
|
||||
following:
|
||||
|
||||
static struct blk_integrity my_profile = {
|
||||
.name = "STANDARDSBODY-TYPE-VARIANT-CSUM",
|
||||
.generate_fn = my_generate_fn,
|
||||
.verify_fn = my_verify_fn,
|
||||
.get_tag_fn = my_get_tag_fn,
|
||||
.set_tag_fn = my_set_tag_fn,
|
||||
.tuple_size = sizeof(struct my_tuple_size),
|
||||
.tag_size = <tag bytes per hw sector>,
|
||||
};
|
||||
|
||||
'name' is a text string which will be visible in sysfs. This is
|
||||
part of the userland API so chose it carefully and never change
|
||||
it. The format is standards body-type-variant.
|
||||
E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
|
||||
|
||||
'generate_fn' generates appropriate integrity metadata (for WRITE).
|
||||
|
||||
'verify_fn' verifies that the data buffer matches the integrity
|
||||
metadata.
|
||||
|
||||
'tuple_size' must be set to match the size of the integrity
|
||||
metadata per sector. I.e. 8 for DIF and EPP.
|
||||
|
||||
'tag_size' must be set to identify how many bytes of tag space
|
||||
are available per hardware sector. For DIF this is either 2 or
|
||||
0 depending on the value of the Control Mode Page ATO bit.
|
||||
|
||||
See 6.2 for a description of get_tag_fn and set_tag_fn.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
|
@ -390,6 +390,10 @@ If you have several tasks to attach, you have to do it one after another:
|
||||
...
|
||||
# /bin/echo PIDn > tasks
|
||||
|
||||
You can attach the current shell task by echoing 0:
|
||||
|
||||
# echo 0 > tasks
|
||||
|
||||
3. Kernel API
|
||||
=============
|
||||
|
||||
|
@ -13,7 +13,7 @@ either an integer or * for all. Access is a composition of r
|
||||
The root device cgroup starts with rwm to 'all'. A child device
|
||||
cgroup gets a copy of the parent. Administrators can then remove
|
||||
devices from the whitelist or add new entries. A child cgroup can
|
||||
never receive a device access which is denied its parent. However
|
||||
never receive a device access which is denied by its parent. However
|
||||
when a device access is removed from a parent it will not also be
|
||||
removed from the child(ren).
|
||||
|
||||
@ -29,7 +29,11 @@ allows cgroup 1 to read and mknod the device usually known as
|
||||
|
||||
echo a > /cgroups/1/devices.deny
|
||||
|
||||
will remove the default 'a *:* mrw' entry.
|
||||
will remove the default 'a *:* rwm' entry. Doing
|
||||
|
||||
echo a > /cgroups/1/devices.allow
|
||||
|
||||
will add the 'a *:* rwm' entry to the whitelist.
|
||||
|
||||
3. Security
|
||||
|
||||
|
@ -154,13 +154,15 @@ browsing and modifying the cpusets presently known to the kernel. No
|
||||
new system calls are added for cpusets - all support for querying and
|
||||
modifying cpusets is via this cpuset file system.
|
||||
|
||||
The /proc/<pid>/status file for each task has two added lines,
|
||||
The /proc/<pid>/status file for each task has four added lines,
|
||||
displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
|
||||
and mems_allowed (on which Memory Nodes it may obtain memory),
|
||||
in the format seen in the following example:
|
||||
in the two formats seen in the following example:
|
||||
|
||||
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
|
||||
Cpus_allowed_list: 0-127
|
||||
Mems_allowed: ffffffff,ffffffff
|
||||
Mems_allowed_list: 0-63
|
||||
|
||||
Each cpuset is represented by a directory in the cgroup file system
|
||||
containing (on top of the standard cgroup files) the following
|
||||
@ -544,6 +546,9 @@ otherwise initial value -1 that indicates the cpuset has no request.
|
||||
( 4 : search nodes in a chunk of node [on NUMA system] )
|
||||
( 5 : search system wide [on NUMA system] )
|
||||
|
||||
The system default is architecture dependent. The system default
|
||||
can be changed using the relax_domain_level= boot parameter.
|
||||
|
||||
This file is per-cpuset and affect the sched domain where the cpuset
|
||||
belongs to. Therefore if the flag 'sched_load_balance' of a cpuset
|
||||
is disabled, then 'sched_relax_domain_level' have no effect since
|
||||
|
@ -14,9 +14,8 @@ represent the thread siblings to cpu X in the same physical package;
|
||||
To implement it in an architecture-neutral way, a new source file,
|
||||
drivers/base/topology.c, is to export the 4 attributes.
|
||||
|
||||
If one architecture wants to support this feature, it just needs to
|
||||
implement 4 defines, typically in file include/asm-XXX/topology.h.
|
||||
The 4 defines are:
|
||||
For an architecture to support this feature, it must define some of
|
||||
these macros in include/asm-XXX/topology.h:
|
||||
#define topology_physical_package_id(cpu)
|
||||
#define topology_core_id(cpu)
|
||||
#define topology_thread_siblings(cpu)
|
||||
@ -25,17 +24,10 @@ The 4 defines are:
|
||||
The type of **_id is int.
|
||||
The type of siblings is cpumask_t.
|
||||
|
||||
To be consistent on all architectures, the 4 attributes should have
|
||||
default values if their values are unavailable. Below is the rule.
|
||||
1) physical_package_id: If cpu has no physical package id, -1 is the
|
||||
default value.
|
||||
2) core_id: If cpu doesn't support multi-core, its core id is 0.
|
||||
3) thread_siblings: Just include itself, if the cpu doesn't support
|
||||
HT/multi-thread.
|
||||
4) core_siblings: Just include itself, if the cpu doesn't support
|
||||
multi-core and HT/Multi-thread.
|
||||
|
||||
So be careful when declaring the 4 defines in include/asm-XXX/topology.h.
|
||||
|
||||
If an attribute isn't defined on an architecture, it won't be exported.
|
||||
|
||||
To be consistent on all architectures, include/linux/topology.h
|
||||
provides default definitions for any of the above macros that are
|
||||
not defined by include/asm-XXX/topology.h:
|
||||
1) physical_package_id: -1
|
||||
2) core_id: 0
|
||||
3) thread_siblings: just the given CPU
|
||||
4) core_siblings: just the given CPU
|
||||
|
@ -222,13 +222,6 @@ Who: Thomas Gleixner <tglx@linutronix.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: i2c-i810, i2c-prosavage and i2c-savage4
|
||||
When: May 2008
|
||||
Why: These drivers are superseded by i810fb, intelfb and savagefb.
|
||||
Who: Jean Delvare <khali@linux-fr.org>
|
||||
|
||||
---------------------------
|
||||
|
||||
What (Why):
|
||||
- include/linux/netfilter_ipv4/ipt_TOS.h ipt_tos.h header files
|
||||
(superseded by xt_TOS/xt_tos target & match)
|
||||
@ -312,3 +305,12 @@ When: 2.6.26
|
||||
Why: Implementation became generic; users should now include
|
||||
linux/semaphore.h instead.
|
||||
Who: Matthew Wilcox <willy@linux.intel.com>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: CONFIG_THERMAL_HWMON
|
||||
When: January 2009
|
||||
Why: This option was introduced just to allow older lm-sensors userspace
|
||||
to keep working over the upgrade to 2.6.26. At the scheduled time of
|
||||
removal fixed lm-sensors (2.x or 3.x) should be readily available.
|
||||
Who: Rene Herman <rene.herman@gmail.com>
|
||||
|
@ -233,10 +233,12 @@ accomplished via the group operations specified on the group's
|
||||
config_item_type.
|
||||
|
||||
struct configfs_group_operations {
|
||||
struct config_item *(*make_item)(struct config_group *group,
|
||||
const char *name);
|
||||
struct config_group *(*make_group)(struct config_group *group,
|
||||
const char *name);
|
||||
int (*make_item)(struct config_group *group,
|
||||
const char *name,
|
||||
struct config_item **new_item);
|
||||
int (*make_group)(struct config_group *group,
|
||||
const char *name,
|
||||
struct config_group **new_group);
|
||||
int (*commit_item)(struct config_item *item);
|
||||
void (*disconnect_notify)(struct config_group *group,
|
||||
struct config_item *item);
|
||||
|
@ -273,13 +273,13 @@ static inline struct simple_children *to_simple_children(struct config_item *ite
|
||||
return item ? container_of(to_config_group(item), struct simple_children, group) : NULL;
|
||||
}
|
||||
|
||||
static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
|
||||
static int simple_children_make_item(struct config_group *group, const char *name, struct config_item **new_item)
|
||||
{
|
||||
struct simple_child *simple_child;
|
||||
|
||||
simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL);
|
||||
if (!simple_child)
|
||||
return NULL;
|
||||
return -ENOMEM;
|
||||
|
||||
|
||||
config_item_init_type_name(&simple_child->item, name,
|
||||
@ -287,7 +287,8 @@ static struct config_item *simple_children_make_item(struct config_group *group,
|
||||
|
||||
simple_child->storeme = 0;
|
||||
|
||||
return &simple_child->item;
|
||||
*new_item = &simple_child->item;
|
||||
return 0;
|
||||
}
|
||||
|
||||
static struct configfs_attribute simple_children_attr_description = {
|
||||
@ -359,20 +360,21 @@ static struct configfs_subsystem simple_children_subsys = {
|
||||
* children of its own.
|
||||
*/
|
||||
|
||||
static struct config_group *group_children_make_group(struct config_group *group, const char *name)
|
||||
static int group_children_make_group(struct config_group *group, const char *name, struct config_group **new_group)
|
||||
{
|
||||
struct simple_children *simple_children;
|
||||
|
||||
simple_children = kzalloc(sizeof(struct simple_children),
|
||||
GFP_KERNEL);
|
||||
if (!simple_children)
|
||||
return NULL;
|
||||
return -ENOMEM;
|
||||
|
||||
|
||||
config_group_init_type_name(&simple_children->group, name,
|
||||
&simple_children_type);
|
||||
|
||||
return &simple_children->group;
|
||||
*new_group = &simple_children->group;
|
||||
return 0;
|
||||
}
|
||||
|
||||
static struct configfs_attribute group_children_attr_description = {
|
||||
|
@ -13,72 +13,93 @@ Mailing list: linux-ext4@vger.kernel.org
|
||||
1. Quick usage instructions:
|
||||
===========================
|
||||
|
||||
- Grab updated e2fsprogs from
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
|
||||
This is a patchset on top of e2fsprogs-1.39, which can be found at
|
||||
- Compile and install the latest version of e2fsprogs (as of this
|
||||
writing version 1.41) from:
|
||||
|
||||
http://sourceforge.net/project/showfiles.php?group_id=2406
|
||||
|
||||
or
|
||||
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
|
||||
|
||||
- It's still mke2fs -j /dev/hda1
|
||||
or grab the latest git repository from:
|
||||
|
||||
- mount /dev/hda1 /wherever -t ext4dev
|
||||
git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
|
||||
|
||||
- To enable extents,
|
||||
- Create a new filesystem using the ext4dev filesystem type:
|
||||
|
||||
mount /dev/hda1 /wherever -t ext4dev -o extents
|
||||
# mke2fs -t ext4dev /dev/hda1
|
||||
|
||||
- The filesystem is compatible with the ext3 driver until you add a file
|
||||
which has extents (ie: `mount -o extents', then create a file).
|
||||
Or configure an existing ext3 filesystem to support extents and set
|
||||
the test_fs flag to indicate that it's ok for an in-development
|
||||
filesystem to touch this filesystem:
|
||||
|
||||
NOTE: The "extents" mount flag is temporary. It will soon go away and
|
||||
extents will be enabled by the "-o extents" flag to mke2fs or tune2fs
|
||||
# tune2fs -O extents -E test_fs /dev/hda1
|
||||
|
||||
If the filesystem was created with 128 byte inodes, it can be
|
||||
converted to use 256 byte for greater efficiency via:
|
||||
|
||||
# tune2fs -I 256 /dev/hda1
|
||||
|
||||
(Note: we currently do not have tools to convert an ext4dev
|
||||
filesystem back to ext3; so please do not do try this on production
|
||||
filesystems.)
|
||||
|
||||
- Mounting:
|
||||
|
||||
# mount -t ext4dev /dev/hda1 /wherever
|
||||
|
||||
- When comparing performance with other filesystems, remember that
|
||||
ext3/4 by default offers higher data integrity guarantees than most. So
|
||||
when comparing with a metadata-only journalling filesystem, use `mount -o
|
||||
data=writeback'. And you might as well use `mount -o nobh' too along
|
||||
with it. Making the journal larger than the mke2fs default often helps
|
||||
performance with metadata-intensive workloads.
|
||||
ext3/4 by default offers higher data integrity guarantees than most.
|
||||
So when comparing with a metadata-only journalling filesystem, such
|
||||
as ext3, use `mount -o data=writeback'. And you might as well use
|
||||
`mount -o nobh' too along with it. Making the journal larger than
|
||||
the mke2fs default often helps performance with metadata-intensive
|
||||
workloads.
|
||||
|
||||
2. Features
|
||||
===========
|
||||
|
||||
2.1 Currently available
|
||||
|
||||
* ability to use filesystems > 16TB
|
||||
* ability to use filesystems > 16TB (e2fsprogs support not available yet)
|
||||
* extent format reduces metadata overhead (RAM, IO for access, transactions)
|
||||
* extent format more robust in face of on-disk corruption due to magics,
|
||||
* internal redunancy in tree
|
||||
|
||||
2.1 Previously available, soon to be enabled by default by "mkefs.ext4":
|
||||
|
||||
* dir_index and resize inode will be on by default
|
||||
* large inodes will be used by default for fast EAs, nsec timestamps, etc
|
||||
* improved file allocation (multi-block alloc)
|
||||
* fix 32000 subdirectory limit
|
||||
* nsec timestamps for mtime, atime, ctime, create time
|
||||
* inode version field on disk (NFSv4, Lustre)
|
||||
* reduced e2fsck time via uninit_bg feature
|
||||
* journal checksumming for robustness, performance
|
||||
* persistent file preallocation (e.g for streaming media, databases)
|
||||
* ability to pack bitmaps and inode tables into larger virtual groups via the
|
||||
flex_bg feature
|
||||
* large file support
|
||||
* Inode allocation using large virtual block groups via flex_bg
|
||||
* delayed allocation
|
||||
* large block (up to pagesize) support
|
||||
* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
|
||||
the ordering)
|
||||
|
||||
2.2 Candidate features for future inclusion
|
||||
|
||||
There are several under discussion, whether they all make it in is
|
||||
partly a function of how much time everyone has to work on them:
|
||||
* Online defrag (patches available but not well tested)
|
||||
* reduced mke2fs time via lazy itable initialization in conjuction with
|
||||
the uninit_bg feature (capability to do this is available in e2fsprogs
|
||||
but a kernel thread to do lazy zeroing of unused inode table blocks
|
||||
after filesystem is first mounted is required for safety)
|
||||
|
||||
* improved file allocation (multi-block alloc, delayed alloc; basically done)
|
||||
* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
|
||||
* nsec timestamps for mtime, atime, ctime, create time (patch exists,
|
||||
needs some e2fsck work)
|
||||
* inode version field on disk (NFSv4, Lustre; prototype exists)
|
||||
* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
|
||||
* journal checksumming for robustness, performance (prototype exists)
|
||||
* persistent file preallocation (e.g for streaming media, databases)
|
||||
There are several others under discussion, whether they all make it in is
|
||||
partly a function of how much time everyone has to work on them. Features like
|
||||
metadata checksumming have been discussed and planned for a bit but no patches
|
||||
exist yet so I'm not sure they're in the near-term roadmap.
|
||||
|
||||
Features like metadata checksumming have been discussed and planned for
|
||||
a bit but no patches exist yet so I'm not sure they're in the near-term
|
||||
roadmap.
|
||||
The big performance win will come with mballoc, delalloc and flex_bg
|
||||
grouping of bitmaps and inode tables. Some test results available here:
|
||||
|
||||
The big performance win will come with mballoc and delalloc. CFS has
|
||||
been using mballoc for a few years already with Lustre, and IBM + Bull
|
||||
did a lot of benchmarking on it. The reason it isn't in the first set of
|
||||
patches is partly a manageability issue, and partly because it doesn't
|
||||
directly affect the on-disk format (outside of much better allocation)
|
||||
so it isn't critical to get into the first round of changes. I believe
|
||||
Alex is working on a new set of patches right now.
|
||||
- http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
|
||||
- http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html
|
||||
|
||||
3. Options
|
||||
==========
|
||||
@ -222,9 +243,11 @@ stripe=n Number of filesystem blocks that mballoc will try
|
||||
to use for allocation size and alignment. For RAID5/6
|
||||
systems this should be the number of data
|
||||
disks * RAID chunk size in file system blocks.
|
||||
|
||||
delalloc (*) Deferring block allocation until write-out time.
|
||||
nodelalloc Disable delayed allocation. Blocks are allocation
|
||||
when data is copied from user to page cache.
|
||||
Data Mode
|
||||
---------
|
||||
=========
|
||||
There are 3 different data modes:
|
||||
|
||||
* writeback mode
|
||||
@ -236,10 +259,10 @@ typically provide the best ext4 performance.
|
||||
|
||||
* ordered mode
|
||||
In data=ordered mode, ext4 only officially journals metadata, but it logically
|
||||
groups metadata and data blocks into a single unit called a transaction. When
|
||||
it's time to write the new metadata out to disk, the associated data blocks
|
||||
are written first. In general, this mode performs slightly slower than
|
||||
writeback but significantly faster than journal mode.
|
||||
groups metadata information related to data changes with the data blocks into a
|
||||
single unit called a transaction. When it's time to write the new metadata
|
||||
out to disk, the associated data blocks are written first. In general,
|
||||
this mode performs slightly slower than writeback but significantly faster than journal mode.
|
||||
|
||||
* journal mode
|
||||
data=journal mode provides full data and metadata journaling. All new data is
|
||||
@ -247,7 +270,8 @@ written to the journal first, and then to its final location.
|
||||
In the event of a crash, the journal can be replayed, bringing both data and
|
||||
metadata into a consistent state. This mode is the slowest except when data
|
||||
needs to be read from and written to disk at the same time where it
|
||||
outperforms all others modes.
|
||||
outperforms all others modes. Curently ext4 does not have delayed
|
||||
allocation support if this data journalling mode is selected.
|
||||
|
||||
References
|
||||
==========
|
||||
@ -256,7 +280,8 @@ kernel source: <file:fs/ext4/>
|
||||
<file:fs/jbd2/>
|
||||
|
||||
programs: http://e2fsprogs.sourceforge.net/
|
||||
http://ext2resize.sourceforge.net
|
||||
|
||||
useful links: http://fedoraproject.org/wiki/ext3-devel
|
||||
http://www.bullopensource.org/ext4/
|
||||
http://ext4.wiki.kernel.org/index.php/Main_Page
|
||||
http://fedoraproject.org/wiki/Features/Ext4
|
||||
|
114
Documentation/filesystems/gfs2-glocks.txt
Normal file
114
Documentation/filesystems/gfs2-glocks.txt
Normal file
@ -0,0 +1,114 @@
|
||||
Glock internal locking rules
|
||||
------------------------------
|
||||
|
||||
This documents the basic principles of the glock state machine
|
||||
internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
|
||||
has two main (internal) locks:
|
||||
|
||||
1. A spinlock (gl_spin) which protects the internal state such
|
||||
as gl_state, gl_target and the list of holders (gl_holders)
|
||||
2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
|
||||
threads from making calls to the DLM, etc. at the same time. If a
|
||||
thread takes this lock, it must then call run_queue (usually via the
|
||||
workqueue) when it releases it in order to ensure any pending tasks
|
||||
are completed.
|
||||
|
||||
The gl_holders list contains all the queued lock requests (not
|
||||
just the holders) associated with the glock. If there are any
|
||||
held locks, then they will be contiguous entries at the head
|
||||
of the list. Locks are granted in strictly the order that they
|
||||
are queued, except for those marked LM_FLAG_PRIORITY which are
|
||||
used only during recovery, and even then only for journal locks.
|
||||
|
||||
There are three lock states that users of the glock layer can request,
|
||||
namely shared (SH), deferred (DF) and exclusive (EX). Those translate
|
||||
to the following DLM lock modes:
|
||||
|
||||
Glock mode | DLM lock mode
|
||||
------------------------------
|
||||
UN | IV/NL Unlocked (no DLM lock associated with glock) or NL
|
||||
SH | PR (Protected read)
|
||||
DF | CW (Concurrent write)
|
||||
EX | EX (Exclusive)
|
||||
|
||||
Thus DF is basically a shared mode which is incompatible with the "normal"
|
||||
shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
|
||||
operations. The glocks are basically a lock plus some routines which deal
|
||||
with cache management. The following rules apply for the cache:
|
||||
|
||||
Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata
|
||||
--------------------------------------------------------------------------
|
||||
UN | No | No | No | No
|
||||
SH | Yes | Yes | No | No
|
||||
DF | No | Yes | No | No
|
||||
EX | Yes | Yes | Yes | Yes
|
||||
|
||||
These rules are implemented using the various glock operations which
|
||||
are defined for each type of glock. Not all types of glocks use
|
||||
all the modes. Only inode glocks use the DF mode for example.
|
||||
|
||||
Table of glock operations and per type constants:
|
||||
|
||||
Field | Purpose
|
||||
----------------------------------------------------------------------------
|
||||
go_xmote_th | Called before remote state change (e.g. to sync dirty data)
|
||||
go_xmote_bh | Called after remote state change (e.g. to refill cache)
|
||||
go_inval | Called if remote state change requires invalidating the cache
|
||||
go_demote_ok | Returns boolean value of whether its ok to demote a glock
|
||||
| (e.g. checks timeout, and that there is no cached data)
|
||||
go_lock | Called for the first local holder of a lock
|
||||
go_unlock | Called on the final local unlock of a lock
|
||||
go_dump | Called to print content of object for debugfs file, or on
|
||||
| error to dump glock to the log.
|
||||
go_type; | The type of the glock, LM_TYPE_.....
|
||||
go_min_hold_time | The minimum hold time
|
||||
|
||||
The minimum hold time for each lock is the time after a remote lock
|
||||
grant for which we ignore remote demote requests. This is in order to
|
||||
prevent a situation where locks are being bounced around the cluster
|
||||
from node to node with none of the nodes making any progress. This
|
||||
tends to show up most with shared mmaped files which are being written
|
||||
to by multiple nodes. By delaying the demotion in response to a
|
||||
remote callback, that gives the userspace program time to make
|
||||
some progress before the pages are unmapped.
|
||||
|
||||
There is a plan to try and remove the go_lock and go_unlock callbacks
|
||||
if possible, in order to try and speed up the fast path though the locking.
|
||||
Also, eventually we hope to make the glock "EX" mode locally shared
|
||||
such that any local locking will be done with the i_mutex as required
|
||||
rather than via the glock.
|
||||
|
||||
Locking rules for glock operations:
|
||||
|
||||
Operation | GLF_LOCK bit lock held | gl_spin spinlock held
|
||||
-----------------------------------------------------------------
|
||||
go_xmote_th | Yes | No
|
||||
go_xmote_bh | Yes | No
|
||||
go_inval | Yes | No
|
||||
go_demote_ok | Sometimes | Yes
|
||||
go_lock | Yes | No
|
||||
go_unlock | Yes | No
|
||||
go_dump | Sometimes | Yes
|
||||
|
||||
N.B. Operations must not drop either the bit lock or the spinlock
|
||||
if its held on entry. go_dump and do_demote_ok must never block.
|
||||
Note that go_dump will only be called if the glock's state
|
||||
indicates that it is caching uptodate data.
|
||||
|
||||
Glock locking order within GFS2:
|
||||
|
||||
1. i_mutex (if required)
|
||||
2. Rename glock (for rename only)
|
||||
3. Inode glock(s)
|
||||
(Parents before children, inodes at "same level" with same parent in
|
||||
lock number order)
|
||||
4. Rgrp glock(s) (for (de)allocation operations)
|
||||
5. Transaction glock (via gfs2_trans_begin) for non-read operations
|
||||
6. Page lock (always last, very important!)
|
||||
|
||||
There are two glocks per inode. One deals with access to the inode
|
||||
itself (locking order as above), and the other, known as the iopen
|
||||
glock is used in conjunction with the i_nlink field in the inode to
|
||||
determine the lifetime of the inode in question. Locking of inodes
|
||||
is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
|
||||
|
@ -380,28 +380,35 @@ i386 and x86_64 platforms support the new IRQ vector displays.
|
||||
Of some interest is the introduction of the /proc/irq directory to 2.4.
|
||||
It could be used to set IRQ to CPU affinity, this means that you can "hook" an
|
||||
IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
|
||||
irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask
|
||||
irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
|
||||
prof_cpu_mask.
|
||||
|
||||
For example
|
||||
> ls /proc/irq/
|
||||
0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
|
||||
1 11 13 15 17 19 3 5 7 9
|
||||
1 11 13 15 17 19 3 5 7 9 default_smp_affinity
|
||||
> ls /proc/irq/0/
|
||||
smp_affinity
|
||||
|
||||
The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ
|
||||
is the same by default:
|
||||
smp_affinity is a bitmask, in which you can specify which CPUs can handle the
|
||||
IRQ, you can set it by doing:
|
||||
|
||||
> cat /proc/irq/0/smp_affinity
|
||||
> echo 1 > /proc/irq/10/smp_affinity
|
||||
|
||||
This means that only the first CPU will handle the IRQ, but you can also echo
|
||||
5 which means that only the first and fourth CPU can handle the IRQ.
|
||||
|
||||
The contents of each smp_affinity file is the same by default:
|
||||
|
||||
> cat /proc/irq/0/smp_affinity
|
||||
ffffffff
|
||||
|
||||
It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can
|
||||
set it by doing:
|
||||
The default_smp_affinity mask applies to all non-active IRQs, which are the
|
||||
IRQs which have not yet been allocated/activated, and hence which lack a
|
||||
/proc/irq/[0-9]* directory.
|
||||
|
||||
> echo 1 > /proc/irq/prof_cpu_mask
|
||||
|
||||
This means that only the first CPU will handle the IRQ, but you can also echo 5
|
||||
which means that only the first and fourth CPU can handle the IRQ.
|
||||
prof_cpu_mask specifies which CPUs are to be profiled by the system wide
|
||||
profiler. Default value is ffffffff (all cpus).
|
||||
|
||||
The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
|
||||
between all the CPUs which are allowed to handle it. As usual the kernel has
|
||||
|
164
Documentation/filesystems/ubifs.txt
Normal file
164
Documentation/filesystems/ubifs.txt
Normal file
@ -0,0 +1,164 @@
|
||||
Introduction
|
||||
=============
|
||||
|
||||
UBIFS file-system stands for UBI File System. UBI stands for "Unsorted
|
||||
Block Images". UBIFS is a flash file system, which means it is designed
|
||||
to work with flash devices. It is important to understand, that UBIFS
|
||||
is completely different to any traditional file-system in Linux, like
|
||||
Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems
|
||||
which work with MTD devices, not block devices. The other Linux
|
||||
file-system of this class is JFFS2.
|
||||
|
||||
To make it more clear, here is a small comparison of MTD devices and
|
||||
block devices.
|
||||
|
||||
1 MTD devices represent flash devices and they consist of eraseblocks of
|
||||
rather large size, typically about 128KiB. Block devices consist of
|
||||
small blocks, typically 512 bytes.
|
||||
2 MTD devices support 3 main operations - read from some offset within an
|
||||
eraseblock, write to some offset within an eraseblock, and erase a whole
|
||||
eraseblock. Block devices support 2 main operations - read a whole
|
||||
block and write a whole block.
|
||||
3 The whole eraseblock has to be erased before it becomes possible to
|
||||
re-write its contents. Blocks may be just re-written.
|
||||
4 Eraseblocks become worn out after some number of erase cycles -
|
||||
typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC
|
||||
NAND flashes. Blocks do not have the wear-out property.
|
||||
5 Eraseblocks may become bad (only on NAND flashes) and software should
|
||||
deal with this. Blocks on hard drives typically do not become bad,
|
||||
because hardware has mechanisms to substitute bad blocks, at least in
|
||||
modern LBA disks.
|
||||
|
||||
It should be quite obvious why UBIFS is very different to traditional
|
||||
file-systems.
|
||||
|
||||
UBIFS works on top of UBI. UBI is a separate software layer which may be
|
||||
found in drivers/mtd/ubi. UBI is basically a volume management and
|
||||
wear-leveling layer. It provides so called UBI volumes which is a higher
|
||||
level abstraction than a MTD device. The programming model of UBI devices
|
||||
is very similar to MTD devices - they still consist of large eraseblocks,
|
||||
they have read/write/erase operations, but UBI devices are devoid of
|
||||
limitations like wear and bad blocks (items 4 and 5 in the above list).
|
||||
|
||||
In a sense, UBIFS is a next generation of JFFS2 file-system, but it is
|
||||
very different and incompatible to JFFS2. The following are the main
|
||||
differences.
|
||||
|
||||
* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on
|
||||
top of UBI volumes.
|
||||
* JFFS2 does not have on-media index and has to build it while mounting,
|
||||
which requires full media scan. UBIFS maintains the FS indexing
|
||||
information on the flash media and does not require full media scan,
|
||||
so it mounts many times faster than JFFS2.
|
||||
* JFFS2 is a write-through file-system, while UBIFS supports write-back,
|
||||
which makes UBIFS much faster on writes.
|
||||
|
||||
Similarly to JFFS2, UBIFS supports on-the-flight compression which makes
|
||||
it possible to fit quite a lot of data to the flash.
|
||||
|
||||
Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts.
|
||||
It does not need stuff like ckfs.ext2. UBIFS automatically replays its
|
||||
journal and recovers from crashes, ensuring that the on-flash data
|
||||
structures are consistent.
|
||||
|
||||
UBIFS scales logarithmically (most of the data structures it uses are
|
||||
trees), so the mount time and memory consumption do not linearly depend
|
||||
on the flash size, like in case of JFFS2. This is because UBIFS
|
||||
maintains the FS index on the flash media. However, UBIFS depends on
|
||||
UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly.
|
||||
Nevertheless, UBI/UBIFS scales considerably better than JFFS2.
|
||||
|
||||
The authors of UBIFS believe, that it is possible to develop UBI2 which
|
||||
would scale logarithmically as well. UBI2 would support the same API as UBI,
|
||||
but it would be binary incompatible to UBI. So UBIFS would not need to be
|
||||
changed to use UBI2
|
||||
|
||||
|
||||
Mount options
|
||||
=============
|
||||
|
||||
(*) == default.
|
||||
|
||||
norm_unmount (*) commit on unmount; the journal is committed
|
||||
when the file-system is unmounted so that the
|
||||
next mount does not have to replay the journal
|
||||
and it becomes very fast;
|
||||
fast_unmount do not commit on unmount; this option makes
|
||||
unmount faster, but the next mount slower
|
||||
because of the need to replay the journal.
|
||||
|
||||
|
||||
Quick usage instructions
|
||||
========================
|
||||
|
||||
The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax,
|
||||
where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is
|
||||
UBI volume name.
|
||||
|
||||
Mount volume 0 on UBI device 0 to /mnt/ubifs:
|
||||
$ mount -t ubifs ubi0_0 /mnt/ubifs
|
||||
|
||||
Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume
|
||||
name):
|
||||
$ mount -t ubifs ubi0:rootfs /mnt/ubifs
|
||||
|
||||
The following is an example of the kernel boot arguments to attach mtd0
|
||||
to UBI and mount volume "rootfs":
|
||||
ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs
|
||||
|
||||
|
||||
Module Parameters for Debugging
|
||||
===============================
|
||||
|
||||
When UBIFS has been compiled with debugging enabled, there are 3 module
|
||||
parameters that are available to control aspects of testing and debugging.
|
||||
The parameters are unsigned integers where each bit controls an option.
|
||||
The parameters are:
|
||||
|
||||
debug_msgs Selects which debug messages to display, as follows:
|
||||
|
||||
Message Type Flag value
|
||||
|
||||
General messages 1
|
||||
Journal messages 2
|
||||
Mount messages 4
|
||||
Commit messages 8
|
||||
LEB search messages 16
|
||||
Budgeting messages 32
|
||||
Garbage collection messages 64
|
||||
Tree Node Cache (TNC) messages 128
|
||||
LEB properties (lprops) messages 256
|
||||
Input/output messages 512
|
||||
Log messages 1024
|
||||
Scan messages 2048
|
||||
Recovery messages 4096
|
||||
|
||||
debug_chks Selects extra checks that UBIFS can do while running:
|
||||
|
||||
Check Flag value
|
||||
|
||||
General checks 1
|
||||
Check Tree Node Cache (TNC) 2
|
||||
Check indexing tree size 4
|
||||
Check orphan area 8
|
||||
Check old indexing tree 16
|
||||
Check LEB properties (lprops) 32
|
||||
Check leaf nodes and inodes 64
|
||||
|
||||
debug_tsts Selects a mode of testing, as follows:
|
||||
|
||||
Test mode Flag value
|
||||
|
||||
Force in-the-gaps method 2
|
||||
Failure mode for recovery testing 4
|
||||
|
||||
For example, set debug_msgs to 5 to display General messages and Mount
|
||||
messages.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
UBIFS documentation and FAQ/HOWTO at the MTD web site:
|
||||
http://www.linux-mtd.infradead.org/doc/ubifs.html
|
||||
http://www.linux-mtd.infradead.org/faq/ubifs.html
|
1360
Documentation/ftrace.txt
Normal file
1360
Documentation/ftrace.txt
Normal file
File diff suppressed because it is too large
Load Diff
@ -1,47 +0,0 @@
|
||||
Kernel driver i2c-i810
|
||||
|
||||
Supported adapters:
|
||||
* Intel 82810, 82810-DC100, 82810E, and 82815 (GMCH)
|
||||
* Intel 82845G (GMCH)
|
||||
|
||||
Authors:
|
||||
Frodo Looijaard <frodol@dds.nl>,
|
||||
Philip Edelbrock <phil@netroedge.com>,
|
||||
Kyösti Mälkki <kmalkki@cc.hut.fi>,
|
||||
Ralph Metzler <rjkm@thp.uni-koeln.de>,
|
||||
Mark D. Studebaker <mdsxyz123@yahoo.com>
|
||||
|
||||
Main contact: Mark Studebaker <mdsxyz123@yahoo.com>
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
WARNING: If you have an '810' or '815' motherboard, your standard I2C
|
||||
temperature sensors are most likely on the 801's I2C bus. You want the
|
||||
i2c-i801 driver for those, not this driver.
|
||||
|
||||
Now for the i2c-i810...
|
||||
|
||||
The GMCH chip contains two I2C interfaces.
|
||||
|
||||
The first interface is used for DDC (Data Display Channel) which is a
|
||||
serial channel through the VGA monitor connector to a DDC-compliant
|
||||
monitor. This interface is defined by the Video Electronics Standards
|
||||
Association (VESA). The standards are available for purchase at
|
||||
http://www.vesa.org .
|
||||
|
||||
The second interface is a general-purpose I2C bus. It may be connected to a
|
||||
TV-out chip such as the BT869 or possibly to a digital flat-panel display.
|
||||
|
||||
Features
|
||||
--------
|
||||
|
||||
Both busses use the i2c-algo-bit driver for 'bit banging'
|
||||
and support for specific transactions is provided by i2c-algo-bit.
|
||||
|
||||
Issues
|
||||
------
|
||||
|
||||
If you enable bus testing in i2c-algo-bit (insmod i2c-algo-bit bit_test=1),
|
||||
the test may fail; if so, the i2c-i810 driver won't be inserted. However,
|
||||
we think this has been fixed.
|
@ -1,23 +0,0 @@
|
||||
Kernel driver i2c-prosavage
|
||||
|
||||
Supported adapters:
|
||||
|
||||
S3/VIA KM266/VT8375 aka ProSavage8
|
||||
S3/VIA KM133/VT8365 aka Savage4
|
||||
|
||||
Author: Henk Vergonet <henk@god.dyndns.org>
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
The Savage4 chips contain two I2C interfaces (aka a I2C 'master' or
|
||||
'host').
|
||||
|
||||
The first interface is used for DDC (Data Display Channel) which is a
|
||||
serial channel through the VGA monitor connector to a DDC-compliant
|
||||
monitor. This interface is defined by the Video Electronics Standards
|
||||
Association (VESA). The standards are available for purchase at
|
||||
http://www.vesa.org . The second interface is a general-purpose I2C bus.
|
||||
|
||||
Usefull for gaining access to the TV Encoder chips.
|
||||
|
@ -1,26 +0,0 @@
|
||||
Kernel driver i2c-savage4
|
||||
|
||||
Supported adapters:
|
||||
* Savage4
|
||||
* Savage2000
|
||||
|
||||
Authors:
|
||||
Alexander Wold <awold@bigfoot.com>,
|
||||
Mark D. Studebaker <mdsxyz123@yahoo.com>
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
The Savage4 chips contain two I2C interfaces (aka a I2C 'master'
|
||||
or 'host').
|
||||
|
||||
The first interface is used for DDC (Data Display Channel) which is a
|
||||
serial channel through the VGA monitor connector to a DDC-compliant
|
||||
monitor. This interface is defined by the Video Electronics Standards
|
||||
Association (VESA). The standards are available for purchase at
|
||||
http://www.vesa.org . The DDC bus is not yet supported because its register
|
||||
is not directly memory-mapped.
|
||||
|
||||
The second interface is a general-purpose I2C bus. This is the only
|
||||
interface supported by the driver at the moment.
|
||||
|
@ -49,7 +49,7 @@ $ modprobe max6875 force=0,0x50
|
||||
|
||||
The MAX6874/MAX6875 ignores address bit 0, so this driver attaches to multiple
|
||||
addresses. For example, for address 0x50, it also reserves 0x51.
|
||||
The even-address instance is called 'max6875', the odd one is 'max6875 subclient'.
|
||||
The even-address instance is called 'max6875', the odd one is 'dummy'.
|
||||
|
||||
|
||||
Programming the chip using i2c-dev
|
||||
|
@ -7,7 +7,7 @@ drivers/gpio/pca9539.c instead.
|
||||
Supported chips:
|
||||
* Philips PCA9539
|
||||
Prefix: 'pca9539'
|
||||
Addresses scanned: 0x74 - 0x77
|
||||
Addresses scanned: none
|
||||
Datasheet:
|
||||
http://www.semiconductors.philips.com/acrobat/datasheets/PCA9539_2.pdf
|
||||
|
||||
@ -23,6 +23,14 @@ The input sense can also be inverted.
|
||||
The 16 lines are split between two bytes.
|
||||
|
||||
|
||||
Detection
|
||||
---------
|
||||
|
||||
The PCA9539 is difficult to detect and not commonly found in PC machines,
|
||||
so you have to pass the I2C bus and address of the installed PCA9539
|
||||
devices explicitly to the driver at load time via the force=... parameter.
|
||||
|
||||
|
||||
Sysfs entries
|
||||
-------------
|
||||
|
||||
|
@ -4,13 +4,13 @@ Kernel driver pcf8574
|
||||
Supported chips:
|
||||
* Philips PCF8574
|
||||
Prefix: 'pcf8574'
|
||||
Addresses scanned: I2C 0x20 - 0x27
|
||||
Addresses scanned: none
|
||||
Datasheet: Publicly available at the Philips Semiconductors website
|
||||
http://www.semiconductors.philips.com/pip/PCF8574P.html
|
||||
|
||||
* Philips PCF8574A
|
||||
Prefix: 'pcf8574a'
|
||||
Addresses scanned: I2C 0x38 - 0x3f
|
||||
Addresses scanned: none
|
||||
Datasheet: Publicly available at the Philips Semiconductors website
|
||||
http://www.semiconductors.philips.com/pip/PCF8574P.html
|
||||
|
||||
@ -38,12 +38,10 @@ For more informations see the datasheet.
|
||||
Accessing PCF8574(A) via /sys interface
|
||||
-------------------------------------
|
||||
|
||||
! Be careful !
|
||||
The PCF8574(A) is plainly impossible to detect ! Stupid chip.
|
||||
So every chip with address in the interval [20..27] and [38..3f] are
|
||||
detected as PCF8574(A). If you have other chips in this address
|
||||
range, the workaround is to load this module after the one
|
||||
for your others chips.
|
||||
So, you have to pass the I2C bus and address of the installed PCF857A
|
||||
and PCF8574A devices explicitly to the driver at load time via the
|
||||
force=... parameter.
|
||||
|
||||
On detection (i.e. insmod, modprobe et al.), directories are being
|
||||
created for each detected PCF8574(A):
|
||||
|
@ -40,12 +40,9 @@ Detection
|
||||
---------
|
||||
|
||||
There is no method known to detect whether a chip on a given I2C address is
|
||||
a PCF8575 or whether it is any other I2C device. So there are two alternatives
|
||||
to let the driver find the installed PCF8575 devices:
|
||||
- Load this driver after any other I2C driver for I2C devices with addresses
|
||||
in the range 0x20 .. 0x27.
|
||||
- Pass the I2C bus and address of the installed PCF8575 devices explicitly to
|
||||
the driver at load time via the probe=... or force=... parameters.
|
||||
a PCF8575 or whether it is any other I2C device, so you have to pass the I2C
|
||||
bus and address of the installed PCF8575 devices explicitly to the driver at
|
||||
load time via the force=... parameter.
|
||||
|
||||
/sys interface
|
||||
--------------
|
||||
|
127
Documentation/i2c/fault-codes
Normal file
127
Documentation/i2c/fault-codes
Normal file
@ -0,0 +1,127 @@
|
||||
This is a summary of the most important conventions for use of fault
|
||||
codes in the I2C/SMBus stack.
|
||||
|
||||
|
||||
A "Fault" is not always an "Error"
|
||||
----------------------------------
|
||||
Not all fault reports imply errors; "page faults" should be a familiar
|
||||
example. Software often retries idempotent operations after transient
|
||||
faults. There may be fancier recovery schemes that are appropriate in
|
||||
some cases, such as re-initializing (and maybe resetting). After such
|
||||
recovery, triggered by a fault report, there is no error.
|
||||
|
||||
In a similar way, sometimes a "fault" code just reports one defined
|
||||
result for an operation ... it doesn't indicate that anything is wrong
|
||||
at all, just that the outcome wasn't on the "golden path".
|
||||
|
||||
In short, your I2C driver code may need to know these codes in order
|
||||
to respond correctly. Other code may need to rely on YOUR code reporting
|
||||
the right fault code, so that it can (in turn) behave correctly.
|
||||
|
||||
|
||||
I2C and SMBus fault codes
|
||||
-------------------------
|
||||
These are returned as negative numbers from most calls, with zero or
|
||||
some positive number indicating a non-fault return. The specific
|
||||
numbers associated with these symbols differ between architectures,
|
||||
though most Linux systems use <asm-generic/errno*.h> numbering.
|
||||
|
||||
Note that the descriptions here are not exhaustive. There are other
|
||||
codes that may be returned, and other cases where these codes should
|
||||
be returned. However, drivers should not return other codes for these
|
||||
cases (unless the hardware doesn't provide unique fault reports).
|
||||
|
||||
Also, codes returned by adapter probe methods follow rules which are
|
||||
specific to their host bus (such as PCI, or the platform bus).
|
||||
|
||||
|
||||
EAGAIN
|
||||
Returned by I2C adapters when they lose arbitration in master
|
||||
transmit mode: some other master was transmitting different
|
||||
data at the same time.
|
||||
|
||||
Also returned when trying to invoke an I2C operation in an
|
||||
atomic context, when some task is already using that I2C bus
|
||||
to execute some other operation.
|
||||
|
||||
EBADMSG
|
||||
Returned by SMBus logic when an invalid Packet Error Code byte
|
||||
is received. This code is a CRC covering all bytes in the
|
||||
transaction, and is sent before the terminating STOP. This
|
||||
fault is only reported on read transactions; the SMBus slave
|
||||
may have a way to report PEC mismatches on writes from the
|
||||
host. Note that even if PECs are in use, you should not rely
|
||||
on these as the only way to detect incorrect data transfers.
|
||||
|
||||
EBUSY
|
||||
Returned by SMBus adapters when the bus was busy for longer
|
||||
than allowed. This usually indicates some device (maybe the
|
||||
SMBus adapter) needs some fault recovery (such as resetting),
|
||||
or that the reset was attempted but failed.
|
||||
|
||||
EINVAL
|
||||
This rather vague error means an invalid parameter has been
|
||||
detected before any I/O operation was started. Use a more
|
||||
specific fault code when you can.
|
||||
|
||||
One example would be a driver trying an SMBus Block Write
|
||||
with block size outside the range of 1-32 bytes.
|
||||
|
||||
EIO
|
||||
This rather vague error means something went wrong when
|
||||
performing an I/O operation. Use a more specific fault
|
||||
code when you can.
|
||||
|
||||
ENODEV
|
||||
Returned by driver probe() methods. This is a bit more
|
||||
specific than ENXIO, implying the problem isn't with the
|
||||
address, but with the device found there. Driver probes
|
||||
may verify the device returns *correct* responses, and
|
||||
return this as appropriate. (The driver core will warn
|
||||
about probe faults other than ENXIO and ENODEV.)
|
||||
|
||||
ENOMEM
|
||||
Returned by any component that can't allocate memory when
|
||||
it needs to do so.
|
||||
|
||||
ENXIO
|
||||
Returned by I2C adapters to indicate that the address phase
|
||||
of a transfer didn't get an ACK. While it might just mean
|
||||
an I2C device was temporarily not responding, usually it
|
||||
means there's nothing listening at that address.
|
||||
|
||||
Returned by driver probe() methods to indicate that they
|
||||
found no device to bind to. (ENODEV may also be used.)
|
||||
|
||||
EOPNOTSUPP
|
||||
Returned by an adapter when asked to perform an operation
|
||||
that it doesn't, or can't, support.
|
||||
|
||||
For example, this would be returned when an adapter that
|
||||
doesn't support SMBus block transfers is asked to execute
|
||||
one. In that case, the driver making that request should
|
||||
have verified that functionality was supported before it
|
||||
made that block transfer request.
|
||||
|
||||
Similarly, if an I2C adapter can't execute all legal I2C
|
||||
messages, it should return this when asked to perform a
|
||||
transaction it can't. (These limitations can't be seen in
|
||||
the adapter's functionality mask, since the assumption is
|
||||
that if an adapter supports I2C it supports all of I2C.)
|
||||
|
||||
EPROTO
|
||||
Returned when slave does not conform to the relevant I2C
|
||||
or SMBus (or chip-specific) protocol specifications. One
|
||||
case is when the length of an SMBus block data response
|
||||
(from the SMBus slave) is outside the range 1-32 bytes.
|
||||
|
||||
ETIMEDOUT
|
||||
This is returned by drivers when an operation took too much
|
||||
time, and was aborted before it completed.
|
||||
|
||||
SMBus adapters may return it when an operation took more
|
||||
time than allowed by the SMBus specification; for example,
|
||||
when a slave stretches clocks too far. I2C has no such
|
||||
timeouts, but it's normal for I2C adapters to impose some
|
||||
arbitrary limits (much longer than SMBus!) too.
|
||||
|
@ -42,8 +42,8 @@ Count (8 bits): A data byte containing the length of a block operation.
|
||||
[..]: Data sent by I2C device, as opposed to data sent by the host adapter.
|
||||
|
||||
|
||||
SMBus Quick Command: i2c_smbus_write_quick()
|
||||
=============================================
|
||||
SMBus Quick Command
|
||||
===================
|
||||
|
||||
This sends a single bit to the device, at the place of the Rd/Wr bit.
|
||||
|
||||
|
@ -25,14 +25,29 @@ routines, and should be zero-initialized except for fields with data you
|
||||
provide. A client structure holds device-specific information like the
|
||||
driver model device node, and its I2C address.
|
||||
|
||||
/* iff driver uses driver model ("new style") binding model: */
|
||||
|
||||
static struct i2c_device_id foo_idtable[] = {
|
||||
{ "foo", my_id_for_foo },
|
||||
{ "bar", my_id_for_bar },
|
||||
{ }
|
||||
};
|
||||
|
||||
MODULE_DEVICE_TABLE(i2c, foo_idtable);
|
||||
|
||||
static struct i2c_driver foo_driver = {
|
||||
.driver = {
|
||||
.name = "foo",
|
||||
},
|
||||
|
||||
/* iff driver uses driver model ("new style") binding model: */
|
||||
.id_table = foo_ids,
|
||||
.probe = foo_probe,
|
||||
.remove = foo_remove,
|
||||
/* if device autodetection is needed: */
|
||||
.class = I2C_CLASS_SOMETHING,
|
||||
.detect = foo_detect,
|
||||
.address_data = &addr_data,
|
||||
|
||||
/* else, driver uses "legacy" binding model: */
|
||||
.attach_adapter = foo_attach_adapter,
|
||||
@ -173,10 +188,9 @@ handle may be used during foo_probe(). If foo_probe() reports success
|
||||
(zero not a negative status code) it may save the handle and use it until
|
||||
foo_remove() returns. That binding model is used by most Linux drivers.
|
||||
|
||||
Drivers match devices when i2c_client.driver_name and the driver name are
|
||||
the same; this approach is used in several other busses that don't have
|
||||
device typing support in the hardware. The driver and module name should
|
||||
match, so hotplug/coldplug mechanisms will modprobe the driver.
|
||||
The probe function is called when an entry in the id_table name field
|
||||
matches the device's name. It is passed the entry that was matched so
|
||||
the driver knows which one in the table matched.
|
||||
|
||||
|
||||
Device Creation (Standard driver model)
|
||||
@ -207,6 +221,31 @@ in the I2C bus driver. You may want to save the returned i2c_client
|
||||
reference for later use.
|
||||
|
||||
|
||||
Device Detection (Standard driver model)
|
||||
----------------------------------------
|
||||
|
||||
Sometimes you do not know in advance which I2C devices are connected to
|
||||
a given I2C bus. This is for example the case of hardware monitoring
|
||||
devices on a PC's SMBus. In that case, you may want to let your driver
|
||||
detect supported devices automatically. This is how the legacy model
|
||||
was working, and is now available as an extension to the standard
|
||||
driver model (so that we can finally get rid of the legacy model.)
|
||||
|
||||
You simply have to define a detect callback which will attempt to
|
||||
identify supported devices (returning 0 for supported ones and -ENODEV
|
||||
for unsupported ones), a list of addresses to probe, and a device type
|
||||
(or class) so that only I2C buses which may have that type of device
|
||||
connected (and not otherwise enumerated) will be probed. The i2c
|
||||
core will then call you back as needed and will instantiate a device
|
||||
for you for every successful detection.
|
||||
|
||||
Note that this mechanism is purely optional and not suitable for all
|
||||
devices. You need some reliable way to identify the supported devices
|
||||
(typically using device-specific, dedicated identification registers),
|
||||
otherwise misdetections are likely to occur and things can get wrong
|
||||
quickly.
|
||||
|
||||
|
||||
Device Deletion (Standard driver model)
|
||||
---------------------------------------
|
||||
|
||||
@ -559,7 +598,6 @@ SMBus communication
|
||||
in terms of it. Never use this function directly!
|
||||
|
||||
|
||||
extern s32 i2c_smbus_write_quick(struct i2c_client * client, u8 value);
|
||||
extern s32 i2c_smbus_read_byte(struct i2c_client * client);
|
||||
extern s32 i2c_smbus_write_byte(struct i2c_client * client, u8 value);
|
||||
extern s32 i2c_smbus_read_byte_data(struct i2c_client * client, u8 command);
|
||||
@ -568,30 +606,31 @@ SMBus communication
|
||||
extern s32 i2c_smbus_read_word_data(struct i2c_client * client, u8 command);
|
||||
extern s32 i2c_smbus_write_word_data(struct i2c_client * client,
|
||||
u8 command, u16 value);
|
||||
extern s32 i2c_smbus_read_block_data(struct i2c_client * client,
|
||||
u8 command, u8 *values);
|
||||
extern s32 i2c_smbus_write_block_data(struct i2c_client * client,
|
||||
u8 command, u8 length,
|
||||
u8 *values);
|
||||
extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client,
|
||||
u8 command, u8 length, u8 *values);
|
||||
|
||||
These ones were removed in Linux 2.6.10 because they had no users, but could
|
||||
be added back later if needed:
|
||||
|
||||
extern s32 i2c_smbus_read_block_data(struct i2c_client * client,
|
||||
u8 command, u8 *values);
|
||||
extern s32 i2c_smbus_write_i2c_block_data(struct i2c_client * client,
|
||||
u8 command, u8 length,
|
||||
u8 *values);
|
||||
|
||||
These ones were removed from i2c-core because they had no users, but could
|
||||
be added back later if needed:
|
||||
|
||||
extern s32 i2c_smbus_write_quick(struct i2c_client * client, u8 value);
|
||||
extern s32 i2c_smbus_process_call(struct i2c_client * client,
|
||||
u8 command, u16 value);
|
||||
extern s32 i2c_smbus_block_process_call(struct i2c_client *client,
|
||||
u8 command, u8 length,
|
||||
u8 *values)
|
||||
|
||||
All these transactions return -1 on failure. The 'write' transactions
|
||||
return 0 on success; the 'read' transactions return the read value, except
|
||||
for read_block, which returns the number of values read. The block buffers
|
||||
need not be longer than 32 bytes.
|
||||
All these transactions return a negative errno value on failure. The 'write'
|
||||
transactions return 0 on success; the 'read' transactions return the read
|
||||
value, except for block transactions, which return the number of values
|
||||
read. The block buffers need not be longer than 32 bytes.
|
||||
|
||||
You can read the file `smbus-protocol' for more information about the
|
||||
actual SMBus protocol.
|
||||
|
@ -1,887 +0,0 @@
|
||||
THE LINUX/I386 BOOT PROTOCOL
|
||||
----------------------------
|
||||
|
||||
H. Peter Anvin <hpa@zytor.com>
|
||||
Last update 2007-05-23
|
||||
|
||||
On the i386 platform, the Linux kernel uses a rather complicated boot
|
||||
convention. This has evolved partially due to historical aspects, as
|
||||
well as the desire in the early days to have the kernel itself be a
|
||||
bootable image, the complicated PC memory model and due to changed
|
||||
expectations in the PC industry caused by the effective demise of
|
||||
real-mode DOS as a mainstream operating system.
|
||||
|
||||
Currently, the following versions of the Linux/i386 boot protocol exist.
|
||||
|
||||
Old kernels: zImage/Image support only. Some very early kernels
|
||||
may not even support a command line.
|
||||
|
||||
Protocol 2.00: (Kernel 1.3.73) Added bzImage and initrd support, as
|
||||
well as a formalized way to communicate between the
|
||||
boot loader and the kernel. setup.S made relocatable,
|
||||
although the traditional setup area still assumed
|
||||
writable.
|
||||
|
||||
Protocol 2.01: (Kernel 1.3.76) Added a heap overrun warning.
|
||||
|
||||
Protocol 2.02: (Kernel 2.4.0-test3-pre3) New command line protocol.
|
||||
Lower the conventional memory ceiling. No overwrite
|
||||
of the traditional setup area, thus making booting
|
||||
safe for systems which use the EBDA from SMM or 32-bit
|
||||
BIOS entry points. zImage deprecated but still
|
||||
supported.
|
||||
|
||||
Protocol 2.03: (Kernel 2.4.18-pre1) Explicitly makes the highest possible
|
||||
initrd address available to the bootloader.
|
||||
|
||||
Protocol 2.04: (Kernel 2.6.14) Extend the syssize field to four bytes.
|
||||
|
||||
Protocol 2.05: (Kernel 2.6.20) Make protected mode kernel relocatable.
|
||||
Introduce relocatable_kernel and kernel_alignment fields.
|
||||
|
||||
Protocol 2.06: (Kernel 2.6.22) Added a field that contains the size of
|
||||
the boot command line.
|
||||
|
||||
Protocol 2.07: (Kernel 2.6.24) Added paravirtualised boot protocol.
|
||||
Introduced hardware_subarch and hardware_subarch_data
|
||||
and KEEP_SEGMENTS flag in load_flags.
|
||||
|
||||
Protocol 2.08: (Kernel 2.6.26) Added crc32 checksum and ELF format
|
||||
payload. Introduced payload_offset and payload length
|
||||
fields to aid in locating the payload.
|
||||
|
||||
Protocol 2.09: (Kernel 2.6.26) Added a field of 64-bit physical
|
||||
pointer to single linked list of struct setup_data.
|
||||
|
||||
**** MEMORY LAYOUT
|
||||
|
||||
The traditional memory map for the kernel loader, used for Image or
|
||||
zImage kernels, typically looks like:
|
||||
|
||||
| |
|
||||
0A0000 +------------------------+
|
||||
| Reserved for BIOS | Do not use. Reserved for BIOS EBDA.
|
||||
09A000 +------------------------+
|
||||
| Command line |
|
||||
| Stack/heap | For use by the kernel real-mode code.
|
||||
098000 +------------------------+
|
||||
| Kernel setup | The kernel real-mode code.
|
||||
090200 +------------------------+
|
||||
| Kernel boot sector | The kernel legacy boot sector.
|
||||
090000 +------------------------+
|
||||
| Protected-mode kernel | The bulk of the kernel image.
|
||||
010000 +------------------------+
|
||||
| Boot loader | <- Boot sector entry point 0000:7C00
|
||||
001000 +------------------------+
|
||||
| Reserved for MBR/BIOS |
|
||||
000800 +------------------------+
|
||||
| Typically used by MBR |
|
||||
000600 +------------------------+
|
||||
| BIOS use only |
|
||||
000000 +------------------------+
|
||||
|
||||
|
||||
When using bzImage, the protected-mode kernel was relocated to
|
||||
0x100000 ("high memory"), and the kernel real-mode block (boot sector,
|
||||
setup, and stack/heap) was made relocatable to any address between
|
||||
0x10000 and end of low memory. Unfortunately, in protocols 2.00 and
|
||||
2.01 the 0x90000+ memory range is still used internally by the kernel;
|
||||
the 2.02 protocol resolves that problem.
|
||||
|
||||
It is desirable to keep the "memory ceiling" -- the highest point in
|
||||
low memory touched by the boot loader -- as low as possible, since
|
||||
some newer BIOSes have begun to allocate some rather large amounts of
|
||||
memory, called the Extended BIOS Data Area, near the top of low
|
||||
memory. The boot loader should use the "INT 12h" BIOS call to verify
|
||||
how much low memory is available.
|
||||
|
||||
Unfortunately, if INT 12h reports that the amount of memory is too
|
||||
low, there is usually nothing the boot loader can do but to report an
|
||||
error to the user. The boot loader should therefore be designed to
|
||||
take up as little space in low memory as it reasonably can. For
|
||||
zImage or old bzImage kernels, which need data written into the
|
||||
0x90000 segment, the boot loader should make sure not to use memory
|
||||
above the 0x9A000 point; too many BIOSes will break above that point.
|
||||
|
||||
For a modern bzImage kernel with boot protocol version >= 2.02, a
|
||||
memory layout like the following is suggested:
|
||||
|
||||
~ ~
|
||||
| Protected-mode kernel |
|
||||
100000 +------------------------+
|
||||
| I/O memory hole |
|
||||
0A0000 +------------------------+
|
||||
| Reserved for BIOS | Leave as much as possible unused
|
||||
~ ~
|
||||
| Command line | (Can also be below the X+10000 mark)
|
||||
X+10000 +------------------------+
|
||||
| Stack/heap | For use by the kernel real-mode code.
|
||||
X+08000 +------------------------+
|
||||
| Kernel setup | The kernel real-mode code.
|
||||
| Kernel boot sector | The kernel legacy boot sector.
|
||||
X +------------------------+
|
||||
| Boot loader | <- Boot sector entry point 0000:7C00
|
||||
001000 +------------------------+
|
||||
| Reserved for MBR/BIOS |
|
||||
000800 +------------------------+
|
||||
| Typically used by MBR |
|
||||
000600 +------------------------+
|
||||
| BIOS use only |
|
||||
000000 +------------------------+
|
||||
|
||||
... where the address X is as low as the design of the boot loader
|
||||
permits.
|
||||
|
||||
|
||||
**** THE REAL-MODE KERNEL HEADER
|
||||
|
||||
In the following text, and anywhere in the kernel boot sequence, "a
|
||||
sector" refers to 512 bytes. It is independent of the actual sector
|
||||
size of the underlying medium.
|
||||
|
||||
The first step in loading a Linux kernel should be to load the
|
||||
real-mode code (boot sector and setup code) and then examine the
|
||||
following header at offset 0x01f1. The real-mode code can total up to
|
||||
32K, although the boot loader may choose to load only the first two
|
||||
sectors (1K) and then examine the bootup sector size.
|
||||
|
||||
The header looks like:
|
||||
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
|
||||
01F1/1 ALL(1 setup_sects The size of the setup in sectors
|
||||
01F2/2 ALL root_flags If set, the root is mounted readonly
|
||||
01F4/4 2.04+(2 syssize The size of the 32-bit code in 16-byte paras
|
||||
01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only
|
||||
01FA/2 ALL vid_mode Video mode control
|
||||
01FC/2 ALL root_dev Default root device number
|
||||
01FE/2 ALL boot_flag 0xAA55 magic number
|
||||
0200/2 2.00+ jump Jump instruction
|
||||
0202/4 2.00+ header Magic signature "HdrS"
|
||||
0206/2 2.00+ version Boot protocol version supported
|
||||
0208/4 2.00+ realmode_swtch Boot loader hook (see below)
|
||||
020C/2 2.00+ start_sys The load-low segment (0x1000) (obsolete)
|
||||
020E/2 2.00+ kernel_version Pointer to kernel version string
|
||||
0210/1 2.00+ type_of_loader Boot loader identifier
|
||||
0211/1 2.00+ loadflags Boot protocol option flags
|
||||
0212/2 2.00+ setup_move_size Move to high memory size (used with hooks)
|
||||
0214/4 2.00+ code32_start Boot loader hook (see below)
|
||||
0218/4 2.00+ ramdisk_image initrd load address (set by boot loader)
|
||||
021C/4 2.00+ ramdisk_size initrd size (set by boot loader)
|
||||
0220/4 2.00+ bootsect_kludge DO NOT USE - for bootsect.S use only
|
||||
0224/2 2.01+ heap_end_ptr Free memory after setup end
|
||||
0226/2 N/A pad1 Unused
|
||||
0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line
|
||||
022C/4 2.03+ initrd_addr_max Highest legal initrd address
|
||||
0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel
|
||||
0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not
|
||||
0235/3 N/A pad2 Unused
|
||||
0238/4 2.06+ cmdline_size Maximum size of the kernel command line
|
||||
023C/4 2.07+ hardware_subarch Hardware subarchitecture
|
||||
0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data
|
||||
0248/4 2.08+ payload_offset Offset of kernel payload
|
||||
024C/4 2.08+ payload_length Length of kernel payload
|
||||
0250/8 2.09+ setup_data 64-bit physical pointer to linked list
|
||||
of struct setup_data
|
||||
|
||||
(1) For backwards compatibility, if the setup_sects field contains 0, the
|
||||
real value is 4.
|
||||
|
||||
(2) For boot protocol prior to 2.04, the upper two bytes of the syssize
|
||||
field are unusable, which means the size of a bzImage kernel
|
||||
cannot be determined.
|
||||
|
||||
If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
|
||||
the boot protocol version is "old". Loading an old kernel, the
|
||||
following parameters should be assumed:
|
||||
|
||||
Image type = zImage
|
||||
initrd not supported
|
||||
Real-mode kernel must be located at 0x90000.
|
||||
|
||||
Otherwise, the "version" field contains the protocol version,
|
||||
e.g. protocol version 2.01 will contain 0x0201 in this field. When
|
||||
setting fields in the header, you must make sure only to set fields
|
||||
supported by the protocol version in use.
|
||||
|
||||
|
||||
**** DETAILS OF HEADER FIELDS
|
||||
|
||||
For each field, some are information from the kernel to the bootloader
|
||||
("read"), some are expected to be filled out by the bootloader
|
||||
("write"), and some are expected to be read and modified by the
|
||||
bootloader ("modify").
|
||||
|
||||
All general purpose boot loaders should write the fields marked
|
||||
(obligatory). Boot loaders who want to load the kernel at a
|
||||
nonstandard address should fill in the fields marked (reloc); other
|
||||
boot loaders can ignore those fields.
|
||||
|
||||
The byte order of all fields is littleendian (this is x86, after all.)
|
||||
|
||||
Field name: setup_sects
|
||||
Type: read
|
||||
Offset/size: 0x1f1/1
|
||||
Protocol: ALL
|
||||
|
||||
The size of the setup code in 512-byte sectors. If this field is
|
||||
0, the real value is 4. The real-mode code consists of the boot
|
||||
sector (always one 512-byte sector) plus the setup code.
|
||||
|
||||
Field name: root_flags
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x1f2/2
|
||||
Protocol: ALL
|
||||
|
||||
If this field is nonzero, the root defaults to readonly. The use of
|
||||
this field is deprecated; use the "ro" or "rw" options on the
|
||||
command line instead.
|
||||
|
||||
Field name: syssize
|
||||
Type: read
|
||||
Offset/size: 0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL)
|
||||
Protocol: 2.04+
|
||||
|
||||
The size of the protected-mode code in units of 16-byte paragraphs.
|
||||
For protocol versions older than 2.04 this field is only two bytes
|
||||
wide, and therefore cannot be trusted for the size of a kernel if
|
||||
the LOAD_HIGH flag is set.
|
||||
|
||||
Field name: ram_size
|
||||
Type: kernel internal
|
||||
Offset/size: 0x1f8/2
|
||||
Protocol: ALL
|
||||
|
||||
This field is obsolete.
|
||||
|
||||
Field name: vid_mode
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x1fa/2
|
||||
|
||||
Please see the section on SPECIAL COMMAND LINE OPTIONS.
|
||||
|
||||
Field name: root_dev
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x1fc/2
|
||||
Protocol: ALL
|
||||
|
||||
The default root device device number. The use of this field is
|
||||
deprecated, use the "root=" option on the command line instead.
|
||||
|
||||
Field name: boot_flag
|
||||
Type: read
|
||||
Offset/size: 0x1fe/2
|
||||
Protocol: ALL
|
||||
|
||||
Contains 0xAA55. This is the closest thing old Linux kernels have
|
||||
to a magic number.
|
||||
|
||||
Field name: jump
|
||||
Type: read
|
||||
Offset/size: 0x200/2
|
||||
Protocol: 2.00+
|
||||
|
||||
Contains an x86 jump instruction, 0xEB followed by a signed offset
|
||||
relative to byte 0x202. This can be used to determine the size of
|
||||
the header.
|
||||
|
||||
Field name: header
|
||||
Type: read
|
||||
Offset/size: 0x202/4
|
||||
Protocol: 2.00+
|
||||
|
||||
Contains the magic number "HdrS" (0x53726448).
|
||||
|
||||
Field name: version
|
||||
Type: read
|
||||
Offset/size: 0x206/2
|
||||
Protocol: 2.00+
|
||||
|
||||
Contains the boot protocol version, in (major << 8)+minor format,
|
||||
e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
|
||||
10.17.
|
||||
|
||||
Field name: readmode_swtch
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x208/4
|
||||
Protocol: 2.00+
|
||||
|
||||
Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
|
||||
|
||||
Field name: start_sys
|
||||
Type: read
|
||||
Offset/size: 0x20c/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The load low segment (0x1000). Obsolete.
|
||||
|
||||
Field name: kernel_version
|
||||
Type: read
|
||||
Offset/size: 0x20e/2
|
||||
Protocol: 2.00+
|
||||
|
||||
If set to a nonzero value, contains a pointer to a NUL-terminated
|
||||
human-readable kernel version number string, less 0x200. This can
|
||||
be used to display the kernel version to the user. This value
|
||||
should be less than (0x200*setup_sects).
|
||||
|
||||
For example, if this value is set to 0x1c00, the kernel version
|
||||
number string can be found at offset 0x1e00 in the kernel file.
|
||||
This is a valid value if and only if the "setup_sects" field
|
||||
contains the value 15 or higher, as:
|
||||
|
||||
0x1c00 < 15*0x200 (= 0x1e00) but
|
||||
0x1c00 >= 14*0x200 (= 0x1c00)
|
||||
|
||||
0x1c00 >> 9 = 14, so the minimum value for setup_secs is 15.
|
||||
|
||||
Field name: type_of_loader
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x210/1
|
||||
Protocol: 2.00+
|
||||
|
||||
If your boot loader has an assigned id (see table below), enter
|
||||
0xTV here, where T is an identifier for the boot loader and V is
|
||||
a version number. Otherwise, enter 0xFF here.
|
||||
|
||||
Assigned boot loader ids:
|
||||
0 LILO (0x00 reserved for pre-2.00 bootloader)
|
||||
1 Loadlin
|
||||
2 bootsect-loader (0x20, all other values reserved)
|
||||
3 SYSLINUX
|
||||
4 EtherBoot
|
||||
5 ELILO
|
||||
7 GRuB
|
||||
8 U-BOOT
|
||||
9 Xen
|
||||
A Gujin
|
||||
B Qemu
|
||||
|
||||
Please contact <hpa@zytor.com> if you need a bootloader ID
|
||||
value assigned.
|
||||
|
||||
Field name: loadflags
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x211/1
|
||||
Protocol: 2.00+
|
||||
|
||||
This field is a bitmask.
|
||||
|
||||
Bit 0 (read): LOADED_HIGH
|
||||
- If 0, the protected-mode code is loaded at 0x10000.
|
||||
- If 1, the protected-mode code is loaded at 0x100000.
|
||||
|
||||
Bit 6 (write): KEEP_SEGMENTS
|
||||
Protocol: 2.07+
|
||||
- if 0, reload the segment registers in the 32bit entry point.
|
||||
- if 1, do not reload the segment registers in the 32bit entry point.
|
||||
Assume that %cs %ds %ss %es are all set to flat segments with
|
||||
a base of 0 (or the equivalent for their environment).
|
||||
|
||||
Bit 7 (write): CAN_USE_HEAP
|
||||
Set this bit to 1 to indicate that the value entered in the
|
||||
heap_end_ptr is valid. If this field is clear, some setup code
|
||||
functionality will be disabled.
|
||||
|
||||
Field name: setup_move_size
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x212/2
|
||||
Protocol: 2.00-2.01
|
||||
|
||||
When using protocol 2.00 or 2.01, if the real mode kernel is not
|
||||
loaded at 0x90000, it gets moved there later in the loading
|
||||
sequence. Fill in this field if you want additional data (such as
|
||||
the kernel command line) moved in addition to the real-mode kernel
|
||||
itself.
|
||||
|
||||
The unit is bytes starting with the beginning of the boot sector.
|
||||
|
||||
This field is can be ignored when the protocol is 2.02 or higher, or
|
||||
if the real-mode code is loaded at 0x90000.
|
||||
|
||||
Field name: code32_start
|
||||
Type: modify (optional, reloc)
|
||||
Offset/size: 0x214/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The address to jump to in protected mode. This defaults to the load
|
||||
address of the kernel, and can be used by the boot loader to
|
||||
determine the proper load address.
|
||||
|
||||
This field can be modified for two purposes:
|
||||
|
||||
1. as a boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
|
||||
|
||||
2. if a bootloader which does not install a hook loads a
|
||||
relocatable kernel at a nonstandard address it will have to modify
|
||||
this field to point to the load address.
|
||||
|
||||
Field name: ramdisk_image
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x218/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The 32-bit linear address of the initial ramdisk or ramfs. Leave at
|
||||
zero if there is no initial ramdisk/ramfs.
|
||||
|
||||
Field name: ramdisk_size
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x21c/4
|
||||
Protocol: 2.00+
|
||||
|
||||
Size of the initial ramdisk or ramfs. Leave at zero if there is no
|
||||
initial ramdisk/ramfs.
|
||||
|
||||
Field name: bootsect_kludge
|
||||
Type: kernel internal
|
||||
Offset/size: 0x220/4
|
||||
Protocol: 2.00+
|
||||
|
||||
This field is obsolete.
|
||||
|
||||
Field name: heap_end_ptr
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x224/2
|
||||
Protocol: 2.01+
|
||||
|
||||
Set this field to the offset (from the beginning of the real-mode
|
||||
code) of the end of the setup stack/heap, minus 0x0200.
|
||||
|
||||
Field name: cmd_line_ptr
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x228/4
|
||||
Protocol: 2.02+
|
||||
|
||||
Set this field to the linear address of the kernel command line.
|
||||
The kernel command line can be located anywhere between the end of
|
||||
the setup heap and 0xA0000; it does not have to be located in the
|
||||
same 64K segment as the real-mode code itself.
|
||||
|
||||
Fill in this field even if your boot loader does not support a
|
||||
command line, in which case you can point this to an empty string
|
||||
(or better yet, to the string "auto".) If this field is left at
|
||||
zero, the kernel will assume that your boot loader does not support
|
||||
the 2.02+ protocol.
|
||||
|
||||
Field name: initrd_addr_max
|
||||
Type: read
|
||||
Offset/size: 0x22c/4
|
||||
Protocol: 2.03+
|
||||
|
||||
The maximum address that may be occupied by the initial
|
||||
ramdisk/ramfs contents. For boot protocols 2.02 or earlier, this
|
||||
field is not present, and the maximum address is 0x37FFFFFF. (This
|
||||
address is defined as the address of the highest safe byte, so if
|
||||
your ramdisk is exactly 131072 bytes long and this field is
|
||||
0x37FFFFFF, you can start your ramdisk at 0x37FE0000.)
|
||||
|
||||
Field name: kernel_alignment
|
||||
Type: read (reloc)
|
||||
Offset/size: 0x230/4
|
||||
Protocol: 2.05+
|
||||
|
||||
Alignment unit required by the kernel (if relocatable_kernel is true.)
|
||||
|
||||
Field name: relocatable_kernel
|
||||
Type: read (reloc)
|
||||
Offset/size: 0x234/1
|
||||
Protocol: 2.05+
|
||||
|
||||
If this field is nonzero, the protected-mode part of the kernel can
|
||||
be loaded at any address that satisfies the kernel_alignment field.
|
||||
After loading, the boot loader must set the code32_start field to
|
||||
point to the loaded code, or to a boot loader hook.
|
||||
|
||||
Field name: cmdline_size
|
||||
Type: read
|
||||
Offset/size: 0x238/4
|
||||
Protocol: 2.06+
|
||||
|
||||
The maximum size of the command line without the terminating
|
||||
zero. This means that the command line can contain at most
|
||||
cmdline_size characters. With protocol version 2.05 and earlier, the
|
||||
maximum size was 255.
|
||||
|
||||
Field name: hardware_subarch
|
||||
Type: write
|
||||
Offset/size: 0x23c/4
|
||||
Protocol: 2.07+
|
||||
|
||||
In a paravirtualized environment the hardware low level architectural
|
||||
pieces such as interrupt handling, page table handling, and
|
||||
accessing process control registers needs to be done differently.
|
||||
|
||||
This field allows the bootloader to inform the kernel we are in one
|
||||
one of those environments.
|
||||
|
||||
0x00000000 The default x86/PC environment
|
||||
0x00000001 lguest
|
||||
0x00000002 Xen
|
||||
|
||||
Field name: hardware_subarch_data
|
||||
Type: write
|
||||
Offset/size: 0x240/8
|
||||
Protocol: 2.07+
|
||||
|
||||
A pointer to data that is specific to hardware subarch
|
||||
|
||||
Field name: payload_offset
|
||||
Type: read
|
||||
Offset/size: 0x248/4
|
||||
Protocol: 2.08+
|
||||
|
||||
If non-zero then this field contains the offset from the end of the
|
||||
real-mode code to the payload.
|
||||
|
||||
The payload may be compressed. The format of both the compressed and
|
||||
uncompressed data should be determined using the standard magic
|
||||
numbers. Currently only gzip compressed ELF is used.
|
||||
|
||||
Field name: payload_length
|
||||
Type: read
|
||||
Offset/size: 0x24c/4
|
||||
Protocol: 2.08+
|
||||
|
||||
The length of the payload.
|
||||
|
||||
**** THE IMAGE CHECKSUM
|
||||
|
||||
From boot protocol version 2.08 onwards the CRC-32 is calculated over
|
||||
the entire file using the characteristic polynomial 0x04C11DB7 and an
|
||||
initial remainder of 0xffffffff. The checksum is appended to the
|
||||
file; therefore the CRC of the file up to the limit specified in the
|
||||
syssize field of the header is always 0.
|
||||
|
||||
**** THE KERNEL COMMAND LINE
|
||||
|
||||
The kernel command line has become an important way for the boot
|
||||
loader to communicate with the kernel. Some of its options are also
|
||||
relevant to the boot loader itself, see "special command line options"
|
||||
below.
|
||||
|
||||
The kernel command line is a null-terminated string. The maximum
|
||||
length can be retrieved from the field cmdline_size. Before protocol
|
||||
version 2.06, the maximum was 255 characters. A string that is too
|
||||
long will be automatically truncated by the kernel.
|
||||
|
||||
If the boot protocol version is 2.02 or later, the address of the
|
||||
kernel command line is given by the header field cmd_line_ptr (see
|
||||
above.) This address can be anywhere between the end of the setup
|
||||
heap and 0xA0000.
|
||||
|
||||
If the protocol version is *not* 2.02 or higher, the kernel
|
||||
command line is entered using the following protocol:
|
||||
|
||||
At offset 0x0020 (word), "cmd_line_magic", enter the magic
|
||||
number 0xA33F.
|
||||
|
||||
At offset 0x0022 (word), "cmd_line_offset", enter the offset
|
||||
of the kernel command line (relative to the start of the
|
||||
real-mode kernel).
|
||||
|
||||
The kernel command line *must* be within the memory region
|
||||
covered by setup_move_size, so you may need to adjust this
|
||||
field.
|
||||
|
||||
Field name: setup_data
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x250/8
|
||||
Protocol: 2.09+
|
||||
|
||||
The 64-bit physical pointer to NULL terminated single linked list of
|
||||
struct setup_data. This is used to define a more extensible boot
|
||||
parameters passing mechanism. The definition of struct setup_data is
|
||||
as follow:
|
||||
|
||||
struct setup_data {
|
||||
u64 next;
|
||||
u32 type;
|
||||
u32 len;
|
||||
u8 data[0];
|
||||
};
|
||||
|
||||
Where, the next is a 64-bit physical pointer to the next node of
|
||||
linked list, the next field of the last node is 0; the type is used
|
||||
to identify the contents of data; the len is the length of data
|
||||
field; the data holds the real payload.
|
||||
|
||||
|
||||
**** MEMORY LAYOUT OF THE REAL-MODE CODE
|
||||
|
||||
The real-mode code requires a stack/heap to be set up, as well as
|
||||
memory allocated for the kernel command line. This needs to be done
|
||||
in the real-mode accessible memory in bottom megabyte.
|
||||
|
||||
It should be noted that modern machines often have a sizable Extended
|
||||
BIOS Data Area (EBDA). As a result, it is advisable to use as little
|
||||
of the low megabyte as possible.
|
||||
|
||||
Unfortunately, under the following circumstances the 0x90000 memory
|
||||
segment has to be used:
|
||||
|
||||
- When loading a zImage kernel ((loadflags & 0x01) == 0).
|
||||
- When loading a 2.01 or earlier boot protocol kernel.
|
||||
|
||||
-> For the 2.00 and 2.01 boot protocols, the real-mode code
|
||||
can be loaded at another address, but it is internally
|
||||
relocated to 0x90000. For the "old" protocol, the
|
||||
real-mode code must be loaded at 0x90000.
|
||||
|
||||
When loading at 0x90000, avoid using memory above 0x9a000.
|
||||
|
||||
For boot protocol 2.02 or higher, the command line does not have to be
|
||||
located in the same 64K segment as the real-mode setup code; it is
|
||||
thus permitted to give the stack/heap the full 64K segment and locate
|
||||
the command line above it.
|
||||
|
||||
The kernel command line should not be located below the real-mode
|
||||
code, nor should it be located in high memory.
|
||||
|
||||
|
||||
**** SAMPLE BOOT CONFIGURATION
|
||||
|
||||
As a sample configuration, assume the following layout of the real
|
||||
mode segment:
|
||||
|
||||
When loading below 0x90000, use the entire segment:
|
||||
|
||||
0x0000-0x7fff Real mode kernel
|
||||
0x8000-0xdfff Stack and heap
|
||||
0xe000-0xffff Kernel command line
|
||||
|
||||
When loading at 0x90000 OR the protocol version is 2.01 or earlier:
|
||||
|
||||
0x0000-0x7fff Real mode kernel
|
||||
0x8000-0x97ff Stack and heap
|
||||
0x9800-0x9fff Kernel command line
|
||||
|
||||
Such a boot loader should enter the following fields in the header:
|
||||
|
||||
unsigned long base_ptr; /* base address for real-mode segment */
|
||||
|
||||
if ( setup_sects == 0 ) {
|
||||
setup_sects = 4;
|
||||
}
|
||||
|
||||
if ( protocol >= 0x0200 ) {
|
||||
type_of_loader = <type code>;
|
||||
if ( loading_initrd ) {
|
||||
ramdisk_image = <initrd_address>;
|
||||
ramdisk_size = <initrd_size>;
|
||||
}
|
||||
|
||||
if ( protocol >= 0x0202 && loadflags & 0x01 )
|
||||
heap_end = 0xe000;
|
||||
else
|
||||
heap_end = 0x9800;
|
||||
|
||||
if ( protocol >= 0x0201 ) {
|
||||
heap_end_ptr = heap_end - 0x200;
|
||||
loadflags |= 0x80; /* CAN_USE_HEAP */
|
||||
}
|
||||
|
||||
if ( protocol >= 0x0202 ) {
|
||||
cmd_line_ptr = base_ptr + heap_end;
|
||||
strcpy(cmd_line_ptr, cmdline);
|
||||
} else {
|
||||
cmd_line_magic = 0xA33F;
|
||||
cmd_line_offset = heap_end;
|
||||
setup_move_size = heap_end + strlen(cmdline)+1;
|
||||
strcpy(base_ptr+cmd_line_offset, cmdline);
|
||||
}
|
||||
} else {
|
||||
/* Very old kernel */
|
||||
|
||||
heap_end = 0x9800;
|
||||
|
||||
cmd_line_magic = 0xA33F;
|
||||
cmd_line_offset = heap_end;
|
||||
|
||||
/* A very old kernel MUST have its real-mode code
|
||||
loaded at 0x90000 */
|
||||
|
||||
if ( base_ptr != 0x90000 ) {
|
||||
/* Copy the real-mode kernel */
|
||||
memcpy(0x90000, base_ptr, (setup_sects+1)*512);
|
||||
base_ptr = 0x90000; /* Relocated */
|
||||
}
|
||||
|
||||
strcpy(0x90000+cmd_line_offset, cmdline);
|
||||
|
||||
/* It is recommended to clear memory up to the 32K mark */
|
||||
memset(0x90000 + (setup_sects+1)*512, 0,
|
||||
(64-(setup_sects+1))*512);
|
||||
}
|
||||
|
||||
|
||||
**** LOADING THE REST OF THE KERNEL
|
||||
|
||||
The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
|
||||
in the kernel file (again, if setup_sects == 0 the real value is 4.)
|
||||
It should be loaded at address 0x10000 for Image/zImage kernels and
|
||||
0x100000 for bzImage kernels.
|
||||
|
||||
The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
|
||||
bit (LOAD_HIGH) in the loadflags field is set:
|
||||
|
||||
is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
|
||||
load_address = is_bzImage ? 0x100000 : 0x10000;
|
||||
|
||||
Note that Image/zImage kernels can be up to 512K in size, and thus use
|
||||
the entire 0x10000-0x90000 range of memory. This means it is pretty
|
||||
much a requirement for these kernels to load the real-mode part at
|
||||
0x90000. bzImage kernels allow much more flexibility.
|
||||
|
||||
|
||||
**** SPECIAL COMMAND LINE OPTIONS
|
||||
|
||||
If the command line provided by the boot loader is entered by the
|
||||
user, the user may expect the following command line options to work.
|
||||
They should normally not be deleted from the kernel command line even
|
||||
though not all of them are actually meaningful to the kernel. Boot
|
||||
loader authors who need additional command line options for the boot
|
||||
loader itself should get them registered in
|
||||
Documentation/kernel-parameters.txt to make sure they will not
|
||||
conflict with actual kernel options now or in the future.
|
||||
|
||||
vga=<mode>
|
||||
<mode> here is either an integer (in C notation, either
|
||||
decimal, octal, or hexadecimal) or one of the strings
|
||||
"normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask"
|
||||
(meaning 0xFFFD). This value should be entered into the
|
||||
vid_mode field, as it is used by the kernel before the command
|
||||
line is parsed.
|
||||
|
||||
mem=<size>
|
||||
<size> is an integer in C notation optionally followed by
|
||||
(case insensitive) K, M, G, T, P or E (meaning << 10, << 20,
|
||||
<< 30, << 40, << 50 or << 60). This specifies the end of
|
||||
memory to the kernel. This affects the possible placement of
|
||||
an initrd, since an initrd should be placed near end of
|
||||
memory. Note that this is an option to *both* the kernel and
|
||||
the bootloader!
|
||||
|
||||
initrd=<file>
|
||||
An initrd should be loaded. The meaning of <file> is
|
||||
obviously bootloader-dependent, and some boot loaders
|
||||
(e.g. LILO) do not have such a command.
|
||||
|
||||
In addition, some boot loaders add the following options to the
|
||||
user-specified command line:
|
||||
|
||||
BOOT_IMAGE=<file>
|
||||
The boot image which was loaded. Again, the meaning of <file>
|
||||
is obviously bootloader-dependent.
|
||||
|
||||
auto
|
||||
The kernel was booted without explicit user intervention.
|
||||
|
||||
If these options are added by the boot loader, it is highly
|
||||
recommended that they are located *first*, before the user-specified
|
||||
or configuration-specified command line. Otherwise, "init=/bin/sh"
|
||||
gets confused by the "auto" option.
|
||||
|
||||
|
||||
**** RUNNING THE KERNEL
|
||||
|
||||
The kernel is started by jumping to the kernel entry point, which is
|
||||
located at *segment* offset 0x20 from the start of the real mode
|
||||
kernel. This means that if you loaded your real-mode kernel code at
|
||||
0x90000, the kernel entry point is 9020:0000.
|
||||
|
||||
At entry, ds = es = ss should point to the start of the real-mode
|
||||
kernel code (0x9000 if the code is loaded at 0x90000), sp should be
|
||||
set up properly, normally pointing to the top of the heap, and
|
||||
interrupts should be disabled. Furthermore, to guard against bugs in
|
||||
the kernel, it is recommended that the boot loader sets fs = gs = ds =
|
||||
es = ss.
|
||||
|
||||
In our example from above, we would do:
|
||||
|
||||
/* Note: in the case of the "old" kernel protocol, base_ptr must
|
||||
be == 0x90000 at this point; see the previous sample code */
|
||||
|
||||
seg = base_ptr >> 4;
|
||||
|
||||
cli(); /* Enter with interrupts disabled! */
|
||||
|
||||
/* Set up the real-mode kernel stack */
|
||||
_SS = seg;
|
||||
_SP = heap_end;
|
||||
|
||||
_DS = _ES = _FS = _GS = seg;
|
||||
jmp_far(seg+0x20, 0); /* Run the kernel */
|
||||
|
||||
If your boot sector accesses a floppy drive, it is recommended to
|
||||
switch off the floppy motor before running the kernel, since the
|
||||
kernel boot leaves interrupts off and thus the motor will not be
|
||||
switched off, especially if the loaded kernel has the floppy driver as
|
||||
a demand-loaded module!
|
||||
|
||||
|
||||
**** ADVANCED BOOT LOADER HOOKS
|
||||
|
||||
If the boot loader runs in a particularly hostile environment (such as
|
||||
LOADLIN, which runs under DOS) it may be impossible to follow the
|
||||
standard memory location requirements. Such a boot loader may use the
|
||||
following hooks that, if set, are invoked by the kernel at the
|
||||
appropriate time. The use of these hooks should probably be
|
||||
considered an absolutely last resort!
|
||||
|
||||
IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and
|
||||
%edi across invocation.
|
||||
|
||||
realmode_swtch:
|
||||
A 16-bit real mode far subroutine invoked immediately before
|
||||
entering protected mode. The default routine disables NMI, so
|
||||
your routine should probably do so, too.
|
||||
|
||||
code32_start:
|
||||
A 32-bit flat-mode routine *jumped* to immediately after the
|
||||
transition to protected mode, but before the kernel is
|
||||
uncompressed. No segments, except CS, are guaranteed to be
|
||||
set up (current kernels do, but older ones do not); you should
|
||||
set them up to BOOT_DS (0x18) yourself.
|
||||
|
||||
After completing your hook, you should jump to the address
|
||||
that was in this field before your boot loader overwrote it
|
||||
(relocated, if appropriate.)
|
||||
|
||||
|
||||
**** 32-bit BOOT PROTOCOL
|
||||
|
||||
For machine with some new BIOS other than legacy BIOS, such as EFI,
|
||||
LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel
|
||||
based on legacy BIOS can not be used, so a 32-bit boot protocol needs
|
||||
to be defined.
|
||||
|
||||
In 32-bit boot protocol, the first step in loading a Linux kernel
|
||||
should be to setup the boot parameters (struct boot_params,
|
||||
traditionally known as "zero page"). The memory for struct boot_params
|
||||
should be allocated and initialized to all zero. Then the setup header
|
||||
from offset 0x01f1 of kernel image on should be loaded into struct
|
||||
boot_params and examined. The end of setup header can be calculated as
|
||||
follow:
|
||||
|
||||
0x0202 + byte value at offset 0x0201
|
||||
|
||||
In addition to read/modify/write the setup header of the struct
|
||||
boot_params as that of 16-bit boot protocol, the boot loader should
|
||||
also fill the additional fields of the struct boot_params as that
|
||||
described in zero-page.txt.
|
||||
|
||||
After setupping the struct boot_params, the boot loader can load the
|
||||
32/64-bit kernel in the same way as that of 16-bit boot protocol.
|
||||
|
||||
In 32-bit boot protocol, the kernel is started by jumping to the
|
||||
32-bit kernel entry point, which is the start address of loaded
|
||||
32/64-bit kernel.
|
||||
|
||||
At entry, the CPU must be in 32-bit protected mode with paging
|
||||
disabled; a GDT must be loaded with the descriptors for selectors
|
||||
__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
|
||||
segment; __BOOS_CS must have execute/read permission, and __BOOT_DS
|
||||
must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
|
||||
must be __BOOT_DS; interrupt must be disabled; %esi must hold the base
|
||||
address of the struct boot_params; %ebp, %edi and %ebx must be zero.
|
@ -117,6 +117,7 @@ Code Seq# Include File Comments
|
||||
<mailto:natalia@nikhefk.nikhef.nl>
|
||||
'c' 00-7F linux/comstats.h conflict!
|
||||
'c' 00-7F linux/coda.h conflict!
|
||||
'c' 80-9F asm-s390/chsc.h
|
||||
'd' 00-FF linux/char/drm/drm/h conflict!
|
||||
'd' 00-DF linux/video_decoder.h conflict!
|
||||
'd' F0-FF linux/digi1.h
|
||||
|
@ -508,12 +508,13 @@ HDIO_DRIVE_RESET execute a device reset
|
||||
|
||||
error returns:
|
||||
EACCES Access denied: requires CAP_SYS_ADMIN
|
||||
ENXIO No such device: phy dead or ctl_addr == 0
|
||||
EIO I/O error: reset timed out or hardware error
|
||||
|
||||
notes:
|
||||
|
||||
Abort any current command, prevent anything else from being
|
||||
queued, execute a reset on the device, and issue BLKRRPART
|
||||
ioctl on the block device.
|
||||
Execute a reset on the device as soon as the current IO
|
||||
operation has completed.
|
||||
|
||||
Executes an ATAPI soft reset if applicable, otherwise
|
||||
executes an ATA soft reset on the controller.
|
||||
|
@ -109,7 +109,7 @@ There are two possible methods of using Kdump.
|
||||
2) Or use the system kernel binary itself as dump-capture kernel and there is
|
||||
no need to build a separate dump-capture kernel. This is possible
|
||||
only with the architecutres which support a relocatable kernel. As
|
||||
of today i386 and ia64 architectures support relocatable kernel.
|
||||
of today, i386, x86_64 and ia64 architectures support relocatable kernel.
|
||||
|
||||
Building a relocatable kernel is advantageous from the point of view that
|
||||
one does not have to build a second kernel for capturing the dump. But
|
||||
|
@ -147,10 +147,14 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
default: 0
|
||||
|
||||
acpi_sleep= [HW,ACPI] Sleep options
|
||||
Format: { s3_bios, s3_mode, s3_beep }
|
||||
Format: { s3_bios, s3_mode, s3_beep, old_ordering }
|
||||
See Documentation/power/video.txt for s3_bios and s3_mode.
|
||||
s3_beep is for debugging; it makes the PC's speaker beep
|
||||
as soon as the kernel's real-mode entry point is called.
|
||||
old_ordering causes the ACPI 1.0 ordering of the _PTS
|
||||
control method, wrt putting devices into low power
|
||||
states, to be enforced (the ACPI 2.0 ordering of _PTS is
|
||||
used by default).
|
||||
|
||||
acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode
|
||||
Format: { level | edge | high | low }
|
||||
@ -271,6 +275,17 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
aic79xx= [HW,SCSI]
|
||||
See Documentation/scsi/aic79xx.txt.
|
||||
|
||||
amd_iommu= [HW,X86-84]
|
||||
Pass parameters to the AMD IOMMU driver in the system.
|
||||
Possible values are:
|
||||
isolate - enable device isolation (each device, as far
|
||||
as possible, will get its own protection
|
||||
domain)
|
||||
amd_iommu_size= [HW,X86-64]
|
||||
Define the size of the aperture for the AMD IOMMU
|
||||
driver. Possible values are:
|
||||
'32M', '64M' (default), '128M', '256M', '512M', '1G'
|
||||
|
||||
amijoy.map= [HW,JOY] Amiga joystick support
|
||||
Map of devices attached to JOY0DAT and JOY1DAT
|
||||
Format: <a>,<b>
|
||||
@ -295,7 +310,7 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
when initialising the APIC and IO-APIC components.
|
||||
|
||||
apm= [APM] Advanced Power Management
|
||||
See header of arch/i386/kernel/apm.c.
|
||||
See header of arch/x86/kernel/apm_32.c.
|
||||
|
||||
arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards
|
||||
Format: <io>,<irq>,<nodeID>
|
||||
@ -560,6 +575,8 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
|
||||
debug_objects [KNL] Enable object debugging
|
||||
|
||||
debugpat [X86] Enable PAT debugging
|
||||
|
||||
decnet.addr= [HW,NET]
|
||||
Format: <area>[,<node>]
|
||||
See also Documentation/networking/decnet.txt.
|
||||
@ -599,6 +616,29 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
See drivers/char/README.epca and
|
||||
Documentation/digiepca.txt.
|
||||
|
||||
disable_mtrr_cleanup [X86]
|
||||
enable_mtrr_cleanup [X86]
|
||||
The kernel tries to adjust MTRR layout from continuous
|
||||
to discrete, to make X server driver able to add WB
|
||||
entry later. This parameter enables/disables that.
|
||||
|
||||
mtrr_chunk_size=nn[KMG] [X86]
|
||||
used for mtrr cleanup. It is largest continous chunk
|
||||
that could hold holes aka. UC entries.
|
||||
|
||||
mtrr_gran_size=nn[KMG] [X86]
|
||||
Used for mtrr cleanup. It is granularity of mtrr block.
|
||||
Default is 1.
|
||||
Large value could prevent small alignment from
|
||||
using up MTRRs.
|
||||
|
||||
mtrr_spare_reg_nr=n [X86]
|
||||
Format: <integer>
|
||||
Range: 0,7 : spare reg number
|
||||
Default : 1
|
||||
Used for mtrr cleanup. It is spare mtrr entries number.
|
||||
Set to 2 or more if your graphical card needs more.
|
||||
|
||||
disable_mtrr_trim [X86, Intel and AMD only]
|
||||
By default the kernel will trim any uncacheable
|
||||
memory out of your available memory pool based on
|
||||
@ -638,7 +678,7 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
|
||||
elanfreq= [X86-32]
|
||||
See comment before function elanfreq_setup() in
|
||||
arch/i386/kernel/cpu/cpufreq/elanfreq.c.
|
||||
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
|
||||
|
||||
elevator= [IOSCHED]
|
||||
Format: {"anticipatory" | "cfq" | "deadline" | "noop"}
|
||||
@ -722,9 +762,6 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
hd= [EIDE] (E)IDE hard drive subsystem geometry
|
||||
Format: <cyl>,<head>,<sect>
|
||||
|
||||
hd?= [HW] (E)IDE subsystem
|
||||
hd?lun= See Documentation/ide/ide.txt.
|
||||
|
||||
highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact
|
||||
size of <nn>. This works even on boxes that have no
|
||||
highmem otherwise. This also works to reduce highmem
|
||||
@ -785,7 +822,7 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
See Documentation/ide/ide.txt.
|
||||
|
||||
idle= [X86]
|
||||
Format: idle=poll or idle=mwait
|
||||
Format: idle=poll or idle=mwait, idle=halt, idle=nomwait
|
||||
Poll forces a polling idle loop that can slightly improves the performance
|
||||
of waking up a idle CPU, but will use a lot of power and make the system
|
||||
run hot. Not recommended.
|
||||
@ -793,6 +830,9 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
to not use it because it doesn't save as much power as a normal idle
|
||||
loop use the MONITOR/MWAIT idle loop anyways. Performance should be the same
|
||||
as idle=poll.
|
||||
idle=halt. Halt is forced to be used for CPU idle.
|
||||
In such case C2/C3 won't be used again.
|
||||
idle=nomwait. Disable mwait for CPU C-states
|
||||
|
||||
ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem
|
||||
Claim all unknown PCI IDE storage controllers.
|
||||
@ -1208,6 +1248,11 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
mtdparts= [MTD]
|
||||
See drivers/mtd/cmdlinepart.c.
|
||||
|
||||
mtdset= [ARM]
|
||||
ARM/S3C2412 JIVE boot control
|
||||
|
||||
See arch/arm/mach-s3c2412/mach-jive.c
|
||||
|
||||
mtouchusb.raw_coordinates=
|
||||
[HW] Make the MicroTouch USB driver use raw coordinates
|
||||
('y', default) or cooked coordinates ('n')
|
||||
@ -1496,6 +1541,9 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
Use with caution as certain devices share
|
||||
address decoders between ROMs and other
|
||||
resources.
|
||||
norom [X86-32,X86_64] Do not assign address space to
|
||||
expansion ROMs that do not already have
|
||||
BIOS assigned address ranges.
|
||||
irqmask=0xMMMM [X86-32] Set a bit mask of IRQs allowed to be
|
||||
assigned automatically to PCI devices. You can
|
||||
make the kernel exclude IRQs of your ISA cards
|
||||
@ -1571,6 +1619,10 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
Format: { parport<nr> | timid | 0 }
|
||||
See also Documentation/parport.txt.
|
||||
|
||||
pmtmr= [X86] Manual setup of pmtmr I/O Port.
|
||||
Override pmtimer IOPort with a hex value.
|
||||
e.g. pmtmr=0x508
|
||||
|
||||
pnpacpi= [ACPI]
|
||||
{ off }
|
||||
|
||||
@ -1679,6 +1731,10 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
Format: <reboot_mode>[,<reboot_mode2>[,...]]
|
||||
See arch/*/kernel/reboot.c or arch/*/kernel/process.c
|
||||
|
||||
relax_domain_level=
|
||||
[KNL, SMP] Set scheduler's default relax_domain_level.
|
||||
See Documentation/cpusets.txt.
|
||||
|
||||
reserve= [KNL,BUGS] Force the kernel to ignore some iomem area
|
||||
|
||||
reservetop= [X86-32]
|
||||
@ -2112,6 +2168,9 @@ and is between 256 and 4096 characters. It is defined in the file
|
||||
usbhid.mousepoll=
|
||||
[USBHID] The interval which mice are to be polled at.
|
||||
|
||||
add_efi_memmap [EFI; x86-32,X86-64] Include EFI memory map in
|
||||
kernel's map of available physical RAM.
|
||||
|
||||
vdso= [X86-32,SH,x86-64]
|
||||
vdso=2: enable compat VDSO (default with COMPAT_VDSO)
|
||||
vdso=1: enable VDSO (default)
|
||||
|
@ -172,6 +172,7 @@ architectures:
|
||||
- ia64 (Does not support probes on instruction slot1.)
|
||||
- sparc64 (Return probes not yet implemented.)
|
||||
- arm
|
||||
- ppc
|
||||
|
||||
3. Configuring Kprobes
|
||||
|
||||
|
@ -174,8 +174,6 @@ The LED is exposed through the LED subsystem, and can be found in:
|
||||
The mail LED is autodetected, so if you don't have one, the LED device won't
|
||||
be registered.
|
||||
|
||||
If you have a mail LED that is not green, please report this to me.
|
||||
|
||||
Backlight
|
||||
*********
|
||||
|
||||
|
@ -81,23 +81,23 @@ inet_peer_minttl - INTEGER
|
||||
Minimum time-to-live of entries. Should be enough to cover fragment
|
||||
time-to-live on the reassembling side. This minimum time-to-live is
|
||||
guaranteed if the pool size is less than inet_peer_threshold.
|
||||
Measured in jiffies(1).
|
||||
Measured in seconds.
|
||||
|
||||
inet_peer_maxttl - INTEGER
|
||||
Maximum time-to-live of entries. Unused entries will expire after
|
||||
this period of time if there is no memory pressure on the pool (i.e.
|
||||
when the number of entries in the pool is very small).
|
||||
Measured in jiffies(1).
|
||||
Measured in seconds.
|
||||
|
||||
inet_peer_gc_mintime - INTEGER
|
||||
Minimum interval between garbage collection passes. This interval is
|
||||
in effect under high memory pressure on the pool.
|
||||
Measured in jiffies(1).
|
||||
Measured in seconds.
|
||||
|
||||
inet_peer_gc_maxtime - INTEGER
|
||||
Minimum interval between garbage collection passes. This interval is
|
||||
in effect under low (or absent) memory pressure on the pool.
|
||||
Measured in jiffies(1).
|
||||
Measured in seconds.
|
||||
|
||||
TCP variables:
|
||||
|
||||
@ -148,9 +148,9 @@ tcp_available_congestion_control - STRING
|
||||
but not loaded.
|
||||
|
||||
tcp_base_mss - INTEGER
|
||||
The initial value of search_low to be used by Packetization Layer
|
||||
Path MTU Discovery (MTU probing). If MTU probing is enabled,
|
||||
this is the inital MSS used by the connection.
|
||||
The initial value of search_low to be used by the packetization layer
|
||||
Path MTU discovery (MTU probing). If MTU probing is enabled,
|
||||
this is the initial MSS used by the connection.
|
||||
|
||||
tcp_congestion_control - STRING
|
||||
Set the congestion control algorithm to be used for new
|
||||
@ -185,10 +185,9 @@ tcp_frto - INTEGER
|
||||
timeouts. It is particularly beneficial in wireless environments
|
||||
where packet loss is typically due to random radio interference
|
||||
rather than intermediate router congestion. F-RTO is sender-side
|
||||
only modification. Therefore it does not require any support from
|
||||
the peer, but in a typical case, however, where wireless link is
|
||||
the local access link and most of the data flows downlink, the
|
||||
faraway servers should have F-RTO enabled to take advantage of it.
|
||||
only modification. Therefore it does not require any support from
|
||||
the peer.
|
||||
|
||||
If set to 1, basic version is enabled. 2 enables SACK enhanced
|
||||
F-RTO if flow uses SACK. The basic version can be used also when
|
||||
SACK is in use though scenario(s) with it exists where F-RTO
|
||||
@ -276,7 +275,7 @@ tcp_mem - vector of 3 INTEGERs: min, pressure, max
|
||||
memory.
|
||||
|
||||
tcp_moderate_rcvbuf - BOOLEAN
|
||||
If set, TCP performs receive buffer autotuning, attempting to
|
||||
If set, TCP performs receive buffer auto-tuning, attempting to
|
||||
automatically size the buffer (no greater than tcp_rmem[2]) to
|
||||
match the size required by the path for full throughput. Enabled by
|
||||
default.
|
||||
@ -336,7 +335,7 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
|
||||
pressure.
|
||||
Default: 8K
|
||||
|
||||
default: default size of receive buffer used by TCP sockets.
|
||||
default: initial size of receive buffer used by TCP sockets.
|
||||
This value overrides net.core.rmem_default used by other protocols.
|
||||
Default: 87380 bytes. This value results in window of 65535 with
|
||||
default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
|
||||
@ -344,8 +343,10 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
|
||||
|
||||
max: maximal size of receive buffer allowed for automatically
|
||||
selected receiver buffers for TCP socket. This value does not override
|
||||
net.core.rmem_max, "static" selection via SO_RCVBUF does not use this.
|
||||
Default: 87380*2 bytes.
|
||||
net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables
|
||||
automatic tuning of that socket's receive buffer size, in which
|
||||
case this value is ignored.
|
||||
Default: between 87380B and 4MB, depending on RAM size.
|
||||
|
||||
tcp_sack - BOOLEAN
|
||||
Enable select acknowledgments (SACKS).
|
||||
@ -358,7 +359,7 @@ tcp_slow_start_after_idle - BOOLEAN
|
||||
Default: 1
|
||||
|
||||
tcp_stdurg - BOOLEAN
|
||||
Use the Host requirements interpretation of the TCP urg pointer field.
|
||||
Use the Host requirements interpretation of the TCP urgent pointer field.
|
||||
Most hosts use the older BSD interpretation, so if you turn this on
|
||||
Linux might not communicate correctly with them.
|
||||
Default: FALSE
|
||||
@ -371,12 +372,12 @@ tcp_synack_retries - INTEGER
|
||||
tcp_syncookies - BOOLEAN
|
||||
Only valid when the kernel was compiled with CONFIG_SYNCOOKIES
|
||||
Send out syncookies when the syn backlog queue of a socket
|
||||
overflows. This is to prevent against the common 'syn flood attack'
|
||||
overflows. This is to prevent against the common 'SYN flood attack'
|
||||
Default: FALSE
|
||||
|
||||
Note, that syncookies is fallback facility.
|
||||
It MUST NOT be used to help highly loaded servers to stand
|
||||
against legal connection rate. If you see synflood warnings
|
||||
against legal connection rate. If you see SYN flood warnings
|
||||
in your logs, but investigation shows that they occur
|
||||
because of overload with legal connections, you should tune
|
||||
another parameters until this warning disappear.
|
||||
@ -386,7 +387,7 @@ tcp_syncookies - BOOLEAN
|
||||
to use TCP extensions, can result in serious degradation
|
||||
of some services (f.e. SMTP relaying), visible not by you,
|
||||
but your clients and relays, contacting you. While you see
|
||||
synflood warnings in logs not being really flooded, your server
|
||||
SYN flood warnings in logs not being really flooded, your server
|
||||
is seriously misconfigured.
|
||||
|
||||
tcp_syn_retries - INTEGER
|
||||
@ -419,19 +420,21 @@ tcp_window_scaling - BOOLEAN
|
||||
Enable window scaling as defined in RFC1323.
|
||||
|
||||
tcp_wmem - vector of 3 INTEGERs: min, default, max
|
||||
min: Amount of memory reserved for send buffers for TCP socket.
|
||||
min: Amount of memory reserved for send buffers for TCP sockets.
|
||||
Each TCP socket has rights to use it due to fact of its birth.
|
||||
Default: 4K
|
||||
|
||||
default: Amount of memory allowed for send buffers for TCP socket
|
||||
by default. This value overrides net.core.wmem_default used
|
||||
by other protocols, it is usually lower than net.core.wmem_default.
|
||||
default: initial size of send buffer used by TCP sockets. This
|
||||
value overrides net.core.wmem_default used by other protocols.
|
||||
It is usually lower than net.core.wmem_default.
|
||||
Default: 16K
|
||||
|
||||
max: Maximal amount of memory allowed for automatically selected
|
||||
send buffers for TCP socket. This value does not override
|
||||
net.core.wmem_max, "static" selection via SO_SNDBUF does not use this.
|
||||
Default: 128K
|
||||
max: Maximal amount of memory allowed for automatically tuned
|
||||
send buffers for TCP sockets. This value does not override
|
||||
net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables
|
||||
automatic tuning of that socket's send buffer size, in which case
|
||||
this value is ignored.
|
||||
Default: between 64K and 4MB, depending on RAM size.
|
||||
|
||||
tcp_workaround_signed_windows - BOOLEAN
|
||||
If set, assume no receipt of a window scaling option means the
|
||||
@ -794,10 +797,6 @@ tag - INTEGER
|
||||
Allows you to write a number, which can be used as required.
|
||||
Default value is 0.
|
||||
|
||||
(1) Jiffie: internal timeunit for the kernel. On the i386 1/100s, on the
|
||||
Alpha 1/1024s. See the HZ define in /usr/include/asm/param.h for the exact
|
||||
value on your system.
|
||||
|
||||
Alexey Kuznetsov.
|
||||
kuznet@ms2.inr.ac.ru
|
||||
|
||||
@ -1064,24 +1063,193 @@ bridge-nf-filter-pppoe-tagged - BOOLEAN
|
||||
Default: 1
|
||||
|
||||
|
||||
proc/sys/net/sctp/* Variables:
|
||||
|
||||
addip_enable - BOOLEAN
|
||||
Enable or disable extension of Dynamic Address Reconfiguration
|
||||
(ADD-IP) functionality specified in RFC5061. This extension provides
|
||||
the ability to dynamically add and remove new addresses for the SCTP
|
||||
associations.
|
||||
|
||||
1: Enable extension.
|
||||
|
||||
0: Disable extension.
|
||||
|
||||
Default: 0
|
||||
|
||||
addip_noauth_enable - BOOLEAN
|
||||
Dynamic Address Reconfiguration (ADD-IP) requires the use of
|
||||
authentication to protect the operations of adding or removing new
|
||||
addresses. This requirement is mandated so that unauthorized hosts
|
||||
would not be able to hijack associations. However, older
|
||||
implementations may not have implemented this requirement while
|
||||
allowing the ADD-IP extension. For reasons of interoperability,
|
||||
we provide this variable to control the enforcement of the
|
||||
authentication requirement.
|
||||
|
||||
1: Allow ADD-IP extension to be used without authentication. This
|
||||
should only be set in a closed environment for interoperability
|
||||
with older implementations.
|
||||
|
||||
0: Enforce the authentication requirement
|
||||
|
||||
Default: 0
|
||||
|
||||
auth_enable - BOOLEAN
|
||||
Enable or disable Authenticated Chunks extension. This extension
|
||||
provides the ability to send and receive authenticated chunks and is
|
||||
required for secure operation of Dynamic Address Reconfiguration
|
||||
(ADD-IP) extension.
|
||||
|
||||
1: Enable this extension.
|
||||
0: Disable this extension.
|
||||
|
||||
Default: 0
|
||||
|
||||
prsctp_enable - BOOLEAN
|
||||
Enable or disable the Partial Reliability extension (RFC3758) which
|
||||
is used to notify peers that a given DATA should no longer be expected.
|
||||
|
||||
1: Enable extension
|
||||
0: Disable
|
||||
|
||||
Default: 1
|
||||
|
||||
max_burst - INTEGER
|
||||
The limit of the number of new packets that can be initially sent. It
|
||||
controls how bursty the generated traffic can be.
|
||||
|
||||
Default: 4
|
||||
|
||||
association_max_retrans - INTEGER
|
||||
Set the maximum number for retransmissions that an association can
|
||||
attempt deciding that the remote end is unreachable. If this value
|
||||
is exceeded, the association is terminated.
|
||||
|
||||
Default: 10
|
||||
|
||||
max_init_retransmits - INTEGER
|
||||
The maximum number of retransmissions of INIT and COOKIE-ECHO chunks
|
||||
that an association will attempt before declaring the destination
|
||||
unreachable and terminating.
|
||||
|
||||
Default: 8
|
||||
|
||||
path_max_retrans - INTEGER
|
||||
The maximum number of retransmissions that will be attempted on a given
|
||||
path. Once this threshold is exceeded, the path is considered
|
||||
unreachable, and new traffic will use a different path when the
|
||||
association is multihomed.
|
||||
|
||||
Default: 5
|
||||
|
||||
rto_initial - INTEGER
|
||||
The initial round trip timeout value in milliseconds that will be used
|
||||
in calculating round trip times. This is the initial time interval
|
||||
for retransmissions.
|
||||
|
||||
Default: 3000
|
||||
|
||||
rto_max - INTEGER
|
||||
The maximum value (in milliseconds) of the round trip timeout. This
|
||||
is the largest time interval that can elapse between retransmissions.
|
||||
|
||||
Default: 60000
|
||||
|
||||
rto_min - INTEGER
|
||||
The minimum value (in milliseconds) of the round trip timeout. This
|
||||
is the smallest time interval the can elapse between retransmissions.
|
||||
|
||||
Default: 1000
|
||||
|
||||
hb_interval - INTEGER
|
||||
The interval (in milliseconds) between HEARTBEAT chunks. These chunks
|
||||
are sent at the specified interval on idle paths to probe the state of
|
||||
a given path between 2 associations.
|
||||
|
||||
Default: 30000
|
||||
|
||||
sack_timeout - INTEGER
|
||||
The amount of time (in milliseconds) that the implementation will wait
|
||||
to send a SACK.
|
||||
|
||||
Default: 200
|
||||
|
||||
valid_cookie_life - INTEGER
|
||||
The default lifetime of the SCTP cookie (in milliseconds). The cookie
|
||||
is used during association establishment.
|
||||
|
||||
Default: 60000
|
||||
|
||||
cookie_preserve_enable - BOOLEAN
|
||||
Enable or disable the ability to extend the lifetime of the SCTP cookie
|
||||
that is used during the establishment phase of SCTP association
|
||||
|
||||
1: Enable cookie lifetime extension.
|
||||
0: Disable
|
||||
|
||||
Default: 1
|
||||
|
||||
rcvbuf_policy - INTEGER
|
||||
Determines if the receive buffer is attributed to the socket or to
|
||||
association. SCTP supports the capability to create multiple
|
||||
associations on a single socket. When using this capability, it is
|
||||
possible that a single stalled association that's buffering a lot
|
||||
of data may block other associations from delivering their data by
|
||||
consuming all of the receive buffer space. To work around this,
|
||||
the rcvbuf_policy could be set to attribute the receiver buffer space
|
||||
to each association instead of the socket. This prevents the described
|
||||
blocking.
|
||||
|
||||
1: rcvbuf space is per association
|
||||
0: recbuf space is per socket
|
||||
|
||||
Default: 0
|
||||
|
||||
sndbuf_policy - INTEGER
|
||||
Similar to rcvbuf_policy above, this applies to send buffer space.
|
||||
|
||||
1: Send buffer is tracked per association
|
||||
0: Send buffer is tracked per socket.
|
||||
|
||||
Default: 0
|
||||
|
||||
sctp_mem - vector of 3 INTEGERs: min, pressure, max
|
||||
Number of pages allowed for queueing by all SCTP sockets.
|
||||
|
||||
min: Below this number of pages SCTP is not bothered about its
|
||||
memory appetite. When amount of memory allocated by SCTP exceeds
|
||||
this number, SCTP starts to moderate memory usage.
|
||||
|
||||
pressure: This value was introduced to follow format of tcp_mem.
|
||||
|
||||
max: Number of pages allowed for queueing by all SCTP sockets.
|
||||
|
||||
Default is calculated at boot time from amount of available memory.
|
||||
|
||||
sctp_rmem - vector of 3 INTEGERs: min, default, max
|
||||
See tcp_rmem for a description.
|
||||
|
||||
sctp_wmem - vector of 3 INTEGERs: min, default, max
|
||||
See tcp_wmem for a description.
|
||||
|
||||
UNDOCUMENTED:
|
||||
|
||||
dev_weight FIXME
|
||||
discovery_slots FIXME
|
||||
discovery_timeout FIXME
|
||||
fast_poll_increase FIXME
|
||||
ip6_queue_maxlen FIXME
|
||||
lap_keepalive_time FIXME
|
||||
lo_cong FIXME
|
||||
max_baud_rate FIXME
|
||||
max_dgram_qlen FIXME
|
||||
max_noreply_time FIXME
|
||||
max_tx_data_size FIXME
|
||||
max_tx_window FIXME
|
||||
min_tx_turn_time FIXME
|
||||
mod_cong FIXME
|
||||
no_cong FIXME
|
||||
no_cong_thresh FIXME
|
||||
slot_timeout FIXME
|
||||
warn_noreply_time FIXME
|
||||
/proc/sys/net/core/*
|
||||
dev_weight FIXME
|
||||
|
||||
/proc/sys/net/unix/*
|
||||
max_dgram_qlen FIXME
|
||||
|
||||
/proc/sys/net/irda/*
|
||||
fast_poll_increase FIXME
|
||||
warn_noreply_time FIXME
|
||||
discovery_slots FIXME
|
||||
slot_timeout FIXME
|
||||
max_baud_rate FIXME
|
||||
discovery_timeout FIXME
|
||||
lap_keepalive_time FIXME
|
||||
max_noreply_time FIXME
|
||||
max_tx_data_size FIXME
|
||||
max_tx_window FIXME
|
||||
min_tx_turn_time FIXME
|
||||
|
@ -83,9 +83,9 @@ Valid range: Limited by memory on system
|
||||
Default: 30
|
||||
|
||||
e. intr_type
|
||||
Specifies interrupt type. Possible values 1(INTA), 2(MSI), 3(MSI-X)
|
||||
Valid range: 1-3
|
||||
Default: 1
|
||||
Specifies interrupt type. Possible values 0(INTA), 2(MSI-X)
|
||||
Valid values: 0, 2
|
||||
Default: 2
|
||||
|
||||
5. Performance suggestions
|
||||
General:
|
||||
|
@ -10,7 +10,7 @@ us to generate 'watchdog NMI interrupts'. (NMI: Non Maskable Interrupt
|
||||
which get executed even if the system is otherwise locked up hard).
|
||||
This can be used to debug hard kernel lockups. By executing periodic
|
||||
NMI interrupts, the kernel can monitor whether any CPU has locked up,
|
||||
and print out debugging messages if so.
|
||||
and print out debugging messages if so.
|
||||
|
||||
In order to use the NMI watchdog, you need to have APIC support in your
|
||||
kernel. For SMP kernels, APIC support gets compiled in automatically. For
|
||||
@ -22,8 +22,7 @@ CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
|
||||
kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
|
||||
may implicitly disable the NMI watchdog.]
|
||||
|
||||
For x86-64, the needed APIC is always compiled in, and the NMI watchdog is
|
||||
always enabled with I/O-APIC mode (nmi_watchdog=1).
|
||||
For x86-64, the needed APIC is always compiled in.
|
||||
|
||||
Using local APIC (nmi_watchdog=2) needs the first performance register, so
|
||||
you can't use it for other purposes (such as high precision performance
|
||||
@ -63,16 +62,15 @@ when the system is idle), but if your system locks up on anything but the
|
||||
"hlt", then you are out of luck -- the event will not happen at all and the
|
||||
watchdog won't trigger. This is a shortcoming of the local APIC watchdog
|
||||
-- unfortunately there is no "clock ticks" event that would work all the
|
||||
time. The I/O APIC watchdog is driven externally and has no such shortcoming.
|
||||
time. The I/O APIC watchdog is driven externally and has no such shortcoming.
|
||||
But its NMI frequency is much higher, resulting in a more significant hit
|
||||
to the overall system performance.
|
||||
|
||||
NOTE: starting with 2.4.2-ac18 the NMI-oopser is disabled by default,
|
||||
you have to enable it with a boot time parameter. Prior to 2.4.2-ac18
|
||||
the NMI-oopser is enabled unconditionally on x86 SMP boxes.
|
||||
On x86 nmi_watchdog is disabled by default so you have to enable it with
|
||||
a boot time parameter.
|
||||
|
||||
On x86-64 the NMI oopser is on by default. On 64bit Intel CPUs
|
||||
it uses IO-APIC by default and on AMD it uses local APIC.
|
||||
NOTE: In kernels prior to 2.4.2-ac18 the NMI-oopser is enabled unconditionally
|
||||
on x86 SMP boxes.
|
||||
|
||||
[ feel free to send bug reports, suggestions and patches to
|
||||
Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing
|
||||
|
File diff suppressed because it is too large
Load Diff
141
Documentation/powerpc/bootwrapper.txt
Normal file
141
Documentation/powerpc/bootwrapper.txt
Normal file
@ -0,0 +1,141 @@
|
||||
The PowerPC boot wrapper
|
||||
------------------------
|
||||
Copyright (C) Secret Lab Technologies Ltd.
|
||||
|
||||
PowerPC image targets compresses and wraps the kernel image (vmlinux) with
|
||||
a boot wrapper to make it usable by the system firmware. There is no
|
||||
standard PowerPC firmware interface, so the boot wrapper is designed to
|
||||
be adaptable for each kind of image that needs to be built.
|
||||
|
||||
The boot wrapper can be found in the arch/powerpc/boot/ directory. The
|
||||
Makefile in that directory has targets for all the available image types.
|
||||
The different image types are used to support all of the various firmware
|
||||
interfaces found on PowerPC platforms. OpenFirmware is the most commonly
|
||||
used firmware type on general purpose PowerPC systems from Apple, IBM and
|
||||
others. U-Boot is typically found on embedded PowerPC hardware, but there
|
||||
are a handful of other firmware implementations which are also popular. Each
|
||||
firmware interface requires a different image format.
|
||||
|
||||
The boot wrapper is built from the makefile in arch/powerpc/boot/Makefile and
|
||||
it uses the wrapper script (arch/powerpc/boot/wrapper) to generate target
|
||||
image. The details of the build system is discussed in the next section.
|
||||
Currently, the following image format targets exist:
|
||||
|
||||
cuImage.%: Backwards compatible uImage for older version of
|
||||
U-Boot (for versions that don't understand the device
|
||||
tree). This image embeds a device tree blob inside
|
||||
the image. The boot wrapper, kernel and device tree
|
||||
are all embedded inside the U-Boot uImage file format
|
||||
with boot wrapper code that extracts data from the old
|
||||
bd_info structure and loads the data into the device
|
||||
tree before jumping into the kernel.
|
||||
Because of the series of #ifdefs found in the
|
||||
bd_info structure used in the old U-Boot interfaces,
|
||||
cuImages are platform specific. Each specific
|
||||
U-Boot platform has a different platform init file
|
||||
which populates the embedded device tree with data
|
||||
from the platform specific bd_info file. The platform
|
||||
specific cuImage platform init code can be found in
|
||||
arch/powerpc/boot/cuboot.*.c. Selection of the correct
|
||||
cuImage init code for a specific board can be found in
|
||||
the wrapper structure.
|
||||
dtbImage.%: Similar to zImage, except device tree blob is embedded
|
||||
inside the image instead of provided by firmware. The
|
||||
output image file can be either an elf file or a flat
|
||||
binary depending on the platform.
|
||||
dtbImages are used on systems which do not have an
|
||||
interface for passing a device tree directly.
|
||||
dtbImages are similar to simpleImages except that
|
||||
dtbImages have platform specific code for extracting
|
||||
data from the board firmware, but simpleImages do not
|
||||
talk to the firmware at all.
|
||||
PlayStation 3 support uses dtbImage. So do Embedded
|
||||
Planet boards using the PlanetCore firmware. Board
|
||||
specific initialization code is typically found in a
|
||||
file named arch/powerpc/boot/<platform>.c; but this
|
||||
can be overridden by the wrapper script.
|
||||
simpleImage.%: Firmware independent compressed image that does not
|
||||
depend on any particular firmware interface and embeds
|
||||
a device tree blob. This image is a flat binary that
|
||||
can be loaded to any location in RAM and jumped to.
|
||||
Firmware cannot pass any configuration data to the
|
||||
kernel with this image type and it depends entirely on
|
||||
the embedded device tree for all information.
|
||||
The simpleImage is useful for booting systems with
|
||||
an unknown firmware interface or for booting from
|
||||
a debugger when no firmware is present (such as on
|
||||
the Xilinx Virtex platform). The only assumption that
|
||||
simpleImage makes is that RAM is correctly initialized
|
||||
and that the MMU is either off or has RAM mapped to
|
||||
base address 0.
|
||||
simpleImage also supports inserting special platform
|
||||
specific initialization code to the start of the bootup
|
||||
sequence. The virtex405 platform uses this feature to
|
||||
ensure that the cache is invalidated before caching
|
||||
is enabled. Platform specific initialization code is
|
||||
added as part of the wrapper script and is keyed on
|
||||
the image target name. For example, all
|
||||
simpleImage.virtex405-* targets will add the
|
||||
virtex405-head.S initialization code (This also means
|
||||
that the dts file for virtex405 targets should be
|
||||
named (virtex405-<board>.dts). Search the wrapper
|
||||
script for 'virtex405' and see the file
|
||||
arch/powerpc/boot/virtex405-head.S for details.
|
||||
treeImage.%; Image format for used with OpenBIOS firmware found
|
||||
on some ppc4xx hardware. This image embeds a device
|
||||
tree blob inside the image.
|
||||
uImage: Native image format used by U-Boot. The uImage target
|
||||
does not add any boot code. It just wraps a compressed
|
||||
vmlinux in the uImage data structure. This image
|
||||
requires a version of U-Boot that is able to pass
|
||||
a device tree to the kernel at boot. If using an older
|
||||
version of U-Boot, then you need to use a cuImage
|
||||
instead.
|
||||
zImage.%: Image format which does not embed a device tree.
|
||||
Used by OpenFirmware and other firmware interfaces
|
||||
which are able to supply a device tree. This image
|
||||
expects firmware to provide the device tree at boot.
|
||||
Typically, if you have general purpose PowerPC
|
||||
hardware then you want this image format.
|
||||
|
||||
Image types which embed a device tree blob (simpleImage, dtbImage, treeImage,
|
||||
and cuImage) all generate the device tree blob from a file in the
|
||||
arch/powerpc/boot/dts/ directory. The Makefile selects the correct device
|
||||
tree source based on the name of the target. Therefore, if the kernel is
|
||||
built with 'make treeImage.walnut simpleImage.virtex405-ml403', then the
|
||||
build system will use arch/powerpc/boot/dts/walnut.dts to build
|
||||
treeImage.walnut and arch/powerpc/boot/dts/virtex405-ml403.dts to build
|
||||
the simpleImage.virtex405-ml403.
|
||||
|
||||
Two special targets called 'zImage' and 'zImage.initrd' also exist. These
|
||||
targets build all the default images as selected by the kernel configuration.
|
||||
Default images are selected by the boot wrapper Makefile
|
||||
(arch/powerpc/boot/Makefile) by adding targets to the $image-y variable. Look
|
||||
at the Makefile to see which default image targets are available.
|
||||
|
||||
How it is built
|
||||
---------------
|
||||
arch/powerpc is designed to support multiplatform kernels, which means
|
||||
that a single vmlinux image can be booted on many different target boards.
|
||||
It also means that the boot wrapper must be able to wrap for many kinds of
|
||||
images on a single build. The design decision was made to not use any
|
||||
conditional compilation code (#ifdef, etc) in the boot wrapper source code.
|
||||
All of the boot wrapper pieces are buildable at any time regardless of the
|
||||
kernel configuration. Building all the wrapper bits on every kernel build
|
||||
also ensures that obscure parts of the wrapper are at the very least compile
|
||||
tested in a large variety of environments.
|
||||
|
||||
The wrapper is adapted for different image types at link time by linking in
|
||||
just the wrapper bits that are appropriate for the image type. The 'wrapper
|
||||
script' (found in arch/powerpc/boot/wrapper) is called by the Makefile and
|
||||
is responsible for selecting the correct wrapper bits for the image type.
|
||||
The arguments are well documented in the script's comment block, so they
|
||||
are not repeated here. However, it is worth mentioning that the script
|
||||
uses the -p (platform) argument as the main method of deciding which wrapper
|
||||
bits to compile in. Look for the large 'case "$platform" in' block in the
|
||||
middle of the script. This is also the place where platform specific fixups
|
||||
can be selected by changing the link order.
|
||||
|
||||
In particular, care should be taken when working with cuImages. cuImage
|
||||
wrapper bits are very board specific and care should be taken to make sure
|
||||
the target you are trying to build is supported by the wrapper bits.
|
29
Documentation/powerpc/dts-bindings/fsl/board.txt
Normal file
29
Documentation/powerpc/dts-bindings/fsl/board.txt
Normal file
@ -0,0 +1,29 @@
|
||||
* Board Control and Status (BCSR)
|
||||
|
||||
Required properties:
|
||||
|
||||
- device_type : Should be "board-control"
|
||||
- reg : Offset and length of the register set for the device
|
||||
|
||||
Example:
|
||||
|
||||
bcsr@f8000000 {
|
||||
device_type = "board-control";
|
||||
reg = <f8000000 8000>;
|
||||
};
|
||||
|
||||
* Freescale on board FPGA
|
||||
|
||||
This is the memory-mapped registers for on board FPGA.
|
||||
|
||||
Required properities:
|
||||
- compatible : should be "fsl,fpga-pixis".
|
||||
- reg : should contain the address and the lenght of the FPPGA register
|
||||
set.
|
||||
|
||||
Example (MPC8610HPCD):
|
||||
|
||||
board-control@e8000000 {
|
||||
compatible = "fsl,fpga-pixis";
|
||||
reg = <0xe8000000 32>;
|
||||
};
|
67
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm.txt
Normal file
67
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm.txt
Normal file
@ -0,0 +1,67 @@
|
||||
* Freescale Communications Processor Module
|
||||
|
||||
NOTE: This is an interim binding, and will likely change slightly,
|
||||
as more devices are supported. The QE bindings especially are
|
||||
incomplete.
|
||||
|
||||
* Root CPM node
|
||||
|
||||
Properties:
|
||||
- compatible : "fsl,cpm1", "fsl,cpm2", or "fsl,qe".
|
||||
- reg : A 48-byte region beginning with CPCR.
|
||||
|
||||
Example:
|
||||
cpm@119c0 {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
#interrupt-cells = <2>;
|
||||
compatible = "fsl,mpc8272-cpm", "fsl,cpm2";
|
||||
reg = <119c0 30>;
|
||||
}
|
||||
|
||||
* Properties common to mulitple CPM/QE devices
|
||||
|
||||
- fsl,cpm-command : This value is ORed with the opcode and command flag
|
||||
to specify the device on which a CPM command operates.
|
||||
|
||||
- fsl,cpm-brg : Indicates which baud rate generator the device
|
||||
is associated with. If absent, an unused BRG
|
||||
should be dynamically allocated. If zero, the
|
||||
device uses an external clock rather than a BRG.
|
||||
|
||||
- reg : Unless otherwise specified, the first resource represents the
|
||||
scc/fcc/ucc registers, and the second represents the device's
|
||||
parameter RAM region (if it has one).
|
||||
|
||||
* Multi-User RAM (MURAM)
|
||||
|
||||
The multi-user/dual-ported RAM is expressed as a bus under the CPM node.
|
||||
|
||||
Ranges must be set up subject to the following restrictions:
|
||||
|
||||
- Children's reg nodes must be offsets from the start of all muram, even
|
||||
if the user-data area does not begin at zero.
|
||||
- If multiple range entries are used, the difference between the parent
|
||||
address and the child address must be the same in all, so that a single
|
||||
mapping can cover them all while maintaining the ability to determine
|
||||
CPM-side offsets with pointer subtraction. It is recommended that
|
||||
multiple range entries not be used.
|
||||
- A child address of zero must be translatable, even if no reg resources
|
||||
contain it.
|
||||
|
||||
A child "data" node must exist, compatible with "fsl,cpm-muram-data", to
|
||||
indicate the portion of muram that is usable by the OS for arbitrary
|
||||
purposes. The data node may have an arbitrary number of reg resources,
|
||||
all of which contribute to the allocatable muram pool.
|
||||
|
||||
Example, based on mpc8272:
|
||||
muram@0 {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
ranges = <0 0 10000>;
|
||||
|
||||
data@0 {
|
||||
compatible = "fsl,cpm-muram-data";
|
||||
reg = <0 2000 9800 800>;
|
||||
};
|
||||
};
|
21
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm/brg.txt
Normal file
21
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm/brg.txt
Normal file
@ -0,0 +1,21 @@
|
||||
* Baud Rate Generators
|
||||
|
||||
Currently defined compatibles:
|
||||
fsl,cpm-brg
|
||||
fsl,cpm1-brg
|
||||
fsl,cpm2-brg
|
||||
|
||||
Properties:
|
||||
- reg : There may be an arbitrary number of reg resources; BRG
|
||||
numbers are assigned to these in order.
|
||||
- clock-frequency : Specifies the base frequency driving
|
||||
the BRG.
|
||||
|
||||
Example:
|
||||
brg@119f0 {
|
||||
compatible = "fsl,mpc8272-brg",
|
||||
"fsl,cpm2-brg",
|
||||
"fsl,cpm-brg";
|
||||
reg = <119f0 10 115f0 10>;
|
||||
clock-frequency = <d#25000000>;
|
||||
};
|
41
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm/i2c.txt
Normal file
41
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm/i2c.txt
Normal file
@ -0,0 +1,41 @@
|
||||
* I2C
|
||||
|
||||
The I2C controller is expressed as a bus under the CPM node.
|
||||
|
||||
Properties:
|
||||
- compatible : "fsl,cpm1-i2c", "fsl,cpm2-i2c"
|
||||
- reg : On CPM2 devices, the second resource doesn't specify the I2C
|
||||
Parameter RAM itself, but the I2C_BASE field of the CPM2 Parameter RAM
|
||||
(typically 0x8afc 0x2).
|
||||
- #address-cells : Should be one. The cell is the i2c device address with
|
||||
the r/w bit set to zero.
|
||||
- #size-cells : Should be zero.
|
||||
- clock-frequency : Can be used to set the i2c clock frequency. If
|
||||
unspecified, a default frequency of 60kHz is being used.
|
||||
The following two properties are deprecated. They are only used by legacy
|
||||
i2c drivers to find the bus to probe:
|
||||
- linux,i2c-index : Can be used to hard code an i2c bus number. By default,
|
||||
the bus number is dynamically assigned by the i2c core.
|
||||
- linux,i2c-class : Can be used to override the i2c class. The class is used
|
||||
by legacy i2c device drivers to find a bus in a specific context like
|
||||
system management, video or sound. By default, I2C_CLASS_HWMON (1) is
|
||||
being used. The definition of the classes can be found in
|
||||
include/i2c/i2c.h
|
||||
|
||||
Example, based on mpc823:
|
||||
|
||||
i2c@860 {
|
||||
compatible = "fsl,mpc823-i2c",
|
||||
"fsl,cpm1-i2c";
|
||||
reg = <0x860 0x20 0x3c80 0x30>;
|
||||
interrupts = <16>;
|
||||
interrupt-parent = <&CPM_PIC>;
|
||||
fsl,cpm-command = <0x10>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
|
||||
rtc@68 {
|
||||
compatible = "dallas,ds1307";
|
||||
reg = <0x68>;
|
||||
};
|
||||
};
|
18
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm/pic.txt
Normal file
18
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm/pic.txt
Normal file
@ -0,0 +1,18 @@
|
||||
* Interrupt Controllers
|
||||
|
||||
Currently defined compatibles:
|
||||
- fsl,cpm1-pic
|
||||
- only one interrupt cell
|
||||
- fsl,pq1-pic
|
||||
- fsl,cpm2-pic
|
||||
- second interrupt cell is level/sense:
|
||||
- 2 is falling edge
|
||||
- 8 is active low
|
||||
|
||||
Example:
|
||||
interrupt-controller@10c00 {
|
||||
#interrupt-cells = <2>;
|
||||
interrupt-controller;
|
||||
reg = <10c00 80>;
|
||||
compatible = "mpc8272-pic", "fsl,cpm2-pic";
|
||||
};
|
15
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm/usb.txt
Normal file
15
Documentation/powerpc/dts-bindings/fsl/cpm_qe/cpm/usb.txt
Normal file
@ -0,0 +1,15 @@
|
||||
* USB (Universal Serial Bus Controller)
|
||||
|
||||
Properties:
|
||||
- compatible : "fsl,cpm1-usb", "fsl,cpm2-usb", "fsl,qe-usb"
|
||||
|
||||
Example:
|
||||
usb@11bc0 {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
compatible = "fsl,cpm2-usb";
|
||||
reg = <11b60 18 8b00 100>;
|
||||
interrupts = <b 8>;
|
||||
interrupt-parent = <&PIC>;
|
||||
fsl,cpm-command = <2e600000>;
|
||||
};
|
45
Documentation/powerpc/dts-bindings/fsl/cpm_qe/network.txt
Normal file
45
Documentation/powerpc/dts-bindings/fsl/cpm_qe/network.txt
Normal file
@ -0,0 +1,45 @@
|
||||
* Network
|
||||
|
||||
Currently defined compatibles:
|
||||
- fsl,cpm1-scc-enet
|
||||
- fsl,cpm2-scc-enet
|
||||
- fsl,cpm1-fec-enet
|
||||
- fsl,cpm2-fcc-enet (third resource is GFEMR)
|
||||
- fsl,qe-enet
|
||||
|
||||
Example:
|
||||
|
||||
ethernet@11300 {
|
||||
device_type = "network";
|
||||
compatible = "fsl,mpc8272-fcc-enet",
|
||||
"fsl,cpm2-fcc-enet";
|
||||
reg = <11300 20 8400 100 11390 1>;
|
||||
local-mac-address = [ 00 00 00 00 00 00 ];
|
||||
interrupts = <20 8>;
|
||||
interrupt-parent = <&PIC>;
|
||||
phy-handle = <&PHY0>;
|
||||
fsl,cpm-command = <12000300>;
|
||||
};
|
||||
|
||||
* MDIO
|
||||
|
||||
Currently defined compatibles:
|
||||
fsl,pq1-fec-mdio (reg is same as first resource of FEC device)
|
||||
fsl,cpm2-mdio-bitbang (reg is port C registers)
|
||||
|
||||
Properties for fsl,cpm2-mdio-bitbang:
|
||||
fsl,mdio-pin : pin of port C controlling mdio data
|
||||
fsl,mdc-pin : pin of port C controlling mdio clock
|
||||
|
||||
Example:
|
||||
mdio@10d40 {
|
||||
device_type = "mdio";
|
||||
compatible = "fsl,mpc8272ads-mdio-bitbang",
|
||||
"fsl,mpc8272-mdio-bitbang",
|
||||
"fsl,cpm2-mdio-bitbang";
|
||||
reg = <10d40 14>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
fsl,mdio-pin = <12>;
|
||||
fsl,mdc-pin = <13>;
|
||||
};
|
58
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe.txt
Normal file
58
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe.txt
Normal file
@ -0,0 +1,58 @@
|
||||
* Freescale QUICC Engine module (QE)
|
||||
This represents qe module that is installed on PowerQUICC II Pro.
|
||||
|
||||
NOTE: This is an interim binding; it should be updated to fit
|
||||
in with the CPM binding later in this document.
|
||||
|
||||
Basically, it is a bus of devices, that could act more or less
|
||||
as a complete entity (UCC, USB etc ). All of them should be siblings on
|
||||
the "root" qe node, using the common properties from there.
|
||||
The description below applies to the qe of MPC8360 and
|
||||
more nodes and properties would be extended in the future.
|
||||
|
||||
i) Root QE device
|
||||
|
||||
Required properties:
|
||||
- compatible : should be "fsl,qe";
|
||||
- model : precise model of the QE, Can be "QE", "CPM", or "CPM2"
|
||||
- reg : offset and length of the device registers.
|
||||
- bus-frequency : the clock frequency for QUICC Engine.
|
||||
|
||||
Recommended properties
|
||||
- brg-frequency : the internal clock source frequency for baud-rate
|
||||
generators in Hz.
|
||||
|
||||
Example:
|
||||
qe@e0100000 {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
#interrupt-cells = <2>;
|
||||
compatible = "fsl,qe";
|
||||
ranges = <0 e0100000 00100000>;
|
||||
reg = <e0100000 480>;
|
||||
brg-frequency = <0>;
|
||||
bus-frequency = <179A7B00>;
|
||||
}
|
||||
|
||||
* Multi-User RAM (MURAM)
|
||||
|
||||
Required properties:
|
||||
- compatible : should be "fsl,qe-muram", "fsl,cpm-muram".
|
||||
- mode : the could be "host" or "slave".
|
||||
- ranges : Should be defined as specified in 1) to describe the
|
||||
translation of MURAM addresses.
|
||||
- data-only : sub-node which defines the address area under MURAM
|
||||
bus that can be allocated as data/parameter
|
||||
|
||||
Example:
|
||||
|
||||
muram@10000 {
|
||||
compatible = "fsl,qe-muram", "fsl,cpm-muram";
|
||||
ranges = <0 00010000 0000c000>;
|
||||
|
||||
data-only@0{
|
||||
compatible = "fsl,qe-muram-data",
|
||||
"fsl,cpm-muram-data";
|
||||
reg = <0 c000>;
|
||||
};
|
||||
};
|
@ -0,0 +1,24 @@
|
||||
* Uploaded QE firmware
|
||||
|
||||
If a new firwmare has been uploaded to the QE (usually by the
|
||||
boot loader), then a 'firmware' child node should be added to the QE
|
||||
node. This node provides information on the uploaded firmware that
|
||||
device drivers may need.
|
||||
|
||||
Required properties:
|
||||
- id: The string name of the firmware. This is taken from the 'id'
|
||||
member of the qe_firmware structure of the uploaded firmware.
|
||||
Device drivers can search this string to determine if the
|
||||
firmware they want is already present.
|
||||
- extended-modes: The Extended Modes bitfield, taken from the
|
||||
firmware binary. It is a 64-bit number represented
|
||||
as an array of two 32-bit numbers.
|
||||
- virtual-traps: The virtual traps, taken from the firmware binary.
|
||||
It is an array of 8 32-bit numbers.
|
||||
|
||||
Example:
|
||||
firmware {
|
||||
id = "Soft-UART";
|
||||
extended-modes = <0 0>;
|
||||
virtual-traps = <0 0 0 0 0 0 0 0>;
|
||||
};
|
51
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/par_io.txt
Normal file
51
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/par_io.txt
Normal file
@ -0,0 +1,51 @@
|
||||
* Parallel I/O Ports
|
||||
|
||||
This node configures Parallel I/O ports for CPUs with QE support.
|
||||
The node should reside in the "soc" node of the tree. For each
|
||||
device that using parallel I/O ports, a child node should be created.
|
||||
See the definition of the Pin configuration nodes below for more
|
||||
information.
|
||||
|
||||
Required properties:
|
||||
- device_type : should be "par_io".
|
||||
- reg : offset to the register set and its length.
|
||||
- num-ports : number of Parallel I/O ports
|
||||
|
||||
Example:
|
||||
par_io@1400 {
|
||||
reg = <1400 100>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
device_type = "par_io";
|
||||
num-ports = <7>;
|
||||
ucc_pin@01 {
|
||||
......
|
||||
};
|
||||
|
||||
Note that "par_io" nodes are obsolete, and should not be used for
|
||||
the new device trees. Instead, each Par I/O bank should be represented
|
||||
via its own gpio-controller node:
|
||||
|
||||
Required properties:
|
||||
- #gpio-cells : should be "2".
|
||||
- compatible : should be "fsl,<chip>-qe-pario-bank",
|
||||
"fsl,mpc8323-qe-pario-bank".
|
||||
- reg : offset to the register set and its length.
|
||||
- gpio-controller : node to identify gpio controllers.
|
||||
|
||||
Example:
|
||||
qe_pio_a: gpio-controller@1400 {
|
||||
#gpio-cells = <2>;
|
||||
compatible = "fsl,mpc8360-qe-pario-bank",
|
||||
"fsl,mpc8323-qe-pario-bank";
|
||||
reg = <0x1400 0x18>;
|
||||
gpio-controller;
|
||||
};
|
||||
|
||||
qe_pio_e: gpio-controller@1460 {
|
||||
#gpio-cells = <2>;
|
||||
compatible = "fsl,mpc8360-qe-pario-bank",
|
||||
"fsl,mpc8323-qe-pario-bank";
|
||||
reg = <0x1460 0x18>;
|
||||
gpio-controller;
|
||||
};
|
60
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/pincfg.txt
Normal file
60
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/pincfg.txt
Normal file
@ -0,0 +1,60 @@
|
||||
* Pin configuration nodes
|
||||
|
||||
Required properties:
|
||||
- linux,phandle : phandle of this node; likely referenced by a QE
|
||||
device.
|
||||
- pio-map : array of pin configurations. Each pin is defined by 6
|
||||
integers. The six numbers are respectively: port, pin, dir,
|
||||
open_drain, assignment, has_irq.
|
||||
- port : port number of the pin; 0-6 represent port A-G in UM.
|
||||
- pin : pin number in the port.
|
||||
- dir : direction of the pin, should encode as follows:
|
||||
|
||||
0 = The pin is disabled
|
||||
1 = The pin is an output
|
||||
2 = The pin is an input
|
||||
3 = The pin is I/O
|
||||
|
||||
- open_drain : indicates the pin is normal or wired-OR:
|
||||
|
||||
0 = The pin is actively driven as an output
|
||||
1 = The pin is an open-drain driver. As an output, the pin is
|
||||
driven active-low, otherwise it is three-stated.
|
||||
|
||||
- assignment : function number of the pin according to the Pin Assignment
|
||||
tables in User Manual. Each pin can have up to 4 possible functions in
|
||||
QE and two options for CPM.
|
||||
- has_irq : indicates if the pin is used as source of external
|
||||
interrupts.
|
||||
|
||||
Example:
|
||||
ucc_pin@01 {
|
||||
linux,phandle = <140001>;
|
||||
pio-map = <
|
||||
/* port pin dir open_drain assignment has_irq */
|
||||
0 3 1 0 1 0 /* TxD0 */
|
||||
0 4 1 0 1 0 /* TxD1 */
|
||||
0 5 1 0 1 0 /* TxD2 */
|
||||
0 6 1 0 1 0 /* TxD3 */
|
||||
1 6 1 0 3 0 /* TxD4 */
|
||||
1 7 1 0 1 0 /* TxD5 */
|
||||
1 9 1 0 2 0 /* TxD6 */
|
||||
1 a 1 0 2 0 /* TxD7 */
|
||||
0 9 2 0 1 0 /* RxD0 */
|
||||
0 a 2 0 1 0 /* RxD1 */
|
||||
0 b 2 0 1 0 /* RxD2 */
|
||||
0 c 2 0 1 0 /* RxD3 */
|
||||
0 d 2 0 1 0 /* RxD4 */
|
||||
1 1 2 0 2 0 /* RxD5 */
|
||||
1 0 2 0 2 0 /* RxD6 */
|
||||
1 4 2 0 2 0 /* RxD7 */
|
||||
0 7 1 0 1 0 /* TX_EN */
|
||||
0 8 1 0 1 0 /* TX_ER */
|
||||
0 f 2 0 1 0 /* RX_DV */
|
||||
0 10 2 0 1 0 /* RX_ER */
|
||||
0 0 2 0 1 0 /* RX_CLK */
|
||||
2 9 1 0 3 0 /* GTX_CLK - CLK10 */
|
||||
2 8 2 0 1 0>; /* GTX125 - CLK9 */
|
||||
};
|
||||
|
||||
|
70
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/ucc.txt
Normal file
70
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/ucc.txt
Normal file
@ -0,0 +1,70 @@
|
||||
* UCC (Unified Communications Controllers)
|
||||
|
||||
Required properties:
|
||||
- device_type : should be "network", "hldc", "uart", "transparent"
|
||||
"bisync", "atm", or "serial".
|
||||
- compatible : could be "ucc_geth" or "fsl_atm" and so on.
|
||||
- cell-index : the ucc number(1-8), corresponding to UCCx in UM.
|
||||
- reg : Offset and length of the register set for the device
|
||||
- interrupts : <a b> where a is the interrupt number and b is a
|
||||
field that represents an encoding of the sense and level
|
||||
information for the interrupt. This should be encoded based on
|
||||
the information in section 2) depending on the type of interrupt
|
||||
controller you have.
|
||||
- interrupt-parent : the phandle for the interrupt controller that
|
||||
services interrupts for this device.
|
||||
- pio-handle : The phandle for the Parallel I/O port configuration.
|
||||
- port-number : for UART drivers, the port number to use, between 0 and 3.
|
||||
This usually corresponds to the /dev/ttyQE device, e.g. <0> = /dev/ttyQE0.
|
||||
The port number is added to the minor number of the device. Unlike the
|
||||
CPM UART driver, the port-number is required for the QE UART driver.
|
||||
- soft-uart : for UART drivers, if specified this means the QE UART device
|
||||
driver should use "Soft-UART" mode, which is needed on some SOCs that have
|
||||
broken UART hardware. Soft-UART is provided via a microcode upload.
|
||||
- rx-clock-name: the UCC receive clock source
|
||||
"none": clock source is disabled
|
||||
"brg1" through "brg16": clock source is BRG1-BRG16, respectively
|
||||
"clk1" through "clk24": clock source is CLK1-CLK24, respectively
|
||||
- tx-clock-name: the UCC transmit clock source
|
||||
"none": clock source is disabled
|
||||
"brg1" through "brg16": clock source is BRG1-BRG16, respectively
|
||||
"clk1" through "clk24": clock source is CLK1-CLK24, respectively
|
||||
The following two properties are deprecated. rx-clock has been replaced
|
||||
with rx-clock-name, and tx-clock has been replaced with tx-clock-name.
|
||||
Drivers that currently use the deprecated properties should continue to
|
||||
do so, in order to support older device trees, but they should be updated
|
||||
to check for the new properties first.
|
||||
- rx-clock : represents the UCC receive clock source.
|
||||
0x00 : clock source is disabled;
|
||||
0x1~0x10 : clock source is BRG1~BRG16 respectively;
|
||||
0x11~0x28: clock source is QE_CLK1~QE_CLK24 respectively.
|
||||
- tx-clock: represents the UCC transmit clock source;
|
||||
0x00 : clock source is disabled;
|
||||
0x1~0x10 : clock source is BRG1~BRG16 respectively;
|
||||
0x11~0x28: clock source is QE_CLK1~QE_CLK24 respectively.
|
||||
|
||||
Required properties for network device_type:
|
||||
- mac-address : list of bytes representing the ethernet address.
|
||||
- phy-handle : The phandle for the PHY connected to this controller.
|
||||
|
||||
Recommended properties:
|
||||
- phy-connection-type : a string naming the controller/PHY interface type,
|
||||
i.e., "mii" (default), "rmii", "gmii", "rgmii", "rgmii-id" (Internal
|
||||
Delay), "rgmii-txid" (delay on TX only), "rgmii-rxid" (delay on RX only),
|
||||
"tbi", or "rtbi".
|
||||
|
||||
Example:
|
||||
ucc@2000 {
|
||||
device_type = "network";
|
||||
compatible = "ucc_geth";
|
||||
cell-index = <1>;
|
||||
reg = <2000 200>;
|
||||
interrupts = <a0 0>;
|
||||
interrupt-parent = <700>;
|
||||
mac-address = [ 00 04 9f 00 23 23 ];
|
||||
rx-clock = "none";
|
||||
tx-clock = "clk9";
|
||||
phy-handle = <212000>;
|
||||
phy-connection-type = "gmii";
|
||||
pio-handle = <140001>;
|
||||
};
|
22
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/usb.txt
Normal file
22
Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/usb.txt
Normal file
@ -0,0 +1,22 @@
|
||||
* USB (Universal Serial Bus Controller)
|
||||
|
||||
Required properties:
|
||||
- compatible : could be "qe_udc" or "fhci-hcd".
|
||||
- mode : the could be "host" or "slave".
|
||||
- reg : Offset and length of the register set for the device
|
||||
- interrupts : <a b> where a is the interrupt number and b is a
|
||||
field that represents an encoding of the sense and level
|
||||
information for the interrupt. This should be encoded based on
|
||||
the information in section 2) depending on the type of interrupt
|
||||
controller you have.
|
||||
- interrupt-parent : the phandle for the interrupt controller that
|
||||
services interrupts for this device.
|
||||
|
||||
Example(slave):
|
||||
usb@6c0 {
|
||||
compatible = "qe_udc";
|
||||
reg = <6c0 40>;
|
||||
interrupts = <8b 0>;
|
||||
interrupt-parent = <700>;
|
||||
mode = "slave";
|
||||
};
|
21
Documentation/powerpc/dts-bindings/fsl/cpm_qe/serial.txt
Normal file
21
Documentation/powerpc/dts-bindings/fsl/cpm_qe/serial.txt
Normal file
@ -0,0 +1,21 @@
|
||||
* Serial
|
||||
|
||||
Currently defined compatibles:
|
||||
- fsl,cpm1-smc-uart
|
||||
- fsl,cpm2-smc-uart
|
||||
- fsl,cpm1-scc-uart
|
||||
- fsl,cpm2-scc-uart
|
||||
- fsl,qe-uart
|
||||
|
||||
Example:
|
||||
|
||||
serial@11a00 {
|
||||
device_type = "serial";
|
||||
compatible = "fsl,mpc8272-scc-uart",
|
||||
"fsl,cpm2-scc-uart";
|
||||
reg = <11a00 20 8000 100>;
|
||||
interrupts = <28 8>;
|
||||
interrupt-parent = <&PIC>;
|
||||
fsl,cpm-brg = <1>;
|
||||
fsl,cpm-command = <00800000>;
|
||||
};
|
18
Documentation/powerpc/dts-bindings/fsl/diu.txt
Normal file
18
Documentation/powerpc/dts-bindings/fsl/diu.txt
Normal file
@ -0,0 +1,18 @@
|
||||
* Freescale Display Interface Unit
|
||||
|
||||
The Freescale DIU is a LCD controller, with proper hardware, it can also
|
||||
drive DVI monitors.
|
||||
|
||||
Required properties:
|
||||
- compatible : should be "fsl-diu".
|
||||
- reg : should contain at least address and length of the DIU register
|
||||
set.
|
||||
- Interrupts : one DIU interrupt should be describe here.
|
||||
|
||||
Example (MPC8610HPCD):
|
||||
display@2c000 {
|
||||
compatible = "fsl,diu";
|
||||
reg = <0x2c000 100>;
|
||||
interrupts = <72 2>;
|
||||
interrupt-parent = <&mpic>;
|
||||
};
|
127
Documentation/powerpc/dts-bindings/fsl/dma.txt
Normal file
127
Documentation/powerpc/dts-bindings/fsl/dma.txt
Normal file
@ -0,0 +1,127 @@
|
||||
* Freescale 83xx DMA Controller
|
||||
|
||||
Freescale PowerPC 83xx have on chip general purpose DMA controllers.
|
||||
|
||||
Required properties:
|
||||
|
||||
- compatible : compatible list, contains 2 entries, first is
|
||||
"fsl,CHIP-dma", where CHIP is the processor
|
||||
(mpc8349, mpc8360, etc.) and the second is
|
||||
"fsl,elo-dma"
|
||||
- reg : <registers mapping for DMA general status reg>
|
||||
- ranges : Should be defined as specified in 1) to describe the
|
||||
DMA controller channels.
|
||||
- cell-index : controller index. 0 for controller @ 0x8100
|
||||
- interrupts : <interrupt mapping for DMA IRQ>
|
||||
- interrupt-parent : optional, if needed for interrupt mapping
|
||||
|
||||
|
||||
- DMA channel nodes:
|
||||
- compatible : compatible list, contains 2 entries, first is
|
||||
"fsl,CHIP-dma-channel", where CHIP is the processor
|
||||
(mpc8349, mpc8350, etc.) and the second is
|
||||
"fsl,elo-dma-channel"
|
||||
- reg : <registers mapping for channel>
|
||||
- cell-index : dma channel index starts at 0.
|
||||
|
||||
Optional properties:
|
||||
- interrupts : <interrupt mapping for DMA channel IRQ>
|
||||
(on 83xx this is expected to be identical to
|
||||
the interrupts property of the parent node)
|
||||
- interrupt-parent : optional, if needed for interrupt mapping
|
||||
|
||||
Example:
|
||||
dma@82a8 {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
compatible = "fsl,mpc8349-dma", "fsl,elo-dma";
|
||||
reg = <82a8 4>;
|
||||
ranges = <0 8100 1a4>;
|
||||
interrupt-parent = <&ipic>;
|
||||
interrupts = <47 8>;
|
||||
cell-index = <0>;
|
||||
dma-channel@0 {
|
||||
compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel";
|
||||
cell-index = <0>;
|
||||
reg = <0 80>;
|
||||
};
|
||||
dma-channel@80 {
|
||||
compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel";
|
||||
cell-index = <1>;
|
||||
reg = <80 80>;
|
||||
};
|
||||
dma-channel@100 {
|
||||
compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel";
|
||||
cell-index = <2>;
|
||||
reg = <100 80>;
|
||||
};
|
||||
dma-channel@180 {
|
||||
compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel";
|
||||
cell-index = <3>;
|
||||
reg = <180 80>;
|
||||
};
|
||||
};
|
||||
|
||||
* Freescale 85xx/86xx DMA Controller
|
||||
|
||||
Freescale PowerPC 85xx/86xx have on chip general purpose DMA controllers.
|
||||
|
||||
Required properties:
|
||||
|
||||
- compatible : compatible list, contains 2 entries, first is
|
||||
"fsl,CHIP-dma", where CHIP is the processor
|
||||
(mpc8540, mpc8540, etc.) and the second is
|
||||
"fsl,eloplus-dma"
|
||||
- reg : <registers mapping for DMA general status reg>
|
||||
- cell-index : controller index. 0 for controller @ 0x21000,
|
||||
1 for controller @ 0xc000
|
||||
- ranges : Should be defined as specified in 1) to describe the
|
||||
DMA controller channels.
|
||||
|
||||
- DMA channel nodes:
|
||||
- compatible : compatible list, contains 2 entries, first is
|
||||
"fsl,CHIP-dma-channel", where CHIP is the processor
|
||||
(mpc8540, mpc8560, etc.) and the second is
|
||||
"fsl,eloplus-dma-channel"
|
||||
- cell-index : dma channel index starts at 0.
|
||||
- reg : <registers mapping for channel>
|
||||
- interrupts : <interrupt mapping for DMA channel IRQ>
|
||||
- interrupt-parent : optional, if needed for interrupt mapping
|
||||
|
||||
Example:
|
||||
dma@21300 {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
compatible = "fsl,mpc8540-dma", "fsl,eloplus-dma";
|
||||
reg = <21300 4>;
|
||||
ranges = <0 21100 200>;
|
||||
cell-index = <0>;
|
||||
dma-channel@0 {
|
||||
compatible = "fsl,mpc8540-dma-channel", "fsl,eloplus-dma-channel";
|
||||
reg = <0 80>;
|
||||
cell-index = <0>;
|
||||
interrupt-parent = <&mpic>;
|
||||
interrupts = <14 2>;
|
||||
};
|
||||
dma-channel@80 {
|
||||
compatible = "fsl,mpc8540-dma-channel", "fsl,eloplus-dma-channel";
|
||||
reg = <80 80>;
|
||||
cell-index = <1>;
|
||||
interrupt-parent = <&mpic>;
|
||||
interrupts = <15 2>;
|
||||
};
|
||||
dma-channel@100 {
|
||||
compatible = "fsl,mpc8540-dma-channel", "fsl,eloplus-dma-channel";
|
||||
reg = <100 80>;
|
||||
cell-index = <2>;
|
||||
interrupt-parent = <&mpic>;
|
||||
interrupts = <16 2>;
|
||||
};
|
||||
dma-channel@180 {
|
||||
compatible = "fsl,mpc8540-dma-channel", "fsl,eloplus-dma-channel";
|
||||
reg = <180 80>;
|
||||
cell-index = <3>;
|
||||
interrupt-parent = <&mpic>;
|
||||
interrupts = <17 2>;
|
||||
};
|
||||
};
|
31
Documentation/powerpc/dts-bindings/fsl/gtm.txt
Normal file
31
Documentation/powerpc/dts-bindings/fsl/gtm.txt
Normal file
@ -0,0 +1,31 @@
|
||||
* Freescale General-purpose Timers Module
|
||||
|
||||
Required properties:
|
||||
- compatible : should be
|
||||
"fsl,<chip>-gtm", "fsl,gtm" for SOC GTMs
|
||||
"fsl,<chip>-qe-gtm", "fsl,qe-gtm", "fsl,gtm" for QE GTMs
|
||||
"fsl,<chip>-cpm2-gtm", "fsl,cpm2-gtm", "fsl,gtm" for CPM2 GTMs
|
||||
- reg : should contain gtm registers location and length (0x40).
|
||||
- interrupts : should contain four interrupts.
|
||||
- interrupt-parent : interrupt source phandle.
|
||||
- clock-frequency : specifies the frequency driving the timer.
|
||||
|
||||
Example:
|
||||
|
||||
timer@500 {
|
||||
compatible = "fsl,mpc8360-gtm", "fsl,gtm";
|
||||
reg = <0x500 0x40>;
|
||||
interrupts = <90 8 78 8 84 8 72 8>;
|
||||
interrupt-parent = <&ipic>;
|
||||
/* filled by u-boot */
|
||||
clock-frequency = <0>;
|
||||
};
|
||||
|
||||
timer@440 {
|
||||
compatible = "fsl,mpc8360-qe-gtm", "fsl,qe-gtm", "fsl,gtm";
|
||||
reg = <0x440 0x40>;
|
||||
interrupts = <12 13 14 15>;
|
||||
interrupt-parent = <&qeic>;
|
||||
/* filled by u-boot */
|
||||
clock-frequency = <0>;
|
||||
};
|
25
Documentation/powerpc/dts-bindings/fsl/guts.txt
Normal file
25
Documentation/powerpc/dts-bindings/fsl/guts.txt
Normal file
@ -0,0 +1,25 @@
|
||||
* Global Utilities Block
|
||||
|
||||
The global utilities block controls power management, I/O device
|
||||
enabling, power-on-reset configuration monitoring, general-purpose
|
||||
I/O signal configuration, alternate function selection for multiplexed
|
||||
signals, and clock control.
|
||||
|
||||
Required properties:
|
||||
|
||||
- compatible : Should define the compatible device type for
|
||||
global-utilities.
|
||||
- reg : Offset and length of the register set for the device.
|
||||
|
||||
Recommended properties:
|
||||
|
||||
- fsl,has-rstcr : Indicates that the global utilities register set
|
||||
contains a functioning "reset control register" (i.e. the board
|
||||
is wired to reset upon setting the HRESET_REQ bit in this register).
|
||||
|
||||
Example:
|
||||
global-utilities@e0000 { /* global utilities block */
|
||||
compatible = "fsl,mpc8548-guts";
|
||||
reg = <e0000 1000>;
|
||||
fsl,has-rstcr;
|
||||
};
|
32
Documentation/powerpc/dts-bindings/fsl/i2c.txt
Normal file
32
Documentation/powerpc/dts-bindings/fsl/i2c.txt
Normal file
@ -0,0 +1,32 @@
|
||||
* I2C
|
||||
|
||||
Required properties :
|
||||
|
||||
- device_type : Should be "i2c"
|
||||
- reg : Offset and length of the register set for the device
|
||||
|
||||
Recommended properties :
|
||||
|
||||
- compatible : Should be "fsl-i2c" for parts compatible with
|
||||
Freescale I2C specifications.
|
||||
- interrupts : <a b> where a is the interrupt number and b is a
|
||||
field that represents an encoding of the sense and level
|
||||
information for the interrupt. This should be encoded based on
|
||||
the information in section 2) depending on the type of interrupt
|
||||
controller you have.
|
||||
- interrupt-parent : the phandle for the interrupt controller that
|
||||
services interrupts for this device.
|
||||
- dfsrr : boolean; if defined, indicates that this I2C device has
|
||||
a digital filter sampling rate register
|
||||
- fsl5200-clocking : boolean; if defined, indicated that this device
|
||||
uses the FSL 5200 clocking mechanism.
|
||||
|
||||
Example :
|
||||
i2c@3000 {
|
||||
interrupt-parent = <40000>;
|
||||
interrupts = <1b 3>;
|
||||
reg = <3000 18>;
|
||||
device_type = "i2c";
|
||||
compatible = "fsl-i2c";
|
||||
dfsrr;
|
||||
};
|
35
Documentation/powerpc/dts-bindings/fsl/lbc.txt
Normal file
35
Documentation/powerpc/dts-bindings/fsl/lbc.txt
Normal file
@ -0,0 +1,35 @@
|
||||
* Chipselect/Local Bus
|
||||
|
||||
Properties:
|
||||
- name : Should be localbus
|
||||
- #address-cells : Should be either two or three. The first cell is the
|
||||
chipselect number, and the remaining cells are the
|
||||
offset into the chipselect.
|
||||
- #size-cells : Either one or two, depending on how large each chipselect
|
||||
can be.
|
||||
- ranges : Each range corresponds to a single chipselect, and cover
|
||||
the entire access window as configured.
|
||||
|
||||
Example:
|
||||
localbus@f0010100 {
|
||||
compatible = "fsl,mpc8272-localbus",
|
||||
"fsl,pq2-localbus";
|
||||
#address-cells = <2>;
|
||||
#size-cells = <1>;
|
||||
reg = <f0010100 40>;
|
||||
|
||||
ranges = <0 0 fe000000 02000000
|
||||
1 0 f4500000 00008000>;
|
||||
|
||||
flash@0,0 {
|
||||
compatible = "jedec-flash";
|
||||
reg = <0 0 2000000>;
|
||||
bank-width = <4>;
|
||||
device-width = <1>;
|
||||
};
|
||||
|
||||
board-control@1,0 {
|
||||
reg = <1 0 20>;
|
||||
compatible = "fsl,mpc8272ads-bcsr";
|
||||
};
|
||||
};
|
36
Documentation/powerpc/dts-bindings/fsl/msi-pic.txt
Normal file
36
Documentation/powerpc/dts-bindings/fsl/msi-pic.txt
Normal file
@ -0,0 +1,36 @@
|
||||
* Freescale MSI interrupt controller
|
||||
|
||||
Reguired properities:
|
||||
- compatible : compatible list, contains 2 entries,
|
||||
first is "fsl,CHIP-msi", where CHIP is the processor(mpc8610, mpc8572,
|
||||
etc.) and the second is "fsl,mpic-msi" or "fsl,ipic-msi" depending on
|
||||
the parent type.
|
||||
- reg : should contain the address and the length of the shared message
|
||||
interrupt register set.
|
||||
- msi-available-ranges: use <start count> style section to define which
|
||||
msi interrupt can be used in the 256 msi interrupts. This property is
|
||||
optional, without this, all the 256 MSI interrupts can be used.
|
||||
- interrupts : each one of the interrupts here is one entry per 32 MSIs,
|
||||
and routed to the host interrupt controller. the interrupts should
|
||||
be set as edge sensitive.
|
||||
- interrupt-parent: the phandle for the interrupt controller
|
||||
that services interrupts for this device. for 83xx cpu, the interrupts
|
||||
are routed to IPIC, and for 85xx/86xx cpu the interrupts are routed
|
||||
to MPIC.
|
||||
|
||||
Example:
|
||||
msi@41600 {
|
||||
compatible = "fsl,mpc8610-msi", "fsl,mpic-msi";
|
||||
reg = <0x41600 0x80>;
|
||||
msi-available-ranges = <0 0x100>;
|
||||
interrupts = <
|
||||
0xe0 0
|
||||
0xe1 0
|
||||
0xe2 0
|
||||
0xe3 0
|
||||
0xe4 0
|
||||
0xe5 0
|
||||
0xe6 0
|
||||
0xe7 0>;
|
||||
interrupt-parent = <&mpic>;
|
||||
};
|
29
Documentation/powerpc/dts-bindings/fsl/sata.txt
Normal file
29
Documentation/powerpc/dts-bindings/fsl/sata.txt
Normal file
@ -0,0 +1,29 @@
|
||||
* Freescale 8xxx/3.0 Gb/s SATA nodes
|
||||
|
||||
SATA nodes are defined to describe on-chip Serial ATA controllers.
|
||||
Each SATA port should have its own node.
|
||||
|
||||
Required properties:
|
||||
- compatible : compatible list, contains 2 entries, first is
|
||||
"fsl,CHIP-sata", where CHIP is the processor
|
||||
(mpc8315, mpc8379, etc.) and the second is
|
||||
"fsl,pq-sata"
|
||||
- interrupts : <interrupt mapping for SATA IRQ>
|
||||
- cell-index : controller index.
|
||||
1 for controller @ 0x18000
|
||||
2 for controller @ 0x19000
|
||||
3 for controller @ 0x1a000
|
||||
4 for controller @ 0x1b000
|
||||
|
||||
Optional properties:
|
||||
- interrupt-parent : optional, if needed for interrupt mapping
|
||||
- reg : <registers mapping>
|
||||
|
||||
Example:
|
||||
sata@18000 {
|
||||
compatible = "fsl,mpc8379-sata", "fsl,pq-sata";
|
||||
reg = <0x18000 0x1000>;
|
||||
cell-index = <1>;
|
||||
interrupts = <2c 8>;
|
||||
interrupt-parent = < &ipic >;
|
||||
};
|
68
Documentation/powerpc/dts-bindings/fsl/sec.txt
Normal file
68
Documentation/powerpc/dts-bindings/fsl/sec.txt
Normal file
@ -0,0 +1,68 @@
|
||||
Freescale SoC SEC Security Engines
|
||||
|
||||
Required properties:
|
||||
|
||||
- compatible : Should contain entries for this and backward compatible
|
||||
SEC versions, high to low, e.g., "fsl,sec2.1", "fsl,sec2.0"
|
||||
- reg : Offset and length of the register set for the device
|
||||
- interrupts : the SEC's interrupt number
|
||||
- fsl,num-channels : An integer representing the number of channels
|
||||
available.
|
||||
- fsl,channel-fifo-len : An integer representing the number of
|
||||
descriptor pointers each channel fetch fifo can hold.
|
||||
- fsl,exec-units-mask : The bitmask representing what execution units
|
||||
(EUs) are available. It's a single 32-bit cell. EU information
|
||||
should be encoded following the SEC's Descriptor Header Dword
|
||||
EU_SEL0 field documentation, i.e. as follows:
|
||||
|
||||
bit 0 = reserved - should be 0
|
||||
bit 1 = set if SEC has the ARC4 EU (AFEU)
|
||||
bit 2 = set if SEC has the DES/3DES EU (DEU)
|
||||
bit 3 = set if SEC has the message digest EU (MDEU/MDEU-A)
|
||||
bit 4 = set if SEC has the random number generator EU (RNG)
|
||||
bit 5 = set if SEC has the public key EU (PKEU)
|
||||
bit 6 = set if SEC has the AES EU (AESU)
|
||||
bit 7 = set if SEC has the Kasumi EU (KEU)
|
||||
bit 8 = set if SEC has the CRC EU (CRCU)
|
||||
bit 11 = set if SEC has the message digest EU extended alg set (MDEU-B)
|
||||
|
||||
remaining bits are reserved for future SEC EUs.
|
||||
|
||||
- fsl,descriptor-types-mask : The bitmask representing what descriptors
|
||||
are available. It's a single 32-bit cell. Descriptor type information
|
||||
should be encoded following the SEC's Descriptor Header Dword DESC_TYPE
|
||||
field documentation, i.e. as follows:
|
||||
|
||||
bit 0 = set if SEC supports the aesu_ctr_nonsnoop desc. type
|
||||
bit 1 = set if SEC supports the ipsec_esp descriptor type
|
||||
bit 2 = set if SEC supports the common_nonsnoop desc. type
|
||||
bit 3 = set if SEC supports the 802.11i AES ccmp desc. type
|
||||
bit 4 = set if SEC supports the hmac_snoop_no_afeu desc. type
|
||||
bit 5 = set if SEC supports the srtp descriptor type
|
||||
bit 6 = set if SEC supports the non_hmac_snoop_no_afeu desc.type
|
||||
bit 7 = set if SEC supports the pkeu_assemble descriptor type
|
||||
bit 8 = set if SEC supports the aesu_key_expand_output desc.type
|
||||
bit 9 = set if SEC supports the pkeu_ptmul descriptor type
|
||||
bit 10 = set if SEC supports the common_nonsnoop_afeu desc. type
|
||||
bit 11 = set if SEC supports the pkeu_ptadd_dbl descriptor type
|
||||
|
||||
..and so on and so forth.
|
||||
|
||||
Optional properties:
|
||||
|
||||
- interrupt-parent : the phandle for the interrupt controller that
|
||||
services interrupts for this device.
|
||||
|
||||
Example:
|
||||
|
||||
/* MPC8548E */
|
||||
crypto@30000 {
|
||||
compatible = "fsl,sec2.1", "fsl,sec2.0";
|
||||
reg = <0x30000 0x10000>;
|
||||
interrupts = <29 2>;
|
||||
interrupt-parent = <&mpic>;
|
||||
fsl,num-channels = <4>;
|
||||
fsl,channel-fifo-len = <24>;
|
||||
fsl,exec-units-mask = <0xfe>;
|
||||
fsl,descriptor-types-mask = <0x12b0ebf>;
|
||||
};
|
24
Documentation/powerpc/dts-bindings/fsl/spi.txt
Normal file
24
Documentation/powerpc/dts-bindings/fsl/spi.txt
Normal file
@ -0,0 +1,24 @@
|
||||
* SPI (Serial Peripheral Interface)
|
||||
|
||||
Required properties:
|
||||
- cell-index : SPI controller index.
|
||||
- compatible : should be "fsl,spi".
|
||||
- mode : the SPI operation mode, it can be "cpu" or "cpu-qe".
|
||||
- reg : Offset and length of the register set for the device
|
||||
- interrupts : <a b> where a is the interrupt number and b is a
|
||||
field that represents an encoding of the sense and level
|
||||
information for the interrupt. This should be encoded based on
|
||||
the information in section 2) depending on the type of interrupt
|
||||
controller you have.
|
||||
- interrupt-parent : the phandle for the interrupt controller that
|
||||
services interrupts for this device.
|
||||
|
||||
Example:
|
||||
spi@4c0 {
|
||||
cell-index = <0>;
|
||||
compatible = "fsl,spi";
|
||||
reg = <4c0 40>;
|
||||
interrupts = <82 0>;
|
||||
interrupt-parent = <700>;
|
||||
mode = "cpu";
|
||||
};
|
38
Documentation/powerpc/dts-bindings/fsl/ssi.txt
Normal file
38
Documentation/powerpc/dts-bindings/fsl/ssi.txt
Normal file
@ -0,0 +1,38 @@
|
||||
Freescale Synchronous Serial Interface
|
||||
|
||||
The SSI is a serial device that communicates with audio codecs. It can
|
||||
be programmed in AC97, I2S, left-justified, or right-justified modes.
|
||||
|
||||
Required properties:
|
||||
- compatible : compatible list, containing "fsl,ssi"
|
||||
- cell-index : the SSI, <0> = SSI1, <1> = SSI2, and so on
|
||||
- reg : offset and length of the register set for the device
|
||||
- interrupts : <a b> where a is the interrupt number and b is a
|
||||
field that represents an encoding of the sense and
|
||||
level information for the interrupt. This should be
|
||||
encoded based on the information in section 2)
|
||||
depending on the type of interrupt controller you
|
||||
have.
|
||||
- interrupt-parent : the phandle for the interrupt controller that
|
||||
services interrupts for this device.
|
||||
- fsl,mode : the operating mode for the SSI interface
|
||||
"i2s-slave" - I2S mode, SSI is clock slave
|
||||
"i2s-master" - I2S mode, SSI is clock master
|
||||
"lj-slave" - left-justified mode, SSI is clock slave
|
||||
"lj-master" - l.j. mode, SSI is clock master
|
||||
"rj-slave" - right-justified mode, SSI is clock slave
|
||||
"rj-master" - r.j., SSI is clock master
|
||||
"ac97-slave" - AC97 mode, SSI is clock slave
|
||||
"ac97-master" - AC97 mode, SSI is clock master
|
||||
|
||||
Optional properties:
|
||||
- codec-handle : phandle to a 'codec' node that defines an audio
|
||||
codec connected to this SSI. This node is typically
|
||||
a child of an I2C or other control node.
|
||||
|
||||
Child 'codec' node required properties:
|
||||
- compatible : compatible list, contains the name of the codec
|
||||
|
||||
Child 'codec' node optional properties:
|
||||
- clock-frequency : The frequency of the input clock, which typically
|
||||
comes from an on-board dedicated oscillator.
|
69
Documentation/powerpc/dts-bindings/fsl/tsec.txt
Normal file
69
Documentation/powerpc/dts-bindings/fsl/tsec.txt
Normal file
@ -0,0 +1,69 @@
|
||||
* MDIO IO device
|
||||
|
||||
The MDIO is a bus to which the PHY devices are connected. For each
|
||||
device that exists on this bus, a child node should be created. See
|
||||
the definition of the PHY node below for an example of how to define
|
||||
a PHY.
|
||||
|
||||
Required properties:
|
||||
- reg : Offset and length of the register set for the device
|
||||
- compatible : Should define the compatible device type for the
|
||||
mdio. Currently, this is most likely to be "fsl,gianfar-mdio"
|
||||
|
||||
Example:
|
||||
|
||||
mdio@24520 {
|
||||
reg = <24520 20>;
|
||||
compatible = "fsl,gianfar-mdio";
|
||||
|
||||
ethernet-phy@0 {
|
||||
......
|
||||
};
|
||||
};
|
||||
|
||||
|
||||
* Gianfar-compatible ethernet nodes
|
||||
|
||||
Required properties:
|
||||
|
||||
- device_type : Should be "network"
|
||||
- model : Model of the device. Can be "TSEC", "eTSEC", or "FEC"
|
||||
- compatible : Should be "gianfar"
|
||||
- reg : Offset and length of the register set for the device
|
||||
- mac-address : List of bytes representing the ethernet address of
|
||||
this controller
|
||||
- interrupts : <a b> where a is the interrupt number and b is a
|
||||
field that represents an encoding of the sense and level
|
||||
information for the interrupt. This should be encoded based on
|
||||
the information in section 2) depending on the type of interrupt
|
||||
controller you have.
|
||||
- interrupt-parent : the phandle for the interrupt controller that
|
||||
services interrupts for this device.
|
||||
- phy-handle : The phandle for the PHY connected to this ethernet
|
||||
controller.
|
||||
- fixed-link : <a b c d e> where a is emulated phy id - choose any,
|
||||
but unique to the all specified fixed-links, b is duplex - 0 half,
|
||||
1 full, c is link speed - d#10/d#100/d#1000, d is pause - 0 no
|
||||
pause, 1 pause, e is asym_pause - 0 no asym_pause, 1 asym_pause.
|
||||
|
||||
Recommended properties:
|
||||
|
||||
- phy-connection-type : a string naming the controller/PHY interface type,
|
||||
i.e., "mii" (default), "rmii", "gmii", "rgmii", "rgmii-id", "sgmii",
|
||||
"tbi", or "rtbi". This property is only really needed if the connection
|
||||
is of type "rgmii-id", as all other connection types are detected by
|
||||
hardware.
|
||||
|
||||
|
||||
Example:
|
||||
ethernet@24000 {
|
||||
#size-cells = <0>;
|
||||
device_type = "network";
|
||||
model = "TSEC";
|
||||
compatible = "gianfar";
|
||||
reg = <24000 1000>;
|
||||
mac-address = [ 00 E0 0C 00 73 00 ];
|
||||
interrupts = <d 3 e 3 12 3>;
|
||||
interrupt-parent = <40000>;
|
||||
phy-handle = <2452000>
|
||||
};
|
59
Documentation/powerpc/dts-bindings/fsl/usb.txt
Normal file
59
Documentation/powerpc/dts-bindings/fsl/usb.txt
Normal file
@ -0,0 +1,59 @@
|
||||
Freescale SOC USB controllers
|
||||
|
||||
The device node for a USB controller that is part of a Freescale
|
||||
SOC is as described in the document "Open Firmware Recommended
|
||||
Practice : Universal Serial Bus" with the following modifications
|
||||
and additions :
|
||||
|
||||
Required properties :
|
||||
- compatible : Should be "fsl-usb2-mph" for multi port host USB
|
||||
controllers, or "fsl-usb2-dr" for dual role USB controllers
|
||||
- phy_type : For multi port host USB controllers, should be one of
|
||||
"ulpi", or "serial". For dual role USB controllers, should be
|
||||
one of "ulpi", "utmi", "utmi_wide", or "serial".
|
||||
- reg : Offset and length of the register set for the device
|
||||
- port0 : boolean; if defined, indicates port0 is connected for
|
||||
fsl-usb2-mph compatible controllers. Either this property or
|
||||
"port1" (or both) must be defined for "fsl-usb2-mph" compatible
|
||||
controllers.
|
||||
- port1 : boolean; if defined, indicates port1 is connected for
|
||||
fsl-usb2-mph compatible controllers. Either this property or
|
||||
"port0" (or both) must be defined for "fsl-usb2-mph" compatible
|
||||
controllers.
|
||||
- dr_mode : indicates the working mode for "fsl-usb2-dr" compatible
|
||||
controllers. Can be "host", "peripheral", or "otg". Default to
|
||||
"host" if not defined for backward compatibility.
|
||||
|
||||
Recommended properties :
|
||||
- interrupts : <a b> where a is the interrupt number and b is a
|
||||
field that represents an encoding of the sense and level
|
||||
information for the interrupt. This should be encoded based on
|
||||
the information in section 2) depending on the type of interrupt
|
||||
controller you have.
|
||||
- interrupt-parent : the phandle for the interrupt controller that
|
||||
services interrupts for this device.
|
||||
|
||||
Example multi port host USB controller device node :
|
||||
usb@22000 {
|
||||
compatible = "fsl-usb2-mph";
|
||||
reg = <22000 1000>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
interrupt-parent = <700>;
|
||||
interrupts = <27 1>;
|
||||
phy_type = "ulpi";
|
||||
port0;
|
||||
port1;
|
||||
};
|
||||
|
||||
Example dual role USB controller device node :
|
||||
usb@23000 {
|
||||
compatible = "fsl-usb2-dr";
|
||||
reg = <23000 1000>;
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
interrupt-parent = <700>;
|
||||
interrupts = <26 1>;
|
||||
dr_mode = "otg";
|
||||
phy = "ulpi";
|
||||
};
|
@ -61,10 +61,7 @@ builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your
|
||||
arch_init_sched_domains function. This function will attach domains to all
|
||||
CPUs using cpu_attach_domain.
|
||||
|
||||
Implementors should change the line
|
||||
#undef SCHED_DOMAIN_DEBUG
|
||||
to
|
||||
#define SCHED_DOMAIN_DEBUG
|
||||
in kernel/sched.c as this enables an error checking parse of the sched domains
|
||||
The sched-domains debugging infrastructure can be enabled by enabling
|
||||
CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains
|
||||
which should catch most possible errors (described above). It also prints out
|
||||
the domain structure in a visual format.
|
||||
|
@ -51,9 +51,9 @@ needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
|
||||
0.00015s. So this group can be scheduled with a period of 0.005s and a run time
|
||||
of 0.00015s.
|
||||
|
||||
The remaining CPU time will be used for user input and other tass. Because
|
||||
The remaining CPU time will be used for user input and other tasks. Because
|
||||
realtime tasks have explicitly allocated the CPU time they need to perform
|
||||
their tasks, buffer underruns in the graphocs or audio can be eliminated.
|
||||
their tasks, buffer underruns in the graphics or audio can be eliminated.
|
||||
|
||||
NOTE: the above example is not fully implemented as of yet (2.6.25). We still
|
||||
lack an EDF scheduler to make non-uniform periods usable.
|
||||
|
@ -56,19 +56,33 @@ Supported Cards/Chipsets
|
||||
9005:0285:9005:02d1 Adaptec 5405 (Voodoo40)
|
||||
9005:0285:15d9:02d2 SMC AOC-USAS-S8i-LP
|
||||
9005:0285:15d9:02d3 SMC AOC-USAS-S8iR-LP
|
||||
9005:0285:9005:02d4 Adaptec 2045 (Voodoo04 Lite)
|
||||
9005:0285:9005:02d5 Adaptec 2405 (Voodoo40 Lite)
|
||||
9005:0285:9005:02d6 Adaptec 2445 (Voodoo44 Lite)
|
||||
9005:0285:9005:02d7 Adaptec 2805 (Voodoo80 Lite)
|
||||
9005:0285:9005:02d4 Adaptec ASR-2045 (Voodoo04 Lite)
|
||||
9005:0285:9005:02d5 Adaptec ASR-2405 (Voodoo40 Lite)
|
||||
9005:0285:9005:02d6 Adaptec ASR-2445 (Voodoo44 Lite)
|
||||
9005:0285:9005:02d7 Adaptec ASR-2805 (Voodoo80 Lite)
|
||||
9005:0285:9005:02d8 Adaptec 5405G (Voodoo40 PM)
|
||||
9005:0285:9005:02d9 Adaptec 5445G (Voodoo44 PM)
|
||||
9005:0285:9005:02da Adaptec 5805G (Voodoo80 PM)
|
||||
9005:0285:9005:02db Adaptec 5085G (Voodoo08 PM)
|
||||
9005:0285:9005:02dc Adaptec 51245G (Voodoo124 PM)
|
||||
9005:0285:9005:02dd Adaptec 51645G (Voodoo164 PM)
|
||||
9005:0285:9005:02de Adaptec 52445G (Voodoo244 PM)
|
||||
9005:0285:9005:02df Adaptec ASR-2045G (Voodoo04 Lite PM)
|
||||
9005:0285:9005:02e0 Adaptec ASR-2405G (Voodoo40 Lite PM)
|
||||
9005:0285:9005:02e1 Adaptec ASR-2445G (Voodoo44 Lite PM)
|
||||
9005:0285:9005:02e2 Adaptec ASR-2805G (Voodoo80 Lite PM)
|
||||
1011:0046:9005:0364 Adaptec 5400S (Mustang)
|
||||
1011:0046:9005:0365 Adaptec 5400S (Mustang)
|
||||
9005:0287:9005:0800 Adaptec Themisto (Jupiter)
|
||||
9005:0200:9005:0200 Adaptec Themisto (Jupiter)
|
||||
9005:0286:9005:0800 Adaptec Callisto (Jupiter)
|
||||
1011:0046:9005:1364 Dell PERC 2/QC (Quad Channel, Mustang)
|
||||
1011:0046:9005:1365 Dell PERC 2/QC (Quad Channel, Mustang)
|
||||
1028:0001:1028:0001 Dell PERC 2/Si (Iguana)
|
||||
1028:0003:1028:0003 Dell PERC 3/Si (SlimFast)
|
||||
1028:0002:1028:0002 Dell PERC 3/Di (Opal)
|
||||
1028:0004:1028:0004 Dell PERC 3/DiF (Iguana)
|
||||
1028:0004:1028:0004 Dell PERC 3/SiF (Iguana)
|
||||
1028:0004:1028:00d0 Dell PERC 3/DiF (Iguana)
|
||||
1028:0002:1028:00d1 Dell PERC 3/DiV (Viper)
|
||||
1028:0002:1028:00d9 Dell PERC 3/DiL (Lexus)
|
||||
1028:000a:1028:0106 Dell PERC 3/DiJ (Jaguar)
|
||||
|
@ -753,8 +753,11 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
|
||||
|
||||
[Multiple options for each card instance]
|
||||
model - force the model name
|
||||
position_fix - Fix DMA pointer (0 = auto, 1 = none, 2 = POSBUF, 3 = FIFO size)
|
||||
position_fix - Fix DMA pointer (0 = auto, 1 = use LPIB, 2 = POSBUF)
|
||||
probe_mask - Bitmask to probe codecs (default = -1, meaning all slots)
|
||||
bdl_pos_adj - Specifies the DMA IRQ timing delay in samples.
|
||||
Passing -1 will make the driver to choose the appropriate
|
||||
value based on the controller chip.
|
||||
|
||||
[Single (global) options]
|
||||
single_cmd - Use single immediate commands to communicate with
|
||||
@ -845,7 +848,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
|
||||
ALC269
|
||||
basic Basic preset
|
||||
|
||||
ALC662
|
||||
ALC662/663
|
||||
3stack-dig 3-stack (2-channel) with SPDIF
|
||||
3stack-6ch 3-stack (6-channel)
|
||||
3stack-6ch-dig 3-stack (6-channel) with SPDIF
|
||||
@ -853,6 +856,10 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
|
||||
lenovo-101e Lenovo laptop
|
||||
eeepc-p701 ASUS Eeepc P701
|
||||
eeepc-ep20 ASUS Eeepc EP20
|
||||
m51va ASUS M51VA
|
||||
g71v ASUS G71V
|
||||
h13 ASUS H13
|
||||
g50v ASUS G50V
|
||||
auto auto-config reading BIOS (default)
|
||||
|
||||
ALC882/885
|
||||
@ -1091,7 +1098,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
|
||||
This occurs when the access to non-existing or non-working codec slot
|
||||
(likely a modem one) causes a stall of the communication via HD-audio
|
||||
bus. You can see which codec slots are probed by enabling
|
||||
CONFIG_SND_DEBUG_DETECT, or simply from the file name of the codec
|
||||
CONFIG_SND_DEBUG_VERBOSE, or simply from the file name of the codec
|
||||
proc files. Then limit the slots to probe by probe_mask option.
|
||||
For example, probe_mask=1 means to probe only the first slot, and
|
||||
probe_mask=4 means only the third slot.
|
||||
@ -2267,6 +2274,10 @@ case above again, the first two slots are already reserved. If any
|
||||
other driver (e.g. snd-usb-audio) is loaded before snd-interwave or
|
||||
snd-ens1371, it will be assigned to the third or later slot.
|
||||
|
||||
When a module name is given with '!', the slot will be given for any
|
||||
modules but that name. For example, "slots=!snd-pcsp" will reserve
|
||||
the first slot for any modules but snd-pcsp.
|
||||
|
||||
|
||||
ALSA PCM devices to OSS devices mapping
|
||||
=======================================
|
||||
|
@ -6127,8 +6127,8 @@ struct _snd_pcm_runtime {
|
||||
|
||||
<para>
|
||||
<function>snd_printdd()</function> is compiled in only when
|
||||
<constant>CONFIG_SND_DEBUG_DETECT</constant> is set. Please note
|
||||
that <constant>DEBUG_DETECT</constant> is not set as default
|
||||
<constant>CONFIG_SND_DEBUG_VERBOSE</constant> is set. Please note
|
||||
that <constant>CONFIG_SND_DEBUG_VERBOSE</constant> is not set as default
|
||||
even if you configure the alsa-driver with
|
||||
<option>--with-debug=full</option> option. You need to give
|
||||
explicitly <option>--with-debug=detect</option> option instead.
|
||||
|
164
Documentation/tracers/mmiotrace.txt
Normal file
164
Documentation/tracers/mmiotrace.txt
Normal file
@ -0,0 +1,164 @@
|
||||
In-kernel memory-mapped I/O tracing
|
||||
|
||||
|
||||
Home page and links to optional user space tools:
|
||||
|
||||
http://nouveau.freedesktop.org/wiki/MmioTrace
|
||||
|
||||
MMIO tracing was originally developed by Intel around 2003 for their Fault
|
||||
Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel,
|
||||
Jeff Muizelaar created a tool for tracing MMIO accesses with the Nouveau
|
||||
project in mind. Since then many people have contributed.
|
||||
|
||||
Mmiotrace was built for reverse engineering any memory-mapped IO device with
|
||||
the Nouveau project as the first real user. Only x86 and x86_64 architectures
|
||||
are supported.
|
||||
|
||||
Out-of-tree mmiotrace was originally modified for mainline inclusion and
|
||||
ftrace framework by Pekka Paalanen <pq@iki.fi>.
|
||||
|
||||
|
||||
Preparation
|
||||
-----------
|
||||
|
||||
Mmiotrace feature is compiled in by the CONFIG_MMIOTRACE option. Tracing is
|
||||
disabled by default, so it is safe to have this set to yes. SMP systems are
|
||||
supported, but tracing is unreliable and may miss events if more than one CPU
|
||||
is on-line, therefore mmiotrace takes all but one CPU off-line during run-time
|
||||
activation. You can re-enable CPUs by hand, but you have been warned, there
|
||||
is no way to automatically detect if you are losing events due to CPUs racing.
|
||||
|
||||
|
||||
Usage Quick Reference
|
||||
---------------------
|
||||
|
||||
$ mount -t debugfs debugfs /debug
|
||||
$ echo mmiotrace > /debug/tracing/current_tracer
|
||||
$ cat /debug/tracing/trace_pipe > mydump.txt &
|
||||
Start X or whatever.
|
||||
$ echo "X is up" > /debug/tracing/marker
|
||||
$ echo none > /debug/tracing/current_tracer
|
||||
Check for lost events.
|
||||
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
Make sure debugfs is mounted to /debug. If not, (requires root privileges)
|
||||
$ mount -t debugfs debugfs /debug
|
||||
|
||||
Check that the driver you are about to trace is not loaded.
|
||||
|
||||
Activate mmiotrace (requires root privileges):
|
||||
$ echo mmiotrace > /debug/tracing/current_tracer
|
||||
|
||||
Start storing the trace:
|
||||
$ cat /debug/tracing/trace_pipe > mydump.txt &
|
||||
The 'cat' process should stay running (sleeping) in the background.
|
||||
|
||||
Load the driver you want to trace and use it. Mmiotrace will only catch MMIO
|
||||
accesses to areas that are ioremapped while mmiotrace is active.
|
||||
|
||||
[Unimplemented feature:]
|
||||
During tracing you can place comments (markers) into the trace by
|
||||
$ echo "X is up" > /debug/tracing/marker
|
||||
This makes it easier to see which part of the (huge) trace corresponds to
|
||||
which action. It is recommended to place descriptive markers about what you
|
||||
do.
|
||||
|
||||
Shut down mmiotrace (requires root privileges):
|
||||
$ echo none > /debug/tracing/current_tracer
|
||||
The 'cat' process exits. If it does not, kill it by issuing 'fg' command and
|
||||
pressing ctrl+c.
|
||||
|
||||
Check that mmiotrace did not lose events due to a buffer filling up. Either
|
||||
$ grep -i lost mydump.txt
|
||||
which tells you exactly how many events were lost, or use
|
||||
$ dmesg
|
||||
to view your kernel log and look for "mmiotrace has lost events" warning. If
|
||||
events were lost, the trace is incomplete. You should enlarge the buffers and
|
||||
try again. Buffers are enlarged by first seeing how large the current buffers
|
||||
are:
|
||||
$ cat /debug/tracing/trace_entries
|
||||
gives you a number. Approximately double this number and write it back, for
|
||||
instance:
|
||||
$ echo 128000 > /debug/tracing/trace_entries
|
||||
Then start again from the top.
|
||||
|
||||
If you are doing a trace for a driver project, e.g. Nouveau, you should also
|
||||
do the following before sending your results:
|
||||
$ lspci -vvv > lspci.txt
|
||||
$ dmesg > dmesg.txt
|
||||
$ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt
|
||||
and then send the .tar.gz file. The trace compresses considerably. Replace
|
||||
"pciid" and "nick" with the PCI ID or model name of your piece of hardware
|
||||
under investigation and your nick name.
|
||||
|
||||
|
||||
How Mmiotrace Works
|
||||
-------------------
|
||||
|
||||
Access to hardware IO-memory is gained by mapping addresses from PCI bus by
|
||||
calling one of the ioremap_*() functions. Mmiotrace is hooked into the
|
||||
__ioremap() function and gets called whenever a mapping is created. Mapping is
|
||||
an event that is recorded into the trace log. Note, that ISA range mappings
|
||||
are not caught, since the mapping always exists and is returned directly.
|
||||
|
||||
MMIO accesses are recorded via page faults. Just before __ioremap() returns,
|
||||
the mapped pages are marked as not present. Any access to the pages causes a
|
||||
fault. The page fault handler calls mmiotrace to handle the fault. Mmiotrace
|
||||
marks the page present, sets TF flag to achieve single stepping and exits the
|
||||
fault handler. The instruction that faulted is executed and debug trap is
|
||||
entered. Here mmiotrace again marks the page as not present. The instruction
|
||||
is decoded to get the type of operation (read/write), data width and the value
|
||||
read or written. These are stored to the trace log.
|
||||
|
||||
Setting the page present in the page fault handler has a race condition on SMP
|
||||
machines. During the single stepping other CPUs may run freely on that page
|
||||
and events can be missed without a notice. Re-enabling other CPUs during
|
||||
tracing is discouraged.
|
||||
|
||||
|
||||
Trace Log Format
|
||||
----------------
|
||||
|
||||
The raw log is text and easily filtered with e.g. grep and awk. One record is
|
||||
one line in the log. A record starts with a keyword, followed by keyword
|
||||
dependant arguments. Arguments are separated by a space, or continue until the
|
||||
end of line. The format for version 20070824 is as follows:
|
||||
|
||||
Explanation Keyword Space separated arguments
|
||||
---------------------------------------------------------------------------
|
||||
|
||||
read event R width, timestamp, map id, physical, value, PC, PID
|
||||
write event W width, timestamp, map id, physical, value, PC, PID
|
||||
ioremap event MAP timestamp, map id, physical, virtual, length, PC, PID
|
||||
iounmap event UNMAP timestamp, map id, PC, PID
|
||||
marker MARK timestamp, text
|
||||
version VERSION the string "20070824"
|
||||
info for reader LSPCI one line from lspci -v
|
||||
PCI address map PCIDEV space separated /proc/bus/pci/devices data
|
||||
unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID
|
||||
|
||||
Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual
|
||||
is a kernel virtual address. Width is the data width in bytes and value is the
|
||||
data value. Map id is an arbitrary id number identifying the mapping that was
|
||||
used in an operation. PC is the program counter and PID is process id. PC is
|
||||
zero if it is not recorded. PID is always zero as tracing MMIO accesses
|
||||
originating in user space memory is not yet supported.
|
||||
|
||||
For instance, the following awk filter will pass all 32-bit writes that target
|
||||
physical addresses in the range [0xfb73ce40, 0xfb800000[
|
||||
|
||||
$ awk '/W 4 / { adr=strtonum($5); if (adr >= 0xfb73ce40 &&
|
||||
adr < 0xfb800000) print; }'
|
||||
|
||||
|
||||
Tools for Developers
|
||||
--------------------
|
||||
|
||||
The user space tools include utilities for:
|
||||
- replacing numeric addresses and values with hardware register names
|
||||
- replaying MMIO logs, i.e., re-executing the recorded writes
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
0 -> Unknown board (au0828)
|
||||
1 -> Hauppauge HVR950Q (au0828) [2040:7200]
|
||||
1 -> Hauppauge HVR950Q (au0828) [2040:7200,2040:7210,2040:7217,2040:721b,2040:721f,2040:7280,0fd9:0008]
|
||||
2 -> Hauppauge HVR850 (au0828) [2040:7240]
|
||||
3 -> DViCO FusionHDTV USB (au0828) [0fe9:d620]
|
||||
|
@ -1,7 +1,7 @@
|
||||
/*
|
||||
* Slabinfo: Tool to get reports about slabs
|
||||
*
|
||||
* (C) 2007 sgi, Christoph Lameter <clameter@sgi.com>
|
||||
* (C) 2007 sgi, Christoph Lameter
|
||||
*
|
||||
* Compile by:
|
||||
*
|
||||
@ -99,7 +99,7 @@ void fatal(const char *x, ...)
|
||||
|
||||
void usage(void)
|
||||
{
|
||||
printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n"
|
||||
printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
|
||||
"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
|
||||
"-a|--aliases Show aliases\n"
|
||||
"-A|--activity Most active slabs first\n"
|
||||
|
@ -266,4 +266,4 @@ of other objects.
|
||||
|
||||
slub_debug=FZ,dentry
|
||||
|
||||
Christoph Lameter, <clameter@sgi.com>, May 30, 2007
|
||||
Christoph Lameter, May 30, 2007
|
||||
|
900
Documentation/x86/i386/boot.txt
Normal file
900
Documentation/x86/i386/boot.txt
Normal file
@ -0,0 +1,900 @@
|
||||
THE LINUX/x86 BOOT PROTOCOL
|
||||
---------------------------
|
||||
|
||||
On the x86 platform, the Linux kernel uses a rather complicated boot
|
||||
convention. This has evolved partially due to historical aspects, as
|
||||
well as the desire in the early days to have the kernel itself be a
|
||||
bootable image, the complicated PC memory model and due to changed
|
||||
expectations in the PC industry caused by the effective demise of
|
||||
real-mode DOS as a mainstream operating system.
|
||||
|
||||
Currently, the following versions of the Linux/x86 boot protocol exist.
|
||||
|
||||
Old kernels: zImage/Image support only. Some very early kernels
|
||||
may not even support a command line.
|
||||
|
||||
Protocol 2.00: (Kernel 1.3.73) Added bzImage and initrd support, as
|
||||
well as a formalized way to communicate between the
|
||||
boot loader and the kernel. setup.S made relocatable,
|
||||
although the traditional setup area still assumed
|
||||
writable.
|
||||
|
||||
Protocol 2.01: (Kernel 1.3.76) Added a heap overrun warning.
|
||||
|
||||
Protocol 2.02: (Kernel 2.4.0-test3-pre3) New command line protocol.
|
||||
Lower the conventional memory ceiling. No overwrite
|
||||
of the traditional setup area, thus making booting
|
||||
safe for systems which use the EBDA from SMM or 32-bit
|
||||
BIOS entry points. zImage deprecated but still
|
||||
supported.
|
||||
|
||||
Protocol 2.03: (Kernel 2.4.18-pre1) Explicitly makes the highest possible
|
||||
initrd address available to the bootloader.
|
||||
|
||||
Protocol 2.04: (Kernel 2.6.14) Extend the syssize field to four bytes.
|
||||
|
||||
Protocol 2.05: (Kernel 2.6.20) Make protected mode kernel relocatable.
|
||||
Introduce relocatable_kernel and kernel_alignment fields.
|
||||
|
||||
Protocol 2.06: (Kernel 2.6.22) Added a field that contains the size of
|
||||
the boot command line.
|
||||
|
||||
Protocol 2.07: (Kernel 2.6.24) Added paravirtualised boot protocol.
|
||||
Introduced hardware_subarch and hardware_subarch_data
|
||||
and KEEP_SEGMENTS flag in load_flags.
|
||||
|
||||
Protocol 2.08: (Kernel 2.6.26) Added crc32 checksum and ELF format
|
||||
payload. Introduced payload_offset and payload length
|
||||
fields to aid in locating the payload.
|
||||
|
||||
Protocol 2.09: (Kernel 2.6.26) Added a field of 64-bit physical
|
||||
pointer to single linked list of struct setup_data.
|
||||
|
||||
**** MEMORY LAYOUT
|
||||
|
||||
The traditional memory map for the kernel loader, used for Image or
|
||||
zImage kernels, typically looks like:
|
||||
|
||||
| |
|
||||
0A0000 +------------------------+
|
||||
| Reserved for BIOS | Do not use. Reserved for BIOS EBDA.
|
||||
09A000 +------------------------+
|
||||
| Command line |
|
||||
| Stack/heap | For use by the kernel real-mode code.
|
||||
098000 +------------------------+
|
||||
| Kernel setup | The kernel real-mode code.
|
||||
090200 +------------------------+
|
||||
| Kernel boot sector | The kernel legacy boot sector.
|
||||
090000 +------------------------+
|
||||
| Protected-mode kernel | The bulk of the kernel image.
|
||||
010000 +------------------------+
|
||||
| Boot loader | <- Boot sector entry point 0000:7C00
|
||||
001000 +------------------------+
|
||||
| Reserved for MBR/BIOS |
|
||||
000800 +------------------------+
|
||||
| Typically used by MBR |
|
||||
000600 +------------------------+
|
||||
| BIOS use only |
|
||||
000000 +------------------------+
|
||||
|
||||
|
||||
When using bzImage, the protected-mode kernel was relocated to
|
||||
0x100000 ("high memory"), and the kernel real-mode block (boot sector,
|
||||
setup, and stack/heap) was made relocatable to any address between
|
||||
0x10000 and end of low memory. Unfortunately, in protocols 2.00 and
|
||||
2.01 the 0x90000+ memory range is still used internally by the kernel;
|
||||
the 2.02 protocol resolves that problem.
|
||||
|
||||
It is desirable to keep the "memory ceiling" -- the highest point in
|
||||
low memory touched by the boot loader -- as low as possible, since
|
||||
some newer BIOSes have begun to allocate some rather large amounts of
|
||||
memory, called the Extended BIOS Data Area, near the top of low
|
||||
memory. The boot loader should use the "INT 12h" BIOS call to verify
|
||||
how much low memory is available.
|
||||
|
||||
Unfortunately, if INT 12h reports that the amount of memory is too
|
||||
low, there is usually nothing the boot loader can do but to report an
|
||||
error to the user. The boot loader should therefore be designed to
|
||||
take up as little space in low memory as it reasonably can. For
|
||||
zImage or old bzImage kernels, which need data written into the
|
||||
0x90000 segment, the boot loader should make sure not to use memory
|
||||
above the 0x9A000 point; too many BIOSes will break above that point.
|
||||
|
||||
For a modern bzImage kernel with boot protocol version >= 2.02, a
|
||||
memory layout like the following is suggested:
|
||||
|
||||
~ ~
|
||||
| Protected-mode kernel |
|
||||
100000 +------------------------+
|
||||
| I/O memory hole |
|
||||
0A0000 +------------------------+
|
||||
| Reserved for BIOS | Leave as much as possible unused
|
||||
~ ~
|
||||
| Command line | (Can also be below the X+10000 mark)
|
||||
X+10000 +------------------------+
|
||||
| Stack/heap | For use by the kernel real-mode code.
|
||||
X+08000 +------------------------+
|
||||
| Kernel setup | The kernel real-mode code.
|
||||
| Kernel boot sector | The kernel legacy boot sector.
|
||||
X +------------------------+
|
||||
| Boot loader | <- Boot sector entry point 0000:7C00
|
||||
001000 +------------------------+
|
||||
| Reserved for MBR/BIOS |
|
||||
000800 +------------------------+
|
||||
| Typically used by MBR |
|
||||
000600 +------------------------+
|
||||
| BIOS use only |
|
||||
000000 +------------------------+
|
||||
|
||||
... where the address X is as low as the design of the boot loader
|
||||
permits.
|
||||
|
||||
|
||||
**** THE REAL-MODE KERNEL HEADER
|
||||
|
||||
In the following text, and anywhere in the kernel boot sequence, "a
|
||||
sector" refers to 512 bytes. It is independent of the actual sector
|
||||
size of the underlying medium.
|
||||
|
||||
The first step in loading a Linux kernel should be to load the
|
||||
real-mode code (boot sector and setup code) and then examine the
|
||||
following header at offset 0x01f1. The real-mode code can total up to
|
||||
32K, although the boot loader may choose to load only the first two
|
||||
sectors (1K) and then examine the bootup sector size.
|
||||
|
||||
The header looks like:
|
||||
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
|
||||
01F1/1 ALL(1 setup_sects The size of the setup in sectors
|
||||
01F2/2 ALL root_flags If set, the root is mounted readonly
|
||||
01F4/4 2.04+(2 syssize The size of the 32-bit code in 16-byte paras
|
||||
01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only
|
||||
01FA/2 ALL vid_mode Video mode control
|
||||
01FC/2 ALL root_dev Default root device number
|
||||
01FE/2 ALL boot_flag 0xAA55 magic number
|
||||
0200/2 2.00+ jump Jump instruction
|
||||
0202/4 2.00+ header Magic signature "HdrS"
|
||||
0206/2 2.00+ version Boot protocol version supported
|
||||
0208/4 2.00+ realmode_swtch Boot loader hook (see below)
|
||||
020C/2 2.00+ start_sys The load-low segment (0x1000) (obsolete)
|
||||
020E/2 2.00+ kernel_version Pointer to kernel version string
|
||||
0210/1 2.00+ type_of_loader Boot loader identifier
|
||||
0211/1 2.00+ loadflags Boot protocol option flags
|
||||
0212/2 2.00+ setup_move_size Move to high memory size (used with hooks)
|
||||
0214/4 2.00+ code32_start Boot loader hook (see below)
|
||||
0218/4 2.00+ ramdisk_image initrd load address (set by boot loader)
|
||||
021C/4 2.00+ ramdisk_size initrd size (set by boot loader)
|
||||
0220/4 2.00+ bootsect_kludge DO NOT USE - for bootsect.S use only
|
||||
0224/2 2.01+ heap_end_ptr Free memory after setup end
|
||||
0226/2 N/A pad1 Unused
|
||||
0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line
|
||||
022C/4 2.03+ initrd_addr_max Highest legal initrd address
|
||||
0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel
|
||||
0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not
|
||||
0235/3 N/A pad2 Unused
|
||||
0238/4 2.06+ cmdline_size Maximum size of the kernel command line
|
||||
023C/4 2.07+ hardware_subarch Hardware subarchitecture
|
||||
0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data
|
||||
0248/4 2.08+ payload_offset Offset of kernel payload
|
||||
024C/4 2.08+ payload_length Length of kernel payload
|
||||
0250/8 2.09+ setup_data 64-bit physical pointer to linked list
|
||||
of struct setup_data
|
||||
|
||||
(1) For backwards compatibility, if the setup_sects field contains 0, the
|
||||
real value is 4.
|
||||
|
||||
(2) For boot protocol prior to 2.04, the upper two bytes of the syssize
|
||||
field are unusable, which means the size of a bzImage kernel
|
||||
cannot be determined.
|
||||
|
||||
If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
|
||||
the boot protocol version is "old". Loading an old kernel, the
|
||||
following parameters should be assumed:
|
||||
|
||||
Image type = zImage
|
||||
initrd not supported
|
||||
Real-mode kernel must be located at 0x90000.
|
||||
|
||||
Otherwise, the "version" field contains the protocol version,
|
||||
e.g. protocol version 2.01 will contain 0x0201 in this field. When
|
||||
setting fields in the header, you must make sure only to set fields
|
||||
supported by the protocol version in use.
|
||||
|
||||
|
||||
**** DETAILS OF HEADER FIELDS
|
||||
|
||||
For each field, some are information from the kernel to the bootloader
|
||||
("read"), some are expected to be filled out by the bootloader
|
||||
("write"), and some are expected to be read and modified by the
|
||||
bootloader ("modify").
|
||||
|
||||
All general purpose boot loaders should write the fields marked
|
||||
(obligatory). Boot loaders who want to load the kernel at a
|
||||
nonstandard address should fill in the fields marked (reloc); other
|
||||
boot loaders can ignore those fields.
|
||||
|
||||
The byte order of all fields is littleendian (this is x86, after all.)
|
||||
|
||||
Field name: setup_sects
|
||||
Type: read
|
||||
Offset/size: 0x1f1/1
|
||||
Protocol: ALL
|
||||
|
||||
The size of the setup code in 512-byte sectors. If this field is
|
||||
0, the real value is 4. The real-mode code consists of the boot
|
||||
sector (always one 512-byte sector) plus the setup code.
|
||||
|
||||
Field name: root_flags
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x1f2/2
|
||||
Protocol: ALL
|
||||
|
||||
If this field is nonzero, the root defaults to readonly. The use of
|
||||
this field is deprecated; use the "ro" or "rw" options on the
|
||||
command line instead.
|
||||
|
||||
Field name: syssize
|
||||
Type: read
|
||||
Offset/size: 0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL)
|
||||
Protocol: 2.04+
|
||||
|
||||
The size of the protected-mode code in units of 16-byte paragraphs.
|
||||
For protocol versions older than 2.04 this field is only two bytes
|
||||
wide, and therefore cannot be trusted for the size of a kernel if
|
||||
the LOAD_HIGH flag is set.
|
||||
|
||||
Field name: ram_size
|
||||
Type: kernel internal
|
||||
Offset/size: 0x1f8/2
|
||||
Protocol: ALL
|
||||
|
||||
This field is obsolete.
|
||||
|
||||
Field name: vid_mode
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x1fa/2
|
||||
|
||||
Please see the section on SPECIAL COMMAND LINE OPTIONS.
|
||||
|
||||
Field name: root_dev
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x1fc/2
|
||||
Protocol: ALL
|
||||
|
||||
The default root device device number. The use of this field is
|
||||
deprecated, use the "root=" option on the command line instead.
|
||||
|
||||
Field name: boot_flag
|
||||
Type: read
|
||||
Offset/size: 0x1fe/2
|
||||
Protocol: ALL
|
||||
|
||||
Contains 0xAA55. This is the closest thing old Linux kernels have
|
||||
to a magic number.
|
||||
|
||||
Field name: jump
|
||||
Type: read
|
||||
Offset/size: 0x200/2
|
||||
Protocol: 2.00+
|
||||
|
||||
Contains an x86 jump instruction, 0xEB followed by a signed offset
|
||||
relative to byte 0x202. This can be used to determine the size of
|
||||
the header.
|
||||
|
||||
Field name: header
|
||||
Type: read
|
||||
Offset/size: 0x202/4
|
||||
Protocol: 2.00+
|
||||
|
||||
Contains the magic number "HdrS" (0x53726448).
|
||||
|
||||
Field name: version
|
||||
Type: read
|
||||
Offset/size: 0x206/2
|
||||
Protocol: 2.00+
|
||||
|
||||
Contains the boot protocol version, in (major << 8)+minor format,
|
||||
e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
|
||||
10.17.
|
||||
|
||||
Field name: readmode_swtch
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x208/4
|
||||
Protocol: 2.00+
|
||||
|
||||
Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
|
||||
|
||||
Field name: start_sys
|
||||
Type: read
|
||||
Offset/size: 0x20c/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The load low segment (0x1000). Obsolete.
|
||||
|
||||
Field name: kernel_version
|
||||
Type: read
|
||||
Offset/size: 0x20e/2
|
||||
Protocol: 2.00+
|
||||
|
||||
If set to a nonzero value, contains a pointer to a NUL-terminated
|
||||
human-readable kernel version number string, less 0x200. This can
|
||||
be used to display the kernel version to the user. This value
|
||||
should be less than (0x200*setup_sects).
|
||||
|
||||
For example, if this value is set to 0x1c00, the kernel version
|
||||
number string can be found at offset 0x1e00 in the kernel file.
|
||||
This is a valid value if and only if the "setup_sects" field
|
||||
contains the value 15 or higher, as:
|
||||
|
||||
0x1c00 < 15*0x200 (= 0x1e00) but
|
||||
0x1c00 >= 14*0x200 (= 0x1c00)
|
||||
|
||||
0x1c00 >> 9 = 14, so the minimum value for setup_secs is 15.
|
||||
|
||||
Field name: type_of_loader
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x210/1
|
||||
Protocol: 2.00+
|
||||
|
||||
If your boot loader has an assigned id (see table below), enter
|
||||
0xTV here, where T is an identifier for the boot loader and V is
|
||||
a version number. Otherwise, enter 0xFF here.
|
||||
|
||||
Assigned boot loader ids:
|
||||
0 LILO (0x00 reserved for pre-2.00 bootloader)
|
||||
1 Loadlin
|
||||
2 bootsect-loader (0x20, all other values reserved)
|
||||
3 SYSLINUX
|
||||
4 EtherBoot
|
||||
5 ELILO
|
||||
7 GRuB
|
||||
8 U-BOOT
|
||||
9 Xen
|
||||
A Gujin
|
||||
B Qemu
|
||||
|
||||
Please contact <hpa@zytor.com> if you need a bootloader ID
|
||||
value assigned.
|
||||
|
||||
Field name: loadflags
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x211/1
|
||||
Protocol: 2.00+
|
||||
|
||||
This field is a bitmask.
|
||||
|
||||
Bit 0 (read): LOADED_HIGH
|
||||
- If 0, the protected-mode code is loaded at 0x10000.
|
||||
- If 1, the protected-mode code is loaded at 0x100000.
|
||||
|
||||
Bit 5 (write): QUIET_FLAG
|
||||
- If 0, print early messages.
|
||||
- If 1, suppress early messages.
|
||||
This requests to the kernel (decompressor and early
|
||||
kernel) to not write early messages that require
|
||||
accessing the display hardware directly.
|
||||
|
||||
Bit 6 (write): KEEP_SEGMENTS
|
||||
Protocol: 2.07+
|
||||
- If 0, reload the segment registers in the 32bit entry point.
|
||||
- If 1, do not reload the segment registers in the 32bit entry point.
|
||||
Assume that %cs %ds %ss %es are all set to flat segments with
|
||||
a base of 0 (or the equivalent for their environment).
|
||||
|
||||
Bit 7 (write): CAN_USE_HEAP
|
||||
Set this bit to 1 to indicate that the value entered in the
|
||||
heap_end_ptr is valid. If this field is clear, some setup code
|
||||
functionality will be disabled.
|
||||
|
||||
Field name: setup_move_size
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x212/2
|
||||
Protocol: 2.00-2.01
|
||||
|
||||
When using protocol 2.00 or 2.01, if the real mode kernel is not
|
||||
loaded at 0x90000, it gets moved there later in the loading
|
||||
sequence. Fill in this field if you want additional data (such as
|
||||
the kernel command line) moved in addition to the real-mode kernel
|
||||
itself.
|
||||
|
||||
The unit is bytes starting with the beginning of the boot sector.
|
||||
|
||||
This field is can be ignored when the protocol is 2.02 or higher, or
|
||||
if the real-mode code is loaded at 0x90000.
|
||||
|
||||
Field name: code32_start
|
||||
Type: modify (optional, reloc)
|
||||
Offset/size: 0x214/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The address to jump to in protected mode. This defaults to the load
|
||||
address of the kernel, and can be used by the boot loader to
|
||||
determine the proper load address.
|
||||
|
||||
This field can be modified for two purposes:
|
||||
|
||||
1. as a boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
|
||||
|
||||
2. if a bootloader which does not install a hook loads a
|
||||
relocatable kernel at a nonstandard address it will have to modify
|
||||
this field to point to the load address.
|
||||
|
||||
Field name: ramdisk_image
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x218/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The 32-bit linear address of the initial ramdisk or ramfs. Leave at
|
||||
zero if there is no initial ramdisk/ramfs.
|
||||
|
||||
Field name: ramdisk_size
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x21c/4
|
||||
Protocol: 2.00+
|
||||
|
||||
Size of the initial ramdisk or ramfs. Leave at zero if there is no
|
||||
initial ramdisk/ramfs.
|
||||
|
||||
Field name: bootsect_kludge
|
||||
Type: kernel internal
|
||||
Offset/size: 0x220/4
|
||||
Protocol: 2.00+
|
||||
|
||||
This field is obsolete.
|
||||
|
||||
Field name: heap_end_ptr
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x224/2
|
||||
Protocol: 2.01+
|
||||
|
||||
Set this field to the offset (from the beginning of the real-mode
|
||||
code) of the end of the setup stack/heap, minus 0x0200.
|
||||
|
||||
Field name: cmd_line_ptr
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x228/4
|
||||
Protocol: 2.02+
|
||||
|
||||
Set this field to the linear address of the kernel command line.
|
||||
The kernel command line can be located anywhere between the end of
|
||||
the setup heap and 0xA0000; it does not have to be located in the
|
||||
same 64K segment as the real-mode code itself.
|
||||
|
||||
Fill in this field even if your boot loader does not support a
|
||||
command line, in which case you can point this to an empty string
|
||||
(or better yet, to the string "auto".) If this field is left at
|
||||
zero, the kernel will assume that your boot loader does not support
|
||||
the 2.02+ protocol.
|
||||
|
||||
Field name: initrd_addr_max
|
||||
Type: read
|
||||
Offset/size: 0x22c/4
|
||||
Protocol: 2.03+
|
||||
|
||||
The maximum address that may be occupied by the initial
|
||||
ramdisk/ramfs contents. For boot protocols 2.02 or earlier, this
|
||||
field is not present, and the maximum address is 0x37FFFFFF. (This
|
||||
address is defined as the address of the highest safe byte, so if
|
||||
your ramdisk is exactly 131072 bytes long and this field is
|
||||
0x37FFFFFF, you can start your ramdisk at 0x37FE0000.)
|
||||
|
||||
Field name: kernel_alignment
|
||||
Type: read (reloc)
|
||||
Offset/size: 0x230/4
|
||||
Protocol: 2.05+
|
||||
|
||||
Alignment unit required by the kernel (if relocatable_kernel is true.)
|
||||
|
||||
Field name: relocatable_kernel
|
||||
Type: read (reloc)
|
||||
Offset/size: 0x234/1
|
||||
Protocol: 2.05+
|
||||
|
||||
If this field is nonzero, the protected-mode part of the kernel can
|
||||
be loaded at any address that satisfies the kernel_alignment field.
|
||||
After loading, the boot loader must set the code32_start field to
|
||||
point to the loaded code, or to a boot loader hook.
|
||||
|
||||
Field name: cmdline_size
|
||||
Type: read
|
||||
Offset/size: 0x238/4
|
||||
Protocol: 2.06+
|
||||
|
||||
The maximum size of the command line without the terminating
|
||||
zero. This means that the command line can contain at most
|
||||
cmdline_size characters. With protocol version 2.05 and earlier, the
|
||||
maximum size was 255.
|
||||
|
||||
Field name: hardware_subarch
|
||||
Type: write (optional, defaults to x86/PC)
|
||||
Offset/size: 0x23c/4
|
||||
Protocol: 2.07+
|
||||
|
||||
In a paravirtualized environment the hardware low level architectural
|
||||
pieces such as interrupt handling, page table handling, and
|
||||
accessing process control registers needs to be done differently.
|
||||
|
||||
This field allows the bootloader to inform the kernel we are in one
|
||||
one of those environments.
|
||||
|
||||
0x00000000 The default x86/PC environment
|
||||
0x00000001 lguest
|
||||
0x00000002 Xen
|
||||
|
||||
Field name: hardware_subarch_data
|
||||
Type: write (subarch-dependent)
|
||||
Offset/size: 0x240/8
|
||||
Protocol: 2.07+
|
||||
|
||||
A pointer to data that is specific to hardware subarch
|
||||
This field is currently unused for the default x86/PC environment,
|
||||
do not modify.
|
||||
|
||||
Field name: payload_offset
|
||||
Type: read
|
||||
Offset/size: 0x248/4
|
||||
Protocol: 2.08+
|
||||
|
||||
If non-zero then this field contains the offset from the end of the
|
||||
real-mode code to the payload.
|
||||
|
||||
The payload may be compressed. The format of both the compressed and
|
||||
uncompressed data should be determined using the standard magic
|
||||
numbers. Currently only gzip compressed ELF is used.
|
||||
|
||||
Field name: payload_length
|
||||
Type: read
|
||||
Offset/size: 0x24c/4
|
||||
Protocol: 2.08+
|
||||
|
||||
The length of the payload.
|
||||
|
||||
Field name: setup_data
|
||||
Type: write (special)
|
||||
Offset/size: 0x250/8
|
||||
Protocol: 2.09+
|
||||
|
||||
The 64-bit physical pointer to NULL terminated single linked list of
|
||||
struct setup_data. This is used to define a more extensible boot
|
||||
parameters passing mechanism. The definition of struct setup_data is
|
||||
as follow:
|
||||
|
||||
struct setup_data {
|
||||
u64 next;
|
||||
u32 type;
|
||||
u32 len;
|
||||
u8 data[0];
|
||||
};
|
||||
|
||||
Where, the next is a 64-bit physical pointer to the next node of
|
||||
linked list, the next field of the last node is 0; the type is used
|
||||
to identify the contents of data; the len is the length of data
|
||||
field; the data holds the real payload.
|
||||
|
||||
This list may be modified at a number of points during the bootup
|
||||
process. Therefore, when modifying this list one should always make
|
||||
sure to consider the case where the linked list already contains
|
||||
entries.
|
||||
|
||||
|
||||
**** THE IMAGE CHECKSUM
|
||||
|
||||
From boot protocol version 2.08 onwards the CRC-32 is calculated over
|
||||
the entire file using the characteristic polynomial 0x04C11DB7 and an
|
||||
initial remainder of 0xffffffff. The checksum is appended to the
|
||||
file; therefore the CRC of the file up to the limit specified in the
|
||||
syssize field of the header is always 0.
|
||||
|
||||
|
||||
**** THE KERNEL COMMAND LINE
|
||||
|
||||
The kernel command line has become an important way for the boot
|
||||
loader to communicate with the kernel. Some of its options are also
|
||||
relevant to the boot loader itself, see "special command line options"
|
||||
below.
|
||||
|
||||
The kernel command line is a null-terminated string. The maximum
|
||||
length can be retrieved from the field cmdline_size. Before protocol
|
||||
version 2.06, the maximum was 255 characters. A string that is too
|
||||
long will be automatically truncated by the kernel.
|
||||
|
||||
If the boot protocol version is 2.02 or later, the address of the
|
||||
kernel command line is given by the header field cmd_line_ptr (see
|
||||
above.) This address can be anywhere between the end of the setup
|
||||
heap and 0xA0000.
|
||||
|
||||
If the protocol version is *not* 2.02 or higher, the kernel
|
||||
command line is entered using the following protocol:
|
||||
|
||||
At offset 0x0020 (word), "cmd_line_magic", enter the magic
|
||||
number 0xA33F.
|
||||
|
||||
At offset 0x0022 (word), "cmd_line_offset", enter the offset
|
||||
of the kernel command line (relative to the start of the
|
||||
real-mode kernel).
|
||||
|
||||
The kernel command line *must* be within the memory region
|
||||
covered by setup_move_size, so you may need to adjust this
|
||||
field.
|
||||
|
||||
|
||||
**** MEMORY LAYOUT OF THE REAL-MODE CODE
|
||||
|
||||
The real-mode code requires a stack/heap to be set up, as well as
|
||||
memory allocated for the kernel command line. This needs to be done
|
||||
in the real-mode accessible memory in bottom megabyte.
|
||||
|
||||
It should be noted that modern machines often have a sizable Extended
|
||||
BIOS Data Area (EBDA). As a result, it is advisable to use as little
|
||||
of the low megabyte as possible.
|
||||
|
||||
Unfortunately, under the following circumstances the 0x90000 memory
|
||||
segment has to be used:
|
||||
|
||||
- When loading a zImage kernel ((loadflags & 0x01) == 0).
|
||||
- When loading a 2.01 or earlier boot protocol kernel.
|
||||
|
||||
-> For the 2.00 and 2.01 boot protocols, the real-mode code
|
||||
can be loaded at another address, but it is internally
|
||||
relocated to 0x90000. For the "old" protocol, the
|
||||
real-mode code must be loaded at 0x90000.
|
||||
|
||||
When loading at 0x90000, avoid using memory above 0x9a000.
|
||||
|
||||
For boot protocol 2.02 or higher, the command line does not have to be
|
||||
located in the same 64K segment as the real-mode setup code; it is
|
||||
thus permitted to give the stack/heap the full 64K segment and locate
|
||||
the command line above it.
|
||||
|
||||
The kernel command line should not be located below the real-mode
|
||||
code, nor should it be located in high memory.
|
||||
|
||||
|
||||
**** SAMPLE BOOT CONFIGURATION
|
||||
|
||||
As a sample configuration, assume the following layout of the real
|
||||
mode segment:
|
||||
|
||||
When loading below 0x90000, use the entire segment:
|
||||
|
||||
0x0000-0x7fff Real mode kernel
|
||||
0x8000-0xdfff Stack and heap
|
||||
0xe000-0xffff Kernel command line
|
||||
|
||||
When loading at 0x90000 OR the protocol version is 2.01 or earlier:
|
||||
|
||||
0x0000-0x7fff Real mode kernel
|
||||
0x8000-0x97ff Stack and heap
|
||||
0x9800-0x9fff Kernel command line
|
||||
|
||||
Such a boot loader should enter the following fields in the header:
|
||||
|
||||
unsigned long base_ptr; /* base address for real-mode segment */
|
||||
|
||||
if ( setup_sects == 0 ) {
|
||||
setup_sects = 4;
|
||||
}
|
||||
|
||||
if ( protocol >= 0x0200 ) {
|
||||
type_of_loader = <type code>;
|
||||
if ( loading_initrd ) {
|
||||
ramdisk_image = <initrd_address>;
|
||||
ramdisk_size = <initrd_size>;
|
||||
}
|
||||
|
||||
if ( protocol >= 0x0202 && loadflags & 0x01 )
|
||||
heap_end = 0xe000;
|
||||
else
|
||||
heap_end = 0x9800;
|
||||
|
||||
if ( protocol >= 0x0201 ) {
|
||||
heap_end_ptr = heap_end - 0x200;
|
||||
loadflags |= 0x80; /* CAN_USE_HEAP */
|
||||
}
|
||||
|
||||
if ( protocol >= 0x0202 ) {
|
||||
cmd_line_ptr = base_ptr + heap_end;
|
||||
strcpy(cmd_line_ptr, cmdline);
|
||||
} else {
|
||||
cmd_line_magic = 0xA33F;
|
||||
cmd_line_offset = heap_end;
|
||||
setup_move_size = heap_end + strlen(cmdline)+1;
|
||||
strcpy(base_ptr+cmd_line_offset, cmdline);
|
||||
}
|
||||
} else {
|
||||
/* Very old kernel */
|
||||
|
||||
heap_end = 0x9800;
|
||||
|
||||
cmd_line_magic = 0xA33F;
|
||||
cmd_line_offset = heap_end;
|
||||
|
||||
/* A very old kernel MUST have its real-mode code
|
||||
loaded at 0x90000 */
|
||||
|
||||
if ( base_ptr != 0x90000 ) {
|
||||
/* Copy the real-mode kernel */
|
||||
memcpy(0x90000, base_ptr, (setup_sects+1)*512);
|
||||
base_ptr = 0x90000; /* Relocated */
|
||||
}
|
||||
|
||||
strcpy(0x90000+cmd_line_offset, cmdline);
|
||||
|
||||
/* It is recommended to clear memory up to the 32K mark */
|
||||
memset(0x90000 + (setup_sects+1)*512, 0,
|
||||
(64-(setup_sects+1))*512);
|
||||
}
|
||||
|
||||
|
||||
**** LOADING THE REST OF THE KERNEL
|
||||
|
||||
The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
|
||||
in the kernel file (again, if setup_sects == 0 the real value is 4.)
|
||||
It should be loaded at address 0x10000 for Image/zImage kernels and
|
||||
0x100000 for bzImage kernels.
|
||||
|
||||
The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
|
||||
bit (LOAD_HIGH) in the loadflags field is set:
|
||||
|
||||
is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
|
||||
load_address = is_bzImage ? 0x100000 : 0x10000;
|
||||
|
||||
Note that Image/zImage kernels can be up to 512K in size, and thus use
|
||||
the entire 0x10000-0x90000 range of memory. This means it is pretty
|
||||
much a requirement for these kernels to load the real-mode part at
|
||||
0x90000. bzImage kernels allow much more flexibility.
|
||||
|
||||
|
||||
**** SPECIAL COMMAND LINE OPTIONS
|
||||
|
||||
If the command line provided by the boot loader is entered by the
|
||||
user, the user may expect the following command line options to work.
|
||||
They should normally not be deleted from the kernel command line even
|
||||
though not all of them are actually meaningful to the kernel. Boot
|
||||
loader authors who need additional command line options for the boot
|
||||
loader itself should get them registered in
|
||||
Documentation/kernel-parameters.txt to make sure they will not
|
||||
conflict with actual kernel options now or in the future.
|
||||
|
||||
vga=<mode>
|
||||
<mode> here is either an integer (in C notation, either
|
||||
decimal, octal, or hexadecimal) or one of the strings
|
||||
"normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask"
|
||||
(meaning 0xFFFD). This value should be entered into the
|
||||
vid_mode field, as it is used by the kernel before the command
|
||||
line is parsed.
|
||||
|
||||
mem=<size>
|
||||
<size> is an integer in C notation optionally followed by
|
||||
(case insensitive) K, M, G, T, P or E (meaning << 10, << 20,
|
||||
<< 30, << 40, << 50 or << 60). This specifies the end of
|
||||
memory to the kernel. This affects the possible placement of
|
||||
an initrd, since an initrd should be placed near end of
|
||||
memory. Note that this is an option to *both* the kernel and
|
||||
the bootloader!
|
||||
|
||||
initrd=<file>
|
||||
An initrd should be loaded. The meaning of <file> is
|
||||
obviously bootloader-dependent, and some boot loaders
|
||||
(e.g. LILO) do not have such a command.
|
||||
|
||||
In addition, some boot loaders add the following options to the
|
||||
user-specified command line:
|
||||
|
||||
BOOT_IMAGE=<file>
|
||||
The boot image which was loaded. Again, the meaning of <file>
|
||||
is obviously bootloader-dependent.
|
||||
|
||||
auto
|
||||
The kernel was booted without explicit user intervention.
|
||||
|
||||
If these options are added by the boot loader, it is highly
|
||||
recommended that they are located *first*, before the user-specified
|
||||
or configuration-specified command line. Otherwise, "init=/bin/sh"
|
||||
gets confused by the "auto" option.
|
||||
|
||||
|
||||
**** RUNNING THE KERNEL
|
||||
|
||||
The kernel is started by jumping to the kernel entry point, which is
|
||||
located at *segment* offset 0x20 from the start of the real mode
|
||||
kernel. This means that if you loaded your real-mode kernel code at
|
||||
0x90000, the kernel entry point is 9020:0000.
|
||||
|
||||
At entry, ds = es = ss should point to the start of the real-mode
|
||||
kernel code (0x9000 if the code is loaded at 0x90000), sp should be
|
||||
set up properly, normally pointing to the top of the heap, and
|
||||
interrupts should be disabled. Furthermore, to guard against bugs in
|
||||
the kernel, it is recommended that the boot loader sets fs = gs = ds =
|
||||
es = ss.
|
||||
|
||||
In our example from above, we would do:
|
||||
|
||||
/* Note: in the case of the "old" kernel protocol, base_ptr must
|
||||
be == 0x90000 at this point; see the previous sample code */
|
||||
|
||||
seg = base_ptr >> 4;
|
||||
|
||||
cli(); /* Enter with interrupts disabled! */
|
||||
|
||||
/* Set up the real-mode kernel stack */
|
||||
_SS = seg;
|
||||
_SP = heap_end;
|
||||
|
||||
_DS = _ES = _FS = _GS = seg;
|
||||
jmp_far(seg+0x20, 0); /* Run the kernel */
|
||||
|
||||
If your boot sector accesses a floppy drive, it is recommended to
|
||||
switch off the floppy motor before running the kernel, since the
|
||||
kernel boot leaves interrupts off and thus the motor will not be
|
||||
switched off, especially if the loaded kernel has the floppy driver as
|
||||
a demand-loaded module!
|
||||
|
||||
|
||||
**** ADVANCED BOOT LOADER HOOKS
|
||||
|
||||
If the boot loader runs in a particularly hostile environment (such as
|
||||
LOADLIN, which runs under DOS) it may be impossible to follow the
|
||||
standard memory location requirements. Such a boot loader may use the
|
||||
following hooks that, if set, are invoked by the kernel at the
|
||||
appropriate time. The use of these hooks should probably be
|
||||
considered an absolutely last resort!
|
||||
|
||||
IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and
|
||||
%edi across invocation.
|
||||
|
||||
realmode_swtch:
|
||||
A 16-bit real mode far subroutine invoked immediately before
|
||||
entering protected mode. The default routine disables NMI, so
|
||||
your routine should probably do so, too.
|
||||
|
||||
code32_start:
|
||||
A 32-bit flat-mode routine *jumped* to immediately after the
|
||||
transition to protected mode, but before the kernel is
|
||||
uncompressed. No segments, except CS, are guaranteed to be
|
||||
set up (current kernels do, but older ones do not); you should
|
||||
set them up to BOOT_DS (0x18) yourself.
|
||||
|
||||
After completing your hook, you should jump to the address
|
||||
that was in this field before your boot loader overwrote it
|
||||
(relocated, if appropriate.)
|
||||
|
||||
|
||||
**** 32-bit BOOT PROTOCOL
|
||||
|
||||
For machine with some new BIOS other than legacy BIOS, such as EFI,
|
||||
LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel
|
||||
based on legacy BIOS can not be used, so a 32-bit boot protocol needs
|
||||
to be defined.
|
||||
|
||||
In 32-bit boot protocol, the first step in loading a Linux kernel
|
||||
should be to setup the boot parameters (struct boot_params,
|
||||
traditionally known as "zero page"). The memory for struct boot_params
|
||||
should be allocated and initialized to all zero. Then the setup header
|
||||
from offset 0x01f1 of kernel image on should be loaded into struct
|
||||
boot_params and examined. The end of setup header can be calculated as
|
||||
follow:
|
||||
|
||||
0x0202 + byte value at offset 0x0201
|
||||
|
||||
In addition to read/modify/write the setup header of the struct
|
||||
boot_params as that of 16-bit boot protocol, the boot loader should
|
||||
also fill the additional fields of the struct boot_params as that
|
||||
described in zero-page.txt.
|
||||
|
||||
After setupping the struct boot_params, the boot loader can load the
|
||||
32/64-bit kernel in the same way as that of 16-bit boot protocol.
|
||||
|
||||
In 32-bit boot protocol, the kernel is started by jumping to the
|
||||
32-bit kernel entry point, which is the start address of loaded
|
||||
32/64-bit kernel.
|
||||
|
||||
At entry, the CPU must be in 32-bit protected mode with paging
|
||||
disabled; a GDT must be loaded with the descriptors for selectors
|
||||
__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
|
||||
segment; __BOOS_CS must have execute/read permission, and __BOOT_DS
|
||||
must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
|
||||
must be __BOOT_DS; interrupt must be disabled; %esi must hold the base
|
||||
address of the struct boot_params; %ebp, %edi and %ebx must be zero.
|
28
Documentation/x86/x86_64/mm.txt
Normal file
28
Documentation/x86/x86_64/mm.txt
Normal file
@ -0,0 +1,28 @@
|
||||
|
||||
<previous description obsolete, deleted>
|
||||
|
||||
Virtual memory map with 4 level page tables:
|
||||
|
||||
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
|
||||
hole caused by [48:63] sign extension
|
||||
ffff800000000000 - ffff80ffffffffff (=40 bits) guard hole
|
||||
ffff810000000000 - ffffc0ffffffffff (=46 bits) direct mapping of all phys. memory
|
||||
ffffc10000000000 - ffffc1ffffffffff (=40 bits) hole
|
||||
ffffc20000000000 - ffffe1ffffffffff (=45 bits) vmalloc/ioremap space
|
||||
ffffe20000000000 - ffffe2ffffffffff (=40 bits) virtual memory map (1TB)
|
||||
... unused hole ...
|
||||
ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0
|
||||
ffffffffa0000000 - fffffffffff00000 (=1536 MB) module mapping space
|
||||
|
||||
The direct mapping covers all memory in the system up to the highest
|
||||
memory address (this means in some cases it can also include PCI memory
|
||||
holes).
|
||||
|
||||
vmalloc space is lazily synchronized into the different PML4 pages of
|
||||
the processes using the page fault handler, with init_level4_pgt as
|
||||
reference.
|
||||
|
||||
Current X86-64 implementations only support 40 bits of address space,
|
||||
but we support up to 46 bits. This expands into MBZ space in the page tables.
|
||||
|
||||
-Andi Kleen, Jul 2004
|
42
Documentation/x86/x86_64/uefi.txt
Normal file
42
Documentation/x86/x86_64/uefi.txt
Normal file
@ -0,0 +1,42 @@
|
||||
General note on [U]EFI x86_64 support
|
||||
-------------------------------------
|
||||
|
||||
The nomenclature EFI and UEFI are used interchangeably in this document.
|
||||
|
||||
Although the tools below are _not_ needed for building the kernel,
|
||||
the needed bootloader support and associated tools for x86_64 platforms
|
||||
with EFI firmware and specifications are listed below.
|
||||
|
||||
1. UEFI specification: http://www.uefi.org
|
||||
|
||||
2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
|
||||
support. Elilo with x86_64 support can be used.
|
||||
|
||||
3. x86_64 platform with EFI/UEFI firmware.
|
||||
|
||||
Mechanics:
|
||||
---------
|
||||
- Build the kernel with the following configuration.
|
||||
CONFIG_FB_EFI=y
|
||||
CONFIG_FRAMEBUFFER_CONSOLE=y
|
||||
If EFI runtime services are expected, the following configuration should
|
||||
be selected.
|
||||
CONFIG_EFI=y
|
||||
CONFIG_EFI_VARS=y or m # optional
|
||||
- Create a VFAT partition on the disk
|
||||
- Copy the following to the VFAT partition:
|
||||
elilo bootloader with x86_64 support, elilo configuration file,
|
||||
kernel image built in first step and corresponding
|
||||
initrd. Instructions on building elilo and its dependencies
|
||||
can be found in the elilo sourceforge project.
|
||||
- Boot to EFI shell and invoke elilo choosing the kernel image built
|
||||
in first step.
|
||||
- If some or all EFI runtime services don't work, you can try following
|
||||
kernel command line parameters to turn off some or all EFI runtime
|
||||
services.
|
||||
noefi turn off all EFI runtime services
|
||||
reboot_type=k turn off EFI reboot runtime service
|
||||
- If the EFI memory map has additional entries not in the E820 map,
|
||||
you can include those entries in the kernels memory map of available
|
||||
physical RAM by using the following kernel command line parameter.
|
||||
add_efi_memmap include EFI memory map of available physical RAM
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user