[PATCH] Documentation update for relay interface
Here's updated documentation for the relay interface, rewritten to match the relayfs->relay changes. It also moves relayfs.txt to relay.txt in the process. It includes the changes to relayfs.txt previously posted by Randy Dunlap, thanks for those. The relay-apps examples have also been updated to match, and can be found on the sourceforge relayfs website. Signed-off-by: Tom Zanussi <zanussi@us.ibm.com> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This commit is contained in:
		
							parent
							
								
									4edb9a143e
								
							
						
					
					
						commit
						e88d78f6ba
					
				| @ -62,8 +62,8 @@ ramfs-rootfs-initramfs.txt | ||||
| 	- info on the 'in memory' filesystems ramfs, rootfs and initramfs. | ||||
| reiser4.txt | ||||
| 	- info on the Reiser4 filesystem based on dancing tree algorithms. | ||||
| relayfs.txt | ||||
| 	- info on relayfs, for efficient streaming from kernel to user space. | ||||
| relay.txt | ||||
| 	- info on relay, for efficient streaming from kernel to user space. | ||||
| romfs.txt | ||||
| 	- description of the ROMFS filesystem. | ||||
| smbfs.txt | ||||
|  | ||||
							
								
								
									
										479
									
								
								Documentation/filesystems/relay.txt
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										479
									
								
								Documentation/filesystems/relay.txt
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,479 @@ | ||||
| relay interface (formerly relayfs) | ||||
| ================================== | ||||
| 
 | ||||
| The relay interface provides a means for kernel applications to | ||||
| efficiently log and transfer large quantities of data from the kernel | ||||
| to userspace via user-defined 'relay channels'. | ||||
| 
 | ||||
| A 'relay channel' is a kernel->user data relay mechanism implemented | ||||
| as a set of per-cpu kernel buffers ('channel buffers'), each | ||||
| represented as a regular file ('relay file') in user space.  Kernel | ||||
| clients write into the channel buffers using efficient write | ||||
| functions; these automatically log into the current cpu's channel | ||||
| buffer.  User space applications mmap() or read() from the relay files | ||||
| and retrieve the data as it becomes available.  The relay files | ||||
| themselves are files created in a host filesystem, e.g. debugfs, and | ||||
| are associated with the channel buffers using the API described below. | ||||
| 
 | ||||
| The format of the data logged into the channel buffers is completely | ||||
| up to the kernel client; the relay interface does however provide | ||||
| hooks which allow kernel clients to impose some structure on the | ||||
| buffer data.  The relay interface doesn't implement any form of data | ||||
| filtering - this also is left to the kernel client.  The purpose is to | ||||
| keep things as simple as possible. | ||||
| 
 | ||||
| This document provides an overview of the relay interface API.  The | ||||
| details of the function parameters are documented along with the | ||||
| functions in the relay interface code - please see that for details. | ||||
| 
 | ||||
| Semantics | ||||
| ========= | ||||
| 
 | ||||
| Each relay channel has one buffer per CPU, each buffer has one or more | ||||
| sub-buffers.  Messages are written to the first sub-buffer until it is | ||||
| too full to contain a new message, in which case it it is written to | ||||
| the next (if available).  Messages are never split across sub-buffers. | ||||
| At this point, userspace can be notified so it empties the first | ||||
| sub-buffer, while the kernel continues writing to the next. | ||||
| 
 | ||||
| When notified that a sub-buffer is full, the kernel knows how many | ||||
| bytes of it are padding i.e. unused space occurring because a complete | ||||
| message couldn't fit into a sub-buffer.  Userspace can use this | ||||
| knowledge to copy only valid data. | ||||
| 
 | ||||
| After copying it, userspace can notify the kernel that a sub-buffer | ||||
| has been consumed. | ||||
| 
 | ||||
| A relay channel can operate in a mode where it will overwrite data not | ||||
| yet collected by userspace, and not wait for it to be consumed. | ||||
| 
 | ||||
| The relay channel itself does not provide for communication of such | ||||
| data between userspace and kernel, allowing the kernel side to remain | ||||
| simple and not impose a single interface on userspace.  It does | ||||
| provide a set of examples and a separate helper though, described | ||||
| below. | ||||
| 
 | ||||
| The read() interface both removes padding and internally consumes the | ||||
| read sub-buffers; thus in cases where read(2) is being used to drain | ||||
| the channel buffers, special-purpose communication between kernel and | ||||
| user isn't necessary for basic operation. | ||||
| 
 | ||||
| One of the major goals of the relay interface is to provide a low | ||||
| overhead mechanism for conveying kernel data to userspace.  While the | ||||
| read() interface is easy to use, it's not as efficient as the mmap() | ||||
| approach; the example code attempts to make the tradeoff between the | ||||
| two approaches as small as possible. | ||||
| 
 | ||||
| klog and relay-apps example code | ||||
| ================================ | ||||
| 
 | ||||
| The relay interface itself is ready to use, but to make things easier, | ||||
| a couple simple utility functions and a set of examples are provided. | ||||
| 
 | ||||
| The relay-apps example tarball, available on the relay sourceforge | ||||
| site, contains a set of self-contained examples, each consisting of a | ||||
| pair of .c files containing boilerplate code for each of the user and | ||||
| kernel sides of a relay application.  When combined these two sets of | ||||
| boilerplate code provide glue to easily stream data to disk, without | ||||
| having to bother with mundane housekeeping chores. | ||||
| 
 | ||||
| The 'klog debugging functions' patch (klog.patch in the relay-apps | ||||
| tarball) provides a couple of high-level logging functions to the | ||||
| kernel which allow writing formatted text or raw data to a channel, | ||||
| regardless of whether a channel to write into exists or not, or even | ||||
| whether the relay interface is compiled into the kernel or not.  These | ||||
| functions allow you to put unconditional 'trace' statements anywhere | ||||
| in the kernel or kernel modules; only when there is a 'klog handler' | ||||
| registered will data actually be logged (see the klog and kleak | ||||
| examples for details). | ||||
| 
 | ||||
| It is of course possible to use the relay interface from scratch, | ||||
| i.e. without using any of the relay-apps example code or klog, but | ||||
| you'll have to implement communication between userspace and kernel, | ||||
| allowing both to convey the state of buffers (full, empty, amount of | ||||
| padding).  The read() interface both removes padding and internally | ||||
| consumes the read sub-buffers; thus in cases where read(2) is being | ||||
| used to drain the channel buffers, special-purpose communication | ||||
| between kernel and user isn't necessary for basic operation.  Things | ||||
| such as buffer-full conditions would still need to be communicated via | ||||
| some channel though. | ||||
| 
 | ||||
| klog and the relay-apps examples can be found in the relay-apps | ||||
| tarball on http://relayfs.sourceforge.net | ||||
| 
 | ||||
| The relay interface user space API | ||||
| ================================== | ||||
| 
 | ||||
| The relay interface implements basic file operations for user space | ||||
| access to relay channel buffer data.  Here are the file operations | ||||
| that are available and some comments regarding their behavior: | ||||
| 
 | ||||
| open()	    enables user to open an _existing_ channel buffer. | ||||
| 
 | ||||
| mmap()      results in channel buffer being mapped into the caller's | ||||
| 	    memory space. Note that you can't do a partial mmap - you | ||||
| 	    must map the entire file, which is NRBUF * SUBBUFSIZE. | ||||
| 
 | ||||
| read()      read the contents of a channel buffer.  The bytes read are | ||||
| 	    'consumed' by the reader, i.e. they won't be available | ||||
| 	    again to subsequent reads.  If the channel is being used | ||||
| 	    in no-overwrite mode (the default), it can be read at any | ||||
| 	    time even if there's an active kernel writer.  If the | ||||
| 	    channel is being used in overwrite mode and there are | ||||
| 	    active channel writers, results may be unpredictable - | ||||
| 	    users should make sure that all logging to the channel has | ||||
| 	    ended before using read() with overwrite mode.  Sub-buffer | ||||
| 	    padding is automatically removed and will not be seen by | ||||
| 	    the reader. | ||||
| 
 | ||||
| sendfile()  transfer data from a channel buffer to an output file | ||||
| 	    descriptor. Sub-buffer padding is automatically removed | ||||
| 	    and will not be seen by the reader. | ||||
| 
 | ||||
| poll()      POLLIN/POLLRDNORM/POLLERR supported.  User applications are | ||||
| 	    notified when sub-buffer boundaries are crossed. | ||||
| 
 | ||||
| close()     decrements the channel buffer's refcount.  When the refcount | ||||
| 	    reaches 0, i.e. when no process or kernel client has the | ||||
| 	    buffer open, the channel buffer is freed. | ||||
| 
 | ||||
| In order for a user application to make use of relay files, the | ||||
| host filesystem must be mounted.  For example, | ||||
| 
 | ||||
| 	mount -t debugfs debugfs /debug | ||||
| 
 | ||||
| NOTE:   the host filesystem doesn't need to be mounted for kernel | ||||
| 	clients to create or use channels - it only needs to be | ||||
| 	mounted when user space applications need access to the buffer | ||||
| 	data. | ||||
| 
 | ||||
| 
 | ||||
| The relay interface kernel API | ||||
| ============================== | ||||
| 
 | ||||
| Here's a summary of the API the relay interface provides to in-kernel clients: | ||||
| 
 | ||||
| TBD(curr. line MT:/API/) | ||||
|   channel management functions: | ||||
| 
 | ||||
|     relay_open(base_filename, parent, subbuf_size, n_subbufs, | ||||
|                callbacks) | ||||
|     relay_close(chan) | ||||
|     relay_flush(chan) | ||||
|     relay_reset(chan) | ||||
| 
 | ||||
|   channel management typically called on instigation of userspace: | ||||
| 
 | ||||
|     relay_subbufs_consumed(chan, cpu, subbufs_consumed) | ||||
| 
 | ||||
|   write functions: | ||||
| 
 | ||||
|     relay_write(chan, data, length) | ||||
|     __relay_write(chan, data, length) | ||||
|     relay_reserve(chan, length) | ||||
| 
 | ||||
|   callbacks: | ||||
| 
 | ||||
|     subbuf_start(buf, subbuf, prev_subbuf, prev_padding) | ||||
|     buf_mapped(buf, filp) | ||||
|     buf_unmapped(buf, filp) | ||||
|     create_buf_file(filename, parent, mode, buf, is_global) | ||||
|     remove_buf_file(dentry) | ||||
| 
 | ||||
|   helper functions: | ||||
| 
 | ||||
|     relay_buf_full(buf) | ||||
|     subbuf_start_reserve(buf, length) | ||||
| 
 | ||||
| 
 | ||||
| Creating a channel | ||||
| ------------------ | ||||
| 
 | ||||
| relay_open() is used to create a channel, along with its per-cpu | ||||
| channel buffers.  Each channel buffer will have an associated file | ||||
| created for it in the host filesystem, which can be and mmapped or | ||||
| read from in user space.  The files are named basename0...basenameN-1 | ||||
| where N is the number of online cpus, and by default will be created | ||||
| in the root of the filesystem (if the parent param is NULL).  If you | ||||
| want a directory structure to contain your relay files, you should | ||||
| create it using the host filesystem's directory creation function, | ||||
| e.g. debugfs_create_dir(), and pass the parent directory to | ||||
| relay_open().  Users are responsible for cleaning up any directory | ||||
| structure they create, when the channel is closed - again the host | ||||
| filesystem's directory removal functions should be used for that, | ||||
| e.g. debugfs_remove(). | ||||
| 
 | ||||
| In order for a channel to be created and the host filesystem's files | ||||
| associated with its channel buffers, the user must provide definitions | ||||
| for two callback functions, create_buf_file() and remove_buf_file(). | ||||
| create_buf_file() is called once for each per-cpu buffer from | ||||
| relay_open() and allows the user to create the file which will be used | ||||
| to represent the corresponding channel buffer.  The callback should | ||||
| return the dentry of the file created to represent the channel buffer. | ||||
| remove_buf_file() must also be defined; it's responsible for deleting | ||||
| the file(s) created in create_buf_file() and is called during | ||||
| relay_close(). | ||||
| 
 | ||||
| Here are some typical definitions for these callbacks, in this case | ||||
| using debugfs: | ||||
| 
 | ||||
| /* | ||||
|  * create_buf_file() callback.  Creates relay file in debugfs. | ||||
|  */ | ||||
| static struct dentry *create_buf_file_handler(const char *filename, | ||||
|                                               struct dentry *parent, | ||||
|                                               int mode, | ||||
|                                               struct rchan_buf *buf, | ||||
|                                               int *is_global) | ||||
| { | ||||
|         return debugfs_create_file(filename, mode, parent, buf, | ||||
| 	                           &relay_file_operations); | ||||
| } | ||||
| 
 | ||||
| /* | ||||
|  * remove_buf_file() callback.  Removes relay file from debugfs. | ||||
|  */ | ||||
| static int remove_buf_file_handler(struct dentry *dentry) | ||||
| { | ||||
|         debugfs_remove(dentry); | ||||
| 
 | ||||
|         return 0; | ||||
| } | ||||
| 
 | ||||
| /* | ||||
|  * relay interface callbacks | ||||
|  */ | ||||
| static struct rchan_callbacks relay_callbacks = | ||||
| { | ||||
|         .create_buf_file = create_buf_file_handler, | ||||
|         .remove_buf_file = remove_buf_file_handler, | ||||
| }; | ||||
| 
 | ||||
| And an example relay_open() invocation using them: | ||||
| 
 | ||||
|   chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks); | ||||
| 
 | ||||
| If the create_buf_file() callback fails, or isn't defined, channel | ||||
| creation and thus relay_open() will fail. | ||||
| 
 | ||||
| The total size of each per-cpu buffer is calculated by multiplying the | ||||
| number of sub-buffers by the sub-buffer size passed into relay_open(). | ||||
| The idea behind sub-buffers is that they're basically an extension of | ||||
| double-buffering to N buffers, and they also allow applications to | ||||
| easily implement random-access-on-buffer-boundary schemes, which can | ||||
| be important for some high-volume applications.  The number and size | ||||
| of sub-buffers is completely dependent on the application and even for | ||||
| the same application, different conditions will warrant different | ||||
| values for these parameters at different times.  Typically, the right | ||||
| values to use are best decided after some experimentation; in general, | ||||
| though, it's safe to assume that having only 1 sub-buffer is a bad | ||||
| idea - you're guaranteed to either overwrite data or lose events | ||||
| depending on the channel mode being used. | ||||
| 
 | ||||
| The create_buf_file() implementation can also be defined in such a way | ||||
| as to allow the creation of a single 'global' buffer instead of the | ||||
| default per-cpu set.  This can be useful for applications interested | ||||
| mainly in seeing the relative ordering of system-wide events without | ||||
| the need to bother with saving explicit timestamps for the purpose of | ||||
| merging/sorting per-cpu files in a postprocessing step. | ||||
| 
 | ||||
| To have relay_open() create a global buffer, the create_buf_file() | ||||
| implementation should set the value of the is_global outparam to a | ||||
| non-zero value in addition to creating the file that will be used to | ||||
| represent the single buffer.  In the case of a global buffer, | ||||
| create_buf_file() and remove_buf_file() will be called only once.  The | ||||
| normal channel-writing functions, e.g. relay_write(), can still be | ||||
| used - writes from any cpu will transparently end up in the global | ||||
| buffer - but since it is a global buffer, callers should make sure | ||||
| they use the proper locking for such a buffer, either by wrapping | ||||
| writes in a spinlock, or by copying a write function from relay.h and | ||||
| creating a local version that internally does the proper locking. | ||||
| 
 | ||||
| Channel 'modes' | ||||
| --------------- | ||||
| 
 | ||||
| relay channels can be used in either of two modes - 'overwrite' or | ||||
| 'no-overwrite'.  The mode is entirely determined by the implementation | ||||
| of the subbuf_start() callback, as described below.  The default if no | ||||
| subbuf_start() callback is defined is 'no-overwrite' mode.  If the | ||||
| default mode suits your needs, and you plan to use the read() | ||||
| interface to retrieve channel data, you can ignore the details of this | ||||
| section, as it pertains mainly to mmap() implementations. | ||||
| 
 | ||||
| In 'overwrite' mode, also known as 'flight recorder' mode, writes | ||||
| continuously cycle around the buffer and will never fail, but will | ||||
| unconditionally overwrite old data regardless of whether it's actually | ||||
| been consumed.  In no-overwrite mode, writes will fail, i.e. data will | ||||
| be lost, if the number of unconsumed sub-buffers equals the total | ||||
| number of sub-buffers in the channel.  It should be clear that if | ||||
| there is no consumer or if the consumer can't consume sub-buffers fast | ||||
| enough, data will be lost in either case; the only difference is | ||||
| whether data is lost from the beginning or the end of a buffer. | ||||
| 
 | ||||
| As explained above, a relay channel is made of up one or more | ||||
| per-cpu channel buffers, each implemented as a circular buffer | ||||
| subdivided into one or more sub-buffers.  Messages are written into | ||||
| the current sub-buffer of the channel's current per-cpu buffer via the | ||||
| write functions described below.  Whenever a message can't fit into | ||||
| the current sub-buffer, because there's no room left for it, the | ||||
| client is notified via the subbuf_start() callback that a switch to a | ||||
| new sub-buffer is about to occur.  The client uses this callback to 1) | ||||
| initialize the next sub-buffer if appropriate 2) finalize the previous | ||||
| sub-buffer if appropriate and 3) return a boolean value indicating | ||||
| whether or not to actually move on to the next sub-buffer. | ||||
| 
 | ||||
| To implement 'no-overwrite' mode, the userspace client would provide | ||||
| an implementation of the subbuf_start() callback something like the | ||||
| following: | ||||
| 
 | ||||
| static int subbuf_start(struct rchan_buf *buf, | ||||
|                         void *subbuf, | ||||
| 			void *prev_subbuf, | ||||
| 			unsigned int prev_padding) | ||||
| { | ||||
| 	if (prev_subbuf) | ||||
| 		*((unsigned *)prev_subbuf) = prev_padding; | ||||
| 
 | ||||
| 	if (relay_buf_full(buf)) | ||||
| 		return 0; | ||||
| 
 | ||||
| 	subbuf_start_reserve(buf, sizeof(unsigned int)); | ||||
| 
 | ||||
| 	return 1; | ||||
| } | ||||
| 
 | ||||
| If the current buffer is full, i.e. all sub-buffers remain unconsumed, | ||||
| the callback returns 0 to indicate that the buffer switch should not | ||||
| occur yet, i.e. until the consumer has had a chance to read the | ||||
| current set of ready sub-buffers.  For the relay_buf_full() function | ||||
| to make sense, the consumer is reponsible for notifying the relay | ||||
| interface when sub-buffers have been consumed via | ||||
| relay_subbufs_consumed().  Any subsequent attempts to write into the | ||||
| buffer will again invoke the subbuf_start() callback with the same | ||||
| parameters; only when the consumer has consumed one or more of the | ||||
| ready sub-buffers will relay_buf_full() return 0, in which case the | ||||
| buffer switch can continue. | ||||
| 
 | ||||
| The implementation of the subbuf_start() callback for 'overwrite' mode | ||||
| would be very similar: | ||||
| 
 | ||||
| static int subbuf_start(struct rchan_buf *buf, | ||||
|                         void *subbuf, | ||||
| 			void *prev_subbuf, | ||||
| 			unsigned int prev_padding) | ||||
| { | ||||
| 	if (prev_subbuf) | ||||
| 		*((unsigned *)prev_subbuf) = prev_padding; | ||||
| 
 | ||||
| 	subbuf_start_reserve(buf, sizeof(unsigned int)); | ||||
| 
 | ||||
| 	return 1; | ||||
| } | ||||
| 
 | ||||
| In this case, the relay_buf_full() check is meaningless and the | ||||
| callback always returns 1, causing the buffer switch to occur | ||||
| unconditionally.  It's also meaningless for the client to use the | ||||
| relay_subbufs_consumed() function in this mode, as it's never | ||||
| consulted. | ||||
| 
 | ||||
| The default subbuf_start() implementation, used if the client doesn't | ||||
| define any callbacks, or doesn't define the subbuf_start() callback, | ||||
| implements the simplest possible 'no-overwrite' mode, i.e. it does | ||||
| nothing but return 0. | ||||
| 
 | ||||
| Header information can be reserved at the beginning of each sub-buffer | ||||
| by calling the subbuf_start_reserve() helper function from within the | ||||
| subbuf_start() callback.  This reserved area can be used to store | ||||
| whatever information the client wants.  In the example above, room is | ||||
| reserved in each sub-buffer to store the padding count for that | ||||
| sub-buffer.  This is filled in for the previous sub-buffer in the | ||||
| subbuf_start() implementation; the padding value for the previous | ||||
| sub-buffer is passed into the subbuf_start() callback along with a | ||||
| pointer to the previous sub-buffer, since the padding value isn't | ||||
| known until a sub-buffer is filled.  The subbuf_start() callback is | ||||
| also called for the first sub-buffer when the channel is opened, to | ||||
| give the client a chance to reserve space in it.  In this case the | ||||
| previous sub-buffer pointer passed into the callback will be NULL, so | ||||
| the client should check the value of the prev_subbuf pointer before | ||||
| writing into the previous sub-buffer. | ||||
| 
 | ||||
| Writing to a channel | ||||
| -------------------- | ||||
| 
 | ||||
| Kernel clients write data into the current cpu's channel buffer using | ||||
| relay_write() or __relay_write().  relay_write() is the main logging | ||||
| function - it uses local_irqsave() to protect the buffer and should be | ||||
| used if you might be logging from interrupt context.  If you know | ||||
| you'll never be logging from interrupt context, you can use | ||||
| __relay_write(), which only disables preemption.  These functions | ||||
| don't return a value, so you can't determine whether or not they | ||||
| failed - the assumption is that you wouldn't want to check a return | ||||
| value in the fast logging path anyway, and that they'll always succeed | ||||
| unless the buffer is full and no-overwrite mode is being used, in | ||||
| which case you can detect a failed write in the subbuf_start() | ||||
| callback by calling the relay_buf_full() helper function. | ||||
| 
 | ||||
| relay_reserve() is used to reserve a slot in a channel buffer which | ||||
| can be written to later.  This would typically be used in applications | ||||
| that need to write directly into a channel buffer without having to | ||||
| stage data in a temporary buffer beforehand.  Because the actual write | ||||
| may not happen immediately after the slot is reserved, applications | ||||
| using relay_reserve() can keep a count of the number of bytes actually | ||||
| written, either in space reserved in the sub-buffers themselves or as | ||||
| a separate array.  See the 'reserve' example in the relay-apps tarball | ||||
| at http://relayfs.sourceforge.net for an example of how this can be | ||||
| done.  Because the write is under control of the client and is | ||||
| separated from the reserve, relay_reserve() doesn't protect the buffer | ||||
| at all - it's up to the client to provide the appropriate | ||||
| synchronization when using relay_reserve(). | ||||
| 
 | ||||
| Closing a channel | ||||
| ----------------- | ||||
| 
 | ||||
| The client calls relay_close() when it's finished using the channel. | ||||
| The channel and its associated buffers are destroyed when there are no | ||||
| longer any references to any of the channel buffers.  relay_flush() | ||||
| forces a sub-buffer switch on all the channel buffers, and can be used | ||||
| to finalize and process the last sub-buffers before the channel is | ||||
| closed. | ||||
| 
 | ||||
| Misc | ||||
| ---- | ||||
| 
 | ||||
| Some applications may want to keep a channel around and re-use it | ||||
| rather than open and close a new channel for each use.  relay_reset() | ||||
| can be used for this purpose - it resets a channel to its initial | ||||
| state without reallocating channel buffer memory or destroying | ||||
| existing mappings.  It should however only be called when it's safe to | ||||
| do so, i.e. when the channel isn't currently being written to. | ||||
| 
 | ||||
| Finally, there are a couple of utility callbacks that can be used for | ||||
| different purposes.  buf_mapped() is called whenever a channel buffer | ||||
| is mmapped from user space and buf_unmapped() is called when it's | ||||
| unmapped.  The client can use this notification to trigger actions | ||||
| within the kernel application, such as enabling/disabling logging to | ||||
| the channel. | ||||
| 
 | ||||
| 
 | ||||
| Resources | ||||
| ========= | ||||
| 
 | ||||
| For news, example code, mailing list, etc. see the relay interface homepage: | ||||
| 
 | ||||
|     http://relayfs.sourceforge.net | ||||
| 
 | ||||
| 
 | ||||
| Credits | ||||
| ======= | ||||
| 
 | ||||
| The ideas and specs for the relay interface came about as a result of | ||||
| discussions on tracing involving the following: | ||||
| 
 | ||||
| Michel Dagenais		<michel.dagenais@polymtl.ca> | ||||
| Richard Moore		<richardj_moore@uk.ibm.com> | ||||
| Bob Wisniewski		<bob@watson.ibm.com> | ||||
| Karim Yaghmour		<karim@opersys.com> | ||||
| Tom Zanussi		<zanussi@us.ibm.com> | ||||
| 
 | ||||
| Also thanks to Hubertus Franke for a lot of useful suggestions and bug | ||||
| reports. | ||||
| @ -1,442 +0,0 @@ | ||||
| 
 | ||||
| relayfs - a high-speed data relay filesystem | ||||
| ============================================ | ||||
| 
 | ||||
| relayfs is a filesystem designed to provide an efficient mechanism for | ||||
| tools and facilities to relay large and potentially sustained streams | ||||
| of data from kernel space to user space. | ||||
| 
 | ||||
| The main abstraction of relayfs is the 'channel'.  A channel consists | ||||
| of a set of per-cpu kernel buffers each represented by a file in the | ||||
| relayfs filesystem.  Kernel clients write into a channel using | ||||
| efficient write functions which automatically log to the current cpu's | ||||
| channel buffer.  User space applications mmap() the per-cpu files and | ||||
| retrieve the data as it becomes available. | ||||
| 
 | ||||
| The format of the data logged into the channel buffers is completely | ||||
| up to the relayfs client; relayfs does however provide hooks which | ||||
| allow clients to impose some structure on the buffer data.  Nor does | ||||
| relayfs implement any form of data filtering - this also is left to | ||||
| the client.  The purpose is to keep relayfs as simple as possible. | ||||
| 
 | ||||
| This document provides an overview of the relayfs API.  The details of | ||||
| the function parameters are documented along with the functions in the | ||||
| filesystem code - please see that for details. | ||||
| 
 | ||||
| Semantics | ||||
| ========= | ||||
| 
 | ||||
| Each relayfs channel has one buffer per CPU, each buffer has one or | ||||
| more sub-buffers. Messages are written to the first sub-buffer until | ||||
| it is too full to contain a new message, in which case it it is | ||||
| written to the next (if available).  Messages are never split across | ||||
| sub-buffers.  At this point, userspace can be notified so it empties | ||||
| the first sub-buffer, while the kernel continues writing to the next. | ||||
| 
 | ||||
| When notified that a sub-buffer is full, the kernel knows how many | ||||
| bytes of it are padding i.e. unused.  Userspace can use this knowledge | ||||
| to copy only valid data. | ||||
| 
 | ||||
| After copying it, userspace can notify the kernel that a sub-buffer | ||||
| has been consumed. | ||||
| 
 | ||||
| relayfs can operate in a mode where it will overwrite data not yet | ||||
| collected by userspace, and not wait for it to consume it. | ||||
| 
 | ||||
| relayfs itself does not provide for communication of such data between | ||||
| userspace and kernel, allowing the kernel side to remain simple and | ||||
| not impose a single interface on userspace. It does provide a set of | ||||
| examples and a separate helper though, described below. | ||||
| 
 | ||||
| klog and relay-apps example code | ||||
| ================================ | ||||
| 
 | ||||
| relayfs itself is ready to use, but to make things easier, a couple | ||||
| simple utility functions and a set of examples are provided. | ||||
| 
 | ||||
| The relay-apps example tarball, available on the relayfs sourceforge | ||||
| site, contains a set of self-contained examples, each consisting of a | ||||
| pair of .c files containing boilerplate code for each of the user and | ||||
| kernel sides of a relayfs application; combined these two sets of | ||||
| boilerplate code provide glue to easily stream data to disk, without | ||||
| having to bother with mundane housekeeping chores. | ||||
| 
 | ||||
| The 'klog debugging functions' patch (klog.patch in the relay-apps | ||||
| tarball) provides a couple of high-level logging functions to the | ||||
| kernel which allow writing formatted text or raw data to a channel, | ||||
| regardless of whether a channel to write into exists or not, or | ||||
| whether relayfs is compiled into the kernel or is configured as a | ||||
| module.  These functions allow you to put unconditional 'trace' | ||||
| statements anywhere in the kernel or kernel modules; only when there | ||||
| is a 'klog handler' registered will data actually be logged (see the | ||||
| klog and kleak examples for details). | ||||
| 
 | ||||
| It is of course possible to use relayfs from scratch i.e. without | ||||
| using any of the relay-apps example code or klog, but you'll have to | ||||
| implement communication between userspace and kernel, allowing both to | ||||
| convey the state of buffers (full, empty, amount of padding). | ||||
| 
 | ||||
| klog and the relay-apps examples can be found in the relay-apps | ||||
| tarball on http://relayfs.sourceforge.net | ||||
| 
 | ||||
| 
 | ||||
| The relayfs user space API | ||||
| ========================== | ||||
| 
 | ||||
| relayfs implements basic file operations for user space access to | ||||
| relayfs channel buffer data.  Here are the file operations that are | ||||
| available and some comments regarding their behavior: | ||||
| 
 | ||||
| open()	 enables user to open an _existing_ buffer. | ||||
| 
 | ||||
| mmap()	 results in channel buffer being mapped into the caller's | ||||
| 	 memory space. Note that you can't do a partial mmap - you must | ||||
| 	 map the entire file, which is NRBUF * SUBBUFSIZE. | ||||
| 
 | ||||
| read()	 read the contents of a channel buffer.  The bytes read are | ||||
| 	 'consumed' by the reader i.e. they won't be available again | ||||
| 	 to subsequent reads.  If the channel is being used in | ||||
| 	 no-overwrite mode (the default), it can be read at any time | ||||
| 	 even if there's an active kernel writer.  If the channel is | ||||
| 	 being used in overwrite mode and there are active channel | ||||
| 	 writers, results may be unpredictable - users should make | ||||
| 	 sure that all logging to the channel has ended before using | ||||
| 	 read() with overwrite mode. | ||||
| 
 | ||||
| poll()	 POLLIN/POLLRDNORM/POLLERR supported.  User applications are | ||||
| 	 notified when sub-buffer boundaries are crossed. | ||||
| 
 | ||||
| close() decrements the channel buffer's refcount.  When the refcount | ||||
| 	reaches 0 i.e. when no process or kernel client has the buffer | ||||
| 	open, the channel buffer is freed. | ||||
| 
 | ||||
| 
 | ||||
| In order for a user application to make use of relayfs files, the | ||||
| relayfs filesystem must be mounted.  For example, | ||||
| 
 | ||||
| 	mount -t relayfs relayfs /mnt/relay | ||||
| 
 | ||||
| NOTE:	relayfs doesn't need to be mounted for kernel clients to create | ||||
| 	or use channels - it only needs to be mounted when user space | ||||
| 	applications need access to the buffer data. | ||||
| 
 | ||||
| 
 | ||||
| The relayfs kernel API | ||||
| ====================== | ||||
| 
 | ||||
| Here's a summary of the API relayfs provides to in-kernel clients: | ||||
| 
 | ||||
| 
 | ||||
|   channel management functions: | ||||
| 
 | ||||
|     relay_open(base_filename, parent, subbuf_size, n_subbufs, | ||||
|                callbacks) | ||||
|     relay_close(chan) | ||||
|     relay_flush(chan) | ||||
|     relay_reset(chan) | ||||
|     relayfs_create_dir(name, parent) | ||||
|     relayfs_remove_dir(dentry) | ||||
|     relayfs_create_file(name, parent, mode, fops, data) | ||||
|     relayfs_remove_file(dentry) | ||||
| 
 | ||||
|   channel management typically called on instigation of userspace: | ||||
| 
 | ||||
|     relay_subbufs_consumed(chan, cpu, subbufs_consumed) | ||||
| 
 | ||||
|   write functions: | ||||
| 
 | ||||
|     relay_write(chan, data, length) | ||||
|     __relay_write(chan, data, length) | ||||
|     relay_reserve(chan, length) | ||||
| 
 | ||||
|   callbacks: | ||||
| 
 | ||||
|     subbuf_start(buf, subbuf, prev_subbuf, prev_padding) | ||||
|     buf_mapped(buf, filp) | ||||
|     buf_unmapped(buf, filp) | ||||
|     create_buf_file(filename, parent, mode, buf, is_global) | ||||
|     remove_buf_file(dentry) | ||||
| 
 | ||||
|   helper functions: | ||||
| 
 | ||||
|     relay_buf_full(buf) | ||||
|     subbuf_start_reserve(buf, length) | ||||
| 
 | ||||
| 
 | ||||
| Creating a channel | ||||
| ------------------ | ||||
| 
 | ||||
| relay_open() is used to create a channel, along with its per-cpu | ||||
| channel buffers.  Each channel buffer will have an associated file | ||||
| created for it in the relayfs filesystem, which can be opened and | ||||
| mmapped from user space if desired.  The files are named | ||||
| basename0...basenameN-1 where N is the number of online cpus, and by | ||||
| default will be created in the root of the filesystem.  If you want a | ||||
| directory structure to contain your relayfs files, you can create it | ||||
| with relayfs_create_dir() and pass the parent directory to | ||||
| relay_open().  Clients are responsible for cleaning up any directory | ||||
| structure they create when the channel is closed - use | ||||
| relayfs_remove_dir() for that. | ||||
| 
 | ||||
| The total size of each per-cpu buffer is calculated by multiplying the | ||||
| number of sub-buffers by the sub-buffer size passed into relay_open(). | ||||
| The idea behind sub-buffers is that they're basically an extension of | ||||
| double-buffering to N buffers, and they also allow applications to | ||||
| easily implement random-access-on-buffer-boundary schemes, which can | ||||
| be important for some high-volume applications.  The number and size | ||||
| of sub-buffers is completely dependent on the application and even for | ||||
| the same application, different conditions will warrant different | ||||
| values for these parameters at different times.  Typically, the right | ||||
| values to use are best decided after some experimentation; in general, | ||||
| though, it's safe to assume that having only 1 sub-buffer is a bad | ||||
| idea - you're guaranteed to either overwrite data or lose events | ||||
| depending on the channel mode being used. | ||||
| 
 | ||||
| Channel 'modes' | ||||
| --------------- | ||||
| 
 | ||||
| relayfs channels can be used in either of two modes - 'overwrite' or | ||||
| 'no-overwrite'.  The mode is entirely determined by the implementation | ||||
| of the subbuf_start() callback, as described below.  In 'overwrite' | ||||
| mode, also known as 'flight recorder' mode, writes continuously cycle | ||||
| around the buffer and will never fail, but will unconditionally | ||||
| overwrite old data regardless of whether it's actually been consumed. | ||||
| In no-overwrite mode, writes will fail i.e. data will be lost, if the | ||||
| number of unconsumed sub-buffers equals the total number of | ||||
| sub-buffers in the channel.  It should be clear that if there is no | ||||
| consumer or if the consumer can't consume sub-buffers fast enought, | ||||
| data will be lost in either case; the only difference is whether data | ||||
| is lost from the beginning or the end of a buffer. | ||||
| 
 | ||||
| As explained above, a relayfs channel is made of up one or more | ||||
| per-cpu channel buffers, each implemented as a circular buffer | ||||
| subdivided into one or more sub-buffers.  Messages are written into | ||||
| the current sub-buffer of the channel's current per-cpu buffer via the | ||||
| write functions described below.  Whenever a message can't fit into | ||||
| the current sub-buffer, because there's no room left for it, the | ||||
| client is notified via the subbuf_start() callback that a switch to a | ||||
| new sub-buffer is about to occur.  The client uses this callback to 1) | ||||
| initialize the next sub-buffer if appropriate 2) finalize the previous | ||||
| sub-buffer if appropriate and 3) return a boolean value indicating | ||||
| whether or not to actually go ahead with the sub-buffer switch. | ||||
| 
 | ||||
| To implement 'no-overwrite' mode, the userspace client would provide | ||||
| an implementation of the subbuf_start() callback something like the | ||||
| following: | ||||
| 
 | ||||
| static int subbuf_start(struct rchan_buf *buf, | ||||
|                         void *subbuf, | ||||
| 			void *prev_subbuf, | ||||
| 			unsigned int prev_padding) | ||||
| { | ||||
| 	if (prev_subbuf) | ||||
| 		*((unsigned *)prev_subbuf) = prev_padding; | ||||
| 
 | ||||
| 	if (relay_buf_full(buf)) | ||||
| 		return 0; | ||||
| 
 | ||||
| 	subbuf_start_reserve(buf, sizeof(unsigned int)); | ||||
| 
 | ||||
| 	return 1; | ||||
| } | ||||
| 
 | ||||
| If the current buffer is full i.e. all sub-buffers remain unconsumed, | ||||
| the callback returns 0 to indicate that the buffer switch should not | ||||
| occur yet i.e. until the consumer has had a chance to read the current | ||||
| set of ready sub-buffers.  For the relay_buf_full() function to make | ||||
| sense, the consumer is reponsible for notifying relayfs when | ||||
| sub-buffers have been consumed via relay_subbufs_consumed().  Any | ||||
| subsequent attempts to write into the buffer will again invoke the | ||||
| subbuf_start() callback with the same parameters; only when the | ||||
| consumer has consumed one or more of the ready sub-buffers will | ||||
| relay_buf_full() return 0, in which case the buffer switch can | ||||
| continue. | ||||
| 
 | ||||
| The implementation of the subbuf_start() callback for 'overwrite' mode | ||||
| would be very similar: | ||||
| 
 | ||||
| static int subbuf_start(struct rchan_buf *buf, | ||||
|                         void *subbuf, | ||||
| 			void *prev_subbuf, | ||||
| 			unsigned int prev_padding) | ||||
| { | ||||
| 	if (prev_subbuf) | ||||
| 		*((unsigned *)prev_subbuf) = prev_padding; | ||||
| 
 | ||||
| 	subbuf_start_reserve(buf, sizeof(unsigned int)); | ||||
| 
 | ||||
| 	return 1; | ||||
| } | ||||
| 
 | ||||
| In this case, the relay_buf_full() check is meaningless and the | ||||
| callback always returns 1, causing the buffer switch to occur | ||||
| unconditionally.  It's also meaningless for the client to use the | ||||
| relay_subbufs_consumed() function in this mode, as it's never | ||||
| consulted. | ||||
| 
 | ||||
| The default subbuf_start() implementation, used if the client doesn't | ||||
| define any callbacks, or doesn't define the subbuf_start() callback, | ||||
| implements the simplest possible 'no-overwrite' mode i.e. it does | ||||
| nothing but return 0. | ||||
| 
 | ||||
| Header information can be reserved at the beginning of each sub-buffer | ||||
| by calling the subbuf_start_reserve() helper function from within the | ||||
| subbuf_start() callback.  This reserved area can be used to store | ||||
| whatever information the client wants.  In the example above, room is | ||||
| reserved in each sub-buffer to store the padding count for that | ||||
| sub-buffer.  This is filled in for the previous sub-buffer in the | ||||
| subbuf_start() implementation; the padding value for the previous | ||||
| sub-buffer is passed into the subbuf_start() callback along with a | ||||
| pointer to the previous sub-buffer, since the padding value isn't | ||||
| known until a sub-buffer is filled.  The subbuf_start() callback is | ||||
| also called for the first sub-buffer when the channel is opened, to | ||||
| give the client a chance to reserve space in it.  In this case the | ||||
| previous sub-buffer pointer passed into the callback will be NULL, so | ||||
| the client should check the value of the prev_subbuf pointer before | ||||
| writing into the previous sub-buffer. | ||||
| 
 | ||||
| Writing to a channel | ||||
| -------------------- | ||||
| 
 | ||||
| kernel clients write data into the current cpu's channel buffer using | ||||
| relay_write() or __relay_write().  relay_write() is the main logging | ||||
| function - it uses local_irqsave() to protect the buffer and should be | ||||
| used if you might be logging from interrupt context.  If you know | ||||
| you'll never be logging from interrupt context, you can use | ||||
| __relay_write(), which only disables preemption.  These functions | ||||
| don't return a value, so you can't determine whether or not they | ||||
| failed - the assumption is that you wouldn't want to check a return | ||||
| value in the fast logging path anyway, and that they'll always succeed | ||||
| unless the buffer is full and no-overwrite mode is being used, in | ||||
| which case you can detect a failed write in the subbuf_start() | ||||
| callback by calling the relay_buf_full() helper function. | ||||
| 
 | ||||
| relay_reserve() is used to reserve a slot in a channel buffer which | ||||
| can be written to later.  This would typically be used in applications | ||||
| that need to write directly into a channel buffer without having to | ||||
| stage data in a temporary buffer beforehand.  Because the actual write | ||||
| may not happen immediately after the slot is reserved, applications | ||||
| using relay_reserve() can keep a count of the number of bytes actually | ||||
| written, either in space reserved in the sub-buffers themselves or as | ||||
| a separate array.  See the 'reserve' example in the relay-apps tarball | ||||
| at http://relayfs.sourceforge.net for an example of how this can be | ||||
| done.  Because the write is under control of the client and is | ||||
| separated from the reserve, relay_reserve() doesn't protect the buffer | ||||
| at all - it's up to the client to provide the appropriate | ||||
| synchronization when using relay_reserve(). | ||||
| 
 | ||||
| Closing a channel | ||||
| ----------------- | ||||
| 
 | ||||
| The client calls relay_close() when it's finished using the channel. | ||||
| The channel and its associated buffers are destroyed when there are no | ||||
| longer any references to any of the channel buffers.  relay_flush() | ||||
| forces a sub-buffer switch on all the channel buffers, and can be used | ||||
| to finalize and process the last sub-buffers before the channel is | ||||
| closed. | ||||
| 
 | ||||
| Creating non-relay files | ||||
| ------------------------ | ||||
| 
 | ||||
| relay_open() automatically creates files in the relayfs filesystem to | ||||
| represent the per-cpu kernel buffers; it's often useful for | ||||
| applications to be able to create their own files alongside the relay | ||||
| files in the relayfs filesystem as well e.g. 'control' files much like | ||||
| those created in /proc or debugfs for similar purposes, used to | ||||
| communicate control information between the kernel and user sides of a | ||||
| relayfs application.  For this purpose the relayfs_create_file() and | ||||
| relayfs_remove_file() API functions exist.  For relayfs_create_file(), | ||||
| the caller passes in a set of user-defined file operations to be used | ||||
| for the file and an optional void * to a user-specified data item, | ||||
| which will be accessible via inode->u.generic_ip (see the relay-apps | ||||
| tarball for examples).  The file_operations are a required parameter | ||||
| to relayfs_create_file() and thus the semantics of these files are | ||||
| completely defined by the caller. | ||||
| 
 | ||||
| See the relay-apps tarball at http://relayfs.sourceforge.net for | ||||
| examples of how these non-relay files are meant to be used. | ||||
| 
 | ||||
| Creating relay files in other filesystems | ||||
| ----------------------------------------- | ||||
| 
 | ||||
| By default of course, relay_open() creates relay files in the relayfs | ||||
| filesystem.  Because relay_file_operations is exported, however, it's | ||||
| also possible to create and use relay files in other pseudo-filesytems | ||||
| such as debugfs. | ||||
| 
 | ||||
| For this purpose, two callback functions are provided, | ||||
| create_buf_file() and remove_buf_file().  create_buf_file() is called | ||||
| once for each per-cpu buffer from relay_open() to allow the client to | ||||
| create a file to be used to represent the corresponding buffer; if | ||||
| this callback is not defined, the default implementation will create | ||||
| and return a file in the relayfs filesystem to represent the buffer. | ||||
| The callback should return the dentry of the file created to represent | ||||
| the relay buffer.  Note that the parent directory passed to | ||||
| relay_open() (and passed along to the callback), if specified, must | ||||
| exist in the same filesystem the new relay file is created in.  If | ||||
| create_buf_file() is defined, remove_buf_file() must also be defined; | ||||
| it's responsible for deleting the file(s) created in create_buf_file() | ||||
| and is called during relay_close(). | ||||
| 
 | ||||
| The create_buf_file() implementation can also be defined in such a way | ||||
| as to allow the creation of a single 'global' buffer instead of the | ||||
| default per-cpu set.  This can be useful for applications interested | ||||
| mainly in seeing the relative ordering of system-wide events without | ||||
| the need to bother with saving explicit timestamps for the purpose of | ||||
| merging/sorting per-cpu files in a postprocessing step. | ||||
| 
 | ||||
| To have relay_open() create a global buffer, the create_buf_file() | ||||
| implementation should set the value of the is_global outparam to a | ||||
| non-zero value in addition to creating the file that will be used to | ||||
| represent the single buffer.  In the case of a global buffer, | ||||
| create_buf_file() and remove_buf_file() will be called only once.  The | ||||
| normal channel-writing functions e.g. relay_write() can still be used | ||||
| - writes from any cpu will transparently end up in the global buffer - | ||||
| but since it is a global buffer, callers should make sure they use the | ||||
| proper locking for such a buffer, either by wrapping writes in a | ||||
| spinlock, or by copying a write function from relayfs_fs.h and | ||||
| creating a local version that internally does the proper locking. | ||||
| 
 | ||||
| See the 'exported-relayfile' examples in the relay-apps tarball for | ||||
| examples of creating and using relay files in debugfs. | ||||
| 
 | ||||
| Misc | ||||
| ---- | ||||
| 
 | ||||
| Some applications may want to keep a channel around and re-use it | ||||
| rather than open and close a new channel for each use.  relay_reset() | ||||
| can be used for this purpose - it resets a channel to its initial | ||||
| state without reallocating channel buffer memory or destroying | ||||
| existing mappings.  It should however only be called when it's safe to | ||||
| do so i.e. when the channel isn't currently being written to. | ||||
| 
 | ||||
| Finally, there are a couple of utility callbacks that can be used for | ||||
| different purposes.  buf_mapped() is called whenever a channel buffer | ||||
| is mmapped from user space and buf_unmapped() is called when it's | ||||
| unmapped.  The client can use this notification to trigger actions | ||||
| within the kernel application, such as enabling/disabling logging to | ||||
| the channel. | ||||
| 
 | ||||
| 
 | ||||
| Resources | ||||
| ========= | ||||
| 
 | ||||
| For news, example code, mailing list, etc. see the relayfs homepage: | ||||
| 
 | ||||
|     http://relayfs.sourceforge.net | ||||
| 
 | ||||
| 
 | ||||
| Credits | ||||
| ======= | ||||
| 
 | ||||
| The ideas and specs for relayfs came about as a result of discussions | ||||
| on tracing involving the following: | ||||
| 
 | ||||
| Michel Dagenais		<michel.dagenais@polymtl.ca> | ||||
| Richard Moore		<richardj_moore@uk.ibm.com> | ||||
| Bob Wisniewski		<bob@watson.ibm.com> | ||||
| Karim Yaghmour		<karim@opersys.com> | ||||
| Tom Zanussi		<zanussi@us.ibm.com> | ||||
| 
 | ||||
| Also thanks to Hubertus Franke for a lot of useful suggestions and bug | ||||
| reports. | ||||
		Loading…
	
		Reference in New Issue
	
	Block a user