Because one dirty seg can only be mapped to one dirty_type. Otherwise, it's a bug.
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
[Jaegeuk Kim: modify a comment related to this patch]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, do_checkpoint() will call congestion_wait() for waiting the pages
(previous submitted node/meta/data pages) to be written back.
Because congestion_wait() will set a regular period (e.g. HZ / 50 ) for waiting, and
no additional wake up mechanism was introduced if IO ends up before regular period costed.
Yuan Zhong found there is a situation that after the pages have been written back,
but the checkpoint thread still wait for congestion_wait to exit.
So here we store checkpoint task into f2fs_sb when doing checkpoint, it'll wait for IO completes
if there's IO going on, and in the end IO path, wake up checkpoint task when IO ends up.
Thanks to Yuan Zhong's pre work about this problem.
Reported-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch add macro MAX_BIO_BLOCKS to limit value of npages in
f2fs_bio_alloc, it can avoid allocating failure in bio_alloc caused by
npages is larger than BIO_MAX_PAGES.
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
An error "label at end of compound statement" will occur if CONFIG_F2FS_STAT_FS
disabled.
fs/f2fs/segment.c:556:1: error: label at end of compound statement
So clean up the 'out' label to fix it.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch fixes a deadlock bug that occurs quite often when there are
concurrent write and fsync on a same file.
Following is the simplified call trace when tasks get hung.
fsync thread:
- f2fs_sync_file
...
- f2fs_write_data_pages
...
- update_extent_cache
...
- update_inode
- wait_on_page_writeback
bdi writeback thread
- __writeback_single_inode
- f2fs_write_data_pages
- mutex_lock(sbi->writepages)
The deadlock happens when the fsync thread waits on a inode page that has
been added to the f2fs' cached bio sbi->bio[NODE], and unfortunately,
no one else could be able to submit the cached bio to block layer for
writeback. This is because the fsync thread already hold a sbi->fs_lock and
the sbi->writepages lock, causing the bdi thread being blocked when attempt
to write data pages for the same inode. At the same time, f2fs_gc thread
does not notice the situation and could not help. Even the sync syscall
gets blocked.
To fix it, we could submit the cached bio first before waiting on a inode page
that is being written back.
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
[Jaegeuk Kim: add more cases to use f2fs_wait_on_page_writeback]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
bio->bi_private is not always needed. As in the reading data path,
end_read_io does not need bio_private for further using, so moving
bio_private allocation out of f2fs_bio_alloc(). Alloc it in the
submit_write_page(), and ignore it in the f2fs_readpage().
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes check_prefree_segments initially designed to enhance the
performance by narrowing the range of LBA usage across the whole block device.
When allocating a new segment, previous f2fs tries to find proper prefree
segments, and then, if finds a segment, it reuses the segment for further
data or node block allocation.
However, I found that this was totally wrong approach since the prefree segments
have several data or node blocks that will be used by the roll-forward mechanism
operated after sudden-power-off.
Let's assume the following scenario.
/* write 8MB with fsync */
for (i = 0; i < 2048; i++) {
offset = i * 4096;
write(fd, offset, 4KB);
fsync(fd);
}
In this case, naive segment allocation sequence will be like:
data segment: x, x+1, x+2, x+3
node segment: y, y+1, y+2, y+3.
But, if we can reuse prefree segments, the sequence can be like:
data segment: x, x+1, y, y+1
node segment: y, y+1, y+2, y+3.
Because, y, y+1, and y+2 became prefree segments one by one, and those are
reused by data allocation.
After conducting this workload, we should consider how to recover the latest
inode with its data.
If we reuse the prefree segments such as y or y+1, we lost the old node blocks
so that f2fs even cannot start roll-forward recovery.
Therefore, I suggest that we should remove reusing prefree segments.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Optimize the while loop condition
Since this condition will always be true and while loop will
be terminated by the following condition in code:
if (segno >= TOTAL_SEGS(sbi))
break;
Hence we can replace the while loop condition with while(1)
instead of always checking for segno to be less than Total segs.
Also we do not need to use TOTAL_SEGS() everytime. We can store
this value in a local variable since this value is constant.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a file is linked, f2fs loose its parent inode number so that fsync calls
for the linked file should do checkpoint all the time.
But, if we can recover its parent inode number after the checkpoint, we can
adjust roll-forward mechanism for the further fsync calls, which is able to
improve the fsync performance significatly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
It's used only locally and could be static.
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We can get the value directly from pointer "curseg".
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Some, counters are needed only for the statistical information
while debugging.
So, those can be controlled using CONFIG_F2FS_STAT_FS,
pushing the usage for few variables under this flag.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When testing f2fs on an SSD, I found some 128 page IOs followed by 1 page IO
were issued by f2fs_write_node_pages.
This means that there were some mishandling flows which degrades performance.
Previous f2fs_write_node_pages determines the number of pages to be written,
nr_to_write, as follows.
1. The bio_get_nr_vecs returns 129 pages.
2. The bio_alloc makes a room for 128 pages.
3. The initial 128 pages go into one bio.
4. The existing bio is submitted, and a new bio is prepared for the last 1 page.
5. Finally, sync_node_pages submits the last 1 page bio.
The problem is from the use of bio_get_nr_vecs, so this patch replace it
with max_hw_blocks using queue_max_sectors.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Adding REQ_META for all the metadata requests can help in improving the
FS performance, if the underlying device supports TAGGING.
So, when considering the submit_bio path for all the f2fs requests. We can
add REQ_META for all the META requests.
As a precursor to this change we considered the commit
4265900e0b 'mmc: MMC-4.5 Data Tag Support'
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add tracepoints to debug the various page write operation
like data pages, meta pages.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
[Jaegeuk: remove unnecessary tracepoints]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Like below, there are 8 segment bitmaps for SSR victim candidates.
enum dirty_type {
DIRTY_HOT_DATA, /* dirty segments assigned as hot data logs */
DIRTY_WARM_DATA, /* dirty segments assigned as warm data logs */
DIRTY_COLD_DATA, /* dirty segments assigned as cold data logs */
DIRTY_HOT_NODE, /* dirty segments assigned as hot node logs */
DIRTY_WARM_NODE, /* dirty segments assigned as warm node logs */
DIRTY_COLD_NODE, /* dirty segments assigned as cold node logs */
DIRTY, /* to count # of dirty segments */
PRE, /* to count # of entirely obsolete segments */
NR_DIRTY_TYPE
};
The upper 6 bitmaps indicates segments dirtied by active log areas respectively.
And, the DIRTY bitmap integrates all the 6 bitmaps.
For example,
o DIRTY_HOT_DATA : 1010000
o DIRTY_WARM_DATA: 0100000
o DIRTY_COLD_DATA: 0001000
o DIRTY_HOT_NODE : 0000010
o DIRTY_WARM_NODE: 0000001
o DIRTY_COLD_NODE: 0000000
In this case,
o DIRTY : 1111011,
which means that we should guarantee the consistency between DIRTY and other
bitmaps concreately.
However, the SSR mode selects victims freely from any log types, which can set
multiple bits across the various bitmap types.
So, this patch eliminates this inconsistency.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a new condition that allocates free segments in the current
active section even if SSR is needed.
Otherwise, f2fs cannot allocate remained free segments in the section since
SSR finds dirty segments only.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes a bitmap for victim segments selected by foreground GC, and
modifies the other bitmap for victim segments selected by background GC.
1) foreground GC bitmap
: We don't need to manage this, since we just only one previous victim section
number instead of the whole victim history.
The f2fs uses the victim section number in order not to allocate currently
GC'ed section to current active logs.
2) background GC bitmap
: This bitmap is used to avoid selecting victims repeatedly by background GCs.
In addition, the victims are able to be selected by foreground GCs, since
there is no need to read victim blocks during foreground GCs.
By the fact that the foreground GC reclaims segments in a section unit, it'd
be better to manage this bitmap based on the section granularity.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When allocating a new segment under the LFS mode, we should keep the section
boundary.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Let's use a macro to get the total number of sections.
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Use kmemdup instead of kzalloc and memcpy.
Signed-off-by: Alexandru Gheorghiu <gheorghiuandru@gmail.com>
Acked-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch makes clearer the ambiguous f2fs_gc flow as follows.
1. Remove intermediate checkpoint condition during f2fs_gc
(i.e., should_do_checkpoint() and GC_BLOCKED)
2. Remove unnecessary return values of f2fs_gc because of #1.
(i.e., GC_NODE, GC_OK, etc)
3. Simplify write_checkpoint() because of #2.
4. Clarify the main f2fs_gc flow.
o monitor how many freed sections during one iteration of do_garbage_collect().
o do GC more without checkpoints if we can't get enough free sections.
o do checkpoint once we've got enough free sections through forground GCs.
5. Adopt thread-logging (Slack-Space-Recycle) scheme more aggressively on data
log types. See. get_ssr_segement()
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch enhances the checkpoint routine to cope with IO errors.
Basically f2fs detects IO errors from end_io_write, and the errors are able to
be occurred during one of data, node, and meta page writes.
In the previous code, when an IO error is occurred during writes, f2fs sets a
flag, CP_ERROR_FLAG, in the raw ckeckpoint buffer which will be written to disk.
Afterwards, write_checkpoint() will check the flag and remount f2fs as a
read-only (ro) mode.
However, even once f2fs is remounted as a ro mode, dirty checkpoint pages are
freely able to be written to disk by flusher or kswapd in background.
In such a case, after cold reboot, f2fs would restore the checkpoint data having
CP_ERROR_FLAG, resulting in disabling write_checkpoint and remounting f2fs as
a ro mode again.
Therefore, let's prevent any checkpoint page (meta) writes once an IO error is
occurred, and remount f2fs as a ro mode right away at that moment.
Reported-by: Oliver Winker <oliver@oli1170.net>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
I'd like to revisit the f2fs_gc flow and rewrite as follows.
1. In practical, the nGC parameter of f2fs_gc is meaningless. So, let's
remove it.
2. Background GC marks victim blocks as dirty one at a time.
3. Foreground GC should do cleaning job until acquiring enough free
sections. Afterwards, it needs to do checkpoint.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Since, the memory for the object of dirty_seglist_info is allocated
using kzalloc - which returns zeroed out memory. So, there is no need
to initialize the nr_dirty values with zeroes.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Practically, has_not_enough_free_secs() should calculate with the numbers of
current node and directory data blocks together.
Actually the equation was implemented in need_to_flush().
So, this patch removes need_flush() and moves the equation into
has_not_enough_free_secs().
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch resolves a build warning reported by kbuild test robot.
"
fs/f2fs/segment.c: In function '__get_segment_type':
fs/f2fs/segment.c:806:1: warning: control reaches end of non-void
function [-Wreturn-type]
"
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
m68k allmodconfig:
fs/f2fs/data.c: In function ‘read_end_io’:
fs/f2fs/data.c:311: error: implicit declaration of function ‘prefetchw’
fs/f2fs/segment.c: In function ‘f2fs_end_io_write’:
fs/f2fs/segment.c:628: error: implicit declaration of function ‘prefetchw’
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We should guarantee not to do *scheduling while atomic*.
I found, in atomic f2fs_end_io_write(), there is a set_page_dirty() call
to deal with IO errors.
But, set_page_dirty() calls:
-> f2fs_set_data_page_dirty()
-> set_dirty_dir_page()
-> cond_resched() which results in scheduling.
In order to avoid this, I'd like to remove simply set_page_dirty(),
since the page is already marked as ERROR and f2fs will be operated
as the read-only mode as well.
So, there is no recovery issue with this.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Do cleanup more for better code readability.
- Change the parameter set of f2fs_bio_alloc()
This function should allocate a bio only since it is not something like
f2fs_bio_init(). Instead, the caller should initialize the allocated bio.
- Introduce SECTOR_FROM_BLOCK
This macro translates a block address to its sector address.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Since, GFP_NOFS(__GFP_WAIT) is used for allocation requests of bio in f2fs.
So, there is no chance of returning NULL from the BIO allocation.
Making the bio allocation routine for f2fs simpler.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
No need to initialize "struct f2fs_gc_kthread *gc_th = NULL",
as gc_th = NULL, will be taken care by the return values of kmalloc().
And fix codes in other places.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
As pointed out by Randy Dunlap, this patch removes all usage of "/**" for comment
blocks. Instead, just use "/*".
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch should resolve the bugs reported by the sparse tool.
Initial reports were written by "kbuild test robot" managed by fengguang.wu.
In my local machines, I've tested also by running:
> make C=2 CF="-D__CHECK_ENDIAN__"
Accordingly, I've found lots of warnings and bugs related to the endian
conversion. And I've fixed all at this moment.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This adds specific functions not only to manage dirty/free segments, SIT pages,
a cache for SIT entries, and summary entries, but also to allocate free blocks
and write three types of pages: data, node, and meta.
- F2FS maintains three types of bitmaps in memory, which indicate free, prefree,
and dirty segments respectively.
- The key information of an SIT entry consists of a segment number, the number
of valid blocks in the segment, a bitmap to identify there-in valid or invalid
blocks.
- An SIT page is composed of a certain range of SIT entries, which is maintained
by the address space of meta_inode.
- To cache SIT entries, a simple array is used. The index for the array is the
segment number.
- A summary entry for data contains the parent node information. A summary entry
for node contains its node offset from the inode.
- F2FS manages information about six active logs and those summary entries in
memory. Whenever one of them is changed, its summary entries are flushed to
its SIT page maintained by the address space of meta_inode.
- This patch adds a default block allocation function which supports heap-based
allocation policy.
- This patch adds core functions to write data, node, and meta pages. Since LFS
basically produces a series of sequential writes, F2FS merges sequential bios
with a single one as much as possible to reduce the IO scheduling overhead.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>