linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-25 21:51:40 +00:00

Author	SHA1	Message	Date
Gabriel Krisman Bertazi	5c26d2f1d3	unicode: Don't special case ignorable code points We don't need to handle them separately. Instead, just let them decompose/casefold to themselves. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>	2024-10-09 13:34:01 -04:00
Jeff Johnson	68318904a7	unicode: add MODULE_DESCRIPTION() macros Currently 'make W=1' reports: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/unicode/utf8data.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/unicode/utf8-selftest.o Add a MODULE_DESCRIPTION() to utf8-selftest.c and utf8data.c_shipped, and update mkutf8data.c to add a MODULE_DESCRIPTION() to any future generated utf8data file. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://lore.kernel.org/r/20240524-md-unicode-v1-1-e2727ce8574d@quicinc.com Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>	2024-06-20 19:30:02 -04:00
Ben Dooks	7bd3d76a1f	unicode: make utf8 test count static The variables failed_tests and total_tests are not used outside of the utf8-selftest.c file so make them static to avoid the following warnings: fs/unicode/utf8-selftest.c:17:14: warning: symbol 'failed_tests' was not declared. Should it be static? fs/unicode/utf8-selftest.c:18:14: warning: symbol 'total_tests' was not declared. Should it be static? Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Link: https://lore.kernel.org/r/20240308183215.1924331-1-ben.dooks@codethink.co.uk Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>	2024-06-20 19:28:05 -04:00
Masahiro Yamada	b1992c3772	kbuild: use $(src) instead of $(srctree)/$(src) for source directory Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for checked-in source files. It is merely a convention without any functional difference. In fact, $(obj) and $(src) are exactly the same, as defined in scripts/Makefile.build: src := $(obj) When the kernel is built in a separate output directory, $(src) does not accurately reflect the source directory location. While Kbuild resolves this discrepancy by specifying VPATH=$(srctree) to search for source files, it does not cover all cases. For example, when adding a header search path for local headers, -I$(srctree)/$(src) is typically passed to the compiler. This introduces inconsistency between upstream and downstream Makefiles because $(src) is used instead of $(srctree)/$(src) for the latter. To address this inconsistency, this commit changes the semantics of $(src) so that it always points to the directory in the source tree. Going forward, the variables used in Makefiles will have the following meanings: $(obj) - directory in the object tree $(src) - directory in the source tree (changed by this commit) $(objtree) - the top of the kernel object tree $(srctree) - the top of the kernel source tree Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced with $(src). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>	2024-05-10 04:34:52 +09:00
Nick Alcock	573858e85d	unicode: remove MODULE_LICENSE in non-modules Since commit `8b41fc4454` ("kbuild: create modules.builtin without Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations are used to identify modules. As a consequence, uses of the macro in non-modules will cause modprobe to misidentify their containing object file as a module when it is not (false positives), and modprobe might succeed rather than failing with a suitable error message. So remove it in the files in this commit, none of which can be built as modules. Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Suggested-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Gabriel Krisman Bertazi <krisman@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: linux-modules@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com> Cc: Gabriel Krisman Bertazi <krisman@collabora.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-04-13 13:13:54 -07:00
Masahiro Yamada	a5575df580	kbuild: unify cmd_copy and cmd_shipped cmd_copy and cmd_shipped have similar functionality. The difference is that cmd_copy uses 'cp' while cmd_shipped 'cat'. Unify them into cmd_copy because this macro name is more intuitive. Going forward, cmd_copy will use 'cat' to avoid the permission issue. I also thought of 'cp --no-preserve=mode' but this option is not mentioned in the POSIX spec [1], so I am keeping the 'cat' command. [1]: https://pubs.opengroup.org/onlinepubs/009695299/utilities/cp.html Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2022-02-14 10:37:32 +09:00
Linus Torvalds	630c12862c	Fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the previous CONFIG_UNICODE. It is -rc material since we don't want to expose the former symbol on 5.17. This has been living on linux-next for the past week. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8jAUPq50yNjPBCi4QEuZqsMcppQFAmH4lC0ACgkQQEuZqsMc ppRl1Q/+Lyba+DORs26C4p1GDS5ezHOCdbBUE8RFwWjIl+h5ckQ/8kndaXPRLorZ 1S9E6h5RfqhekGKOhMTXyfzqcW8qMzUy4i3J2lmJpDwATqLt+4Wu/M2BBH2CaIIL EhhW8D+WduAEM/TFYihH9LJ0RopvIsqcy8qdu+oSBGfPAdxJ0f2+Yx0pNTRfqVmi 8+Dry0nRhP12o9wXElpZ0/BYEZTlY+Zo6L/heT6/GKDLpz/YmZp18GAc/0TWb3LL ASujr+anU2LxSFskkyuMu+rbFE8eDshvHEuBZLxlD2o+tG6lAi4mNWZYc0/+jPMw 8TdJ5MEX3IlljXLRKuYctoCdsFQKLxH5IN5wLkiLvM5fBpeb/sWqNolx8f2s/f9R TaUdjwiqFnML4VnlEH3hd3/hUUVbnE+xJo6g1iRGgJY3eecimvwl8P5H7k9Sn3OS 4zh0bHT9pfg+vUR0BVnfdWi4OpPxSrdqCgFhHsmKaGMvTApm0qMKK1Cg4OPNtYwr d1RMqsqEBSJTHzr0nHoiWLhkIo8npRPy+LMK51D8j6wg0kOj4GGYerWm1MD9ZlbI rhPy7nDgdcH48Gk1m6o7dROZKCvkZK+/QDPelBgZHGcGB94lUugYVJQrlBjI+2+7 Wx5oQLgQgeabeMtDZ/YNy5Dsre20vas2oLj5cs6uuoWNOcBO6Ew= =YVNN -----END PGP SIGNATURE----- Merge tag 'unicode-for-next-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode Pull unicode cleanup from Gabriel Krisman Bertazi: "A fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the previous CONFIG_UNICODE. It is -rc material since we don't want to expose the former symbol on 5.17. This has been living on linux-next for the past week" * tag 'unicode-for-next-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: unicode: clean up the Kconfig symbol confusion	2022-02-01 11:13:24 -08:00
Christoph Hellwig	5298d4bfe8	unicode: clean up the Kconfig symbol confusion Turn the CONFIG_UNICODE symbol into a tristate that generates some always built in code and remove the confusing CONFIG_UNICODE_UTF8_DATA symbol. Note that a lot of the IS_ENABLED() checks could be turned from cpp statements into normal ifs, but this change is intended to be fairly mechanic, so that should be cleaned up later. Fixes: `2b3d047870` ("unicode: Add utf8-data module") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2022-01-20 19:57:24 -05:00
Linus Torvalds	98f2345773	unicode: fix .gitignore for generated utfdata file Commit `2b3d047870` ("unicode: Add utf8-data module") changed the generated utf8data file from 'utf8data.h' to 'utf8data.c', but didn't change the comments or the .gitignore to match. The comments should be updated too, but at least they don't cause any visible breakage. But the gitignore file needs changing to avoid git complaining about untracked files. Fixes: `2b3d047870` ("unicode: Add utf8-data module") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-17 07:26:43 +02:00
Christoph Hellwig	e2a58d2d34	unicode: only export internal symbols for the selftests The exported symbols in utf8-norm.c are not needed for normal file system consumers, so move them to conditional _GPL exports just for the selftest. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2021-10-12 11:41:39 -03:00
Christoph Hellwig	2b3d047870	unicode: Add utf8-data module utf8data.h contains a large database table which is an auto-generated decodification trie for the unicode normalization functions. Allow building it into a separate module. Based on a patch from Shreeya Patel <shreeya.patel@collabora.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2021-10-12 11:41:39 -03:00
Christoph Hellwig	6ca99ce756	unicode: cache the normalization tables in struct unicode_map Instead of repeatedly looking up the version add pointers to the NFD and NFD+CF tables to struct unicode_map, and pass a unicode_map plus index to the functions using the normalization tables. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2021-10-11 17:02:02 -03:00
Christoph Hellwig	fbc59d6505	unicode: move utf8cursor to utf8-selftest.c Only used by the tests, so no need to keep it in the core. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2021-10-11 17:01:58 -03:00
Christoph Hellwig	9012d79cf0	unicode: simplify utf8len Just use the utf8nlen implementation with a (size_t)-1 len argument, similar to utf8_lookup. Also move the function to utf8-selftest.c, as it isn't used anywhere else. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2021-10-11 17:01:54 -03:00
Christoph Hellwig	379210db48	unicode: remove the unused utf8{,n}age{min,max} functions No actually used anywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2021-10-11 17:01:50 -03:00
Christoph Hellwig	49bd03cc7e	unicode: pass a UNICODE_AGE() tripple to utf8_load Don't bother with pointless string parsing when the caller can just pass the version in the format that the core expects. Also remove the fallback to the latest version that none of the callers actually uses. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2021-10-11 17:01:46 -03:00
Christoph Hellwig	a440943e68	unicode: remove the charset field from struct unicode_map It is hardcoded and only used for a f2fs sysfs file where it can be hardcoded just as easily. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2021-10-11 17:01:35 -03:00
Masahiro Yamada	9009b45581	.gitignore: prefix local generated files with a slash The pattern prefixed with '/' matches files in the same directory, but not ones in sub-directories. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Miguel Ojeda <ojeda@kernel.org> Acked-by: Rob Herring <robh@kernel.org> Acked-by: Andra Paraschiv <andraprs@amazon.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2021-05-02 00:43:35 +09:00
Daniel Rosenberg	3d7bfea8b8	unicode: Add utf8_casefold_hash This adds a case insensitive hash function to allow taking the hash without needing to allocate a casefolded copy of the string. The existing d_hash implementations for casefolding allocate memory within rcu-walk, by avoiding it we can be more efficient and avoid worrying about a failed allocation. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-09-10 14:03:31 -07:00
Masahiro Yamada	d198b34f38	.gitignore: add SPDX License Identifier Add SPDX License Identifier to all .gitignore files. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-25 11:50:48 +01:00
Masahiro Yamada	5f2fb52fac	kbuild: rename hostprogs-y/always to hostprogs/always-y In old days, the "host-progs" syntax was used for specifying host programs. It was renamed to the current "hostprogs-y" in 2004. It is typically useful in scripts/Makefile because it allows Kbuild to selectively compile host programs based on the kernel configuration. This commit renames like follows: always -> always-y hostprogs-y -> hostprogs So, scripts/Makefile will look like this: always-$(CONFIG_BUILD_BIN2C) += ... always-$(CONFIG_KALLSYMS) += ... ... hostprogs := $(always-y) $(always-m) I think this makes more sense because a host program is always a host program, irrespective of the kernel configuration. We want to specify which ones to compile by CONFIG options, so always-y will be handier. The "always", "hostprogs-y", "hostprogs-m" will be kept for backward compatibility for a while. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2020-02-04 01:53:07 +09:00
Colin Ian King	aa28b98d6d	unicode: make array 'token' static const, makes object smaller Don't populate the array 'token' on the stack but instead make it static const. Makes the object code smaller by 234 bytes. Before: text data bss dec hex filename 5371 272 0 5643 160b fs/unicode/utf8-core.o After: text data bss dec hex filename 5041 368 0 5409 1521 fs/unicode/utf8-core.o (gcc version 9.2.1, amd64) Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2019-09-17 11:48:24 -04:00
Krzysztof Wilczynski	334b427e96	unicode: Move static keyword to the front of declarations Move the static keyword to the front of declarations of nfdi_test_data and nfdicf_test_data, and resolve the following compiler warnings that can be seen when building with warnings enabled (W=1): fs/unicode/utf8-selftest.c:38:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] fs/unicode/utf8-selftest.c:92:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] Signed-off-by: Krzysztof Wilczynski <kw@linux.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>	2019-09-17 11:48:11 -04:00
Linus Torvalds	2e756758e5	Many bug fixes and cleanups, and an optimization for case-insensitive lookups. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl0lFIoACgkQ8vlZVpUN gaOwNQf/aJxFxHVf4t3lga8kfoMhlbwINQknsGUVwg32HporMa1NxQXjbEMMhs6V A31gBJ44nYVz1enz7nvbE4kx4quF4E8rDVprEetphv4i8GSdUAihwJwY5/H0oSd8 rxzTZzNKddoyN/j7H4LgAh7bo6IFk54kUuaAWuZDJnJtfLNQ6RBaIwg6u6Z8Fael 9H3u/RtFHqWPQp5j50PMUG06abr26GKi1gLL+yeoFD1tuzC54B5i6uy34amrXlon 5agIQ7YuB9bigK4VaLoF4df7o+7+Oa6ENaQ9O/TQc9Uy9ngdVlPpNb2bVDizRLNn e369sBFTf3C8sMycJy6x9TCqg2B7Hw== =EpCF -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Many bug fixes and cleanups, and an optimization for case-insensitive lookups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix coverity warning on error path of filename setup ext4: replace ktype default_attrs with default_groups ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree() ext4: refactor initialize_dirent_tail() ext4: rename "dirent_csum" functions to use "dirblock" ext4: allow directory holes jbd2: drop declaration of journal_sync_buffer() ext4: use jbd2_inode dirty range scoping jbd2: introduce jbd2_inode dirty range scoping mm: add filemap_fdatawait_range_keep_errors() ext4: remove redundant assignment to node ext4: optimize case-insensitive lookups ext4: make __ext4_get_inode_loc plug ext4: clean up kerneldoc warnigns when building with W=1 ext4: only set project inherit bit for directory ext4: enforce the immutable flag on open files ext4: don't allow any modifications to an immutable file jbd2: fix typo in comment of journal_submit_inode_data_buffers jbd2: fix some print format mistakes ext4: gracefully handle ext4_break_layouts() failure during truncate	2019-07-10 21:06:01 -07:00
Gabriel Krisman Bertazi	3ae72562ad	ext4: optimize case-insensitive lookups Temporarily cache a casefolded version of the file name under lookup in ext4_filename, to avoid repeatedly casefolding it. I got up to 30% speedup on lookups of large directories (>100k entries), depending on the length of the string under lookup. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-06-19 23:45:09 -04:00
Thomas Gleixner	9f8068503d	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 294 Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license 2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation this program is distributed in the hope that it [would] be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 9 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141901.804956444@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-06-05 17:36:38 +02:00
Thomas Gleixner	9c92ab6191	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 282 Based on 1 normalized pattern(s): this software is licensed under the terms of the gnu general public license version 2 as published by the free software foundation and may be copied distributed and modified under those terms this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 285 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141900.642774971@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-06-05 17:36:37 +02:00
Thomas Gleixner	ec8f24b7fa	treewide: Add SPDX license identifier - Makefile/Kconfig Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-21 10:50:46 +02:00
Theodore Ts'o	7fb6413336	unicode: update to Unicode 12.1.0 final Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Gabriel Krisman Bertazi <krisman@collabora.com>	2019-05-12 13:26:08 -04:00
Theodore Ts'o	15f0d8d0ba	unicode: add missing check for an error return from utf8lookup() Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Gabriel Krisman Bertazi <krisman@collabora.com>	2019-05-12 04:56:51 -04:00
Masahiro Yamada	28ba53c076	unicode: refactor the rule for regenerating utf8data.h scripts/mkutf8data is used only when regenerating utf8data.h, which never happens in the normal kernel build. However, it is irrespectively built if CONFIG_UNICODE is enabled. Moreover, there is no good reason for it to reside in the scripts/ directory since it is only used in fs/unicode/. Hence, move it from scripts/ to fs/unicode/. In some cases, we bypass build artifacts in the normal build. The conventional way to do so is to surround the code with ifdef REGENERATE_. For example, - `7373f4f83c` ("kbuild: add implicit rules for parser generation") - `6aaf49b495` ("crypto: arm,arm64 - Fix random regeneration of S_shipped") I rewrote the rule in a more kbuild'ish style. In the normal build, utf8data.h is just shipped from the check-in file. $ make [ snip ] SHIPPED fs/unicode/utf8data.h CC fs/unicode/utf8-norm.o CC fs/unicode/utf8-core.o CC fs/unicode/utf8-selftest.o AR fs/unicode/built-in.a If you want to generate utf8data.h based on UCD, put .txt files into fs/unicode/, then pass REGENERATE_UTF8DATA=1 from the command line. The mkutf8data tool will be automatically compiled to generate the utf8data.h from the *.txt files. $ make REGENERATE_UTF8DATA=1 [ snip ] HOSTCC fs/unicode/mkutf8data GEN fs/unicode/utf8data.h CC fs/unicode/utf8-norm.o CC fs/unicode/utf8-core.o CC fs/unicode/utf8-selftest.o AR fs/unicode/built-in.a I renamed the check-in utf8data.h to utf8data.h_shipped so that this will work for the out-of-tree build. You can update it based on the latest UCD like this: $ make REGENERATE_UTF8DATA=1 fs/unicode/ $ cp fs/unicode/utf8data.h fs/unicode/utf8data.h_shipped Also, I added entries to .gitignore and dontdiff. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-04-28 13:45:36 -04:00
Gabriel Krisman Bertazi	1215d239e7	unicode: update unicode database unicode version 12.1.0 Regenerate utf8data.h based on the latest UCD files and run tests against the latest version. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-04-25 13:59:17 -04:00
Gabriel Krisman Bertazi	f0d6cc0032	unicode: introduce test module for normalized utf8 implementation This implements a in-kernel sanity test module for the utf8 normalization core. At probe time, it will run basic sequences through the utf8n core, to identify problems will equivalent sequences and normalization/casefold code. This is supposed to be useful for regression testing when adding support for a new version of utf8 to linux. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-04-25 13:56:01 -04:00
Gabriel Krisman Bertazi	9d53690f0d	unicode: implement higher level API for string handling This patch integrates the utf8n patches with some higher level API to perform UTF-8 string comparison, normalization and casefolding operations. Implemented is a variation of NFD, and casefold is performed by doing full casefold on top of NFD. These algorithms are based on the core implemented by Olaf Weber from SGI. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-04-25 13:51:22 -04:00
Olaf Weber	a8384c6879	unicode: reduce the size of utf8data[] Remove the Hangul decompositions from the utf8data trie, and do algorithmic decomposition to calculate them on the fly. To store the decomposition the caller of utf8lookup()/utf8nlookup() must provide a 12-byte buffer, which is used to synthesize a leaf with the decomposition. This significantly reduces the size of the utf8data[] array. Changes made by Gabriel: Rebase to mainline Fix checkpatch errors Extract robustness fixes and merge back to original mkutf8data.c patch Regenerate utf8data.h Signed-off-by: Olaf Weber <olaf@sgi.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-04-25 13:49:18 -04:00
Olaf Weber	44594c2fbf	unicode: introduce code for UTF-8 normalization Supporting functions for UTF-8 normalization are in utf8norm.c with the header utf8norm.h. Two normalization forms are supported: nfdi and nfdicf. nfdi: - Apply unicode normalization form NFD. - Remove any Default_Ignorable_Code_Point. nfdicf: - Apply unicode normalization form NFD. - Remove any Default_Ignorable_Code_Point. - Apply a full casefold (C + F). For the purposes of the code, a string is valid UTF-8 if: - The values encoded are 0x1..0x10FFFF. - The surrogate codepoints 0xD800..0xDFFFF are not encoded. - The shortest possible encoding is used for all values. The supporting functions work on null-terminated strings (utf8 prefix) and on length-limited strings (utf8n prefix). From the original SGI patch and for conformity with coding standards, the utf8data_t typedef was dropped, since it was just masking the struct keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I decided to keep it, since they are simple pointers to memory buffers, and using uchars here wouldn't provide any more meaningful information. From the original submission, we also converted from the compatibility form to canonical. Changes made by Gabriel: Rebase to Mainline Fix up checkpatch.pl warnings Drop typedefs move out of libxfs Convert from NFKD to NFD Signed-off-by: Olaf Weber <olaf@sgi.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-04-25 13:45:46 -04:00
Gabriel Krisman Bertazi	955405d117	unicode: introduce UTF-8 character database The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-04-25 13:38:44 -04:00

37 Commits