The locale generation are issues in parallel to try speed locale
generation. The maximum number of jobs are limited to the online
CPU (in hope to not overcommit on environments with lower cores
than tests).
On a Ryzen 9, the test execution improves from ~6.7s to ~1.4s.
Tested-by: Mark Wielaard <mark@klomp.org>
Previously, they were assumed to be in ISO-8859-1, and that the output
charset overlapped with ISO-8859-1 for the characters actually used.
However, this did not work as intended on many architectures even for
an ISO-8859-1 output encoding because of the char signedness bug in
lr_getc. Therefore, this commit switches to UTF-8 without making
provisions for backwards compatibility.
The following Elisp code can be used to convert locale definition files
to UTF-8:
(defun glibc/convert-localedef (from to)
(interactive "r")
(save-excursion
(save-restriction
(narrow-to-region from to)
(goto-char (point-min))
(save-match-data
(while (re-search-forward "<U\\([0-9a-fA-F]+\\)>" nil t)
(let* ((codepoint (string-to-number (match-string 1) 16))
(converted
(cond
((memq codepoint '(?/ ?\ ?< ?>))
(string ?/ codepoint))
((= codepoint ?\") "<U0022>")
(t (string codepoint)))))
(replace-match converted t)))))))
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
This will permit reusing the Unicode character processing for
different character encodings, not just the current <U...> encoding.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
The array lr->buf contains characters, which can be signed. A 0xff
byte in the input could be incorrectly reported as EOF. More
importantly, get_string in linereader.c converts a signed input byte
to a Unicode code point using ADDWC ((uint32_t) ch), under the
assumption that this decodes the ISO-8859-1 input encoding. If char
is signed, this does not give the correct result. This means that
ISO-8859-1 input files for localedef are not actually supported,
contrary to the comment in get_string. This is a happy accident because
we can therefore change the file encoding to UTF-8 without impacting
backwards compatibility.
While at it, remove the \32 check for MS-DOS end-of-file character (^Z).
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
And introduce struct lr_buffer. The functions addc and adds can
be called from functions, enabling subsequent refactoring.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
This avoids an alias violation later. This commit also fixes
an incorrect double-checked locking idiom in _nl_init_era_entries.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
We can call the cleanup functions directly from _nl_unload_locale
if we pass the category to it.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The function performs the same steps for ld_archive locales
(mapped from an archive), and this code is not performance-critical,
so the specialization does not add value.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
On 32-bit machines this has no affect. On 64-bit machines
{u}int_fast{16|32} are set as {u}int64_t which is often not
ideal. Particularly x86_64 this change both saves code size and
may save instruction cost.
Full xcheck passes on x86_64.
ISO C17, POSIX Issue 7, and ISO 30112 all allow the char*
types to be empty strings i.e. "", integer or char values to
be -1 or CHAR_MAX respectively, with the exception of
decimal_point which must be non-empty in ISO C. Note that
the defaults for mon_grouping vary, but are functionaly
equivalent e.g. "\177" (no further grouping reuqired) vs.
"" (no grouping defined for all groups).
We include a broad comment talking about harmonizing ISO C,
POSIX, ISO 30112, and the default C/POSIX locale for glibc.
We reorder all setting based on locale/categories.def order.
We soften all missing definitions from errors to warnings when
defaults exist.
Given that ISO C, POSIX and ISO 30112 allow the empty string
we change LC_MONETARY handling of mon_decimal_point to allow
the empty string. If mon_decimal_point is not defined at all
then we pick the existing legacy glibc default value of
<U002E> i.e. ".".
We also set the default for mon_thousands_sep_wc at the
same time as mon_thousands_sep, but this is not a change in
behaviour, it is always either a matching value or L'\0',
but if in the future we change the default to a non-empty
string we would need to update both at the same time.
Tested on x86_64 and i686 without regressions.
Tested with install-locale-archive target.
Tested with install-locale-files target.
Reviewed-by: DJ Delorie <dj@redhat.com>
Whenever locale data for any locale included symbolic links, localedef
would throw the error "incomplete set of locale files" and exclude it
from the generated locale archive. This commit fixes that.
Co-authored-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The handling of mon_decimal_point is incorrect when it comes to
handling the empty "" value. The existing parser in monetary_read()
will correctly handle setting the non-wide-character value and the
wide-character value e.g. STR_ELEM_WC(mon_decimal_point) if they are
set in the locale definition. However, in monetary_finish() we have
conflicting TEST_ELEM() which sets a default value (if the locale
definition doesn't include one), and subsequent code which looks for
mon_decimal_point to be NULL to issue a specific error message and set
the defaults. The latter is unused because TEST_ELEM() always sets a
default. The simplest solution is to remove the TEST_ELEM() check,
and allow the existing check to look to see if mon_decimal_point is
NULL and set an appropriate default. The final fix is to move the
setting of mon_decimal_point_wc so it occurs only when
mon_decimal_point is being set to a default, keeping both values
consistent. There is no way to tell the difference between
mon_decimal_point_wc having been set to the empty string and not
having been defined at all, for that distinction we must use
mon_decimal_point being NULL or "", and so we must logically set
the default together with mon_decimal_point.
Lastly, there are more fixes similar to this that could be made to
ld-monetary.c, but we avoid that in order to fix just the code
required for mon_decimal_point, which impacts the ability for C.UTF-8
to set mon_decimal_point to "", since without this fix we end up with
an inconsistent setting of mon_decimal_point set to "", but
mon_decimal_point_wc set to "." which is incorrect.
Tested on x86_64 and i686 without regression.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
I've updated copyright dates in glibc for 2022. This is the patch for
the changes not generated by scripts/update-copyrights and subsequent
build / regeneration of generated files. As well as the usual annual
updates, mainly dates in --version output (minus csu/version.c which
previously had to be handled manually but is now successfully updated
by update-copyrights), there is a small change to the copyright notice
in NEWS which should let NEWS get updated automatically next year.
Please remember to include 2022 in the dates for any new files added
in future (which means updating any existing uncommitted patches you
have that add new files to use the new copyright dates in them).
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.
I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah. I don't
know why I run into these diagnostics whereas others evidently do not.
remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
localedef currently blindly trust the archive header. When passed an
archive file with the wrong endianess, this leads to a segmentation
fault:
$ localedef --big-endian --list-archive /usr/lib/locale/locale-archive
Segmentation fault (core dumped)
When passed non-archive files, asserts are reported on the best case,
but sometimes it can lead to a segmentation fault:
$ localedef --list-archive /bin/true
localedef: programs/locarchive.c:1643: show_archive_content: Assertion `used < GET (head->namehash_used)' failed.
Aborted (core dumped)
$ localedef --list-archive /usr/lib/locale/C.utf8/LC_COLLATE
Segmentation fault (core dumped)
This patch improves the user experience by looking at the magic value,
which is always written, but never checked. It should still be possible
to trigger a segmentation fault with crafted files, but this already
catch many cases.
Support a new directive 'codepoint_collation' in the LC_COLLATE
section of a locale source file. This new directive causes all
collation rules to be dropped and instead STRCMP (strcmp or
wcscmp) is used for collation of the input character set. This
is required to allow for a C.UTF-8 that contains zero collation
rules (minimal size) and sorts using code point sorting.
To date the only implementation of a locale with zero collation
rules is the C/POSIX locale. The C/POSIX locale provides
identity tables for _NL_COLLATE_COLLSEQMB and
_NL_COLLATE_COLLSEQWC that map to ASCII even though it has zero
rules. This has lead to existing fnmatch, regexec, and regcomp
implementations that require these tables. It is not correct
to use these tables when nrules == 0, but the conservative fix
is to provide these tables when nrules == 0. This assures that
existing static applications using a new C.UTF-8 locale with
'codepoint_collation' at least have functional range expressions
with ASCII e.g. [0-9] or [a-z]. Such static applications would
not have the fixes to fnmatch, regexec and regcomp that avoid
the use of the tables when nrules == 0. Future fixes to fnmatch,
regexec, and regcomp would allow range expressions to use the
full set of code points for such ranges.
Tested on x86_64 and i686 without regression.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
We stopped adding "Contributed by" or similar lines in sources in 2012
in favour of git logs and keeping the Contributors section of the
glibc manual up to date. Removing these lines makes the license
header a bit more consistent across files and also removes the
possibility of error in attribution when license blocks or files are
copied across since the contributed-by lines don't actually reflect
reality in those cases.
Move all "Contributed by" and similar lines (Written by, Test by,
etc.) into a new file CONTRIBUTED-BY to retain record of these
contributions. These contributors are also mentioned in
manual/contrib.texi, so we just maintain this additional record as a
courtesy to the earlier developers.
The following scripts were used to filter a list of files to edit in
place and to clean up the CONTRIBUTED-BY file respectively. These
were not added to the glibc sources because they're not expected to be
of any use in future given that this is a one time task:
https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dchttps://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
For the legacy ABI with supports 32-bit time_t it calls the 64-bit
time directly, since the LFS symbols calls the 64-bit time_t ones
internally.
Checked on i686-linux-gnu and x86_64-linux-gnu.
Reviewed-by: Lukasz Majewski <lukma@denx.de>
Fix trivial leak identified by coverity. The program runs to exit and
the leak doesn't grow, but it's just cleaner to free the allocated
memory.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
During ellipsis processing the collation cursor was not correctly
moved to the end of the ellipsis after processing.
The code inserted the new entry after the cursor, but before the
real end of the ellipsis:
[cursor]
... element_t <-> element_t <-> element_t <-> element_t
"<U0000>" "<U0001>" "<U007F>"
startp endp
At the end of the function we have:
[cursor]
... element_t <-> element_t <-> element_t
"<U007E>" "<U007F>"
endp
The cursor should be pointing at endp, the last element in the
doubly-linked list, otherwise when execution returns to the
caller we will start inserting the next line after <U007E>.
Subsequent operations end up unlinking the ellipsis end entry or
just leaving it in the list dangling from the end. This kind of
dangling is immediately visible in C.UTF-8 with the following
sorting from strcoll:
<U0010FFFF>
<U0000FFFF>
<U000007FF>
<U0000007F>
With the cursor correctly adjusted the end entry is correctly given
the right location and thus the right weight.
Retested and no regressions on x86_64 and i686.
Co-authored-by: Carlos O'Donell <carlos@redhat.com>
These symbol usages are not definitions, so compat_symbol_reference is
more appropriate than compat_symbol. compat_symbol_reference is also
safe to emit multiple times (in case the inline assembly is
duplicated; this is possible because it is nested in a function).
compat_symbol does not necessarily have this property because it is
intended to provide a symbol definition.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
I've updated copyright dates in glibc for 2021. This is the patch for
the changes not generated by scripts/update-copyrights and subsequent
build / regeneration of generated files. As well as the usual annual
updates, mainly dates in --version output (minus csu/version.c which
previously had to be handled manually but is now successfully updated
by update-copyrights), there is a small change to the copyright notice
in NEWS which should let NEWS get updated automatically next year.
Please remember to include 2021 in the dates for any new files added
in future (which means updating any existing uncommitted patches you
have that add new files to use the new copyright dates in them).
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
Among the warnings causing a glibc build with GCC 11 to fail is one
for a call new_composite_name in setlocale.c. The newnames argument
is declared as an array with __LC_LAST elements, but when the category
argument is not LC_ALL, it actually only has one element. Since the
number of elements depends on the first argument to the function, it
seems clearer to declare the argument as a pointer.
Tested with build-many-glibcs.py for arm-linux-gnueabi, where this
allows the build to get further.
Reviewed-by: DJ Delorie <dj@redhat.com>
It replaces the internal usage of __{f,l}xstat{at}{64} with the
__{f,l}stat{at}{64}. It should not change the generate code since
sys/stat.h explicit defines redirections to internal calls back to
xstat* symbols.
Checked with a build for all affected ABIs. I also check on
x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Lukasz Majewski <lukma@denx.de>
ISO-8859-8-based locales will need this transliteration if the locale
files contain these Unicode characters.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
We automatically create $(complocaledir) in the testroot.root
now and so we don't need to create it in the test.
Tested on x86_64.
Reviewed-by: DJ Delorie <dj@redhat.com>
On the current AArch64 buildbot the default 20s timeout is not
enough to run this test, it seems
make test t=locale/tst-localedef-path-norm
takes about 25s, so i increased the timeout to 30s.
During testing of localedef running in a minimal container
there were several error cases which were hard to diagnose
since they appeared as strerror (errno) values printed by the
higher level functions. This change adds three new verbose
messages for potential failure paths. The new messages give
the user the opportunity to use -v and display additional
information about why localedef might be failing. I found
these messages useful myself while writing a localedef
container test for --no-hard-links.
Since the changes cleanup the code that handle codeset
normalization we add tst-localedef-path-norm which contains
many sub-tests to verify the correct expected normalization of
codeset strings both when installing to default paths (the
only time normalization is enabled) and installing to absolute
paths. During the refactoring I created at least one
buffer-overflow which valgrind caught, but these tests did not
catch because the exec in the container had a very clean heap
with zero-initialized memory. However, between valgrind and
the tests the results are clean.
The new tst-localedef-path-norm passes without regression on
x86_64.
Change-Id: I28b9f680711ff00252a2cb15625b774cc58ecb9d
The locale/tst-locale-locpath test unsets LANG, then runs a test with
test_wrapper_env and expects LANG to remain unset for that test. This
does not work for cross-testing with cross-test-ssh.sh when sshd (on
the system specified as an argument to cross-test-ssh.sh) is
configured to have a default LANG setting.
The general design used in cross testing, after commit
8540f6d2a7 ("Don't require test wrappers
to preserve environment variables, use more consistent environment.",
6 June 2014), is that environment settings required by tests should be
passed explicitly to $(test-wrapper-env). This patch changes
tst-locale-locpath.sh to pass an explicit LANG= rather than expecting
"unset LANG" to be in effect for the program run under
test_wrapper_env. Note that this does slightly change the environment
in which the test is run natively (empty LANG instead of unset LANG)
but that difference does not appear relevant to what it is trying to
test.
Tested for Arm that this fixes the failure seen for that test in
cross-testing.
I've updated copyright dates in glibc for 2020. This is the patch for
the changes not generated by scripts/update-copyrights and subsequent
build / regeneration of generated files. As well as the usual annual
updates, mainly dates in --version output (minus libc.texinfo which
previously had to be handled manually but is now successfully updated
by update-copyrights), there is a fix to
sysdeps/unix/sysv/linux/powerpc/bits/termios-c_lflag.h where a typo in
the copyright notice meant it failed to be updated automatically.
Please remember to include 2020 in the dates for any new files added
in future (which means updating any existing uncommitted patches you
have that add new files to use the new copyright dates in them).
_nl_load_locale_from_archive() checks for a zero size, but
divides by both (size) and (size-2). Extend the check to
guard against a size of two or less.
Tested by manually corrupting locale-archive and running a program
that calls setlocale() with LOCPATH unset (size is typically very
large).
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This propagates the recent http->https URL changes.
Since I used gperf 3.1 to regenerate, this is also a minor
internal-to-localedef API change.
URL problem reported by Joseph Myers in:
https://www.sourceware.org/ml/libc-alpha/2019-09/msg00143.html
* locale/programs/charmap-kw.h, locale/programs/locfile-kw.h:
Regenerate with gperf 3.1.
* locale/programs/linereader.h (kw_hash_fct_t):
* locale/programs/repertoire.c (repertoiremap_hash):
2nd arg is now size_t not unsigned, for compatibility with gperf 3.1.