Change the implementation of grapheme cluster matching in regex to use an ICU
break iterator instead of a little one-off state machine.
The old implementation had fallen behind the Unicode UAX-29 specification for
graphem clusters, and could not be easily updated.
The implementation follows the same general pattern that is used for finding
word boundaries with an ICU break iterator. In reviewing that code, a few
improvements to the handling of ICU error codes were also made.
Also note that this change adds a new dependency on Break Iteration. Regex
patterns that previously would work with ICU builds that were configured with
no break iteration will now fail. But only if they include \X for matching
grapheme cluster boundaries.
This change builds on Vincent Torri's changes.
This installs the ICU DLL files in $prefix/bin instead of $prefix/lib.
Note: In order to disable this change in behavior you can edit
the "mh-mingw*" file(s). If you set the variable MINGW_MOVEDLLSTOBINDIR
to NO instead of YES, then it will retain the previous behavior of
installing the DLLs into the bin folder.
Adds `ICU_TIMEZONE_FILES_DIR_PREFIX_ENV_VAR`, similar to
`ICU_DATA_DIR_PREFIX_ENV_VAR`, that specifies an environment variable
to retrieve and prepend to the ICU time zone data file path.
In regular expressions, when testing for word boundaries with \b, the
boundaries were incorrect when in Unicode mode, meaning that an ICU word break
iterator is being used to find the boundaries, and the text being matched is
UTF-8 encoded.
The bug stemmed from a misunderstanding of how string indexes work with UText
and break iterators, leading to the inclusion of code to convert from UTF-8 to
UTF-16 indexing, when what was wanted was the original UTF-8 index everywhere.
Removing the indexing conversion fixes the problem.
Compiled regular expression patterns make use of several shared common
UnicodeSets. This change simplifies the creation and use of these
static UnicodeSets.
- Pointer fields to the static sets are removed from the compiled patterns,
and the static variables are accessed directly. The deleted pointers
were a hold-over from earlier code that did not use shared statics.
- The UnicodeSet pattern literals are changed from hex constants to
u"string literals".
- The size of fRuleSets (from regexst.h) is changed from a hard-coded 10
to the number of UnicodeSets actually required. Doing this required
a change to regexcst.pl to export the required size. Changing and
rerunning this perl code resulted in massive but benign changes to
the generated file regexcst.h, the result of perl having changed its
order of enumeration of hashes since the file was last regenerated.
- UnicodeSets are frozen when possible. Should result in faster matching.
The definition of max_align_t is not guaranteed to be available unless
the appropriate header is included. Since use of <stddef.h> from C++ is
deprecated, that's <cstddef>, and max_align_t is thus defined under the
std namespace rather than in the global namespace.
DateTimePatternGenerator needs to consider the hour-cycle preferred by
Locale. This means that we need to to override the hour-cycle when a
locale contains "hc" keyword. This patch is adding such functionality.
In addition, "DateTimePatternGenerator::adjustFieldTypes" should adjust
hour field to properly follow tr35
spec(https://www.unicode.org/reports/tr35/tr35-dates.html#dfst-hour).
This would cause failures during cross compilation cases such as:
make[6]: Leaving directory '/spksrc/spk/bazarr/work-qoriq-6.1/icu/source/data'
make[5]: *** No rule to make target 'out', needed by 'out/icudt64b.dat'. Stop.
- equivalent but originally unequal
- locale distance shifted left for additional fraction bits with micro distance
- Java more verbose matcher debug output
See #949
- Merge the look-ahead results slots used when multiple rules share a common accepting state.
- Sequentially number the look-ahead result slot. Will eventually allow replacing the runtime map with an array.
- Inhibit chaining out of look-ahead rules. This could never actually happen; when a hard break
rule matches, the engine is stopped immediately, but the state table was being constructed
as if it could happen. Reduces table size for line break rules.
- Remove incorrect handling of fAccepting and fLookAhead fields of a state table row
when removing duplicate states. Look-ahead slot number was being mis-interpreted as a state number.
It's not available with some libc implementations. Specifically,
BIONIC and uClibc-ng. uprv_ variants are available.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
When C code includes the ICU headers, the UChar type is defined to be
uint16_t. But when C++ code includes the headers, UChar is char16_t
even when U_SHOW_CPLUSPLUS_API has been set to 0. Apart from arguably
being an inconsistency in the API, this means that C++98 or C++03 code
can't use the C API even though C99 code can.
So, change unicode/umachine.h to check not just whether __cplusplus is
defined but the value of U_CPLUSPLUS_VERSION when deciding how to
typedef UChar.
See PR #915, see changes.txt
- Unicode 13 beta data as of 2019-nov-21
- uprops.icu format version 7.7 with more bits for Script/Script_Extensions
- more bits in spoof checker ScriptSet
- root line break rules adjusted for UAX 14 changes, from Andy
- line break tailorings not yet in sync with root
The purpose of the FIELD_NAME_STR() macro is to create a string literal
by using the # preprocessing operator and then skip the first 5 chars of
this string by using the +5 pointer arithmetic. This is all intentional,
but if the parentheses are misplaced the compiler might think that this
is a mistake, a failed string concatenation (-Wstring-plus-int).
- includes new APIChangeReport.md
- the 'simplification' section is no longer new
- uses tools from [ICU-20162] #807
Includes non-promotion fixes:
- utrace.h: UTRACE_UDATA_BUNDLE
The test fails for "America/Argentina/ComodRivadavia", which has length
of 32, the same as the buffer. The results in the string being not
terminated with a NUL.
This increases the buffer size used by the test, and also checks for the
string not terminated warning.
This enables "classic" desktop builds of ICU4C for both ARM (32-bit)
and ARM64 (64-bit) on Windows.
All but the two samples "cal" and "date" in the "allinone" project now
have ARM and ARM64 project configurations, and build for Windows Desktop
ARM/ARM64.
Note: In order to build the ARM/ARM64 data DLL, you need to first build
x64/Release, as the ARM/ARM build uses the x64 bits in order to be able
to cross-compile for ARM/ARM64. This allows for completely building
ARM/ARM64 binaries using only x64 hardware.
The ARM/ARM64 builds require using a newer version of the Windows SDK
than 8.1, so they have a separate WindowsTargetPlatformVersion which
uses Windows 10 SDK version 10.0.16299.0 (aka RS3), which is the first
version of the Windows SDK to support building ARM64 desktop applications.
In addition this also greatly cleans-up the ICU4C ".vcxproj" files, in
order to remove redundant parts, fix inconsistencies, and make them more
readable. This introduces two new variables in the shared `*.props`
files: `IcuBinOutputDir` and `IcuLibOutputDir` in order to further
reduce the amount of duplicated lines in the individual ".vcxproj"
files themselves.
In the case several internal builtins defined by clang were not available,
ICU was replacing them by a stub implementation. But that was breaking
detection of availability of same methods in other parts of Chromium
(specifically in base/location.h).
Instead of that, this change creates ICU specific macros that will
map to those builtins when available, or to stub implementation if
not.
I.e. for the case of __has_builtin, previous implementation was
defining __has_builtin(x) as 0 in case it was not declared. With
new implementation, there is a macro UPRV_HAS_BUILTIN that maps
to __has_builtin if available, or 0 if not.
Adds `U_ASSERT` check before using `m->lock()` to make allocation issues
more apparent at least in debug builds.
There is probably quite a few places like this, but let's try fixing
broken windows.
This is the normal way of making compound macros behave like statements
and allowing macro invocations that look like function calls to work
just like if they were function calls.
Before ICU 65, function-like, multi-statement ICU macros were just
defined as series of statements wrapped in { } blocks and the caller
could choose to either treat them as if they were actual functions and
end the invocation with a trailing ; creating an empty statement after
the block or else omit this trailing ; using the knowledge that the
macro would expand to { }.
To continue supporting code that depends on that old behaviour, on being
able to omit the trailing ; when invoking one of these macros, the new
surrounding do { } while is done through preprocessor defines that can
easily be overridden at compile time, when needed.
Using the C++ SFINAE (substitution failure is not an error) technique,
it's possible to provide an icu::StringPiece constructor that accepts
any string view type that might be available in the current compilation
unit, without adding any additional dependencies or configuration flags
to ICU.
This also changes the pkgdata tool command line options to decouple the
ARM/ARM64 flags from the UWP flag, and adds a generic architecture
option (-a) for passing in the value of the linker "/MACHINE:" argument
on the command line. (Ex: -a ARM64).
Additionally this change also adds another option to pkgdata (-b) to
set /DYNAMICBASE on the data DLL, which is required for the ARM64 builds.
Both of these changes also work towards future work in order to enable
full ARM64 desktop builds without UWP.
This change also removes a number of ifdefs/forks in the ICU code based
on U_PLATFORM_HAS_WINUWP_API, and changes them to use ICU_DATA_DIR_WINDOWS
instead. This is needed to unblock the usage of the data DLL in the
UWP scenario, but also helps to further reduce the divergence of the UWP
projects from regular Windows builds.
Related tickets:
ICU4C: Remove fixed DLL base addresses when building Windows DLLs
https://unicode-org.atlassian.net/browse/ICU-20768
Add support to generate ICU data DLL for Windows arm64
https://unicode-org.atlassian.net/browse/ICU-20670
The way these macros currently are defined, the code compiles also
without the final semicolons. But for consistency and in order to allow
the macro definitions to be updated also these macro invocations should
have final semicolons just like in the rest of this code base.
This adds a separate CI build that enables -Werror for clang.
This also fixes all of the -Wall -Wextra warnings in the tests, and all the
-Wextra-semi warnings as well.
This change adds support for building the ICU4C UWP projects for ARM64.
Additionally it adds CI builds for ARM32 and ARM64.
Co-Authored-By: Daniel Ju <daju@microsoft.com>
The Azure DevOps VM images that are used in the CI builds recently
updated the version of MSYS that is installed, which is causing the
MSYS builds to fail. It seems that this new version has issues with
using the shell cmd method in the makefiles to obtain the current path.
However, we can avoid this entirely by using the cygpath utility, which
also avoids the overhead of launching a new cmd instance just to get
the path. Unfortunately, it seems that older versions of MSYS don't
have this utility program included by default, though versions of
MSYS 2 and up do have it included by default.
We can continue using the old approach for the older versions and use
the cygpath approach for new versions.
RulebasedCollator().
ICU-20652 Adds test/fuzzer/Makefile (auto-generated upon ICU4C configuration)
to .gitignore.
ICU-20652 In response to PR#693 review, corrects allocation size of char16_t
buffer.
While at it, adds generated files to .gitignore.
targets to test/fuzzer/ directory. This will enable compilation and
smoke test of fuzzer targets as part of the ICU continuous build.
ICU-20652 Fixed exit-on-error behaviour of fuzzer targets execution.
Minor clean-ups and improvements
ICU-20652 Modifies fuzzer/Makefile.in to fix Windows build issue.
ICU-20627 Adds explicit enablement of fuzzer targets build to ICU4C
configuration and Makefile.in. File 'configure' was created from
'configure.ac' by executing 'autoreconf'.
autoreconf added some new entries into 'configure' about runstatedir. Not sure
why it did this, they are not related to fuzzer.
ICU now uses namespaces. When trying to use the scrptrun.h header, the
header cannot compile stand-alone as it is inheriting from UObject which
is namespaced now. Add namespace macros to properly inherit.
ICU-20566 Add scrptrun to the extra high level Makefile
The scrptrun project was not being added to the extra portion of the ICU
build. Add it as a sub project so it can always be built.
ICU-20566 Properly namespace the ScriptRun test
The ScriptRun class is now namespaced. Update the test reference.
ICU-20566 Use int instead of void to compile
Clang will complain about having a void main without an int main to
invoke. Change void to int and return 0.