scuffed-code/icu4c/source/common/utrie_swap.cpp
Markus Scherer fe3eb3ed5c
ICU-13530 add UCPTrie/CodePointTrie, switch normalization to use it (#48)
* ICU-13530 copy C/C++ files UTrie2 -> UTrie3

X-SVN-Rev: 40754

* ICU-13530 UTrie3 new files copied from UTrie2: rename types/functions/macros

X-SVN-Rev: 40755

* ICU-13530 debug-print building each UTrie2

X-SVN-Rev: 40756

* ICU-13530 remove two-byte-UTF-8 errorValue block; move highValue from end of data array into header; add errorValue to header

X-SVN-Rev: 40762

* ICU-13530 UTrie3 U16_NEXT/PREV: errorValue for unpaired surrogates

X-SVN-Rev: 40763

* ICU-13530 no more separate values for lead surrogate code units

X-SVN-Rev: 40764

* ICU-13530 change from 11:5 trie bits to 10:6 for simpler UTF-8 code

X-SVN-Rev: 40766

* ICU-13530 UTrie2 build UTrie3 as well, print sizes

X-SVN-Rev: 40767

* ICU-13530 debug-print countSame, sumOverlaps, countInitial

X-SVN-Rev: 40768

* ICU-13530 debug-print whether trie is for CanonIterData

X-SVN-Rev: 40769

* ICU-13530 no index-shift for BMP data, no separate index-2 for 2-byte UTF-8; builder changes incomplete

X-SVN-Rev: 40777

* ICU-13530 remove errorValue and highStart from UNewTrie3

X-SVN-Rev: 40778

* ICU-13530 rewrite UTrie3 builder code

X-SVN-Rev: 40783

* ICU-13530 UTrie3 bug fixes

X-SVN-Rev: 40788

* ICU-13530 fully re-inline _UTRIE3_U8_NEXT()

X-SVN-Rev: 40790

* ICU-13530 find most common all-same data block for dataNullBlock and initialValue

X-SVN-Rev: 40792

* ICU-13530 UTrie3 iterator functions take start and return the end of a range, rather than callback call for each range

X-SVN-Rev: 40800

* ICU-13530 mask off unused data value bits before building a UTrie3 with values less than 32 bits wide

X-SVN-Rev: 40803

* ICU-13530 split utrie3builder.h out of utrie3.h

X-SVN-Rev: 40804

* ICU-13530 separate types UTrie3 vs. UTrie3Builder, implement builder as wrapper over C++ class Trie3Builder in .cpp

X-SVN-Rev: 40809

* ICU-13530 function to make a UTrie3Builder from a UTrie3

X-SVN-Rev: 40810

* ICU-13530 debug-print some data; some cleanup

X-SVN-Rev: 40865

* ICU-13530 BMP 10:6 but supplementary 10:6:4

X-SVN-Rev: 40984

* ICU-13530 move errorValue & highValue to the end of the data table, minimal padding to 4 bytes

X-SVN-Rev: 41011

* ICU-13530 index-1 table gap of index-2 null blocks

X-SVN-Rev: 41018

* ICU-13530 test with more than 128k compacted data

X-SVN-Rev: 41034

* ICU-13530 supplementary bits 11:5:4 saves a little space

X-SVN-Rev: 41039

* ICU-13530 supplementary bits 6:5:5:4 instead of gap: about same size but simpler

X-SVN-Rev: 41050

* ICU-13530 remove unnecessary utrie3_clone(built trie)

X-SVN-Rev: 41058

* ICU-13530 remove unnecessary UTrie3StringIterator

X-SVN-Rev: 41059

* ICU-13530 back to UTRIE3_GET...() macros *returning* data values

X-SVN-Rev: 41060

* ICU-13530 fast vs. small

X-SVN-Rev: 41066

* ICU-13530 always load NFC data, add simple normalization performance test

X-SVN-Rev: 41110

* ICU-13530 change normalization main trie to UTrie3 with special values for lead surrogates; forbid non-inert surrogate code *points* because unable to store values different from code *units*; runtime code work around that for code point lookup and iteration; adjust UTS 46 for normalization no longer mapping unpaired surrogates to U+FFFD

X-SVN-Rev: 41122

* ICU-13530 simplenormperf bug fix and NFC base line

X-SVN-Rev: 41126

* ICU-13530 move normalization getRange skipping lead surrogates to API getRangeSkipLead()

X-SVN-Rev: 41182

* ICU-13530 switch CanonIterData and gennorm2 Norms to UTrie3

X-SVN-Rev: 41183

* ICU-13530 remove unused overwrite parameter from setRange()

X-SVN-Rev: 41184

* ICU-13530 getRange skip lead -> fixed surrogates

X-SVN-Rev: 41219

* ICU-13530 minor cleanup

X-SVN-Rev: 41221

* ICU-13530 UTS 46 code map unpaired surrogates to U+FFFD before normalization

X-SVN-Rev: 41224

* ICU-13530 minor internal-docs cleanup

X-SVN-Rev: 41225

* ICU-13530 rename UTrie3 to UCPTrie, and other name changes

X-SVN-Rev: 41226

* ICU-13530 add 8-bit data option; add type-any & valueBits-any for fromBinary(); macros consistently source type then data width

X-SVN-Rev: 41234

* ICU-13530 scrub the API docs for the proposal

X-SVN-Rev: 41319

* ICU-13530 tag internal definitions as such, or move them to an internal header

X-SVN-Rev: 41320

* ICU-13530 Java API skeleton

X-SVN-Rev: 41326

* ICU-13530 API feedback: ValueWidth, MutableCodePointTrie, base CodePointMap, ...

X-SVN-Rev: 41382

* ICU-13530 add UCPTrie valueWidth field and padding, and combine data pointers into a union

X-SVN-Rev: 41408

* ICU-13530 switch some macros to using dataAccess parameter: separate index vs. data lookups, no macro variant for each value width

X-SVN-Rev: 41409

* ICU-13530 StringIterator is no longer a java.util.Iterator (bad fit)

X-SVN-Rev: 41455

* ICU-13530 CodePointTrie.java code complete

X-SVN-Rev: 41518

* ICU-13530 finish Java port incl test; keep C++ parallel

* ICU-13530 adjust API for feedback: rename HandleValue to FilterValue, change getRange+getRangeFixedSurr(bool allSurr) to enum RangeOption+getRange(enum option); change remaining C macros to use dataAccess for 16/32/8-bit value widths; fix/clarify some API docs

* ICU-13530 add javadoc

* ICU-13530 document UCPTrie binary data format

* ICU-13530 update .nrm formatVersion 3->4, document change in surrogate handling with new trie

* ICU-13530 re-hardcode NFC data

* move trie swapper code into new file; add new files to Windows project files; turn off trie debugging

* ICU-13530 minor cleanup

* ICU-13530 test more range starts; fix a C test leak

* ICU-13530 regenerate Java data from scratch

* ICU-13530 review feedback changes: API docs typos, more @internal, C++11 field initializers, fix potential leak in MutableCodePointTrie::fromUCPTrie()

* ICU-13530 rename interface FilterValue to ValueFilter
2018-09-27 14:27:38 -07:00

345 lines
11 KiB
C++

// © 2018 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
// utrie_swap.cpp
// created: 2018aug08 Markus W. Scherer
#include "unicode/utypes.h"
#include "cmemory.h"
#include "ucptrie_impl.h"
#include "udataswp.h"
#include "utrie.h"
#include "utrie2_impl.h"
// These functions for swapping different generations of ICU code point tries are here
// so that their implementation files need not depend on swapper code,
// need not depend on each other, and so that other swapper code
// need not depend on other trie code.
namespace {
constexpr int32_t ASCII_LIMIT = 0x80;
} // namespace
U_CAPI int32_t U_EXPORT2
utrie_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
const UTrieHeader *inTrie;
UTrieHeader trie;
int32_t size;
UBool dataIs32;
if(pErrorCode==NULL || U_FAILURE(*pErrorCode)) {
return 0;
}
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
return 0;
}
/* setup and swapping */
if(length>=0 && (uint32_t)length<sizeof(UTrieHeader)) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
inTrie=(const UTrieHeader *)inData;
trie.signature=ds->readUInt32(inTrie->signature);
trie.options=ds->readUInt32(inTrie->options);
trie.indexLength=udata_readInt32(ds, inTrie->indexLength);
trie.dataLength=udata_readInt32(ds, inTrie->dataLength);
if( trie.signature!=0x54726965 ||
(trie.options&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_SHIFT ||
((trie.options>>UTRIE_OPTIONS_INDEX_SHIFT)&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_INDEX_SHIFT ||
trie.indexLength<UTRIE_BMP_INDEX_LENGTH ||
(trie.indexLength&(UTRIE_SURROGATE_BLOCK_COUNT-1))!=0 ||
trie.dataLength<UTRIE_DATA_BLOCK_LENGTH ||
(trie.dataLength&(UTRIE_DATA_GRANULARITY-1))!=0 ||
((trie.options&UTRIE_OPTIONS_LATIN1_IS_LINEAR)!=0 && trie.dataLength<(UTRIE_DATA_BLOCK_LENGTH+0x100))
) {
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
return 0;
}
dataIs32=(UBool)((trie.options&UTRIE_OPTIONS_DATA_IS_32_BIT)!=0);
size=sizeof(UTrieHeader)+trie.indexLength*2+trie.dataLength*(dataIs32?4:2);
if(length>=0) {
UTrieHeader *outTrie;
if(length<size) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
outTrie=(UTrieHeader *)outData;
/* swap the header */
ds->swapArray32(ds, inTrie, sizeof(UTrieHeader), outTrie, pErrorCode);
/* swap the index and the data */
if(dataIs32) {
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, trie.dataLength*4,
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
} else {
ds->swapArray16(ds, inTrie+1, (trie.indexLength+trie.dataLength)*2, outTrie+1, pErrorCode);
}
}
return size;
}
U_CAPI int32_t U_EXPORT2
utrie2_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
const UTrie2Header *inTrie;
UTrie2Header trie;
int32_t dataLength, size;
UTrie2ValueBits valueBits;
if(U_FAILURE(*pErrorCode)) {
return 0;
}
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
return 0;
}
/* setup and swapping */
if(length>=0 && length<(int32_t)sizeof(UTrie2Header)) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
inTrie=(const UTrie2Header *)inData;
trie.signature=ds->readUInt32(inTrie->signature);
trie.options=ds->readUInt16(inTrie->options);
trie.indexLength=ds->readUInt16(inTrie->indexLength);
trie.shiftedDataLength=ds->readUInt16(inTrie->shiftedDataLength);
valueBits=(UTrie2ValueBits)(trie.options&UTRIE2_OPTIONS_VALUE_BITS_MASK);
dataLength=(int32_t)trie.shiftedDataLength<<UTRIE2_INDEX_SHIFT;
if( trie.signature!=UTRIE2_SIG ||
valueBits<0 || UTRIE2_COUNT_VALUE_BITS<=valueBits ||
trie.indexLength<UTRIE2_INDEX_1_OFFSET ||
dataLength<UTRIE2_DATA_START_OFFSET
) {
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
return 0;
}
size=sizeof(UTrie2Header)+trie.indexLength*2;
switch(valueBits) {
case UTRIE2_16_VALUE_BITS:
size+=dataLength*2;
break;
case UTRIE2_32_VALUE_BITS:
size+=dataLength*4;
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
if(length>=0) {
UTrie2Header *outTrie;
if(length<size) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
outTrie=(UTrie2Header *)outData;
/* swap the header */
ds->swapArray32(ds, &inTrie->signature, 4, &outTrie->signature, pErrorCode);
ds->swapArray16(ds, &inTrie->options, 12, &outTrie->options, pErrorCode);
/* swap the index and the data */
switch(valueBits) {
case UTRIE2_16_VALUE_BITS:
ds->swapArray16(ds, inTrie+1, (trie.indexLength+dataLength)*2, outTrie+1, pErrorCode);
break;
case UTRIE2_32_VALUE_BITS:
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, dataLength*4,
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
}
return size;
}
U_CAPI int32_t U_EXPORT2
ucptrie_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
const UCPTrieHeader *inTrie;
UCPTrieHeader trie;
int32_t dataLength, size;
UCPTrieValueWidth valueWidth;
if(U_FAILURE(*pErrorCode)) {
return 0;
}
if(ds==nullptr || inData==nullptr || (length>=0 && outData==nullptr)) {
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
return 0;
}
/* setup and swapping */
if(length>=0 && length<(int32_t)sizeof(UCPTrieHeader)) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
inTrie=(const UCPTrieHeader *)inData;
trie.signature=ds->readUInt32(inTrie->signature);
trie.options=ds->readUInt16(inTrie->options);
trie.indexLength=ds->readUInt16(inTrie->indexLength);
trie.dataLength = ds->readUInt16(inTrie->dataLength);
UCPTrieType type = (UCPTrieType)((trie.options >> 6) & 3);
valueWidth = (UCPTrieValueWidth)(trie.options & UCPTRIE_OPTIONS_VALUE_BITS_MASK);
dataLength = ((int32_t)(trie.options & UCPTRIE_OPTIONS_DATA_LENGTH_MASK) << 4) | trie.dataLength;
int32_t minIndexLength = type == UCPTRIE_TYPE_FAST ?
UCPTRIE_BMP_INDEX_LENGTH : UCPTRIE_SMALL_INDEX_LENGTH;
if( trie.signature!=UCPTRIE_SIG ||
type > UCPTRIE_TYPE_SMALL ||
(trie.options & UCPTRIE_OPTIONS_RESERVED_MASK) != 0 ||
valueWidth > UCPTRIE_VALUE_BITS_8 ||
trie.indexLength < minIndexLength ||
dataLength < ASCII_LIMIT
) {
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UCPTrie */
return 0;
}
size=sizeof(UCPTrieHeader)+trie.indexLength*2;
switch(valueWidth) {
case UCPTRIE_VALUE_BITS_16:
size+=dataLength*2;
break;
case UCPTRIE_VALUE_BITS_32:
size+=dataLength*4;
break;
case UCPTRIE_VALUE_BITS_8:
size+=dataLength;
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
if(length>=0) {
UCPTrieHeader *outTrie;
if(length<size) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
outTrie=(UCPTrieHeader *)outData;
/* swap the header */
ds->swapArray32(ds, &inTrie->signature, 4, &outTrie->signature, pErrorCode);
ds->swapArray16(ds, &inTrie->options, 12, &outTrie->options, pErrorCode);
/* swap the index and the data */
switch(valueWidth) {
case UCPTRIE_VALUE_BITS_16:
ds->swapArray16(ds, inTrie+1, (trie.indexLength+dataLength)*2, outTrie+1, pErrorCode);
break;
case UCPTRIE_VALUE_BITS_32:
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, dataLength*4,
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
break;
case UCPTRIE_VALUE_BITS_8:
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
if(inTrie!=outTrie) {
uprv_memmove((outTrie+1)+trie.indexLength, (inTrie+1)+trie.indexLength, dataLength);
}
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
}
return size;
}
namespace {
/**
* Gets the trie version from 32-bit-aligned memory containing the serialized form
* of a UTrie (version 1), a UTrie2 (version 2), or a UCPTrie (version 3).
*
* @param data a pointer to 32-bit-aligned memory containing the serialized form of a trie
* @param length the number of bytes available at data;
* can be more than necessary (see return value)
* @param anyEndianOk If FALSE, only platform-endian serialized forms are recognized.
* If TRUE, opposite-endian serialized forms are recognized as well.
* @return the trie version of the serialized form, or 0 if it is not
* recognized as a serialized trie
*/
int32_t
getVersion(const void *data, int32_t length, UBool anyEndianOk) {
uint32_t signature;
if(length<16 || data==nullptr || (U_POINTER_MASK_LSB(data, 3)!=0)) {
return 0;
}
signature=*(const uint32_t *)data;
if(signature==UCPTRIE_SIG) {
return 3;
}
if(anyEndianOk && signature==UCPTRIE_OE_SIG) {
return 3;
}
if(signature==UTRIE2_SIG) {
return 2;
}
if(anyEndianOk && signature==UTRIE2_OE_SIG) {
return 2;
}
if(signature==UTRIE_SIG) {
return 1;
}
if(anyEndianOk && signature==UTRIE_OE_SIG) {
return 1;
}
return 0;
}
} // namespace
U_CAPI int32_t U_EXPORT2
utrie_swapAnyVersion(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
if(U_FAILURE(*pErrorCode)) { return 0; }
switch(getVersion(inData, length, TRUE)) {
case 1:
return utrie_swap(ds, inData, length, outData, pErrorCode);
case 2:
return utrie2_swap(ds, inData, length, outData, pErrorCode);
case 3:
return ucptrie_swap(ds, inData, length, outData, pErrorCode);
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
}