Correct interpretation of utf-8 0xf8-0xff
In consuming this useful string utility, it was discovered that the interpretation of leading byte codes 0xf8-0xff did not conform to either the RFC 3629 nor ISO/IEC 10646 definitions of utf-8. The IETF RFC describes only 1-4 byte encodings (a limited number of 4 byte encodings at that), and plainly states in section 1. Introduction; o The octet values C0, C1, F5 to FF never appear. Alternately, the ISO definition "R.2 Specification of UTF-8" preseented in the original IETF RFC 2279 clearly define the meaning of leading byte values F5 through FD, and RFC 3629 Section 10. Security paragraph 3 calls out this alternate reading (alterative to "never appears".) F5-F7 begin an invalid (in the domain of unicode code points) 4-byte UTF-8 sequence (similar to F0-F4), while F8-FC begin a 5-byte sequence, FC and FD begin a 6 byte sequence. The curent code is wrong in that it doesn't treat the codes F8-FF as invalid 1-byte characters, nor does it treat the codes F8-FD as the correct number of bytes. No valid parser will land these lead characters 4 bytes forward. Most will treat these as the 5 or 6 byte utf-32 character and may then treat the resulting character as invalid, while some parsers may reject all leading F5-FF characters as a single byte of erronious input, followed by each invalid continuation byte. We propose the conventional reading of F8-FD as 5 and 6 byte sequences as originally defined, while FE-FF must be read as single byte invalid code points. Signed-off-by: William A Rowe Jr <wrowe@pivotal.io> Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io>
This commit is contained in:
parent
6d087c256c
commit
961c0e6b86
@ -2292,7 +2292,7 @@ static const unsigned char kUTF8LenTbl[256] = {
|
||||
1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,
|
||||
1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,
|
||||
2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,
|
||||
3,3,3,3,3,3,3,3, 3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4, 4,4,4,4,4,4,4,4
|
||||
3,3,3,3,3,3,3,3, 3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4, 5,5,5,5,6,6,1,1
|
||||
};
|
||||
|
||||
// Return length of a single UTF-8 source character
|
||||
|
Loading…
Reference in New Issue
Block a user