patch 9.1.1258: regexp: max \U and \%U value is limited by INT_MAX
Problem: regexp: max \U and \%U value is limited by INT_MAX but gives a confusing error message (related: v8.1.0985). Solution: give a better error message when the value reaches INT_MAX When searching Vim allows to get up to 8 hex characters using the /\V and /\%V regex atoms. However, when using "/\UFFFFFFFF" the code point is already above what an integer variable can hold, which is 2,147,483,647. Since patch v8.1.0985, Vim already limited the max codepoint to INT_MAX (otherwise it caused a crash in the nfa regex engine), but instead of error'ing out it silently fell back to parse the number as a backslash value and not as a codepoint value and as such this "/[\UFFFFFFFF]" will happily find a "\" or an literal "F". And this "/[\d127-\UFFFFFFFF]" will error out as "reverse range in character class). Interestingly, the max Unicode codepoint value is U+10FFFF which still fits into an ordinary integer value, which means, that we don't even need to parse 8 hex characters, but 6 should have been enough. However, let's not limit Vim to search for only max 6 hex characters (which would be a backward incompatible change), but instead allow all 8 characters and only if the codepoint reaches INT_MAX, give a more precise error message (about what the max unicode codepoint value is). This allows to search for "[\U7FFFFFFE]" (will likely return "E486 Pattern not found") and "[/\U7FFFFFF]" now errors "E1517: Value too large, max Unicode codepoint is U+10FFFF". While this change is straight forward on architectures where long is 8 bytes, this is not so simple on Windows or 32bit architectures where long is 4 bytes (and therefore the test fails there). To account for that, let's make use of the vimlong_T number type and make a few corresponding changes in the regex engine code and cast the value to the expected data type. This however may not work correctly on systems that doesn't have the long long datatype (e.g. OpenVMS) and probably the test will fail there. fixes: #16949 closes: #16994 Signed-off-by: Christian Brabandt <cb@256bit.org>
This commit is contained in:
@ -1,4 +1,4 @@
|
||||
*pattern.txt* For Vim version 9.1. Last change: 2025 Mar 21
|
||||
*pattern.txt* For Vim version 9.1. Last change: 2025 Mar 28
|
||||
|
||||
|
||||
VIM REFERENCE MANUAL by Bram Moolenaar
|
||||
@ -1222,7 +1222,8 @@ x A single character, with no special meaning, matches itself
|
||||
\o40 octal number of character up to 0o377
|
||||
\x20 hexadecimal number of character up to 0xff
|
||||
\u20AC hex. number of multibyte character up to 0xffff
|
||||
\U1234 hex. number of multibyte character up to 0xffffffff
|
||||
\U1234 hex. number of multibyte character up to 8 characters
|
||||
0xffffffff |E1541|
|
||||
NOTE: The other backslash codes mentioned above do not work inside
|
||||
[]!
|
||||
- Matching with a collection can be slow, because each character in
|
||||
@ -1263,7 +1264,8 @@ x A single character, with no special meaning, matches itself
|
||||
\%u20AC Matches the character specified with up to four hexadecimal
|
||||
characters.
|
||||
\%U1234abcd Matches the character specified with up to eight hexadecimal
|
||||
characters, up to 0x7fffffff
|
||||
characters, up to 0x7fffffff (the maximum allowed value is INT_MAX
|
||||
|E1541|, but the maximum valid Unicode codepoint is U+10FFFF).
|
||||
|
||||
==============================================================================
|
||||
7. Ignoring case in a pattern */ignorecase*
|
||||
|
Reference in New Issue
Block a user