Commit Graph

296 Commits

Author SHA1 Message Date
f2b16986a1 patch 9.1.1258: regexp: max \U and \%U value is limited by INT_MAX
Problem:  regexp: max \U and \%U value is limited by INT_MAX but gives a
          confusing error message (related: v8.1.0985).
Solution: give a better error message when the value reaches INT_MAX

When searching Vim allows to get up to 8 hex characters using the /\V
and /\%V regex atoms.  However, when using "/\UFFFFFFFF" the code point is
already above what an integer variable can hold, which is 2,147,483,647.

Since patch v8.1.0985, Vim already limited the max codepoint to INT_MAX
(otherwise it caused a crash in the nfa regex engine), but instead of
error'ing out it silently fell back to parse the number as a backslash
value and not as a codepoint value and as such this "/[\UFFFFFFFF]" will
happily find a "\" or an literal "F".  And this "/[\d127-\UFFFFFFFF]"
will error out as "reverse range in character class).

Interestingly, the max Unicode codepoint value is U+10FFFF which still
fits into an ordinary integer value,  which means, that we don't even
need to parse 8 hex characters, but 6 should have been enough.

However, let's not limit Vim to search for only max 6 hex characters
(which would be a backward incompatible change), but instead allow all 8
characters and only if the codepoint reaches INT_MAX, give a more
precise error message (about what the max unicode codepoint value is).
This allows to search for "[\U7FFFFFFE]" (will likely return "E486
Pattern not found") and "[/\U7FFFFFF]" now errors "E1517: Value too
large, max Unicode codepoint is U+10FFFF".

While this change is straight forward on architectures where long is 8
bytes, this is not so simple on Windows or 32bit architectures where long
is 4 bytes (and therefore the test fails there).  To account for that,
let's make use of the vimlong_T number type and make a few corresponding
changes in the regex engine code and cast the value to the expected data
type. This however may not work correctly on systems that doesn't have
the long long datatype (e.g. OpenVMS) and probably the test will fail
there.

fixes: #16949
closes: #16994

Signed-off-by: Christian Brabandt <cb@256bit.org>
2025-03-29 09:08:58 +01:00
c3a02d78bd patch 9.1.0701: crash with NFA regex engine when searching for composing chars
Problem:  crash with NFA regex engine when searching for composing chars
          (SuyueGuo)
Solution: When there is no composing character, break out of the loop
          and check that out1 state is not null

fixes: #15583

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-08-28 23:17:52 +02:00
22e8e12d9f patch 9.1.0645: regex: wrong match when searching multi-byte char case-insensitive
Problem:  regex: wrong match when searching multi-byte char
          case-insensitive (diffsetter)
Solution: Apply proper case-folding for characters and search-string

This patch does the following 4 things:

1) When the regexp engine compares two utf-8 codepoints case
   insensitive it may match an adjacent character, because it assumes
   it can step over as many bytes as the pattern contains.

   This however is not necessarily true because of case-folding, a
   multi-byte UTF-8 character can be considered equal to some
   single-byte value.

   Let's consider the pattern 'ſ' and the string 's'. When comparing and
   ignoring case, the single character 's' matches, and since it matches
   Vim will try to step over the match (by the amount of bytes of the
   pattern), assuming that since it matches, the length of both strings is
   the same.

   However in that case, it should only step over the single byte value
   's' by 1 byte and try to start matching after it again. So for the
   backtracking engine we need to ensure:
   * we try to match the correct length for the pattern and the text
   * in case of a match, we step over it correctly

   There is one tricky thing for the backtracing engine. We also need to
   calculate correctly the number of bytes to compare the 2 different
   utf-8 strings s1 and s2. So we will count the number of characters in
   s1 that the byte len specified. Then we count the number of bytes to
   step over the same number of characters in string s2 and then we can
   correctly compare the 2 utf-8 strings.

2) A similar thing can happen for the NFA engine, when skipping to the
   next character to test for a match. We are skipping over the regstart
   pointer, however we do not consider the case that because of
   case-folding we may need to adjust the number of bytes to skip over.
   So this needs to be adjusted in find_match_text() as well.

3) A related issue turned out, when prog->match_text is actually empty.
   In that case we should try to find the next match and skip this
   condition.

4) When comparing characters using collections, we must also apply case
   folding to each character in the collection and not just to the
   current character from the search string.  This doesn't apply to the
   NFA engine, because internally it converts collections to branches
   [abc] -> a\|b\|c

fixes: #14294
closes: #14756

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-07-30 20:39:18 +02:00
82792db631 patch 9.1.0409: too many strlen() calls in the regexp engine
Problem:  too many strlen() calls in the regexp engine
Solution: refactor code to retrieve strlen differently, make use
          of bsearch() for getting the character class
          (John Marriott)

closes: #14648

Signed-off-by: John Marriott <basilisk@internode.on.net>
Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-05-12 00:07:17 +02:00
c97f4d61cd patch 9.1.0297: Patch 9.1.0296 causes too many issues
Problem:  Patch 9.1.0296 causes too many issues
          (Tony Mechelynck, @chdiza, CI)
Solution: Back out the change for now

Revert "patch 9.1.0296: regexp: engines do not handle case-folding well"

This reverts commit 7a27c108e0 it causes
issues with syntax highlighting and breaks the FreeBSD and MacOS CI. It
needs more work.

fixes: #14487

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-04-10 16:22:17 +02:00
7a27c108e0 patch 9.1.0296: regexp: engines do not handle case-folding well
Problem:  Regex engines do not handle case-folding well
Solution: Correctly calculate byte length of characters to skip

When the regexp engine compares two utf-8 codepoints case insensitively
it may match an adjacent character, because it assumes it can step over
as many bytes as the pattern contains.

This however is not necessarily true because of case-folding, a
multi-byte UTF-8 character can be considered equal to some single-byte
value.

Let's consider the pattern 'ſ' and the string 's'. When comparing and
ignoring case, the single character 's' matches, and since it matches
Vim will try to step over the match (by the amount of bytes of the
pattern), assuming that since it matches, the length of both strings is
the same.

However in that case, it should only step over the single byte
value 's' so by 1 byte and try to start matching after it again. So for the
backtracking engine we need to ensure:
- we try to match the correct length for the pattern and the text
- in case of a match, we step over it correctly

The same thing can happen for the NFA engine, when skipping to the next
character to test for a match. We are skipping over the regstart
pointer, however we do not consider the case that because of
case-folding we may need to adjust the number of bytes to skip over. So
this needs to be adjusted in find_match_text() as well.

A related issue turned out, when prog->match_text is actually empty. In
that case we should try to find the next match and skip this condition.

fixes: #14294
closes: #14433

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-04-09 22:53:19 +02:00
b64cec217f patch 9.1.0229: Error E877 is not translated
Problem:  Error E877 is not translated (RestorerZ)
Solution: Declare the error with N_ to mark it as translatable, add _()
          around the error message in regexp_nfa.c

fixes: #14333

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-03-31 17:56:17 +02:00
46fa3c7e27 patch 9.1.0217: regexp: verymagic cannot match before/after a mark
Problem:  regexp: verymagic cannot match before/after a mark
Solution: Correctly check for the very magic check (Julio B)

Fix regexp parser for \v%>'m and \v%<'m
Currently \v%'m works fine, but it is unable to match before or after
the position of mark m.

closes: #14309

Signed-off-by: Julio B <julio.bacel@gmail.com>
Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-03-28 10:23:37 +01:00
d2cc51f9a1 patch 9.1.0011: regexp cannot match combining chars in collection
Problem:  regexp cannot match combining chars in collection
Solution: Check for combining characters in regex collections for the
          NFA and BT Regex Engine

Also, while at it, make debug mode work again.

fixes #10286
closes: #12871

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-01-04 22:54:08 +01:00
be07caa071 patch 9.0.1777: patch 9.0.1771 causes problems
Problem:  patch 9.0.1771 causes problems
Solution: revert it

Revert "patch 9.0.1771: regex: combining chars in collections not handled"
This reverts commit ca22fc36a4.

Signed-off-by: Christian Brabandt <cb@256bit.org>
2023-08-20 22:28:28 +02:00
ca22fc36a4 patch 9.0.1771: regex: combining chars in collections not handled
Problem:  regex: combining chars in collections not handled
Solution: Check for following combining characters for NFA and BT engine

closes: #10459
closes: #10286

Signed-off-by: Christian Brabandt <cb@256bit.org>
2023-08-20 20:38:56 +02:00
68ebcee023 patch 9.0.1594: some internal error messages are translated
Problem:    Some internal error messages are translated.
Solution:   Consistently do not translate internal error messages.
            (closes #12459)
2023-05-31 17:12:14 +01:00
097c5370ea patch 9.0.1576: users may not know what to do with an internal error
Problem:    Users may not know what to do with an internal error.
Solution:   Add a translated message with instructions.
2023-05-24 21:02:24 +01:00
c9471b1872 patch 9.0.1529: code style test doesn't check for space after "if"
Problem:    Code style test doesn't check for space after "if".
Solution:   Add a test for space.
2023-05-09 15:00:00 +01:00
1f76138ff1 patch 9.0.1427: warning for uninitialized variable
Problem:    Warning for uninitialized variable. (Tony Mechelynck)
Solution:   Add #ifdef.
2023-03-25 11:31:32 +00:00
1b438a8228 patch 9.0.1271: using sizeof() and subtract array size is tricky
Problem:    Using sizeof() and subtract array size is tricky.
Solution:   Use offsetof() instead. (closes #11926)
2023-02-01 13:11:15 +00:00
ebfec1c531 patch 9.0.1234: the code style has to be checked manually
Problem:    The code style has to be checked manually.
Solution:   Add basic code style checks in a test.  Fix or avoid uncovered
            problems.
2023-01-22 21:14:53 +00:00
f97a295cca patch 9.0.1221: code is indented more than necessary
Problem:    Code is indented more than necessary.
Solution:   Use an early return where it makes sense. (Yegappan Lakshmanan,
            closes #11833)
2023-01-18 18:17:48 +00:00
79336e19cb patch 9.0.1047: matchparen is slow
Problem:    Matchparen is slow.
Solution:   Actually use the position where the match started, not the
            position where the search started. (closes #11644)
2022-12-11 14:18:31 +00:00
4c5678ff0c patch 9.0.0977: it is not easy to see what client-server commands are doing
Problem:    It is not easy to see what client-server commands are doing.
Solution:   Add channel log messages if ch_log() is available.  Move the
            channel logging and make it available with the +eval feature.
2022-11-30 18:12:19 +00:00
01105b37a1 patch 9.0.0951: trying every character position for a match is inefficient
Problem:    Trying every character position for a match is inefficient.
Solution:   Use the start position of the match ignoring "\zs".
2022-11-26 11:47:10 +00:00
c96311b5be patch 9.0.0950: the pattern "\_s\zs" matches at EOL
Problem:    The pattern "\_s\zs" matches at EOL.
Solution:   Make the pattern "\_s\zs" match at the start of the next line.
            (closes #11617)
2022-11-25 21:13:47 +00:00
88456cd3c4 patch 9.0.0904: various comment and indent flaws
Problem:    Various comment and indent flaws.
Solution:   Improve comments and indenting.
2022-11-18 22:14:09 +00:00
753aead960 patch 9.0.0414: matchstr() still does not match column offset
Problem:    matchstr() still does not match column offset when done after a
            text search.
Solution:   Only use the line number for a multi-line search.  Fix the test.
            (closes #10938)
2022-09-08 12:17:06 +01:00
75a115e8d6 patch 9.0.0407: matchstr() does match column offset
Problem:    matchstr() does match column offset. (Yasuhiro Matsumoto)
Solution:   Accept line number zero. (closes #10938)
2022-09-07 18:21:24 +01:00
13ed494bb5 patch 9.0.0228: crash when pattern looks below the last line
Problem:    Crash when pattern looks below the last line.
Solution:   Consider invalid lines to be empty. (closes #10938)
2022-08-19 13:59:25 +01:00
7f9969c559 patch 9.0.0067: cannot show virtual text
Problem:    Cannot show virtual text.
Solution:   Initial changes for virtual text support, using text properties.
2022-07-25 18:13:54 +01:00
509ce03831 patch 8.2.5137: cannot build without the +channel feature
Problem:    Cannot build without the +channel feature. (Dominique Pellé)
Solution:   Add #ifdef around ch_log() calls. (closes #10598)
2022-06-20 11:23:01 +01:00
616592e081 patch 8.2.5115: search timeout is overrun with some patterns
Problem:    Search timeout is overrun with some patterns.
Solution:   Check for timeout in more places.  Make the flag volatile and
            atomic.  Use assert_inrange() to see what happened.
2022-06-17 15:17:10 +01:00
6574577cac patch 8.2.5057: using gettimeofday() for timeout is very inefficient
Problem:    Using gettimeofday() for timeout is very inefficient.
Solution:   Set a platform dependent timer. (Paul Ollis, closes #10505)
2022-06-05 16:55:54 +01:00
305abc6123 patch 8.2.5036: using two counters for timeout check in NFA engine
Problem:    Using two counters for timeout check in NFA engine.
Solution:   Use only one counter.  Tune the counts based on guessing.
2022-05-28 11:08:40 +01:00
02e8d4e4ff patch 8.2.5028: syntax regexp matching can be slow
Problem:    Syntax regexp matching can be slow.
Solution:   Adjust the counters for checking the timeout to check about once
            per msec. (closes #10487, closes #2712)
2022-05-27 15:35:28 +01:00
360da40b47 patch 8.2.4978: no error if engine selection atom is not at the start
Problem:    No error if engine selection atom is not at the start.
Solution:   Give an error. (Christian Brabandt, closes #10439)
2022-05-18 15:04:02 +01:00
72bb10df1f patch 8.2.4693: new regexp does not accept pattern "\%>0v"
Problem:    new regexp does not accept pattern "\%>0v".
Solution:   Do accept digit zero.
2022-04-05 14:00:32 +01:00
91ff3d4f52 patch 8.2.4688: new regexp engine does not give an error for "\%v"
Problem:    New regexp engine does not give an error for "\%v".
Solution:   Check for a value argument. (issue #10079)
2022-04-04 18:32:32 +01:00
b4ad3b0dea patch 8.2.4649: various formatting problems
Problem:    Various formatting problems.
Solution:   Improve the code formatting.
2022-03-30 10:57:45 +01:00
b10ff5c1b3 patch 8.2.4592: search continues after giving E1204
Problem:    Search continues after giving E1204.
Solution:   Return failure after giving E1204. (closes #9972)
2022-03-19 11:31:38 +00:00
0a4e098f32 patch 8.2.4546: duplicate #undef
Problem:    Duplicate #undef.
Solution:   Remove one #undef. (closes #9932)
2022-03-11 15:33:53 +00:00
424bcae1fb patch 8.2.4273: the EBCDIC support is outdated
Problem:    The EBCDIC support is outdated.
Solution:   Remove the EBCDIC support.
2022-01-31 14:59:41 +00:00
b2d85e3784 patch 8.2.4029: debugging NFA regexp my crash, cached indent may be wrong
Problem:    Debugging NFA regexp my crash, cached indent may be wrong.
Solution:   Fix some debug warnings in the NFA regexp code.  Make sure log_fd
            is set when used.  Fix breakindent and indent caching. (Christian
            Brabandt, closes #9482)
2022-01-07 16:55:32 +00:00
d82a47dd04 patch 8.2.4012: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move the last error messages to errors.h.
2022-01-05 20:24:39 +00:00
9d00e4a814 patch 8.2.4010: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move more error messages to errors.h.
2022-01-05 17:49:15 +00:00
677658ae49 patch 8.2.4008: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move more error messages to errors.h.
2022-01-05 16:09:06 +00:00
a6f7929e62 patch 8.2.4005: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move more error messages to errors.h.
2022-01-04 21:30:47 +00:00
74409f6279 patch 8.2.3970: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move more errors to errors.h.
2022-01-01 15:58:22 +00:00
af4a61a85d patch 8.2.3914: various spelling mistakes in comments
Problem:    Various spelling mistakes in comments.
Solution:   Fix the mistakes. (Dominique Pellé, closes #9416)
2021-12-27 17:21:41 +00:00
bc404bfb32 patch 8.2.3855: illegal memory access when displaying a blob
Problem:    Illegal memory access when displaying a blob.
Solution:   Append a NUL at the end. (Yegappan Lakshmanan, closes #9372)
2021-12-19 19:19:31 +00:00
52797bae17 patch 8.2.3825: various comments could be improved
Problem:    Various comments could be improved.
Solution:   Improve the comments.
2021-12-16 14:45:13 +00:00
12f3c1b77f patch 8.2.3749: error messages are everywhere
Problem:    Error messages are everywhere.
Solution:   Move more error messages to errors.h and adjust the names.
2021-12-05 21:46:34 +00:00
64066b9acd patch 8.2.3612: using freed memory with regexp using a mark
Problem:    Using freed memory with regexp using a mark.
Solution:   Get the line again after getting the mark position.
2021-11-17 18:22:56 +00:00