Commit Graph

51 Commits

Author SHA1 Message Date
f2b16986a1 patch 9.1.1258: regexp: max \U and \%U value is limited by INT_MAX
Problem:  regexp: max \U and \%U value is limited by INT_MAX but gives a
          confusing error message (related: v8.1.0985).
Solution: give a better error message when the value reaches INT_MAX

When searching Vim allows to get up to 8 hex characters using the /\V
and /\%V regex atoms.  However, when using "/\UFFFFFFFF" the code point is
already above what an integer variable can hold, which is 2,147,483,647.

Since patch v8.1.0985, Vim already limited the max codepoint to INT_MAX
(otherwise it caused a crash in the nfa regex engine), but instead of
error'ing out it silently fell back to parse the number as a backslash
value and not as a codepoint value and as such this "/[\UFFFFFFFF]" will
happily find a "\" or an literal "F".  And this "/[\d127-\UFFFFFFFF]"
will error out as "reverse range in character class).

Interestingly, the max Unicode codepoint value is U+10FFFF which still
fits into an ordinary integer value,  which means, that we don't even
need to parse 8 hex characters, but 6 should have been enough.

However, let's not limit Vim to search for only max 6 hex characters
(which would be a backward incompatible change), but instead allow all 8
characters and only if the codepoint reaches INT_MAX, give a more
precise error message (about what the max unicode codepoint value is).
This allows to search for "[\U7FFFFFFE]" (will likely return "E486
Pattern not found") and "[/\U7FFFFFF]" now errors "E1517: Value too
large, max Unicode codepoint is U+10FFFF".

While this change is straight forward on architectures where long is 8
bytes, this is not so simple on Windows or 32bit architectures where long
is 4 bytes (and therefore the test fails there).  To account for that,
let's make use of the vimlong_T number type and make a few corresponding
changes in the regex engine code and cast the value to the expected data
type. This however may not work correctly on systems that doesn't have
the long long datatype (e.g. OpenVMS) and probably the test will fail
there.

fixes: #16949
closes: #16994

Signed-off-by: Christian Brabandt <cb@256bit.org>
2025-03-29 09:08:58 +01:00
22e8e12d9f patch 9.1.0645: regex: wrong match when searching multi-byte char case-insensitive
Problem:  regex: wrong match when searching multi-byte char
          case-insensitive (diffsetter)
Solution: Apply proper case-folding for characters and search-string

This patch does the following 4 things:

1) When the regexp engine compares two utf-8 codepoints case
   insensitive it may match an adjacent character, because it assumes
   it can step over as many bytes as the pattern contains.

   This however is not necessarily true because of case-folding, a
   multi-byte UTF-8 character can be considered equal to some
   single-byte value.

   Let's consider the pattern 'ſ' and the string 's'. When comparing and
   ignoring case, the single character 's' matches, and since it matches
   Vim will try to step over the match (by the amount of bytes of the
   pattern), assuming that since it matches, the length of both strings is
   the same.

   However in that case, it should only step over the single byte value
   's' by 1 byte and try to start matching after it again. So for the
   backtracking engine we need to ensure:
   * we try to match the correct length for the pattern and the text
   * in case of a match, we step over it correctly

   There is one tricky thing for the backtracing engine. We also need to
   calculate correctly the number of bytes to compare the 2 different
   utf-8 strings s1 and s2. So we will count the number of characters in
   s1 that the byte len specified. Then we count the number of bytes to
   step over the same number of characters in string s2 and then we can
   correctly compare the 2 utf-8 strings.

2) A similar thing can happen for the NFA engine, when skipping to the
   next character to test for a match. We are skipping over the regstart
   pointer, however we do not consider the case that because of
   case-folding we may need to adjust the number of bytes to skip over.
   So this needs to be adjusted in find_match_text() as well.

3) A related issue turned out, when prog->match_text is actually empty.
   In that case we should try to find the next match and skip this
   condition.

4) When comparing characters using collections, we must also apply case
   folding to each character in the collection and not just to the
   current character from the search string.  This doesn't apply to the
   NFA engine, because internally it converts collections to branches
   [abc] -> a\|b\|c

fixes: #14294
closes: #14756

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-07-30 20:39:18 +02:00
6043024cd4 patch 9.1.0412: typo in regexp_bt.c in DEBUG code
Problem:  typo in regexp_bt.c in DEBUG code, causing
          compile error (@kfleong7, after v9.1.0409)
Solution: Replace bulen by buflen

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-05-14 11:19:47 +02:00
82792db631 patch 9.1.0409: too many strlen() calls in the regexp engine
Problem:  too many strlen() calls in the regexp engine
Solution: refactor code to retrieve strlen differently, make use
          of bsearch() for getting the character class
          (John Marriott)

closes: #14648

Signed-off-by: John Marriott <basilisk@internode.on.net>
Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-05-12 00:07:17 +02:00
c97f4d61cd patch 9.1.0297: Patch 9.1.0296 causes too many issues
Problem:  Patch 9.1.0296 causes too many issues
          (Tony Mechelynck, @chdiza, CI)
Solution: Back out the change for now

Revert "patch 9.1.0296: regexp: engines do not handle case-folding well"

This reverts commit 7a27c108e0 it causes
issues with syntax highlighting and breaks the FreeBSD and MacOS CI. It
needs more work.

fixes: #14487

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-04-10 16:22:17 +02:00
7a27c108e0 patch 9.1.0296: regexp: engines do not handle case-folding well
Problem:  Regex engines do not handle case-folding well
Solution: Correctly calculate byte length of characters to skip

When the regexp engine compares two utf-8 codepoints case insensitively
it may match an adjacent character, because it assumes it can step over
as many bytes as the pattern contains.

This however is not necessarily true because of case-folding, a
multi-byte UTF-8 character can be considered equal to some single-byte
value.

Let's consider the pattern 'ſ' and the string 's'. When comparing and
ignoring case, the single character 's' matches, and since it matches
Vim will try to step over the match (by the amount of bytes of the
pattern), assuming that since it matches, the length of both strings is
the same.

However in that case, it should only step over the single byte
value 's' so by 1 byte and try to start matching after it again. So for the
backtracking engine we need to ensure:
- we try to match the correct length for the pattern and the text
- in case of a match, we step over it correctly

The same thing can happen for the NFA engine, when skipping to the next
character to test for a match. We are skipping over the regstart
pointer, however we do not consider the case that because of
case-folding we may need to adjust the number of bytes to skip over. So
this needs to be adjusted in find_match_text() as well.

A related issue turned out, when prog->match_text is actually empty. In
that case we should try to find the next match and skip this condition.

fixes: #14294
closes: #14433

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-04-09 22:53:19 +02:00
46fa3c7e27 patch 9.1.0217: regexp: verymagic cannot match before/after a mark
Problem:  regexp: verymagic cannot match before/after a mark
Solution: Correctly check for the very magic check (Julio B)

Fix regexp parser for \v%>'m and \v%<'m
Currently \v%'m works fine, but it is unable to match before or after
the position of mark m.

closes: #14309

Signed-off-by: Julio B <julio.bacel@gmail.com>
Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-03-28 10:23:37 +01:00
d2cc51f9a1 patch 9.1.0011: regexp cannot match combining chars in collection
Problem:  regexp cannot match combining chars in collection
Solution: Check for combining characters in regex collections for the
          NFA and BT Regex Engine

Also, while at it, make debug mode work again.

fixes #10286
closes: #12871

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-01-04 22:54:08 +01:00
be07caa071 patch 9.0.1777: patch 9.0.1771 causes problems
Problem:  patch 9.0.1771 causes problems
Solution: revert it

Revert "patch 9.0.1771: regex: combining chars in collections not handled"
This reverts commit ca22fc36a4.

Signed-off-by: Christian Brabandt <cb@256bit.org>
2023-08-20 22:28:28 +02:00
ca22fc36a4 patch 9.0.1771: regex: combining chars in collections not handled
Problem:  regex: combining chars in collections not handled
Solution: Check for following combining characters for NFA and BT engine

closes: #10459
closes: #10286

Signed-off-by: Christian Brabandt <cb@256bit.org>
2023-08-20 20:38:56 +02:00
68ebcee023 patch 9.0.1594: some internal error messages are translated
Problem:    Some internal error messages are translated.
Solution:   Consistently do not translate internal error messages.
            (closes #12459)
2023-05-31 17:12:14 +01:00
ebfec1c531 patch 9.0.1234: the code style has to be checked manually
Problem:    The code style has to be checked manually.
Solution:   Add basic code style checks in a test.  Fix or avoid uncovered
            problems.
2023-01-22 21:14:53 +00:00
f97a295cca patch 9.0.1221: code is indented more than necessary
Problem:    Code is indented more than necessary.
Solution:   Use an early return where it makes sense. (Yegappan Lakshmanan,
            closes #11833)
2023-01-18 18:17:48 +00:00
4c5678ff0c patch 9.0.0977: it is not easy to see what client-server commands are doing
Problem:    It is not easy to see what client-server commands are doing.
Solution:   Add channel log messages if ch_log() is available.  Move the
            channel logging and make it available with the +eval feature.
2022-11-30 18:12:19 +00:00
01105b37a1 patch 9.0.0951: trying every character position for a match is inefficient
Problem:    Trying every character position for a match is inefficient.
Solution:   Use the start position of the match ignoring "\zs".
2022-11-26 11:47:10 +00:00
753aead960 patch 9.0.0414: matchstr() still does not match column offset
Problem:    matchstr() still does not match column offset when done after a
            text search.
Solution:   Only use the line number for a multi-line search.  Fix the test.
            (closes #10938)
2022-09-08 12:17:06 +01:00
75a115e8d6 patch 9.0.0407: matchstr() does match column offset
Problem:    matchstr() does match column offset. (Yasuhiro Matsumoto)
Solution:   Accept line number zero. (closes #10938)
2022-09-07 18:21:24 +01:00
13ed494bb5 patch 9.0.0228: crash when pattern looks below the last line
Problem:    Crash when pattern looks below the last line.
Solution:   Consider invalid lines to be empty. (closes #10938)
2022-08-19 13:59:25 +01:00
7f9969c559 patch 9.0.0067: cannot show virtual text
Problem:    Cannot show virtual text.
Solution:   Initial changes for virtual text support, using text properties.
2022-07-25 18:13:54 +01:00
509ce03831 patch 8.2.5137: cannot build without the +channel feature
Problem:    Cannot build without the +channel feature. (Dominique Pellé)
Solution:   Add #ifdef around ch_log() calls. (closes #10598)
2022-06-20 11:23:01 +01:00
616592e081 patch 8.2.5115: search timeout is overrun with some patterns
Problem:    Search timeout is overrun with some patterns.
Solution:   Check for timeout in more places.  Make the flag volatile and
            atomic.  Use assert_inrange() to see what happened.
2022-06-17 15:17:10 +01:00
6574577cac patch 8.2.5057: using gettimeofday() for timeout is very inefficient
Problem:    Using gettimeofday() for timeout is very inefficient.
Solution:   Set a platform dependent timer. (Paul Ollis, closes #10505)
2022-06-05 16:55:54 +01:00
02e8d4e4ff patch 8.2.5028: syntax regexp matching can be slow
Problem:    Syntax regexp matching can be slow.
Solution:   Adjust the counters for checking the timeout to check about once
            per msec. (closes #10487, closes #2712)
2022-05-27 15:35:28 +01:00
360da40b47 patch 8.2.4978: no error if engine selection atom is not at the start
Problem:    No error if engine selection atom is not at the start.
Solution:   Give an error. (Christian Brabandt, closes #10439)
2022-05-18 15:04:02 +01:00
72bb10df1f patch 8.2.4693: new regexp does not accept pattern "\%>0v"
Problem:    new regexp does not accept pattern "\%>0v".
Solution:   Do accept digit zero.
2022-04-05 14:00:32 +01:00
91ff3d4f52 patch 8.2.4688: new regexp engine does not give an error for "\%v"
Problem:    New regexp engine does not give an error for "\%v".
Solution:   Check for a value argument. (issue #10079)
2022-04-04 18:32:32 +01:00
b55986c52d patch 8.2.4646: using buffer line after it has been freed
Problem:    Using buffer line after it has been freed in old regexp engine.
Solution:   After getting mark get the line again.
2022-03-29 13:24:58 +01:00
6456fae9ba patch 8.2.4440: crash with specific regexp pattern and string
Problem:    Crash with specific regexp pattern and string.
Solution:   Stop at the start of the string.
2022-02-22 13:37:31 +00:00
424bcae1fb patch 8.2.4273: the EBCDIC support is outdated
Problem:    The EBCDIC support is outdated.
Solution:   Remove the EBCDIC support.
2022-01-31 14:59:41 +00:00
b2810f123c patch 8.2.4046: some error messages not in the right place
Problem:    Some error messages not in the right place.
Solution:   Adjust the errors file.  Fix typo.
2022-01-08 21:38:52 +00:00
677658ae49 patch 8.2.4008: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move more error messages to errors.h.
2022-01-05 16:09:06 +00:00
a6f7929e62 patch 8.2.4005: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move more error messages to errors.h.
2022-01-04 21:30:47 +00:00
eaaac014a0 patch 8.2.3983: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move more error messages to errors.h.
2022-01-02 17:00:40 +00:00
74409f6279 patch 8.2.3970: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move more errors to errors.h.
2022-01-01 15:58:22 +00:00
d0819d11ec patch 8.2.3962: build fails for missing error message
Problem:    Build fails for missing error message.
Solution:   Add changes in missed file.
2021-12-31 23:15:53 +00:00
12f3c1b77f patch 8.2.3749: error messages are everywhere
Problem:    Error messages are everywhere.
Solution:   Move more error messages to errors.h and adjust the names.
2021-12-05 21:46:34 +00:00
d8e44476d8 patch 8.2.3197: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move a few more error messages to errors.h.
2021-07-21 22:20:33 +02:00
e29a27f6f8 patch 8.2.3190: error messages are spread out
Problem:    Error messages are spread out.
Solution:   Move error messages to errors.h and give them a clear name.
2021-07-20 21:07:36 +02:00
04db26b360 patch 8.2.3110: a pattern that matches the cursor position is complicated
Problem:    A pattern that matches the cursor position is bit complicated.
Solution:   Use a dot to indicate the cursor line and column. (Christian
            Brabandt, closes #8497, closes #8179)
2021-07-05 20:15:23 +02:00
872bee557e patch 8.2.2885: searching for \%'> does not match linewise end of line
Problem:    searching for \%'> does not match linewise end of line. (Tim Chase)
Solution:   Match end of line if column is MAXCOL. (closes #8238)
2021-05-24 22:56:15 +02:00
df36514a64 patch 8.2.2829: some comments are not correct or clear
Problem:    Some comments are not correct or clear.
Solution:   Adjust the comments.  Add test for cursor position.
2021-05-03 20:01:45 +02:00
0b94e297af patch 8.2.2716: the equivalent class regexp is missing some characters
Problem:    The equivalent class regexp is missing some characters.
Solution:   Update the list of equivalent characters. (Dominique Pellé,
            closes #8029)
2021-04-05 13:59:53 +02:00
a3d10a508c patch 8.2.2181: valgrind warnings for using uninitialized value
Problem:    Valgrind warnings for using uninitialized value.
Solution:   Do not use "start" or "end" unless there is a match.
2020-12-21 18:24:00 +01:00
a7a691cc14 patch 8.2.2121: internal error when using \ze before \zs in a pattern
Problem:    Internal error when using \ze before \zs in a pattern.
Solution:   Check the end is never before the start. (closes #7442)
2020-12-09 16:36:04 +01:00
e83cca2911 patch 8.2.1633: some error messages are internal but do not use iemsg()
Problem:    Some error messages are internal but do not use iemsg().
Solution:   Use iemsg(). (Dominique Pellé, closes #6894)
2020-09-07 18:53:21 +02:00
71ccd03ee8 patch 8.2.0967: unnecessary type casts for vim_strnsave()
Problem:    Unnecessary type casts for vim_strnsave().
Solution:   Remove the type casts.
2020-06-12 22:59:11 +02:00
a80faa8930 patch 8.2.0559: clearing a struct is verbose
Problem:    Clearing a struct is verbose.
Solution:   Define and use CLEAR_FIELD() and CLEAR_POINTER().
2020-04-12 19:37:17 +02:00
f4140488c7 patch 8.2.0260: several lines of code are duplicated
Problem:    Several lines of code are duplicated.
Solution:   Move duplicated code to a function. (Yegappan Lakshmanan,
            closes #5330)
2020-02-15 23:06:45 +01:00
7c77b34967 patch 8.2.0033: crash when make_extmatch() runs out of memory
Problem:    Crash when make_extmatch() runs out of memory.
Solution:   Check for NULL. (Dominique Pelle, closs #5392)
2019-12-22 19:40:40 +01:00
9490b9a61c patch 8.1.2010: new file uses old style comments
Problem:    New file uses old style comments.
Solution:   Change to new style comments. (Yegappan Lakshmanan, closes #4910)
2019-09-08 17:20:12 +02:00