CI: Manage multibyte characters in syntax tests

As reported in #16559, bytes of a multibyte character may
be written as separate U+FFFD characters in a ":terminal"
window on a busy machine.  The testing facilities currently
offer an optional filtering step to be carried out between
reading and comparing the contents of two screendump files
for each such file.  This filtering has been resorted to
(#14767 and #16560) in an attempt to unconditionally replace
known non-Latin-1 characters with an arbitrary substitute
ASCII character and avoid this rendering mishap leading to
syntax tests failures.  However, it has been overlooked at
the time that metadata description (in shorthand) to follow
spurious U+FFFD characters may be *distinct* and make the
remainder of such a line, ASCII characters and whatnot, also
unequal between compared screendump files.

While it is straightforward to adapt current filter files to
ignore the line characters after the leftmost U+FFFD,

> It is challenging and error-prone to keep up to date filter
> files because moving around examples in source files will
> likely make redundant some previously required filter files
> and, at the same time, it may require creating new filter
> files for the same source file; substituting one multibyte
> character for another multibyte character will also demand
> a coordinated change for filter files.

Besides, unconditionally dropping arbitrary parts of a line
is rather too blunt an instrument.  An alternative approach
is to not use the supported filtering for this purpose; let
a syntax test pass or fail initially; then *if* the same
failure is imminent, drop the leftmost U+FFFD and the rest
of the previously seen line (repeating it for all previously
seen unequal lines) before another round of file contents
comparing.  The obvious disadvantage with this filtering,
unconditional and otherwise, is that if there are consistent
failures for _other reasons_ and the unequal parts happen to
be after U+FFFDs, then spurious test passing can happen when
stars align for _a particular test runner_.

Hence syntax test authors should strive to write as little
significant text after multibyte characters as syntactically
permissible, write multibyte characters closer to EOL in
general, and make sure that their checked-in and published
"*.dump" files do not have any U+FFFDs.

It is also practical to refrain from attempting screendump
generation if U+FFFDs can already be discovered, and instead
try re-running from scratch the syntax test in hand, while
accepting other recently generated screendumps without going
through with new rounds of verification.

Reference:
https://github.com/vim/vim/pull/16470#issuecomment-2599848525

closes: #17704

Signed-off-by: Aliaksei Budavei <0x000c70@gmail.com>
Signed-off-by: Christian Brabandt <cb@256bit.org>
This commit is contained in:
Aliaksei Budavei
2025-07-25 20:08:52 +02:00
committed by Christian Brabandt
parent 43b99c9376
commit 0fde6aebdd
4 changed files with 265 additions and 73 deletions

View File

@ -61,8 +61,6 @@ an "input/setup/java.vim" script file with the following lines:
Both inline setup commands and setup scripts may be used at the same time, the
script file will be sourced before any VIM_TEST_SETUP commands are executed.
Every line of a source file must not be longer than 1425 (19 x 75) characters.
If there is no further setup required, you can now run all tests:
make test
@ -112,6 +110,20 @@ If they look OK, move them to the "dumps" directory:
If you now run the test again, it will succeed.
Limitations for syntax plugin tests
-----------------------------------
Do not compose ASCII lines that do not fit a 19 by 75 window (1425 columns).
Use multibyte characters, if at all, sparingly (see #16559). When possible,
move multibyte characters closer to the end of a line and keep the line short:
no more than a 75-byte total of displayed characters. A poorly rendered line
may otherwise become wrapped when enough of spurious U+FFFD (0xEF 0xBF 0xBD)
characters claim more columns than are available (75) and then invalidate line
correspondence under test. Refrain from mixing non-spurious U+FFFD characters
with other multibyte characters in the same line.
Adjusting a syntax plugin test
------------------------------