patch 9.1.1627: fuzzy matching can be improved

Problem:  fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
          (Girish Palya)

Replace fuzzy matching algorithm with improved fzy-based implementation

The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:

* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
  ranking middle matches higher.

After evaluating alternatives (see my comments
[here](https://github.com/vim/vim/issues/17531#issuecomment-3112046897)
and
[here](https://github.com/vim/vim/issues/17531#issuecomment-3121593900)),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:

* Resolves the aforementioned issues.
* Performs better.

Implementation details

This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.

* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
  compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
  the original, but now operates on codepoints rather than single-byte
  characters.

Performance

Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.

```
Search String            Current Algo    FZY Algo
-------------------------------------------------
init                          131.759    66.916
main                          83.688     40.861
sig                           98.348     39.699
index                         109.222    30.738
ab                            72.222     44.357
cd                            83.036     54.739
a                             58.94      62.242
b                             43.612     43.442
c                             64.39      67.442
k                             40.585     36.371
z                             34.708     22.781
w                             38.033     30.109
cpa                           82.596     38.116
arz                           84.251     23.964
zzzz                          35.823     22.75
dimag                         110.686    29.646
xa                            43.188     29.199
nha                           73.953     31.001
nedax                         94.775     29.568
dbue                          79.846     25.902
fp                            46.826     31.641
tr                            90.951     55.883
kw                            38.875     23.194
rp                            101.575    55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519     30.921
```

```vim
vim9script

var haystack = readfile('/Users/gp/linux.files')

var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
    'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
    'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
    var start = reltime()
    var tmp = matchfuzzy(haystack, needle)
    echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```

Additional changes

* Removed the "camelcase" option from both matchfuzzy() and
  matchfuzzypos(), as it's now obsolete with the improved algorithm.

related: neovim/neovim#34101
fixes #17531
closes: #17900

Signed-off-by: Girish Palya <girishji@gmail.com>
Signed-off-by: Christian Brabandt <cb@256bit.org>
This commit is contained in:
Girish Palya
2025-08-12 22:22:52 +02:00
committed by Christian Brabandt
parent 5ba6e41d37
commit 7e0df5eee9
25 changed files with 1289 additions and 1311 deletions

View File

@ -6429,8 +6429,8 @@ vgr_match_buflines(
long lnum;
colnr_T col;
int pat_len = (int)STRLEN(spat);
if (pat_len > MAX_FUZZY_MATCHES)
pat_len = MAX_FUZZY_MATCHES;
if (pat_len > FUZZY_MATCH_MAX_LEN)
pat_len = FUZZY_MATCH_MAX_LEN;
for (lnum = 1; lnum <= buf->b_ml.ml_line_count && *tomatch > 0; ++lnum)
{
@ -6483,13 +6483,13 @@ vgr_match_buflines(
char_u *str = ml_get_buf(buf, lnum, FALSE);
colnr_T linelen = ml_get_buf_len(buf, lnum);
int score;
int_u matches[MAX_FUZZY_MATCHES];
int_u matches[FUZZY_MATCH_MAX_LEN];
int_u sz = ARRAY_LENGTH(matches);
// Fuzzy string match
CLEAR_FIELD(matches);
while (fuzzy_match(str + col, spat, FALSE, &score,
matches, sz, TRUE) > 0)
matches, sz) > 0)
{
// Pass the buffer number so that it gets used even for a
// dummy buffer, unless duplicate_name is set, then the