feat(server): implement FTS5-based full-text search (#5079)

* build: add sqlite_fts5 build tag to enable FTS5 support

* feat: add SearchBackend config option (default: fts)

* feat: add buildFTS5Query for safe FTS5 query preprocessing

* feat: add FTS5 search backend with config toggle, refactor legacy search

- Add searchExprFunc type and getSearchExpr() for backend selection
- Rename fullTextExpr to legacySearchExpr
- Add ftsSearchExpr using FTS5 MATCH subquery
- Update fullTextFilter in sql_restful.go to use configured backend

* feat: add FTS5 migration with virtual tables, triggers, and search_participants

Creates FTS5 virtual tables for media_file, album, and artist with
unicode61 tokenizer and diacritic folding. Adds search_participants
column, populates from JSON, and sets up INSERT/UPDATE/DELETE triggers.

* feat: populate search_participants in PostMapArgs for FTS5 indexing

* test: add FTS5 search integration tests

* fix: exclude FTS5 virtual tables from e2e DB restore

The restoreDB function iterates all tables in sqlite_master and
runs DELETE + INSERT to reset state. FTS5 contentless virtual tables
cannot be directly deleted from. Since triggers handle FTS5 sync
automatically, simply skip tables matching *_fts and *_fts_* patterns.

* build: add compile-time guard for sqlite_fts5 build tag

Same pattern as netgo: compilation fails with a clear error if
the sqlite_fts5 build tag is missing.

* build: add sqlite_fts5 tag to reflex dev server config

* build: extract GO_BUILD_TAGS variable in Makefile to avoid duplication

* fix: strip leading * from FTS5 queries to prevent "unknown special query" error

* feat: auto-append prefix wildcard to FTS5 search tokens for broader matching

Every plain search token now gets a trailing * appended (e.g., "love" becomes
"love*"), so searching for "love" also matches "lovelace", "lovely", etc.
Quoted phrases are preserved as exact matches without wildcards. Results are
ordered alphabetically by name/title, so shorter exact matches naturally
appear first.

* fix: clarify comments about FTS5 operator neutralization

The comments said "strip" but the code lowercases operators to
neutralize them (FTS5 operators are case-sensitive). Updated comments
to accurately describe the behavior.

* fix: use fmt.Sprintf for FTS5 phrase placeholders

The previous encoding used rune('0'+index) which silently breaks with
10+ quoted phrases. Use fmt.Sprintf for arbitrary index support.

* fix: validate and normalize SearchBackend config option

Normalize the value to lowercase and fall back to "fts" with a log
warning for unrecognized values. This prevents silent misconfiguration
from typos like "FTS", "Legacy", or "fts5".

* refactor: improve documentation for build tags and FTS5 requirements

Signed-off-by: Deluan <deluan@navidrome.org>

* refactor: convert FTS5 query and search backend normalization tests to DescribeTable format

Signed-off-by: Deluan <deluan@navidrome.org>

* fix: add sqlite_fts5 build tag to golangci configuration

Signed-off-by: Deluan <deluan@navidrome.org>

* feat: add UISearchDebounceMs configuration option and update related components

Signed-off-by: Deluan <deluan@navidrome.org>

* fix: fall back to legacy search when SearchFullString is enabled

FTS5 is token-based and cannot match substrings within words, so
getSearchExpr now returns legacySearchExpr when SearchFullString
is true, regardless of SearchBackend setting.

* fix: add sqlite_fts5 build tag to CI pipeline and Dockerfile

* fix: add WHEN clauses to FTS5 AFTER UPDATE triggers

Added WHEN clauses to the media_file_fts_au, album_fts_au, and
artist_fts_au triggers so they only fire when FTS-indexed columns
actually change. Previously, every row update (e.g., play count, rating,
starred status) triggered an unnecessary delete+insert cycle in the FTS
shadow tables. The WHEN clauses use IS NOT for NULL-safe comparison of
each indexed column, avoiding FTS index churn for non-indexed updates.

* feat: add SearchBackend configuration option to data and insights components

Signed-off-by: Deluan <deluan@navidrome.org>

* fix: enhance input sanitization for FTS5 by stripping additional punctuation and special characters

Signed-off-by: Deluan <deluan@navidrome.org>

* feat: add search_normalized column for punctuated name search (R.E.M., AC/DC)

Add index-time normalization and query-time single-letter collapsing to
fix FTS5 search for punctuated names. A new search_normalized column
stores concatenated forms of punctuated words (e.g., "R.E.M." → "REM",
"AC/DC" → "ACDC") and is indexed in FTS5 tables. At query time, runs of
consecutive single letters (from dot-stripping) are collapsed into OR
expressions like ("R E M" OR REM*) to match both the original tokens and
the normalized form. This enables searching by "R.E.M.", "REM", "AC/DC",
"ACDC", "A-ha", or "Aha" and finding the correct results.

* refactor: simplify isSingleUnicodeLetter to avoid []rune allocation

Use utf8.DecodeRuneInString to check for a single Unicode letter
instead of converting the entire string to a []rune slice.

* feat: define ftsSearchColumns for flexible FTS5 search column inclusion

Signed-off-by: Deluan <deluan@navidrome.org>

* feat: update collapseSingleLetterRuns to return quoted phrases for abbreviations

Signed-off-by: Deluan <deluan@navidrome.org>

* feat: implement extractPunctuatedWords to handle artist/album names with embedded punctuation

Signed-off-by: Deluan <deluan@navidrome.org>

* feat: implement extractPunctuatedWords to handle artist/album names with embedded punctuation

Signed-off-by: Deluan <deluan@navidrome.org>

* refactor: punctuated word handling to improve processing of artist/album names

Signed-off-by: Deluan <deluan@navidrome.org>

* feat: add CJK support for search queries with LIKE filters

Signed-off-by: Deluan <deluan@navidrome.org>

* feat: enhance FTS5 search by adding album version support and CJK handling

Signed-off-by: Deluan <deluan@navidrome.org>

* refactor: search configuration to use structured options

Signed-off-by: Deluan <deluan@navidrome.org>

* feat: enhance search functionality to support punctuation-only queries and update related tests

Signed-off-by: Deluan <deluan@navidrome.org>

---------

Signed-off-by: Deluan <deluan@navidrome.org>
This commit is contained in:
Deluan Quintão
2026-02-21 17:52:42 -05:00
committed by GitHub
parent 6f5f58ae9d
commit 54de0dbc52
35 changed files with 1283 additions and 56 deletions
+261
View File
@@ -0,0 +1,261 @@
package persistence
import (
"fmt"
"regexp"
"strings"
"unicode"
"unicode/utf8"
. "github.com/Masterminds/squirrel"
"github.com/navidrome/navidrome/log"
)
// containsCJK returns true if the string contains any CJK (Chinese/Japanese/Korean) characters.
// CJK text doesn't use spaces between words, so FTS5's unicode61 tokenizer treats entire
// CJK phrases as single tokens, making token-based search ineffective for CJK content.
func containsCJK(s string) bool {
for _, r := range s {
if unicode.Is(unicode.Han, r) ||
unicode.Is(unicode.Hiragana, r) ||
unicode.Is(unicode.Katakana, r) ||
unicode.Is(unicode.Hangul, r) {
return true
}
}
return false
}
// fts5SpecialChars matches characters that should be stripped from user input.
// We keep only Unicode letters, numbers, whitespace, * (prefix wildcard), " (phrase quotes),
// and \x00 (internal placeholder marker). All punctuation is removed because the unicode61
// tokenizer treats it as token separators, and characters like ' can cause FTS5 parse errors
// as unbalanced string delimiters.
var fts5SpecialChars = regexp.MustCompile(`[^\p{L}\p{N}\s*"\x00]`)
// fts5PunctStrip strips everything except letters and numbers (no whitespace, wildcards, or quotes).
// Used for normalizing words at index time to create concatenated forms (e.g., "R.E.M." → "REM").
var fts5PunctStrip = regexp.MustCompile(`[^\p{L}\p{N}]`)
// fts5Operators matches FTS5 boolean operators as whole words (case-insensitive).
var fts5Operators = regexp.MustCompile(`(?i)\b(AND|OR|NOT|NEAR)\b`)
// fts5LeadingStar matches a * at the start of a token. FTS5 only supports * at the end (prefix queries).
var fts5LeadingStar = regexp.MustCompile(`(^|[\s])\*+`)
// normalizeForFTS takes multiple strings, strips non-letter/non-number characters from each word,
// and returns a space-separated string of words that changed after stripping (deduplicated).
// This is used at index time to create concatenated forms: "R.E.M." → "REM", "AC/DC" → "ACDC".
func normalizeForFTS(values ...string) string {
seen := make(map[string]struct{})
var result []string
for _, v := range values {
for _, word := range strings.Fields(v) {
stripped := fts5PunctStrip.ReplaceAllString(word, "")
if stripped == "" || stripped == word {
continue
}
lower := strings.ToLower(stripped)
if _, ok := seen[lower]; ok {
continue
}
seen[lower] = struct{}{}
result = append(result, stripped)
}
}
return strings.Join(result, " ")
}
// isSingleUnicodeLetter returns true if token is exactly one Unicode letter.
func isSingleUnicodeLetter(token string) bool {
r, size := utf8.DecodeRuneInString(token)
return size == len(token) && size > 0 && unicode.IsLetter(r)
}
// namePunctuation is the set of characters commonly used as separators in artist/album
// names (hyphens, slashes, dots, apostrophes). Only words containing these are candidates
// for punctuated-word processing; other special characters (^, :, &) are just stripped.
const namePunctuation = `-/.''`
// processPunctuatedWords handles words with embedded name punctuation before the general
// special-character stripping. For each punctuated word it produces either:
// - A quoted phrase for dotted abbreviations: R.E.M. → "R E M"
// - A phrase+concat OR for other patterns: a-ha → ("a ha" OR aha*)
func processPunctuatedWords(input string, phrases []string) (string, []string) {
words := strings.Fields(input)
var result []string
for _, w := range words {
if strings.HasPrefix(w, "\x00") || strings.ContainsAny(w, `*"`) || !strings.ContainsAny(w, namePunctuation) {
result = append(result, w)
continue
}
concat := fts5PunctStrip.ReplaceAllString(w, "")
if concat == "" || concat == w {
result = append(result, w)
continue
}
subTokens := strings.Fields(fts5SpecialChars.ReplaceAllString(w, " "))
if len(subTokens) < 2 {
// Single sub-token after splitting (e.g., N' → N): just use the stripped form
result = append(result, concat)
continue
}
// Dotted abbreviations (R.E.M., U.K.) — all single letters separated by dots only
if isDottedAbbreviation(w, subTokens) {
phrases = append(phrases, fmt.Sprintf(`"%s"`, strings.Join(subTokens, " ")))
} else {
// Punctuated names (a-ha, AC/DC, Jay-Z) — phrase for adjacency + concat for search_normalized
phrases = append(phrases, fmt.Sprintf(`("%s" OR %s*)`, strings.Join(subTokens, " "), concat))
}
result = append(result, fmt.Sprintf("\x00PHRASE%d\x00", len(phrases)-1))
}
return strings.Join(result, " "), phrases
}
// isDottedAbbreviation returns true if w uses only dots as punctuation and all sub-tokens
// are single letters (e.g., "R.E.M.", "U.K." but not "a-ha" or "AC/DC").
func isDottedAbbreviation(w string, subTokens []string) bool {
for _, r := range w {
if !unicode.IsLetter(r) && !unicode.IsNumber(r) && r != '.' {
return false
}
}
for _, st := range subTokens {
if !isSingleUnicodeLetter(st) {
return false
}
}
return true
}
// buildFTS5Query preprocesses user input into a safe FTS5 MATCH expression.
// It preserves quoted phrases and * prefix wildcards, neutralizes FTS5 operators
// (by lowercasing them, since FTS5 operators are case-sensitive) and strips
// special characters to prevent query injection.
func buildFTS5Query(userInput string) string {
q := strings.TrimSpace(userInput)
if q == "" {
return ""
}
var phrases []string
result := q
for {
start := strings.Index(result, `"`)
if start == -1 {
break
}
end := strings.Index(result[start+1:], `"`)
if end == -1 {
// Unmatched quote — remove it
result = result[:start] + result[start+1:]
break
}
end += start + 1
phrase := result[start : end+1] // includes quotes
phrases = append(phrases, phrase)
result = result[:start] + fmt.Sprintf("\x00PHRASE%d\x00", len(phrases)-1) + result[end+1:]
}
// Neutralize FTS5 operators by lowercasing them (FTS5 operators are case-sensitive:
// AND, OR, NOT, NEAR are operators, but and, or, not, near are plain tokens)
result = fts5Operators.ReplaceAllStringFunc(result, strings.ToLower)
// Handle words with embedded punctuation (a-ha, AC/DC, R.E.M.) before stripping
result, phrases = processPunctuatedWords(result, phrases)
result = fts5SpecialChars.ReplaceAllString(result, " ")
result = fts5LeadingStar.ReplaceAllString(result, "$1")
tokens := strings.Fields(result)
// Append * to plain tokens for prefix matching (e.g., "love" → "love*").
// Skip tokens that are already wildcarded or are quoted phrase placeholders.
for i, t := range tokens {
if strings.HasPrefix(t, "\x00") || strings.HasSuffix(t, "*") {
continue
}
tokens[i] = t + "*"
}
result = strings.Join(tokens, " ")
for i, phrase := range phrases {
placeholder := fmt.Sprintf("\x00PHRASE%d\x00", i)
result = strings.ReplaceAll(result, placeholder, phrase)
}
return result
}
// likeSearchColumns defines the core columns to search with LIKE queries.
// These are the primary user-visible fields for each entity type.
// Used as a fallback when FTS5 cannot handle the query (e.g., CJK text, punctuation-only input).
var likeSearchColumns = map[string][]string{
"media_file": {"title", "album", "artist", "album_artist"},
"album": {"name", "album_artist"},
"artist": {"name"},
}
// likeSearchExpr generates LIKE-based search filters against core columns.
// Each word in the query must match at least one column (AND between words),
// and each word can match any column (OR within a word).
// Used as a fallback when FTS5 cannot handle the query (e.g., CJK text, punctuation-only input).
func likeSearchExpr(tableName string, s string) Sqlizer {
s = strings.TrimSpace(s)
if s == "" {
log.Trace("Search using LIKE backend, query is empty", "table", tableName)
return nil
}
columns, ok := likeSearchColumns[tableName]
if !ok {
log.Trace("Search using LIKE backend, couldn't find columns for this table", "table", tableName)
return nil
}
words := strings.Fields(s)
wordFilters := And{}
for _, word := range words {
colFilters := Or{}
for _, col := range columns {
colFilters = append(colFilters, Like{tableName + "." + col: "%" + word + "%"})
}
wordFilters = append(wordFilters, colFilters)
}
log.Trace("Search using LIKE backend", "query", wordFilters, "table", tableName)
return wordFilters
}
// ftsSearchColumns defines which FTS5 columns are included in general search.
// Columns not listed here are indexed but not searched by default,
// enabling future additions (comments, lyrics, bios) without affecting general search.
var ftsSearchColumns = map[string]string{
"media_file": "{title album artist album_artist sort_title sort_album_name sort_artist_name sort_album_artist_name disc_subtitle search_participants search_normalized}",
"album": "{name sort_album_name album_artist search_participants discs catalog_num album_version search_normalized}",
"artist": "{name sort_artist_name search_normalized}",
}
// ftsSearchExpr generates an FTS5 MATCH-based search filter.
// If the query produces no FTS tokens (e.g., punctuation-only like "!!!!!!!"),
// it falls back to LIKE-based search.
func ftsSearchExpr(tableName string, s string) Sqlizer {
q := buildFTS5Query(s)
if q == "" {
s = strings.TrimSpace(s)
if s != "" {
log.Trace("Search using LIKE fallback for non-tokenizable query", "table", tableName, "query", s)
return likeSearchExpr(tableName, s)
}
return nil
}
ftsTable := tableName + "_fts"
matchExpr := q
if cols, ok := ftsSearchColumns[tableName]; ok {
matchExpr = cols + " : (" + q + ")"
}
filter := Expr(
tableName+".rowid IN (SELECT rowid FROM "+ftsTable+" WHERE "+ftsTable+" MATCH ?)",
matchExpr,
)
log.Trace("Search using FTS5 backend", "table", tableName, "query", q, "filter", filter)
return filter
}