feat(server): implement FTS5-based full-text search (#5079)

* build: add sqlite_fts5 build tag to enable FTS5 support * feat: add SearchBackend config option (default: fts) * feat: add buildFTS5Query for safe FTS5 query preprocessing * feat: add FTS5 search backend with config toggle, refactor legacy search - Add searchExprFunc type and getSearchExpr() for backend selection - Rename fullTextExpr to legacySearchExpr - Add ftsSearchExpr using FTS5 MATCH subquery - Update fullTextFilter in sql_restful.go to use configured backend * feat: add FTS5 migration with virtual tables, triggers, and search_participants Creates FTS5 virtual tables for media_file, album, and artist with unicode61 tokenizer and diacritic folding. Adds search_participants column, populates from JSON, and sets up INSERT/UPDATE/DELETE triggers. * feat: populate search_participants in PostMapArgs for FTS5 indexing * test: add FTS5 search integration tests * fix: exclude FTS5 virtual tables from e2e DB restore The restoreDB function iterates all tables in sqlite_master and runs DELETE + INSERT to reset state. FTS5 contentless virtual tables cannot be directly deleted from. Since triggers handle FTS5 sync automatically, simply skip tables matching *_fts and *_fts_* patterns. * build: add compile-time guard for sqlite_fts5 build tag Same pattern as netgo: compilation fails with a clear error if the sqlite_fts5 build tag is missing. * build: add sqlite_fts5 tag to reflex dev server config * build: extract GO_BUILD_TAGS variable in Makefile to avoid duplication * fix: strip leading * from FTS5 queries to prevent "unknown special query" error * feat: auto-append prefix wildcard to FTS5 search tokens for broader matching Every plain search token now gets a trailing * appended (e.g., "love" becomes "love*"), so searching for "love" also matches "lovelace", "lovely", etc. Quoted phrases are preserved as exact matches without wildcards. Results are ordered alphabetically by name/title, so shorter exact matches naturally appear first. * fix: clarify comments about FTS5 operator neutralization The comments said "strip" but the code lowercases operators to neutralize them (FTS5 operators are case-sensitive). Updated comments to accurately describe the behavior. * fix: use fmt.Sprintf for FTS5 phrase placeholders The previous encoding used rune('0'+index) which silently breaks with 10+ quoted phrases. Use fmt.Sprintf for arbitrary index support. * fix: validate and normalize SearchBackend config option Normalize the value to lowercase and fall back to "fts" with a log warning for unrecognized values. This prevents silent misconfiguration from typos like "FTS", "Legacy", or "fts5". * refactor: improve documentation for build tags and FTS5 requirements Signed-off-by: Deluan <deluan@navidrome.org> * refactor: convert FTS5 query and search backend normalization tests to DescribeTable format Signed-off-by: Deluan <deluan@navidrome.org> * fix: add sqlite_fts5 build tag to golangci configuration Signed-off-by: Deluan <deluan@navidrome.org> * feat: add UISearchDebounceMs configuration option and update related components Signed-off-by: Deluan <deluan@navidrome.org> * fix: fall back to legacy search when SearchFullString is enabled FTS5 is token-based and cannot match substrings within words, so getSearchExpr now returns legacySearchExpr when SearchFullString is true, regardless of SearchBackend setting. * fix: add sqlite_fts5 build tag to CI pipeline and Dockerfile * fix: add WHEN clauses to FTS5 AFTER UPDATE triggers Added WHEN clauses to the media_file_fts_au, album_fts_au, and artist_fts_au triggers so they only fire when FTS-indexed columns actually change. Previously, every row update (e.g., play count, rating, starred status) triggered an unnecessary delete+insert cycle in the FTS shadow tables. The WHEN clauses use IS NOT for NULL-safe comparison of each indexed column, avoiding FTS index churn for non-indexed updates. * feat: add SearchBackend configuration option to data and insights components Signed-off-by: Deluan <deluan@navidrome.org> * fix: enhance input sanitization for FTS5 by stripping additional punctuation and special characters Signed-off-by: Deluan <deluan@navidrome.org> * feat: add search_normalized column for punctuated name search (R.E.M., AC/DC) Add index-time normalization and query-time single-letter collapsing to fix FTS5 search for punctuated names. A new search_normalized column stores concatenated forms of punctuated words (e.g., "R.E.M." → "REM", "AC/DC" → "ACDC") and is indexed in FTS5 tables. At query time, runs of consecutive single letters (from dot-stripping) are collapsed into OR expressions like ("R E M" OR REM*) to match both the original tokens and the normalized form. This enables searching by "R.E.M.", "REM", "AC/DC", "ACDC", "A-ha", or "Aha" and finding the correct results. * refactor: simplify isSingleUnicodeLetter to avoid []rune allocation Use utf8.DecodeRuneInString to check for a single Unicode letter instead of converting the entire string to a []rune slice. * feat: define ftsSearchColumns for flexible FTS5 search column inclusion Signed-off-by: Deluan <deluan@navidrome.org> * feat: update collapseSingleLetterRuns to return quoted phrases for abbreviations Signed-off-by: Deluan <deluan@navidrome.org> * feat: implement extractPunctuatedWords to handle artist/album names with embedded punctuation Signed-off-by: Deluan <deluan@navidrome.org> * feat: implement extractPunctuatedWords to handle artist/album names with embedded punctuation Signed-off-by: Deluan <deluan@navidrome.org> * refactor: punctuated word handling to improve processing of artist/album names Signed-off-by: Deluan <deluan@navidrome.org> * feat: add CJK support for search queries with LIKE filters Signed-off-by: Deluan <deluan@navidrome.org> * feat: enhance FTS5 search by adding album version support and CJK handling Signed-off-by: Deluan <deluan@navidrome.org> * refactor: search configuration to use structured options Signed-off-by: Deluan <deluan@navidrome.org> * feat: enhance search functionality to support punctuation-only queries and update related tests Signed-off-by: Deluan <deluan@navidrome.org> --------- Signed-off-by: Deluan <deluan@navidrome.org>
2026-02-21 17:52:42 -05:00
parent 6f5f58ae9d
commit 54de0dbc52
35 changed files with 1283 additions and 56 deletions
@@ -0,0 +1,261 @@
+package persistence
+
+import (
+	"fmt"
+	"regexp"
+	"strings"
+	"unicode"
+	"unicode/utf8"
+
+	. "github.com/Masterminds/squirrel"
+	"github.com/navidrome/navidrome/log"
+)
+
+// containsCJK returns true if the string contains any CJK (Chinese/Japanese/Korean) characters.
+// CJK text doesn't use spaces between words, so FTS5's unicode61 tokenizer treats entire
+// CJK phrases as single tokens, making token-based search ineffective for CJK content.
+func containsCJK(s string) bool {
+	for _, r := range s {
+		if unicode.Is(unicode.Han, r) ||
+			unicode.Is(unicode.Hiragana, r) ||
+			unicode.Is(unicode.Katakana, r) ||
+			unicode.Is(unicode.Hangul, r) {
+			return true
+		}
+	}
+	return false
+}
+
+// fts5SpecialChars matches characters that should be stripped from user input.
+// We keep only Unicode letters, numbers, whitespace, * (prefix wildcard), " (phrase quotes),
+// and \x00 (internal placeholder marker). All punctuation is removed because the unicode61
+// tokenizer treats it as token separators, and characters like ' can cause FTS5 parse errors
+// as unbalanced string delimiters.
+var fts5SpecialChars = regexp.MustCompile(`[^\p{L}\p{N}\s*"\x00]`)
+
+// fts5PunctStrip strips everything except letters and numbers (no whitespace, wildcards, or quotes).
+// Used for normalizing words at index time to create concatenated forms (e.g., "R.E.M." → "REM").
+var fts5PunctStrip = regexp.MustCompile(`[^\p{L}\p{N}]`)
+
+// fts5Operators matches FTS5 boolean operators as whole words (case-insensitive).
+var fts5Operators = regexp.MustCompile(`(?i)\b(AND|OR|NOT|NEAR)\b`)
+
+// fts5LeadingStar matches a * at the start of a token. FTS5 only supports * at the end (prefix queries).
+var fts5LeadingStar = regexp.MustCompile(`(^|[\s])\*+`)
+
+// normalizeForFTS takes multiple strings, strips non-letter/non-number characters from each word,
+// and returns a space-separated string of words that changed after stripping (deduplicated).
+// This is used at index time to create concatenated forms: "R.E.M." → "REM", "AC/DC" → "ACDC".
+func normalizeForFTS(values ...string) string {
+	seen := make(map[string]struct{})
+	var result []string
+	for _, v := range values {
+		for _, word := range strings.Fields(v) {
+			stripped := fts5PunctStrip.ReplaceAllString(word, "")
+			if stripped == "" || stripped == word {
+				continue
+			}
+			lower := strings.ToLower(stripped)
+			if _, ok := seen[lower]; ok {
+				continue
+			}
+			seen[lower] = struct{}{}
+			result = append(result, stripped)
+		}
+	}
+	return strings.Join(result, " ")
+}
+
+// isSingleUnicodeLetter returns true if token is exactly one Unicode letter.
+func isSingleUnicodeLetter(token string) bool {
+	r, size := utf8.DecodeRuneInString(token)
+	return size == len(token) && size > 0 && unicode.IsLetter(r)
+}
+
+// namePunctuation is the set of characters commonly used as separators in artist/album
+// names (hyphens, slashes, dots, apostrophes). Only words containing these are candidates
+// for punctuated-word processing; other special characters (^, :, &) are just stripped.
+const namePunctuation = `-/.''`
+
+// processPunctuatedWords handles words with embedded name punctuation before the general
+// special-character stripping. For each punctuated word it produces either:
+//   - A quoted phrase for dotted abbreviations: R.E.M. → "R E M"
+//   - A phrase+concat OR for other patterns:    a-ha  → ("a ha" OR aha*)
+func processPunctuatedWords(input string, phrases []string) (string, []string) {
+	words := strings.Fields(input)
+	var result []string
+	for _, w := range words {
+		if strings.HasPrefix(w, "\x00") || strings.ContainsAny(w, `*"`) || !strings.ContainsAny(w, namePunctuation) {
+			result = append(result, w)
+			continue
+		}
+		concat := fts5PunctStrip.ReplaceAllString(w, "")
+		if concat == "" || concat == w {
+			result = append(result, w)
+			continue
+		}
+		subTokens := strings.Fields(fts5SpecialChars.ReplaceAllString(w, " "))
+		if len(subTokens) < 2 {
+			// Single sub-token after splitting (e.g., N' → N): just use the stripped form
+			result = append(result, concat)
+			continue
+		}
+		// Dotted abbreviations (R.E.M., U.K.) — all single letters separated by dots only
+		if isDottedAbbreviation(w, subTokens) {
+			phrases = append(phrases, fmt.Sprintf(`"%s"`, strings.Join(subTokens, " ")))
+		} else {
+			// Punctuated names (a-ha, AC/DC, Jay-Z) — phrase for adjacency + concat for search_normalized
+			phrases = append(phrases, fmt.Sprintf(`("%s" OR %s*)`, strings.Join(subTokens, " "), concat))
+		}
+		result = append(result, fmt.Sprintf("\x00PHRASE%d\x00", len(phrases)-1))
+	}
+	return strings.Join(result, " "), phrases
+}
+
+// isDottedAbbreviation returns true if w uses only dots as punctuation and all sub-tokens
+// are single letters (e.g., "R.E.M.", "U.K." but not "a-ha" or "AC/DC").
+func isDottedAbbreviation(w string, subTokens []string) bool {
+	for _, r := range w {
+		if !unicode.IsLetter(r) && !unicode.IsNumber(r) && r != '.' {
+			return false
+		}
+	}
+	for _, st := range subTokens {
+		if !isSingleUnicodeLetter(st) {
+			return false
+		}
+	}
+	return true
+}
+
+// buildFTS5Query preprocesses user input into a safe FTS5 MATCH expression.
+// It preserves quoted phrases and * prefix wildcards, neutralizes FTS5 operators
+// (by lowercasing them, since FTS5 operators are case-sensitive) and strips
+// special characters to prevent query injection.
+func buildFTS5Query(userInput string) string {
+	q := strings.TrimSpace(userInput)
+	if q == "" {
+		return ""
+	}
+
+	var phrases []string
+	result := q
+	for {
+		start := strings.Index(result, `"`)
+		if start == -1 {
+			break
+		}
+		end := strings.Index(result[start+1:], `"`)
+		if end == -1 {
+			// Unmatched quote — remove it
+			result = result[:start] + result[start+1:]
+			break
+		}
+		end += start + 1
+		phrase := result[start : end+1] // includes quotes
+		phrases = append(phrases, phrase)
+		result = result[:start] + fmt.Sprintf("\x00PHRASE%d\x00", len(phrases)-1) + result[end+1:]
+	}
+
+	// Neutralize FTS5 operators by lowercasing them (FTS5 operators are case-sensitive:
+	// AND, OR, NOT, NEAR are operators, but and, or, not, near are plain tokens)
+	result = fts5Operators.ReplaceAllStringFunc(result, strings.ToLower)
+
+	// Handle words with embedded punctuation (a-ha, AC/DC, R.E.M.) before stripping
+	result, phrases = processPunctuatedWords(result, phrases)
+
+	result = fts5SpecialChars.ReplaceAllString(result, " ")
+	result = fts5LeadingStar.ReplaceAllString(result, "$1")
+	tokens := strings.Fields(result)
+
+	// Append * to plain tokens for prefix matching (e.g., "love" → "love*").
+	// Skip tokens that are already wildcarded or are quoted phrase placeholders.
+	for i, t := range tokens {
+		if strings.HasPrefix(t, "\x00") || strings.HasSuffix(t, "*") {
+			continue
+		}
+		tokens[i] = t + "*"
+	}
+
+	result = strings.Join(tokens, " ")
+
+	for i, phrase := range phrases {
+		placeholder := fmt.Sprintf("\x00PHRASE%d\x00", i)
+		result = strings.ReplaceAll(result, placeholder, phrase)
+	}
+
+	return result
+}
+
+// likeSearchColumns defines the core columns to search with LIKE queries.
+// These are the primary user-visible fields for each entity type.
+// Used as a fallback when FTS5 cannot handle the query (e.g., CJK text, punctuation-only input).
+var likeSearchColumns = map[string][]string{
+	"media_file": {"title", "album", "artist", "album_artist"},
+	"album":      {"name", "album_artist"},
+	"artist":     {"name"},
+}
+
+// likeSearchExpr generates LIKE-based search filters against core columns.
+// Each word in the query must match at least one column (AND between words),
+// and each word can match any column (OR within a word).
+// Used as a fallback when FTS5 cannot handle the query (e.g., CJK text, punctuation-only input).
+func likeSearchExpr(tableName string, s string) Sqlizer {
+	s = strings.TrimSpace(s)
+	if s == "" {
+		log.Trace("Search using LIKE backend, query is empty", "table", tableName)
+		return nil
+	}
+	columns, ok := likeSearchColumns[tableName]
+	if !ok {
+		log.Trace("Search using LIKE backend, couldn't find columns for this table", "table", tableName)
+		return nil
+	}
+	words := strings.Fields(s)
+	wordFilters := And{}
+	for _, word := range words {
+		colFilters := Or{}
+		for _, col := range columns {
+			colFilters = append(colFilters, Like{tableName + "." + col: "%" + word + "%"})
+		}
+		wordFilters = append(wordFilters, colFilters)
+	}
+	log.Trace("Search using LIKE backend", "query", wordFilters, "table", tableName)
+	return wordFilters
+}
+
+// ftsSearchColumns defines which FTS5 columns are included in general search.
+// Columns not listed here are indexed but not searched by default,
+// enabling future additions (comments, lyrics, bios) without affecting general search.
+var ftsSearchColumns = map[string]string{
+	"media_file": "{title album artist album_artist sort_title sort_album_name sort_artist_name sort_album_artist_name disc_subtitle search_participants search_normalized}",
+	"album":      "{name sort_album_name album_artist search_participants discs catalog_num album_version search_normalized}",
+	"artist":     "{name sort_artist_name search_normalized}",
+}
+
+// ftsSearchExpr generates an FTS5 MATCH-based search filter.
+// If the query produces no FTS tokens (e.g., punctuation-only like "!!!!!!!"),
+// it falls back to LIKE-based search.
+func ftsSearchExpr(tableName string, s string) Sqlizer {
+	q := buildFTS5Query(s)
+	if q == "" {
+		s = strings.TrimSpace(s)
+		if s != "" {
+			log.Trace("Search using LIKE fallback for non-tokenizable query", "table", tableName, "query", s)
+			return likeSearchExpr(tableName, s)
+		}
+		return nil
+	}
+	ftsTable := tableName + "_fts"
+	matchExpr := q
+	if cols, ok := ftsSearchColumns[tableName]; ok {
+		matchExpr = cols + " : (" + q + ")"
+	}
+
+	filter := Expr(
+		tableName+".rowid IN (SELECT rowid FROM "+ftsTable+" WHERE "+ftsTable+" MATCH ?)",
+		matchExpr,
+	)
+	log.Trace("Search using FTS5 backend", "table", tableName, "query", q, "filter", filter)
+	return filter
+}