Skip to main content

How it works

The pipeline has seven stages. Each runs independently and on its own schedule.

1. Capture

A class extension on the core search class hands every completed search to the capture listener. Before a row is written, the listener evaluates a sequence of gates in order:

  1. Master switch (Enable capturing) is on.
  2. The search is a genuine, member-initiated search submission, not a results-page re-run, a widget search, or a search-as-you-type autocomplete request.
  3. Query length is at least Minimum query length.
  4. The search type is not in the private-by-design exclusion list (direct messages, conversations).
  5. The search type appears in Capture these search types.
  6. The visitor is not a guest, or Capture guest searches is on.
  7. The visitor is not a member of an Excluded user group.
  8. The query does not match a denylist entry.
  9. The normalised query is non-empty.

If all gates pass, the row is pushed into an in-memory write buffer.

2. Buffer and flush

The buffer holds rows until one of three conditions is met:

  • The buffer reaches Write buffer size (default 50 entries).
  • The oldest buffered row is older than Write buffer max age (default 30 seconds).
  • The request shuts down.

A single multi-row INSERT writes the batched rows into xf_mc_sa_search_log. On failure the buffer increments a telemetry counter and the search request itself is never affected.

3. Daily rollup

The Daily rollup cron entry runs once a day. It groups raw rows by (query_hash, search_type, day) and writes one pseudonymised row per group into xf_mc_sa_search_aggregate. Aggregate rows carry counts only: no IP, no user id. Rollup chunk size caps how many aggregate rows are processed per pass.

4. Clustering

When Enable intent clustering is on, the rollup pass also collapses lexically similar normalised queries into shared cluster IDs. Stopwords are stripped using the lists selected in Clustering stopword languages. Clusters smaller than Minimum cluster size are kept but flagged as low confidence. See Clusters.

The Trending cron entry runs once a day and refreshes all three snapshot periods (day, week, month). For each candidate query, the detector compares the current period's volume against the average of the previous four periods. Queries with fewer than Minimum searches for trending in the current period are excluded. Results are written to xf_mc_sa_search_trending_snapshot. See Trending.

6. Widget cache and retention

The Widget cache cron entry rebuilds public widget payloads every fifteen minutes and stores them in the data registry. Payloads expire after Widget cache lifetime. The Retention prune cron entry runs daily, deletes raw rows older than Raw log retention and aggregate rows older than Aggregate retention, both chunked by Retention prune chunk size.

7. Search-engine diagnostics

The Diagnostics sweep cron entry runs hourly. It looks at clusters whose searches are almost all zero-result (80 percent or more) and works out why those searches keep failing, refreshing each cluster's verdict about once a day.

When Enhanced Search (XFES) is installed, the sweep queries the Elasticsearch index for a typo correction, a relevance verdict (the content is there but ranks poorly, against a genuine content gap), and suggested tags. Without XFES it falls back to tag suggestions taken from the most-used site tags. The verdict shows on the cluster detail view. See Clusters.

Two independent retention tiers

TierStorageDefaultWhat it loses on prune
Rawxf_mc_sa_search_log30 daysThe identifiable row (optional user id, optional IP hash).
Aggregatexf_mc_sa_search_aggregate730 daysThe pseudonymised count. Survives raw prune untouched.

The aggregate tier is what powers every chart, trending detection, and clustering long-term. Shortening the raw window only shortens the per-query drill-down.