docs: add honeypot docs

Signed-off-by: Xe Iaso <me@xeiaso.net>
2026-05-25 15:16:13 +00:00 · 2025-12-16 04:03:37 -05:00
parent 83c8c3606a
commit 82fca3e714
6 changed files with 100 additions and 45 deletions
@@ -12,3 +12,6 @@ maintnotifications
 azurediamond
 cooldown
 verifyfcrdns
+Spintax
+spintax
+clampip
@@ -95,49 +95,49 @@ bots:
  #   weight:
  #     adjust: -10

-  # # Assert behaviour that only genuine browsers display. This ensures that Chrome
-  # # or Firefox versions
-  # - name: realistic-browser-catchall
-  #   expression:
-  #     all:
-  #       - '"User-Agent" in headers'
-  #       - '( userAgent.contains("Firefox") ) || ( userAgent.contains("Chrome") ) || ( userAgent.contains("Safari") )'
-  #       - '"Accept" in headers'
-  #       - '"Sec-Fetch-Dest" in headers'
-  #       - '"Sec-Fetch-Mode" in headers'
-  #       - '"Sec-Fetch-Site" in headers'
-  #       - '"Accept-Encoding" in headers'
-  #       - '( headers["Accept-Encoding"].contains("zstd") || headers["Accept-Encoding"].contains("br") )'
-  #       - '"Accept-Language" in headers'
-  #   action: WEIGH
-  #   weight:
-  #     adjust: -10
+  # Assert behaviour that only genuine browsers display. This ensures that Chrome
+  # or Firefox versions
+  - name: realistic-browser-catchall
+    expression:
+      all:
+        - '"User-Agent" in headers'
+        - '( userAgent.contains("Firefox") ) || ( userAgent.contains("Chrome") ) || ( userAgent.contains("Safari") )'
+        - '"Accept" in headers'
+        - '"Sec-Fetch-Dest" in headers'
+        - '"Sec-Fetch-Mode" in headers'
+        - '"Sec-Fetch-Site" in headers'
+        - '"Accept-Encoding" in headers'
+        - '( headers["Accept-Encoding"].contains("zstd") || headers["Accept-Encoding"].contains("br") )'
+        - '"Accept-Language" in headers'
+    action: WEIGH
+    weight:
+      adjust: -10

-  # # The Upgrade-Insecure-Requests header is typically sent by browsers, but not always
-  # - name: upgrade-insecure-requests
-  #   expression: '"Upgrade-Insecure-Requests" in headers'
-  #   action: WEIGH
-  #   weight:
-  #     adjust: -2
+  # The Upgrade-Insecure-Requests header is typically sent by browsers, but not always
+  - name: upgrade-insecure-requests
+    expression: '"Upgrade-Insecure-Requests" in headers'
+    action: WEIGH
+    weight:
+      adjust: -2

-  # # Chrome should behave like Chrome
-  # - name: chrome-is-proper
-  #   expression:
-  #     all:
-  #       - userAgent.contains("Chrome")
-  #       - '"Sec-Ch-Ua" in headers'
-  #       - 'headers["Sec-Ch-Ua"].contains("Chromium")'
-  #       - '"Sec-Ch-Ua-Mobile" in headers'
-  #       - '"Sec-Ch-Ua-Platform" in headers'
-  #   action: WEIGH
-  #   weight:
-  #     adjust: -5
+  # Chrome should behave like Chrome
+  - name: chrome-is-proper
+    expression:
+      all:
+        - userAgent.contains("Chrome")
+        - '"Sec-Ch-Ua" in headers'
+        - 'headers["Sec-Ch-Ua"].contains("Chromium")'
+        - '"Sec-Ch-Ua-Mobile" in headers'
+        - '"Sec-Ch-Ua-Platform" in headers'
+    action: WEIGH
+    weight:
+      adjust: -5

-  # - name: should-have-accept
-  #   expression: '!("Accept" in headers)'
-  #   action: WEIGH
-  #   weight:
-  #     adjust: 5
+  - name: should-have-accept
+    expression: '!("Accept" in headers)'
+    action: WEIGH
+    weight:
+      adjust: 5

  # Generic catchall rule
  - name: generic-browser
@@ -28,6 +28,12 @@ Anubis is back and better than ever! Lots of minor fixes with some big ones inte
 - Open Graph passthrough now reuses the configured target Host/SNI/TLS settings, so metadata fetches succeed when the upstream certificate differs from the public domain. ([1283](https://github.com/TecharoHQ/anubis/pull/1283))
 - Stabilize the CVE-2025-24369 regression test by always submitting an invalid proof instead of relying on random POW failures.

+### Dataset poisoning
+
+Anubis has the ability to engage in [dataset poisoning attacks](https://www.anthropic.com/research/small-samples-poison) using the [dataset poisoning subsystem](./admin/honeypot/overview.mdx). This allows every Anubis instance to be a honeypot to attract and flag abusive scrapers so that no administrator action is required to ban them.
+
+There is much more information about this feature in [the dataset poisoning subsystem documentation](./admin/honeypot/overview.mdx). Administrators that are interested in learning how this feature works should consult that documentation.
+
 ### Deprecate `report_as` in challenge configuration

 Previously Anubis let you lie to users about the difficulty of a challenge to interfere with operators of malicious scrapers as a psychological attack:
@@ -0,0 +1,8 @@
+{
+  "label": "Honeypot",
+  "position": 40,
+  "link": {
+    "type": "generated-index",
+    "description": "Honeypot features in Anubis, allowing Anubis to passively detect malicious crawlers."
+  }
+}
@@ -0,0 +1,40 @@
+---
+title: Dataset poisoning
+---
+
+Anubis offers the ability to participate in [dataset poisoning](https://www.anthropic.com/research/small-samples-poison) attacks similar to what [iocaine](https://iocaine.madhouse-project.org/) and other similar tools offer. Currently this is in a preview state where a lot of details are hard-coded in order to test the viability of this approach.
+
+In essence, when Anubis challenge and error pages are rendered they include a small bit of HTML code that browsers will ignore but scrapers will interpret as a link to ingest. This will then create a small forest of recursive nothing pages that are designed according to the following principles:
+
+- These pages are _cheap_ to render, rendering in at most ten milliseconds on decently specced hardware.
+- These pages are _vacuous_, meaning that they essentially are devoid of content such that a human would find it odd and click away, but a scraper would not be able to know that and would continue through the forest.
+- These pages are _fairly large_ so that scrapers don't think that the pages are error pages or are otherwise devoid of content.
+- These pages are _fully self-contained_ so that they load fast without incurring additional load from resource fetches.
+
+In this limited preview state, Anubis generates pages using [spintax](https://outboundly.ai/blogs/what-is-spintax-and-how-to-use-it/). Spintax is a syntax that is used to create different variants of utterances for use in marketing messages and email spam that evades word filtering. In its current form, Anubis' dataset poisoning has AI generated spintax that generates vapid LinkedIn posts with some western occultism thrown in for good measure. This results in utterances like the following:
+
+> There's a moment when visionaries are being called to realize that the work can't be reduced to optimization, but about resonance. We don't transform products by grinding endlessly, we do it by holding the vision. Because meaning can't be forced, it unfolds over time when culture are in integrity. This moment represents a fundamental reimagining in how we think about work. This isn't a framework, it's a lived truth that requires courage. When we get honest, we activate nonlinear growth that don't show up in dashboards, but redefine success anyway.
+
+This should be fairly transparent to humans that this is pseudoprofound anti-content and is a signal to click away.
+
+## Plans
+
+Future versions of this feature will allow for more customization. In the near future this will be configurable via the following mechanisms:
+
+- WebAssembly logic for customizing how the poisoning data is generated (with examples including the existing spintax method).
+- Weight thresholds and logic for how they are interpreted by Anubis.
+- Other configuration settings as facts and circumstances dictate.
+
+## Implementation notes
+
+In its current implementation, the Anubis dataset poisoning feature has the following flaws that may hinder production deployments:
+
+- All Anubis instances use the same method for generating dataset poisoning information. This may be easy for malicious actors to detect and ignore.
+- Anubis dataset poisoning routes are under the `/.within.website/x/cmd/anubis` URL hierarchy. This may be easy for malicious actors to detect and ignore.
+
+Right now Anubis assigns 30 weight points if the following criteria are met:
+
+- A client's User-Agent has been observed in the dataset poisoning maze at least 25 times.
+- The network-clamped IP address (/24 for IPv4 and /48 for IPv6) has been observed in the dataset poisoning maze at least 25 times.
+
+Additionally, when any given client by both User-Agent and network-clamped IP address has been observed, Anubis will emit log lines warning about it so that administrative action can be taken up to and including [filing abuse reports with the network owner](/blog/2025/file-abuse-reports).
@@ -172,11 +172,9 @@ func (i *Impl) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 	if stage == "init" {
 		lg.Debug("found new entrance point", "id", id, "stage", stage, "userAgent", r.UserAgent(), "clampedIP", network)
 	} else {
-		if networkCount >= 50 && networkCount%256 == 0 {
-			lg.Warn("found possible crawler", "id", id)
-		}
-		if uaCount >= 50 && uaCount%256 == 0 {
-			lg.Warn("found possible crawler", "id", id)
+		switch {
+		case networkCount%256 == 0, uaCount%256 == 0:
+			lg.Warn("found possible crawler", "id", id, "network", network)
 		}
 	}