Currently the honeypotting feature has no limits or delays anywhere and
uses that to feed an internal greylist of IP networks. This can cause
issues such as in #1613 where Claude's crawler seemed to pick up on it
and egress data at over one megabit per second until the administrator
noticed and blocked the address range.
This takes a different approach by inspiration of how the classic #xkcd
IRC bot Robot9000 works. The first time a given IPv4 /24 or IPv6 /48
visits a honepot page, Anubis sleeps for 1 millisecond. The second it
sleeps for two milliseconds. The third is four milliseconds and so on.
The goal of this is to make the scraping inherently self-limiting such
that the scrapers go off in their own corner where they won't really
hurt anyone.
Let's see if this works out according to keikaku.
Ref: https://github.com/TecharoHQ/anubis/issues/1613
Signed-off-by: Xe Iaso <me@xeiaso.net>
* fix(policy): correctly wire subrequest mode through CEL/path checkers
Previously Anubis only checked for the X-Original-Url when using
subrequest mode. This header is used by the example nginx config to pass
the request path through from the original client request to Anubis in
order to do path-based filtering.
According to facts and circumstances, Traefik hardcodes its own
headers[1]:
```text
httpdebug-1 | GET /.within.website/x/cmd/anubis/api/check
httpdebug-1 | X-Forwarded-Method: GET
httpdebug-1 | X-Forwarded-Proto: http
httpdebug-1 | X-Forwarded-Server: b9a5d299c929
httpdebug-1 | X-Forwarded-Port: 8080
httpdebug-1 | X-Forwarded-Uri: /
httpdebug-1 | X-Real-Ip: 172.18.0.1
httpdebug-1 | Accept-Encoding: gzip
httpdebug-1 | User-Agent: curl/8.20.0
httpdebug-1 | Accept: */*
httpdebug-1 | X-Forwarded-For: 172.18.0.1
httpdebug-1 | X-Forwarded-Host: localhost:8080
```
As a result, this means that path-based filtering did not work.
This commit fixes this issue by amending how path based checking logic
works:
* For CEL based checks, this pipes through the `subrequestMode` flag from
main and alters the behaviour if either `X-Original-Url` or
`X-Forwarded-Url` are found. These values are currently hardcoded for
convenience but probably need to be made configurable in the policy
file at a future date.
* For path-based checks, this uses the existing `subrequestMode` flag
from main and adds `X-Forwarded-Url` to the list of headers it checks.
A smoke test was added to make sure that traefik in this mode continues
to work. Thank you https://github.com/flifloo for filing a detailed
issue with the relevant configuration fragments. Those configuration
fragments formed the core of this smoke test.
[1]: https://doc.traefik.io/traefik/v3.4/middlewares/http/forwardauth/
Closes: https://github.com/TecharoHQ/anubis/issues/1628
Signed-off-by: Xe Iaso <me@xeiaso.net>
Co-Authored-By: flifloo <flifloo@gmail.com>
* chore: spelling
Signed-off-by: Xe Iaso <me@xeiaso.net>
---------
Signed-off-by: Xe Iaso <me@xeiaso.net>
Co-authored-by: flifloo <flifloo@gmail.com>
* fix: patch GHSA-6wcg-mqvh-fcvg
PR https://github.com/TecharoHQ/anubis/pull/1015 added the ability for
reverse proxies using Anubis in subrequest auth mode to look at the path
of a request as there are many rules in the wild that rely on checking
the path. This is how access to things like robots.txt or anything in the
.well-known directory is unaffected by Anubis.
However this logic was also enabled for non-subrequest deployments of Anubis,
meaning that a specially crafted request could include a /.well-known/
path in it and then get around Anubis with little effort.
This fix gates the logic behind a new plumbed variable named subrequestMode
that only fires when Anubis is running in subrequest auth mode. This
properly contains that workaround so that the logic does not fire in
most deployments.
Signed-off-by: Xe Iaso <me@xeiaso.net>
* chore: spelling
Signed-off-by: Xe Iaso <me@xeiaso.net>
---------
Signed-off-by: Xe Iaso <me@xeiaso.net>
Using the User-Agent as a filtering vector for the honeypot maze was a
decent idea, however in practice it can become a DoS vector by a
malicious client adding a lot of points to Google Chrome's User-Agent
string. In practice it also seems that the worst offenders use vanilla
Google Chrome User-Agent strings as well, meaning that this backfires
horribly.
Gotta crack a few eggs to make omlettes.
Closes: #1580
Signed-off-by: Xe Iaso <me@xeiaso.net>
* Resolve#1193
Address documentation and error message issues around REDIRECT_DOMAINS and required keywords in bot specifications.
* Add CHANGELOG entry
* fix: enable CEL iterators
Signed-off-by: Jason Cameron <jason.cameron@stanwith.me>
* test: add unit tests for CELChecker map iteration
Signed-off-by: Jason Cameron <jason.cameron@stanwith.me>
* fix: implement map iterators for HTTPHeaders and URLValues to resolve CEL internal errors
Signed-off-by: Jason Cameron <jason.cameron@stanwith.me>
* fix: replace checker.NewMapIterator with newMapIterator for HTTPHeaders and URLValues
Signed-off-by: Jason Cameron <jason.cameron@stanwith.me>
---------
Signed-off-by: Jason Cameron <jason.cameron@stanwith.me>
When displayed in Japanese, the `バージョン` (version) is in the middle, but the version number is at the end, so it is displayed strangely. Improve this.
**"version_info":**
```
このウェブサイトはAnubisバージョンで動作しています
```
to
```
このウェブサイトはAnubisで動作しています バージョン
```
Signed-off-by: BALLOON | FU-SEN <5434159+fu-sen@users.noreply.github.com>
* Add Wikimedia Foundation citoid services file
Wikimedia Foundation runs a service called citoid which retrieves citation metadata from urls in order to create formatted citations.
This file contains the ip ranges allocated to the WMF (https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations) from which the services make requests, as well as regex for the User-Agents from both services used to generate citations (citoid, and Zotero's translation-server which citoid makes requests to as well in order to generate the metadata).
Signed-off-by: Marielle Volz <marielle.volz@gmail.com>
* Add Wikimedia Citoid crawler to allowed list
Signed-off-by: Marielle Volz <marielle.volz@gmail.com>
* chore: update spelling
Signed-off-by: Xe Iaso <me@xeiaso.net>
---------
Signed-off-by: Marielle Volz <marielle.volz@gmail.com>
Signed-off-by: Xe Iaso <me@xeiaso.net>
Co-authored-by: Xe Iaso <me@xeiaso.net>