Merge branch 'main' into Xe/changelog-mention-migration-breakage

Signed-off-by: Xe Iaso <xe.iaso@techaro.lol>
feat(blog): incident report for TI-20250709-0001 (#795 )
2026-04-17 05:44:57 +00:00 · 2025-07-09 16:53:13 -04:00 · 2025-07-09 14:56:12 +00:00 · 2025-07-09 13:31:28 +00:00 · 2025-07-09 08:26:25 -04:00
6 changed files with 133 additions and 16 deletions
--- a/.github/actions/spelling/expect.txt
+++ b/.github/actions/spelling/expect.txt
@@ -147,6 +147,7 @@ Imagesift
 imgproxy
 impressum
 inp
+internets
 IPTo
 iptoasn
 iss
@@ -312,6 +313,8 @@ Velen
 vendored
 vhosts
 videotest
+VKE
+Vultr
 waitloop
 weblate
 webmaster
--- a/docs/blog/2025-07-09-incident-report/index.mdx
+++ b/docs/blog/2025-07-09-incident-report/index.mdx
@@ -0,0 +1,105 @@
+---
+slug: incident/TI-20250709-0001
+title: "TI-20250709-0001: IPv4 traffic failures for Techaro services"
+authors: [xe]
+tags: [incident]
+image: ./window-portal.jpg
+---
+
+![](./window-portal.jpg)
+
+Techaro services were down for IPv4 traffic on July 9th, 2025. This blogpost is a report of what happened, what actions were taken to resolve the situation, and what actions are being done in the near future to prevent this problem. Enjoy this incident report!
+
+{/* truncate */}
+
+:::note
+
+In other companies, this kind of documentation would be kept internal. At Techaro, we believe that you deserve radical candor and the truth. As such, we are proving our lofty words with actions by publishing details about how things go wrong publicly.
+
+Everything past this point follows my standard incident root cause meeting template.
+
+:::
+
+This incident report will focus on the services affected, timeline of what happened at which stage of the incident, where we got lucky, the root cause analysis, and what action items are being planned or taken to prevent this from happening in the future.
+
+## Timeline
+
+All events take place on July 9th, 2025.
+
+| Time (UTC) | Description                                                                                                                                                                                  |
+| :--------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| 12:32      | Uptime Kuma reports that another unrelated website on the same cluster was timing out.                                                                                                       |
+| 12:33      | Uptime Kuma reports that Thoth's production endpoint is failing gRPC health checks.                                                                                                          |
+| 12:35      | Investigation begins, [announcement made on Xe's Bluesky](https://bsky.app/profile/xeiaso.net/post/3ltjtdczpwc2x) due to the impact including their personal blog.                           |
+| 12:39      | `nginx-ingress` logs on the production cluster show IPv6 traffic but an abrupt cutoff in IPv4 traffic around 12:32 UTC. Ticket is opened with the hosting provider.                          |
+| 12:41      | IPv4 traffic resumes long enough for Uptime Kuma to report uptime, but then immediately fails again.                                                                                         |
+| 12:46      | IPv4 traffic resumes long enough for Uptime Kuma to report uptime, but then immediately fails again. (repeat instances of this have been scrubbed, but it happened about every 5-10 minutes) |
+| 12:48      | First reply from the hosting provider.                                                                                                                                                       |
+| 12:57      | Reply to hosting provider, ask to reboot the load balancer.                                                                                                                                  |
+| 13:00      | Incident responder because busy due to a meeting under the belief that the downtime was out of their control and that uptime monitoring software would let them know if it came back up.     |
+| 13:20      | Incident responder ended meeting and went back to monitoring downtime and preparing this document.                                                                                           |
+| 13:34      | IPv4 traffic starts to show up in the `ingress-nginx` logs.                                                                                                                                  |
+| 13:35      | All services start to report healthy. Incident status changes to monitoring.                                                                                                                 |
+| 13:48      | Incident closed.                                                                                                                                                                             |
+| 14:07      | Incident re-opened. Issues seem to be manifesting as BGP issues in the upstream provider.                                                                                                    |
+| 14:10      | IPv4 traffic resumes and then stops.                                                                                                                                                         |
+| 14:18      | IPv4 traffic resumes again. Incident status changes to monitoring.                                                                                                                           |
+| 14:40      | Incident closed.                                                                                                                                                                             |
+
+## Services affected
+
+| Service name                                        | User impact        |
+| :-------------------------------------------------- | :----------------- |
+| [Anubis Docs](https://anubis.techaro.lol) (IPv4)    | Connection timeout |
+| [Anubis Docs](https://anubis.techaro.lol) (IPv6)    | None               |
+| [Thoth](/docs/admin/thoth/) (IPv4)                  | Connection timeout |
+| [Thoth](/docs/admin/thoth/) (IPv6)                  | None               |
+| Other websites colocated on the same cluster (IPv4) | Connection timeout |
+| Other websites colocated on the same cluster (IPv6) | None               |
+
+## Root cause analysis
+
+In simplify server management, Techaro runs a [Kubernetes](https://kubernetes.io/) cluster on [Vultr VKE](https://www.vultr.com/kubernetes/) (Vultr Kubernetes Engine). When you do this, Vultr needs to provision a [load balancer](https://docs.vultr.com/how-to-use-a-vultr-load-balancer-with-vke) to bridge the gap between the outside world and the Kubernetes world, kinda like this:
+
+```mermaid
+---
+title: Overall architecture
+---
+
+flowchart LR
+    UT(User Traffic)
+    subgraph Provider Infrastructure
+      LB[Load Balancer]
+    end
+    subgraph Kubernetes
+        IN(ingress-nginx)
+        TH(Thoth)
+        AN(Anubis Docs)
+        OS(Other sites)
+
+        IN --> TH
+        IN --> AN
+        IN --> OS
+    end
+
+    UT --> LB --> IN
+```
+
+Techaro controls everything inside the Kubernetes side of that diagram. Anything else is out of our control. That load balancer is routed to the public internet via [Border Gateway Protocol (BGP)](https://en.wikipedia.org/wiki/Border_Gateway_Protocol).
+
+If there is an interruption with the BGP sessions in the upstream provider, this can manifest as things either not working or inconsistently working. This is made more difficult by the fact that the IPv4 and IPv6 internets are technically separate networks. With this in mind, it's very possible to have IPv4 traffic fail but not IPv6 traffic.
+
+The root cause is that the hosting provider we use for production services had flapping IPv4 BGP sessions in its Toronto region. When this happens all we can do is open a ticket and wait for it to come back up.
+
+## Where we got lucky
+
+The Uptime Kuma instance that caught this incident runs on an IPv4-only network. If it was dual stack, this would not have been caught as quickly.
+
+The `ingress-nginx` logs print IP addresses of remote clients to the log feed. If this was not the case, it would be much more difficult to find this error.
+
+## Action items
+
+- A single instance of downtime like this is not enough reason to move providers. Moving providers because of this is thus out of scope.
+- Techaro needs a status page hosted on a different cloud provider than is used for the production cluster (`TecharoHQ/TODO#6`).
+- Health checks for IPv4 and IPv6 traffic need to be created (`TecharoHQ/TODO#7`).
+- Remove the requirement for [Anubis to pass Thoth health checks before it can start if Thoth is enabled](https://github.com/TecharoHQ/anubis/pull/794).
--- a/docs/blog/2025-07-09-incident-report/window-portal.jpg
+++ b/docs/blog/2025-07-09-incident-report/window-portal.jpg
--- a/docs/docs/CHANGELOG.md
+++ b/docs/docs/CHANGELOG.md
@@ -13,12 +13,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 <!-- This changes the project to: -->

-### Added
-
-Anubis now supports these new languages:
-
- [Italian](https://github.com/TecharoHQ/anubis/pull/778)
-
 ## v1.21.0: Minfilia Warde

 > Please, be at ease. You are among friends here.
@@ -40,11 +34,16 @@ Anubis now is able to store things persistently [in memory](./admin/policies.mdx
 Anubis now supports localized responses. Locales can be added in [lib/localization/locales/](https://github.com/TecharoHQ/anubis/tree/main/lib/localization/locales). This release includes support for the following languages:

 - [Brazilian Portugese](https://github.com/TecharoHQ/anubis/pull/726)
+- [Chinese (Simplified)](https://github.com/TecharoHQ/anubis/pull/774)
 - [Chinese (Traditional)](https://github.com/TecharoHQ/anubis/pull/759)
 - English
- [Estonian](https://github.com/TecharoHQ/anubis/pull/783) 
+- [Estonian](https://github.com/TecharoHQ/anubis/pull/783)
+- [Filipino](https://github.com/TecharoHQ/anubis/pull/775)
 - [French](https://github.com/TecharoHQ/anubis/pull/716)
 - [German](https://github.com/TecharoHQ/anubis/pull/741)
+- [Icelandic](https://github.com/TecharoHQ/anubis/pull/780)
+- [Italian](https://github.com/TecharoHQ/anubis/pull/778)
+- [Japanese](https://github.com/TecharoHQ/anubis/pull/772)
 - [Spanish](https://github.com/TecharoHQ/anubis/pull/716)
 - [Turkish](https://github.com/TecharoHQ/anubis/pull/751)

@@ -99,9 +98,22 @@ There are a bunch of other assorted features and fixes too:
 - Make the [Open Graph](./admin/configuration/open-graph.mdx) subsystem and DNSBL subsystem use [storage backends](./admin/policies.mdx#storage-backends) instead of storing everything in memory by default.
 - Allow [Common Crawl](https://commoncrawl.org/) by default so scrapers have less incentive to scrape
 - The [bbolt storage backend](./admin/policies.mdx#bbolt) now runs its cleanup every hour instead of every five minutes.
+- Don't block Anubis starting up if [Thoth](./admin/thoth.mdx) health checks fail.

 ### Potentially breaking changes

+We try to introduce breaking changes as much as possible, but these are the changes that may be relevant for you as an administrator:
+
+#### Challenge format change
+
+Previously Anubis did no accounting for challenges that it issued. This means that if Anubis restarted during a client, the client would be able to proceed once Anubis came back online.
+
+During the upgrade to v1.21.0 and when v1.21.0 (or later) restarts with the [in-memory storage backend](./admin/policies.mdx#memory), you may see a higher rate of failed challenges than normal. If this persists beyond a few minutes, [open an issue](https://github.com/TecharoHQ/anubis/issues/new).
+
+If you are using the in-memory storage backend, please consider using [a different storage backend](./admin/policies.mdx#storage-backends).
+
+#### Systemd service changes
+
 The following potentially breaking change applies to native installs with systemd only:

 Each instance of systemd service template now has a unique `RuntimeDirectory`, as opposed to each instance of the service sharing a `RuntimeDirectory`. This change was made to avoid [the `RuntimeDirectory` getting nuked any time one of the Anubis instances restarts](https://github.com/TecharoHQ/anubis/issues/748).
--- a/docs/docs/admin/policies.mdx
+++ b/docs/docs/admin/policies.mdx
@@ -268,6 +268,12 @@ The memory backend is an in-memory cache. This backend works best if you don't u

 The biggest downside is that there is not currently a limit to how much data can be stored in memory. This will be addressed at a later time.

+:::warning
+
+The in-memory backend exists mostly for validation, testing, and to ensure that the default configuration of Anubis works as expected. Do not use this persistently in production.
+
+:::
+
 #### Configuration

 The memory backend does not require any configuration to use.
--- a/internal/thoth/thoth.go
+++ b/internal/thoth/thoth.go
@@ -60,15 +60,6 @@ func New(ctx context.Context, thothURL, apiToken string, plaintext bool) (*Clien

 	hc := healthv1.NewHealthClient(conn)

-	resp, err := hc.Check(ctx, &healthv1.HealthCheckRequest{})
-	if err != nil {
-		return nil, fmt.Errorf("can't verify thoth health at %s: %w", thothURL, err)
-	}
-
-	if resp.Status != healthv1.HealthCheckResponse_SERVING {
-		return nil, fmt.Errorf("thoth is not healthy, wanted %s but got %s", healthv1.HealthCheckResponse_SERVING, resp.Status)
-	}
-
 	return &Client{
 		conn:    conn,
 		health:  hc,
Author	SHA1	Message	Date
Xe Iaso	9661a0a0ab	Merge branch 'main' into Xe/changelog-mention-migration-breakage Signed-off-by: Xe Iaso <xe.iaso@techaro.lol>	2025-07-09 16:53:13 -04:00
Xe Iaso	fa3fbfb0a5	feat(blog): incident report for TI-20250709-0001 (#795 ) * feat(blog): incident report for TI-20250709-0001 Signed-off-by: Xe Iaso <me@xeiaso.net> * chore: spelling check-spelling run (pull_request) for Xe/TI-20250709-0001 Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com> on-behalf-of: @check-spelling <check-spelling-bot@check-spelling.dev> * fix(blog/TI-20250709-0001): add TecharoHQ/anubis#794 Signed-off-by: Xe Iaso <me@xeiaso.net> * fix(blog/TI-20250709-0001): amend grammar Signed-off-by: Xe Iaso <me@xeiaso.net> --------- Signed-off-by: Xe Iaso <me@xeiaso.net> Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com>	2025-07-09 14:56:12 +00:00
Xe Iaso	3c739c1305	fix(internal/thoth): don't block Anubis starting if healthcheck fails (#794 ) Signed-off-by: Xe Iaso <me@xeiaso.net>	2025-07-09 13:31:28 +00:00
Xe Iaso	d00417e556	docs: update CHANGELOG for language changes Signed-off-by: Xe Iaso <me@xeiaso.net>	2025-07-09 08:26:25 -04:00