Compare commits

...

4 Commits

Author SHA1 Message Date
Xe Iaso
9661a0a0ab Merge branch 'main' into Xe/changelog-mention-migration-breakage
Signed-off-by: Xe Iaso <xe.iaso@techaro.lol>
2025-07-09 16:53:13 -04:00
Xe Iaso
fa3fbfb0a5 feat(blog): incident report for TI-20250709-0001 (#795)
* feat(blog): incident report for TI-20250709-0001

Signed-off-by: Xe Iaso <me@xeiaso.net>

* chore: spelling

check-spelling run (pull_request) for Xe/TI-20250709-0001

Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com>
on-behalf-of: @check-spelling <check-spelling-bot@check-spelling.dev>

* fix(blog/TI-20250709-0001): add TecharoHQ/anubis#794

Signed-off-by: Xe Iaso <me@xeiaso.net>

* fix(blog/TI-20250709-0001): amend grammar

Signed-off-by: Xe Iaso <me@xeiaso.net>

---------

Signed-off-by: Xe Iaso <me@xeiaso.net>
Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com>
2025-07-09 14:56:12 +00:00
Xe Iaso
3c739c1305 fix(internal/thoth): don't block Anubis starting if healthcheck fails (#794)
Signed-off-by: Xe Iaso <me@xeiaso.net>
2025-07-09 13:31:28 +00:00
Xe Iaso
d00417e556 docs: update CHANGELOG for language changes
Signed-off-by: Xe Iaso <me@xeiaso.net>
2025-07-09 08:26:25 -04:00
6 changed files with 133 additions and 16 deletions

View File

@@ -147,6 +147,7 @@ Imagesift
imgproxy
impressum
inp
internets
IPTo
iptoasn
iss
@@ -312,6 +313,8 @@ Velen
vendored
vhosts
videotest
VKE
Vultr
waitloop
weblate
webmaster

View File

@@ -0,0 +1,105 @@
---
slug: incident/TI-20250709-0001
title: "TI-20250709-0001: IPv4 traffic failures for Techaro services"
authors: [xe]
tags: [incident]
image: ./window-portal.jpg
---
![](./window-portal.jpg)
Techaro services were down for IPv4 traffic on July 9th, 2025. This blogpost is a report of what happened, what actions were taken to resolve the situation, and what actions are being done in the near future to prevent this problem. Enjoy this incident report!
{/* truncate */}
:::note
In other companies, this kind of documentation would be kept internal. At Techaro, we believe that you deserve radical candor and the truth. As such, we are proving our lofty words with actions by publishing details about how things go wrong publicly.
Everything past this point follows my standard incident root cause meeting template.
:::
This incident report will focus on the services affected, timeline of what happened at which stage of the incident, where we got lucky, the root cause analysis, and what action items are being planned or taken to prevent this from happening in the future.
## Timeline
All events take place on July 9th, 2025.
| Time (UTC) | Description |
| :--------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 12:32 | Uptime Kuma reports that another unrelated website on the same cluster was timing out. |
| 12:33 | Uptime Kuma reports that Thoth's production endpoint is failing gRPC health checks. |
| 12:35 | Investigation begins, [announcement made on Xe's Bluesky](https://bsky.app/profile/xeiaso.net/post/3ltjtdczpwc2x) due to the impact including their personal blog. |
| 12:39 | `nginx-ingress` logs on the production cluster show IPv6 traffic but an abrupt cutoff in IPv4 traffic around 12:32 UTC. Ticket is opened with the hosting provider. |
| 12:41 | IPv4 traffic resumes long enough for Uptime Kuma to report uptime, but then immediately fails again. |
| 12:46 | IPv4 traffic resumes long enough for Uptime Kuma to report uptime, but then immediately fails again. (repeat instances of this have been scrubbed, but it happened about every 5-10 minutes) |
| 12:48 | First reply from the hosting provider. |
| 12:57 | Reply to hosting provider, ask to reboot the load balancer. |
| 13:00 | Incident responder because busy due to a meeting under the belief that the downtime was out of their control and that uptime monitoring software would let them know if it came back up. |
| 13:20 | Incident responder ended meeting and went back to monitoring downtime and preparing this document. |
| 13:34 | IPv4 traffic starts to show up in the `ingress-nginx` logs. |
| 13:35 | All services start to report healthy. Incident status changes to monitoring. |
| 13:48 | Incident closed. |
| 14:07 | Incident re-opened. Issues seem to be manifesting as BGP issues in the upstream provider. |
| 14:10 | IPv4 traffic resumes and then stops. |
| 14:18 | IPv4 traffic resumes again. Incident status changes to monitoring. |
| 14:40 | Incident closed. |
## Services affected
| Service name | User impact |
| :-------------------------------------------------- | :----------------- |
| [Anubis Docs](https://anubis.techaro.lol) (IPv4) | Connection timeout |
| [Anubis Docs](https://anubis.techaro.lol) (IPv6) | None |
| [Thoth](/docs/admin/thoth/) (IPv4) | Connection timeout |
| [Thoth](/docs/admin/thoth/) (IPv6) | None |
| Other websites colocated on the same cluster (IPv4) | Connection timeout |
| Other websites colocated on the same cluster (IPv6) | None |
## Root cause analysis
In simplify server management, Techaro runs a [Kubernetes](https://kubernetes.io/) cluster on [Vultr VKE](https://www.vultr.com/kubernetes/) (Vultr Kubernetes Engine). When you do this, Vultr needs to provision a [load balancer](https://docs.vultr.com/how-to-use-a-vultr-load-balancer-with-vke) to bridge the gap between the outside world and the Kubernetes world, kinda like this:
```mermaid
---
title: Overall architecture
---
flowchart LR
UT(User Traffic)
subgraph Provider Infrastructure
LB[Load Balancer]
end
subgraph Kubernetes
IN(ingress-nginx)
TH(Thoth)
AN(Anubis Docs)
OS(Other sites)
IN --> TH
IN --> AN
IN --> OS
end
UT --> LB --> IN
```
Techaro controls everything inside the Kubernetes side of that diagram. Anything else is out of our control. That load balancer is routed to the public internet via [Border Gateway Protocol (BGP)](https://en.wikipedia.org/wiki/Border_Gateway_Protocol).
If there is an interruption with the BGP sessions in the upstream provider, this can manifest as things either not working or inconsistently working. This is made more difficult by the fact that the IPv4 and IPv6 internets are technically separate networks. With this in mind, it's very possible to have IPv4 traffic fail but not IPv6 traffic.
The root cause is that the hosting provider we use for production services had flapping IPv4 BGP sessions in its Toronto region. When this happens all we can do is open a ticket and wait for it to come back up.
## Where we got lucky
The Uptime Kuma instance that caught this incident runs on an IPv4-only network. If it was dual stack, this would not have been caught as quickly.
The `ingress-nginx` logs print IP addresses of remote clients to the log feed. If this was not the case, it would be much more difficult to find this error.
## Action items
- A single instance of downtime like this is not enough reason to move providers. Moving providers because of this is thus out of scope.
- Techaro needs a status page hosted on a different cloud provider than is used for the production cluster (`TecharoHQ/TODO#6`).
- Health checks for IPv4 and IPv6 traffic need to be created (`TecharoHQ/TODO#7`).
- Remove the requirement for [Anubis to pass Thoth health checks before it can start if Thoth is enabled](https://github.com/TecharoHQ/anubis/pull/794).

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

View File

@@ -13,12 +13,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
<!-- This changes the project to: -->
### Added
Anubis now supports these new languages:
- [Italian](https://github.com/TecharoHQ/anubis/pull/778)
## v1.21.0: Minfilia Warde
> Please, be at ease. You are among friends here.
@@ -40,11 +34,16 @@ Anubis now is able to store things persistently [in memory](./admin/policies.mdx
Anubis now supports localized responses. Locales can be added in [lib/localization/locales/](https://github.com/TecharoHQ/anubis/tree/main/lib/localization/locales). This release includes support for the following languages:
- [Brazilian Portugese](https://github.com/TecharoHQ/anubis/pull/726)
- [Chinese (Simplified)](https://github.com/TecharoHQ/anubis/pull/774)
- [Chinese (Traditional)](https://github.com/TecharoHQ/anubis/pull/759)
- English
- [Estonian](https://github.com/TecharoHQ/anubis/pull/783)
- [Estonian](https://github.com/TecharoHQ/anubis/pull/783)
- [Filipino](https://github.com/TecharoHQ/anubis/pull/775)
- [French](https://github.com/TecharoHQ/anubis/pull/716)
- [German](https://github.com/TecharoHQ/anubis/pull/741)
- [Icelandic](https://github.com/TecharoHQ/anubis/pull/780)
- [Italian](https://github.com/TecharoHQ/anubis/pull/778)
- [Japanese](https://github.com/TecharoHQ/anubis/pull/772)
- [Spanish](https://github.com/TecharoHQ/anubis/pull/716)
- [Turkish](https://github.com/TecharoHQ/anubis/pull/751)
@@ -99,9 +98,22 @@ There are a bunch of other assorted features and fixes too:
- Make the [Open Graph](./admin/configuration/open-graph.mdx) subsystem and DNSBL subsystem use [storage backends](./admin/policies.mdx#storage-backends) instead of storing everything in memory by default.
- Allow [Common Crawl](https://commoncrawl.org/) by default so scrapers have less incentive to scrape
- The [bbolt storage backend](./admin/policies.mdx#bbolt) now runs its cleanup every hour instead of every five minutes.
- Don't block Anubis starting up if [Thoth](./admin/thoth.mdx) health checks fail.
### Potentially breaking changes
We try to introduce breaking changes as much as possible, but these are the changes that may be relevant for you as an administrator:
#### Challenge format change
Previously Anubis did no accounting for challenges that it issued. This means that if Anubis restarted during a client, the client would be able to proceed once Anubis came back online.
During the upgrade to v1.21.0 and when v1.21.0 (or later) restarts with the [in-memory storage backend](./admin/policies.mdx#memory), you may see a higher rate of failed challenges than normal. If this persists beyond a few minutes, [open an issue](https://github.com/TecharoHQ/anubis/issues/new).
If you are using the in-memory storage backend, please consider using [a different storage backend](./admin/policies.mdx#storage-backends).
#### Systemd service changes
The following potentially breaking change applies to native installs with systemd only:
Each instance of systemd service template now has a unique `RuntimeDirectory`, as opposed to each instance of the service sharing a `RuntimeDirectory`. This change was made to avoid [the `RuntimeDirectory` getting nuked any time one of the Anubis instances restarts](https://github.com/TecharoHQ/anubis/issues/748).

View File

@@ -268,6 +268,12 @@ The memory backend is an in-memory cache. This backend works best if you don't u
The biggest downside is that there is not currently a limit to how much data can be stored in memory. This will be addressed at a later time.
:::warning
The in-memory backend exists mostly for validation, testing, and to ensure that the default configuration of Anubis works as expected. Do not use this persistently in production.
:::
#### Configuration
The memory backend does not require any configuration to use.

View File

@@ -60,15 +60,6 @@ func New(ctx context.Context, thothURL, apiToken string, plaintext bool) (*Clien
hc := healthv1.NewHealthClient(conn)
resp, err := hc.Check(ctx, &healthv1.HealthCheckRequest{})
if err != nil {
return nil, fmt.Errorf("can't verify thoth health at %s: %w", thothURL, err)
}
if resp.Status != healthv1.HealthCheckResponse_SERVING {
return nil, fmt.Errorf("thoth is not healthy, wanted %s but got %s", healthv1.HealthCheckResponse_SERVING, resp.Status)
}
return &Client{
conn: conn,
health: hc,