docs: add THR1 spec

Signed-off-by: Xe Iaso <me@xeiaso.net>
This commit is contained in:
Xe Iaso
2025-06-04 23:14:17 -04:00
parent 76fa3e01a5
commit 3a4b1086af

View File

@@ -0,0 +1,187 @@
# Techaro HTTP Request Fingerprinting Version 1
The naïve way to identify HTTP clients is to use the HTTP User-Agent string as a signal. In an ideal world, this would give you a perfect view of what clients are connecting to your server. We do not live in that ideal world. As such, we need an alternative method that can scale to the world we have.
## Prior Art
The biggest source of prior art is [FoxIO's JA4H fingerprinting method](https://github.com/FoxIO-LLC/ja4/blob/main/technical_details/JA4H.md). This is fine, but there's a problem with it in the real world: Go doesn't allow you to observe the order headers arrived in. As Anubis is written in Go and I don't feel like boiling the HTTP server ocean today, there needs to be an alternative.
## THR1
The fingerprint consists of four concatenated components:
```text
<thr1_head>_<thr1_lang>_<thr1_sec>_<thr1_all>
```
Example:
```text
get20cr1004_enca-d6b272e5b_sec-a9649072c_2a347fcf7
```
Each component is described below:
### `thr1_head`
Overall request summary of method, protocol, and header counts:
- First three letters of the HTTP method, lowercased (e.g. get, pos).
- HTTP protocol version formatted in two digits (`10` for HTTP/1.0, `11` for HTTP/1.1, `20` for HTTP/2, `30` for HTTP/3 etc.).
- Single letter indicating if the request has cookies: `c` if present, `n` if not.
- Single letter indicating Referer header presence: `r` if present, `n` if absent.
- Number of HTTP headers sent by the client, zero-padded to two digits (e.g. `10`).
- Number of `Sec-*` headers sent by the client, zero-padded to two digits (e.g. `04`).
Example:
```text
get20cr1004
```
### `thr1_lang`
`Accept-Language` header details.
- If no `Accept-Language` header is set, then:
```
-000000000
```
- Otherwise:
- The first 4 alphanumeric characters of the header value (lowercased, right-padded with `0` to length 4), e.g. `enca`.
- The first 9 hex characters of the SHA-256 hash of the full `Accept-Language` header value.
Example:
```
enca-d6b272e5b
```
### `thr1_sec`
Details about the `Sec-*` headers sent by the client.
```
thr1_sec = "sec-" + HASH9
```
Where:
- Collect **all headers whose names start with `sec-` (case-insensitive)**.
- For each header:
1. Normalize the header name by lowercasing.
2. If the header is one of the `Sec-CH-UA` family:
- `Sec-CH-UA`
- `Sec-CH-UA-Mobile`
- `Sec-CH-UA-Platform`
- `Sec-CH-UA-Platform-Version`
- `Sec-CH-UA-Model`
- `Sec-CH-UA-Full-Version`
Apply **special normalization rules** (see below).
3. For all other `sec-` headers:
- Unquote values if quoted.
- Trim leading/trailing whitespace.
- Keep the value as-is (do not parse further).
- Sort all included headers by their normalized header name (ASCII order).
- Serialize each header as:
```text
<header_name>:<normalized_value>
```
- Join all serialized lines with `\n`.
- Compute SHA-256 hash of the resulting canonical string.
- Take the first 9 hex characters of the hash and prefix with `sec-`.
Example:
```text
sec-a9649072c
```
#### Special Normalization Rules for `Sec-CH-UA*` headers
| Header | Normalization |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Sec-CH-UA` | Parse into `{brand, version}` pairs. Omit any with brand `"Not=A?Brand"`. Sort by brand ASC. Serialize as: `ua:Brand1/Version1,Brand2/Version2,...` |
| `Sec-CH-UA-Mobile` | Convert `"?1"` → `true`, `"?0"` → `false`. Serialize as: `mobile:true` or `mobile:false` |
| `Sec-CH-UA-Platform` | Lowercase, unquoted, trimmed. Serialize as: `platform:<value>` |
| `Sec-CH-UA-Platform-Version` | Unquoted, trimmed. Serialize as: `platform_version:<value>` |
| `Sec-CH-UA-Model` | Unquoted, trimmed. Serialize as: `model:<value>` |
| `Sec-CH-UA-Full-Version` | Unquoted, trimmed. Serialize as: `full_version:<value>` |
Given these headers:
```text
Sec-CH-UA: "Google Chrome";v="123", "Not=A?Brand";v="8", "Chromium";v="123"
Sec-CH-UA-Mobile: ?1
Sec-CH-UA-Platform: "Windows"
Sec-CH-UA-Platform-Version: "10.0.0"
Sec-CH-UA-Model: "Pixel 7"
Sec-CH-UA-Full-Version: "123.0.6312.122"
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
```
Normalized canonical string before hashing:
```text
sec-fetch-dest:document
sec-fetch-mode:navigate
mobile:true
platform:windows
platform_version:10.0.0
full_version:123.0.6312.122
model:Pixel 7
ua:Chromium/123,Google Chrome/123
```
Then sort by header name:
```text
full_version:123.0.6312.122
mobile:true
model:Pixel 7
platform:windows
platform_version:10.0.0
sec-fetch-dest:document
sec-fetch-mode:navigate
ua:Chromium/123,Google Chrome/123
```
### `thr1_all`
A hash of the canonicalized form of request headers.
To construct a `tlr1_all`:
1. Collect all header keys excluding:
- `Cookie`
- `Referer`
- `User-Agent`
- Any header starting with `X-`
2. Sort header keys by lowercase name.
3. Serialize as:
```text
name:value
```
Joined by newlines.
4. Compute the SHA-256 checksum of that string and take the first 9 hex digits.
Example output:
```text
2a347fcf7
```