-
@ mccrmx
2025-05-15 02:04:26Bottom-up Load-balancing for Nostr Relays
Goal: protect your relay by distributing clients more evenly across all relays without top-down coordination.
The problem
Nostr relays are a public good with the costs borne by benevolent volunteers running them. So far this system has been robust, supporting tens of thousands of active users per month. There is some low-key evidence of strain on these servers (see below) and overload could become a bigger problem as the network grows, attracts spammers, griefers, and other bad actors. It could become expensive to run the most popular relays even under ideal conditions with conscientious users.
The main strategies employed by relays to mitigate overload currently are a) throttling requests b) pre-set PoW c) authentication of some form, with throttling being common on popular servers. Throttling is an opaque strategy where client apps only receive the kill signal once they have been throttled. Pre-set PoW is exclusionary to most clients that don't implement it, and authentication is centralizing.
The current strategy employed by clients for selecting relays is for client apps to choose a set of default relays, and for users to modify those defaults manually in the UI. This results in natural aggregation of clients on the more popular relays. This concentrates load on a smaller number of relays and is centralizing.
A solution
A solution that helps both clients and relays is to make the relay overload signal legible.
Our proposal is simple:
- Relay software measures common load utilization percentage metrics like cpu, memory, and disk.
- Overloaded relays publish their current load expressed as a proof-of-work requirement (NIP-11).
- Clients upon seeing the proof-of-work requirement can do the PoW or switch relays (NIP-13, NIP-01).
This sets up a basic PoW market where "prices" help "consumers" decide where to allocate their "capital" (events with proof-of-work tokens). The likely effect is that clients will automatically re-distribute their traffic away from "expensive" overloaded relays towards "cheap" underloaded relays, making the entire network healthier.
The new thing here is tying PoW to load. Existing NIPs cover most of the PoW part of the soultion. NIP-11 covers publishing of PoW requirements by relays. NIP-13 and NIP-01 cover clients performing PoW and relays reacting to it.
A pleasant side-effect of using PoW as the load signal is that would-be spammers are forced to pay an actual cost in the form of energy expenditure and hardware.
How to participate
Participation is opt-in. Any relay or client can participate in this scheme if they want to try it out, or ignore it if they don't. It doesn't require large scale changes to the protocol.
Relay implementation
To participate in this protocol relays would make these changes:
- Collect load metrics.
- Publish PoW requirements (NIP-11).
- Check PoW on incoming client events (NIP-13, NIP-01).
Collecting load metrics is straightforward. Relays can collect CPU loadavg, memory usage (averaged), and disk utilization as percentages. We suggest the final load percentage value should be:
max(cpu%, mem%, disk%)
as a starting point. The reason for usingmax()
is that problems arise if any of your resources is exhausted, so we should rely on the most exhausted resource as the actual load value.PoW required can be calculated (see section below) by using a sensible starting point for tolerable load - the
zero-PoW-load
point. Amax-bits
diffculty can be specified for the 100% load point. Thezero-PoW-load
value is the load percentage up until which the relay is comfortable offering connections "free" without requring any PoW. This situation would be the same as it is currently with clients able to freely connect to most relays. Once thezero-PoW-load
is reached the server would start publishing a number of bits of PoW required using NIP-11. The PoW required would scale up with the load experienced. Of coursezero-PoW-load
can be set low to facilitate early PoW market signals.New metrics that relays care about can be added to the calculation in future without clients caring. Other metrics could be used such as percent of available TCP/websocket connection slots used, or actual server rental costs vs. maximum cost a volunteer will bear.
Validating PoW on incoming client events is something that has already been discussed and implemented in relays. See NIPs 11, 13, and 01.
PoW Required Calculation
zero_pow_load
is a setting for the minimum load percentage where PoW should kick in with 1 bit of PoW. 80% could be a sensible starting point. Operators of popular relays can use their own historical data to determine this.max_bits
is set to the highest PoW clients will have to do when server load goes near 100%. This can be calibrated to some value like 1 hour of PoW on an average modern device. It's not expected clients will actually perform this but it sets the scale, and it should prevent highly resourced spammers from 100% utilization of any relay.current_load
is the computed maximum of all of the load metric percentages gathered.
Then the formula for calculating the required PoW bits is:
pow_bits_required = max_bits * max(0, (current_load - zero_pow_load) / (100 - zero_pow_load))
Since PoW increases logarithmically (each bit being exponentially harder than the last) some scaling function may be required to smooth this off.
Client implementation
These are the changes clients would make to participate in this protocol.
- Select relays based on PoW (favour PoW-cheap + known reliable relays).
- Do PoW if connecting to overloaded servers (NIP-11 and NIP-13).
Selecting relays for the user then might mean keeping a larger list of potential default relays and selecting a subset at setup time based on PoW requirements. It might also mean actively monitoring for high-PoW relays and switching away if a relay is frequently expensive.
It is in a client's best interest to select a low-PoW underloaded set of relays to publish to, whilst still favouring known-reliable relays. If the whole network becomes loaded then PoW acts as a deterrent for non-critical use and spammers. Only the most commited clients and users will participate. It can also be a transparent signal to the community that more relays are required.
Addenda
So that's the main specification of the scheme. The following are related addenda.
Evidence of strain
Some random low-key evidence of strain on relays, and worries about the costs of running them.
nostr:note1kp57fvd8jz6639g86ugv5zy4q2sn52mz30kqp6xwgtlgph44q22stt7hkk
nostr:note19cp5rzvrmu7gc7n6czv8650wlyc355yffmy6amxtmd3pmut4umcqsr9dm4
nostr:note16vmlqsucqqgyjuac66wvpux903xrw5gea4wewtvz9ufhy5s8y83qscu5wv
TODO: sample some nostr events/users randomly from the firehose and get stats on relay centralization.
More weird PoW ideas
One thing not covered in NIPs is PoW-on-connect which could help in future if there are malicious clients camping on websockets.
Another idea is building PoW into npubs, similar to vanitygen, which would put some skin-in-the-game onto users when creating keys. Some relays may choose only to service high PoW npubs that have proven their commitment to the network and protocol.
On resource rationing
Some ways to ration unexpectedly demanded goods in an emergency:
(1) market prices ("price gouging") (2) waiting in line (3) centrally planned rationing (4) don't ration: just let the resource run out
Market prices are a least-worst business-as-usual option.