
@ David
2025-06-14 08:28:50
#realy #devstr #progressreport
exploiting the benefits of my custom streaming varint codec, i have now changed the timestamps in database indexes to be these compact varints. this has resulted in a shrinking of the size of indexes from 64mb down to 59 megabytes, for a stash of events of 204mb as minified wire encoded JSON
the current unix timestamps are still under 4 bytes of value, and will remain at this size until 2038, so for now they will only be 4 bytes long, but expand quickly to standard 64 bit signed integer timestamps which covers until 2140.
in other notes about development, i have reorganised the documentation of the index type enums so that they are in groups logically ordered, with the metadata keys - pubkey, kind and created_at timestamp in a group of 6 with 3 being the individual, two being kind/created_at pubkey/created_at, and a three way combination of kind/pubkey/created_at
these 6 indexes cover all of the primary metadata searches of events themselves, and then i have a series of tag indexes that cover a-tags, which have kind/pubkey/identifier, e and p tags, the p tags can tolerate having hashtags stored in them as appears in some clients broken implementations of follow lists (why they don't use t-tags idk, but it is what it is), standard other tags of single letter (which would include things like the mimetype and other categories), the d-tag, which is the identifier used on addressable (parameterized replaceable), and nonstandard tags also get indexed, so they will actually be searchable, though not via standard hash-letter style filters, it will require an extension of this in the HTTP API
lastly there is some cache-management GC type indexes that store first seen, last accessed and access count, which will be usable for calculating whether to prune events out of the store or not, for an archive/cache two layer architecture.
i've also been thinking about how to do query forwarding, and for this, http API clients will thus open one subscription using SSE, which will send any out-of-order events or subscription filter matches. this will also mean that standard filter queries will return a relay-generated filter identifier, when query forwarding is implemented at the top of the list of event IDs that are returned, so that when a forwarded query returns the identifier is prefixed to the results that come from the forwarded query, enabling the client to recognise the client that made the query and send those results to the client over the subscription SSE connection.
a second distributed relay architecture feature that i am designing that will allow relays to subscribe to other relays latest events (usually just an empty query that continues to forward all new events to the client relay/cache relay) will entail building a subscription queue management system, and an event ack message so that the archive or source relay that is receiving events that get forwarded, tolerate the subscription connection dropping and they will recover the connection and send events that arrived after the subscription dropped and continue. this is an out of band state, that enables the implementation of a full relay distribution strategy
last thing that would be required also is auth proxying. for this, clients need to be able to understand that they are authing to a different address than the relay they send a query to, and this would enable a cache type relay to serve a client exclusively to eliminate replication of traffic that the current architecture forces, and causes big problems with huge bandwidth usage currently. with this, you will be able to deploy a cluster of relays, with a few archives, and many caches, and clients will connect to only one of them at a time, and the cache relay will store all query results that they get via the forwarded query protocol, which requires this auth proxy protocol, and additionally, cache relays in a cluster would have subscriptions to each other so that when forwarded queries get new events, as well as returning the result to the client, they would propagate them horizontally, meaning that from that point any other client on the cluster would quickly be able to see these events that did not get published to the specific cache relay in the cluster that triggered the event query forward by the cache.
with all of these new strategies in place, both geographical distribution and consortium style collaboration become possible and further decrease the attack surface possible to suppress transit of events on nostr.