-
@ Mike Dilger ☑️
2023-07-29 03:27:23Gossip: The HTTP Fetcher
Gossip is a desktop nostr client. This post is about the code that fetches HTTP resources.
Gossip fetches HTTP resources. This includes images, videos, nip05 json files, etc. The part of gossip that does this is called the fetcher.
We have had a fetcher for some time, but it was poorly designed and had problems. For example, it was never expiring items in the cache.
We've made a lot of improvements to the fetcher recently. It's pretty good now, but there is still room for improvement.
Caching
Our fetcher caches data. Each URL that is fetched is hashed, and the content is stored under a file in the cache named by that hash.
If a request is in the cache, we don't do an HTTP request, we serve it directly from the cache.
But cached data gets stale. Sometimes resources at a URL change. We generally check resources again after three days.
We save the server's ETag value for content, and when we check the content again we supply an If-None-Match header with the ETag so the server could respond with 304 Not Modified in which case we don't need to download the resource again, we just bump the filetime to now.
In the event that our cache data is stale, but the server gives us an error, we serve up the stale data (stale is better than nothing).
Queueing
We used to fire off HTTP GET requests as soon as we knew that we needed a resource. This was not looked on too kindly by servers and CDNs who were giving us either 403 Forbidden or 429 Too Many Requests.
So we moved into a queue system. The host is extracted from each URL, and each host is only given up to 3 requests at a time. If we want 29 images from the same host, we only ask for three, and the remaining 26 remain in the queue for next time. When one of those requests completes, we decrement the host load so we know that we can send it another request later.
We process the queue in an infinite loop where we wait 1200 milliseconds between passes. Passes take time themselves and sometimes must wait for a timeout. Each pass fetches potentially multiple HTTP resources in parallel, asynchronously. If we have 300 resources at 100 different hosts, three per host, we could get them all in a single pass. More likely a bunch of resources are at the same host, and we make multiple passes at it.
Timeouts
When we fetch URLs in parallel asynchronously, we wait until all of the fetches complete before waiting another 1200 ms and doing another loop. Sometimes one of the fetches times out. In order to keep things moving, we use short timeouts of 10 seconds for a connect, and 15 seconds for a response.
Handling Errors
Some kinds of errors are more serious than others. When we encounter these, we sin bin the server for a period of time where we don't try fetching from it until a specified period elapses.