-

@ mleku
2025-05-17 07:11:37
#realy #devstr #progressreport
the full text search is theoretically now implemented
it iterates the new fulltext index searching for word matches
then it checks that the rest of the filter criteria match, and eliminating candidates that don't have at least one of the criteria, i.e.:
- eliminates results outside of since<->until,
- that aren't one of the kinds requested,
- that aren't published by one of the specified pubkeys,
- that don't have the requested language tag, and
- that don't have one of the tags in the filter
then it groups the result fulltext index entries by same event into a unit and creates a map of them
then it converts the random iterating map into a straight array
it iterates the array and calculates the distance between the first and last word match in the sequence of the text of the event
then it segments the results by number of words that match in each event, grouping the ones with the same number of word matches into groups
then it calculates the distance between the first and last word matches in the text (the terms can appear multiple times but are sorted by their appearance, this is just an approximation, since words can appear multiple times)
then it gets a list of all the terms in the event by their sequence number in the original search, and with this array it counts how many of those matches are in ascending order matching the search terms
with these two metrics, we can calculate relevance, as the higher the number of items in matching sequence the more relevant, and the closer they are together, the more likely they are to match the search text
then we sort the groups of same-count of words from the search text by distance AND sequence, meaning the top results have the lowest distance and the highest sequence
then with the segments individually sorted, we zip them back together into a list, extract their event id, pubkey and timestamp and return that to the caller (this same result is used by the filter in the HTTP API so it can filter out pubkeys and sort the results by timestamp descending or ascending according to the query parameters).
and finally, then it trims the list down to the number of results requested in the limit *or* the configured max limit (512, generally, but the http endpoint will allow 1000 for unauthed users and 10000 for authed users, as it only returns the event id, so the client can paginate them on their side)
i know that all sounds complicated but it's now all written and this will enable a fairly decent relevance sorted full text search for nostr text events.
the hard part now is going to be testing it, i will probably make two endpoints for this, one will be disabled later, but that one will return the full events in the result, as an array, so i can see them in the api, and squash any bugs or logic errors in the search code... and also see how long it takes for producing the results.