-

@ David
2025-06-16 13:58:04
#realy #devstr #progressreport
after cleaning up this nilsimsa hash function i decided that i'm going to add a new index. it's a little like a bayesian distance number, but it will be precalculated when saving events that are defined as text.
it would, for example, be useful to detect plagiarism like the yoda speak bot that is currently doing the rounds on the spam infested free relays, but i think it would also have an interesting use in searching for recommendations of follows to look at.
computing a whole graph of comparison vectors would be pretty insanely expensive compute but probably it could be useful to augment WoT graph calculations by generating a similarity score between the posts of two different users.
it could also discover the breadth of a user's content by a simple XOR of all of their posts' nilsimsa hashes, the more varied their stuff, the more bits would be left not zeroed. this could then be used as part of a recommendation system as well in that the wider the variance of a user's content, the more interesting it is probably. it would pick up a lot also on the vocabulary of the user, more vocabulary would also tend to produce a higher score of all of their posts XORed together.
a time series of such values could also be an interesting data set to look at as well, creating a set of interstitial comparisons over time and then evaluating the time series in some interesting ways.