Related content

For anyone wanting to pick this up and work on it, a few weeks ago I submitted a PR to @bbalet’s stopwords repo that significantly improves its performance by working with byte slices instead of strings. He’s yet to respond, but we can fork and move on if we want. After working on stopwords, I appreciate his work. I think he’s just too busy to work on it, and I’m too busy right now to take it to the next step.

Having said that, I wanted to share my thoughts. I was thinking that we could implement interfaces for this:

type WordStopper interface {
    RemoveStopWords(content []byte, lang string) (cleaned []byte)
}

type ContentRelater interface {
    Simhash(content []byte) uint64
    Distance(x, y uint64) int
}

Then implement those interfaces with stopwords and simhash. The interfaces would let us swap out the implementation on the fly with something else later if we wanted to or even open it up to plugins later. Which means we need to make good choices about constructing and naming the interface.

1 Like

Hi,

Having related posts integrated in Hugo would be so valuable!

I would like to toss in an idea about this. As I see it, the Taxonomies provide a semantic layer overarching the content layer.

So far this discussion has focussed on fast and efficient content representation for Pages using hashes. And although the hashes lend themselves fairly well for direct 1-on-1 document comparison, just tagging on Taxonomies in a sorted/weighted fashion feels extremely kludgy.

If I get the SimHash function right, the hash of two documents appended should be the same as/extremely close to the sum/average of the individual hashes (like 2 * hash("doc A doc B") == hash("doc A") + hash("doc B")). It might be a weighted sum, but in any case it is very simple and instantly computed.

Taxonomy Terms provide a way for the authors to organise/segment their pages. As a result, each term defines a subset of all pages. So a natural representation for a Term would then be the sum of the content hashes of pages in its subset. The resulting hash is the content representation for that term.

Assuming the site defines N taxonomy terms, we can compare each page hash to N term hashes, creating a new representation for pages as vectors of length N. It tells you how much the post relates to each semantic concept.

Now, instead of comparing two document hashes, one would compare two N-vectors. I think cosine-similarity is a good metric here which is fast (not as fast as Hamming distance). In any case, this should be a much more natural way to incorporate the taxonomy structure in the “page relatedness” measure.

I think the overhead will be small compared to the improvement of related documents, but I understand Hugo’s focus is on speed. Just wanted to share the idea.

Hey @bep, has anyone taken care of this feature?

No…

@bep mind to point to the right place in code as a starting point? I keep migrating my website to Hugo and I need this feature. I’ll see if I can implement it.

hugolib.Page – but this is a hard task to get right (it has been discussed before). Finding the correct “starting point” is the trivial part.

I suggest that you sketch a plan/design and discuss it here before you spend too much time implementing something.

1 Like

@bep sounds good. Thank you.

1 Like

See (and comment if you want) on this:

https://github.com/gohugoio/hugo/pull/3815#issuecomment-326216823