Related content

moorereason · January 5, 2016, 8:27pm

For anyone wanting to pick this up and work on it, a few weeks ago I submitted a PR to @bbalet’s stopwords repo that significantly improves its performance by working with byte slices instead of strings. He’s yet to respond, but we can fork and move on if we want. After working on stopwords, I appreciate his work. I think he’s just too busy to work on it, and I’m too busy right now to take it to the next step.

Having said that, I wanted to share my thoughts. I was thinking that we could implement interfaces for this:

type WordStopper interface {
    RemoveStopWords(content []byte, lang string) (cleaned []byte)
}

type ContentRelater interface {
    Simhash(content []byte) uint64
    Distance(x, y uint64) int
}

Then implement those interfaces with stopwords and simhash. The interfaces would let us swap out the implementation on the fly with something else later if we wanted to or even open it up to plugins later. Which means we need to make good choices about constructing and naming the interface.

wpk · February 18, 2016, 2:18pm

Hi,

Having related posts integrated in Hugo would be so valuable!

I would like to toss in an idea about this. As I see it, the Taxonomies provide a semantic layer overarching the content layer.

So far this discussion has focussed on fast and efficient content representation for Pages using hashes. And although the hashes lend themselves fairly well for direct 1-on-1 document comparison, just tagging on Taxonomies in a sorted/weighted fashion feels extremely kludgy.

If I get the SimHash function right, the hash of two documents appended should be the same as/extremely close to the sum/average of the individual hashes (like 2 * hash("doc A doc B") == hash("doc A") + hash("doc B")). It might be a weighted sum, but in any case it is very simple and instantly computed.

Taxonomy Terms provide a way for the authors to organise/segment their pages. As a result, each term defines a subset of all pages. So a natural representation for a Term would then be the sum of the content hashes of pages in its subset. The resulting hash is the content representation for that term.

Assuming the site defines N taxonomy terms, we can compare each page hash to N term hashes, creating a new representation for pages as vectors of length N. It tells you how much the post relates to each semantic concept.

Now, instead of comparing two document hashes, one would compare two N-vectors. I think cosine-similarity is a good metric here which is fast (not as fast as Hamming distance). In any case, this should be a much more natural way to incorporate the taxonomy structure in the “page relatedness” measure.

I think the overhead will be small compared to the improvement of related documents, but I understand Hugo’s focus is on speed. Just wanted to share the idea.

andreyg · January 8, 2017, 3:20am

Hey @bep, has anyone taken care of this feature?

bep · January 8, 2017, 9:25am

No…

andreyg · January 8, 2017, 7:00pm

@bep mind to point to the right place in code as a starting point? I keep migrating my website to Hugo and I need this feature. I’ll see if I can implement it.

bep · January 8, 2017, 7:28pm

hugolib.Page – but this is a hard task to get right (it has been discussed before). Finding the correct “starting point” is the trivial part.

I suggest that you sketch a plan/design and discuss it here before you spend too much time implementing something.

andreyg · January 8, 2017, 8:36pm

@bep sounds good. Thank you.

bep · August 31, 2017, 7:49am

See (and comment if you want) on this:

https://github.com/gohugoio/hugo/pull/3815#issuecomment-326216823

Topic		Replies	Views
Getting started with Related Content support	9	2764	August 16, 2018
Confused as to how related content works and how to make it work support	3	824	July 23, 2020
Creating custom indexes?	4	690	January 4, 2020
Show a list of related content support	7	4158	May 7, 2017
Contentful 2 Hugo Announcements	6	6886	January 23, 2018

Related content

Related Topics