225.000 docs, 19 taxonomies, 160 minutes

As @michael_henderson puts it so nicely:

It’s weird because I’d have supposed that techies with large sites would be the target but it seems that people are pulled in because of the speed.

I’m also in for the big and complex websites. And no real showstoppers has turned up yet.

In the beginning it was even big and fast at the same time. Hugo managed to generate my 25.000+ docs website in about 20 minutes.

But speed has gone significantly down, since I put on a little more weight. As illustrated in this small table:

| Setup | Md-docs | Expired | HTML-files | Languages | Taxonomies | Minutes |
| — | — | — | — | — | — | — | — |
| Old | 25.000 | 0 | 170.000 | 2 | 22 | 20 |
| New | 30.000 | 8500 | 225.000 | 4 | 19 | 160 |

For reference: The “new” run setup has resulted in this website.

Hugo may be the only static site generator capable of doing this in less than a day, and that makes me grateful nevertheless. Still this is a substantial increase in rendering time (x8) for just a slight increase in workload.

The hardware/software setup hasn’t been altered - except for an upgrade from Hugo 0.21 to 0.22.

  1. Do the extra languages force Hugo to more iterations, that takes extra time?

  2. Do the expired documents take up considerable time for evaluation?

  3. The number of taxonomies in front matter has been increased by approx. 10, which are not used (not present in config.yaml). The new, inactive taxonomies generally carry a handfull of values each. Perhaps it’s time to make a full confession up front. I’m a taxonomy-aficionado, and I want every imaginable future requirement covered in those taxonomies. That’s why the total count at present reaches 73 in the new setup. Might these (even if they are inactive) taxonomies be the culprits?

  4. Could anything else have happened, that I’ve just not paid attention to?

In other words:
Is a little discipline required even with Hugo’s infinite power at disposal?

Full log from the “new” run:

Jan@JLKM1 MINGW64 /f/data/uv
$ hugo -d /w/uv/public
Started building sites ...
Built site for language en:
0 draft content
0 future content
0 expired content
0 regular pages created
20 other pages created
0 non-page files copied
0 paginator pages created
0 tags created
0 sources created
0 places created
0 related created
0 countries created
0 year created
0 keywordspairs created
0 cities created
0 countriestags created
0 regions created
0 persons created
0 keywordspairsinverse created
0 keywords created
0 documenttype created
0 categories created
0 authors created
0 aspects created
0 countrieskeywordspairs created
Built site for language da:
0 of 11 drafts rendered
0 future content
0 of 8592 expired rendered
29487 regular pages created
55312 other pages created
0 non-page files copied
78363 paginator pages created
18 aar created
2502 kilder created
179 noegleordpar created
20480 steder created
18 noegleord created
5564 relaterede created
179 noegleordparomvendt created
482 regioner created
251 lande created
3141 landenoegleordpar created
13855 personer created
7 dokumenttype created
4 kategorier created
4144 skribenter created
14 aspekter created
2982 byer created
1315 landenoegleord created
150 emner created
0 landeemneord created
Built site for language nb:
0 draft content
0 future content
0 of 82 expired rendered
308 regular pages created
1220 other pages created
0 non-page files copied
1314 paginator pages created
37 byer created
5 dokumenttype created
1 kategorier created
61 noekkelordpar created
13 aar created
185 personer created
61 noekkelordparomvendt created
160 steder created
90 relaterte created
56 regioner created
76 emner created
125 landnoekkelord created
12 aspekter created
0 landemne created
181 landnoekkelordpar created
51 land created
16 noekkelord created
67 skribenter created
Built site for language sv:
0 draft content
0 future content
0 of 60 expired rendered
230 regular pages created
1235 other pages created
0 non-page files copied
1301 paginator pages created
12 aar created
1 kategorier created
161 landenyckelordpar created
219 landeaemnen created
53 nyckelordparomvaent created
48 regioner created
53 lande created
53 nyckelordpar created
67 aemnen created
144 staellen created
17 relaterade created
166 personer created
34 staeder created
16 nyckelord created
9 aspekter created
121 landenyckelord created
33 skribenter created
5 dokumenttyp created
total in 9548677 ms

In my tests in your previous thread, it was the number of distinct terms in the taxonomies that had the biggest impact on performance at scale, so “steder” and “personer” are responsible for most of your build time. One thing the build output doesn’t mention is how many articles match each term in those large taxonomies; on average, how many articles contain a particular value of “steder”?

Update:

For comparison, here’s a site with three times as many regular pages (62,186 content, 77 _index.md), but only one taxonomy, with only 794 terms:

0 draft content
0 future content
0 expired content
62186 regular pages created
872 other pages created
0 non-page files copied
11195 paginator pages created
794 categories created
total in 88824 ms

So, my test has 3x the files in content, yours has 3x the files in public, but since mine has only one (relatively) small taxonomy, it builds over 100 times faster. It’s also a much simpler site design, I’m sure; for this test, I just took my recipe site and added 47 new content sections containing the contents of [BBQDan]'s archives.

-j
[BBQDan]: Bill Wight's Food and Recipe Page

Doubling the number of languages (sites) may have some effect, as those are processed serial, and with these numbers even small adjustments can have big effects; but then currently Hugo isn’t really built for this kind of use, and tagging it with “support” may stretch it a little.

No offence in my tagging af this post with support. Just couldn’t find a more suitable one. Bottom line is, that I’m more than satisfied with the performance of Hugo. That goes also for the perhaps rather extreme conditions, that I’m offering. I just wanted some useful input in order to reduce rendering time. And good ideas have turned up allready. They’re much appreciated.

No problem, I like reading about how far it is currently possible to stretch it.

Re. taxonomies, my initial conclusions (@jgreely may not agree) is that a surprising amount of resources is spent on decoding front matter; currently, JSON and TOML is much faster than YAML. I did a quick test where I just manually parsed the tags lists (bytes.Split) and this issue vanished … But I cannot do that in a general way, I’m afraid.

I have some ideas about how to get “Hugo to handle the big, big sites”, but none are trivial.

Does the parsing change if they’re not taxonomies, but just params? If everything still gets parsed, a simple test would be for @JLKM to simply try removing the two largest taxonomies (steder and personer) from his config file and checking the build time.

If you compare to his original thread, he’s added only ~4,000 content pages, but the steder taxonomy went from 2,292 to 20,480 distinct terms.

I still have the files from this test, and when I leave the 20 10,000-term taxonomies in the front matter but comment them out of the config file, the build time is about the same as when they weren’t in the front matter at all.

-j

No, it gets encoded into a map even if you don’t use it. But I see a significant difference between:

title: "foo"

And

title: "foo"
tags:
- tag1
- tag2

(must be the list/array handling, it lights up as a Christmas tree in the benchmarks)

So if you say:

title: "foo"
notataxonomy:
- tag1
- tag2

You will save some taxonomy page creation/rendering, but you will still get a front matter penalty.