Is it possible to pre process .md files and skip certain blocks before building the pages?

Hi everyone,

I’m wondering if in Hugo it’s possible to skip certain blocks present in a .md file and exclude them from the built page.

In particular I would like to skip something like this:

---
Annotations: 0,6310 SHA-256 0650af9e723d7401b1a63e81582eb7bd  
@Anton Sotkov: 15 20,107 128,186 490,642 1319 1562,632 2197,29 2227,18 2267,356 2624,15 2680,19 2705,21 2732,14 2752,4 2764,2846 5611,699  
@Oliver Reichenstein: 314,176  
@Iain Humm: 2194,3 2226 2245,22 2623 2639,40 2699,6 2726,6 2746,6 2756,8 5610  
...

which is something that iA Writer (an editor for MarkDown) automatically adds at the end of the file in case you use their latests authorship feature.

I reckon this is not standard MarkDown and that it shouldn’t be added to the file, but here we are.

In case Hugo doesn’t have a native way to configure blocks to be skipped, I’m going to run a script like this:

#!/bin/bash

# Directory where your Markdown files are located
CONTENT_DIR="path/to/your/content"

# Find all Markdown files and process them
find "$CONTENT_DIR" -type f -name "*.md" | while read -r file; do
    # Use sed to remove the lines between "--- Annotations" and "..."
    # -i.bak creates a backup before modifying the file
    sed -i.bak '/--- Annotations/,/\.\.\./d' "$file"
    
    # Optionally, remove the backup file if you're confident the changes are correct
    rm "${file}.bak"
done

in my building pipeline, but if I could use a native way I would be happier.

Thank you so much.

How about using shortcodes, to remove part of the content you don’t want for an expurged version of your website ? Like this:
{{< politically-incorrect >}}
something most visitors are not meant to read
{{</ politically-incorrect >}}
and in politically-incorrect.html:

{{ if not site.Params.expurged-version }}
{{ .Inner | .Page.RenderString }}
{{  end }}

However it would have the downside of not appearing in the ToC. But you could also use a regexp to render only an expurged version in the base/single template, if the switch is active. Probably better. And I don’t know how to render specific pages depending on such a config switch, I don’t know how.

No, it is not.

I suggest logging a feature request with the iA Writer team, asking them to (optionally?) wrap the Markdown Annotations as HTML comments, e.g.,

<!--
---
Annotations: 0,6310 SHA-256 0650af9e723d7401b1a63e81582eb7bd  
@Anton Sotkov: 15 20,107 128,186 490,642 1319 1562,632 2197,29 2227,18 2267,356 2624,15 2680,19 2705,21 2732,14 2752,4 2764,2846 5611,699  
@Oliver Reichenstein: 314,176  
@Iain Humm: 2194,3 2226 2245,22 2623 2639,40 2699,6 2726,6 2746,6 2756,8 5610  
-->

Thanks for your reply Tom, but there are two problems with the solution you propose:

  1. To wrap something with a shortcode, it needs to be visible in my editor and if you are using iA Writer the Annotation section is completely hidden in the text you see (it’s only used to render words in a different color)

  2. If I had to manually edit the page, instead of adding a shortcode, I would just delete that section entirely since it’s not something that is supposed to appear in the rendered page, it’s only used by iA Writer when you use their editor

Hi, I tried to send them feedback, but they replied with:

At the beginning, at the end, separately, YAML… We looked at all of this and there is no perfect solution. There are plenty of considerations next to making it run on Hugo. Something will always break. A separate file would still break the annotation if you add YAML. Dislike non pristine Markdown—but YAML is fine? Hm… its primary purpose is that, as you write, you always know what’s yours. In the end you can saveas/export it without Annotations, and Hugo’s happy

so they don’t seem very open to change it :confused: if at least they had included the annotation in the YAML, Hugo would have plainly ignored it (just like any unknown YAML tag)

1 Like

Then if I were you I’d modify the preproc script to wrap instead of remove. That way it’s reversible if needed. Maybe someday this, or something like it, will become part of the markdown specification (CommonMark and/or GFM), but until then it’s just another proprietary implementation.

1 Like

I didn’t read well, my bad.

For anyone interested in, this is the final version of my script:

#!/bin/bash
# This script is necessary to remove the annotations block from each Markdown file.
# These annotations are being written by iA Writer and are not necessary for the website.

# Define the directory to start searching from. Adjust this to your specific folder.
START_DIR="sub-folder"

# Process each Markdown file in the specified directory and its subfolders.
find "$START_DIR" -type f -name "*.md" | while read -r file; do
    echo "Processing: $file"

    # Use awk to skip the annotations block as described.
    awk '
    /^---$/ { 
      getline; 
      if ($0 ~ /^Annotations:/) skip_block=1 
    }
    skip_block && /^...$/ {
      skip_block=0; 
      next;
    }
    !skip_block { print }
    ' "$file" > "${file}.tmp" && mv "${file}.tmp" "$file"

    echo "Finished processing: $file"
done

I tested it locally and it works, but it has one issue which I haven’t been able to fix: it will remove any annotations block, even if you put it inside a code block. This means if you want to write a blog post about this thing, you want be able to include an Annotations example because it will be stripped out :smiley:

Hi andreagrandi,

a Tip for your script: Awk can do multiline patterns so you could match —\nAnnotations: til end of file and tell awk to only print non matching lines. it also hase I guess -i for inline replacement.

just installed WSL so if you are interested I could elaborate that.

For Hugo I have the idea to use the .RawContent and render the result yourself.

I wrote a partial where you pass the current page and it will strip of the annotation at the end and then render the HTML.

{{ .RawContent | replaceRE `(?ms:^---\r?\nAnnotations:.*$)` "" | markdownify | safeHTML }}

in your block definition use

{{ partial "skip_annotation" . }}

instead of

{{ .Content }}

have a look at GitHub - irkode/hugodefault at discourse-48649 to see it in action. you may compare with main branch to see the differences.

It’s a standard hugo default repo where I added your Annotation snippet do the homepage/_index.html and post-1.md

Maybe one of our Hugo gurus can give a statement on these:

  • I’m not sure if markdownify | safeHTML is really the same as normal processing of .Content .
  • ahat about performance

it’s also possible to do the regex to the .Content, but then it has to match the rendered content and may depend on the layouts and theme. With Hugo standard it would be somethink like <hr>\r?\n<p>Annotations.*?</p>

regards

played around to refresh my knowledge on that

for preprocessing try this one (GNU sed on Ubuntu 22.04 LTS)

sed -z 's/[\s\n]*\n---\s*\nAnnotations:.*/\n/' $file
# for inplace editing (incl backup use : $file.bak) 
sed -z -i.bak  's/[\s\n]*\n---\s*\nAnnotations:.*/\n/' $file

which will

s/                     # substitute
    [\s\n]*            # lines containing only whitespace
# followed by
    \n---\s*           # three dashes at the beginning of the next line (maybe some whitespace after)
# followed by
    \nAnnotations:     # 'Annotation:' at beginning of the next line
# slurping up 
    .*                 # everything to the end of file
# and replace it 
/\n/                   # just a newline

Guess you are aware, that u will loose the Annotations if you commit after your script :wink:

Hi and thanks for your script! I still haven’t had a chance to test it. I don’t mind loosing Annotations. In iA Writer they are mostly useful while you are still writing (to distinguish between copy-pasted text and what you write/change). Once I decide to commit+push it means I’m fine with the end result and I just want to publish the page (without any annotations at the end).

About partials, I’ve never used them so I guess I need to read a bit of docs first.

Does the partial replace having a separate bash script? (I would run the script through my existing CircleCI build pipeline, so it would be a tiny addition to something I already have in place).

p.s: I will try your sed version too. Does it behave differently or is it a much shorter version of the awk based one? Worth saying I did not write the awk version… I asked GPT to do it :sweat_smile:

Hiho,

About partials, I’ve never used them so I guess I need to read a bit of docs first.

Definitely a topic you should be aware of - usually all theme’s use them under the hood for their layouts.

Does the partial replace having a separate bash script?

Partial processing is a basic feature in Hugo. No extra step needed. Partials will be evaluated when generating the pages without affecting the source markdown. So no separate script neccessary.

Does it behave differently or is it a much shorter version of the awk based one?

yes, it will slurp in the whole file at once - guess not a problem

It will remove everything including and after these lines until the end of the file

# some blank lines
---
Annotations:

I thought the three dots in your example told “some more of these lines”.
But rechecking your awk seems that the three dots are the end marker of the Annotations block.
Even more weird that they include their stuff in — … that will break a lot of markdown engines around
So with that version you will loose all text after the start marker

and just rechecked - same problem with such a block in the middle (also in the partial version)
I’ll elaborate on that, but I need to know how that looks at the end.

  • … is the closing block sequence?
  • … last characters in the file ?
  • a newline maybe after allowed or blanks?
  • or even more valid markdown after that to keep?

I disagree. Anything inside a shortcode won’t figure inside the Table Of Content, with or without %. Until that is taken care of, external preprocessing will keep its use. What I and that person need, is a means to edit files, whatever they may be (content in that case), under command line control, before anything else is done in hugo.

To clarify, markdown headings within a shortcode called with the {{% %}} notation are included when calling the page’s .TableOfContents method.

1 Like

Then I can’t replicate that !
file:

{{% content essai %}}

shortocde:

{{- with .Get 0 | site.GetPage -}}
{{.Content}}
{{end}}

essai.md

---
Title: "sds"
---

## attempt

d

With % the html code is generated but not rendered as html. Instead it goes inside a pair of <code> tags. With < and >, it is rendered correctly but the heading doesn’t show in the toc. Is the expected behavior ?

Oh… The above is what I used to do. I had to include the content of a snippet in a different file. I don’t understand.
But the rendering of

{{ if eq site.Params.with_expurged true}}
{{.Inner}}
{{end}}

does indeed show in the toc, so thanks, I stand corrected. Long live hugo. But then we can’t preprocess

Just to be sure. Think I did not get the jump from partial to shortcode

Is this a yes i to that the above renders the complete page?

I can guess of a .pages later on somewhere else on the site using .content…could still have the annotations?

I don’t understand the connection either. Shortcodes are not relevant to this discussion. I responded to correct an erroneous statement about Hugo’s capabilities.

1) .Content != .RawContent | markdownify

2) .Content == .RawContent | .RenderString

With respect to #2, it may be not be exactly equal, but it’s close. See the documentation for markdownify to understand why .Page.RenderString is a better choice (it’s related to render hook integrity). Also, you don’t need to pass either through safeHTML because both return the template.HTML data type (i.e., they’re already marked as safe).

My view on this? Parsing .Content or .RawContent is (a) fragile, and (b) additional work. I either wrap the annotations within an HTML comment before the build, or convince the iA Writer team to do so… they’re the ones generating non-compliant markdown.

1 Like

agreed, much safer to do it from outside. in a ci build it would not even be distructive.

so lets warp it in a separate script. More used to perl regex but set hold spaces or awk…

 perl -0777 -pe 's/^(.*\n)(---\s*\nAnnotations:.*?\.{3})(.*?)\z/\1<\!--\n\2\n-->\3/xs' FILENAME

which will

  • wrap in HTML comments
  • remove last occurrence only
  • allow other text after
  • inline, so overwriting the original file

explanations on the regex if needed after answering the End of File behaviour. :slight_smile:

cheers