Convert HTML to Markdown with PowerShell

Convert HTML to Markdown with PowerShell

Hi All,

About a year ago i have changed my Blog from Subtext to Hugo.

Since then i write my Blog Articles in Markdown and i am pretty happy with that.

I’ve exported all the old Blog Articles from Subtext with the Relevant information to HTML. That HTML Code contains a lot of unnessecary formating. While Hugo can use HTML as Input, it sometimes makes the rendering of old Sites not very good

I was looking for a way to convert HTML to Markdown and found a PowerShell Module on the PowerShell Gallery

Installing the PowerShell Module

Install-PSResource MarkdownPrince -Scope CurrentUser
Get-InstalledPSResource MarkdownPrince -Scope CurrentUser

Converting the HTML with “-UnknownTags drop” leaves me with an empty Output

$htmlfile = "C:\GIT_WorkingDir\blog.icewolf.ch\blog.icewolf.ch\content\202304\exchange-online-sends-now-dmarc-aggregate-reports.html"
#Convert HTML to Markdown
[string]$Markdown = ConvertFrom-HTMLToMarkdown -Path $HTMLfile -UnknownTags drop -GithubFlavored
$Markdown

Converting the HTML with “-UnknownTags bypass” gives me a better result. But the FrontMatter used to tell Hugo some details about the Article has been destroyed because the CRLF has been pharsed out of the Result

$htmlfile = "C:\GIT_WorkingDir\blog.icewolf.ch\blog.icewolf.ch\content\202304\exchange-online-sends-now-dmarc-aggregate-reports.html"
#Convert HTML to Markdown
[string]$Markdown = ConvertFrom-HTMLToMarkdown -Path $HTMLfile -UnknownTags drop -GithubFlavored
$Markdown

I need a way to extract the FrontMatter

#Extract FrontMatter
$html = Get-Content -Path $htmlfile -Raw
$start = $html.IndexOf("---")
$end = $html.IndexOf("---",4)
$frontmatter = $html.Substring($start,$end+3)
$frontmatter

I remove the FrontMatter from the Output and replace it with the orginal Frontmatter we extracted in the Step above and save all as a Markdown (*.md) File

#Remove converted Fontmatter and add orginal FrontMatter
$mdstart = $Markdown.IndexOf("---",4)
$mdend = $Markdown.Length - $mdstart -3
$MD = $frontmatter + $Markdown.Substring($mdstart+3,$mdend)

#Replace FileName with .md
$mdfile = $htmlfile.Replace("html","md")

#Save converted Markdown
Set-Content -path $Mdfile -value $md

The Result is very good - but still needs some polishing. But that’s far less work than doing it manually

Once again, some small PowerShell Script has saved me a ton of Work 😍

Regards
Andres Bohren

PowerShell Logo