Convert HTML to Markdown with PowerShell
Hi All,
About a year ago i have changed my Blog from Subtext to Hugo.
Since then i write my Blog Articles in Markdown and i am pretty happy with that.
I’ve exported all the old Blog Articles from Subtext with the Relevant information to HTML. That HTML Code contains a lot of unnessecary formating. While Hugo can use HTML as Input, it sometimes makes the rendering of old Sites not very good
I was looking for a way to convert HTML to Markdown and found a PowerShell Module on the PowerShell Gallery
Installing the PowerShell Module
Install-PSResource MarkdownPrince -Scope CurrentUser
Get-InstalledPSResource MarkdownPrince -Scope CurrentUser
Converting the HTML with “-UnknownTags drop” leaves me with an empty Output
$htmlfile = "C:\GIT_WorkingDir\blog.icewolf.ch\blog.icewolf.ch\content\202304\exchange-online-sends-now-dmarc-aggregate-reports.html"
#Convert HTML to Markdown
[string]$Markdown = ConvertFrom-HTMLToMarkdown -Path $HTMLfile -UnknownTags drop -GithubFlavored
$Markdown
Converting the HTML with “-UnknownTags bypass” gives me a better result. But the FrontMatter used to tell Hugo some details about the Article has been destroyed because the CRLF has been pharsed out of the Result
$htmlfile = "C:\GIT_WorkingDir\blog.icewolf.ch\blog.icewolf.ch\content\202304\exchange-online-sends-now-dmarc-aggregate-reports.html"
#Convert HTML to Markdown
[string]$Markdown = ConvertFrom-HTMLToMarkdown -Path $HTMLfile -UnknownTags drop -GithubFlavored
$Markdown
I need a way to extract the FrontMatter
#Extract FrontMatter
$html = Get-Content -Path $htmlfile -Raw
$start = $html.IndexOf("---")
$end = $html.IndexOf("---",4)
$frontmatter = $html.Substring($start,$end+3)
$frontmatter
I remove the FrontMatter from the Output and replace it with the orginal Frontmatter we extracted in the Step above and save all as a Markdown (*.md) File
#Remove converted Fontmatter and add orginal FrontMatter
$mdstart = $Markdown.IndexOf("---",4)
$mdend = $Markdown.Length - $mdstart -3
$MD = $frontmatter + $Markdown.Substring($mdstart+3,$mdend)
#Replace FileName with .md
$mdfile = $htmlfile.Replace("html","md")
#Save converted Markdown
Set-Content -path $Mdfile -value $md
The Result is very good - but still needs some polishing. But that’s far less work than doing it manually
Once again, some small PowerShell Script has saved me a ton of Work 😍
Regards
Andres Bohren