stripformat.pl is a script which will remove all the ugly MS FrontPage and MS Word HTML. It does not strip all the html code (though that is part of it), it strips the style. Things like <o:p>
or class=MsoNormal
etc etc …
I originally developed this off of a script written by Duramecho titled StripFormattingFromWordGeneratedHtml.pl. The scope of what I wanted was a little different, so it is more interactive via command line, and is more geared to removing all styling and formating from a page than just Word generated HTML.
This script will :
class="Mso*"
<font>
tags<o:p>
and <city:state>
type tags<span>
tags<style>
blocks and style="..."
attributes<script>
blocks<b></b>
id=Autonumber1
from table tagsDownload the version 4 stripformat.pl script