This is the third in a series of articles on my migration to WordPress. In this post, I’ll talk about how I removed non-core material from the website before I imported it. This is vital because WordPress adds its own headers and footers, which my website already contains. The first step is to remove all that cruft before creating the input file for the RSS Importer (first used in the previous post).
The creation of the RSS file could be done manually, but that would be quite a bit of work. Instead, I’ll use and modify an existing program. Perhaps you can do similar for your situation and we’ll talk about how to do that in a future post.
Where’s the cruft
If you look at the existing HTML for one of my articles, you’ll find a lot of stuff that’s just not required for WordPress. If you’ve written with any blogging software before, you’ll know you don’t usually have to deal with
.. etc. You just type stuff. WordPress figures out the rest. Well… that’s true for my website too.Compare this code, with the code that follows it. The first, is a complete HTML file, as copied directly from the website. The second, contains only the basic post. You’ll see the difference.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE>The FreeBSD Diary -- FreeBSD 4.1.1-RELEASE</TITLE> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <link rel="stylesheet" type="text/css" href="/css/site.css"> <meta name="title" content="FreeBSD 4.1.1-RELEASE"> <link rel="image_src" href="images/freebsddiary-logo.gif"> <META NAME="description" CONTENT="A new tag in the tree"> <META NAME="keywords" CONTENT="freebsd,diary,FreeBSD"> <LINK REL="SHORTCUT ICON" HREF="/favicon.ico" type="image/x-icon"> <LINK REL="ICON" HREF="/favicon.ico" type="image/x-icon"> <meta name="MSSmartTagsPreventParsing" content="TRUE"> <META http-equiv="Pragma" CONTENT="no-cache"> <META HTTP-EQUIV="Expires" CONTENT="0"> <META HTTP-EQUIV="Cache-Control" CONTENT="no-cache"> <META HTTP-EQUIV="Pragma-directive" CONTENT="no-cache"> <META HTTP-EQUIV="cache-directive" CONTENT="no-cache"> <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"> <META NAME="ROBOTS" CONTENT="NOARCHIVE"> <meta http-equiv="pics-label" content='(pics-1.1 "http://www.icra.org/ratingsv02.html" comment "ICRAonline v2.0" l gen true for "http://www.freebsddiary.org/" r (nz 1 vz 1 lz 1 oz 1 ca 1) "http://www.rsac.org/ratingsv01.html" l gen true for "http://www.freebsddiary.org/" r (n 0 s 0 v 0 l 0))'> <META http-equiv="PICS-Label" content='(PICS-1.1 "http://www.classify.org/safesurf/" l gen true for "http://www.freebsddiary.org/" r (SS~~000 1))'> <meta http-equiv="Certification" content='"http://www.ufcws.org/license.html" l r true comment "United Federation of ChildSafe Web Sites" for "http://www.freebsddiary.org" on "2002.08.10" r"'> <!-- Begin ICCS Certified Web Site Statement --> <!-- VERIFICATION="PICS-Label / ICCS_Certification" CONTENT="ICCS Certified Web Site" RATING="Class 1 SafeSurf" and/or "Class 2 ICRA" TASK="ICCS-Certification under the iWatchDog Program for "http://www.freebsddiary.org" EXPIRATION_DATE="2003.08.10"//--> <!-- End Statement --> <link rel="alternate" type="application/rss+xml" title="The FreeBSD Diary" href="http://www.freebsddiary.org/news.php"> </HEAD> <BODY BGCOLOR="#FFFFFF" TEXT="#000000"> <table border="0" cellspacing="2"> <tr> <TD ALIGN="right" CLASS="sans" VALIGN="middle"><h1>The FreeBSD Diary</h1></TD></tr></table> <TABLE WIDTH="100%" CELLPADDING="0" CELLSPACING="0" BORDER="0" align="center"> <TR> <TD width="450" ALIGN="right" VALIGN="top"><A HREF="/"><IMG SRC="/images/freebsddiary-logo.gif" ALT="The FreeBSD Diary" WIDTH="431" HEIGHT="98" BORDER="0"></A></td> <td width="5" align="left" ><small><a href="/other-copyrights.php">(TM)</a></small></td> <TD ALIGN="right" CLASS="sans" VALIGN="bottom"><h3>Providing practical examples since 1998</h3></TD> </TR> </TABLE> <div id="linksheader" align="right"> [ <A HREF="/">HOME</A> | <A HREF="/topics.php">TOPICS</A> | <A HREF="/chronological.php">INDEX</A> | <A HREF="/help.php">WEB RESOURCES</A> | <A HREF="/booksmags.php">BOOKS</A> | <A HREF="/contribute.php">CONTRIBUTE</A> | <A HREF="/search.php">SEARCH</A> | <A HREF="/feedback.php">FEEDBACK</A> | <A HREF="/faq.php">FAQ</A> | <A HREF="/phorum/">FORUMS</A> ] </div> <TABLE WIDTH="100%" ALIGN="center" BORDER="0"> <TR><TD VALIGN="top"> <TABLE WIDTH="100%" BORDER="0"> <tr><td align="center"> <script language='JavaScript' type='text/javascript' src='http://ads.unixathome.org/phpPgAds/adx.js'></script> <script language='JavaScript' type='text/javascript'> <!-- if (!document.phpAds_used) document.phpAds_used = ','; phpAds_random = new String (Math.random()); phpAds_random = phpAds_random.substring(2,11); document.write ("<" + "script language='JavaScript' type='text/javascript' src='"); document.write ("http://ads.unixathome.org/phpPgAds/adjs.php?n=" + phpAds_random); document.write ("&what=zone:50"); document.write ("&exclude=" + document.phpAds_used); if (document.referrer) document.write ("&referer=" + escape(document.referrer)); document.write ("'><" + "/script>"); //--> </script><noscript><a href='http://ads.unixathome.org/phpPgAds/adclick.php?n=ab34d058' target='_blank'><img src='http://ads.unixathome.org/phpPgAds/adview.php?what=zone:50&n=50' border='0' alt=''></a></noscript> </td></tr> <TR> <TD> <div class="heading"> <span class="left">FreeBSD 4.1.1-RELEASE</span> <span class="right">26 September 2000</span> </div> </TD> </TR> <TR><TD ALIGN="right"><div style="float:left"><a name="fb_share" type="button_count" href="http://www.facebook.com/sharer.php">Share</a><script src="http://static.ak.fbcdn.net/connect.php/js/FB.Share" type="text/javascript"></script></div><div class="headingmoreinfo">Need more help on this topic? <A HREF="/phorum/post.php?f=1">Click here</A><BR>This article has <a href="phorum/post.php?f=3&article_id=400">no comments</a><br>Show me <a href="topics.php?aid=400">similar articles</a><br></div></TD></TR> <TR> <TD> For your information: the FreeBSD source tree has been tagged for RELENG_4_1_1_RELEASE.<P>What does this mean? At this point in my day (early morning) I have no idea.</P> <P>But it means you can now get 4.1.1-RELEASE via cvsup. When I get back from work, I'll see what I know then.</TD> </TR> <TR> <TD></TD> </TR> </TABLE> </TD><TD VALIGN="top" WIDTH="140"> <div align="center"> <script language='JavaScript' type='text/javascript' src='http://ads.unixathome.org/phpPgAds/adx.js'></script> <script language='JavaScript' type='text/javascript'> <!-- if (!document.phpAds_used) document.phpAds_used = ','; phpAds_random = new String (Math.random()); phpAds_random = phpAds_random.substring(2,11); document.write ("<" + "script language='JavaScript' type='text/javascript' src='"); document.write ("http://ads.unixathome.org/phpPgAds/adjs.php?n=" + phpAds_random); document.write ("&what=zone:54"); document.write ("&exclude=" + document.phpAds_used); if (document.referrer) document.write ("&referer=" + escape(document.referrer)); document.write ("'><" + "/script>"); //--> </script><noscript><a href='http://ads.unixathome.org/phpPgAds/adclick.php?n=a02ff4f4' target='_blank'><img src='http://ads.unixathome.org/phpPgAds/adview.php?what=zone:54&n=54' border='0' alt=''></a></noscript> </div> </TD></TR> </TABLE> <div align="center"> <br> <script language='JavaScript' type='text/javascript' src='http://ads.unixathome.org/phpPgAds/adx.js'></script> <script language='JavaScript' type='text/javascript'> <!-- if (!document.phpAds_used) document.phpAds_used = ','; phpAds_random = new String (Math.random()); phpAds_random = phpAds_random.substring(2,11); document.write ("<" + "script language='JavaScript' type='text/javascript' src='"); document.write ("http://ads.unixathome.org/phpPgAds/adjs.php?n=" + phpAds_random); document.write ("&what=zone:50"); document.write ("&exclude=" + document.phpAds_used); if (document.referrer) document.write ("&referer=" + escape(document.referrer)); document.write ("'><" + "/script>"); //--> </script><noscript><a href='http://ads.unixathome.org/phpPgAds/adclick.php?n=a0185fbb' target='_blank'><img src='http://ads.unixathome.org/phpPgAds/adview.php?what=zone:50&n=50' border='0' alt=''></a></noscript> </div> <TABLE WIDTH="100%" CELLPADDING="3" CELLSPACING="0" BORDER="0"> <TR><TD ALIGN="right"><div style="float:left"><a name="fb_share" type="button_count" href="http://www.facebook.com/sharer.php">Share</a><script src="http://static.ak.fbcdn.net/connect.php/js/FB.Share" type="text/javascript"></script></div><div class="headingmoreinfo">Need more help on this topic? <A HREF="/phorum/post.php?f=1">Click here</A><BR>This article has <a href="phorum/post.php?f=3&article_id=400">no comments</a><br>Show me <a href="topics.php?aid=400">similar articles</a><br></div></TD></TR> <TR> <TD ALIGN="center"> <div id="linksfooter"> [ <A HREF="/">HOME</A> | <A HREF="/topics.php">TOPICS</A> | <A HREF="/chronological.php">INDEX</A> | <A HREF="/help.php">WEB RESOURCES</A> | <A HREF="/booksmags.php">BOOKS</A> | <A HREF="/contribute.php">CONTRIBUTE</A> | <A HREF="/search.php">SEARCH</A> | <A HREF="/feedback.php">FEEDBACK</A> | <A HREF="/faq.php">FAQ</A> | <A HREF="/phorum/">FORUMS</A> ] </div> </TD> </TR> </TABLE> <div class="footer"> <span class="left"> Servers and bandwidth provided by <A HREF="http://www.nyi.net/" TARGET="_new">New York Internet</A> and <A HREF="http://www.supernews.com/" TARGET="_new">SuperNews</A> </span> <span class="right"> Valid <a href="http://validator.w3.org/check/referer">HTML</a>, <a href="http://jigsaw.w3.org/css-validator/check/referer">CSS</a> , and <a href="http://feedvalidator.org/check.cgi?url=http://www.freebsddiary.org/news.php">RSS</a>.<BR> <A HREF="/legal.php">Copyright</a> © 1997-2012 <A HREF="http://www.dvl-software.com/">DVL Software Ltd.</A><BR>All rights reserved. </span> </div> </body> </html>
The above all just boils down to these few lines of code:
<h3 class="heading">FreeBSD 4.1.1-RELEASE</h3> For your information: the FreeBSD source tree has been tagged for RELENG_4_1_1_RELEASE.<P>What does this mean? At this point in my day (early morning) I have no idea.</P> <P>But it means you can now get 4.1.1-RELEASE via cvsup. When I get back from work, I'll see what I know then.
So how did I get there? I modified the code which supplied those headers, footers, and sidebars. If you don’t have such common functions, I’m sorry, but you’ll either have to write a program/script to alter all your individual files, or, do it by hand.
Code modification
When I modified my website for import, I set up a duplicate website and started there. This used a different hostname, but provided the same content. Then I went to work on that new installtion. For example, this is the code which displays a section title. That’s the bit with a horizontal yellow box and a title in it. This shows the patch:
function diary_SectionHeader() {
-echo ‘
– </TD>
– </TR>
–
–
-‘;
–
echo diary_BannerSpace();
echo diary_BannerSection();
–
-echo ‘
– <TR>
– <TD>
– <P>
-‘;
}
Now, that may seem cryptic, but here is the before code:
function diary_SectionHeader() { echo ' </TD> </TR> '; echo diary_BannerSpace(); echo diary_BannerSection(); echo ' <TR> <TD> <P> '; }
My website made heavy use of tables and rows. But no more. I’m removing all that. The resulting code, for that section header is:
function diary_SectionHeader($Title) { echo diary_BannerSpace(); echo diary_BannerSection($Title); }
And the simple section headers are now produced by this:
function diary_BannerSection() { echo "<h3 class=\"section\"></h3>"; }
I also tossed out all the code that created the <html>, <head>, and <body> sections. They are no longer required. WordPress will do all that for me.
This was an iterative process. I would make some changes, look at the website, find something to change, repeat. Eventually, you get rid of all the cruft.
OK, but that’s not enough
The above process deals with the common stuff. It gets rid of the stuff which appears on every page. But each page itself contains stuff you need to remove. In my case, I’m using perl to clean all that up. What kind of stuff? Well, here’s a list of the stuff I ran:
520 perl -pi -e 's:</body>::g' *.php 523 perl -pi -e 's:</html>::g' *.php 527 perl -pi -e 's:<TR><TD>::g' *.php 529 perl -pi -e 's:<TD height="10"></TD>::g' *.php 531 perl -pi -e 's:</TD></TR>::g' *.php 539 perl -pi -e 's:</TD>::g' *.php 540 perl -pi -e 's:</td>::g' *.php 541 perl -pi -e 's:</tr>::g' *.php 542 perl -pi -e 's:</TR>::g' *.php 544 perl -pi -e 's:<tr>::g' *.php 545 perl -pi -e 's:<TR>::g' *.php 546 perl -pi -e 's:<TD>::g' *.php 547 perl -pi -e 's:<tr>::g' *.php 549 perl -pi -e 's:<td>::g' *.php 560 perl -pi -e 's:^<P>$::g' *.php 561 perl -pi -e 's:^<p>$::g' *.php 562 perl -pi -e 's:^</p>$::g' *.php 563 perl -pi -e 's:^</P>$::g' *.php 566 perl -pi -e 's:^</TABLE>$::g' *.php
These are from my command history, and show perl altering the files in place to remove various tags from the files. For example, line 566 removes all lines which have just </TABLE> on them. Not shown in the above are the tests I ran first on individual files to verify that I had the correct command.
Eventually, through much trial and error, I was able to strip down the website to its basics. Something that I was ready to import into WordPress. This bulk import will be the subject of the next post.