Oct 142012
 

This is the third in a series of articles on my migration to WordPress. In this post, I’ll talk about how I removed non-core material from the website before I imported it. This is vital because WordPress adds its own headers and footers, which my website already contains. The first step is to remove all that cruft before creating the input file for the RSS Importer (first used in the previous post).

The creation of the RSS file could be done manually, but that would be quite a bit of work. Instead, I’ll use and modify an existing program. Perhaps you can do similar for your situation and we’ll talk about how to do that in a future post.

Where’s the cruft

If you look at the existing HTML for one of my articles, you’ll find a lot of stuff that’s just not required for WordPress. If you’ve written with any blogging software before, you’ll know you don’t usually have to deal with .. etc. You just type stuff. WordPress figures out the rest. Well… that’s true for my website too.

Compare this code, with the code that follows it. The first, is a complete HTML file, as copied directly from the website. The second, contains only the basic post. You’ll see the difference.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<HTML>
<HEAD>

	<TITLE>The FreeBSD Diary -- FreeBSD 4.1.1-RELEASE</TITLE>
	
	<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

	<link rel="stylesheet" type="text/css" href="/css/site.css">

	<meta name="title" content="FreeBSD 4.1.1-RELEASE">
	<link rel="image_src" href="images/freebsddiary-logo.gif">
	<META NAME="description" CONTENT="A new tag in the tree">
	<META NAME="keywords"    CONTENT="freebsd,diary,FreeBSD">

	<LINK REL="SHORTCUT ICON" HREF="/favicon.ico" type="image/x-icon">
	<LINK REL="ICON"          HREF="/favicon.ico" type="image/x-icon">
	<meta name="MSSmartTagsPreventParsing" content="TRUE">
	<META http-equiv="Pragma"              CONTENT="no-cache">
	<META HTTP-EQUIV="Expires"             CONTENT="0">
	<META HTTP-EQUIV="Cache-Control"       CONTENT="no-cache">
	<META HTTP-EQUIV="Pragma-directive"    CONTENT="no-cache">
	<META HTTP-EQUIV="cache-directive"     CONTENT="no-cache">
	<META NAME="GOOGLEBOT"                 CONTENT="NOARCHIVE">
	<META NAME="ROBOTS"                    CONTENT="NOARCHIVE">

	<meta http-equiv="pics-label" content='(pics-1.1 "http://www.icra.org/ratingsv02.html" comment "ICRAonline v2.0" l gen true for "http://www.freebsddiary.org/"  r (nz 1 vz 1 lz 1 oz 1 ca 1) "http://www.rsac.org/ratingsv01.html" l gen true for "http://www.freebsddiary.org/"  r (n 0 s 0 v 0 l 0))'>
	<META http-equiv="PICS-Label" content='(PICS-1.1 "http://www.classify.org/safesurf/"                             l gen true for "http://www.freebsddiary.org/"  r (SS~~000 1))'> 
	<meta http-equiv="Certification" content='"http://www.ufcws.org/license.html" l r true comment "United Federation of ChildSafe Web Sites" for "http://www.freebsddiary.org" on "2002.08.10" r"'>
	<!-- Begin ICCS Certified Web Site Statement -->
	<!-- VERIFICATION="PICS-Label / ICCS_Certification" CONTENT="ICCS Certified Web Site"
	RATING="Class 1 SafeSurf" and/or "Class 2 ICRA" TASK="ICCS-Certification under the iWatchDog Program for "http://www.freebsddiary.org" EXPIRATION_DATE="2003.08.10"//-->
	<!-- End Statement -->

	<link rel="alternate" type="application/rss+xml" title="The FreeBSD Diary" href="http://www.freebsddiary.org/news.php">


</HEAD>

<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<table border="0" cellspacing="2">
<tr>
	<TD ALIGN="right" CLASS="sans" VALIGN="middle"><h1>The FreeBSD Diary</h1></TD></tr></table>
<TABLE WIDTH="100%" CELLPADDING="0" CELLSPACING="0" BORDER="0" align="center">
<TR>

	<TD width="450" ALIGN="right" VALIGN="top"><A HREF="/"><IMG SRC="/images/freebsddiary-logo.gif" ALT="The FreeBSD Diary" WIDTH="431" HEIGHT="98" BORDER="0"></A></td>
	<td width="5" align="left" ><small><a href="/other-copyrights.php">(TM)</a></small></td>
<TD ALIGN="right" CLASS="sans" VALIGN="bottom"><h3>Providing practical examples since 1998</h3></TD>
</TR>
</TABLE>


<div id="linksheader" align="right">
[ <A HREF="/">HOME</A> | <A HREF="/topics.php">TOPICS</A> | <A HREF="/chronological.php">INDEX</A> | <A HREF="/help.php">WEB RESOURCES</A> | <A HREF="/booksmags.php">BOOKS</A> | <A HREF="/contribute.php">CONTRIBUTE</A> | <A HREF="/search.php">SEARCH</A> | <A HREF="/feedback.php">FEEDBACK</A> | <A HREF="/faq.php">FAQ</A> | <A HREF="/phorum/">FORUMS</A> ]
</div>

<TABLE WIDTH="100%" ALIGN="center" BORDER="0">
		<TR><TD VALIGN="top">
		<TABLE WIDTH="100%" BORDER="0">
		<tr><td align="center">
<script language='JavaScript' type='text/javascript' src='http://ads.unixathome.org/phpPgAds/adx.js'></script>
<script language='JavaScript' type='text/javascript'>
<!--
   if (!document.phpAds_used) document.phpAds_used = ',';
   phpAds_random = new String (Math.random()); phpAds_random = phpAds_random.substring(2,11);
   
   document.write ("<" + "script language='JavaScript' type='text/javascript' src='");
   document.write ("http://ads.unixathome.org/phpPgAds/adjs.php?n=" + phpAds_random);
   document.write ("&amp;what=zone:50");
   document.write ("&amp;exclude=" + document.phpAds_used);
   if (document.referrer)
      document.write ("&amp;referer=" + escape(document.referrer));
   document.write ("'><" + "/script>");
//-->
</script><noscript><a href='http://ads.unixathome.org/phpPgAds/adclick.php?n=ab34d058' target='_blank'><img src='http://ads.unixathome.org/phpPgAds/adview.php?what=zone:50&amp;n=50' border='0' alt=''></a></noscript>

	</td></tr>
  <TR>
    <TD>

<div class="heading">
<span class="left">FreeBSD 4.1.1-RELEASE</span>
<span class="right">26 September 2000</span>
</div>
    </TD>
  </TR>
<TR><TD ALIGN="right"><div style="float:left"><a name="fb_share" type="button_count" href="http://www.facebook.com/sharer.php">Share</a><script src="http://static.ak.fbcdn.net/connect.php/js/FB.Share" type="text/javascript"></script></div><div class="headingmoreinfo">Need more help on this topic? <A HREF="/phorum/post.php?f=1">Click here</A><BR>This article has <a href="phorum/post.php?f=3&amp;article_id=400">no comments</a><br>Show me <a href="topics.php?aid=400">similar articles</a><br></div></TD></TR>
  <TR>
	<TD>
	For your information: the FreeBSD source tree has been tagged for
    RELENG_4_1_1_RELEASE.<P>What does this mean?&nbsp; At this point in my day (early morning)
    I have no idea.</P>
    <P>But it means you can now get 4.1.1-RELEASE via cvsup.&nbsp; When I get back from work,
    I'll see what I know then.</TD>
  </TR>
  <TR>
    <TD></TD>
  </TR>

</TABLE>



			</TD><TD VALIGN="top" WIDTH="140">
			<div align="center">

		
<script language='JavaScript' type='text/javascript' src='http://ads.unixathome.org/phpPgAds/adx.js'></script>
<script language='JavaScript' type='text/javascript'>
<!--
   if (!document.phpAds_used) document.phpAds_used = ',';
   phpAds_random = new String (Math.random()); phpAds_random = phpAds_random.substring(2,11);
   
   document.write ("<" + "script language='JavaScript' type='text/javascript' src='");
   document.write ("http://ads.unixathome.org/phpPgAds/adjs.php?n=" + phpAds_random);
   document.write ("&amp;what=zone:54");
   document.write ("&amp;exclude=" + document.phpAds_used);
   if (document.referrer)
      document.write ("&amp;referer=" + escape(document.referrer));
   document.write ("'><" + "/script>");
//-->
</script><noscript><a href='http://ads.unixathome.org/phpPgAds/adclick.php?n=a02ff4f4' target='_blank'><img src='http://ads.unixathome.org/phpPgAds/adview.php?what=zone:54&amp;n=54' border='0' alt=''></a></noscript>

	</div>
			</TD></TR>
			</TABLE>
			<div align="center">
<br>

<script language='JavaScript' type='text/javascript' src='http://ads.unixathome.org/phpPgAds/adx.js'></script>
<script language='JavaScript' type='text/javascript'>
<!--
   if (!document.phpAds_used) document.phpAds_used = ',';
   phpAds_random = new String (Math.random()); phpAds_random = phpAds_random.substring(2,11);
   
   document.write ("<" + "script language='JavaScript' type='text/javascript' src='");
   document.write ("http://ads.unixathome.org/phpPgAds/adjs.php?n=" + phpAds_random);
   document.write ("&amp;what=zone:50");
   document.write ("&amp;exclude=" + document.phpAds_used);
   if (document.referrer)
      document.write ("&amp;referer=" + escape(document.referrer));
   document.write ("'><" + "/script>");
//-->
</script><noscript><a href='http://ads.unixathome.org/phpPgAds/adclick.php?n=a0185fbb' target='_blank'><img src='http://ads.unixathome.org/phpPgAds/adview.php?what=zone:50&amp;n=50' border='0' alt=''></a></noscript>

	</div>
<TABLE WIDTH="100%" CELLPADDING="3" CELLSPACING="0" BORDER="0">
		<TR><TD ALIGN="right"><div style="float:left"><a name="fb_share" type="button_count" href="http://www.facebook.com/sharer.php">Share</a><script src="http://static.ak.fbcdn.net/connect.php/js/FB.Share" type="text/javascript"></script></div><div class="headingmoreinfo">Need more help on this topic? <A HREF="/phorum/post.php?f=1">Click here</A><BR>This article has <a href="phorum/post.php?f=3&amp;article_id=400">no comments</a><br>Show me <a href="topics.php?aid=400">similar articles</a><br></div></TD></TR>
		<TR>
			<TD ALIGN="center">
			<div id="linksfooter">
			[ <A HREF="/">HOME</A> | <A HREF="/topics.php">TOPICS</A> | <A HREF="/chronological.php">INDEX</A> | <A HREF="/help.php">WEB RESOURCES</A> | <A HREF="/booksmags.php">BOOKS</A> | <A HREF="/contribute.php">CONTRIBUTE</A> | <A HREF="/search.php">SEARCH</A> | <A HREF="/feedback.php">FEEDBACK</A> | <A HREF="/faq.php">FAQ</A> | <A HREF="/phorum/">FORUMS</A> ]
			</div>
			</TD>
		</TR>
		</TABLE>
		
<div class="footer">
<span class="left">
Servers and bandwidth provided by <A HREF="http://www.nyi.net/" TARGET="_new">New York Internet</A> and <A HREF="http://www.supernews.com/" TARGET="_new">SuperNews</A>
</span>
<span class="right">
Valid 
<a href="http://validator.w3.org/check/referer">HTML</a>, 

<a href="http://jigsaw.w3.org/css-validator/check/referer">CSS</a>
, and
<a href="http://feedvalidator.org/check.cgi?url=http://www.freebsddiary.org/news.php">RSS</a>.<BR>
<A HREF="/legal.php">Copyright</a> &copy; 1997-2012 <A HREF="http://www.dvl-software.com/">DVL Software Ltd.</A><BR>All rights reserved.
</span>
</div>

</body>
</html>

The above all just boils down to these few lines of code:

<h3 class="heading">FreeBSD 4.1.1-RELEASE</h3>
  
	
	For your information: the FreeBSD source tree has been tagged for
    RELENG_4_1_1_RELEASE.<P>What does this mean?&nbsp; At this point in my day (early morning)
    I have no idea.</P>
    <P>But it means you can now get 4.1.1-RELEASE via cvsup.&nbsp; When I get back from work,
    I'll see what I know then.

So how did I get there? I modified the code which supplied those headers, footers, and sidebars. If you don’t have such common functions, I’m sorry, but you’ll either have to write a program/script to alter all your individual files, or, do it by hand.

Code modification

When I modified my website for import, I set up a duplicate website and started there. This used a different hostname, but provided the same content. Then I went to work on that new installtion. For example, this is the code which displays a section title. That’s the bit with a horizontal yellow box and a title in it. This shows the patch:

function diary_SectionHeader() {

-echo ‘
– </TD>
– </TR>


-‘;

echo diary_BannerSpace();
echo diary_BannerSection();

-echo ‘
– <TR>
– <TD>
– <P>
-‘;
}

Now, that may seem cryptic, but here is the before code:

function diary_SectionHeader() {

echo '
    </TD>
  </TR>


';

    echo diary_BannerSpace();
    echo diary_BannerSection();

echo '
  <TR>
    <TD>
    <P>
';
}

My website made heavy use of tables and rows. But no more. I’m removing all that. The resulting code, for that section header is:

function diary_SectionHeader($Title) {

    echo diary_BannerSpace();
    echo diary_BannerSection($Title);
}

And the simple section headers are now produced by this:

function diary_BannerSection() {

echo "<h3 class=\"section\"></h3>";

}

I also tossed out all the code that created the <html>, <head>, and <body> sections. They are no longer required. WordPress will do all that for me.

This was an iterative process. I would make some changes, look at the website, find something to change, repeat. Eventually, you get rid of all the cruft.

OK, but that’s not enough

The above process deals with the common stuff. It gets rid of the stuff which appears on every page. But each page itself contains stuff you need to remove. In my case, I’m using perl to clean all that up. What kind of stuff? Well, here’s a list of the stuff I ran:

  520  perl -pi -e 's:</body>::g' *.php
  523  perl -pi -e 's:</html>::g' *.php
  527  perl -pi -e 's:<TR><TD>::g' *.php
  529  perl -pi -e 's:<TD height="10"></TD>::g' *.php
  531  perl -pi -e 's:</TD></TR>::g' *.php
  539  perl -pi -e 's:</TD>::g' *.php
  540  perl -pi -e 's:</td>::g' *.php
  541  perl -pi -e 's:</tr>::g' *.php
  542  perl -pi -e 's:</TR>::g' *.php
  544  perl -pi -e 's:<tr>::g' *.php
  545  perl -pi -e 's:<TR>::g' *.php
  546  perl -pi -e 's:<TD>::g' *.php
  547  perl -pi -e 's:<tr>::g' *.php
  549  perl -pi -e 's:<td>::g' *.php
  560  perl -pi -e 's:^<P>$::g' *.php
  561  perl -pi -e 's:^<p>$::g' *.php
  562  perl -pi -e 's:^</p>$::g' *.php
  563  perl -pi -e 's:^</P>$::g'  *.php
  566  perl -pi -e 's:^</TABLE>$::g' *.php

These are from my command history, and show perl altering the files in place to remove various tags from the files. For example, line 566 removes all lines which have just </TABLE> on them. Not shown in the above are the tests I ran first on individual files to verify that I had the correct command.

Eventually, through much trial and error, I was able to strip down the website to its basics. Something that I was ready to import into WordPress. This bulk import will be the subject of the next post.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive