WordPress Migration: The import

This is the fourth in a series of articles on my migration to WordPress. In this post, I’ll talk about how I exported my existing website into an XML file, which I then imported into WordPress. All this was possible because of the preparations previously described.

I was fortunate in that my website already had an RSS feed. It did not contain everything needed to do an import, but it did represent a starting point.

I should point out that after doing this import, I went back later and did it again, after adding in author information to the website. That in itself is an interesting task. One might claim that the original import was then wasted. I say it was not. This entire process is iterative. Do a proof of concept. Expand upon it. Repeat until done. Just keep in mind that there are more steps to complete.

The RSS format

As mentioned in the WordPress Importing: The first attempts post, this is the basic format for my RSS feed. It is an XML file, although not strictly legal. It is good enough for the importing script.

<item>
  <title>This is the article title</title>
  <dc:date>2012-10-12 12:34:56</dc:date>
  <content:encoded>
This is the article content.
  </content:encoded>
  <category>FreeBSD</category>
</item>

My RSS generation script

My RSS feed was written in PHP. It looks more or less like this:

<?php
        #
        # $Id: news.php,v 1.15 2012/10/10 22:04:08 dan Exp $
        #
        # Copyright (c) 1998-2003 DVL Software Limited
        #

        require($_SERVER["DOCUMENT_ROOT"] . "/include/common.php");
        require($_SERVER["DOCUMENT_ROOT"] . "/include/freebsddiary.php");
        require($_SERVER["DOCUMENT_ROOT"] . "/include/databaselogin.php");

   $HTML .= '<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"' . "\n";
   $HTML .= '        "http://www.rssboard.org/rss-0.91.dtd">' . "\n";
   $HTML .= '<rss version="0.91">' . "\n";

   $HTML .= "\n";

   $HTML .= '<channel>' . "\n";
   $HTML .= '  <title>The FreeBSD Diary</title>' . "\n";
   $HTML .= '  <link>http://www.freebsddiary.org/</link>' . "\n";
   $HTML .= '  <description>The largest collection of practical examples for FreeBSD!</description>' . "\n";
   $HTML .= '  <language>en-us</language>' . "\n";
   $HTML .= '  <copyright>Copyright ' . GetCopyrightYears() . ', DVL Software Limited.</copyright>' . "\n";

   $HTML .= "\n";

   $sql = "SELECT description, name, filename
             FROM articles
            where actual_date  <= '" . date("Y/m/d", time()) . "'
                  and completed = 'Y'
         order by actual_date desc
            limit 10 ";

#die("<pre>$sql</pre>");

   $result = pg_query($db, $sql);
   while ($myrow = pg_fetch_array($result)) {
      $HTML .= '  <item>' . "\n";
      $HTML .= '    <title>' .  htmlentities($myrow["name"]) . '</title>' . "\n";
      $HTML .= '    <link>http://www.freebsddiary.org/' . htmlentities($myrow["filename"]) . '</link>' . "\n";
      $HTML .= '    <description>' . htmlentities($myrow["description"]) . '</description>' . "\n";
      $HTML .= '  </item>' . "\n";
   }

   $HTML .= '</channel>' . "\n";
   $HTML .= '</rss>' . "\n";

   header('Content-type: text/xml');

   echo '<?xml version="1.0"?>', "\n";
   echo $HTML;

?>

This produced a basic structure, but I needed much more information. Here is what I am using for the migration:

<?php
        #
        # $Id: news.php,v 1.14 2010/02/08 16:29:22 dan Exp $
        #
        # Copyright (c) 1998-2003 DVL Software Limited
        #

        require($_SERVER["DOCUMENT_ROOT"] . "/include/common.php");
        require($_SERVER["DOCUMENT_ROOT"] . "/include/freebsddiary.php");
        require($_SERVER["DOCUMENT_ROOT"] . "/include/databaselogin.php");

$NOIMPORT = array('index.php' => 1, 'article-feedback.php' => 1, 'topics.php' => 1);

$HTML = '';

   $HTML .= '<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"' . "\n";
   $HTML .= '        "http://www.rssboard.org/rss-0.91.dtd">' . "\n";
   $HTML .= '<rss version="0.91">' . "\n";

   $HTML .= "\n";

   $HTML .= '<channel>' . "\n";
   $HTML .= '  <title>The FreeBSD Diary</title>' . "\n";
   $HTML .= '  <link>http://www.freebsddiary.org/</link>' . "\n";
   $HTML .= '  <description>The largest collection of practical examples for FreeBSD!</description>' . "\n";
   $HTML .= '  <language>en-us</language>' . "\n";
   $HTML .= '  <copyright>Copyright ' . GetCopyrightYears() . ', DVL Software Limited.</copyright>' . "\n";

   $HTML .= "\n";

   $sql = "SELECT id, description, name, filename, actual_date as date
             FROM articles
            where completed = 'Y'
         order by actual_date desc";

   $result = pg_query($db, $sql);
   while ($myrow = pg_fetch_array($result)) {
      // we don't want these files
      if (isset($NOIMPORT[$myrow['filename']]))
      {
         continue;
      }
      $HTML .= '  <item>' . "\n";
      $HTML .= '    <title>' .  htmlentities($myrow["name"]) . '</title>' . "\n";
      $HTML .= '    <link>http://www.freebsddiary.org/' . htmlentities($myrow["filename"]) . '</link>' . "\n";
      $HTML .= '    <dc:date>' .  $myrow["date"] . '</dc:date>' . "\n";
      $HTML .= '    <description>' . htmlentities($myrow["description"]) . '</description>' . "\n";
      $HTML .= '    <content:encoded>' .  getContents($myrow['filename']) . '</content:encoded>' . "\n";
      $HTML .= getCategories($db, $myrow['id']);
      $HTML .= '  </item>' . "\n";
   }

   $HTML .= '</channel>' . "\n";
   $HTML .= '</rss>' . "\n";

   header('Content-type: text/xml');

   echo '<?xml version="1.0"?>', "\n";
   echo $HTML;

   function getCategories($db, $id)
   {
      $sql = "SELECT T.name
                FROM topic_articles TA, topics T
               WHERE TA.article_id = $id
                 AND TA.topic_id   = T.id";

      $HTML = '';
      $result = pg_query($db, $sql);
      while ($myrow = pg_fetch_array($result))
      {
         $HTML .= '    <category>' . $myrow['name'] . '</category>' . "\n";
      }

      return $HTML;
   }

   function getContents($filename)
   {
      $contents = file_get_contents('http://' . $_SERVER['SERVER_NAME'] . '/' . $filename);

      return $contents;
   }

A few notes about this code:

Lines 16-29 are not really needed. The import will work without them
On line 31 is the SQL used to get a list of the articles from my database. You may need to obtain this list from disk, or somewhere else.
There are some files I don’t want to import because they are automatically created on my website and will be created by WordPress. These files are listed on line 12 and references on line 39.
Line 48 obtains the content of the article by fetching it from the website. Yes, just like your browser would fetch it. It invokes the function found on line 78.
Line 29 gets a list of categories for this article. WordPress will automatically create these during the import process.

Fetching the RSS file

I work mostly in a Unix-like environment, specifically, FreeBSD. Sure, I use my MacBook for most things, but all of my servers run FreeBSD. With FreeBSD comes a built in command; fetch. With it, I can do this:

$ fetch http://www.freebsddiary.org/news.php
news.php                                      100% of 2546  B 2544 kBps

That’s not the whole import. That’s just ten items. That’s my original unchanged script. I won’t give you the URL for the real script.

Using Chrome, I downloaded the URL and saved it to local disk. Now I’m ready to import.

The import

I was using WordPress 3.4.2. To import an RSS feed, I followed these steps:

click on Tools
click on Import
click on RSS^*
click on Choose File
click on Upload file and Import

^* If you have not already done, so, you’ll need to download and install the RSS importer. I used version 0.2.

You should check the size of the file you are going to import. If it exceeds the Maximum size shown in step 4 above, you need to do one of two things:

split the file into multiple part: this should be straight forward; just be sure to keep all of one .. in the same file.
alter your web server parameters so you can upload a larger file

The results

I had to repeat this import process several times. I kept seeing things in the files which was messing up the process or creating issues. Situations such as text like this:

<content>/<port>

Why is this a problem? It is the same tag as used by the RSS Import script, which caused one article to be incomplete and created an extra article.

To redo the import, I just used WordPress to completely delete all the existing articles and start again. Repeat.

But wait, there’s more!

This was a good start. I have all the content. It looks good. But not everything is there yet.

In future articles, I will import the authors (which I suspect will require reimporting all the posts). And I will import the comments. I think that will be the hardest task.

I also have to redirect the old URLs to the new URLs. This will keep search engines happy, bookmarks will continue to find the page, and links from within my website and from other websites will still work.

So far, so good. I’m enjoying this and the prognosis is good.