WordPress Migration: redirecting old URLs

This is the sixth in a series of articles on my migration to WordPress. In this post, I’ll talk about how I enabled the old URLs. This is important only if you want the ‘old’ content to be found. This is especially important if your website is well established. People will have links to your website from their websites. Search engines have results which need to continue to be valid. Myself? I know the URLs of a few articles and I want them to work. They are also much shorter….

The WordPress API

I did look at the WordPress Rewrite API. But I abandoned that idea. I feel much more comfortable doing this via Apache than I do via WordPress. I think that, long term, letting Apache do this is a much better strategy, especially when dealing with over 650 URLs.

The Theory

How am I going to connect the old URL to the new URL? I will do it with Apache rewrites. There are other ways to do this, but I’m going to place entries such as this within my Apache configuration file.

RewriteEngine on
RedirectPermanent /bacula.php   /2004/02/01/bacula-cross-platform-client-server-backups/

With this, I can type this into my browser:

http://www.freebsddiary.org/bacula.php

And I will be automatically redirected to:

http://www.freebsddiary.org/2004/02/01/bacula-cross-platform-client-server-backups/

This is done with an http status code of 301 (Moved Permanently). This allows search engines to update their references.

This wget output shows the status codes and outlines what happens with this solution

$ wget -S http://wp.freebsddiary.org/bacula.php
--2012-11-07 20:43:14--  http://wp.freebsddiary.org/bacula.php
Resolving wp.freebsddiary.org... 64.90.182.120
Connecting to wp.freebsddiary.org|64.90.182.120|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 301 Moved Permanently
  Date: Wed, 07 Nov 2012 20:43:14 GMT
  Server: Apache/2.2.22 (FreeBSD) PHP/5.4.5 mod_ssl/2.2.22 OpenSSL/0.9.8q DAV/2
  Location: http://wp.freebsddiary.org/2004/02/01/bacula-cross-platform-client-server-backups/
  Content-Length: 290
  Keep-Alive: timeout=5, max=100
  Connection: Keep-Alive
  Content-Type: text/html; charset=iso-8859-1
Location: http://wp.freebsddiary.org/2004/02/01/bacula-cross-platform-client-server-backups/ [following]
--2012-11-07 20:43:14--  http://wp.freebsddiary.org/2004/02/01/bacula-cross-platform-client-server-backups/
Reusing existing connection to wp.freebsddiary.org:80.
HTTP request sent, awaiting response... 
  HTTP/1.0 200 OK
  Date: Wed, 07 Nov 2012 20:43:14 GMT
  Server: Apache/2.2.22 (FreeBSD) PHP/5.4.5 mod_ssl/2.2.22 OpenSSL/0.9.8q DAV/2
  X-Powered-By: PHP/5.4.5
  X-Pingback: http://wp.freebsddiary.org/xmlrpc.php
  Link: <http://wp.freebsddiary.org/?p=2966>; rel=shortlink
  Connection: close
  Content-Type: text/html; charset=UTF-8
Length: unspecified [text/html]
Saving to: `bacula.php.2'

    [  <=>                                                              ] 82,006       285K/s   in 0.3s    

2012-11-07 20:43:15 (285 KB/s) - `bacula.php.2' saved [82006]

$

On line 6, you can see the 301 status code, and on line 17, you can see the request for the other URL. All of this happens transparently for the user.

What about .htaccess?

I am sure this approach can be used with your .htaccess file, although I won’t be using it and I have not tested it. I will be altering my virtual host definition in httpd.conf

Matching OLD with NEW

What may initially seem complex is how do you link up the old URL with the new URL. Let’s start with a list of the old URLs, assuming we have the following list of information:

filename (e.g. bacula.php)
title (Bacula: Cross-Platform Client-Server Backups)
Author (Dan Langille)
Date (1 February 2004)

In my post about importing comments, I showed you how I linked author, date, and article title together to find the new post. This resulted in a post ID. From the post ID, you can get the full URL. If you know the old URL, you can now link that to the new URL. It is very straight forward.

Extracting data

Here is the code, as rough as it is, which pulls the above list of data out of the original FreeBSD Diary database, and creates an XML file.

<?php
	#
	# $Id: news.php,v 1.14 2010/02/08 16:29:22 dan Exp $
	#
	# Copyright (c) 1998-2003 DVL Software Limited
	#

	require($_SERVER["DOCUMENT_ROOT"] . "/include/common.php");
	require($_SERVER["DOCUMENT_ROOT"] . "/include/freebsddiary.php");
	require($_SERVER["DOCUMENT_ROOT"] . "/include/databaselogin.php");
	
$MaxArticles = 10000;

$HTML = '';

   $HTML .= '<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"' . "\n";
   $HTML .= '        "http://www.rssboard.org/rss-0.91.dtd">' . "\n";
   $HTML .= '<rss version="0.91">' . "\n";

   $HTML .= "\n";

   $HTML .= '<channel>' . "\n";
   $HTML .= '  <title>The FreeBSD Diary</title>' . "\n";
   $HTML .= '  <link>http://www.freebsddiary.org/</link>' . "\n";
   $HTML .= '  <description>The largest collection of practical examples for FreeBSD!</description>' . "\n";
   $HTML .= '  <language>en-us</language>' . "\n";
   $HTML .= '  <copyright>Copyright ' . GetCopyrightYears() . ', DVL Software Limited.</copyright>' . "\n";

   $HTML .= "\n";

   $sql = "SELECT A.id as article_id,
                  A.name AS article_name, 
                  A.author as author, 
                  A.actual_date as date,
                  A.filename
             FROM articles A
            WHERE A.completed = 'Y'
         order by date, article_id ";

#die("<pre>$sql</pre>");

   $result = pg_query($db, $sql);
   while ($myrow = pg_fetch_array($result)) {
      $email = trim($myrow["email"]);
      $HTML .= '  <item>' . "\n";
      $HTML .= '    <title>' .  htmlentities($myrow["article_name"]) . '</title>' . "\n";
      $HTML .= '    <dc:date>' .  $myrow["date"] . '</dc:date>' . "\n";
      $HTML .= '    <author>' .  $myrow["author"] . '</author>' . "\n";
      $HTML .= '    <filename>' . htmlentities($myrow['filename']) . '</filename>' . "\n";
      $HTML .= '  </item>' . "\n";
   }

   $HTML .= '</channel>' . "\n";
   $HTML .= '</rss>' . "\n";

   header('Content-type: text/xml');

   echo '<?xml version="1.0"?>', "\n";   
   echo $HTML;

The extracted data

Here is an example of that XML output, which shows the first two articles:

<?xml version="1.0"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
        "http://www.rssboard.org/rss-0.91.dtd">
<rss version="0.91">

<channel>
  <title>The FreeBSD Diary</title>
  <link>http://www.freebsddiary.org/</link>
  <description>The largest collection of practical examples for FreeBSD!</description>
  <language>en-us</language>
  <copyright>Copyright 1997-2012, DVL Software Limited.</copyright>

  <item>
    <title>Why I wanted FreeBSD before I knew it existed</title>
    <dc:date>1998-02-11</dc:date>
    <author></author>
    <filename>why.php</filename>
  </item>
  <item>
    <title>How I found FreeBSD</title>
    <dc:date>1998-05-11</dc:date>
    <author></author>
    <filename>introduction.php</filename>
  </item>

The code for importing

I took the code I created for importing comments and altered it so it would output a list of RedirectPermanent statements. That code appears below and is installed at wp-content/plugins/rss-importer-redirects/

<?php
/*
Plugin Name: RSS Importer Redirects
Plugin URI: 
Description: Import article headers from RSS feed and create redirects. Based upon http://wordpress.org/extend/plugins/rss-importer/
Author: Dan Langille
Author URI: http://wordpress.org/
Version: 0.1
Stable tag: 0.1
License: GPL version 2 or later - http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
*/

if ( !defined('WP_LOAD_IMPORTERS') )
	return;

// Load Importer API
require_once ABSPATH . 'wp-admin/includes/import.php';

if ( !class_exists( 'WP_Importer' ) ) {
	$class_wp_importer = ABSPATH . 'wp-admin/includes/class-wp-importer.php';
	if ( file_exists( $class_wp_importer ) )
		require_once $class_wp_importer;
}

/**
 * RSS Importer
 *
 * @package WordPress
 * @subpackage Importer
 */

/**
 * RSS Importer
 *
 * Will process a RSS feed for importing posts into WordPress. This is a very
 * limited importer and should only be used as the last resort, when no other
 * importer is available.
 *
 * @since unknown
 */
if ( class_exists( 'WP_Importer' ) ) {
class RSS_ImportRedirects extends WP_Importer {

	var $posts = array ();
	var $file;
	
	var $authors = array();

function get_all_authors() {
	global $wpdb;

	$query = "SELECT ID, display_name, user_login, user_email FROM $wpdb->users";

	return $wpdb->get_results( $wpdb->prepare($query, $this->authors ));

	return 0;
}

function get_all_posts() {
	global $wpdb;

	$query = "SELECT ID, post_author, post_date, post_title FROM $wpdb->posts WHERE post_type = 'post'";

	return $wpdb->get_results( $wpdb->prepare($query, $posts ));
}

	function header() {
		echo '<div class="wrap">';
		screen_icon();
		echo '<h2>'.__('Import RSS redirects', 'rss-importer-redirects').'</h2>';
	}

	function footer() {
		echo '</div>';
	}

	function greet() {
		echo '<div class="narrow">';
		echo '<p>'.__('Howdy! This importer allows you to import redirects from an RSS 2.0 file into your WordPress site. This is useful if you want to use the old URLS but redirect them to the new URLS. Pick an RSS file to upload and click Import.', 'rss-importer-redirects').'</p>';
		wp_import_upload_form("admin.php?import=rssredirects&amp;step=1");
		echo '</div>';
	}

	function _normalize_tag( $matches ) {
		return '<' . strtolower( $matches[1] );
	}

	function get_posts() {
		global $wpdb;

		set_magic_quotes_runtime(0);
		$datalines = file($this->file); // Read the file into an array
		$importdata = implode('', $datalines); // squish it
		$importdata = str_replace(array ("\r\n", "\r"), "\n", $importdata);

		$authors = $this->get_all_authors();
		$authorLoginLookup = array();
		$authorNameLookup  = array();
		$authorLookup      = array();
		foreach($authors as $author)
		{
		        $key = $author->display_name . ' ## ' . $author->user_email;
		        if (isset($authorLoookup[$key]))
		        {
		                echo 'WARNING: duplicate display name / email combination ' . $key . '<br>';
		        }
                        $authorLoookup[$key] = $author->ID;
		        if (isset($authorNameLookup [$author->display_name]))
		        {
#		                echo 'WARNING: duplicate display_name ' . $author->display_name . '<br>';
                        }
		        if (isset($authorNameLookup [$author->user_login]))
		        {
#		                echo 'WARNING: duplicate user_login ' . $author->user_login . '<br>';
                        }
		        $authorNameLookup [$author->display_name] = $author->ID;
		        $authorLoginLookup[$author->user_login]   = $author->ID;
                }
                
#		echo ' the authors are <pre>' . print_r($authorNameLookup,  true) . '</pre>';
#		echo ' the authors are <pre>' . print_r($authorLoginLookup, true) . '</pre>';

		$posts = $this->get_all_posts();
		$postLookup = array();
		foreach($posts as $post)
		{
		        $key = $post->post_author . ' @@ ' .  $post->post_date . ' @@ ' . $post->post_title;
		        if (isset($postLookup[$key]))
		        {
                		echo ' the posts are <pre>' . print_r($postLookup, true) . '</pre>';
		                die('duplicate post found: ' . $key);
                        }
		        $postLookup[$key] = $post->ID;
                }
#                echo ' the posts are <pre>' . print_r($postLookup, true) . '</pre>';
                echo 'we have ' . count($posts) . ' fetched from WordPress<br>';
                echo 'we have ' . count($postLookup) . ' in postLookup<br>';
                


                // this is the look up array for the redirects added into the system already
                $comments = array();

                $blanks = sprintf("'%50s'", ''); // we use this for padding the filenames so the redirect URLS are aligned in the output
		preg_match_all('|<item>(.*?)</item>|is', $importdata, $this->posts);
		$this->posts = $this->posts[1];
		echo "<pre>";
		foreach ($this->posts as $post) {
			preg_match('|<title>(.*?)</title>|is', $post, $post_title);
			$post_title = str_replace(array('<![CDATA[', ']]>'), '', $wpdb->escape( trim($post_title[1]) ));

			// if we don't already have something from pubDate
			preg_match('|<dc:date>(.*?)</dc:date>|is', $post, $post_date_gmt);
			$post_date_gmt = preg_replace('|([-+])([0-9]+):([0-9]+)$|', '\1\2\3', $post_date_gmt[1]);
			$post_date_gmt = str_replace('T', ' ', $post_date_gmt);
			$post_date_gmt = strtotime($post_date_gmt);

			$post_date_gmt = gmdate('Y-m-d H:i:s', $post_date_gmt);
			$post_date = get_date_from_gmt( $post_date_gmt );

                        // by default, all posts belong to the user id = 1.
			$post_author = 1;

			preg_match('|<filename.*?>(.*?)</filename>|is', $post, $filename);
			$filename = $filename[1];

			preg_match('|<author.*?>(.*?)</author>|is', $post, $authorName);
			if ($authorName)
			{
				$authorName = $wpdb->escape(trim($authorName[1]));
				if (isset($authorNameLookup[$authorName]))
				{
				        $post_author = $authorNameLookup[$authorName];
                                }
                                else
                                {
        				if (isset($authorLoginLookup[$authorName]))
	        			{
		        		        $post_author = $authorLoginLookup[$authorName];
                                        }
                                }
                        }

			// look up the post id for this comment
			
			$key = $post_author . ' @@ ' .  $post_date . ' @@ ' . stripslashes($post_title);
			if (isset($postLookup[$key]))
			{
			        $post_ID = $postLookup[$key];
                        }
                        else
                        {
                                echo 'could not find post id for ' . $authorName . ' ' . $key . '<br>';
                                continue;
                        }
                        
                        $post_URL = get_permalink( $post_ID );
                        $post_URL = str_replace('http://' . $_SERVER['HTTP_HOST'], '', $post_URL);

                        echo "RedirectPermanent /" . str_pad($filename, 50) . "$post_URL\n";
		}
		echo "</pre>";
		exit;
	}

	function import_posts() {
		echo '<ol>';

		foreach ($this->posts as $post) {
			echo "<li>".__('Importing redirects...', 'rss-importer-redirects');

			extract($post);

			if ($post_id = post_exists($post_title, $post_content, $post_date)) {
				_e('Post already imported', 'rss-importer-redirects');
			} else {
				$post_id = wp_insert_post($post);
				if ( is_wp_error( $post_id ) )
					return $post_id;
				if (!$post_id) {
					_e('Couldn&#8217;t get post ID', 'rss-importer-redirects');
					return;
				}

				if (0 != count($categories))
					wp_create_categories($categories, $post_id);
				_e('Done!', 'rss-importer-redirects');
			}
			echo '</li>';
		}

		echo '</ol>';

	}

	function import() {
		$file = wp_import_handle_upload();
		if ( isset($file['error']) ) {
			echo $file['error'];
			return;
		}

		$this->file = $file['file'];
		$this->get_posts();
		$result = $this->import_posts();
		if ( is_wp_error( $result ) )
			return $result;
		wp_import_cleanup($file['id']);
		do_action('import_done', 'rssredirects');

		echo '<h3>';
		printf(__('All done. <a href="%s">Have fun!</a>', 'rss-importer-redirects'), get_option('home'));
		echo '</h3>';
	}

	function dispatch() {
		if (empty ($_GET['step']))
			$step = 0;
		else
			$step = (int) $_GET['step'];

		$this->header();

		switch ($step) {
			case 0 :
				$this->greet();
				break;
			case 1 :
				check_admin_referer('import-upload');
				$result = $this->import();
				if ( is_wp_error( $result ) )
					echo $result->get_error_message();
				break;
		}

		$this->footer();
	}

	function RSS_ImportRedirects() {
		// Nothing.
	}
}

$rss_import = new RSS_ImportRedirects();

register_importer('rssredirects', __('RSS REDIRECTS', 'rss-importer-redirects'), __('Import redirects from an RSS feed.', 'rss-importer-redirects'), array ($rss_import, 'dispatch'));

} // class_exists( 'WP_Importer' )

function rss_importer_redirects_init() {
    load_plugin_textdomain( 'rss-importer-redirects', false, dirname( plugin_basename( __FILE__ ) ) . '/languages' );
}
add_action( 'init', 'rss_importer_redirects_init' );

Sample output

After installing this new plugin, which, apart from the above file, is the same as rss-importer-comment. So… to create this plugin, you do this:

# cd wp-content/plugins
# cp -rp rss-importer rss-importer-redirects
# cd rss-importer-redirects
# mv rss-importer.php rss-importer-redirects.php

Then you copy & paste the above code into rss-importer-redirects.php. You also need to enable the plugin, then run Tools | Import, and click on RSS REDIRECTS. You then find and upload the xml file you created in the previous step.

The output

The new plugin does not alter your database. It creates output which is then copy/pasted into the .htaccess file (or httpd.conf file). Here is an example of that output:

Import RSS redirects
WARNING: duplicate display name / email combination Gerard Samuel ## 
WARNING: duplicate display name / email combination John J. Rushford Jr ## 
we have 646 fetched from WordPress
we have 646 in postLookup
RedirectPermanent /why.php                                           /1998/02/11/why-i-wanted-freebsd-before-i-knew-it-existed/
RedirectPermanent /introduction.php                                  /1998/05/11/how-i-found-freebsd/
RedirectPermanent /install.php                                       /1998/06/11/the-installation/
RedirectPermanent /natd.php                                          /1998/06/21/natd-network-address-translation-ip-masquerading-ip-aliasing/
RedirectPermanent /cdrom.php                                         /1998/07/09/cd-rom-saga-a-funny-story/
RedirectPermanent /dns.php                                           /1998/07/10/the-dns-problem-which-was-an-natd-problem/
RedirectPermanent /filtering.php                                     /1998/07/11/firewalls-filtering-ipfw-and-ftp-clients/
RedirectPermanent /http.php                                          /1998/07/12/redirecting-port-requests/
RedirectPermanent /mail.php                                          /1998/07/15/reading-my-mail-from-nt1-qpopper/
RedirectPermanent /shell.php                                         /1998/07/26/changing-the-shell-bash/

What do you do with that

I have about 646 URLs. Rather than put them all into the httpd.conf file, I put them in another file and include it. For example:

RewriteEngine On
include    /usr/websites/wp.freebsddiary.org/configuration/redirects

With this, and a restart/reload of Apache, the redirects start happening.

What’s next?

As mentioned when I imported the comments, I still have some tidy up to do.

Remove <HTML></HTML&gt tags from many comments. I think there are artifacts from Phorum.
Deal with [%sig%] macros in various comments. These relate to signatures for users.

From what I can tell, the only issues remaining are the above. They should be easily dealt with.