This is the sixth in a series of articles on my migration to WordPress. In this post, I’ll talk about how I imported the comments from my website into the WordPress database. The comments in question are feedback on the articles on the website. I consider them to be an important part of the website.
The comments on an article can be simple questions and answers, or careful elaborations upon an obscure point. Either way, they are part and parcel of a website. That’s why I want to import them.
When I built my website, blogging software was very uncommon. In fact, the term blog was not common at all. That’s why The FreeBSD Diary had a custom built solution. For article comments, I used a customized version of a popular forum package, Phorum. Fortunately, the basic structure used by Phorum is similar to that used by the commenting solution of WordPress.
Let’s look at the WordPress comments
WordPress stores commends in the wp_comments table. Here are the columns:
mysql> show columns from wp_comments; +----------------------+---------------------+------+-----+---------------------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------------------+---------------------+------+-----+---------------------+----------------+ | comment_ID | bigint(20) unsigned | NO | PRI | NULL | auto_increment | | comment_post_ID | bigint(20) unsigned | NO | MUL | 0 | | | comment_author | tinytext | NO | | NULL | | | comment_author_email | varchar(100) | NO | | | | | comment_author_url | varchar(200) | NO | | | | | comment_author_IP | varchar(100) | NO | | | | | comment_date | datetime | NO | | 0000-00-00 00:00:00 | | | comment_date_gmt | datetime | NO | MUL | 0000-00-00 00:00:00 | | | comment_content | text | NO | | NULL | | | comment_karma | int(11) | NO | | 0 | | | comment_approved | varchar(20) | NO | MUL | 1 | | | comment_agent | varchar(255) | NO | | | | | comment_type | varchar(20) | NO | | | | | comment_parent | bigint(20) unsigned | NO | MUL | 0 | | | user_id | bigint(20) unsigned | NO | | 0 | | +----------------------+---------------------+------+-----+---------------------+----------------+ 15 rows in set (0.01 sec) mysql>
Comments are stored one per row in this table. Each comment has a unique identifier, comment_ID (line 5). Each comment relates to exactly one post, identified by comment_post_ID (line 6). A comment may be a reply to another comment, in which case, comment_parent is the ID of the parent comment. And finally, user_id, if not zero, refers to an entry in the wp_users table.
We will be populating these fields:
- comment_post_ID
- comment_author
- comment_author_email (if supplied)
- comment_author_url (if supplied)
- comment_date
- comment_date_gmt
- comment_content
- comment_approved
- comment_parent
- user_id
The difficult parts will be relating comments to posts and to their author.
The starting SQL
After a bit of messing around with SQL, I came up with this SQL as a starting point:
SELECT A.id AS article_id, AF.id AS comment_id, AFX.thread_id, A.name AS article_name, A.actual_date AS date, AF.parent AS comment_parent_id, AF.author, AF.subject, AF.email, AF.approved, AF.host FROM articles A, article_feedback_xref AFX, article_feedback AF, article_feedback_bodies AFB WHERE A.completed = 'Y' AND A.id = AFX.article_id AND AFX.thread_id = AF.thread AND AFB.thread = AF.thread AND AF.id = AFB.id AND A.id = 664 order by actual_date desc
This query specifically looks only at the comments for article 664. You can see them here (if you get a NOT FOUND on that URL, sorry, the website has already transitioned to WordPress and the Phorum links will no longer work).
article_id | comment_id | thread_id | article_name | date | commend_parent_id | author | subject ------------+------------+-----------+------------------------------+------------+-------------------+-----------------------+----------------------------------------------- 308 | 46 | 46 | Stripping ^M from your files | 2000-03-04 | 0 | Dan Langille | see also 308 | 452 | 452 | Stripping ^M from your files | 2000-03-04 | 0 | Zev | yet Another way 308 | 695 | 695 | Stripping ^M from your files | 2000-03-04 | 0 | jason kawaja | shell script for multiple files 308 | 1187 | 695 | Stripping ^M from your files | 2000-03-04 | 695 | maximus | Re: shell script for multiple files 308 | 1140 | 786 | Stripping ^M from your files | 2000-03-04 | 786 | m.hassan | The VIM way 308 | 786 | 786 | Stripping ^M from your files | 2000-03-04 | 0 | Dan Langille | See also: 308 | 1496 | 1496 | Stripping ^M from your files | 2000-03-04 | 0 | Dan | Using tr 308 | 1581 | 1581 | Stripping ^M from your files | 2000-03-04 | 0 | Taz | Stripping ^M (8 rows)
The strategy
For some time now, I’ve been thinking about a strategy for loading comments. I think I have a straight forward solution. I’ll outline it here.
For each comment:
- Find the post it belongs to
- If it has a parent comment, find it
- Insert the comment using the above values
There will be a few things to look up, and they will be kept in an array (i.e. cache), to minimize the amount of database access required.
What will I cache?
- All the posts. I will create an array where the key is (title, date, author). The value will be the post id.
- All the comments loaded so far. The key will be the original comment ID. The value stored there will be the new WorddPress comment ID
Why do we need to cache existing comments? Well, when you go to load a comment into WordPress, any other replies to that comment need to refer to it. Thus, you need to know both the ID from the old system, and the ID from the WordPress database. It is more convenient to store this information locally than to refer to the database each time you need it.
What will I lose
The old system allowed for automatic appending of a custom signature to each post. I may have to deal with that later.
The old URLs for comments will no longer work. I’m not sure how to deal with those yet. By default, they’ll start getting 404s…
How to import a comment
WordPress supplies a function for adding a comment: wp_insert_comment(). The function takes a single parameter. This array contains all the data associated with the new array. For my purposes, I am supplying the following fields:
- comment_post_ID
- user_id
- comment_parent
- comment_date
- comment_date_gmt
- comment_content
The function returns the ID of the just-inserted comment. I use that ID as a reference for any future inserts which may be related to this comment (e.g. a reply).
It turns out to be quite easy to add a comment. So let’s get going. The first step is to import a single comment. From there, all the comments for one post. And then, all comments. We start small. Refine our approach. Repeat.
The first attempts
The first attempt took some time. I managed to import one of my own comments. Then I imported all the comments for an article. That worked well. The only problem: all the comments were anonymous.
Oh.
In order for comments to not be anonymous. I think I need to import more authors… That is, put more entries into the wp_users table.
NOTE: Looking at the wp_comments table, I am not convinced of this requirement to populate wp_users in order to have non-anonymous comments. Perhaps there was a problem elsewhere that I overlooked.
Adding more authors
I went back to my website and started accumulating a list of more authors. This time, the authors of comments. I found this database query worked fine for that:
select distinct('(''' || author || ''', '''', ''' || email || ''', ''' || email || ''', '''', '''', 0, ''' || author || '''),') from article_feedback;
If you look back at the article, you’ll see how I used the output of the above to do an insert into the users table.
NOTE: If the author’s name contains an apostrophe (e.g. Daniel O’Connor), this insert will fail. I fixed such instances manually by added a \ before the ‘.
The export code
The following PHP code exports the comments for one forum, the article feedback forum, via an RSS file.
<?php // // $Id: news.php,v 1.14 2010/02/08 16:29:22 dan Exp $ // // Copyright (c) 1998-2003 DVL Software Limited // require( $_SERVER["DOCUMENT_ROOT"] . "/include/common.php" ); require( $_SERVER["DOCUMENT_ROOT"] . "/include/freebsddiary.php" ); require( $_SERVER["DOCUMENT_ROOT"] . "/include/databaselogin.php" ); $MaxArticles = 10000; $HTML = ''; $HTML .= '<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"' . "\n"; $HTML .= ' "http://www.rssboard.org/rss-0.91.dtd">' . "\n"; $HTML .= '<rss version="0.91">' . "\n"; $HTML .= "\n"; $HTML .= '<channel>' . "\n"; $HTML .= ' <title>The FreeBSD Diary</title>' . "\n"; $HTML .= ' <link>http://www.freebsddiary.org/</link>' . "\n"; $HTML .= ' <description>The largest collection of practical examples for FreeBSD!</description>' . "\n"; $HTML .= ' <language>en-us</language>' . "\n"; $HTML .= ' <copyright>Copyright ' . GetCopyrightYears() . ', DVL Software Limited.</copyright>' . "\n"; $HTML .= "\n"; $sql = "SELECT A.id as article_id, A.name AS article_name, A.author as author, A.actual_date as date, AF.id as comment_id, AF.datestamp as comment_date, AFX.thread_id, AF.parent AS comment_parent_id, AF.author as comment_author, AF.subject, AF.email, AF.approved, AF.host, AFB.body FROM articles A, article_feedback_xref AFX, article_feedback AF, article_feedback_bodies AFB WHERE A.completed = 'Y' AND A.id = AFX.article_id AND AFX.thread_id = AF.thread AND AFB.thread = AF.thread AND AF.id = AFB.id AND AF.approved = 'Y' order by comment_id asc limit $MaxArticles "; //die("<pre>$sql</pre>"); $result = pg_query( $db, $sql ); while ( $myrow = pg_fetch_array( $result ) ) { $email = trim( $myrow["email"] ); $HTML .= ' <item>' . "\n"; $HTML .= ' <title>' . htmlentities( $myrow["article_name"] ) . '</title>' . "\n"; $HTML .= ' <dc:date>' . $myrow["date"] . '</dc:date>' . "\n"; $HTML .= ' <author>' . $myrow["author"] . '</author>' . "\n"; $HTML .= ' <comment_id>' . $myrow["comment_id"] . '</comment_id>' . "\n"; $HTML .= ' <thread_id>' . $myrow["thread_id"] . '</thread_id>' . "\n"; $HTML .= ' <parent_id>' . $myrow["comment_parent_id"] . '</parent_id>' . "\n"; $HTML .= ' <comment_author>' . trim( $myrow["comment_author"] ) . '</comment_author>' . "\n"; $HTML .= ' <comment_date>' . trim( $myrow["comment_date"] ) . '</comment_date>' . "\n"; $HTML .= ' <email>' . htmlentities( $email ) . '</email>' . "\n"; $HTML .= ' <subject>' . htmlentities( $myrow["subject"] ) . '</subject>' . "\n"; $HTML .= ' <host>' . htmlentities( $myrow["host"] ) . '</host>' . "\n"; $HTML .= ' <body>' . htmlentities( $myrow["body"] ) . '</body>' . "\n"; $HTML .= ' </item>' . "\n"; } $HTML .= '</channel>' . "\n"; $HTML .= '</rss>' . "\n"; header( 'Content-type: text/xml' ); echo '<?xml version="1.0"?>', "\n"; echo $HTML;
Lines 62-64 are used to uniquely identify the article being commented upon. This is used to lookup the article/post within WordPress.
The import code
This is pretty rough code. Hopefully, it’ll be useful to you. It is not production quality, but it is migration quality.
It is based entirely upon the RSS Importer. I copied the original code and altered some things.
# cd wp-content/plugins # cp -rp rss-importer rss-importer-comments # cd rss-importer-comments # mv rss-importer.php rss-importer-comments.php
I then proceeded to modify rss-importer-comments.php, as shown below.
<?php /* Plugin Name: RSS Importer Comments Plugin URI: Description: Import comments from an RSS feed. Based upon http://wordpress.org/extend/plugins/rss-importer/ Author: Dan Langille Author URI: http://wordpress.org/ Version: 0.1 Stable tag: 0.1 License: GPL version 2 or later - http://www.gnu.org/licenses/old-licenses/gpl-2.0.html */ if ( !defined( 'WP_LOAD_IMPORTERS' ) ) return; // Load Importer API require_once ABSPATH . 'wp-admin/includes/import.php'; if ( !class_exists( 'WP_Importer' ) ) { $class_wp_importer = ABSPATH . 'wp-admin/includes/class-wp-importer.php'; if ( file_exists( $class_wp_importer ) ) require_once $class_wp_importer; } /** * RSS Importer * * @package WordPress * @subpackage Importer */ /** * RSS Importer * * Will process a RSS feed for importing posts into WordPress. This is a very * limited importer and should only be used as the last resort, when no other * importer is available. * * @since unknown */ if ( class_exists( 'WP_Importer' ) ) { class RSS_ImportComments extends WP_Importer { var $posts = array(); var $file; var $authors = array(); function get_all_authors() { global $wpdb; $query = "SELECT ID, display_name, user_login, user_email FROM $wpdb->users"; return $wpdb->get_results( $wpdb->prepare( $query, $this->authors ) ); return 0; } function get_all_posts() { global $wpdb; $query = "SELECT ID, post_author, post_date, post_title FROM $wpdb->posts WHERE post_type = 'post'"; return $wpdb->get_results( $wpdb->prepare( $query, $this->posts ) ); return 0; } function header() { echo '<div class="wrap">'; screen_icon(); echo '<h2>' . __( 'Import RSS comments', 'rss-importer-comments' ) . '</h2>'; } function footer() { echo '</div>'; } function greet() { echo '<div class="narrow">'; echo '<p>' . __( 'Howdy! This importer allows you to import comments from an RSS 2.0 file into your WordPress site. This is useful if you want to import your comments from a system that is not handled by a custom import tool. Pick an RSS file to upload and click Import.', 'rss-importer-comments' ) . '</p>'; wp_import_upload_form( "admin.php?import=rsscomments&step=1" ); echo '</div>'; } function _normalize_tag( $matches ) { return '<' . strtolower( $matches[1] ); } function get_posts() { global $wpdb; set_magic_quotes_runtime( 0 ); $datalines = file( $this->file ); // Read the file into an array $importdata = implode( '', $datalines ); // squish it $importdata = str_replace( array( "\r\n", "\r" ), "\n", $importdata ); $authors = $this->get_all_authors(); $authorLoginLookup = array(); $authorNameLookup = array(); $authorLookup = array(); foreach ( $authors as $author ) { $key = $author->display_name . ' ## ' . $author->user_email; if ( isset( $authorLookup[$key] ) ) { echo 'WARNING: duplicate display name / email combination ' . $key . '<br>'; } $authorLookup[$key] = $author->ID; if ( isset( $authorNameLookup[$author->display_name] ) ) { // echo 'WARNING: duplicate display_name ' . $author->display_name . '<br>'; } if ( isset( $authorNameLookup[$author->user_login] ) ) { //echo 'WARNING: duplicate user_login ' . $author->user_login . '<br>'; } $authorNameLookup[$author->display_name] = $author->ID; $authorLoginLookup[$author->user_login] = $author->ID; } // echo ' the authors are <pre>' . print_r($authorNameLookup, true) . '</pre>'; // echo ' the authors are <pre>' . print_r($authorLoginLookup, true) . '</pre>'; $posts = $this->get_all_posts(); $postLookup = array(); foreach ( $posts as $post ) { $key = $post->post_author . ' @@ ' . $post->post_date . ' @@ ' . $post->post_title; if ( isset( $postLookup[$key] ) ) { echo ' the posts are <pre>' . print_r( $postLookup, true ) . '</pre>'; die( 'duplicate post found: ' . $key ); } $postLookup[$key] = $post->ID; } // echo ' the posts are <pre>' . print_r($postLookup, true) . '</pre>'; // this is the look up array for the comments added into the system already $comments = array(); preg_match_all( '|<item>(.*?)</item>|is', $importdata, $this->posts ); $this->posts = $this->posts[1]; $index = 0; foreach ( $this->posts as $post ) { preg_match( '|<title>(.*?)</title>|is', $post, $post_title ); $post_title = str_replace( array( '<![CDATA[', ']]>' ), '', $wpdb->escape( trim( $post_title[1] ) ) ); preg_match( '|<pubdate>(.*?)</pubdate>|is', $post, $post_date_gmt ); // if we don't already have something from pubDate preg_match( '|<dc:date>(.*?)</dc:date>|is', $post, $post_date_gmt ); $post_date_gmt = preg_replace( '|([-+])([0-9]+):([0-9]+)$|', '\1\2\3', $post_date_gmt[1] ); $post_date_gmt = str_replace( 'T', ' ', $post_date_gmt ); $post_date_gmt = strtotime( $post_date_gmt ); $post_date_gmt = gmdate( 'Y-m-d H:i:s', $post_date_gmt ); $post_date = get_date_from_gmt( $post_date_gmt ); // by default, all posts belong to the user id = 1. $post_author = 1; preg_match( '|<author.*?>(.*?)</author>|is', $post, $authorName ); if ( $authorName ) { $authorName = $wpdb->escape( trim( $authorName[1] ) ); if ( isset( $authorNameLookup[$authorName] ) ) { $post_author = $authorNameLookup[$authorName]; } else { if ( isset( $authorLoginLookup[$authorName] ) ) { $post_author = $authorLoginLookup[$authorName]; } } } // if we don't already have something from pubDate preg_match( '|<comment_date>(.*?)</comment_date>|is', $post, $comment_date_gmt ); $comment_date_gmt = preg_replace( '|([-+])([0-9]+):([0-9]+)$|', '\1\2\3', $comment_date_gmt[1] ); $comment_date_gmt = str_replace( 'T', ' ', $comment_date_gmt ); $comment_date_gmt = strtotime( $comment_date_gmt ); $comment_date_gmt = gmdate( 'Y-m-d H:i:s', $comment_date_gmt ); $comment_date = get_date_from_gmt( $comment_date_gmt ); preg_match( '|<comment_id.*?>(.*?)</comment_id>|is', $post, $old_comment_id ); $old_comment_id = $wpdb->escape( trim( $old_comment_id[1] ) ); preg_match( '|<thread_id.*?>(.*?)</thread_id>|is', $post, $old_thread_id ); $old_thread_id = $wpdb->escape( trim( $old_thread_id[1] ) ); preg_match( '|<parent_id.*?>(.*?)</parent_id>|is', $post, $old_parent_id ); $old_parent_id = $wpdb->escape( trim( $old_parent_id[1] ) ); preg_match( '|<comment_author.*?>(.*?)</comment_author>|is', $post, $commentAuthorName ); unset( $user_id ); if ( $commentAuthorName ) { $commentAuthorName = $wpdb->escape( trim( $commentAuthorName[1] ) ); if ( isset( $authorNameLookup[$commentAuthorName] ) ) { $user_id = $authorNameLookup[$commentAuthorName]; } else { if ( isset( $authorLoginLookup[$commentAuthorName] ) ) { $post_author = $authorLoginLookup[$commentAuthorName]; } } } preg_match( '|<body>(.*?)</body>|is', $post, $comment_content ); $comment_content = str_replace( array( '<![CDATA[', ']]>' ), '', $wpdb->escape( trim( $comment_content[1] ) ) ); // Clean up content $comment_content = preg_replace_callback( '|<(/?[A-Z]+)|', array( &$this, '_normalize_tag' ), $comment_content ); $comment_content = str_replace( '<br>', '<br />', $comment_content ); $comment_content = str_replace( '<hr>', '<hr />', $comment_content ); // look up the post id for this comment $key = $post_author . ' @@ ' . $post_date . ' @@ ' . stripslashes( $post_title ); if ( isset( $postLookup[$key] ) ) { $comment_post_ID = $postLookup[$key]; } else { die( 'could not find post id for ' . $authorName . ' ' . $key ); } // look up the comment id for the parent unset( $comment_parent ); if ( isset( $old_parent_id ) && !empty( $old_parent_id ) ) { echo 'yes, we have a parent<br>'; // echo '<pre>' . print_r($comments, true) . '</pre>'; if ( isset( $comments[$old_parent_id] ) ) { $comment_parent = $comments[$old_parent_id]; } else { die( 'cold not find parent id for ' . $old_parent_id ); } } // print_r("<pre>" . htmlentities("$post_title $post_date $post_author $old_comment_id $old_comment_id $old_parent_id $comment_content") . "</pre>"); $comments[$old_comment_id] = $old_comment_id; $commentdata = compact( 'comment_post_ID', 'user_id', 'comment_parent', 'comment_date', 'comment_date_gmt', 'comment_content' ); echo "<pre>" . print_r( $commentdata, true ) . "</pre>"; // continue; $comment_ID = wp_insert_comment( $commentdata ); $comments[$old_comment_id] = $comment_ID; $index++; } echo '<pre>' . print_r( $comments, true ) . '</pre>'; echo 'we had ' . $index . ' comments<br>'; exit; } function import_posts() { echo '<ol>'; foreach ( $this->posts as $post ) { echo "<li>" . __( 'Importing comments...', 'rss-importer-comments' ); extract( $post ); if ( $post_id = post_exists( $post_title, $post_content, $post_date ) ) { _e( 'Post already imported', 'rss-importer-comments' ); } else { $post_id = wp_insert_post( $post ); if ( is_wp_error( $post_id ) ) return $post_id; if ( !$post_id ) { _e( 'Couldn’t get post ID', 'rss-importer-comments' ); return; } if ( 0 != count( $categories ) ) wp_create_categories( $categories, $post_id ); _e( 'Done!', 'rss-importer-comments' ); } echo '</li>'; } echo '</ol>'; } function import() { $file = wp_import_handle_upload(); if ( isset( $file['error'] ) ) { echo $file['error']; return; } $this->file = $file['file']; $this->get_posts(); $result = $this->import_posts(); if ( is_wp_error( $result ) ) return $result; wp_import_cleanup( $file['id'] ); do_action( 'import_done', 'rsscomments' ); echo '<h3>'; printf( __( 'All done. <a href="%s">Have fun!</a>', 'rss-importer-comments' ), get_option( 'home' ) ); echo '</h3>'; } function dispatch() { if ( empty( $_GET['step'] ) ) $step = 0; else $step = (int) $_GET['step']; $this->header(); switch ( $step ) { case 0: $this->greet(); break; case 1: check_admin_referer( 'import-upload' ); $result = $this->import(); if ( is_wp_error( $result ) ) echo $result->get_error_message(); break; } $this->footer(); } function RSS_ImportComments() { // Nothing. } } $rss_import = new RSS_ImportComments(); register_importer( 'rsscomments', __( 'RSS COMMENTS', 'rss-importer-comments' ), __( 'Import comments from an RSS feed.', 'rss-importer-comments' ), array( $rss_import, 'dispatch' ) ); } // class_exists( 'WP_Importer' ) function rss_importer_comments_init() { load_plugin_textdomain( 'rss-importer-comments', false, dirname( plugin_basename( __FILE__ ) ) . '/languages' ); } add_action( 'init', 'rss_importer_comments_init' );
Notes on the code
- Line 110 refers to a abandoned strategy of using both the author display_name and user_email to lookup a comment author.
- Line 239 is the key used to lookup the post_id for this comment. This associated the comment with a given post.
- line 276 is the actual insertion of the comment
- line 286 terminates the script… because of the nature of comments, I wanted to insert each one as I went. This enabled the code to always know the parent_comment ID
In conclusion
I had to do quite a bit of import, delete, modify code, repeat… before I got things to a reasonable situation. I’m sure you’ll need to do similar.
I still have some tidy up to do.
- Remove <HTML></HTML> tags from many comments. I think there are artifacts from Phorum.
- Deal with [%sig%] macros in various comments. These relate to signatures for users.
So far so good. I’m happy with the import so far.