Programatically Importing Data Into Drupal
I'm currently consulting on a project that entails importing thousands of records from a home grown content management system into Drupal. The legacy system was coded in-house using Coldfusion and storing the records in Oracle. While there are several Drupal modules for importing nodes I was not able to find one that fit the bill exactly. Having the ability to import the data in a relatively automated fashion was also a requirement.
I experimented with numerous modules to find an import solution that fit my needs. In the past I've use the Node Import Module for importing simple data from a CSV file. In this case I needed more granular control of the data. There needed to be conditional processing of legacy data (particularly with setting authoring information). The Import/Export API module was also a candidate but seemed overly complex for my (relatively) basic needs. In the end I created a JSON Export API on the legacy site and a php command line script for importing the data. This method safely imported several thousand nodes in a couple of minutes.
The project was broken into 3 major phases:
- Data analysis and data entry interface design
- Legacy Data Export
- Legacy Data Import
Data Analysis/Data Entry Interface Design
This was a necessary step that would be required of any data migration project. It entailed looking at the legacy content management system and analyzing the data model. The majority of that work was reviewing screen shots of the legacy data entry interface and a "describe tablename" on the database. Once I had a clear picture of the data model I fired up Drupal and created the nodes using CCK.
Data Export
How many times have you worked on a data import project only to have to re-import the data again do to a mistake or data change? I wanted the ability to pull in the data without having to manually export it. In the case of this project I was putting some legacy Coldfusion apps out of their misery. In order to get the data out I coded up a CFML page that outputs the data as JSON. Below is code similar to what I used in production:
I ended up choosing JSON as the export format. Why JSON? It is lean, easy to serialize and easy to consume. Some kind of XML based webservice would have worked too. This was more of an academic exercise to see if it could be done effectively.
This script relies on several CF UDF's available from www.cflib.org. The most critical one being jsonencode. I added both the JSON UDF's to a CFC for organizational purposes. If you are lucky enough to be running CF8 JSON is baked right in. Below is some conceptual code hacked together.
<cfscript>
json_obj = createObject('component','cfc.JSONSerializer'); //UDF's for working with JSON.
text_utils = createObject('component','cfc.textUtils'); //A greatest hits of text utilities packed into a cfc
legacy_data_obj = createObject('component','cfc.legacyDataComponent'); //A CFC with the database functionality
</cfscript>
<!-- Get a list of all the row ID's and output as JSON -->
<cfif>
<cfset legacy_indexes="legacy_data_obj.getIndexes()">
<cfoutput>#json_obj.jsonEncode(legacy_indexes)#</cfoutput>
<!-- Display JSON of a row if ID is passed -->
<cfelse>
<cfset legacy_row="legacy_data_obj.getRowByID(url.id)">
<cfoutput>#json_obj.jsonEncode(legacy_row)#</cfoutput>
</cfset>
</cfelse></cfset></cfif>The above code will display a list of ID's by hitting http://yoursite.com/legacy_api.cfm. Initially I tried having a "mega json" file with all the results but the php json_decode function was not happy with a 5mb JSON file. This allowed me to iterate through the indexes and pull in each row. Highly inefficient? Probably! Gets the job done? Absolutely!
Date Import
Now that I had a nice easy way to get the data I needed to put it into Drupal. After a bit of googling I located several recipes for programmatically inserting data as a CCK node. While it would be possible to make an elaborate module for these kind of tasks, a php cli script is the quick and easy way.
/** * @file * Used to import campus news articles as nodes */ // Include Drupal require_once('./includes/bootstrap.inc'); drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL); //Define Config Variables define("NODE_TYPE","news"); define("JSON_LOCATION","http://www.mysite.com/api/legacy_api.cfm"); /** * Returns a node object from legacy json * @param obj $legacy_obj a json pr object * @param obj $node a node obj * legacy_obj Data Definition * $legacy_obj->TITLE * $legacy_obj->SUMMARY * $legacy_obj->CONTACT_ID * $legacy_obj->CONTENT * $legacy_obj->PUBLISH_ON * $legacy_obj->SUBCAT1 * $legacy_obj->SUBCAT2 **/ function legacyimporter_createnode($legacy_obj) { $user = legacyimporter_contactid_to_user($legacy_obj->CONTACT_ID); //Function for mapping legacy User ID's to UID/NAME $node = new stdClass(); $node->type = NODE_TYPE; $node->language = ''; $node->uid = $user["uid"]; $node->status = 1; $node->created = $legacy_obj->PUBLISH_ON; $node->title = $legacy_obj->TITLE; $node->name = $user["name"]; $node->body = $legacy_obj->CONTENT; //Custom CCK Field $node->field_subtitle[0] = array("value" => $legacy_obj->SUBTITLE); return $node; } /** * Function for downloading a legacy JSON row * @param int $id ID of article * @return obj $node object **/ function legacyimporter_process_json($id) { $url = JSON_LOCATION . "?id=" . $id; $node = FALSE; $result = drupal_http_request($url); if($result->code == 200) { $wms_obj = json_decode(trim($result->data)); $node = legacyimporter_createnode($wms_obj); } return $node; } //Get List of Files $result = drupal_http_request(JSON_LOCATION); //Set article type for media coverage $article_tally = 0; if($result->code == 200) { $legacy_index = json_decode(trim($result->data)); if(is_array($legacy_index)) { foreach($legacy_index as $i) { $node = legacyimporter_process_json($i->ID); $valid = legacyimporter_validatenode($node); //This is a function to validate the node if($valid['isValid']) { node_save($node); $article_tally++; } else { //Handle Node Errors Here } } } } print "Saved " . $article_tally . "nodes!";
The preceding example is a greatly simplified version of what the final product was. It will work, however, and illustrates what one would need to do to programatically import a CCK Node. I would love to hear how other people are solving this problem. Email away!
- Login to post comments