Poudro's blog
CTO / Data Scientist / Problem Solver - Consultant

Integrating Full-Text search into WordPress using Xapian

Integrating Full-Text search into WordPress using Xapian - poudro.com développeur web freelance


I recently needed to integrate a natural language full-text search bar with a suggestion drop down list into a WordPress project I was working on. Although the default WordPress solution worked ok, I found it to be relatively slow and clumsy. So I decided to have a look at if and how I could integrate Xapian into WordPress.



Context



WordPress has a semi built-in search system that basically issues a “LIKE” query on the posts and comments contents to the SQL database that backs WordPress. This is usually considered very slow (there are some examples and performance comparisons to this end here) and if you need to search something else than the contents of the posts (which I did) you need to write a custom query anyway. So I started out trying the full-text indexing capabilities of MySQL (version 5.1).


To do this, and so as not to touch the basic WP table structure (to reduce risks of breaking on upgrade and things like that), I created a new table which would hold only the post ID and the data I wanted indexed (in this case the post title) in a column with a MySQL FULLTEXT index for natural language processing. Then throwing something like this :

$wpdb->get_results( 
  $wpdb->prepare(
     "SELECT wposts.* 
      FROM $wpdb->posts wposts, indexed_content_table ict 
      WHERE 
        wposts.ID = ict.post_id 
        AND MATCH (ict.title) AGAINST (%s);"
     , $search) 
);



would result in a set of posts that correspond to the query.


There are two issues with this though, the first one being that it is still relatively slow (you at least have to start up the WP system, 200ms minimum from what I have measured) and the second one being that it performs a search on full words only. That means that we have to expand the query somehow if we want to have a suggestion list as soon as the user types in text. There are a number of options you can pass along with the SQL query, but nothing that really does the job straight out of the box. One way of getting around this limitation is to reformat the indexed data and the query into something that can be processed and matched quickly (i.e. usually without wildcard expansion), and which yields coherent results. This can be done, at least in part, by using a trigram (or more generally n-gram) representation of the data. But we are still too slow for a practical suggestion box (about 300ms per query, i.e. per entered character in the search field) and, although we could bypass the WP loading scheme to tap directly into the database, I didn’t want to do that for several reasons, one of which is that using MySQL for full-text searches is really no fun at all. Ultimately, in my opinion, it would be nice to achieve something well below 100ms per query to make the experience as fluid as possible (the human eye has a refresh rate of about 100ms).



About Xapian



Xapian “is an Open Source Search Engine Library” (as defined on the website). It is written in C++ and has a great number of features and customisations available (I am using version 1.2.7). From what I’ve been able to gather, although without any real study I can find to back this up (and I may very well be totally wrong about this), Xapian is slightly outperformed by Lucene/Solr in larger settings (i.e. millions of documents distributed over a number of servers). On the other hand, Xapian is ultra light and has a small memory signature compared to Lucene based projects.


What I’m interested in here is that it is also magnitudes easier to set up than Lucene/Solr and that few WordPress sites grow to such a point that we could actually start having the kind of issues that appear with large sets. Finally, what really set me to using it is that it has fast, robust and simple PHP bindings (WP is PHP).


I’m not going to go into the details of Xapian here, they have all the documentation you might need for any setup and they have very responsive mailing lists from what I understand (haven’t actually used them myself though). What I can say is that it is very straightforward, and the basics (being insert, delete and search) are very easy to access and super fast.


Entries are called documents. To every document can be attached information in three different “forms” : term, value and data. The three forms have different uses which you can read all about here. The terms are generally the indexed part, but you can very easily induce extra text indexing with respect to the document however you like (i.e. the trigram representation already mentionned). Search results are filtered and sorted through the use of values (e.g. date range and such). And finally, the data is a blob that cannot be used for searching.



Putting Xapian into WordPress



To use Xapian on the system where WP is set up means you need to be on a proper server where you can install xapian-core and xapian-bindings for PHP (i.e. you can’t use simple hosting services, unless they support Xapian, which, I’m sure, is rare). Once installed, a very good source of inspiration to start interacting with Xapian through PHP can be found in the XapianWrapper.php script which can be found here. It would need to be changed a bit to fit your needs, but I think it’s a good base to start with. I recommend getting yourself acquainted with Xapian (if not already) and especially figuring out the proper query procedure you want to use and test it thoroughly before you put it into WP. I made the mistake of putting it in too early and lost a lot of time trying to figure out if bugs where coming from WP or Xapian.


The first thing to do is write (or not) your own wrapper to Xapian. In my case I wanted titles to be indexed while needing date as well as other filterings. So I started by creating my basic document structure with terms, values, data and indexing as needed (an example can be found in the index and index_row methods of the aforementionned wrapper). Similarly delete and, of course, search methods from the XapianWrapper script are also inspiring.


Once that is done, you need to index a document when it is published/untrashed in WordPress and remove it from the index when it is trashed and/or deleted. To do this, the usual WP way is to use hooks. To keep it simple, I’m only going to write about posts, you would have to do the same for pages, comments… For example, you can use the following hooks :

  add_action('publish_post', 'insert_into_Xapian');
  add_action('untrashed_post', 'insert_into_Xapian');

  add_action('delete_post', 'remove_from_Xapian');
  add_action('trashed_post', 'remove_from_Xapian');



where the functions can be something like this

  function insert_into_Xapian($post_id)
  {
    require_once('XapianWrapper.php');
    $x = new XapianWrapper();
    $p = get_post($post_id);
  
    $params = array( 'id'=>$post_id, 'title'=>$p->post_title, WHAT_YOU_NEED );
    $x->index( $params );
  }

  function remove_from_Xapian($post_id)
  {
    require_once('XapianWrapper.php');
    $x = new XapianWrapper();
  
    $params = array( 'id'=>$post_id, WHAT_YOU_NEED );
    $x->delete( $params );
}



That should index and delete your posts using the parameters you set up. Then you need something that calls Xapian directly to process searches upon request, which, in the form of a very simple PHP script can look like this :

  if (!isset($_GET['s']))
    exit();

  require_once('XapianWrapper.php');
  $x = new XapianWrapper();
  $search = urldecode($_GET['s']);

  $response = $x->search( $search );
  
  DO_STUFF_AND_PRINT($response);



With this method, on a relatively slow and moderately busy server, in testing on about 10000 documents, I have managed to bring response time down to around 50ms from 300ms which it was with the MySQL FULLTEXT version running through WP. That means that when you type in a query, you get the answer almost instantly in the suggestion box. Which is exactly what I wanted, thank you Xapian !



Some Stats



To wrap up I wanted to really test the speed difference, at the query level, between Xapian and MySQL FULLTEXT indices. To do this, I needed a lot of different titles which is basically what I’m calling a body of documents composed of texts of up to 10 to 15 words at most on average. The easiest I had on hand was to extract the titles of the articles from a dump of the English version of Wikipedia.


The methodology is as follows, I started out by extracting a certain number N of random title entries from the dump and indexing these in Xapian and in a dedicated MySQL table. Then, from this list of N documents I would extract 5 random sublists of 1000 entries and use these sublists to query the index (i.e. whatever the value of the total number of documents in the index, there will always be the same number of queries performed). I would then measure, as a function of N, the average time it would take for every query to run through the entire process (parsing and transforming the query, connecting to the DB, fetching and validating results and finally returning the answer). The graphs below show the results I obtained.


Xapian vs MySQL - Total number of documents below 50,000


Xapian vs MySQL


As you can see, for small database sizes, MySQL actually outperforms Xapian on my system. But, as soon as we hit the 10,000 documents mark, roles become reversed. MySQL’s performance degrades more rapidly with the total number of documents in the table than Xapian which manages to stay relatively flat. (Both in fact degrade somewhat linearly with N, but MySQL faster than Xapian).








In conclusion, it may not have been totally necessary to move away from MySQL to generate a full-text index on the pages, and actually, for reasonable database sizes (below 10,000 documents) MySQL performs better than Xapian. Why use Xapian, then ? Good question. For one, these tests are all performed locally, on a local MySQL installation on a local hard drive. It is quite standard that the database is accessed through a network which means extra lag. Xapian can be run through the network, but the idea here was to have a small enough index that it can sit locally without too much overhead. For 250,000 documents, both databases holding the same information, the complete index size was 132M for Xapian, with very little extra room needed for the library itself, vs 159M for MySQL which requires all the overhead that comes with the management system. Another reason I can think of, is that by seperating the search from the database, there is less chance of one (mainly the search) interfering with the other (the WP database). Indeed, while playing with MySQL FULLTEXT, I managed to lock the database fairly easily which meant the site was not accessible during that time. I was never able to achieve that with Xapian. I would also like to run the same tests with concurrent requests, but it’s more difficult to simulate concurrent accesses.


Overall it was more of an experiment in full-text indices and search engine performance than anything else. I am thinking, though, of maybe (if I have the time) adapting this into plugin form (with support for both Xapian and MySQL?), stay tuned for more.




21 Dec 2011