Rewriting Thinking Sphinx: Introducing Realtime Indices
The one other feature in the rewrite of Thinking Sphinx that I wanted to highlight should most certainly be considered in beta, but it’s gradually getting to the point where it can be used reliably: real-time indices (which are now covered pretty decently in the official TS documentation).
Real-time indices are built into Sphinx, and are indices that can be dynamically updated on the fly - and are not backed by a database sources. They do have a defined structure with fields and attributes (so they’re not a NoSQL key/value store), but they remove the need for delta indices, because each record in a real-time index can be updated directly. You also get the benefit, within Thinking Sphinx, to refer to Ruby methods instead of tables and columns.
The recent 3.0.4 release of Thinking Sphinx provides support for this, but the workflow’s a little different from the SQL-backed indices:
Define your indices
Presuming a Product model defined just so:
class Product < ActiveRecord::Base has_many :categorisations has_many :categories, :through => :categorisations end
You can put together an index like this:
ThinkingSphinx::Index.define :product, :with => :real_time do indexes name has category_ids, :type => :integer, :multi => true end
You can see here that it’s very similar to a SQL-backed index, but we’re
referring to Ruby methods (such as
auto-generated by associations), and we’re specifying the attribute type
explicitly - as we can’t be sure what a method returns.
Every time a record is updated in your database, you want those changes to be reflected in Sphinx as well. Sometimes you may want associated models to prompt a change - hence, these callbacks aren’t added automatically.
In our example above, we’d want
after_save callbacks in our Product
model (of course) and also our Categorisation model - as that will
impact a product’s
# within product.rb after_save ThinkingSphinx::RealTime.callback_for(:product) # within categorisation.rb after_save ThinkingSphinx::RealTime.callback_for(:product, [:product])
The first argument is the reference to the indices involved - matching the first argument when you define your index. The second argument in the Categorisation example is the method chain required to get to the objects involved in the index.
Generate the configuration
We’ve no need for the old
ts:index task as that’s preloading index
data via the database.
All of our interactions with Sphinx are through the daemon - and so, Sphinx must be running before we can add records into the indices.
Populate the initial data
This will go through each index, load each record for the appropriate model and insert (or update, if it exists) the data for that into the real-time indices. If you’ve got a tonne of records or complex index definitions, then this could take a while.
Everything at once
The regenerate task will stop Sphinx (if it’s running), clear out all Sphinx index files, generate the configuration file again, start Sphinx, and then repopulate all the data.
Essentially, this is the rake task you want to call when you’ve changed the structure of your Sphinx indices.
Handle with care
Once you have everything in place, then searching will work, and as your models are updated, your indices will be too. In theory, it should be pretty smooth sailing indeed!
Of course, there could be glitches, and so if you spot inconsistencies
between your database and Sphinx, consider where you may be making
changes to your database without firing the after_save callback. You
can run the
ts:generate task at any point to update your Sphinx
I don’t yet have Flying Sphinx providing full support for real-time indices - it should work fine, but there’s not yet any automated backup (whereas SQL-backed indices are backed up every time you process the index files). This means if a server fails it’d be up to you to restore your index files. It’s on my list of things to do!
I’m keen to provide hooks to allow the callbacks to fire off background jobs instead of having that Sphinx update part of the main process - though it’s certainly not as bad as the default delta approach (you’re not shelling out to another process, and you’re only updating a single record).
I’m starting to play with this in my own apps, and am keen to see it used in production. It is a different way of using Sphinx, but it’s certainly one worth considering. If you give it a spin, let me know how you go!