Rewriting Thinking Sphinx: Introducing Realtime Indices

The one other feature in the rewrite of Thinking Sphinx that I wanted to highlight should most certainly be considered in beta, but it’s gradually getting to the point where it can be used reliably: real-time indices (which are now covered pretty decently in the official TS documentation).

Real-time indices are built into Sphinx, and are indices that can be dynamically updated on the fly - and are not backed by a database sources. They do have a defined structure with fields and attributes (so they’re not a NoSQL key/value store), but they remove the need for delta indices, because each record in a real-time index can be updated directly. You also get the benefit, within Thinking Sphinx, to refer to Ruby methods instead of tables and columns.

The recent 3.0.4 release of Thinking Sphinx provides support for this, but the workflow’s a little different from the SQL-backed indices:

Define your indices

Presuming a Product model defined just so:

class Product < ActiveRecord::Base
  has_many :categorisations
  has_many :categories, :through => :categorisations
end

You can put together an index like this:

ThinkingSphinx::Index.define :product, :with => :real_time do
  indexes name

  has category_ids, :type => :integer, :multi => true
end

You can see here that it’s very similar to a SQL-backed index, but we’re referring to Ruby methods (such as category_ids, perhaps auto-generated by associations), and we’re specifying the attribute type explicitly - as we can’t be sure what a method returns.

Add Callbacks

Every time a record is updated in your database, you want those changes to be reflected in Sphinx as well. Sometimes you may want associated models to prompt a change - hence, these callbacks aren’t added automatically.

In our example above, we’d want after_save callbacks in our Product model (of course) and also our Categorisation model - as that will impact a product’s category_ids value.

# within product.rb
after_save ThinkingSphinx::RealTime.callback_for(:product)

# within categorisation.rb
after_save ThinkingSphinx::RealTime.callback_for(:product, [:product])

The first argument is the reference to the indices involved - matching the first argument when you define your index. The second argument in the Categorisation example is the method chain required to get to the objects involved in the index.

Generate the configuration

rake ts:configure

We’ve no need for the old ts:index task as that’s preloading index data via the database.

Start Sphinx

rake ts:start

All of our interactions with Sphinx are through the daemon - and so, Sphinx must be running before we can add records into the indices.

Populate the initial data

rake ts:generate

This will go through each index, load each record for the appropriate model and insert (or update, if it exists) the data for that into the real-time indices. If you’ve got a tonne of records or complex index definitions, then this could take a while.

Everything at once

rake ts:regenerate

The regenerate task will stop Sphinx (if it’s running), clear out all Sphinx index files, generate the configuration file again, start Sphinx, and then repopulate all the data.

Essentially, this is the rake task you want to call when you’ve changed the structure of your Sphinx indices.

Handle with care

Once you have everything in place, then searching will work, and as your models are updated, your indices will be too. In theory, it should be pretty smooth sailing indeed!

Of course, there could be glitches, and so if you spot inconsistencies between your database and Sphinx, consider where you may be making changes to your database without firing the after_save callback. You can run the ts:generate task at any point to update your Sphinx dataset.

I don’t yet have Flying Sphinx providing full support for real-time indices - it should work fine, but there’s not yet any automated backup (whereas SQL-backed indices are backed up every time you process the index files). This means if a server fails it’d be up to you to restore your index files. It’s on my list of things to do!

What’s next?

I’m keen to provide hooks to allow the callbacks to fire off background jobs instead of having that Sphinx update part of the main process - though it’s certainly not as bad as the default delta approach (you’re not shelling out to another process, and you’re only updating a single record).

I’m starting to play with this in my own apps, and am keen to see it used in production. It is a different way of using Sphinx, but it’s certainly one worth considering. If you give it a spin, let me know how you go!

Kirill Maximov left a comment on 5 Aug, 2013:

Hello,

This weekend I tried migrating to TS 3.0 with RT indexes - and I’ve faced the problem with initial data population. As far as I understand, there is no way to add “includes” and “joins” when generating initial data which gives N+1 problem and serious performance issues. Could it be possible to add some syntax which would allow to specify includes for data (re)generation?

Thanks,

JeremyLi left a comment on 20 Mar, 2014:

I create two index(:title, :content) in a model Article, and if i want to search data only in the :title index, how to do that?

Mikhail left a comment on 19 Jun, 2014:

Hi.

Is there any way to set callback for deleting model from index?

Prior to this: Gutentag: Simple Rails Tagging

More recently: Melbourne Ruby Retrospective for 2013