Rewriting Thinking Sphinx: Introducing Realtime Indices
The one other feature in the rewrite of Thinking Sphinx that I wanted to highlight should most certainly be considered in beta, but it’s gradually getting to the point where it can be used reliably: real-time indices (which are now covered pretty decently in the official TS documentation).
Real-time indices are built into Sphinx, and are indices that can be dynamically updated on the fly - and are not backed by a database sources. They do have a defined structure with fields and attributes (so they’re not a NoSQL key/value store), but they remove the need for delta indices, because each record in a real-time index can be updated directly. You also get the benefit, within Thinking Sphinx, to refer to Ruby methods instead of tables and columns.
The recent 3.0.4 release of Thinking Sphinx provides support for this, but the workflow’s a little different from the SQL-backed indices:
Define your indices
Presuming a Product model defined just so:
class Product < ActiveRecord::Base
has_many :categorisations
has_many :categories, :through => :categorisations
end
You can put together an index like this:
ThinkingSphinx::Index.define :product, :with => :real_time do
indexes name
has category_ids, :type => :integer, :multi => true
end
You can see here that it’s very similar to a SQL-backed index, but we’re
referring to Ruby methods (such as category_ids
, perhaps
auto-generated by associations), and we’re specifying the attribute type
explicitly - as we can’t be sure what a method returns.
Add Callbacks
Every time a record is updated in your database, you want those changes to be reflected in Sphinx as well. Sometimes you may want associated models to prompt a change - hence, these callbacks aren’t added automatically.
In our example above, we’d want after_save
callbacks in our Product
model (of course) and also our Categorisation model - as that will
impact a product’s category_ids
value.
# within product.rb
after_save ThinkingSphinx::RealTime.callback_for(:product)
# within categorisation.rb
after_save ThinkingSphinx::RealTime.callback_for(:product, [:product])
The first argument is the reference to the indices involved - matching the first argument when you define your index. The second argument in the Categorisation example is the method chain required to get to the objects involved in the index.
Generate the configuration
rake ts:configure
We’ve no need for the old ts:index
task as that’s preloading index
data via the database.
Start Sphinx
rake ts:start
All of our interactions with Sphinx are through the daemon - and so, Sphinx must be running before we can add records into the indices.
Populate the initial data
rake ts:generate
This will go through each index, load each record for the appropriate model and insert (or update, if it exists) the data for that into the real-time indices. If you’ve got a tonne of records or complex index definitions, then this could take a while.
Everything at once
rake ts:regenerate
The regenerate task will stop Sphinx (if it’s running), clear out all Sphinx index files, generate the configuration file again, start Sphinx, and then repopulate all the data.
Essentially, this is the rake task you want to call when you’ve changed the structure of your Sphinx indices.
Handle with care
Once you have everything in place, then searching will work, and as your models are updated, your indices will be too. In theory, it should be pretty smooth sailing indeed!
Of course, there could be glitches, and so if you spot inconsistencies
between your database and Sphinx, consider where you may be making
changes to your database without firing the after_save callback. You
can run the ts:generate
task at any point to update your Sphinx
dataset.
I don’t yet have Flying Sphinx providing full support for real-time indices - it should work fine, but there’s not yet any automated backup (whereas SQL-backed indices are backed up every time you process the index files). This means if a server fails it’d be up to you to restore your index files. It’s on my list of things to do!
What’s next?
I’m keen to provide hooks to allow the callbacks to fire off background jobs instead of having that Sphinx update part of the main process - though it’s certainly not as bad as the default delta approach (you’re not shelling out to another process, and you’re only updating a single record).
I’m starting to play with this in my own apps, and am keen to see it used in production. It is a different way of using Sphinx, but it’s certainly one worth considering. If you give it a spin, let me know how you go!
Hello,
This weekend I tried migrating to TS 3.0 with RT indexes - and I’ve faced the problem with initial data population. As far as I understand, there is no way to add “includes” and “joins” when generating initial data which gives N+1 problem and serious performance issues. Could it be possible to add some syntax which would allow to specify includes for data (re)generation?
Thanks,
I create two index(:title, :content) in a model Article, and if i want to search data only in the :title index, how to do that?
Hi.
Is there any way to set callback for deleting model from index?