Indexing your Models
- Basic Indexing
- Callbacks
- Index Names
- Real-time Indices vs SQL-backed Indices
- Fields
- Attributes
- Conditions and Groupings
- Sanitizing SQL
- Index Options
- Multiple Indices
- Processing your Index
Basic Indexing
Everything to set up the indices for your models goes in files in app/indices
. The files themselves can be named however you like, but I generally opt for model_name_index.rb
. At the very least, the file name should not be the same as your model’s file name. Here’s an example of what goes in the file:
ThinkingSphinx::Index.define :article, :with => :active_record do
indexes subject, :sortable => true
indexes content
indexes author.name, :as => :author, :sortable => true
has author_id, created_at, updated_at
end
You’ll notice the first argument is the model name downcased and as a symbol, and we are specifying the processor - :active_record
- to use SQL-backed indices. Everything inside the block is just like previous versions of Thinking Sphinx, if you’re familiar with that (and if not, keep reading).
An equivalent index definition if you want to use real-time indices would be:
ThinkingSphinx::Index.define :article, :with => :real_time do
indexes subject, :sortable => true
indexes content
indexes author.name, :as => :author, :sortable => true
has author_id, :type => :integer
has created_at, :type => :timestamp
has updated_at, :type => :timestamp
end
For both SQL-backed and real-time indices, you’ll also want to add callbacks to the models that are being indexed.
When you’re defining indices for namespaced models, use a lowercase string with /’s for namespacing and then casted to a symbol as the model reference:
# For a model named Blog::Article:
ThinkingSphinx::Index.define 'blog/article'.to_sym, :with => :active_record
Callbacks
To ensure changes are reflected from your database models into Sphinx, you need to explicitly add callbacks to indexed models. This was done automatically in Thinking Sphinx v4 and earlier, but the performance overhead on all model changes was less than ideal, hence now you must specify it for just the indexed models.
# if your indexed model is app/models/article.rb:
class Article < ApplicationRecord
# if you're using SQL-backed indices:
ThinkingSphinx::Callbacks.append(
self, :behaviours => [:sql]
)
# if you're using SQL-backed indices with deltas:
ThinkingSphinx::Callbacks.append(
self, :behaviours => [:sql, :deltas]
)
# if you're using real-time indices
ThinkingSphinx::Callbacks.append(
self, :behaviours => [:real_time]
)
# if you're using namespaced models:
ThinkingSphinx::Callbacks.append(
self, 'admin/article', :behaviours => [:real_time]
)
# If you have got the `attribute_updates` setting enabled in
# their config/thinking_sphinx.yml file, you'll want to
# include the callbacks for that as well:
ThinkingSphinx::Callbacks.append(
self, :behaviours => [:sql, :updates]
)
# Though given this feature isn't enabled by default, I
# suspect not many people will need to do this. The setting
# is only useful for updating attribute values in SQL-backed
# indices that aren't using deltas. The only way for fields
# to be updated is by using deltas or real-time indices.
end
If you want changes to associated data to fire Sphinx updates for a related model and you’re using real-time indices, you can specify a method chain for the callback - you’ll also want to add the index reference (the first argument in an index definition - usually the model’s name as an underscored and lowercase symbol) as the second argument:
# in app/models/comment.rb, presuming a comment belongs_to :article
# note the second argument is :article, as per the
# ThinkingSphinx::Index.define call.
ThinkingSphinx::Callbacks.append(
self, :article, :behaviours => [:real_time], :path => [:article]
)
The path option is a chain, and should be in the form of an array of symbols, each symbol representing methods called to get to the indexed object (so, an instance of the Article model in the example above).
If you wish to have your callbacks update Sphinx only in certain conditions, you can either define your own callback and then invoke TS if/when needed:
after_save :populate_to_sphinx
# ...
def populate_to_sphinx
return unless indexing?
ThinkingSphinx::RealTime::Callbacks::RealTimeCallbacks.new(
:article
).after_save self
end
Or supply a block to the callback instantiation which returns an array of instances to process:
# if your model is app/models/article.rb:
ThinkingSphinx::Callbacks.append(self, :behaviours => [:real_time]) { |instance|
instance.indexing? ? [instance] : []
}
If you’re combining custom indexing conditions with associated data, then you’ll need to supply the reference (as noted above), but the :path
option is ignored, and instead you’ll need to return the appropriate instances instead:
# if your model is app/models/comment.rb
# and you want to process related articles:
ThinkingSphinx::Callbacks.append(self, :article, :behaviours => [:real_time]) { |instance|
# instance is a comment
instance.saved_changes.keys.include?("content") ? [instance.article] : []
}
Index Names
When translating these index definitions into Sphinx configuration, Thinking Sphinx will use the model’s name for the index, and append a _core
suffix to it. So, an index for Article
will be named article_core
.
If you’re using SQL-backed indices with deltas, then there is also a corresponding index with the _delta
suffix - e.g. article_delta
.
You can set different index names if you wish, using the :name
option (as noted later in this documentation related to multiple indices for a single model). However, the suffixes will always be applied.
Real-time Indices vs SQL-backed Indices
Thinking Sphinx allows for definitions of both real-time indices and SQL-backed indices. (In previous versions, only SQL-backed indices were available.)
Real-time indices are processed using Sphinx’s SphinxQL protocol, and thus are managed by Thinking Sphinx via Ruby, with the following advantages:
- Your fields and attributes reference Ruby methods.
- Real-time records can be updated directly, thus keeping your Sphinx data up-to-date almost immediately. This removes the need for delta indices.
The SQL-backed indices, however, have the potential to be much faster: the indexing process avoids the need to iterate through every record separately, and can use SQL joins to load association data directly.
You’ll need to consider which approach will work best for your application, but certainly if your data is changing frequently and you’d like it to be up-to-date, it’s worth starting with real-time indices.
The two approaches are distinguished by the :with
option:
# for real-time indices:
ThinkingSphinx::Index.define :article, :with => :real_time do
# ...
# for SQL-backed indices:
ThinkingSphinx::Index.define :article, :with => :active_record do
# ...
Any differences in behaviour within an index definition are noted in the documentation below.
Fields
The indexes
method adds one (or many) fields, by referencing the model’s method names (for real-time indices) or column names (for SQL-backed indices). You cannot reference model methods with SQL-backed indices - in this case, Sphinx talks directly to your database, and Ruby doesn’t get loaded.
indexes content
You don’t need to keep the same names as your model, though. Use the :as
option to signify a new name. Field and attribute names must be unique, so specifying custom names (instead of the column name for both) is essential.
indexes content, :as => :post
You can also flag fields as being sortable.
indexes subject, :sortable => true
Use the :facet
option to signify a facet.
indexes authors.name, :as => :author, :facet => true
For real-time indices, you can drill down on methods that return single objects (such as belongs_to
associations):
indexes author.name, :as => :author
If you want to collect multiple values into a single field, you will need a method in your model to aggregate this:
# in index:
indexes comment_texts
# in model:
def comment_texts
comments.collect(&:text).join(' ')
end
With SQL-backed indices, if there are associations in your model you can drill down through them to access other columns. Explicit names with the :as
option are required when doing this.
indexes author.name, :as => :author
indexes author.location, :as => :author_location
There may be times when a normal column value isn’t exactly what you’re after, so you can also define your indexes as raw SQL:
indexes "LOWER(first_name)", :as => :first_name, :sortable => true
Again, in this situation, an explicit name is required, and it only works with SQL-backed indices.
Attributes
The has
method adds one (or many) attributes, and just like the indexes
method, it requires references to the model’s methods (for real-time indices) or column names (for SQL-backed indices).
Real-time indices require the attribute types to be set manually, but SQL-backed indices have the ability to introspect on the database to determine types. Known types for real-time indices are: integer
, boolean
, string
, timestamp
, float
, bigint
and json
.
# In a real-time index:
has author_id, :type => :integer
# In a SQL-backed index:
has author_id
The syntax is very similar to setting up fields. You can set custom names, and drill down into associations. You don’t ever need to label an attribute as :sortable
though - in Sphinx, all attributes can be used for sorting.
You’ll also see below that multi-value attributes in real-time indices need the :multi
option to be set.
Please note that Sphinx only supports multi-value attributes for 32-bit and 64-bit integers and timestamps. This applies to both SQL-backed and real-time indices. Strings are sadly not supported.
# In a real-time index:
has id, :as => :article_id, :type => :integer
has tag_ids, :multi => true
# In a SQL-backed index:
has id, :as => :article_id
has tag_ids, :as => :tag_ids
Again: fields and attributes cannot share names - they must all be unique. Use the :as
option to provide custom names when a column is being used more than once.
Conditions and Groupings
Because SQL-backed indices are translated to SQL, you may want to add some custom conditions or groupings manually - and for that, you’ll want the where
and group_by
methods:
where "status = 'active'"
group_by "user_id"
For real-time indices you can define a custom scope to preload associations or apply custom conditions:
scope { Article.includes(:comments) }
This scope only comes into play when populating all records at once, not when single records are created or updated.
Sanitizing SQL
Note: this section applies only to SQL-backed indices.
As previously mentioned, your index definition results in SQL from the indexes, the attributes, conditions and groupings, etc. With this in mind, it may be useful to simplify your index.
One way would be to use something like ActiveRecord::Base.sanitize_sql
to generate the required SQL for you. For example:
where sanitize_sql(["published", true])
This will produce the expected WHERE published = 1
for MySQL.
Index Options
Most Sphinx index configuration options can be set on a per-index basis using the set_property
method within your index definition. Here’s an example for the min_infix_len
option:
ThinkingSphinx::Index.define :article, :with => :active_record do
# ...
set_property :min_infix_len => 3
end
set_property
takes a hash of options, but also can be called as many times as you’d like.
Multiple Indices
If you want more than one index defined for a given model, just add further ThinkingSphinx::Index.define
calls - but make sure you give every index a unique name, and have the same attributes defined in all indices.
ThinkingSphinx::Index.define(
:article, :name => 'article_foo', :with => :active_record
) do
# index definition
end
ThinkingSphinx::Index.define(
:article, :name => 'article_bar', :with => :active_record
) do
# index definition
end
These index definitions can be in the same file or separate files - it’s up to you.
Processing your Index
Once you’ve got your index set up just how you like it, you can run the rake task to get Sphinx to process the data.
rake ts:index
If you have made structural changes to your index (which is anything except adding new data into the database tables), you’ll need to stop Sphinx, re-process, and then re-start Sphinx - which can be done through a single rake call.
rake ts:rebuild
Index Guard Files
Any given SQL-backed index can not be processed more than once concurrently. To avoid multiple indexing requests, Thinking Sphinx adds a lock file in the indices directory while indexing occurs, named ts-INDEXNAME.tmp
. When you’re processing all indices in the one call (via either of the above rake tasks), then the lock file is instead named ts--all.tmp
.
In rare cases (generally when the parent process crashes completely), orphan lock files may remain - these are safe to remove if no indexing is occured. If you’re finding some of your indices aren’t being processed reliably, checking for these index files is recommended.
These lock files are not created when processing real-time indices.
You can disable the use of these lock files if you wish, by changing the guarding strategy:
# This can go in an initialiser:
ThinkingSphinx::Configuration.instance.guarding_strategy =
ThinkingSphinx::Guard::None
Processing Approaches
By default, ts:index
will instruct Sphinx to process all indices (and this has always been how Thinking Sphinx has behaved). This means that Sphinx will prepare all of the new data together before switching the daemon over to use it.
It is possible, though, to instead process each index one at a time (and thus, the daemon uses each index’s new data as that index’s processing is completed):
# This can go in an initialiser:
ThinkingSphinx::Configuration.instance.indexing_strategy =
ThinkingSphinx::IndexingStrategies::OneAtATime
Should you wish to build your own indexint strategy, you can give ThinkingSphinx::Configuration.instance.indexing_strategy
anything you like that responds to call and expects an array of index options, and yields index names. You can see the implementations of the two approaches here.
You can also process just specific indices via the INDEX_FILTER
environment variable:
rake ts:index INDEX_FILTER=article_core,user_delta