Indexing your Models

Basic Indexing

Everything to set up the indices for your models goes in files in app/indices. The files themselves can be named however you like, but I generally opt for model_name_index.rb. At the very least, the file name should not be the same as your model’s file name. Here’s an example of what goes in the file:

ThinkingSphinx::Index.define :article, :with => :active_record do
  indexes subject, :sortable => true
  indexes content
  indexes author.name, :as => :author, :sortable => true

  has author_id, created_at, updated_at
end

You’ll notice the first argument is the model name downcased and as a symbol, and we are specifying the processor - :active_record - to use SQL-backed indices. Everything inside the block is just like previous versions of Thinking Sphinx, if you’re familiar with that (and if not, keep reading).

An equivalent index definition if you want to use real-time indices would be:

ThinkingSphinx::Index.define :article, :with => :real_time do
  indexes subject, :sortable => true
  indexes content
  indexes author.name, :as => :author, :sortable => true

  has author_id,  :type => :integer
  has created_at, :type => :timestamp
  has updated_at, :type => :timestamp
end

You’ll also want to add a real-time callback to your model.

When you’re defining indices for namespaced models, use a lowercase string with /’s for namespacing as the model reference:

# For a model named Blog::Article:
ThinkingSphinx::Index.define 'blog/article', :with => :active_record

Thinking Sphinx v1/v2

Note: Index definitions for Thinking Sphinx versions before 3.0.0 went in the model files instead, inside a define_index call.

Don't forget to place this block below your associations and any accepts_nested_attributes_for calls, otherwise any references to them for fields and attributes will not work.

class Article < ActiveRecord::Base
  # ...

  define_index do
    indexes subject, :sortable => true
    indexes content
    indexes author(:name), :as => :author, :sortable => true

    has author_id, created_at, updated_at
  end

  # ...
end

Real-time Indices vs SQL-backed Indices

Thinking Sphinx allows for definitions of both real-time indices and SQL-backed indices. (In previous versions, only SQL-backed indices were available.)

Real-time indices are processed using Sphinx’s SphinxQL protocol, and thus are managed by Thinking Sphinx via Ruby, with the following advantages:

  • Your fields and attributes reference Ruby methods.
  • Real-time records can be updated directly, thus keeping your Sphinx data up-to-date almost immediately. This removes the need for delta indices.

The SQL-backed indices, however, have the potential to be much faster: the indexing process avoids the need to iterate through every record separately, and can use SQL joins to load association data directly.

You’ll need to consider which approach will work best for your application, but certainly if your data is changing frequently and you’d like it to be up-to-date, it’s worth starting with real-time indices.

The two approaches are distinguished by the :with option:

# for real-time indices:
ThinkingSphinx::Index.define :article, :with => :real_time do
# ...

# for SQL-backed indices:
ThinkingSphinx::Index.define :article, :with => :active_record do
# ...

Any differences in behaviour within an index definition are noted in the documentation below.

Fields

The indexes method adds one (or many) fields, by referencing the model’s method names (for real-time indices) or column names (for SQL-backed indices). You cannot reference model methods with SQL-backed indices - in this case, Sphinx talks directly to your database, and Ruby doesn’t get loaded.

indexes content

Thinking Sphinx v1/v2

Keep in mind that if you're referencing a column that shares its name with a core Ruby method (such as id, name or type) and you're using Thinking Sphinx v1 or v2, then you'll need to specify it using a symbol.

indexes :name

You don’t need to keep the same names as your model, though. Use the :as option to signify a new name. Field and attribute names must be unique, so specifying custom names (instead of the column name for both) is essential.

indexes content, :as => :post

You can also flag fields as being sortable.

indexes subject, :sortable => true

Use the :facet option to signify a facet.

indexes authors.name, :as => :author, :facet => true

For real-time indices, you can drill down on methods that return single objects (such as belongs_to associations):

indexes author.name, :as => :author

If you want to collect multiple values into a single field, you will need a method in your model to aggregate this:

# in index:
indexes comment_texts

# in model:
def comment_texts
  comments.collect(&:text).join(' ')
end

With SQL-backed indices, if there are associations in your model you can drill down through them to access other columns. Explicit names with the :as option are required when doing this.

indexes author.name,     :as => :author
indexes author.location, :as => :author_location

There may be times when a normal column value isn’t exactly what you’re after, so you can also define your indexes as raw SQL:

indexes "LOWER(first_name)", :as => :first_name, :sortable => true

Again, in this situation, an explicit name is required, and it only works with SQL-backed indices.

Attributes

The has method adds one (or many) attributes, and just like the indexes method, it requires references to the model’s methods (for real-time indices) or column names (for SQL-backed indices).

Real-time indices require the attribute types to be set manually, but SQL-backed indices have the ability to introspect on the database to determine types. Known types for real-time indices are: integer, boolean, string, timestamp, float, bigint and json.

# In a real-time index:
has author_id, :type => :integer

# In a SQL-backed index:
has author_id

The syntax is very similar to setting up fields. You can set custom names, and drill down into associations. You don’t ever need to label an attribute as :sortable though - in Sphinx, all attributes can be used for sorting.

You’ll also see below that multi-value attributes in real-time indices need the :multi option to be set.

# In a real-time index:
has id, :as => :article_id, :type => :integer
has tag_ids, :multi => true

# In a SQL-backed index:
has id, :as => :article_id
has tag_ids, :as => :tag_ids

Again: fields and attributes cannot share names - they must all be unique. Use the :as option to provide custom names when a column is being used more than once.

Conditions and Groupings

Because SQL-backed indices are translated to SQL, you may want to add some custom conditions or groupings manually - and for that, you’ll want the where and group_by methods:

where "status = 'active'"

group_by "user_id"

For real-time indices you can define a custom scope to preload associations or apply custom conditions:

scope { Article.includes(:comments) }

This scope only comes into play when populating all records at once, not when single records are created or updated.

Sanitizing SQL

Note: this section applies only to SQL-backed indices.

As previously mentioned, your index definition results in SQL from the indexes, the attributes, conditions and groupings, etc. With this in mind, it may be useful to simplify your index.

One way would be to use something like ActiveRecord::Base.sanitize_sql to generate the required SQL for you. For example:

where sanitize_sql(["published", true])

This will produce the expected WHERE published = 1 for MySQL.

Index Options

Most Sphinx index configuration options can be set on a per-index basis using the set_property method within your index definition. Here’s an example for the min_infix_len option:

ThinkingSphinx::Index.define :article, :with => :active_record do
  # ...

  set_property :min_infix_len => 3
end

set_property takes a hash of options, but also can be called as many times as you’d like.

Multiple Indices

If you want more than one index defined for a given model, just add further ThinkingSphinx::Index.define calls - but make sure you give every index a unique name, and have the same attributes defined in all indices.

ThinkingSphinx::Index.define(
  :article, :name => 'article_foo', :with => :active_record
) do
  # index definition
end

ThinkingSphinx::Index.define(
  :article, :name => 'article_bar', :with => :active_record
) do
  # index definition
end

These index definitions can be in the same file or separate files - it’s up to you.

Thinking Sphinx v1/v2

Note: Defining multiple indices in Thinking Sphinx v2 or older is just a matter of using define_index multiple times, and supplying a unique name for each:

define_index 'article_foo' do
  # index definition
end

define_index 'article_bar' do
  # index definition
end

Real-time Callbacks

If you’re using real-time indices, you will want to add a callback to your model to ensure changes are reflected in Sphinx:

# if your model is app/models/article.rb:
after_save ThinkingSphinx::RealTime.callback_for(:article)

If you want changes to associated data to fire Sphinx updates for a related model, you can specify a method chain for the callback.

# in app/models/comment.rb, presuming a comment belongs_to :article
after_save ThinkingSphinx::RealTime.callback_for(
  :article, [:article]
)

The first argument, in all situations, should match the index definition’s first argument: a symbolised version of the model name. The second argument is a chain, and should be in the form of an array of symbols, each symbol representing methods called to get to the indexed object (so, an instance of the Article model in the example above).

If you wish to have your callbacks update Sphinx only in certain conditions, you can either define your own callback and then invoke TS if/when needed:

after_save :populate_to_sphinx

# ...

def populate_to_sphinx
  return unless indexing?

  ThinkingSphinx::RealTime::Callbacks::RealTimeCallbacks.new(
    :article
  ).after_save self
end

Or supply a block to the callback instantiation which returns an array of instances to process:

# if your model is app/models/article.rb:
after_save(
  ThinkingSphinx::RealTime.callback_for(:article) { |instance|
    instance.indexing? ? [instance] : []
  }
)

Processing your Index

Once you’ve got your index set up just how you like it, you can run the rake task to get Sphinx to process the data.

# If you're using SQL-backed indices:
rake ts:index

# If you're using real-time indices:
rake ts:generate

However, if you have made structural changes to your index (which is anything except adding new data into the database tables), you’ll need to stop Sphinx, re-process, and then re-start Sphinx - which can be done through a single rake call.

# If you're using SQL-backed indices:
rake ts:rebuild

# If you're using real-time indices:
rake ts:regenerate