Advanced Sphinx Configuration
Thinking Sphinx provides a good set of defaults out of the box, and for some people, those options are exactly what they need. Sometimes, though, you may need to customise how Sphinx works - and this can usually be done by adding some settings to a file named thinking_sphinx.yml
in your config
directory. Much like database.yml, settings are defined for each environment. Here’s an example:
development:
mysql41: 9312
test:
mysql41: 9313
production:
mysql41: 9312
Now, Sphinx has a lot of different settings you can play with, and they’re pretty much all supported by Thinking Sphinx as well.
This page covers most of the more important settings, but there is also a full overview which is far more exhaustive.
Any paths provided in this configuration are expected to be absolute by default. If you want to provide relative paths, you will need to add absolute_paths: true
to each relevant environment in config/thinking_sphinx.yml
. These paths will then be translated to absolute paths from the root of the Rails application, within the generated Sphinx configuration.
Index File Location
You can customise the location of your Sphinx index files using the indices_location
option.
Thinking Sphinx defaults to putting these files in db/sphinx/ENVIRONMENT - which makes life easier if you’re running integration tests with a live Sphinx setup. It’s worth keeping this in mind and ensuring your file locations are unique for each environment when they share a machine. Indeed, you’ll probably only want to change this value on your production machine.
production:
indices_location: "/var/www/my_app/shared/sphinx"
# ... repeat for other environments if necessary
Configuration, PID and Log File Locations
In the same vein as the above setting, you can nominate custom locations for your configuration, log and pid files.
Here’s some example syntax, using Thinking Sphinx’s defaults. Uppercase words are placeholders for system variables in the example only - you can’t actually use them in your YAML file.
development:
configuration_file: "RAILS_ROOT/config/ENVIRONMENT.sphinx.conf"
log: "RAILS_ROOT/log/searchd.log"
query_log: "RAILS_ROOT/log/searchd.query.log"
pid_file: "RAILS_ROOT/log/searchd.ENVIRONMENT.pid"
# ... repeat for other environments
Daemon Address and Port
If your Sphinx Daemon (also known as searchd) is running on a different machine or port, you’re going to need to tell Thinking Sphinx the critical details:
production:
address: 10.0.0.4
mysql41: 3200
# ... repeat for other environments if necessary
Hosting via a UNIX Socket
It is possible to run the Sphinx daemon on a UNIX socket. To do this, you will need to specify the path to the socket in your config/thinking_sphinx.yml
file per environment:
production:
socket: "RAILS_ROOT/tmp/production.sphinx"
If you specify mysql41
and/or address
, then the daemon will also be available via TCP, but connections from Thinking Sphinx to Sphinx will still be made via the UNIX socket.
This feature is unfortunately not supported in JRuby (as there doesn’t seem to be a way to use UNIX sockets to connect to the MySQL protocol). It also means that Sphinx cannot be interacted with by other servers - only the machine with the Sphinx daemon. So, remote searches and deletions are not possible with this UNIX socket setting.
Indexer Memory Usage
Sphinx indexes your data using the indexer
command-line tool. This tool runs with a fixed memory limit - defaulting to 64 megabytes. You can change this to something else if you’d like - the more memory, the faster your indexes will be processed.
development:
mem_limit: 128M
# ... repeat for other environments
Word Stemming / Morphology
By default, Sphinx and Thinking Sphinx doesn’t get too smart about the words you’re searching for - it assumes you know exactly what you’re after. However, sometimes you may want it to recognise that certain words share pretty much the same meaning. For example: think and thinking.
To enable this kind of behaviour, you need to specify a morphology (or stemming library) to Sphinx. It comes with English (stem_en) and Russian (stem_ru) built-in. You can also use other stemmers via Snowball’s libstemmer library. Have a read of Sphinx’s documentation for more clues.
development:
morphology: stem_en
# ... repeat for other environments
Wildcard/Star Syntax
If you’re using Sphinx 2.2.2 or newer, wildcard syntax will be respected by default (though you’ll also need infixes or prefixes, as covered in the next section).
If you’re using an older version of Sphinx, then you can enable wildcard syntax using the enable_star
option:
development:
enable_star: true
# ... repeat for other environments
Infix and Prefix Indexing
If you want partial word matching, then you’re going to need to tell Sphinx to either index prefixes (the beginnings of words) or infixes (substrings of words). You cannot enable both at once, though.
You need to tell Sphinx what the minimum infix or prefix length is - the smaller the number is, the larger your index gets. If you set it to zero, though, that disables this feature. If you want absolutely everything, down to the last character, then set min_infix_len to 1 - but be prepared for the performance hit.
development:
min_infix_len: 3
# OR
min_prefix_len: 3
# ... repeat for other environments
Character Sets and Tables
By default, Sphinx and Thinking Sphinx use the UTF-8 character set. If you’re using an older version of Sphinx and prefer Sphinx’s inbuild sbcs encoding, you’ll need to specify it via the charset_type setting:
development:
charset_type: sbcs
# ... repeat for other environments
This changest the default character mappings, which you can read about in the Sphinx documentation. You can also set your own character mappings - which is recommended when using UTF-8 - to include other characters. James Healy has posted his extensive settings which cover most (if not all) accented characters. If you don’t want to click through, it’s all done via the charset_table setting:
development:
charset_table: "0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F"
# ... repeat for other environments
Large Result Sets
To keep searching fast, Sphinx has a default limit of 1000 records being available via pagination, even if there are more matches than that. The reasons for this limit are discussed in the Sphinx documentation.
However, you can change this value. Firstly, in your config/thinking_sphinx.yml
file, you need to set max_matches to your upper limit:
development:
max_matches: 10000
# ... repeat for other environments
Don’t forget to reconfigure and restart your Sphinx daemon so it is aware of the change.
rake ts:stop ts:configure ts:start
And you also need to specify it in your searches (Sphinx doesn’t assume you want the higher number by default):
Article.search 'pancakes', :max_matches => 10_000
This does not mean you will get 10,000 results returned in one request, but you can paginate up to the ten-thousandth result. If you want them all at once (which will be slow, because you’re asking Rails to instantiate 10,000 records), use the per_page
option.
Article.search 'pancakes',
:max_matches => 10_000,
:per_page => 10_000
Word Forms, Exceptions, and Stop Words
To configure Thinking Sphinx for any of these features, simply specify the path to the appropriate file in your config/thinking_sphinx.yml
file:
development:
wordforms: "/full/path/to/wordforms.txt"
exceptions: "/full/path/to/exceptions.txt"
stopwords: "/full/path/to/stopwords.txt"
# ... repeat for other environments
For full details on what these features actually do, please refer to the Sphinx documentation.