Freelancing Gods 2013

God
30 May 2011

Searching with Sphinx on Heroku

Just over two weeks ago, I released Flying Sphinx – which provides Sphinx search capability for Heroku apps. I’ll talk more about how I built it and the challenges faced at some point, but right now I just want to introduce the service and how you may go about using it.

Why Sphinx?

Perhaps you’re not familiar with Sphinx and how it can be useful. For those who are new to Sphinx, it’s a full-text search tool – think of your own personal Google for within your website. It comes with two main moving parts – the indexer tool for interpreting and storing your search data (indices), and the searchd tool, which runs as a daemon accepting search requests, and returns the most appropriate matches for a given search query.

In most situations, Sphinx is very fast at indexing your data, and connects directly to MySQL and PostgreSQL databases – so it’s quite a good fit for a lot of Rails applications.

Using Sphinx in Rails

I’ve written a gem, Thinking Sphinx, which integrates Sphinx neatly with ActiveRecord. It allows you to define indices in your models, and then use rake tasks to handle the processing of these indices, along with managing the searchd daemon.

If you want to install Sphinx, have a read through of this guide from the Thinking Sphinx documentation – in most cases it should be reasonably painless.

Installing Thinking Sphinx in a Rails 3 application is quite simple – just add the gem to your Gemfile:

gem 'thinking-sphinx', '2.0.5'

For older versions of Rails, the Thinking Sphinx docs have more details.

I’m not going to get too caught up in the details of how to structure indices – this is also covered within the Thinking Sphinx documentation – but here’s a quick example, for user account:

class User < ActiveRecord::Base
  # ...
  
  define_index do
    indexes name, :sortable => true
    indexes location
    
    has admin, created_at
  end
  
  # ...
end

The indexes method defines fields – which are the textual data that people can search for. In this case, we’ve got the user names and locations covered. The has method is for attributes – which are used for filtering and sorting (fields can’t be used for sorting by default). The distinction of fields and attributes is quite important – make sure you understand the difference.

Now that we have our index defined, we can have Sphinx grab the required data from our database, which is done via a rake task:

rake ts:index

What Sphinx does here is grab all the required data from the database, inteprets it and stores it in a custom format. This allows Sphinx to be smarter about ranking search results and matching words within your fields.

Once that’s done, we next start up the Sphinx daemon:

rake ts:start

And now we can search! Either in script/console or in an appropriate action, just use the search method on your model:

User.search 'pat'

This returns the first page of users that match your search query. Sphinx always paginates results – though you can set the page size to be quite large if you wish – and Thinking Sphinx search results can be used by both WillPaginate and Kaminari pagination view helpers.

Instead of sorting by the most relevant matches, here’s examples where we sort by name and created_at:

User.search 'pat', :order => :name
User.search 'pat', :order => :created_at

And if we only want admin users returned in our search, we can filter on the admin attribute:

User.search 'pat', :with => {:admin => true}

There’s many more options for search calls – the documentation (yet again) covers most of them quite well.

One more thing to remember – if you change your index structures, or add/remove index defintions, then you should restart and reindex Sphinx. This can be done in a single rake task:

rake ts:rebuild

If you just want the latest data to be processed into your indices, there’s no need to restart Sphinx – a normal ts:index call is fine.

Using Thinking Sphinx with Heroku

Now that we’ve got a basic search setup working quite nicely, let’s get it sorted out on Heroku as well. Firstly, let’s add the flying-sphinx gem to our Gemfile (below our thinking-sphinx reference):

gem 'flying-sphinx', '0.5.0'

Get that change (along with your indexed model setup) deployed to Heroku, then inform Heroku you’d like to use the Flying Sphinx add-on (the entry level plan costs $12 USD per month):

heroku addons:add flying_sphinx:wooden

And finally, let’s get our data on the site indexed and the daemon running:

heroku rake fs:index
heroku rake fs:start

Note the fs prefix instead of the ts prefix in those rake calls – the normal Thinking Sphinx tasks are only useful on your local machine (or on servers that aren’t Heroku).

When you run those rake tasks, you will probably see the following output:

Sphinx cannot be found on your system. You may need to configure the
following settings in your config/sphinx.yml file:
  * bin_path
  * searchd_binary_name
  * indexer_binary_name

For more information, read the documentation:
http://freelancing-god.github.com/ts/en/advanced_config.html

This is because Thinking Sphinx doesn’t have access to Sphinx locally, and isn’t sure which version of Sphinx is available. To have these warnings silenced, you should add a config/sphinx.yml file to your project, with the version set for the production environment:

production:
  version: 1.10-beta

Push that change up to Heroku, and you won’t see the warnings again.

For the more curious of you: the Sphinx daemon is located on a Flying Sphinx server, also located within the Amazon cloud (just like Heroku) to keep things fast and cheap. This is all managed by the flying-sphinx gem, though – you don’t need to worry about IP addresses or port numbers.

Also: the same rules apply with Flying Sphinx for modifying index structures or adding/removing index definitions – make sure you restart Sphinx so it’s aware of the changes:

heroku rake fs:rebuild

The final thing to note is that you’ll want the data in your Sphinx indices updated regularly – perhaps every day or every hour. This is best done on Heroku via their Cron add-on – since that’s just a rake task as well.

If you don’t have a cron task already, the following (perhaps in lib/tasks/cron.rake) will do the job:

desc 'Have cron index the Sphinx search indices'
task :cron => 'fs:index'

Otherwise, maybe something more like the following suits:

desc 'Have cron index the Sphinx search indices'
task :cron => 'fs:index' do
  # Other things to do when Cron comes calling
end

If you’d like your search data to have your latest changes, then I recommend you read up on delta indexing – both for Thinking Sphinx and for Flying Sphinx.

Further Sources

Keep in mind this is just an introduction – the documentation for Thinking Sphinx is pretty good, and Flying Sphinx is improving regularly. There’s also the Thinking Sphinx google group and the Flying Sphinx support site if you have questions about either, along with numerous blog posts (though the older they are, the more likely they’ll be out of date). And finally – I’m always happy to answer questions about this, so don’t hesitate to get in touch.

12 Mar 2010

Using Thinking Sphinx with Cucumber

While I highly recommend you stub out your search requests in controller unit tests/specs, I also recommend you give your full stack a work-out when running search scenarios in Cucumber.

This has gotten a whole lot easier with the ThinkingSphinx::Test class and the integrated Cucumber support, but it’s still not perfect, mainly because generally everyone (correctly) keeps their database changes within a transaction. Sphinx talks to your database outside Rails’ context, and so can’t see anything, unless you turn these transactions off.

It’s not hard to turn transactions off in your features/support/env.rb file:

Cucumber::Rails::World.use_transactional_fixtures = false

But this makes Cucumber tests far more fragile, because either each scenario can’t conflict with each other, or the database needs to be cleaned before and after each scenario is run.

Pretty soon after I added the inital documentation for this, a few expert Cucumber users pointed out that you can flag certain feature files to be run without transactional fixtures, and the rest use the default:

@no-txn
Feature: Searching
  In order to find things as easily as possible
  As a user
  I want to search across all data on the site

This is a good step in the right direction, but it’s not perfect – you’ll still need to clean up the database. Writing steps to do that is easy enough:

Given /^a clean slate$/ do
  Object.subclasses_of(ActiveRecord::Base).each do |model|
    next unless model.table_exists?
    model.connection.execute "TRUNCATE TABLE `#{model.table_name}`"
  end
end

(You can also use Database Cleaner, as noted by Thilo in the comments).

But adding that to the start and end of every single scenario isn’t particularly DRY.

Thankfully, there’s Before and After hooks in Cucumber, and they can be limited to scenarios marked with certain tags. Now we’re getting somewhere!

Before('@no-txn') do
  Given 'a clean slate'
end

After('@no-txn') do
  Given 'a clean slate'
end

And here’s a bonus step, to make indexing data a little easier:

Given /^the (\w+) indexes are processed$/ do |model|
  model = model.titleize.gsub(/\s/, '').constantize
  ThinkingSphinx::Test.index *model.sphinx_index_names
end

So, how do things look now? Well, you can write your features normally – just flag them with no-txn, and your database will be cleaned up both before and after each scenario.

My current preferred approach is adding a file named features/support/sphinx.rb, containing this code:

require 'cucumber/thinking_sphinx/external_world'

Cucumber::ThinkingSphinx::ExternalWorld.new

Before('@no-txn') do
  Given 'a clean slate'
end

After('@no-txn') do
  Given 'a clean slate'
end

And I put the step definitions in either features/step_definitions/common_steps.rb or features/step_definitions/search_steps.rb.

So, now you have no excuse to not use Thinking Sphinx with your Cucumber suite. Get testing!

03 Jan 2010

A Month in the Life of Thinking Sphinx

It’s just over two months since I asked for – and received – support from the Ruby community to work on Thinking Sphinx for a month. A review of this would be a good idea, hey?

I’m going to write a separate blog post about how it all worked out, but here’s a long overview of the new features.

Internal Cucumber Cleanup

This one’s purely internal, but it’s worth knowing about.

Thinking Sphinx has a growing set of Cucumber features to test behaviour with a live Sphinx daemon. This has made the code far more reliable, but there was a lot of hackery to get it all working. I’ve cleaned this up considerably, and it is now re-usable for other gems that extend Thinking Sphinx.

External Delta Gems

Of course, it was my own re-use that was driving that need: I wanted to use it in gems for the delayed job and datetime delta approaches.

There was a clear need for removing these two pieces of functionality from Thinking Sphinx: to keep the main library as slim as possible, and to make better use of gem dependencies, allowing people to use whichever version of delayed job they like.

So, if you’ve not upgraded in a while, it’s worth re-reading the delta page of the documentation, which covers the new setup pretty well.

Testing Helpers

Internal testing is all very well, but what’s much more useful for everyone using Thinking Sphinx is the new testing class. This provides a clean, simple interface for processing indexes and starting the Sphinx daemon.

There’s also a Cucumber world that simplifies things even further – automatically starting and stopping Sphinx when your features are run. I’ve been using this myself in a project over the last few days, and I’m figuring out a neat workflow. More details soon, but in the meantime, have a read through the documentation.

No Vendored Code for Gems

One of the uglier parts of Thinking Sphinx is the fact that it vendors Riddle and AfterCommit (and for a while, Delayed Job), two essential libraries. This is not ideal at all, particularly when gem dependencies can manage this for you.

So, Thinking Sphinx no longer vendors these libraries if you install it as a gem – instead, the riddle and after_commit gems will get brought along for the ride.

The one catch is that they’re still vendored for plugin installations. I recommend people use Thinking Sphinx as a gem, but there are valid reasons for going down the plugin path.

Default Sphinx Scopes

Thanks to some hard work by Joost Hietbrink of the Netherlands, Thinking Sphinx now supports default sphinx scopes. All I had to do was merge this in – Joost was the first contributor to Thinking Sphinx (and there’s now over 100!), so he knows the code pretty well.

In lieu of any real documentation, here’s a quick sample – define a scope normally, and then set it as the default:

class Article < ActiveRecord::Base
  # ...
  
  sphinx_scope(:by_date) {
    {:order => :created_at_}
  }
  
  default_sphinx_scope :by_date
  
  # ...
end

Thread Safety

I’ve made some changes to improve the thread safety of Thinking Sphinx. It’s not perfect, but I think all critical areas are covered. Most of the dynamic behaviour occurs when the environment is initialised anyway.

That said, I’m anything but an expert in this area, so consider this a tentative feature.

Sphinx Select Option

Another community-sourced patch – this time from Andrei Bocan in Romania: if you’re using Sphinx 0.9.9, you can make use of its custom select statements:

Article.search 'pancakes',
  :sphinx_select => '*, @weight + karma AS superkarma'

This is much like the :select option in ActiveRecord – but make sure you use :sphinx_select (as the former gets passed through to ActiveRecord’s find calls).

Multiple Index Support

You can now have more than one index in a model. I don’t see this as being a widely needed feature, but there’s definitely times when it comes in handy (such as having one index with stemming, and one without). The one thing to note is that all indexes after the first one need explicit names:

define_index 'stemmed' do
  # ...
end

You can then specify explicit indexes when searching:

Article.search 'pancakes',
  :index => 'stemmed_core'
Article.search 'pancakes',
  :index => 'article_core,stemmed_core'

Don’t forget that the default index name is the model’s name in lowercase and underscores. All indexes are prefixed with _core, and if you’ve enabled deltas, then a matching index with the _delta suffix exists as well.

Building on from this, you can also now have indexes on STI subclasses when superclasses are already indexed.

While the commits to this feature are mine, I was reading code from a patch by Jonas von Andrian – so he’s the person to thank, not me.

Lazy Initialisation

Thinking Sphinx needs to know which models have indexes for searching and indexing – and so it would load every single model when the environment is initialised, just to figure this out. While this was necessary, it also is slow for applications with more than a handful of models… and in development mode, this hit happens on every single page load.

Now, though, Thinking Sphinx only runs this load request when you’re searching or indexing. While this doesn’t make a difference in production environments, it should make life on your workstations a little happier.

Lazy Index Definition

In a similar vein, anything within the define_index block is now evaluated when it’s needed. This means you can have it anywhere in your model files, whereas before, it had to appear after association definitions, else Thinking Sphinx would complain that they didn’t exist.

This feature actually introduced a fair few bugs, but (thanks to some patience from early adopters), it now runs smoothly. And if it doesn’t, you know where to find me.

Sphinx Auto-Version detection

Over the course of the month, Thinking Sphinx and Riddle went through some changes as to how they’d be required (depending on your version of Sphinx). First, there was separate gems for 0.9.8 and 0.9.9, and then single gems with different require statements. Neither of these approaches were ideal, which Ben Schwarz clarified for me.

So I spent a day or two working on a solution, and now Thinking Sphinx will automatically detect which version you have installed. You don’t need any version numbers in your require statements.

The one catch with this is that you currently need Sphinx installed on every machine that needs to know about it, including web servers that talk to Sphinx on a separate server. There’s an issue logged for this, and I’ll be figuring out a solution soon.

Sphinx 0.9.9

This isn’t quite a Thinking Sphinx feature, but it’s worth noting that Sphinx 0.9.9 final release is now available. If you’re upgrading (which should be painless), the one thing to note is that the default port for Sphinx has changed from 3312 to 9312.

Upgrading

If you want to grab the latest and greatest Thinking Sphinx, then version 1.3.14 is what to install. And read the documentation on upgrading!

28 Oct 2009

Funding Thinking Sphinx

Update: I’ve now hit my target. If you want to donate more, I won’t turn you away, but perhaps you should send those funds to other worthy open source projects, or a local charity. A massive thank you to all who have pitched in to the pledgie, your generosity and support is amazing.

Over the past two years, Thinking Sphinx has grown massively – in lines of code, in the numbers of users, in complexity, in time required to support it. I’m regularly amazed and touched by the recommendations I see on Twitter, and the feedback I get in conversations. The fact that there’s been almost one hundred contributors is staggering.

It’s not all fun and games, though… there’s still plenty of features that can be added, and bugs to be fixed, and documentation to write. So, what I’d really like to do is spend November working close to full-time on just Thinking Sphinx. I have a long task list. All I need is a bit of financial help to cover living expenses.

I have an existing pledgie tied to the GitHub project, currently sitting on $600. If I can get another $2000, then I won’t have to worry at all about how I’m going to pay bills or rent for November. Even $1400 will make it viable for me, albeit maybe with some help from my savings.

If you or your workplace can make a donation, that would be very much appreciated. I’m happy to provide weekly updates on where things are at if people request it – but of course, watching the GitHub projects for Thinking Sphinx itself and the documentation site is the most reliable way to keep an eye on my progress.

I’m hoping to get Thinking Sphinx to a point where the documentation is by far the best place for support, and it’s only the really tricky problems (and bug reports) that end up in my inbox.

I want it to be a model Ruby library that doesn’t get in your way, is as fast as possible, and plays nicely with other libraries.

I want the testing suite to be rock-solid. I’ve been much better at writing tests first over the last six months, and using Cucumber has made the test suite so much more reliable, but there’s still some way to go.

This is not a rewrite – it’s polishing.

I’ve been toying with this idea for a while, and it’s time to have a stab at it. Hopefully you can provide some assistance to do this.

RssSubscribe to the RSS feed

About Freelancing Gods

Freelancing Gods is written by , who works on the web as a web developer in Melbourne, Australia, specialising in Ruby on Rails.

In case you're wondering what the likely content here will be about (besides code), keep in mind that Pat is passionate about the internet, music, politics, comedy, bringing people together, and making a difference. And pancakes.

His ego isn't as bad as you may think. Honest.

Here's more than you ever wanted to know.

Ruby on Rails Projects

Other Sites

Creative Commons Logo All original content on this site is available through a Creative Commons by-nc-sa licence.