Freelancing Gods 2013

God
19 Oct 2011

A Sustainable Flying Sphinx?

In which I muse about what a sustainable web service could look like – but first, the backstory:

A year ago – almost to the day – I sat in a wine bar in Sydney’s Surry Hills with Steve Hopkins. I’d been thinking about how to get Sphinx working on Heroku, and ran him through the basic idea in my head of how it could work. His first question was “So, what are you working on tomorrow, then?”

By the end of the following day, I had some idea of how it would work. Over the next few months I had a proof of concept working, hit some walls, began again, and finally got to a point where I could launch an alpha release of Flying Sphinx.

In May, Flying Sphinx became available for all Heroku users – and earlier today (five months later), I received my monthly provider payment from Heroku, with the happy news that I’m now earning enough to cover all related ongoing expenses – things like AWS for the servers, Scalarium to manage them, and Tender for support.

Now, I’m not rolling in cash, and I’m certainly not earning enough through Flying Sphinx to pay rent, let alone be in a position to drop all client work and focus on Flying Sphinx full-time. That’s cool, either of those targets would be amazing.

And of course, money isn’t the be all and end all – even though this is a business, and I certainly don’t want to run at a loss. I want Flying Sphinx to be sustainable – in that it covers not only the hosting costs, but my time as well, along with supporting the broader system around it – code, people and beyond.

But what does a sustainable web service look like, particularly beyond the standard (outmoded) financial axis?

Sustainable Time

Firstly (and selfishly), it should cover the time spent maintaining and expanding the service. Flying Sphinx doesn’t use up a huge amount of my time right at the moment, but I’m definitely keen to improve a few things (in particular, offer Sphinx 2.0.1 alongside the existing 1.10-beta installation), and there is the occasional support query to deal with.

This one’s relatively straight-forward, really – I can track all time spent on Flying Sphinx and multiply that by a decent hourly rate. If it turns out I can’t manage all the work myself, then I pay someone else to help.

It certainly doesn’t look like I’m going to need anyone helping in the near future, mind you – nor am I drowning in support requests.

Sustainable Software

Ignoring the time I spend writing code for Flying Sphinx (as that’s covered by the previous section), pretty much every other piece of software involved with the service is open source. Front and centre among these is Sphinx itself.

I certainly don’t expect to be paid for my own open source contributions, but it certainly helps when there’s some funds trickling in to help motivate dealing with support questions, fixing bugs and adding features. It can also provide a stronger base to build a community as well.

With this in mind, I’m considering setting aside a percentage of any profit for Sphinx development – as any improvements to that help make Flying Sphinx a stronger offering.

(I could also cover my time spent on Thinking Sphinx either with a percentage cut – either way it would end up in my pocket though.)

Sustainable Hardware

This is where things get a little trickier – we’re not just dealing with bits and electrons, but also silicon and metals. The human race is pretty bad at weaning itself off of limited (as opposed to renewable) resources, and the hardware industry certainly is going to hit some limits in the future as certain metals become harder to source.

Of course, the servers use a lot of energy, so one thing I will be doing is offsetting the carbon. I’ve not yet figured out the best service to do this, but will start by looking at Brighter Planet.

From a social perspective, there’s also questions about how those resources are sourced. We should be considering the working conditions of where the metals are mined (and by whom), the people who are soldering the logic boards, and those who place the finished products into racks in data centres.

As an example, let’s look at Amazon. Given the recent issues raised with the conditions for staff in their warehouses, I think it’s fair to seek clarification on the situation of their web service colleagues. And what if there were significant ethical issues for using AWS? What then for Flying Sphinx, which runs EC2 instances and is an add-on for Heroku, a business built entirely on top of Amazon’s offerings?

I could at least use servers elsewhere – but that means bandwidth between servers and Heroku apps starts to cost money – and we introduce a step of latency into the service. Neither of those things are ideal. Or I could just say that I don’t want to support Amazon at all, and shut down Flying Sphinx, remove all my Heroku apps, and find some other hosting service to use.

Am I getting a little too carried away? Perhaps, but this is all hypothetical anyway. I’m guessing Amazon’s techs are looked after decently (though I’d love some confirmation on this), and am hoping the situation improves for their warehouse staff as well.

I am still searching for answers for what truly sustainable hardware – and moreso, sustainable web services – financially, socially, environmentally, and technically. What’s your take? What have I forgotten?

30 May 2011

Searching with Sphinx on Heroku

Just over two weeks ago, I released Flying Sphinx – which provides Sphinx search capability for Heroku apps. I’ll talk more about how I built it and the challenges faced at some point, but right now I just want to introduce the service and how you may go about using it.

Why Sphinx?

Perhaps you’re not familiar with Sphinx and how it can be useful. For those who are new to Sphinx, it’s a full-text search tool – think of your own personal Google for within your website. It comes with two main moving parts – the indexer tool for interpreting and storing your search data (indices), and the searchd tool, which runs as a daemon accepting search requests, and returns the most appropriate matches for a given search query.

In most situations, Sphinx is very fast at indexing your data, and connects directly to MySQL and PostgreSQL databases – so it’s quite a good fit for a lot of Rails applications.

Using Sphinx in Rails

I’ve written a gem, Thinking Sphinx, which integrates Sphinx neatly with ActiveRecord. It allows you to define indices in your models, and then use rake tasks to handle the processing of these indices, along with managing the searchd daemon.

If you want to install Sphinx, have a read through of this guide from the Thinking Sphinx documentation – in most cases it should be reasonably painless.

Installing Thinking Sphinx in a Rails 3 application is quite simple – just add the gem to your Gemfile:

gem 'thinking-sphinx', '2.0.5'

For older versions of Rails, the Thinking Sphinx docs have more details.

I’m not going to get too caught up in the details of how to structure indices – this is also covered within the Thinking Sphinx documentation – but here’s a quick example, for user account:

class User < ActiveRecord::Base
  # ...
  
  define_index do
    indexes name, :sortable => true
    indexes location
    
    has admin, created_at
  end
  
  # ...
end

The indexes method defines fields – which are the textual data that people can search for. In this case, we’ve got the user names and locations covered. The has method is for attributes – which are used for filtering and sorting (fields can’t be used for sorting by default). The distinction of fields and attributes is quite important – make sure you understand the difference.

Now that we have our index defined, we can have Sphinx grab the required data from our database, which is done via a rake task:

rake ts:index

What Sphinx does here is grab all the required data from the database, inteprets it and stores it in a custom format. This allows Sphinx to be smarter about ranking search results and matching words within your fields.

Once that’s done, we next start up the Sphinx daemon:

rake ts:start

And now we can search! Either in script/console or in an appropriate action, just use the search method on your model:

User.search 'pat'

This returns the first page of users that match your search query. Sphinx always paginates results – though you can set the page size to be quite large if you wish – and Thinking Sphinx search results can be used by both WillPaginate and Kaminari pagination view helpers.

Instead of sorting by the most relevant matches, here’s examples where we sort by name and created_at:

User.search 'pat', :order => :name
User.search 'pat', :order => :created_at

And if we only want admin users returned in our search, we can filter on the admin attribute:

User.search 'pat', :with => {:admin => true}

There’s many more options for search calls – the documentation (yet again) covers most of them quite well.

One more thing to remember – if you change your index structures, or add/remove index defintions, then you should restart and reindex Sphinx. This can be done in a single rake task:

rake ts:rebuild

If you just want the latest data to be processed into your indices, there’s no need to restart Sphinx – a normal ts:index call is fine.

Using Thinking Sphinx with Heroku

Now that we’ve got a basic search setup working quite nicely, let’s get it sorted out on Heroku as well. Firstly, let’s add the flying-sphinx gem to our Gemfile (below our thinking-sphinx reference):

gem 'flying-sphinx', '0.5.0'

Get that change (along with your indexed model setup) deployed to Heroku, then inform Heroku you’d like to use the Flying Sphinx add-on (the entry level plan costs $12 USD per month):

heroku addons:add flying_sphinx:wooden

And finally, let’s get our data on the site indexed and the daemon running:

heroku rake fs:index
heroku rake fs:start

Note the fs prefix instead of the ts prefix in those rake calls – the normal Thinking Sphinx tasks are only useful on your local machine (or on servers that aren’t Heroku).

When you run those rake tasks, you will probably see the following output:

Sphinx cannot be found on your system. You may need to configure the
following settings in your config/sphinx.yml file:
  * bin_path
  * searchd_binary_name
  * indexer_binary_name

For more information, read the documentation:
http://freelancing-god.github.com/ts/en/advanced_config.html

This is because Thinking Sphinx doesn’t have access to Sphinx locally, and isn’t sure which version of Sphinx is available. To have these warnings silenced, you should add a config/sphinx.yml file to your project, with the version set for the production environment:

production:
  version: 1.10-beta

Push that change up to Heroku, and you won’t see the warnings again.

For the more curious of you: the Sphinx daemon is located on a Flying Sphinx server, also located within the Amazon cloud (just like Heroku) to keep things fast and cheap. This is all managed by the flying-sphinx gem, though – you don’t need to worry about IP addresses or port numbers.

Also: the same rules apply with Flying Sphinx for modifying index structures or adding/removing index definitions – make sure you restart Sphinx so it’s aware of the changes:

heroku rake fs:rebuild

The final thing to note is that you’ll want the data in your Sphinx indices updated regularly – perhaps every day or every hour. This is best done on Heroku via their Cron add-on – since that’s just a rake task as well.

If you don’t have a cron task already, the following (perhaps in lib/tasks/cron.rake) will do the job:

desc 'Have cron index the Sphinx search indices'
task :cron => 'fs:index'

Otherwise, maybe something more like the following suits:

desc 'Have cron index the Sphinx search indices'
task :cron => 'fs:index' do
  # Other things to do when Cron comes calling
end

If you’d like your search data to have your latest changes, then I recommend you read up on delta indexing – both for Thinking Sphinx and for Flying Sphinx.

Further Sources

Keep in mind this is just an introduction – the documentation for Thinking Sphinx is pretty good, and Flying Sphinx is improving regularly. There’s also the Thinking Sphinx google group and the Flying Sphinx support site if you have questions about either, along with numerous blog posts (though the older they are, the more likely they’ll be out of date). And finally – I’m always happy to answer questions about this, so don’t hesitate to get in touch.

12 Mar 2010

Using Thinking Sphinx with Cucumber

While I highly recommend you stub out your search requests in controller unit tests/specs, I also recommend you give your full stack a work-out when running search scenarios in Cucumber.

This has gotten a whole lot easier with the ThinkingSphinx::Test class and the integrated Cucumber support, but it’s still not perfect, mainly because generally everyone (correctly) keeps their database changes within a transaction. Sphinx talks to your database outside Rails’ context, and so can’t see anything, unless you turn these transactions off.

It’s not hard to turn transactions off in your features/support/env.rb file:

Cucumber::Rails::World.use_transactional_fixtures = false

But this makes Cucumber tests far more fragile, because either each scenario can’t conflict with each other, or the database needs to be cleaned before and after each scenario is run.

Pretty soon after I added the inital documentation for this, a few expert Cucumber users pointed out that you can flag certain feature files to be run without transactional fixtures, and the rest use the default:

@no-txn
Feature: Searching
  In order to find things as easily as possible
  As a user
  I want to search across all data on the site

This is a good step in the right direction, but it’s not perfect – you’ll still need to clean up the database. Writing steps to do that is easy enough:

Given /^a clean slate$/ do
  Object.subclasses_of(ActiveRecord::Base).each do |model|
    next unless model.table_exists?
    model.connection.execute "TRUNCATE TABLE `#{model.table_name}`"
  end
end

(You can also use Database Cleaner, as noted by Thilo in the comments).

But adding that to the start and end of every single scenario isn’t particularly DRY.

Thankfully, there’s Before and After hooks in Cucumber, and they can be limited to scenarios marked with certain tags. Now we’re getting somewhere!

Before('@no-txn') do
  Given 'a clean slate'
end

After('@no-txn') do
  Given 'a clean slate'
end

And here’s a bonus step, to make indexing data a little easier:

Given /^the (\w+) indexes are processed$/ do |model|
  model = model.titleize.gsub(/\s/, '').constantize
  ThinkingSphinx::Test.index *model.sphinx_index_names
end

So, how do things look now? Well, you can write your features normally – just flag them with no-txn, and your database will be cleaned up both before and after each scenario.

My current preferred approach is adding a file named features/support/sphinx.rb, containing this code:

require 'cucumber/thinking_sphinx/external_world'

Cucumber::ThinkingSphinx::ExternalWorld.new

Before('@no-txn') do
  Given 'a clean slate'
end

After('@no-txn') do
  Given 'a clean slate'
end

And I put the step definitions in either features/step_definitions/common_steps.rb or features/step_definitions/search_steps.rb.

So, now you have no excuse to not use Thinking Sphinx with your Cucumber suite. Get testing!

03 Jan 2010

A Month in the Life of Thinking Sphinx

It’s just over two months since I asked for – and received – support from the Ruby community to work on Thinking Sphinx for a month. A review of this would be a good idea, hey?

I’m going to write a separate blog post about how it all worked out, but here’s a long overview of the new features.

Internal Cucumber Cleanup

This one’s purely internal, but it’s worth knowing about.

Thinking Sphinx has a growing set of Cucumber features to test behaviour with a live Sphinx daemon. This has made the code far more reliable, but there was a lot of hackery to get it all working. I’ve cleaned this up considerably, and it is now re-usable for other gems that extend Thinking Sphinx.

External Delta Gems

Of course, it was my own re-use that was driving that need: I wanted to use it in gems for the delayed job and datetime delta approaches.

There was a clear need for removing these two pieces of functionality from Thinking Sphinx: to keep the main library as slim as possible, and to make better use of gem dependencies, allowing people to use whichever version of delayed job they like.

So, if you’ve not upgraded in a while, it’s worth re-reading the delta page of the documentation, which covers the new setup pretty well.

Testing Helpers

Internal testing is all very well, but what’s much more useful for everyone using Thinking Sphinx is the new testing class. This provides a clean, simple interface for processing indexes and starting the Sphinx daemon.

There’s also a Cucumber world that simplifies things even further – automatically starting and stopping Sphinx when your features are run. I’ve been using this myself in a project over the last few days, and I’m figuring out a neat workflow. More details soon, but in the meantime, have a read through the documentation.

No Vendored Code for Gems

One of the uglier parts of Thinking Sphinx is the fact that it vendors Riddle and AfterCommit (and for a while, Delayed Job), two essential libraries. This is not ideal at all, particularly when gem dependencies can manage this for you.

So, Thinking Sphinx no longer vendors these libraries if you install it as a gem – instead, the riddle and after_commit gems will get brought along for the ride.

The one catch is that they’re still vendored for plugin installations. I recommend people use Thinking Sphinx as a gem, but there are valid reasons for going down the plugin path.

Default Sphinx Scopes

Thanks to some hard work by Joost Hietbrink of the Netherlands, Thinking Sphinx now supports default sphinx scopes. All I had to do was merge this in – Joost was the first contributor to Thinking Sphinx (and there’s now over 100!), so he knows the code pretty well.

In lieu of any real documentation, here’s a quick sample – define a scope normally, and then set it as the default:

class Article < ActiveRecord::Base
  # ...
  
  sphinx_scope(:by_date) {
    {:order => :created_at_}
  }
  
  default_sphinx_scope :by_date
  
  # ...
end

Thread Safety

I’ve made some changes to improve the thread safety of Thinking Sphinx. It’s not perfect, but I think all critical areas are covered. Most of the dynamic behaviour occurs when the environment is initialised anyway.

That said, I’m anything but an expert in this area, so consider this a tentative feature.

Sphinx Select Option

Another community-sourced patch – this time from Andrei Bocan in Romania: if you’re using Sphinx 0.9.9, you can make use of its custom select statements:

Article.search 'pancakes',
  :sphinx_select => '*, @weight + karma AS superkarma'

This is much like the :select option in ActiveRecord – but make sure you use :sphinx_select (as the former gets passed through to ActiveRecord’s find calls).

Multiple Index Support

You can now have more than one index in a model. I don’t see this as being a widely needed feature, but there’s definitely times when it comes in handy (such as having one index with stemming, and one without). The one thing to note is that all indexes after the first one need explicit names:

define_index 'stemmed' do
  # ...
end

You can then specify explicit indexes when searching:

Article.search 'pancakes',
  :index => 'stemmed_core'
Article.search 'pancakes',
  :index => 'article_core,stemmed_core'

Don’t forget that the default index name is the model’s name in lowercase and underscores. All indexes are prefixed with _core, and if you’ve enabled deltas, then a matching index with the _delta suffix exists as well.

Building on from this, you can also now have indexes on STI subclasses when superclasses are already indexed.

While the commits to this feature are mine, I was reading code from a patch by Jonas von Andrian – so he’s the person to thank, not me.

Lazy Initialisation

Thinking Sphinx needs to know which models have indexes for searching and indexing – and so it would load every single model when the environment is initialised, just to figure this out. While this was necessary, it also is slow for applications with more than a handful of models… and in development mode, this hit happens on every single page load.

Now, though, Thinking Sphinx only runs this load request when you’re searching or indexing. While this doesn’t make a difference in production environments, it should make life on your workstations a little happier.

Lazy Index Definition

In a similar vein, anything within the define_index block is now evaluated when it’s needed. This means you can have it anywhere in your model files, whereas before, it had to appear after association definitions, else Thinking Sphinx would complain that they didn’t exist.

This feature actually introduced a fair few bugs, but (thanks to some patience from early adopters), it now runs smoothly. And if it doesn’t, you know where to find me.

Sphinx Auto-Version detection

Over the course of the month, Thinking Sphinx and Riddle went through some changes as to how they’d be required (depending on your version of Sphinx). First, there was separate gems for 0.9.8 and 0.9.9, and then single gems with different require statements. Neither of these approaches were ideal, which Ben Schwarz clarified for me.

So I spent a day or two working on a solution, and now Thinking Sphinx will automatically detect which version you have installed. You don’t need any version numbers in your require statements.

The one catch with this is that you currently need Sphinx installed on every machine that needs to know about it, including web servers that talk to Sphinx on a separate server. There’s an issue logged for this, and I’ll be figuring out a solution soon.

Sphinx 0.9.9

This isn’t quite a Thinking Sphinx feature, but it’s worth noting that Sphinx 0.9.9 final release is now available. If you’re upgrading (which should be painless), the one thing to note is that the default port for Sphinx has changed from 3312 to 9312.

Upgrading

If you want to grab the latest and greatest Thinking Sphinx, then version 1.3.14 is what to install. And read the documentation on upgrading!

28 Oct 2009

Funding Thinking Sphinx

Update: I’ve now hit my target. If you want to donate more, I won’t turn you away, but perhaps you should send those funds to other worthy open source projects, or a local charity. A massive thank you to all who have pitched in to the pledgie, your generosity and support is amazing.

Over the past two years, Thinking Sphinx has grown massively – in lines of code, in the numbers of users, in complexity, in time required to support it. I’m regularly amazed and touched by the recommendations I see on Twitter, and the feedback I get in conversations. The fact that there’s been almost one hundred contributors is staggering.

It’s not all fun and games, though… there’s still plenty of features that can be added, and bugs to be fixed, and documentation to write. So, what I’d really like to do is spend November working close to full-time on just Thinking Sphinx. I have a long task list. All I need is a bit of financial help to cover living expenses.

I have an existing pledgie tied to the GitHub project, currently sitting on $600. If I can get another $2000, then I won’t have to worry at all about how I’m going to pay bills or rent for November. Even $1400 will make it viable for me, albeit maybe with some help from my savings.

If you or your workplace can make a donation, that would be very much appreciated. I’m happy to provide weekly updates on where things are at if people request it – but of course, watching the GitHub projects for Thinking Sphinx itself and the documentation site is the most reliable way to keep an eye on my progress.

I’m hoping to get Thinking Sphinx to a point where the documentation is by far the best place for support, and it’s only the really tricky problems (and bug reports) that end up in my inbox.

I want it to be a model Ruby library that doesn’t get in your way, is as fast as possible, and plays nicely with other libraries.

I want the testing suite to be rock-solid. I’ve been much better at writing tests first over the last six months, and using Cucumber has made the test suite so much more reliable, but there’s still some way to go.

This is not a rewrite – it’s polishing.

I’ve been toying with this idea for a while, and it’s time to have a stab at it. Hopefully you can provide some assistance to do this.

22 Apr 2009

A Visit to Vegas

In case you’ve not booked your ticket yet, but were considering coming and needed one more reason: I’ll be giving a tutorial on Sphinx at RailsConf in Las Vegas next month. To get 15% discount, use the code ‘RC09FOS’. (O’Reilly seem to hand out codes everywhere. If you paid full price this time, keep that in mind for next year.)

If you will be there, I’m more than happy to have a chat – about Sphinx or anything else – so let me know if you want to meet up.

Crowdsourced Research

For those of you who know Sphinx and Thinking Sphinx already, I’d love to hear about the things you found a bit difficult when learning. What are the topics I should make sure I cover in my tutorial, that are maybe lacking in documentation?

09 Jan 2009

Link: Thinking Sphinx in Arabic/Unicode | ExpressionLab

"here is what to do to support Arabic (Unicode) search."

06 Jan 2009

Thinking Sphinx Delta Changes

There’s been a bit of changes under the hood with Thinking Sphinx lately, and some of the more recent commits are pretty useful.

Small Stuff

First off, something neat but minor – you can now use decimal, date and timestamp columns as attributes – the plugin automatically maps those to float and datetime types as needed.

There’s also now a cucumber-driven set of feature tests, which can run on MySQL and PostgreSQL. While that’s not important to most users, it makes it much less likely that I’ll break things. It’s also useful for the numerous contributors – just over 50 people as of this week! You all rock!

New Delta Possibilities

The major changes are around delta indexing, though. As well as the default delta column approach, there’s now two other methods of getting your changes into Sphinx. The first, requested by some Ultrasphinx users, and heavily influenced by a fork by Ed Hickey, is datetime-driven deltas. You can use a datetime column (the default is updated_at), and then run the thinking_sphinx:index:delta rake task on a regular basis to load recent changes into Sphinx.

Your define_index block would look something like the following:

define_index do
  # ... field and attribute definitions
  
  set_property :delta => :datetime, :threshold => 1.day
end

If you want to use a column other than updated_at, set it with the :delta_column option.

The above situation is if you’re running the rake task once a day. The more often you run it, the lower you can set your threshold. This is a bit different to the normal delta approach, as changes will not appear in search results straight away – only whenever the rake task is run.

Delayed Reaction

One of the biggest complaints with the default delta structure is that it didn’t scale. Your delta index got larger and larger every time records were updated, and that meant each change got slower and slower, because the indexing time increased. When running multiple servers, you could get a few indexer processes running at once. That ain’t good.

So now, we have delayed deltas, using the delayed_job plugin. You’ll need to have the job queue being processed (via the thinking_sphinx:delayed_delta rake task), but everything is pushed off into that, instead of overloading your web server. It means the changes take slightly longer to get into Sphinx, but that’s almost certainly not going to be a problem.

Firstly, you’ll need to create the delayed_jobs table (see the delayed_job readme for example code), and then change your define_index block so it looks something like this:

define_index do
  # ... field and attribute definitions
  
  set_property :delta => :delayed
end

Riddle Update

As part of the restructuring over the last couple of months, I’ve also added some additional code to Riddle, my Ruby API for Sphinx. It now has objects to represent all of the configuration elements of Sphinx (ie: settings for sources, indexes, indexer and searchd), and can generate the configuration file for you. This means you don’t need to worry about doing text manipulation, just do everything in neat, clean Ruby.

Documentation on this is non-existent, mind you, but the source shouldn’t be too hard to grok. I also need to update Thinking Sphinx’s documentation to cover the delta changes – for now, this blog post will have to do. If you get stuck, check out the Google Group.

Sphinx 0.9.9

One more thing: Thinking Sphinx and Riddle now both have Sphinx 0.9.9 branches – not merged into master, as most people are still using Sphinx 0.9.8, but you can find both code sets on GitHub.

24 Oct 2008

Thinking Sphinx PDF at Peepcode

A quick note to let anyone using (or interested in using) Sphinx and/or Thinking Sphinx that my Peepcode PDF has just been published, and contains a wealth of information on how to mix Sphinx with ActiveRecord (via Rails or Merb).

It’s been great working with Geoffrey Grosenbach to get this written up, and I’m pretty stoked to see the final results – hopefully others will enjoy it as well.

Also, a massive thank you to all the contributors to Thinking Sphinx – it wouldn’t be quite so cool if it wasn’t for all the patches (facilitated by GitHub’s forking).

12 Jul 2008

Link: Jedlinski.pl devblog » Thinking Sphinx as Windows service

"Thinking Sphinx for Windows - a batch of simple rake tasks dedicated for Windows users."

12 Jun 2008

A Concise Guide to Using Thinking Sphinx

Okay, it’s well past time for the companion piece to my Sphinx Primer – let’s go through the basic process of using Thinking Sphinx with Rails.

Just to recap: Sphinx is a search engine that indexes data, and then you can query it with search terms to find out which documents are relevant. Why do you want to use it with Rails? Because it saves having to write messy SQL, and it’s so damn fast.

(If you’re getting a feeling of deja-vu, then it’s probably because you’ve read an old post on this blog that dealt with an old version of Thinking Sphinx. I’ve had a few requests for an updated article, so this is it.)

Installation

So: first step is to install Sphinx. This may be tricky on some systems – but I’ve never had a problem with it with Mac OS X or Ubuntu. My process is thus:

curl -O http://sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz
tar zxvf sphinx-0.9.8-rc2.tar.gz
cd sphinx-0.9.8-rc2
./configure
make
sudo make install

If you’re using Windows, you can just grab the binaries.

Once that’s taken care of, you then want to take your Rails app, and install the plugin. If you’re running edge or 2.1, this is a piece of cake:

script/plugin install git://github.com/freelancing-god/thinking-sphinx.git

Otherwise, you’ve got a couple of options. The first is, if you have git installed, just clone to your vendor/plugins directory:

git clone git://github.com/freelancing-god/thinking-sphinx.git
  vendor/plugins/thinking-sphinx

If you’re not yet using git, then the easiest way is to download the tar file of the code. Try the following:

curl -L http://github.com/freelancing-god/thinking-sphinx/tarball/master
  -o thinking-sphinx.tar.gz
tar -xvf thinking-sphinx.tar.gz -C vendor/plugins
mv vendor/plugins/freelancing-god-thinking-sphinx* vendor/plugins/thinking-sphinx

Oh, and it’s worth noting: if you’re not using MySQL or PostgreSQL, then you’re out of luck – Sphinx doesn’t talk to any other relational databases.

Configuration

Next step: let’s get a model or two indexed. It might be worth refreshing your memory on what fields and attributes are for – can I recommend my Sphinx article (because I’m not at all biased)?

Ok, now let’s work with a simple Person model, and add a few fields:

class Person < ActiveRecord::Base
  define_index do
    indexes [first_name, last_name], :as => :name
    indexes location
  end
end

Nothing too scary – we’ve added two fields. The first is the first and last names of a person combined to one field with the alias ‘name’. The second is simply location.

Adding attributes is just as easy:

define_index do
  # ...
  
  has birthday
end

This attribute is the datetime value birthday (so you can now sort and filter your results by birthdays).

Managing Sphinx

We’ve set up a basic index – now what? We tell Sphinx to index the data, and then we can start searching. Rake is our friend for this:

rake thinking_sphinx:index
rake thinking_sphinx:start

Searching

Now for the fun stuff:

Person.search "Melbourne"

Or with some sorting:

Person.search "Melbourne", :order => :birthday

Or just people born within a 10 year window:

Person.search "Melbourne", :with => {:birthday => 25.years.ago..15.years.ago}

If you want to keep certain search terms to specific fields, use :conditions:

Person.search :conditions => {:location => "Melbourne"}

Just remember: :conditions is for fields, :with is for attributes (and :without for exclusive attribute filters).

Change

Your data changes – but unfortunately, Sphinx doesn’t update your indexes to match automatically. So there’s two things you need to do. Firstly, run rake thinking_sphinx:index regularly (using cron or something similar). ‘Regularly’ can mean whatever time frame you want – weekly, daily, hourly.

The second step is optional, but it’s needed to have your indexes always up to date. First, add a boolean column to your model, named ‘delta’, and have it default to true. Then, tell your index to use that delta field to keep track of changes:

define_index do
  # ...
  
  set_property :delta => true
end

Then you need to tell Sphinx about the updates:

rake thinking_sphinx:stop
rake thinking_sphinx:index
rake thinking_sphinx:start

Once that’s done, a delta index will be created – which holds any recent changes (since the last proper indexing), and gets re-indexed whenever a model is edited or created. This doesn’t mean you can stop the regular indexing, as that’s needed to keep delta indexes as small (and fast) as possible.

String Sorting

If you remember the details about fields and attributes, you’ll know that you can’t sort by fields. Which is a pain, but there’s ways around this – and it’s kept pretty damn easy in Thinking Sphinx. Let’s say we wanted to make our name field sortable:

define_index do
  indexes [first_name, last_name], :as => :name, :sortable => true
  
  # ...
end

Re-index and restart Sphinx, and sorting by name will work.

How is this done? Thinking Sphinx creates an attribute under the hood, called name_sort, and uses that, as Sphinx is quite fine with sorting by strings if they’re converted to ordinal values (which happens automatically when they’re attributes).

Pagination

Sphinx paginates automatically – in fact, there’s no way of turning that off. But that’s okay… as long as you can use your will_paginate helper, right? Never fear, Thinking Sphinx plays nicely with will_paginate, so your views don’t need to change at all:

<%= will_paginate @search_results %>

Associations

Sometimes you’ll want data in your fields (or attributes) from associations. This is a piece of cake:

define_index do
  indexes photos.caption, :as => :captions
  indexes friends.photos.caption, :as => :friends_photos
  
  # ...
end

Polymorphic associations are fine as well – but keep in mind, the more complex your index fields and attributes, the slower it will be for Sphinx to index (and you’ll definitely need some database indexes on foreign key columns to help it stay as speedy as possible).

Gotchas

In case things aren’t working, here’s some things to keep in mind:

  • Added an attribute, but can’t sort or filter by it? Have you reindexed and restarted Sphinx? It doesn’t automatically pick up these changes.
  • Sorting not working? If you’re specifying the attribute to sort by as a string, you’ll need to include the direction to sort by, just like with SQL: “birthday ASC”.
  • Using name or id columns in your fields or attributes? Make sure you specify them using symbols, as they’re core class methods in Ruby.
define_index do
  indexes :name
  
  # ...
  
  has photos(:id), :as => :photo_ids
end

And Next?

I realise this article is pretty light on details – but if you want more information, the first stop should be the extended usage page on the Thinking Sphinx site, quickly followed by the documentation. There’s also an email list to ask questions on.

21 May 2008

Sphinx + Rails + PostgreSQL

In case you’ve not been watching every commit carefully flow through Thinking Sphinx on GitHub – PostgreSQL support has been added. I’ve done a little bit of testing, and I’ve had some excellent support from Björn Andreasson and Tim Riley, so I feel it’s ready for people to start kicking the tires.

I’m no PostgreSQL expert – I definitely fall into the n00b category – so if you think there’s better ways to do what I’m doing, would love to hear them.

11 May 2008

Updates for Thinking Sphinx

I’ve been working away on Thinking Sphinx when I’ve had the time – and we’re nearing what I feel is a solid 1.0 release. I say we, because I’ve had some very generous people supply patches over the last few weeks – namely Arthur Zapparoli, James Healy, Chris Heald and Jae-Jun Hwang. Switching to git – and GitHub in particular – has made it very easy to accept patches.

Mind you, all of these changes aren’t committed just yet – and even when they are, there’ll still be a few more things to cross off the list before we hit the big 1.0 milestone, namely PostgreSQL support and solid spec coverage. Slowly edging closer.

In other news – to help share the Thinking Sphinx knowledge (after some prompting by a few users of the plugin), I’ve created a Google Group for it – so this will be the ideal place to ask questions about how to implement Sphinx into your app, suggest features, or report bugs.

If you’ve been pondering how to deploy Thinking Sphinx via Capistrano, I recommend you read a blog post by Wade Winningham – or if you’re interested in better ways of handling UTF-8 characters outside of the ‘normal’ set (ie: without accents and so forth), make sure you peruse James Healy’s solution.

And one last reminder – if you’re in Sydney on Wednesday evening and interested in learning a bit more about Sphinx in general and Thinking Sphinx in particular, come along to the monthly Ruby meet at the Crown Hotel in Surry Hills, as those are the topics I’ll be presenting on.

26 Apr 2008

Sphinx: A Primer

On Thursday night I presented to the Melbourne Ruby Group about Sphinx – first with a non-Ruby perspective, and then using Ruby, and more specifically Rails. I’ll be presenting again at the Sydney group in a couple of weeks, but I am also adapting the talk to a few blog posts – to allow a bit more detail in a few doses.

First up: Sphinx itself. Why should you read this? Because understanding Sphinx will help you use whichever library (Ruby or otherwise) smarter. It might also teach you some things you had no idea about (ie: this is the article I should have read when I started using Sphinx).

What is Sphinx?

Sphinx is a search engine. You feed it documents, each with a unique identifier and a bunch of text, and then you can send it search terms, and it will tell you the most relevant documents that match them. If you’re familiar with Lucene, Ferret or Solr, it’s pretty similar to those systems. You get the daemon running, your data indexed, and then using a client of some sort, start searching.

When indexing your data, Sphinx talks directly to your data source itself – which must be one of MySQL, PostgreSQL, or XML files – which means it can be very fast to index (if your SQL statements aren’t too complex, anyway).

Sphinx Structure

A Sphinx daemon (the process known as searchd) can talk to a collection of indexes, and each index can have a collection of sources. Sphinx can be directed to search a specific index, or all of them, but you can’t limit the search to a specific source explicitly.

Each source tracks a set of documents, and each document is made up of fields and attributes. While in other areas of software you could use those two terms interchangeably, they have distinct meanings in Sphinx (and thus require their own sections in this post).

Fields

Fields are the content for your search queries – so if you want words tied to a specific document, you better make sure they’re in a field in your source. They are only string data – you could have numbers and dates and such in your fields, but Sphinx will only treat them as strings, nothing else.

Attributes

Attributes are used for sorting, filtering and grouping your search results. Their values do not get paid any attention by Sphinx for search terms, though, and they’re limited to the following data types: integers, floats, datetimes (as Unix timestamps – and thus integers anyway), booleans, and strings. Take note that string attributes are converted to ordinal integers, which is especially useful for sorting, but not much else.

Multi-Value Attributes

There is also support in Sphinx to handle arrays of attributes for a single document – which go by the name of multi-value attributes. Currently (Sphinx version 0.9.8rc2) only integers are supported, so this isn’t quite as flexible as normal attributes, but it’s worth keeping in mind.

Filters

Filters are useful with attributes to limit your searches to certain sets of results – for example, limiting a forum post search to entries by a specific user id. Sphinx’s filters accept arrays or ranges – so if filtering by a single value, just put that in an array. The range filters are particularly useful for getting results from a certain time span.

Relevance

Relevancy is the default sorting order for Sphinx. I’ve no idea exactly how it is calculated, but there are a couple of things you can do easily enough in your queries to influence it. The first is index-level weighting, where you give specific indexes higher rankings than others. The other, similar in nature, but at a lower level, is field weightings. Generally these are set before each query, but it will depend on the library you use.

Keeping Your Indexes Updated

One thing that sets Sphinx apart from Ferret and other search engines is that there is no way to update fields for a specific document in your indexes. The main approach around this is having delta indexes – a small index with all the recent changes (which will be super-fast to index), so Sphinx will include that and the main index for its searches. Of the Rails plugins, both Thinking Sphinx and Ultrasphinx have support for this – I’ve no idea for other languages, mind you.

What’s next?

Next is when we’ll dive into some actual code – we’ll go through some of the common tasks for setting up Sphinx with Rails using Thinking Sphinx.

10 Apr 2008

Thinking Sphinx Reborn

So, over the last month or so I’ve been working hard on rewriting Thinking Sphinx – and it’s now time to release those changes publicly. The site’s now got a brief quickstart page and a detailed usage page beyond the rdoc files, and there will be more added over the coming weeks.

A quick overview of what’s shiny and new:

Better index definition syntax

This part reworked many times, finally to something I’m pretty happy with:

define_index do
  indexes [first_name, last_name], :as => :name, :sortable => true
  indexes email, location
  indexes [posts.content, posts.subject], :as => :posts
end

Polymorphic association support in indexes

When you’re drilling down into your associations for relevant field data, it’s now safe to use polymorphic associations – Thinking Sphinx is pretty smart about figuring what models to look at. Make sure you put table indexes on your _type columns though.

MVA Support

Multi-Value Attributes now work nice and cleanly – so you can tie an array of integers to any record.

Multi-Model Searching

Just like before, you can search for records of a specific model. This time around though, you can also search across all your models – and the results still use will_paginate if it’s installed.

ThinkingSphinx::Search.search "help"

Better Filter Support

It was kinda in there to start with, but now it’s much smarter – and it all goes into the conditions hash, just like a find call:

User.search :conditions => {:role_id => 5}
Article.search :conditions => {:author_ids => [12, 24, 48]}

Sorting by Fields

As you may have noticed in the first code block of this post, you can mark fields as :sortable – what this does is it uses Sphinx’s string attributes, and creates a matching attribute that acts as a sort-index to the field. When specifying the search options though, you can just use the field’s name – Thinking Sphinx knows what you’re talking about.

User.search "John", :order => :name
User.search "Smith", :order => "name DESC"

Even More

I’m so eager to share this new release that there’s probably a few things that need a bit more documentation – that will appear both on the Thinking Sphinx site and here on the blog. I’m planning on writing some articles that provide a solid overview to Sphinx in general – which will hopefully be some help no matter what plugin you use – and then dive into some regular ‘recipes’ of Thinking Sphinx usage, and some detailed posts of the cool new features as well.

Also in the pipeline is Merb support – just for ActiveRecord initially, but I’d love to get it talking to DataMapper as well.

Update: Jonathan Conway’s got a branch working in Merb and Rails – needless to say, I’ll be updating trunk with his patch as soon as possible.

14 Mar 2008

Sphinx 0.9.8-rc1 Updates

Another small sphinx-related post.

In line with the first release candidate release of Sphinx 0.9.8 last week, I’ve updated both my API, Riddle, and my plugin, Thinking Sphinx, to support it. Also, for those inclined, you can now get Riddle as a gem.

I’m slowly making progress on some major changes to Thinking Sphinx, so hopefully I’ll have something cool to show people soon. Oh, but some features that aren’t reflected in the documentation: most of Sphinx’s search options can be passed through when you call Model.search – including :group_by, :group_function, :field_weights, :sort_mode, etc. Consider it an exercise for the reader to figure out the details until I get around to improving the docs.

17 Jan 2008

Sphinx 0.9.8r1065

Short post, as befitting the importance of the content: Riddle and Thinking Sphinx have both been updated to support the current version of Sphinx, 0.9.8r1065.

27 Dec 2007

Updates for Sphinx 0.9.8r985

Another quick Sphinx post – Riddle is updated to support Sphinx’s latest release (0.9.8r985), and Thinking Sphinx now has that new version of Riddle as well.

I’ve not tested any of this with the recently released Ruby 1.9 yet, though (but it’s on my list of things to do).

Also, thank-you to Joost Hietbrink (again) and Jonathan Conway for their patches to Thinking Sphinx – very much appreciated.

02 Dec 2007

Sphinx-related Updates

Two Sphinx-related tidbits:

Riddle

Riddle now has a tag in SVN for the 0.9.8-r909 release of Sphinx – not that there were really any functional changes compared to r871, besides the two extra match modes (Full Scan and Extended 2 – the latter isn’t going to hang around for long anyway).

Thinking Sphinx

As well as supporting the above version of Sphinx, I’ve now added some brief documentation to Thinking Sphinx that discusses attributes, sorting and delta indexes. To summarise kinda-briefly:

Attributes

Attributes are defined in the define_index block as follows:

define_index do |index|
  index.has.created_at
  index.has.updated_at
  # Field definitions go here
end

They can only be from the indexed model (not associations), and in line with Sphinx’s limitations, must be either integers, floats or timestamps.

Sorting

Ties in closely with attributes – as that’s all Sphinx will let you order by. Use it in the same way as you would in a find call:

Invoice.search :conditions => "expensive", :order => "created_at DESC"

Same approach works for the :include parameter (although this has nothing to do with Sphinx itself).

Delta Indexes

Delta indexes track changes to model records between proper indexes (ie: from the rake task thinking_sphinx:index) – all they require is a boolean field in the model’s table called delta, and for delta indexing to be enabled as follows:

define_index do |index|
  index.delta = true
  # Fields and attributes go here
end

The one catch – at this point, delta indexes are one step off current, as they get indexed before the current transaction to the database is committed. This will get better soon, thanks to some help from Joost Hietbrink and his colleagues at YelloYello – once I find some free time, I’ll get that working much more neatly.

14 Nov 2007

Sphinx's Riddle

Edit: I’ve changed the Subversion reference to Github, and it’s worth noting that Riddle works with Sphinx 0.9.8, 0.9.9 and 1.10-beta at the time of writing (January 2011). Original post continues below:

Built out of the work I’ve done for Thinking Sphinx (which has just got basic support for delta indexes, attributes and sorting – although the documentation doesn’t reflect that), I’ve extracted a new Ruby client that communicates with Sphinx, which I’ve named Riddle.

I’m not going to delve into the code here – because I’m not expecting it to be that useful to many people (and I just wrote examples in the documentation – go read that instead!) – but I’m very happy with how it’s ended up, and it’s got some level of specs to give it a thorough test. It’s also compatible with the most recent release of Sphinx (0.9.8 r871). Should you wish to poke around with it, just clone it from Github:

git clone \
  git://github.com/freelancing-god/riddle.git

It’s also being used in Evan Weaver’s UltraSphinx plugin, which I’m pretty pleased about.

RssSubscribe to the RSS feed

About Freelancing Gods

Freelancing Gods is written by , who works on the web as a web developer in Melbourne, Australia, specialising in Ruby on Rails.

In case you're wondering what the likely content here will be about (besides code), keep in mind that Pat is passionate about the internet, music, politics, comedy, bringing people together, and making a difference. And pancakes.

His ego isn't as bad as you may think. Honest.

Here's more than you ever wanted to know.

Ruby on Rails Projects

Other Sites

Creative Commons Logo All original content on this site is available through a Creative Commons by-nc-sa licence.