Thinking Sphinx Reborn

So, over the last month or so I’ve been working hard on rewriting Thinking Sphinx - and it’s now time to release those changes publicly. The site’s now got a brief quickstart page and a detailed usage page beyond the rdoc files, and there will be more added over the coming weeks.

A quick overview of what’s shiny and new:

Better index definition syntax

This part reworked many times, finally to something I’m pretty happy with:

define_index do
  indexes [first_name, last_name], :as => :name, :sortable => true
  indexes email, location
  indexes [posts.content, posts.subject], :as => :posts
end

Polymorphic association support in indexes

When you’re drilling down into your associations for relevant field data, it’s now safe to use polymorphic associations - Thinking Sphinx is pretty smart about figuring what models to look at. Make sure you put table indexes on your _type columns though.

MVA Support

Multi-Value Attributes now work nice and cleanly - so you can tie an array of integers to any record.

Multi-Model Searching

Just like before, you can search for records of a specific model. This time around though, you can also search across all your models - and the results still use will_paginate if it’s installed.

ThinkingSphinx::Search.search "help"

Better Filter Support

It was kinda in there to start with, but now it’s much smarter - and it all goes into the conditions hash, just like a find call:

User.search :conditions => {:role_id => 5}
Article.search :conditions => {:author_ids => [12, 24, 48]}

Sorting by Fields

As you may have noticed in the first code block of this post, you can mark fields as :sortable - what this does is it uses Sphinx’s string attributes, and creates a matching attribute that acts as a sort-index to the field. When specifying the search options though, you can just use the field’s name - Thinking Sphinx knows what you’re talking about.

User.search "John", :order => :name
User.search "Smith", :order => "name DESC"

Even More

I’m so eager to share this new release that there’s probably a few things that need a bit more documentation - that will appear both on the Thinking Sphinx site and here on the blog. I’m planning on writing some articles that provide a solid overview to Sphinx in general - which will hopefully be some help no matter what plugin you use - and then dive into some regular ‘recipes’ of Thinking Sphinx usage, and some detailed posts of the cool new features as well.

Also in the pipeline is Merb support - just for ActiveRecord initially, but I’d love to get it talking to DataMapper as well.

Update: Jonathan Conway’s got a branch working in Merb and Rails - needless to say, I’ll be updating trunk with his patch as soon as possible.

Chris Parker left a comment on 10 Apr, 2008:

I really love this plugin. It makes indexing a breeze and is much easier to use than the alternatives.

Thanks alot!

Thuva left a comment on 10 Apr, 2008:

Pat, congratulations on releasing a new version of Thinking Sphinx. I’m looking forward to using it in my current project.

Chris Parker left a comment on 10 Apr, 2008:

Could you give the option for DISTINCT in GROUP_CONCAT?

For example:

CAST (GROUP_CONCAT(DISTINCT `vendor_details`.`group_id` SEPARATOR ‘ ‘) AS CHAR)

Chris Parker left a comment on 10 Apr, 2008:

oops:

@ CAST (GROUP_CONCAT(DISTINCT `vendor_details`.`group_id` SEPARATOR ‘ ‘) AS CHAR)@

pat left a comment on 10 Apr, 2008:

Thuva: Thank you :)

Chris: Great that you’re enjoying the plugin. Why do you need the distinct though? I think I know a pretty easy way for me to add it in, I’m just not (yet) sure why it’s needed.

Rob Olson left a comment on 11 Apr, 2008:

Does Thinking Sphinx support eager loading of associated models? Similar to the :include option in ActiveRecord find calls?

pat left a comment on 11 Apr, 2008:

Rob: Yes, if you specify :include in your search calls, it will be respected.

The one caveat - since Sphinx returns an array of ids, each model has to be loaded in a separate call to the database, so the full speed benefits of :include aren’t quite reached.

ie: Think of it like User.find(:first, :include => [:posts]) instead of User.find(:all, :include => [:posts])

Oliver Beddows left a comment on 11 Apr, 2008:

I’m loving this plugin! Most excellent work!

I have one “query” related to searching a selected group of indexes.

It would great if you could supply a list of class names (like UltraSphinx does).

For example:
ThinkSphinx::Search.search(‘query’, { :class_names => [‘Article’, ‘Product’, ‘Customer’] })

Looking at ThinkSphinx::Search it seems to be capable of handing one index or all indexes, but not a selected group of indexes.

Is this a feature you have planned or thought about? Just curious… ;o)

pat left a comment on 12 Apr, 2008:

Oliver: It’s not something I’d ever thought of - but easy enough to add. I’ll see if I can get it working over the next few days.

Thanks for the feedback :)

Paul Goscicki left a comment on 21 Apr, 2008:

I’ve been reading about Thinking Sphinx a little and so far it looks promising (especially as a Ferret replacement). I’d suggest you write a comprehensive tutorial on Thinking Sphinx usage in Rails. Similar to the one Rails Envy did for Ferret (with highlighting, field boosting, pagination, etc).

pat left a comment on 21 Apr, 2008:

Hi Paul - writing some detailed tutorials is definitely on my list of things to do. I’m hoping to start with a solid introduction to Sphinx first (by this weekend would be nice), and then I’ll move on to how to use Thinking Sphinx.

Brian Johnson left a comment on 24 Apr, 2008:

This looks very interesting. I have narrowed down my options to Thinking Sphinx and Ultrasphinx. Does Ultrasphinx still have more features with this latest release? If so, what does it have the Thinking Sphinx doesn’t? In particular, faceting is very important for my application.

pat left a comment on 24 Apr, 2008:

Hi Brian

At this stage, if you need faceting, go with Ultrasphinx. I know Patrick Lenz is working on a patch for faceting with Thinking Sphinx (check out his git tree on GitHub)… I don’t have the time to focus on it myself at the moment, so until he’s done, it looks like Ultrasphinx is what suits your needs best.

Beyond that though, I think the two plugins are pretty close in regards to features.

Brian Johnson left a comment on 24 Apr, 2008:

My needs for faceting are still a few months out, so something in the works might be ok. There must be some fundamental difference between Ultrasphinx and Thinking Sphinx, otherwise there would be no need for multiple plugins. What are the two approaches and what are the benefits/problems with each approach. I’d love to have the time to simply try both of them out and learn for myself, but unfortunately I have too many other things on my plate right now. A brief explanation would be much appreciated. Thanks.

pat left a comment on 25 Apr, 2008:

I feel there’s two main differences - the first is how you define indexes. I think my syntax is clean and obvious, and also allows you to get data through associations very easily.

The second is approach - I aim to have everything managed through Ruby and a bit of YAML. That is, you don’t need to edit the Sphinx configuration file yourself - the plugin sets it all up for you. I also aim for a very convention-over-configuration setup - if you don’t want to, you don’t need to worry about file paths - it puts it all in your application’s directory, and handles multiple environments fine.

Caveat with all this: obvious bias, plus I haven’t spent any time using Ultraphinx for a good 8 months or so.

Björn Andreasson left a comment on 28 Apr, 2008:

Hi. Just tried your Sphinx plugin (third plugin this last week) and stumbled upon some problems.

Your plugin seems to be MySQL only? Found a “mysql” in plugins/thinking_sphinx/lib/thinking_sphinx/configuration.rb and replaced it to pgsql to make it work with my database.
… however, the SQL-string that gets generated won’t work. The postgres doesn’t like your “´” and when deleting those, it gives me the error tbl_archive.collect doesn’t exists. Which is true. Why does it try to get a column that doesn’t exist? What is “collect”?

Would love to get some answers :)

Cheri A left a comment on 1 May, 2008:

Does Eager Loading work with Thinking Sphinx?

pat left a comment on 2 May, 2008:

Cheri - to some extent it is - see the comments above between myself and Rob Olsen. I do have plans to support it fully though - hopefully they’ll be implemented in the next week or two.

Cheri A left a comment on 2 May, 2008:

Great plugin! Do all the indexing features of Thinking Sphinx negate the need to add indices to my tables in sql?

pat left a comment on 3 May, 2008:

Hi Cheri - no, you still need indexes in your database tables. Sphinx uses MySQL to extract the data, and indexes on your tables makes those queries faster.

Also - there are still plenty of times when you’ll need to use normal find calls. Sphinx doesn’t replace that.

Cheri A left a comment on 4 May, 2008:

Thanks, Pat. So if I have a simple request, such as selecting an entry based on a primary id, it’s better to use find_by_sql and not sphinx, correct?

pat left a comment on 4 May, 2008:

Cheri: yup - or even just find(id), if it’s that simple :)

Artur Jedlinski left a comment on 12 Jul, 2008:

If you would like to use thinking_sphinx as Windows service (which is much more comfortable than typing “rake ts:start” again and again), you may find yourself interested in:

http://www.jedlinski.pl/blog/2008/07/12/thinking-sphinx-as-windows-service/

mauro catenacci left a comment on 21 Sep, 2008:

Hi, I would like to know if I can associate models that are not directly associated in the model where you specify the define_index. i:e
in my model I have
belongs_to :imageasset
belongs_to :raider

now imageasset has associations of it’s own and I need to get them in here somehow. Is that possible?

pat left a comment on 22 Sep, 2008:

Hi Mauro

You should be able to drill through your associations’ associations, like so:

indexes imageasset.creator.login

Cheers

joren left a comment on 23 Oct, 2008:

hi,

i’m really loving this plugin but I stumbled on some small issues.
Say I have a list of names, john, jon, jonn,…
an I want to search for all names where “jo” comes in.

Is this supported, or do I have to edit some setting or is there a specific way to search?

thx

pat left a comment on 23 Oct, 2008:

Hi Joren

It is supported once you add a couple of settings to config/sphinx.yml, for each environment (like you would in config/database.yml): set enable_star to true, and min_prefix_len to 1 (or longer, if you only want prefixes of 2 characters or more?). Stop Sphinx, reindex, restart Sphinx, and then try searching for “jo*”.

joren left a comment on 24 Oct, 2008:

thx,

Peepcode published a pdf today about thinking sphinx. So that can be a help to.
Got it this morning, and there is some good information in it about how it works and about more settings.
http://peepcode.com/products/thinking-sphinx-pdf

joren left a comment on 24 Oct, 2008:

ok, didn’t see till now, you’re already blogged about peepcode :)

joren left a comment on 24 Oct, 2008:

shouldn’t it be useful to only reindex the new values and the values that were updated since the last index?
’cause I don’t really like the extra field you need to use the delta functionality.

pat left a comment on 24 Oct, 2008:

Joren: not sure what you mean - the delta index is only used for new/updated records.

joren left a comment on 24 Oct, 2008:

we have a lot of records and don’t want even more writing in our db when we work with delta.

so every time we index we want to check only these records where the updated_at has changed since the previous time we indexed.

if we index every 15min, only reindex every record where updated_at is 15 minutes ago.

pat left a comment on 24 Oct, 2008:

Joren: At this stage, there’s nothing like that in the official Thinking Sphinx - although I’d like to have it as a feature at some point.

For the moment though, have a look at Ed Hickey’s fork.

joren left a comment on 28 Oct, 2008:

thx, this is the kind of stuff I needed.

great work for both of you

scott left a comment on 16 Nov, 2008:

Hi there,

I am an avid ultrasphinx user and stumbled on your plugin recently. I have yet to try it but plan on doing so soon.

Question for you - does thinking sphinx generate a single sphinx index, that contains multiple models (as ultrasphinx does), or does it actually create a separate index and source for each Active Record model?

IE - in ultrasphinx, you get one index (besides delta), ‘main’, and in main there is fields added like class and class_id, so at the end of the configuration and indexing, you have one gigantic index, that contains all your data.

To make sphinx and ultrasphinx work for our project we have had to hand modify the sphinx.conf file, and manually specify to sphinx which indexes to search. We have 1.2 billion rows we are searching, so having things in a single index is not efficient.

So, in asking you this I am just curious your approach on generating the actual index(es), as if ThinkingSphinx keeps them all separated per model, this would make deployment and maintenance much less for us.

Thank you very much for your hard work!

pat left a comment on 19 Nov, 2008:

Hi Scott

Thinking Sphinx generates an index per model - indeed, it actually generates two - user_core and user. The latter is a distributed index, used as a way to access user_core and user_delta (if deltas are enabled) in one single query.

Hopefully this separation makes life easier for you.

Cheers

Canvas left a comment on 15 Dec, 2008:

Hi,

I have a very complecated view to search data from. The view joins 6 tables, incuding inner and outter joins. Triggers are used to update data to help the join when data in the 6 tables are updated. As a result, the view is very complicated.

I used thining_sphix to build index and searched on the view, it worked pretty nice. But since it’s a view, I can not add column “delta” into it and can not use the delta index as a result. But I really need to use delta index as my data updates quite often.

The solution I can think of now is to use a table to replace the view. And update data into the new table when any of the 6 related tables are updated. Therefore I have to extract the logic in the view and triggers and implement it somewhere. The side effect of this solution is that synchronize the data in the new table and the other six is quite messy.

Any suggestion will be appreciated. Thank you very much.

Pat left a comment on 16 Dec, 2008:

Hi Canvas

It’s best to post your question to the google group so we can get more minds thinking about the best approach.

Canvas left a comment on 17 Dec, 2008:

Thank you Pat. I just posted my question to the google group.

joren left a comment on 18 Mar, 2009:

Hi,

Is there any reason why I only can search for the first 3 characters of an id?

I have this id 123456, when I search for ‘123*’ I get a result, bt when I search for ‘1234*’ or the complete id, My result is empty, whithout any error output in any logs.

Is this a setting or an option I can adjust in my config?
thanks

pat left a comment on 23 Mar, 2009:

Joren: I’m not sure what the problem is - perhaps it’s worth discussing this on the google group

Alex left a comment on 6 Jun, 2009:

indexes email, location

My instinct says this is gross. Why use method_missing instead of a symbol? Are you saving the : character or is there some reason for that?

pat left a comment on 7 Jun, 2009:

Alex: I’m using method missing because it makes more sense when defining fields and attributes - see the posts example in the line below what you’ve highlighted.

I agree it’s not brilliant, but it helps to keep the index definition clean.

wxianfeng left a comment on 23 Sep, 2009:

it is easier to use than ultrasphinx………

Prior to this: Fast CSV reading and writing with MySQL

More recently: Pangea Day