Y U NO GEMSPEC!? Wednesday, March 14, 2012

tl;dr

  1. Team Nokogiri are not 10-foot-tall code-crunching robots, so master is usually unstable.
  2. Unstable code can corrupt your data and crash your application, which would make everybody look bad.
  3. Therefore, the risk associated with using unstable code is severe; for you and for Team Nokogiri.
  4. The absence of a gemspec is a risk mitigation tactic.
  5. You can always ask for an RC release.

Why Isn't There a Gemspec!?

OHAI! Thank you for asking this question!

Team Nokogiri gets asked this pretty frequently. Just a sample from the historical record:

Sometimes people imply that we've forgotten, or that we don't how to properly manage our codebase. Those people are super fun to respond to!

We've gone back and forth a couple of times over the past few years, but the current policy of Team Nokogiri is to not provide a gemspec in the Github repo. This is a conscious choice, not an oversight.

But You Didn't Answer the Question!

Ah, I was hoping you wouldn't notice. Well, OK, let's do this, if you're serious about it.

I'd like to start by talking about risk. Specifically, the risk associated with using a known-unstable version of Nokogiri.

Risk

One common way to evaluate the risk of an incident is:

risk = probability x impact

You can read more about this on the internets.

The risk associated with a Nokogiri bug could be loosely defined by answering the questions:

  • "How likely is it that a bug exists?" (probability)
  • "How severe will the consequences of a bug be?" (impact)

Probability

The master branch should be considered unstable. Team Nokogiri are not 10-foot-tall code-crunching robots; we are humans. We make mistakes, and as a result, any arbitrary commit on master is likely to contain bugs.

Just as an example, Nokogiri master was unstable for about five months between November 2011 and March 2012. It was unstable not because we were sloppy, or didn't care, but because the fixes were hard and unobvious.

When we release Nokogiri, we test for memory leaks and invalid memory access on all kinds of platforms with many flavors of Ruby and lots of versions of libxml2. Because these tests are time-consuming, we don't run them on every commit. We run them often when preparing a release.

If we're releasing Nokogiri, it means we think it's rock solid.

And if we're not releasing it, it means there are probably bugs.

Impact

Nokogiri is a gem with native extensions. This means it's not pure Ruby -- there's C or Java code being compiled and run, which means that there's always a chance that the gem will crash your application, or worse. Possible outcomes include:

  • leaking memory
  • corrupting data
  • making benign code crash (due to memory corruption)

So, then, a bug in a native extension can have much worse downside than you might think. It's not just going to do something unexpected; it's possibly going to do terrible, awful things to your application and data.

Nobody wants that to happen. Especially Team Nokogiri.

Risk, Redux

So, if you accept the equation

risk = probability x impact

and you believe me when I say that:

  • the probablility of a bug in unreleased code is high, and
  • the impact of a bug is likely to be severe,

then you should easily see that the risk associated with a bug in Nokogiri is quite high.

Part of Team Nokogiri's job is to try to mitigate this risk. We have a number of tactics that we use to accomplish this:

  • we respond quickly to bug reports, particularly when they are possible memory issues
  • we review each others' commits
  • we have a thorough test suite, and we test-drive new features
  • we discuss code design and issues on a core developer mailing list
  • we use valgrind to test for memory issues (leaks and invalid access) on multiple combinations of OS, libxml2 and Ruby
  • we package release candidates, and encourage devs to use them
  • we do NOT commit a gemspec in our git repository

Yes, that's right, the absence of a gemspec is a risk mitigation tactic. Not only does Team Nokogiri not want to imply support for master, we want to actively discourage people from using it. Because it's not stable.

But I Want to Do It Anyway

Another option, is to email the nokogiri-talk list and ask for a release candidate to be built. We're pretty accommodating if there's a bugfix that's a blocker for you. And if we can't release an RC, we'll tell you why.

And in the end, nothing is stopping you from cloning the repo and generating a private gemspec. This is an extra step or two, but it has the benefit of making sure developers have thought through the costs and risks involved; and it tends to select for developers who know what they're doing.

In Conclusion

Team Nokogiri takes stability very seriously. We want everybody who uses Nokogiri to have a pleasant experience. And so we want to make sure that you're using the best software we can make.

Please keep in mind that we're trying very hard to do the right thing for all Nokogiri users out there in Rubyland. Nokogiri loves you very much, and we hope you love it back.

Fairy-Wing Wrapup: Nokogiri Performance Wednesday, May 18, 2011

TL;DR

  • Nokogiri’s DOM parser was extremely way faster than either the SAX or Reader parsers, in this particular real-world example.
  • ActiveSupport Hash#from_xml, I am dissapoint.
  • On JRuby, Nokogiri 1.5.0 is extremely way faster than Nokogiri 1.4.4, in this particular real-world example.

Artists Pre-enactment

(Shout-out to @jonathanpberger for the Artist’s Pre-enactment of Paul Dix wearing the fairy wings.)

Previously, on the Fairy Wing Throwdown …

So, you might remember that a few months back, @pauldix bet me that JSON parsing is an order of magnitude faster than XML parsing. (If you’re not in the loop, you can read the dramatization of the bet).

TL;DR, Paul lost that bet, and so will be wearing my daughter’s dress-up fairy wings during his RailsConf 2011 talk on Redis on Thursday. Awesome!

You can view the winning benchmark here.

I want to go to there.

The bet revolved around a real-world use case (Paul and I both work at Benchmark Solutions, a stealth financial market data startup in NYC).

You can view the data structure at the Offical Fairy-Wing Throwdown Repo™, https://github.com/flavorjones/fairy-wing-throwdown, but the summary is that it’s 54K when serialized as JSON, and is comprised (mostly) of an array of key-value stores (i.e., hashes).

Because I wanted to not just win, but to destroy Paul, I implemented the same parsing task using Nokogiri’s DOM parser, SAX parser, and Reader parser, expecting that code complexity and performance would correlate, somehow. In my mind, the graph looked like this:

Expected complexity and performance

But I was shocked and dismayed to see the real results:

Reality bites

What the WHAT?

Yes, that’s right. My payback for increasing the complexity of the code was a reduction in performance. The DOM parser was extremely way faster than either the Reader or SAX parsers.

Let me say that again: the DOM parser implementation was compellingly faster (1.3x) than the SAX parser implementation.

Why would that be? Good question, which I’ll deep-dive into in my next post. But suffice to say, the SAX parser is bottlenecked on making lots of callbacks from C into Ruby-land.

ActiveSupport, I am dissapoint.

Another big wow for me was how slow ActiveSupport’s Hash#from_xml method is. The benchmark shows that it’s about 40 times slower than the partial implementation using Nokogiri’s DOM parser.

Somebody should work on that! It wouldn’t be tough to hack an alternative implementation of Hash#from_xml on top of Nokogiri. If anybody’s looking for an interesting project, there it is.

You can be my @yokolet

Here’s a chart of how the DOM parser implementation works on various platforms:

DOM parser on various platforms

Holy cow! The pure-Java implementation on Nokogiri 1.5.0.beta.4 is 4 times faster than the FFI-to-C implementation on Nokogiri 1.4.4 (28s vs 117s). That’s crazytown!

Thanks to everyone who’s committed to the pure-Java code, notably @headius, @yokolet, @pmahoney and @serabe.

Chart Notes

The “expected performance” line chart is in imaginary units.

The “actual performance” line chart renders performance in number of records processed per second, so bigger is better. The Saikuro and Flog scores were normalized on their values for #transform_via_dom.

The “DOM parser on various platforms” bar chart renders total benchmark runtime, so smaller is better.

JSON vs XML: The Fairy-Wing Throwdown Thursday, March 31, 2011

TL;DR

  1. Is XML parsing more than an order of magnitude (i.e., 10x) slower than JSON parsing in real world situations?
  2. Both @pauldix and @flavorjones think XML parsing is slower than than JSON parsing.
  3. @pauldix says XML parsing is more than an order of magnitude slower than JSON parsing.
  4. @flavorjones says XML parsing is less than an order of magnitude slower than JSON parsing.
  5. The loser must wear @flavorjones’s daughter’s dress-up fairy wings on stage throughout @pauldix’s RailsConf 2011 presentation.
  6. Benchmarks must be performed by close of business Friday, April 1. (No, this is not an April Fool's joke.)

And man, I hope Confreaks is filming it.

The Fairy-Wing Throwdown

or

JSON v XML

or

How much slower, exactly, is XML in the real world?

(A One Act Drama)

Dramatis Personae

Act I, Scene I

Cast is gathered together, drinking beverages, nerding.

John: Hark! My love for Scala knows no bounds. Also, Ruby is Not Half Bad.

Mike: Rememberest thou when we first met? You had time and love for naught but Java and its dear-lov’d cousin, strong static typing.

Chorus: And don’t forget XML!

Paul: Ha ha! Java doth go nowhere without bountiful XML following it around like a little puppy.

Mike: Ha ha! And aided by Spring’s alchemy you wrote Java in XML!

Chorus: Ha ha!

John: A cold and drowsy humour to hear you mock XML so.

Chorus: Why dost thou wring thy hands?

John: Because Nokogiri hath been brought forth from his loins, and he hath intimate knowledge of XML.

Mike: Aye, I know it well, and thus my disaffection has measure and reason. In particular, namespaces are really quite broken.

Paul: Plus, it’s SO SLOW.

John: Gentle Dix, put thy rapier up.

Paul: I do protest I never injur’d thee! I would wager that XML has got to be an order of magnitude slower than JSON, at least!

(Pause.)

Mike: (to John) Forbear this outrage? For shame.

(John shrugs.)

Mike: Knowest I libxml2 so well, it is mos def dishonorably slow. But an order of magnitude? I will take that bet.

Paul: I am not affrighted, nor have I need for your money.

Mike: Then … let’s make it … interesting.

Exeunt omnes.

Tentative Conditions

  1. Benchmarks must be performed on Ruby 1.8.7 with any standard compiled extensions / gems.
  2. Objective is a specific data structure actually used by these Gentlemen at their place of business, Benchmark Solutions.
  3. Code must take a string (JSON or XML), and return an inflated Ruby data structure exactly matching the objective.
  4. Timing should encompass only in-memory operations (not IO).
  5. @jvshahid will be the arbiter of whether the implementations violate the spirit of “real worldiness”.

Easily Valgrind & GDB your Ruby C Extensions Monday, June 08, 2009

Update: John Barnette (@jbarnette) has packaged these rake tasks up as a Hoe plugin: hoe-debugging.

When developing Nokogiri, the most valuable tool I use to track down memory-related errors is Valgrind. It rocks! Aaron and I run the entire Nokogiri test suite under Valgrind before releasing any version.

I could wax poetic about Valgrind all day, but for now I'll keep it brief and just say: if you write C code and you're not familiar with Valgrind, get familiar with it. It will save you countless hours of tracking down heisenbugs and memory leaks some day.

In any case, I've been meaning to package up my utility scripts and tools for quite a while. But they're so small, and it's so hard to make them work for every project ... it's looking pretty likely that'll never happen, so blogging about them is probably the best thing for everyone.

Basics

Let's get to it. Here's how to run a ruby process under valgrind:

# hello-world.rb
require 'rubygems'
puts 'hello world'

# run from cmdline
valgrind ruby hello-world.rb

Oooh! But that's not actually what you want. The Matz Ruby Interpreter does a lot of funky things in the name of speed, like using uninitialized variables and reading past the ends of malloced blocks that aren't on an 8-byte boundary. As a result, something as simple as require 'rubygems' will give you 3800 lines of error messages (see this gist for full output).

Let's try this:

valgrind --partial-loads-ok=yes --undef-value-errors=no ruby hello-world.rb

==15535== Memcheck, a memory error detector.
==15535== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==15535== Using LibVEX rev 1804, a library for dynamic binary translation.
==15535== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==15535== Using valgrind-3.3.0-Debian, a dynamic binary instrumentation framework.
==15535== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==15535== For more details, rerun with: -v
==15535== 
hello world
==15535== 
==15535== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==15535== malloc/free: in use at exit: 10,403,440 bytes in 138,986 blocks.
==15535== malloc/free: 420,496 allocs, 281,510 frees, 155,680,688 bytes allocated.
==15535== For counts of detected errors, rerun with: -v
==15535== searching for pointers to 138,986 not-freed blocks.
==15535== checked 10,654,020 bytes.
==15535== 
==15535== LEAK SUMMARY:
==15535==    definitely lost: 21,280 bytes in 1,330 blocks.
==15535==      possibly lost: 27,368 bytes in 1,840 blocks.
==15535==    still reachable: 10,354,792 bytes in 135,816 blocks.
==15535==         suppressed: 0 bytes in 0 blocks.
==15535== Rerun with --leak-check=full to see details of leaked memory.

Ahhh, much better. We don't see any spurious errors.

Without going too far off-topic, I'd should just mention that those "leaks" aren't really leaks, they're characteristic of how the Ruby interpreter manages its internal memory. (You can see this by running this example with --leak-check=full.)

Rakified!

Here's an easy way to run Valgrind on your gem's existing test suite. This rake task assumes you've got Hoe 1.12.1 or higher.

namespace :test do
  # partial-loads-ok and undef-value-errors necessary to ignore
  # spurious (and eminently ignorable) warnings from the ruby
  # interpreter
  VALGRIND_BASIC_OPTS = "--num-callers=50 --error-limit=no \
                         --partial-loads-ok=yes --undef-value-errors=no"

  desc "run test suite under valgrind with basic ruby options"
  task :valgrind => :compile do
    cmdline = "valgrind #{VALGRIND_BASIC_OPTS} ruby #{HOE.make_test_cmd}"
    puts cmdline
    system cmdline
  end
end

Those basic options will give you a decent-sized stack walkback on errors, will make sure you see every error, and will skip all the BS output mentioned above. You can read Valgrind's documentation for more information, and to tune the output.

If you're not testing a gem, or don't have Hoe installed, try this for Test::Unit suites:

def test_suite_cmdline
  require 'find'
  files = []
  Find.find("test") do |f|
    files << f if File.basename(f) =~ /.*test.*\.rb$/
  end
  cmdline = "#{RUBY} -w -I.:lib:ext:test -rtest/unit \
               -e '%w[#{files.join(' ')}].each {|f| require f}'"
end

namespace :test do
  # partial-loads-ok and undef-value-errors necessary to ignore
  # spurious (and eminently ignorable) warnings from the ruby
  # interpreter
  VALGRIND_BASIC_OPTS = "--num-callers=50 --error-limit=no \
                         --partial-loads-ok=yes --undef-value-errors=no"

  desc "run test suite under valgrind with basic ruby options"
  task :valgrind => :compile do
    cmdline = "valgrind #{VALGRIND_BASIC_OPTS} #{test_suite_cmdline}"
    puts cmdline
    system cmdline
  end
end

Getting this to work for rspec suites is left as an exercise for the reader. :-\

A Note for OS X Users

Valgrind isn't just for Linux. You can make Valgrind work on your fancy-pants OS, too! Check out http://www.sealiesoftware.com/valgrind/ for details.

GDB FTW!

Another thing I find myself doing pretty often is running the test suite under the gdb debugger:

gdb --args ruby -S rake test

or in your Rakefile:

namespace :test do
  desc "run test suite under gdb"
  task :gdb => :compile do
    system "gdb --args ruby #{HOE.make_test_cmd}"
  end
end

Nokogiri, Your New Swiss Army Knife Monday, November 17, 2008

Prologue

Today I'd like to talk about the use of regular expressions to parse and modify HTML. Or rather, the misuse.

I'm going to try to convince you that it's a very bad idea to use regexes for HTML. And I'm going to introduce you to Nokogiri, my new best friend and life companion, who can do this job way better, and nearly as fast.

For those of you who just want the meat without all the starch:

  1. You don't parse Ruby or YAML with regular expressions, so don't do it with HTML, either.
  2. If you know how to use Hpricot, you know how to use Nokogiri.
  3. Nokogiri can parse and modify HTML more robustly than regexes, with less penalty than formatting Markdown or Textile.
  4. Nokogiri is 4 to 10 times faster than Hpricot performing the typical HTML-munging operations benchmarked.

The Scene

On one of the open-source projects I contribute to (names will be withheld for the protection of the innocent, this isn't Daily WTF), I came across the following code:

def spanify_links(text)
  text.gsub(/<a\s+(.*)>(.*)<\/a>/i, '<a \1><span>\2</span></a>')
end

In case it's not clear, the goal of this method is to insert a <span> element inside the link, converting hyperlinks from

<a href='http://foo.com/'> Foo! </a>

to

<a href='http://foo.com/'> <span> Foo! </span> </a>

for CSS styling.

The Problem

Look, I love regexes as much as the next guy, but this regex is seriously busticated. If there is more than one <a> tag on a line, only the final one will be spanified. If the tag contains an embedded newline, nothing will be spanified. There are probably other unobvious bugs, too, and that means there's a code smell here.

Sure, the regex could be fixed to work in these cases. But does a trivial feature like this justify the time spent writing test cases and playing whack-a-mole with regex bugs? Code smell.

Let's look at it another way: If you were going to modify Ruby code programmatically, would you use regular expressions? I seriously doubt it. You'd use something like ParseTree, which understands all of Ruby's syntax and will correctly interpret everything in context, not just in isolation.

What about YAML? Would you modify YAML files with regular expressions? Hells no. You'd slurp it with YAML.parse(), modify the in-memory data structures, and then write it back out.

Why wouldn't you do the same with HTML, which has its own nontrivial (and DTD-dependent) syntax?

Regular expressions just aren't the right tool for this job. Jamie Zawinski said it best:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Why, God? Why?

So, what drives otherwise intelligent people (myself included) to whip out regular expressions when it comes time to munge HTML?

My only guess is this: A lack of worthy XML/HTML libraries.

Whoa, whoa, put down the flamethrower and let me explain myself. By "worthy", I mean three things:

  • fast, high-performance, suitable for use in a web server
  • nice API, easy for a developer to learn and use
  • will successfully parse broken HTML commonly found on the intarwebs

libxml2 and libxml-ruby have been around for ages, and they're incredibly fast. But have you seen the API? It's totally sadistic, and as a result it's inappropriate and not easily usable in simple cases like the one described above.

Now, Hpricot is pure genius. It's pretty fast, and the API is absolutely delightful to work with. It supports CSS as well as XPath queries. I've even used it (with feed-normalizer) in a Rails application, and it performed reasonably well. But it's still much slower than regexes. Here's a (totally unfair) sample benchmark comparing Hpricot to a comparable (though buggy) regular expression (see below for a link to the benchmark gist):

For an html snippet 2374 bytes long ...
                          user     system      total        real
regex * 1000          0.160000   0.010000   0.170000 (  0.182207)
hpricot * 1000        5.740000   0.650000   6.390000 (  6.401207)

it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long

For an html snippet 97517 bytes long ...
                          user     system      total        real
regex * 10            0.100000   0.020000   0.120000 (  0.122117)
hpricot * 10          3.190000   0.300000   3.490000 (  3.502819)

it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

So, historically, I haven't used Hpricot everywhere I could have, and that's because I was overly-cautious about performance.

Get On With It, Already

Oooooh, if only there was a library with libxml2's speed and Hpricot's API. Then maybe people wouldn't keep trying to use regular expressions where an HTML parser is needed.

Oh wait, there is. Everyone, meet Nokogiri.

Check out the full benchmark, comparing the same operation (spanifying links and removing possibly-unsafe tags) across regular expressions, Hpricot and Nokogiri:

For an html snippet 2374 bytes long ...
                          user     system      total        real
regex * 1000          0.160000   0.010000   0.170000 (  0.182207)
nokogiri * 1000       1.440000   0.060000   1.500000 (  1.537546)
hpricot * 1000        5.740000   0.650000   6.390000 (  6.401207)

it took an average of 0.0015 seconds for Nokogiri to parse and operate on an HTML snippet 2374 bytes long
it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long

For an html snippet 97517 bytes long ...
                          user     system      total        real
regex * 10            0.100000   0.020000   0.120000 (  0.122117)
nokogiri * 10         0.310000   0.020000   0.330000 (  0.322290)
hpricot * 10          3.190000   0.300000   3.490000 (  3.502819)

it took an average of 0.0322 seconds for Nokogiri to parse and operate on an HTML snippet 97517 bytes long
it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

Wow! Nokogiri parsed and modified blog-sized HTML snippets in under 2 milliseconds! This performance, though still significantly slower than regular expressions, is still fast enough for me to consider using it in a web application server.

Hell, that's as fast (faster, actually) than BlueCloth or RedCloth can render Markdown or Textile of similar length. If you can justify using those in your web application, you can certainly afford the overhead of Nokogiri.

And as for usability, let's compare the regular expressions to the Nokogiri operations:

html.gsub(/<a\s+(.*)>(.*)<\/a>/i, '<a \1><span>\2</span></a>') # broken regex
html.gsub(/<(script|noscript|object|embed|style|frameset|frame|iframe)[>\s\S]*<\/\1>/, '')

doc.search("a/text()").wrap("<span></span>")
doc.search("script","noscript","object","embed","style","frameset","frame","iframe").unlink

The Nokogiri version is much clearer. More maintainable, more robust and, for me, just fast enough to start jamming into all kinds of places.

Where Else Can I Use Nokogiri?

You can use Nokogiri anywhere you read, write or modify HTML or XML. It's your new swiss army knife.

What about your test cases? Merb is using Nokogiri extensively in their controller tests, and they're reportedly much faster than before. And those Merb dudes are S-M-R-T.

Have you thought about using Nokogiri::Builder to generate XML, instead of the default Rails XML template builder? Boy, I have. Upcoming blog post, hopefully.

Let me know where else you've found Nokogiri useful! Or better yet, join the mailing list and tell the community!

Nokogiri: World's Finest (XML/HTML) Saw Friday, October 31, 2008

Yesterday was a big day, and I nearly missed it, since I spent nearly all of the sunlight hours at the wheel of a car. Nine hours sitting on your butt is no way to ... oh wait, that's actually how I spend every day. Just usually not in a rental Hyundai. Never mind, I digress.

It was a big day because Nokogiri was released. I've spent quite a bit of time over the last couple of months working with Aaron Patterson (of Mechanize fame) on this excellent library, and so I'm walking around, feeling satisfied.

"What's Nokogiri?" Good question, I'm glad I asked it.

Nokogiri is the best damn XML/HTML parsing library out there in Rubyland. What makes it so good? You can search by XPath. You can search by CSS. You can search by both XPath and CSS. Plus, it uses libxml2 as the parsing engine, so it's fast. But the best part is, it's got a dead-simple interface that we shamelessly lifted from Hpricot, everyone's favorite delightful parser.

I had big plans to do a series of posts with examples and benchmarks, but right now I'm in DST Hell and don't have the quality time to invest.

So, as I am wont to do, I'm punting. Thankfully, Aaron was his usual prolific self, and has kindly provided lots of documentation and examples:

Use it in good health! Carry on.

P.S. Please start following Aaron on Twitter. :)

Rails Model Firewall Mixin Tuesday, August 26, 2008

At my company, Pharos, we're about to launch a new product which will contain sensitive data for multiple firms in a single database. This is essentially a lightweight version of our flagship product, which was built for a single client.

Of course, as a result, I had to refactor like crazy to get rid of the implicit "one-firm" assumption that was built into the code and database schemas.

The essential task was to add "firm_id" to each of the private table schemas, and then make sure that all the code that accesses the model specifies the firm in the query. The two access idioms that were being widely used (unsurprisingly):

results = ClassName.find(:all, :conditions => [....])

and

results = ClassName.find_by_entity_id_and_hour(...)

I was able to make minimal changes to the code by supporting the following new idioms through a mixin (the mixin code is at the end of the article):

results = ClassName.find_in_firm_scope(firm_id, :all, :conditions => [....])

results = ClassName.with_firm_scope(firm_id) do |klass|
  klass.find_by_entity_id_and_hour(...)
end

(The second idiom I found easier to make (and the diff easier to read) than:

ClassName.find_by_firm_id_and_entity_id_and_hour(firm_id, ...)

but really, that's a matter of taste.)

But I was still nervous. What if I missed an instance of a database lookup that wasn't specifying firm, and as a result one client saw another client's records? That would be a Really Bad Thing TM, and I want to explicitly make sure that can't happen. But how?

After a half hour of poking around and futzing, I came up with a find()-and-friends implementation that will check with_scope conditions as well as the :conditions parameter to the find() call:

>> My::PrivateModel.find_by_entity_id(1)
RuntimeError: My::PrivateModel PrivateRecord find() did not specify firm_id

Without further ado, here's the mixin:

# lib/private_record.rb
module PrivateRecord
  def self.included(base)
    base.validates_presence_of :firm_id
    base.extend PrivateRecordClassExtendor
  end
end

module PrivateRecordClassExtendor

  def find_every(*args)
    check_for_firm_id(*args)
    super(*args)
  end

  # the DRY idiom here is: results = ClassName.with_firm_scope(firm) {|klass| klass.find(...) }
  def with_firm_scope(firm, &block)
    with_scope(:find => {:conditions => "firm_id = #{firm}"}, :create => {:firm_id => firm}) do
      yield self
    end
  end

  def find_in_firm_scope(firm, *args)
    with_firm_scope(firm) do
      find(*args)
    end
  end

private
  FIRM_ID_RE = /firm_id =/
  def check_for_firm_id(*args)
    ok = false
    if scoped_methods
      scoped_methods.each do |j|
        if j[:find] && j[:find][:conditions] && j[:find][:conditions] =~ FIRM_ID_RE
          ok = true 
          break
        end
      end
    end
    if !ok
      args.each do |j|
        if j.is_a?(Hash) && j[:conditions]
          if (j[:conditions].is_a?(String) && j[:conditions] =~ FIRM_ID_RE) \
             or (j[:conditions].is_a?(Hash) && j[:conditions][:firm_id])
            ok = true 
            break
          end
        end
      end
    end
    raise "#{self} PrivateRecord find() did not specify firm_id" if !ok
  end
end

The magic is all in the check_for_firm_id() method. To use this, simply:

include PrivateRecord

and go to town.

Oh, and lest ye be skeptical, here are the test cases:

require File.dirname(__FILE__) + '/../test_helper'

class PrivateModelTest < ActiveSupport::TestCase

  fixtures :isone_da_schedules

  def test_privaterecord_disallow_find_requirement
    assert_raises(RuntimeError) { My::PrivateModel.find(1) }
    assert_raises(RuntimeError) { My::PrivateModel.find_by_entity_id(1) }
    assert_raises(RuntimeError) { My::PrivateModel.find_all_by_entity_id(1) }
    assert_raises(RuntimeError) { My::PrivateModel.find(:all, :conditions => 'entity_id = 1') }
    assert_raises(RuntimeError) { My::PrivateModel.find(:first, :conditions => 'entity_id = 1') }
  end

  def test_privaterecord_allow_find_requirement
    assert_nothing_thrown { My::PrivateModel.find_in_firm_scope(1, 1) }
    assert_nothing_thrown { My::PrivateModel.with_firm_scope(1) {|k| k.find_by_entity_id(1) } }
    assert_nothing_thrown { My::PrivateModel.with_firm_scope(1) {|k| k.find_all_by_entity_id(1) } }
    assert_nothing_thrown { My::PrivateModel.find_in_firm_scope(1, :all, :conditions => 'entity_id = 0') }
    assert_nothing_thrown { My::PrivateModel.find_in_firm_scope(1, :first, :conditions => 'entity_id = 0') }
    assert_nothing_thrown { My::PrivateModel.find_by_firm_id_and_entity_id(1, 1) }
  end

end

Let me know in the comments if you found this at all useful! Keep coding.