Nokogiri, Your New Swiss Army Knife Monday, November 17, 2008

Prologue

Today I'd like to talk about the use of regular expressions to parse and modify HTML. Or rather, the misuse.

I'm going to try to convince you that it's a very bad idea to use regexes for HTML. And I'm going to introduce you to Nokogiri, my new best friend and life companion, who can do this job way better, and nearly as fast.

For those of you who just want the meat without all the starch:

  1. You don't parse Ruby or YAML with regular expressions, so don't do it with HTML, either.
  2. If you know how to use Hpricot, you know how to use Nokogiri.
  3. Nokogiri can parse and modify HTML more robustly than regexes, with less penalty than formatting Markdown or Textile.
  4. Nokogiri is 4 to 10 times faster than Hpricot performing the typical HTML-munging operations benchmarked.

The Scene

On one of the open-source projects I contribute to (names will be withheld for the protection of the innocent, this isn't Daily WTF), I came across the following code:

def spanify_links(text)
  text.gsub(/<a\s+(.*)>(.*)<\/a>/i, '<a \1><span>\2</span></a>')
end

In case it's not clear, the goal of this method is to insert a <span> element inside the link, converting hyperlinks from

<a href='http://foo.com/'> Foo! </a>

to

<a href='http://foo.com/'> <span> Foo! </span> </a>

for CSS styling.

The Problem

Look, I love regexes as much as the next guy, but this regex is seriously busticated. If there is more than one <a> tag on a line, only the final one will be spanified. If the tag contains an embedded newline, nothing will be spanified. There are probably other unobvious bugs, too, and that means there's a code smell here.

Sure, the regex could be fixed to work in these cases. But does a trivial feature like this justify the time spent writing test cases and playing whack-a-mole with regex bugs? Code smell.

Let's look at it another way: If you were going to modify Ruby code programmatically, would you use regular expressions? I seriously doubt it. You'd use something like ParseTree, which understands all of Ruby's syntax and will correctly interpret everything in context, not just in isolation.

What about YAML? Would you modify YAML files with regular expressions? Hells no. You'd slurp it with YAML.parse(), modify the in-memory data structures, and then write it back out.

Why wouldn't you do the same with HTML, which has its own nontrivial (and DTD-dependent) syntax?

Regular expressions just aren't the right tool for this job. Jamie Zawinski said it best:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Why, God? Why?

So, what drives otherwise intelligent people (myself included) to whip out regular expressions when it comes time to munge HTML?

My only guess is this: A lack of worthy XML/HTML libraries.

Whoa, whoa, put down the flamethrower and let me explain myself. By "worthy", I mean three things:

  • fast, high-performance, suitable for use in a web server
  • nice API, easy for a developer to learn and use
  • will successfully parse broken HTML commonly found on the intarwebs

libxml2 and libxml-ruby have been around for ages, and they're incredibly fast. But have you seen the API? It's totally sadistic, and as a result it's inappropriate and not easily usable in simple cases like the one described above.

Now, Hpricot is pure genius. It's pretty fast, and the API is absolutely delightful to work with. It supports CSS as well as XPath queries. I've even used it (with feed-normalizer) in a Rails application, and it performed reasonably well. But it's still much slower than regexes. Here's a (totally unfair) sample benchmark comparing Hpricot to a comparable (though buggy) regular expression (see below for a link to the benchmark gist):

For an html snippet 2374 bytes long ...
                          user     system      total        real
regex * 1000          0.160000   0.010000   0.170000 (  0.182207)
hpricot * 1000        5.740000   0.650000   6.390000 (  6.401207)

it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long

For an html snippet 97517 bytes long ...
                          user     system      total        real
regex * 10            0.100000   0.020000   0.120000 (  0.122117)
hpricot * 10          3.190000   0.300000   3.490000 (  3.502819)

it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

So, historically, I haven't used Hpricot everywhere I could have, and that's because I was overly-cautious about performance.

Get On With It, Already

Oooooh, if only there was a library with libxml2's speed and Hpricot's API. Then maybe people wouldn't keep trying to use regular expressions where an HTML parser is needed.

Oh wait, there is. Everyone, meet Nokogiri.

Check out the full benchmark, comparing the same operation (spanifying links and removing possibly-unsafe tags) across regular expressions, Hpricot and Nokogiri:

For an html snippet 2374 bytes long ...
                          user     system      total        real
regex * 1000          0.160000   0.010000   0.170000 (  0.182207)
nokogiri * 1000       1.440000   0.060000   1.500000 (  1.537546)
hpricot * 1000        5.740000   0.650000   6.390000 (  6.401207)

it took an average of 0.0015 seconds for Nokogiri to parse and operate on an HTML snippet 2374 bytes long
it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long

For an html snippet 97517 bytes long ...
                          user     system      total        real
regex * 10            0.100000   0.020000   0.120000 (  0.122117)
nokogiri * 10         0.310000   0.020000   0.330000 (  0.322290)
hpricot * 10          3.190000   0.300000   3.490000 (  3.502819)

it took an average of 0.0322 seconds for Nokogiri to parse and operate on an HTML snippet 97517 bytes long
it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

Wow! Nokogiri parsed and modified blog-sized HTML snippets in under 2 milliseconds! This performance, though still significantly slower than regular expressions, is still fast enough for me to consider using it in a web application server.

Hell, that's as fast (faster, actually) than BlueCloth or RedCloth can render Markdown or Textile of similar length. If you can justify using those in your web application, you can certainly afford the overhead of Nokogiri.

And as for usability, let's compare the regular expressions to the Nokogiri operations:

html.gsub(/<a\s+(.*)>(.*)<\/a>/i, '<a \1><span>\2</span></a>') # broken regex
html.gsub(/<(script|noscript|object|embed|style|frameset|frame|iframe)[>\s\S]*<\/\1>/, '')

doc.search("a/text()").wrap("<span></span>")
doc.search("script","noscript","object","embed","style","frameset","frame","iframe").unlink

The Nokogiri version is much clearer. More maintainable, more robust and, for me, just fast enough to start jamming into all kinds of places.

Where Else Can I Use Nokogiri?

You can use Nokogiri anywhere you read, write or modify HTML or XML. It's your new swiss army knife.

What about your test cases? Merb is using Nokogiri extensively in their controller tests, and they're reportedly much faster than before. And those Merb dudes are S-M-R-T.

Have you thought about using Nokogiri::Builder to generate XML, instead of the default Rails XML template builder? Boy, I have. Upcoming blog post, hopefully.

Let me know where else you've found Nokogiri useful! Or better yet, join the mailing list and tell the community!

Nokogiri: World's Finest (XML/HTML) Saw Friday, October 31, 2008

Yesterday was a big day, and I nearly missed it, since I spent nearly all of the sunlight hours at the wheel of a car. Nine hours sitting on your butt is no way to ... oh wait, that's actually how I spend every day. Just usually not in a rental Hyundai. Never mind, I digress.

It was a big day because Nokogiri was released. I've spent quite a bit of time over the last couple of months working with Aaron Patterson (of Mechanize fame) on this excellent library, and so I'm walking around, feeling satisfied.

"What's Nokogiri?" Good question, I'm glad I asked it.

Nokogiri is the best damn XML/HTML parsing library out there in Rubyland. What makes it so good? You can search by XPath. You can search by CSS. You can search by both XPath and CSS. Plus, it uses libxml2 as the parsing engine, so it's fast. But the best part is, it's got a dead-simple interface that we shamelessly lifted from Hpricot, everyone's favorite delightful parser.

I had big plans to do a series of posts with examples and benchmarks, but right now I'm in DST Hell and don't have the quality time to invest.

So, as I am wont to do, I'm punting. Thankfully, Aaron was his usual prolific self, and has kindly provided lots of documentation and examples:

Use it in good health! Carry on.

P.S. Please start following Aaron on Twitter. :)

Rails Model Firewall Mixin Tuesday, August 26, 2008

At my company, Pharos, we're about to launch a new product which will contain sensitive data for multiple firms in a single database. This is essentially a lightweight version of our flagship product, which was built for a single client.

Of course, as a result, I had to refactor like crazy to get rid of the implicit "one-firm" assumption that was built into the code and database schemas.

The essential task was to add "firm_id" to each of the private table schemas, and then make sure that all the code that accesses the model specifies the firm in the query. The two access idioms that were being widely used (unsurprisingly):

results = ClassName.find(:all, :conditions => [....])

and

results = ClassName.find_by_entity_id_and_hour(...)

I was able to make minimal changes to the code by supporting the following new idioms through a mixin (the mixin code is at the end of the article):

results = ClassName.find_in_firm_scope(firm_id, :all, :conditions => [....])

results = ClassName.with_firm_scope(firm_id) do |klass|
  klass.find_by_entity_id_and_hour(...)
end

(The second idiom I found easier to make (and the diff easier to read) than:

ClassName.find_by_firm_id_and_entity_id_and_hour(firm_id, ...)

but really, that's a matter of taste.)

But I was still nervous. What if I missed an instance of a database lookup that wasn't specifying firm, and as a result one client saw another client's records? That would be a Really Bad Thing TM, and I want to explicitly make sure that can't happen. But how?

After a half hour of poking around and futzing, I came up with a find()-and-friends implementation that will check with_scope conditions as well as the :conditions parameter to the find() call:

>> My::PrivateModel.find_by_entity_id(1)
RuntimeError: My::PrivateModel PrivateRecord find() did not specify firm_id

Without further ado, here's the mixin:

# lib/private_record.rb
module PrivateRecord
  def self.included(base)
    base.validates_presence_of :firm_id
    base.extend PrivateRecordClassExtendor
  end
end

module PrivateRecordClassExtendor

  def find_every(*args)
    check_for_firm_id(*args)
    super(*args)
  end

  # the DRY idiom here is: results = ClassName.with_firm_scope(firm) {|klass| klass.find(...) }
  def with_firm_scope(firm, &block)
    with_scope(:find => {:conditions => "firm_id = #{firm}"}, :create => {:firm_id => firm}) do
      yield self
    end
  end

  def find_in_firm_scope(firm, *args)
    with_firm_scope(firm) do
      find(*args)
    end
  end

private
  FIRM_ID_RE = /firm_id =/
  def check_for_firm_id(*args)
    ok = false
    if scoped_methods
      scoped_methods.each do |j|
        if j[:find] && j[:find][:conditions] && j[:find][:conditions] =~ FIRM_ID_RE
          ok = true 
          break
        end
      end
    end
    if !ok
      args.each do |j|
        if j.is_a?(Hash) && j[:conditions]
          if (j[:conditions].is_a?(String) && j[:conditions] =~ FIRM_ID_RE) \
             or (j[:conditions].is_a?(Hash) && j[:conditions][:firm_id])
            ok = true 
            break
          end
        end
      end
    end
    raise "#{self} PrivateRecord find() did not specify firm_id" if !ok
  end
end

The magic is all in the check_for_firm_id() method. To use this, simply:

include PrivateRecord

and go to town.

Oh, and lest ye be skeptical, here are the test cases:

require File.dirname(__FILE__) + '/../test_helper'

class PrivateModelTest < ActiveSupport::TestCase

  fixtures :isone_da_schedules

  def test_privaterecord_disallow_find_requirement
    assert_raises(RuntimeError) { My::PrivateModel.find(1) }
    assert_raises(RuntimeError) { My::PrivateModel.find_by_entity_id(1) }
    assert_raises(RuntimeError) { My::PrivateModel.find_all_by_entity_id(1) }
    assert_raises(RuntimeError) { My::PrivateModel.find(:all, :conditions => 'entity_id = 1') }
    assert_raises(RuntimeError) { My::PrivateModel.find(:first, :conditions => 'entity_id = 1') }
  end

  def test_privaterecord_allow_find_requirement
    assert_nothing_thrown { My::PrivateModel.find_in_firm_scope(1, 1) }
    assert_nothing_thrown { My::PrivateModel.with_firm_scope(1) {|k| k.find_by_entity_id(1) } }
    assert_nothing_thrown { My::PrivateModel.with_firm_scope(1) {|k| k.find_all_by_entity_id(1) } }
    assert_nothing_thrown { My::PrivateModel.find_in_firm_scope(1, :all, :conditions => 'entity_id = 0') }
    assert_nothing_thrown { My::PrivateModel.find_in_firm_scope(1, :first, :conditions => 'entity_id = 0') }
    assert_nothing_thrown { My::PrivateModel.find_by_firm_id_and_entity_id(1, 1) }
  end

end

Let me know in the comments if you found this at all useful! Keep coding.

Freezing Deep Ruby Data Structures Sunday, August 24, 2008

On one of my current ruby projects, I'm reading in a YML file and using the generated data structure as a hackish set of global configuation settings:

firm_1:
    departments:
        sales: 419
        executive: 999
        IT: 232
    locations:
        NY: 19
        WV: 27
        CA: 102
firm_2:
    ...

Because these should be treated as constants, they should not be overwritten (accidentally, of course). I wanted to go ahead and freeze them:

global_conf = YAML.load_file("...")
global_conf.freeze
global_conf['firm_1'] = {'foo' => 'bar'}
=> TypeError: can't modify frozen hash

But, as you probably know, Ruby's freeze doesn't affect the objects in a container.

global_conf['firm_1']['departments'] = {'foo' => 'bar'}
=> {"foo"=>"bar"}

That's bad.

So I hacked up a quick monkeypatch (or whatever the duck punchers call it these days) to recursively freeze containers:

#
#  allow us to freeze deep data structures by recursively freezeing each nested object
#
class Hash
    def deep_freeze # har, har ,har
        each { |k,v| v.deep_freeze if v.respond_to? :deep_freeze }
        freeze
    end
end
class Array
    def deep_freeze
        each { |j| j.deep_freeze if j.respond_to? :deep_freeze }
        freeze
    end
end

After loading these patches, calling deep_freeze does what we want:

global_conf = YAML.load_file("...")
global_conf.deep_freeze
global_conf['firm_1']['departments'] = {'foo' => 'bar'}
=> TypeError: can't modify frozen hash

Nice!

Flash and the Firefox Reframe Problem Thursday, May 22, 2008

So I spent the last two days trying to figure out why Firefox insists on reloading flash content whenever I flip around in my tasty javascript-y tabbed interface.

You haven't seen this? I'm not surprised, it really only occurs if you're embedding flash into a web page whose layout is being managed by javascript. Some examples of UI libraries like this are Scriptaculous, Ext-JS, and my latest BSO, jQuery.

All of these libraries modify the style of the flash object's parent <div> in ways (usually, display:none, but position:absolute will do it, too) that somehow goads Firefox into helpfully reloading the Flash from scratch. Reportedly it's not just swfobjects -- any generic <object> or <embed>, including Java applets, will get reloaded.

For flash charting components (we're playing with amCharts at my company, Pharos), this problem is multiplied by the fact that the flash application will re-download whatever historical data you're trying to present, delaying the presentation and using up more bandwidth. (Hey, how's your ETag support looking?)

It actually took me about two hours to find what the root problem is, and you're not going to believe it:

https://bugzilla.mozilla.org/show_bug.cgi?id=90268

This bug has been open since July 2001! That's Firefox 0.9! Holy cripey!

Worse, it's still not fixed, even in the brand-spanking-new Firefox 3.

The good news is, there's a relatively easy way to get around this, if your JS library is using CSS for hiding elements. What I mean by that is, the javascript code is hiding elements by adding a class to them (in Ext-JS, this class name defaults to .x-hide-display), and is not setting display:none directly on your DOM elements. (You'll probably need to look at the implementation of hide() and show() for your specific library to know for sure.)

So if hiding DOM elements is done via style classes, the low-hanging fruit is to redefine the CSS rule to look like this:

.x-hide-display {
    display:block!important; /* overrides the display:none in the original rule */
    height:0!important;
    width:0!important;
    border:none!important;
    visibility:hidden!important;
}

(this is exactly what I did to make my flash charts work in Ext-JS).

You can probably override the hide() and show() functions in your particular library to do something like this, as well. YMMV.

Now, you're saying to yourself, "Dude, you must be breaking something else that used to depend on the display:none behavior." Well, you're probably right, but I haven't found it yet. If you know, or if you find out, let me know in the comments.

jQuery UI and Closable Tabs Thursday, May 15, 2008

So last week I decided (at my company, Pharos) to dump Ext-JS in favor of jQuery.

The short version is that Ext-JS is hard to style with CSS, plus I was getting odd sizing of objects in my (pretty complicated) layouts that I just couldn't figure out. (The longer version has to do with how easy (hard) it is to write and find contributed extensions.)

Anyway, I'm getting off-topic. jQuery rocks. And the jQuery-UI project is really coming along, in terms of functionality. They're pushing hard to get the 1.5 release candidate out the door.

There are some missing pieces, though, as in any young GUI project. But because I'm betting on jQuery, I'm willing to work to make it do what I want. Until today, this meant contributing some (very) minor bugfixes.

But this afternoon, I implemented closable tabs. Check out jQuery trac 2470 for the patch and working examples (including CSS).

Here's a couple of screenshots showing the closable tabs in 'all' mode, and 'selected' mode. And you can play around with it on the demo page!

General description:

  • A clickable "button" (really an A tag) appears on the tab. When the button is clicked, the tab is removed.
  • LI tags are dynamically modified to contain a second tag:
              <a onclick="return false;"><span>#{text}</span></a>
    
  • The #{text} snippet will be replaced by the configuration option closeText (which is '(x)' by default), and the snippet itself can be set via the configuration option closeTemplate.

Some specifics:

  • New creation option closable can be set to false, 'all' or 'selected'
    • default is false, meaning no closable tabs.
    • 'all' means all tabs have are closable.
    • 'selected' means only the selected tab is closable.
  • New creation options closeTemplate and closeText allow overriding default markup.
  • When a tab is closable, a second A is dynamically added to the tab LI after the normal tab anchor
    • this tag is only added to the DOM if options.closable is non-false
    • this tag is hidden in unselected tabs if options.closable is 'selected'
  • CSS / styles
    • Note that this patch is backwards-compatible with CSS as long as the closable option is not turned on.
    • Close-button tag has class ui-tabs-close
    • However, existing CSS will probably need to be modified to support the new close button.
    • A new class, ui-tabs-tab is associated with the normal A to allow differentiation for themes/styles.
    • see examples.tar.gz for example CSS support
So, if you find the code useful, let me know! It's attached to jQuery trac 2470, along with the sample CSS and code in the snapshot. And don't forget to test drive it at the demo page!

Managing Git Submodules With git.rake Tuesday, May 06, 2008

Update 2008-05-21: Tim Dysinger and Pat Maddox pointed out that git submodules are inherently not well-suited for frequently updated projects. Read the comments for more details, and please use submodules with caution on projects where you can't guarantee a shared repository has not changed between 'pull' and 'push' operation.

Today I'm releasing git.rake into the wild under an open-source license. It's a rakefile for managing multiple git submodules in a shared-server development environment.

We've been using it internally at my company, Pharos Enterprise Intelligence, for the last 5 months and it's been a huge timesaver for us. Read below for a detailed description of the features and its use.

The code is being released under the MIT license and the git repository is being hosted on github. Take a look:

http://github.com/flavorjones/git-rake

What git.rake Is

A set of rake tasks that will:

  • Keep your superproject in synch with multiple submodules, and vice versa. This includes branching, merging, pushing and pulling to/from a shared server, and committing. (Biff!)

  • Keep a description of all changes made to submodules in the commit log of the superproject. (Bam!)

  • Display the status of each submodule and the superproject in an easily-scannable representation, suppressing what you don't want or need to see. (Pow!)

  • Execute arbitrary commands in each repository (submodule and superproject), terminating execution if something fails. (Whamm!)

  • Configure a rails project for use with git. (Although, you've seen that elsewhere and are justifiably unimpressed.)

Prerequisites

If you're not sure how to add a submodule to your repo, or you're not sure what a submodule is, take a quick trip over to the Git Submodule Tutorial, and then come back. In fact, even if you ARE familiar with submodules, it's probably worth reviewing.

The Problem We're Trying to Solve Here

Let's start with stating our basic assumptions:

  1. you're using a shared repository (like github)
  2. you're actively developing in one or more submodules

This model of development can get very tedious very quickly if you don't have the right tools, because everytime you decide to "checkpoint" and commit your code (either locally or up to the shared server), you have to:

  • iterate through your submodules, doing things like:
    • making sure you're on the right branch,
    • making sure you've pulled changes down from the server,
    • making sure that you've committed your changes,
    • and pushed all your commits
  • and then making sure that your superproject's references to the submodules have also been committed and pushed.

If you do this a few times, you'll see that it's tedious and error-prone. You could mistakenly push a version of the superproject that refers to a local commit of a submodule. When people try to pull that down from the server, all hell will break loose because that commit won't exist for them.

Ugh! This is monkey work. Let's automate it.

Simple Solution

OK, fixing this issue sounds easy. All we have to do is:

  • develop some primitives for iterating over the submodules (and optionally the superproject),
  • and then throw some actual functionality on top for sanity checking, pulling, pushing and committing.

The Tasks

git-rake presents a set of tasks for dealing with the submodules:

    git:sub:commit     # git commit for submodules
    git:sub:diff       # git diff for submodules
    git:sub:for_each   # Execute a command in the root directory of each submodule.\
                         Requires CMD='command' environment variable.
    git:sub:pull       # git pull for submodules
    git:sub:push       # git push for submodules
    git:sub:status     # git status for submodules

And the corresponding tasks that run for the submodules PLUS the superproject:

    git:commit         # git commit for superproject and submodules
    git:diff           # git diff for superproject and submodules
    git:for_each       # Run command in all submodules and superproject. \
                         Requires CMD='command' environment variable.
    git:pull           # git pull for superproject and submodules
    git:push           # git push for superproject and submodules
    git:status         # git status for superproject and submodules

It's worth noting here that most of these tasks do pretty much just what they advertise, in some cases less, and certainly nothing more (well, maybe a sanity check or two, but no destructive actions).

The exception is git:commit, which depends on git:update, and that has some pixie dust in it. More on this below.

Leaving only the following specialty tasks to be explained:

    git:configure      # Configure Rails for git
    git:update         # Update superproject with current submodules

The first is simple: configuration of a rails project for use with git.

The other, git:update, does two powerful things:

  1. (Only if on branch 'master') Submodules are pushed to the shared server. This guarantees that the superproject will not have any references to local-only submodule commits.

  2. For each submodule, retrieve the git-log for all uncommitted (in the superproject) revisions, and jam them into a superproject commit message.

Here's an example of such a superproject commit message:

    commit 17272d53c298bd6a8ccee6528e0bc0d62104c268
    Author: Mike Dalessio <mike@csa.net>
    Date:   Mon May 5 20:48:13 2008 -0400

            updating to latest vendor/plugins/pharos_library

            > commit f4dbbce6177de4b561aa8388f3fa9f7bf015fa0b
            > Author: Mike Dalessio <mike@csa.net>
            > Date:   Mon May 5 20:47:46 2008 -0400
            >
            >     git:for_each now exits if any of the subcommands fails.
            >
            > commit 6f15dee8c52ced20c98eef63b3f3fd1c29d91bbf
            > Author: Mike Dalessio <mike@csa.net>
            > Date:   Fri May 2 13:58:17 2008 -0400
            >
            >     think i've got the tempfile handling correct now. awkward, but right.
            >

Excellent! Not only did git:update automatically generate a useful log message for me (indicating that we're updating to the latest submodule version), but it's also embedding original commit logs for all the changes included in that commit! That makes it much easier to find a specific submodule commit in the superproject commit log.

A Note on Branching and Merging

Note that there are no tasks for handling branching and merging. This is intentional! It could be very dangerous to try to read your mind about actions on branches, and frankly, I'm just not up to it today.

For example, let's say I invented a task to copy the current branch master to a new branch foo (the equivalent of git checkout -b foo master) in all submodules, but one of the submodules already has a branch named foo!

Do we reduce this action to a simple git checkout foo for that submodule? That could yield unexpected results if we a) forgot we had a branch named foo and b) that branch is very different from the master we expected to copy.

Well, then -- we can delete (or rename) the existing foo branch and follow that up by copying master to foo. But then we're silently renaming (or deleting) branches that a) could be upstream on the shared server or b) we intended to keep around, but forgot to git-stash.

In any case, my point is that it can get complicated, and so I'm punting. If you want to copy branches or do simple checkouts, you should use the git:for_each command.

Everyday Use of git:rake

In my day job, I've taken the vendor-everything approach and refactored lots of common code (across clients) into plugins, which are each a git submodule. My current project has 14 submodules, of which I am actively coding in probably 5 to 7 at any one time. (Plenty of motivation for creating git:rake right there.)

Let's say I've hacked for an hour or two and am ready to commit to my local repository. Let's first take a look at what's changed:

    $ rake git:status

    All repositories are on branch 'master'
    /home/mike/git-repos/demo1/vendor/plugins/core: master, changes need to be committed
    #   modified:   app/models/user_mailer.rb
    #   public/images/mail_alert.png        (may need to be 'git add'ed)
    WARNING: vendor/plugins/core needs to be pushed to remote origin
    /home/mike/git-repos/demo1/vendor/plugins/pharos_library: master, changes need to be committed
    #   deleted:    tasks/rake/git.rake

You'll notice first of all that, despite having 14 submodules, I'm only seeing output for the ones that need commits, and even that output is minimal, listing only the specific files and not all the cruft in the original message. It tells me that all submodules are on the same branch. It's smart enough to tell me that a file may need to be git-added. It will even alert me when a repo needs to be pushed to the origin.

I'll have to manually cd to the submodule and git-add that one file, but once that's done, I can commit my changes by running:

    $ rake git:commit

which will run git commit -a -v for each submodule, fire up the editor for commit messages along the way, push each submodule to the shared server, and then automagically create verbose commit logs for the superproject.

To pull changes from the shared server:

    $ rake git:pull

When you run this command, you'll notice that the output is filtered, so if no changes were pulled, you'll see no output. Silence is golden.

To push?

    $ rake git:push

Not only will this be silent if there's nothing to push, but the rake task is smart enough to not even attempt to push to the server if master is no different from origin/master. So it's silent and fast.

Let's say I want to copy the current branch, master, to a new branch, working.

    $ rake git:for_each CMD='git checkout -b working master'

If the command fails for any submodules, the rake task will terminate immediately.

Merging changes from 'working' back into 'master' for every submodule (and the superproject)?

    $ rake git:for_each CMD='git checkout master'
    $ rake git:for_each CMD='git merge working'

What git.rake Doesn't Do

A couple of things that come quickly to mind that git.rake should probably do:

  • Push to the shared server for ANY branch that we're tracking from a remote branch.

  • Be more intelligent about when we push to the server. Right now, the code pushes submodules to the shared server every time we want to commit the superproject. We might be able to get away with only pushing the submodules when we push the superproject.

  • Parsing the output from various 'git' commands is prone to breakage if the git crew starts modifying some of the strings.

  • There should probably be some unit/functional tests. See previous item.

Anyway, the code is all up on github. Go hack it, and send back patches!

JS development on IE is busted. Monday, April 21, 2008

My git commit for this afternoon, following 3 hours of debugging and work, contained the following description:

IE7 fixes. DAMN that browser is busted.

Look, I'm not going to go off on a rant, but there are lots of things that can be done to make debugging Javascript in the browser easier, and Microsoft (and the windows community) has done exactly none of them.

1. Javascript console

Hello? I'd like to see what the error is, and where it's happening. By default, all that IE gives you is the Gray Box of Doom that tells you the problem is on line 24696, but won't tell you which file it's referring to.

A quick Google query for IE7 javascript console does a good job at showing the general level of pain about this out there.

Firefox has a basic Javascript console built in. Open source 1, Microsoft 0.

2. Javascript debugger

Microsoft Script Debugger is the only standalone tool available, and it's no longer supported by MS. The other options require installation of either Front Page or Visual Studio. Puh-lease.

Firebug is free for Firefox. Open source 2, Microsoft 0.

I did find a nice tool called DebugBar, but it's only available freely for personal use. Even when I test-drove it, though, most functionality doesn't work properly for dynamically-created DOM elements. So, anything you've created or updated via AJAX calls are not going to be debuggable by DebugBar. Lose! This is basically everything that excellent javascript frameworks and libraries like ExtJS, Dojo and Scriptaculous have been working towards for years.

3. Basic EcmaScript extensions

Array.forEach() doesn't work? That's been around since Ecmascript 1.6! That's right, IE7 still doesn't implement any of the crafty Array iterator methods.

I'm just going to point you at this terrific blog entry detailing the changelist for Microsoft Javascript support since 2001. (Hint: the changelist is empty.)

Got that? In seven years, IE has not improved its Javascript support one whit. Where were you in 2001?

Just. Effing. Boggling.

At the end of all that, which quite frankly made me more dumb than when I started, I found myself asking the question: "Can I get away without supporting IE in my product?"

The realistic answer is obvious, but doesn't the fact that I'm asking the question in the first place tell you that something is seriously busticated?

The details behind the Pit of Despair Known As Internet Explorer have been covered in way more detail (and by more knowledgable people) than I can hope to do. I'm just adding my voice to the chorus of "WTF?"s that are already out there.

If anyone knows of better tools for debugging a rich javascript application on IE7, puh-lease let me know.

(Re-) Starting Up Thursday, April 17, 2008

Hey there. It's been a while. Sorry about that. Thankfully, the interweb (that's you!) hasn't gone anywhere.

I've recently started up my own software company with a close friend from college, so I figured I'd resurrect my blog to record for posterity how the startup is going. We're doing some interesting software development (by some people's standards, anyway), so there'll be some articles in that vein, as well as anecdotes about running the business, deep thoughts on existential topics like "What kind of monkey is best?" (answer: they're all the best) and just plain old me-being-me (my Mom tells me I'm funny).

My company is Pharos Enterprise Intelligence, and just this week our alpha test site (invitation only) went live with Engine Yard. I'll talk more about the product, the technology and our business model in later posts. You'll just have to wait.

And to top it off, our public-facing intarnets site went live this week. If you really loved me, you'd subscribe to our RSS feed.