Upgrading a Rails application incrementally

By Luke Francl (look@recursion.org), January 2018

One of the major projects I led at my last Rails gig was upgrading our application from Rails 3.2 to Rails 4.2. I'd like to share some lessons I learned about upgrading a large Rails application with minimal problems. Some operational issues I'll discuss are specific to the Rails 3.2 to 4.2 upgrade, but most of this should be relevant to any incremental Rails upgrade.

There's three ways you can upgrade a Rails application:

  1. Stop the world. No feature development, everyone works on the Rails upgrade.
  2. Long-lived upgrade branch. Developers attempt to keep up with ongoing master branch development.
  3. Incremental upgrade. Get the application working under multiple versions of Rails and incrementally fix compatibility problems by conditionalizing code to produce the same behavior in the old and new version.

We didn't want to drop everything to do the upgrade and we knew from prior experience that a long-lived upgrade branch would be painful. Therefore, we decided to attempt an incremental upgrade and gradual rollout. This is more work, but has a lot of advantages:

To accomplish this, my plan was to get the application running on both Rails 3.2 and Rails 4.2. Once the tests were passing for both versions, we would verify the new version with a small amount of API traffic and then gradually switch servers to the new version. Finally, once everything was running under 100% Rails 4, we could delete the conditional logic necessary to support Rails 3.

Prerequisites for a gradual Rails rollout

Before jumping into an incremental upgrade, you need to be operationally ready.

Thorough tests and continuous integration

This is probably the most important thing because without tests and CI you cannot even get started. Once your application can bundle successfully with both versions of Rails, you can create a separate build for the upgrade version. A wide breadth of tests is important (we had unit tests, integration tests, and Capybara-based system tests) but no test suite can catch all upgrade problems, because the environment is different than production.

While we worked on the upgrade, the Rails 4 CI build was marked as optional, but once all the tests were fixed, we marked it as required and a failure would block merging to master.

Canary deployment

Even the best tests are not sufficient. Even with extraordinary measures to synchronize behavior between the test environment and production, there will be differences -- particularly things like class loading, caching, and assets -- and some errors will only manifest in production.

Tests run in an internally consistent environment. During a gradual rollout, your production environment will not be internally consistent (this is also true of any rolling deployment). You need a way to determine how your system will interact when running multiple versions simultaneously.

Testing with production traffic uncovers unknown problems not covered by tests. This could be due to legacy data only present in the production database, performance problems caused by scale of requests or data size, or simply missing test cases.

Monitoring

When you're running your application with multiple versions or Rails, you will need monitoring to get insight into how the versions compare.

We hooked into the ActiveSupport instrumentation to enable this. We used code like this to grab ActiveSupport events and record them in StatsD multiple times, including the Rails version in the event name.

def sanitize_metric_name(metric_name)
  metric_name.to_s.strip.gsub(/\W/, '-')
end
 
ActiveSupport::Notifications.subscribe(/process_action.action_controller/) do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  controller = event.payload[:controller]
  action = event.payload[:action]
 
  prefix = "app"
  suffix = "controller.#{sanitize_metric_name(controller)}.action.#{sanitize_metric_name(action)}"
 
  scopes = [
    "global",
    "rails-#{Rails::VERSION::MAJOR}-#{Rails::VERISON::MINOR}",
    "rails-#{sanitize_metric_name(Rails.version)}"
  ]
 
  scopes.each do |scope|
    StatsD.measure("#{prefix}.#{scope}.#{suffix}", event.duration)
  end
end

With this approach, you can create a dashboard that compares number of requests, performance characteristics, and error rates between the versions.

Originally, I recorded a metric with the full Rails version, as shown in the code above, because this could be useful for patch upgrades. But this used too much storage space and had to be removed. Other tools may not have this limitation (for example, with Datadog, you could use a tag instead).

Process for a no-downtime Rails upgrade

Once you have the prerequisites in place, you will follow roughly this process:

  1. Research
  2. Initial exploration
  3. Parameterize your Gemfile
  4. Opportunistic dependency upgrade and removal
  5. Fix the tests, adding conditional logic where necessary
  6. Fix operational problems
  7. Gradually roll out the new version
  8. Remove obsolete code

Research

Before getting started, do some research. Read the Rails upgrade guide thoroughly -- you will need to refer back to it often. Also, check out blog posts about upgrading to the targeted version and review your major gems to see if they have compatibility problems with your targeted version of Rails. Ready for Rails can help.

Finally, read up on gradual rollouts. Upgrading GitHub to Rails 3 with Zero Downtime by Shay Frendt lays out how to approach a zero-downtime upgrade and provided inspiration for my work.

Initial exploration

After doing some research, the next thing I recommend is getting your app booting under the target version of Rails. Update the rails gem and run bundle install until it works, updating or commenting out dependencies as you go. This will tell you (roughly) how much pain you are in for.

When I did this, it was hack-and-slash, throwaway work. I wanted to see how many gems would be affected by the upgrade. I was able to run the tests locally and use some parts of the app in my development environment. This gave me some data to work with.

Parameterize your Gemfile

Since multiple Rails versions cannot coexist in the same Gemfile.lock you will need to have one lockfile for each version of Rails. Fortunately, a Gemfile is really a Ruby program (This is why Bundler 2 renames Gemfile to gems.rb) so it's possible to parameterize it with an environment variable and generate multiple Gemfile.locks.

Your Gemfile will end up looking something like this:

def rails3?
  ENV['RAILS_VERSION'] == '3'
end
 
if rails3?
  gem 'rails', '3.2.0.1'
  # ...more Rails 3 only gems
else
  gem 'rails', '4.2.1.1'
  # ...more Rails 4 only gems
end
 
# ...lots more gems

There are two gotchas here:

  1. If you vendor your dependencies, you need to make sure the different versions don't prune each other's gems. The --no-prune option does this.
  2. You must update all lock files every time you bundle. The --gemfile argument can be used to choose the Gemfile. A file named Gemfile-rails4 will cause a Gemfile-rails4.lock file to be created.

We created a script to ensure that Bundler was always invoked with the correct arguments and run for both versions of Rails.

In our setup, we had one Gemfile with symlinks to Gemfile-rails3 and Gemfile-rails4. Then to bundle, our script/bundle wrapper would run RAILS_VERSION=3 bundle install --gemfile Gemfile-rails3 --no-prune then RAILS_VERSION=4 bundle install --gemfile Gemfile-rails4 --no-prune.

To launch your application with one Gemfile or the other, use the BUNDLE_GEMFILE environment variable: BUNDLE_GEMFILE=Gemfile-rails4 bundle exec rails server

One final bundler gotcha: --no-prune is a persistent setting that's added to .bundle/config. Once you're done upgrading, remove that configuration so you can prune the unnecessary gems.

Opportunistic dependency upgrade and removal

In the run up to the Rails upgrade, my colleagues and I took every opportunity to remove unneeded or easily replaceable dependencies. This entailed cleaning up dead code, removing gems and deleting monkey patches. Some of the gems or monkey patches were incompatible with Rails 4, but others were removed to lower the surface area for the upgrade.

There may also be backports of new Rails features that you can include in your application to reduce version-conditional code. For example, the strong_parameters gem brings parameter whitelisting to Rails 3. Additionally, many deprecated Rails features have been pulled out into gems like activerecord-deprecated_finders. If your app uses deprecated APIs, consider using a gem now and worrying about fixing the deprecations after you've finished the Rails upgrade.

Fix the tests, adding conditional logic where necessary

The bulk of the work will be in fixing tests until they pass under the new version of Rails. Each change should be a separate pull request and you should merge frequently to master to avoid the curse of a long-running upgrade branch.

Encapsulating API changes

To keep conditional logic isolated, we encapsulated conditional logic in a RailsVersionSupport module. Our rule of thumb was that you could only use the RailsVersionSupport.rails3? and RailsVersionSupport.rails4? methods inside RailsVersionSupport itself. Intention-revealing methods were used everywhere else in the codebase.

For example, different modules needed to be included in our API controllers depending on the Rails version, so we extracted that into a method call:

class Api::SomeController < ActionController::Metal
   RailsVersionSupport.include_api_controller_modules!(self)
   
   def index
     # ...
   end
end

As another example, the SimpleForm gem needed to be conditionally upgraded, but it had a slightly different API for text labels. To support both APIs, we added a simple_form_text_label method like this:

def self.simple_form_text_label
  if rails3?
    lambda { |label, required| label }
  else
    lambda { |label, required, explicit_label| label }
  end
end

Then, where the label_text is set, the correct API for the current Rails version is used:

config.label_text = RailsVersionSupport.simple_form_text_label

As the project proceeded, RailsVersionSupport grew. Finally, when Rails 3 support was removed, all the Rails 3 supporting code was removed from RailsVersionSupport. But we kept the class. Rails 5 had just been released, after all...

Fix operational problems

Once all the tests are passing, it's time to test the upgrade with some live traffic. This will show operational problems which you need to fix.

Almost all of the problems we encountered were due to shared state between application instances. These are the type of problems that only occur while you're running multiple versions at the same time in production.

The two versions of the app are writing to and reading from the same database, cache, and cookies. If that data is not backwards and forwards compatible, there will be problems.

The problems you will run into will depend on what version of Rails you are upgrading from and to and the libraries you use, so it's impossible to cover every problem. Here are a few that I think will be common. Remember that the errors may occur in either version of the application because they are sharing state.

Incompatible cache serialization

One of the first problems we noticed when running Rails 4 in the canary was exceptions in the Rails 3 app servers. This happened because Rails 3 and 4 write incompatible ActiveSupport::Cache::Entry instances into the cache. Rails 4 could read objects written to the cache by Rails 3, but not vice-versa. We made an attempt to monkey patch ActiveSupport::Cache::Entry but after repeated problems we decided to split the cache using cache namespaces. We gave each environment/version combination its own isolated cache.

Incompatible Rails flash serialization

There was a similar problem with the Rails flash. Rails 3 can not deserialize the Rails 4 flash. If your app servers are a mix of Rails 3 and 4 while you roll out, this is a problem.

Unicode escaping fix

One of the subtlest problems we ran into was related to JSON and Unicode. While running the upgrade canary, some input came in containing emoji. Under Rails 3, ActiveSupport::JSON.encode escaped Unicode characters. However, this was broken for Unicode characters outside the Basic Multilingual Plane (like most, but not all, emoji), resulting in these characters being stripped. This was fixed in Rails 4.0 by not escaping these characters (since JSON strings are UTF-8, escaping is unnecessary).

However, our analytics database was configured to use MySQL's broken utf8 encoding. The Rails 4 JSON encoding fix resulted in an event being emitted from the canary with a multibyte character. The event was consumed by our analytics pipeline, but then the resulting metric row failed to insert into our misconfigured MySQL table. Due to the asynchronous nature of this process, it took us quite a while to figure out the source of the failed analytics job!

Side note: This problem caused me to forget my daughter at day care. I was deep into debugging at a coworker's desk and lost track of time. Oops.

Asset pipeline changes

Ah, everyone's favorite Rails feature, the asset pipeline. Changes in the asset pipeline between Rails 3 and Rails 4 caused us a lot of headaches because problems did not show up in tests, asset manifests aren't used in local development, and we didn't have an easy way of testing a partial rollout.

In Rails 4, the asset digest algorithm changed, so a different digested filename is produced, and non-digested files are no longer included in the generated assets. The manifest file format also changed. We worked around these changes by building assets twice -- once under Rails 3 and once under Rails 4 -- and deploying both sets of digested assets and manifest files.

There were also changes in how the asset helpers behaved. The fallback behavior when an asset couldn't be found produced a different path on Rails 3 and Rails 4. As part of this work, we discovered that our background workers were using assets (for example, for image URLs in HTML emails) without the asset manifest. Instead, they relied on the fallback behavior. This happened to work because Rails 3 compiled assets included the undigested file, but Rails 4 no longer included these files. We fixed this by uploading the asset manifest to work servers as well.

Sprockets is difficult to understand and debug. If I had to do it over, I would investigate switching to WebPack instead of fixing the assets to work on both versions.

Gradually roll out the new version

When I worked on the project, we had a monolithic application that ran in multiple environments:

These were separate Rails environments that inherited the same base production settings.

Because our application ran the same code under both versions of Rails, the only configuration change necessary to switch the version on a server was changing an environment variable.

Our roll out looked like this:

  1. Test on the canary repeatedly until every problem we could find was identified and fixed
  2. Deploy the Rails upgrade to the admin environment
  3. Deploy the upgrade to the web environment
  4. Switch work servers over one by one
  5. Switch api servers over one by one

After each roll out, we would check the metrics to make sure all looked well before proceeding. All told, the full rollout took about one week.

If you do everything right, you'll be able to enjoy a graph like this:

What a wonderful feeling.

Remove obsolete code

🎉 You're done! Once the new Rails version has burned in for a while in production, the last step is getting rid of all the crufty code and configuration for running the old version. I recommend keeping your RailsVersionSupport and Gemfile configuration code, because you'll probably need to upgrade again soon.

Is it worth it?

It felt good to get to a modern version of Rails, but upgrading is never any company's top priority. It's hard to say when it makes sense to attempt a big upgrade. After all, what you have is working fine.

But as more time goes by, your version gets cruftier and cruftier, slowing you down, like barnacles on a ship. Documentation goes out of date. You can't use new features. Libraries release fixes that are only compatible with a version you can't use. Morale and recruiting suffer as developers get the message that modern tools are not a priority. Eventually, security updates are no longer released and you put the business at risk.

However, the incremental upgrade techniques discussed here help reduce the risk and cost of upgrading, making it easier to stay current. If you build the institutional knowledge about safe upgrades, you can make it part of your workflow instead of a scary task to put off for some other time.

Thanks to Tom Copeland and Brian Stevenson for reviewing drafts of this article. All errors are mine.