Debugging the whole ecosystem

I was recently exporting a large number of records via CSV, some 130k rows. Luckily, the rows only consisted of three columns.

When testing the output, I noticed that there were only 65355 rows written.
A nice round number, being the highest positive integer that can be stored in a 16bit value.

I checked the docs for Ruby’s IO#write, and noted that from IO#write_nonblock, it may be buffered.
So stepping through the loop from the working case to the failing:

File.open(path, 'wb') do |f|
  records.each.with_index do |record, index|
    binding.pry if index >= 65354
    f.write(record.csv_row)
  end
end

I could see that it failed to alter the size of the file at path. Manually calling f.flush && f.write(row_data) seemed to make a difference to the file size.

Now I inserted a IO#flush in the output loop every 100 rows or so.

File.open(path, 'wb') do |f|
  records.each.with_index do |record, index|
    f.write(record.csv_row)
    f.flush if index % 100 == 0
  end
end

Still no difference in the output though, loading it into Apple Pages. Still limited to 65355 rows. On a hunch, I opened the file in vim. Lo and behold! 130k lines ready and waiting for me. Going back, I removed the IO#flush call, regenerated the file, and still Pages reports 65355 rows, vim reports 130k lines.

So, moral of the story… don’t rely on pretty software for large data sets? Trust but verify? Commandline wins?

A Rate-limited Sidekiq Job – Part 2

In the last post, I talked about an initial approach to a rate limited sidekiq job. Sadly, it didn’t scale for us; thousands of jobs were being tried and retried, while all of them exited early as the rate limit had been exceeded.

The implementation we have settled on is slightly different.

Instead of raising an exception when the rate limit had been met (forcing the job back into the queue, only for the next to be tried immediately), we wait.

Our original ratelimited() method changes from

def ratelimited(&:block)
  raise "Ratelimit met" if ratelimit.exceeded?(RL_SUBJECT, interval: RL_INTERVAL, threshold: RL_THRESHOLD)
  ratelimit.add(RL_SUBJECT)
  block.call
end

to

def ratelimited(&:block)
  ratelimit.exec_within_threshold(RL_SUBJECT, interval: RL_INTERVAL, threshold: RL_THRESHOLD) do
    ratelimit.add(RL_SUBJECT)
    block.call
  end
end

Here, the Ratelimit gem provides a method which takes a block and executes it if the limit has not been met. Otherwise, it calls sleep() for the remaining duration of the RL_INTERVAL.

Sleeping until the rate limit is no-longer exceeded prevents Sidekiq from thrashing through jobs, but has a drawback. If you have other jobs in the queue, these will be blocked too.

We solve this by running with different Sidekiq queues, one for the rate-limited task (in our case, Strava user sync), another for other tasks. However, we still have a problem, as the strava queue will block the Sidekiq process. To solve this, we specify Sidekiq will run multiple processes, each with a different set of queues.

As we deploy with Capistrano, the sidekiq processes are configured using the capistrano-sidekiq gem.

In ”’config/deploy/production.rb”’

set :sidekiq_processes, 2
set :sidekiq_options_per_process, [
  "--concurrency 1 --queue strava",
  "--concurrency 10 --queue default --queue some_other_queue"
]

Above, we see capistrano-sidekiq properites being set to ensure two separate sidekiq processes are running. Each will run a different queue, and we ask the strava running process to only bother with one thread. More would be fine, but unless the jobs take a long time, one thread can probably saturate the strava API rate limit.

A Rate-limited Sidekiq Job – Part 1

Problem: A background task needs to hit an external service, but not too frequently

Solution: Use the rate limit gem

What do we do when the limit has been met?

Recently, I’ve been working on the above. I thought I’d write about the initial solution I tried, which is fine, but doesn’t quite scale if you are frequently hitting the limit. The next post will address an alternate take.

TLDR; raise an exception if the rate limit has been met, and let sidekiq queue the job for later.

Sidekiq has a great feature, in that failed jobs will be re-queued.

Some details on the rate limits involved:

We want to hit the external service (Strava API) a maximum of 600 times in a fifteen minute period (900 seconds).

To build the basic Sidekiq job, retrying every fifteen minutes if we failed (to let the rate limit replenish)

class StravaSyncUserJob
  include Sidekiq::Worker
  
  sidekiq_retry_in do |count|
    15.minutes
  end

  def perform(user_id)
  end
end

Now if our StravaSyncUserJob#perform implementation raises, the task will be shelved for a later attempt. Let’s configure the rate limiting with the Ratelimit gem

In our Gemfile:

gem 'ratelimit'

Then install the gem with bundle install

Now we’ll setup some values:

RL_SUBJECT = "users" # Just a way to separate different ratelimit counts, can be any string in our case

# 600 hits per 15 minutes
RL_THRESHOLD = 600
RL_INTERVAL = 15.minutes

And implement our sidekiq perform method:

def perform(user_id)
  ratelimited do
    fetch_strava_data(user_id)
  end
end

Creating the ratelimit object is easy, we give it a unique key, and an instance of the redis client, if we already have one for other purposes:

def ratelimit
  @ratelimit ||= Ratelimit.new(
    "strava_sync",
    redis: $redis # We are already using Redis elsewhere in the app. If you aren't, leave out this parameter
  )
end

Next, we’ll implement the ratelimited method, which accepts a block and only calls it if the service has not exceeded the limit.

def ratelimited(&:block)
  raise "Ratelimit met" if ratelimit.exceeded?(RL_SUBJECT, interval: RL_INTERVAL, threshold: RL_THRESHOLD)
  block.call
end

Above, you can see we implicitly raise a RuntimeError with a message. This exception will trigger Sidekiq to re-queue the job and try in 15 minutes.

Earlier, I hinted this was not the solution we ended with (I’ll write that up in the next post). We have somewhere around 6000 users for which we want twice daily sync of strava trips taken. Coupled with the short nature of the job (many requests will do little work as not all users will have recorded a ride since our last check), this causes lots of retries as we quickly hit the 600 requests per 15 minutes ceiling. However, the above approach sees perhaps a majority of the jobs re-queued by Sidekiq. This is fine, as they will get serviced eventually, and sidekiq is very efficient at it’s work. But thousands of jobs were being requeued was the norm (not the exception), which feels wrong.