codeAt the core of how Buffer schedules posts is one line of a cronjob configuration that hasn’t been touched since the very start when Joel founded Buffer.

We still rely on that single cronjob that runs every minute of every day.

While this configuration is the same, everything else around it has evolved. Today, Buffer schedules on average 300 posts per minute and over 432,000 posts a day. Here’s a look at some of the challenges and iterations we’ve made to the core of what we do—scheduling posts on social media.

Buffer: October 2010-June 2012

When Joel started Buffer in his bedroom in 2010 his goal was to validate that Buffer would be something people wanted to pay for. A way to schedule tweets in advance was his hypothesis, and he moved fast to get something working. Looking back at our codebase, Joel’s 4th commit was to put up a pricing page. Commit #13 (a week later) is when he checked in the first instance of cron scheduling. That’s what I’d call lean.

The first iteration of the scheduling worked like this:

  1. Cronjob executed every minute that gets status updates from the database that are due right now.
  2. The same cronjob process would post them one by one and mark as sent.

This was run in a single cronjob that ran on the Linode server that ran the web server and database ($20/mo with 512 mb of RAM).

For a while this worked well enough. However, as customers started to schedule more posts, the cronjob would have to process more every minute. This slowly started to become more CPU intensive. When Joel would hit high CPU or memory usage, he vertically scaled the machines or optimized SQL queries. Each time he upgraded the server, there would be a few hours of downtime but this was totally acceptable at that level of scale.

The switch to AWS—June 2012

In June 2012, with 100,000 users and some level of product-market fit in hand, Joel decided it was time to migrate infrastructure so that scaling becomes a lot more, well scalable. They made the switch to AWS using elasticbeanstalk, SQS , and S3.

With this change, here’s how post scheduling looked.

  1. Cron job runs every minute on a single server and grabs updates that are due now.
  2. This cron job would then create an SQS (Amazon Queuing service) message.
  3. A worker running in a cluster of utility servers would pick off a message and process it. This worker would post the status update on Twitter, Facebook, etc. and mark as sent.

This new architecture formed the foundation for how Buffer works today. We’ve had an amazing experience using SQS and, in almost two years of using it, haven’t had any unexpected message delay or downtime with it (knock on wood). It’s incredibly well architected and handles everything we throw at it.

The separation of scheduling and processing was fundamental to tackling the challenges faced in scaling. Now it’s as simple as adding more workers to process the queue instead of upgrading to a server with beefier specs.

Concurrency

One of the early problems that was encountered when switching to this queue/worker set up was random unexpected behavior. Duplicate posting was probably the biggest and most noticeable one. Concurrency is often hard to debug, so there has been a lot of thought that went into understanding the flow and why issues like duplicate posting would occur.

To solve this, we designed a life cycle state machine of an update. We came up with these states throughout the course of an update being scheduled:

  • buffer — currently in the buffer
  • pending—picked off by the cronjob and added to SQS to be scheduled
  • processing—picked off by a worker and currently being processed
  • analytics/sent—finished sending and is viewable in analytics (analytics checking has its own state machine)

Even with this paradigm in place we still noticed that race conditions would occur, leading to duplication posts. This was a major issue for us and we worked hard to get this under control. To solve duplicate posting, we had to absolutely ensure atomicity when changing states. This is where MongoDB’s findAndModify comes in handy.

db.updates.findAndModify({
    '_id':ObjectId(), 'status':'pending'}, 
    {$set:{'status':'processing'}
});

FindAndModify allows one db connection to query for an update with a ‘pending’ status and atomically change the state so that another connection querying for the same update will see that it’s being processed. This general rule of ensuring state changes use findAndModify have helped resolve much of our concurrency issues.

Duplicate posts make us cringe

Spiky load

One of the really interesting challenges that we face at Buffer is the load from scheduling is incredibly spiky. Many Buffer users have set up their schedules to be hourly on the dot. This means that on a Thursday morning at 7:59am PST there are 255 updates scheduled to be posted. As soon at 8:00 hits, we schedule close to 5,000 updates to be posted. Then when 8:01 it’s back down to ~400 posts.

Here’s a visualization of when updates are due on Thursday (Feb 27) morning.

Posting delay

As these spikes started to grow above 3k per minute, we started experiencing larger posting delays. When a customer would schedule a post exactly on the hour the post would have been seen on the social network a few minutes later than expected. The worst case we saw was an 8:00am post would be posted at 8:04. This was not acceptable.

After noticing these delays as a trend, we quickly realized that our poor cron job couldn’t handle the task of grabbing due updates and quickly adding them to our SQS queue to post during these spikes. Since the cron runs once a minute, the job would timeout as it could only process a few thousand a minute. Since the cronjob processing was part of the real-time scheduling path, it was at fault to introduce delay, especially for last thousand or so posts that were scheduled on the hour.

How we schedule posts today

In January 2014 we deployed a major change to our scheduling to ensure better real-time posting with minimal delays. To do this, we used SQS’s Delay Message feature. Effectively with SQS, we can schedule a message to be processed by our workers up to 15 minutes into the future. We made several changes so that our cronjob can start scheduling posts ahead of when they’re actually due.

Now the scheduling cronjob will look for any post that is due in the next 15 minutes and adds it to the SQS queue with a delay so that our workers can process them at the exact time they’re due. It cuts out the cronjob as the middle man in the scheduling path and ensures we’re more real-time.

As you might imagine, there were some major considerations and changes to handle if a post is added to our scheduling SQS queue and its due time gets changed (ie, if a customer changes the post’s due time).

This change has been a huge relief as we’re now able to horizontally scale processing posts by increasing the number of workers and our cronjob is no longer the limiting factor.

I’d love to hear your thoughts!

This post was super fun to write as it gave me a chance to reflect back on the challenges and evolution of how we schedule posts at Buffer. I’m still amazed that since the beginning and up to now this one line has powered our core product:

* * * * * php app/webroot/index.php cron tweet

Do you have any questions about this cronjob or Buffer engineering in general? I’d love to hear from you in the comments and answer any questions at all you might have.

P.S. Interested in working on some unique scaling challenges to manage sending out 3 million posts per week? Join our small team!

This post originally appeared on Medium. You can find other Buffer posts on Medium by browsing the Buffer collection.

Image Credit: kjetikor

Free up your day with our Social Media Tools

Buffer can save you up to an hour a day and grow your traffic too.

Learn More
Written by Sunil Sadasivan

Buffer CTO and disc tosser.

  • appneeti

    Nice post, Sunil. I love the way Buffer has adopted openness and transparency in its culture. I am myself a founder and love your product as well as the content you people disseminate. A lot to learn from you guys. Thank and keep it up.

    I have one quick question. Do you guys decide and schedule the dayful of stories in advance or do you schedule multiple times in a day on various themes?

  • Tigran Hakobyan

    Thanks for this update Sunil. I know you guys are running PHP Codeignither for most back-end services, right? Do you think you will continue to scale this framework or you might switch to use other technologies for your back-end stuff. Thanks!!

  • Yves

    Are you sure the customers care if posts are done 4 after the scheduled time? If I were a customer I wouldn’t care about precision by the minute. Have you considered letting users schedule posts based on approximate time, to validate if this requirement is important?

    • Adam Birnbaum

      I agree, that was the first thing I thought of when I read this article. I have prescheduled tweets before, and when I have the precise timing plus or minus a few minutes is not a big deal. I could imagine that some industries, however, such as a news feed, may be more concerned with the exact time stamp on a twee. This of course also has implications on reducing worker capacity required at peak post times

    • KhalJaaidi

      A 4 minutes delay could become an 8, 12, 30 minutes, etc. Solve problems before they become complaints, and solve them early

    • Yves, It’s the core of the business to send out posts at exactly the time customers want it to go. If I saw that posts are posted 2-4-8 minutes later I would start thinking the service is not reliable.

      It’s the same as if you wanted to order from Amazon using the next day delivery and you got it 2 days later.

  • Gil

    First of all, thanks for your transparency guys!

    Your approach seems good to me for now, but I think you will face the same problem in the future with your current growth.

    Have you guys though on storing the future posts on Redis? As Redis is single threaded the actions are atomic. It’s easily scalable, and this gives you the chance to have more machines running this cron at the same time without posting duplicates.

  • Kim Slawson

    I also wondered if spreading the peaks out might not be a sane solution to reducing spiky load. User-facing change: change “at” to “around” and having it be the default, with “at” an option if needed.

    • sunils34

      This is a great idea! We have thought about providing a more estimated time your post will go out. It’s been such a fun and unique challenge to provide more accurate guarantee of timings, and definitely something we think is solvable so this has really helped push us to get to a stable solution :).

  • Ahmed M. Araby

    first of all , I think you’ve default secluding times for all users as far as I remember, you need to scramble theses default times and the separating periods using random function
    Second , You don’t need to send them in serial guys
    There’s parallel programming in PHP , the script can fork himself and you can assign number of tweets to each fork and the main master process can wait till all done and close too, cuz closing before children can be messy
    Try to optimize the method of sending tweets cuz on higher loads , the 24 hours will not be enough to send everything

  • ShaneHarter

    Sunil, I’d love to hear what Buffer has in place to monitor the cron jobs and queue workers that keep everything running. I recently launched a dead simple cron job monitoring tool, https://cronitor.io (might be useful to your readers in the sub-$4-million cron range). It would honestly be awesome to hear what you think about Cronitor and a little more about how you do monitoring over there.

Join 13,000+ startup culture thinkers & get our posts in your inbox!