Every time a video is recorded by a CrossBrowserTesting user, it goes through a handful of steps.
The video is recorded in the Flash Video (*.flv*) format because it is cheap to write to. Because of format issues on the browser side, we then transcode the FLV video to a web-compatible standard inside of an MP4 container.
Unfortunately, this transcoding can be computationally expensive, especially when you have multiple videos being transcoded simultaneously. When you have other services running on these boxes, these services will slow to a crawl, impacting the entire user experience end-to-end.
Given our growth, we are routinely seeing more and more of these scenarios where a system becomes overloaded and cascades into others, so we needed to start working to make the services independent of each other.
Investigating Our Options
When we first started looking into offloading the transcoding processes, we were looking at services similar to Amazon’s Elastic Transcoder. These services are versatile, simple, and can scale to effectively any size we could conceivably need.
There’s just one problem — for our use case, these services are prohibitively expensive.
CrossBrowserTesting users record nearly 300,000 videos in a month, every month, and that number is only expected to grow as we do. With the pricing model of Elastic Transcoder, and similar services, we would be charged by minute of video transcoded.
Since almost all of our recordings are considered HD by Amazon’s standards, Elastic Transcoder costs 3 cents per minute of video output. While that doesn’t sound like a lot, when you have 300,000 videos, that will add up quickly. So, unfortunately, Elastic Transcoder and its ilk are not viable options for our requirements.
Fortunately, our needs are pretty simple — all we really need to do is transcode video to the input resolution, and maintain the same codecs and container format every time. There are very few changes between transcoding runs.
So, we ended up building our own scalable transcoding service for customer videos.
How It All Works
A week after starting this search, we were ready to start sending customers’ videos to the new transcoder.
When a user finishes recording a video, the FLV file created is then sent to Amazon S3 for storage. Inside S3, there is the ability to create notifications any time a file is created inside a bucket or, as in our case, any time a file matching a pattern is created, and these notifications can be sent to Amazon’s Simple Queue Service (SQS), a simple message queue.
These notifications will sit in the queue until they are handled, and they are pulled by a continuously-running program on some number of EC2 instances. This program is written in Python; and pulls the notifications, downloads the file from S3, transcodes it (using ffmpeg), and then uploads the result to S3.
A call to our API marks the video as completed, and the transcoding of a single video is done. This program is set up to handle a certain number of ‘slots’ for transcoding, allowing a single box to transcode multiple videos simultaneously.
How Not to Scale
When we first built this service for ourselves, we had set it to work on two t2.xlarge EC2 instances. We found that these instances provided a good balance of cost-efficiency and speed. After a little bit of trial and error, we settled on this number because it handled our normal load handily, and left us plenty of room for expansion.
At least, that’s what we thought. One day, we noticed that videos were taking an exceedingly long time to become available to our users. This was due to a massive spike in the number of videos in our queue; while our normal load was easily handled by two of these boxes, our load during this incident peaked at over 1,100 videos waiting to be transcoded.
Our two boxes simply couldn’t keep up with our demand, so the queue grew steadily, and I eventually stood up two more transcoding servers. Thanks to the way these are deployed, it doesn’t take long to set up more of these instances, but that was still more work and took more time than I would like to dedicate to handling an incident like this.
After this incident, I set up an autoscaling group for these transcoding boxes. Because it was designed with elasticity — the ability to start and stop additional horizontally-scaled services without causing issues — in mind, this was really simple. All I had to do was take a running box, shut it down, and create an image, which could then be applied to a launch configuration, which in turn is set up as the definition for a scaling profile.
This worked great based on the CPU utilization of the boxes, but one day, I received an alert that it was beginning to creep up again, despite the auto-scaling. As it turns out, under a specific set of circumstances, the boxes end up in a weird state where their CPU utilization never goes above 50% even when they’re transcoding on all slots.
At this point, I realized that CPU utilization, in addition to being fallible, is also a really bad proxy for what we really want out of the transcoder — we don’t care how busy the boxes are, only that the videos are being processed in a timely manner.
After a little experimentation, we now have the autoscaling set up to track the number of videos waiting in the queue. By scaling with the metric we are actually concerned about, we will never end up in a situation where a bad proxy for a metric (in this case, CPU Utilization as a bad proxy for how many videos need to be processed) will cause the system to lag behind.
The end result
In the end, our new transcoder system runs, on average, about 1.3 t3.xlarge processes in a month, and can scale as needed to match our customer demand.
This is a significant savings over services like Elastic Transcoder, while still easily meeting our needs by only doing the things we really need it to do.
It does one thing — transcodes video for the web — and it does it well, allowing us to focus on making the system better every day instead of putting out fires.