Video Upload Latency Improvements at Instagram

Published in

Instagram Engineering

9 min readJun 13, 2019

In June 2013 Instagram introduced video uploads. At the time the system was simple. To ingest video and make it available for playback, we had the Instagram app upload the entire video file to the server once the client had finished recording it. Then we would transcode the video to a controlled set of video versions of different qualities to make sure the video file was playable on as many devices as possible. Once all the video versions were available, we “published” the video and made it available for viewing.

At Instagram, our community is sensitive to upload times. Users want to see their video stories and direct video messages made available for others as soon as possible. For that reason, upload latency is an important metric at Instagram. Over the years we have developed strategies for reducing that latency.

Let’s start by defining upload latency for this article as the time it takes once the server has received all video bytes from the client, until the video is made “publishable” or available for viewing.

Publishing Signal

A simple improvement to reduce video upload latency is to only do the minimal work necessary before a video is considered publishable. The idea is, instead of blocking until all video versions are available, we can publish the video once the highest-quality video version is available. The rest of the video representations are not mandatory for playback, but provide a playback experience with less stalling for our users with bandwidth constraints. This reduces latency in cases where the lower-quality versions took longer than the highest-quality version to process, and increases the success rate of video uploads, since we only depend on one version instead of all versions.

We represent our video data model with a graph-based storage system. All video versions are attached to a “video asset” parent node, which allows us to reason about the video at the uploaded media level instead of the video version level. This abstraction enables simple and unified publishing logic. To implement the above mentioned improvement, we mark the video asset as “publishable” by flipping a boolean in the video asset when the callback is received from our video processing service.

By making the publishing signal only depend on this one version, we open the opportunity to make our video processing model more resilient and flexible. If the pipeline fails to produce the optional encodings for some time, we can still allow the videos to be published since we have the highest quality version available. Later, we can fill in the rest of the versions on demand.

One tradeoff of making the publishing signal only depend on the highest-quality version is that users with bandwidth constraints may experience a suboptimal playback experience until the rest of the versions are complete. For instance, the video will be published as soon as the highest-quality version is ready, but there may be a few lower-quality versions still being processed that will not be available until later. When users with bandwidth constraints view the video initially, they may experience a higher stall rate if only the highest-quality video is available. In practice, in a significant majority of cases, the remainder of the encodings are made available soon after the mandatory version is complete.

Segmented Video Upload Processing

Another approach to make video uploads faster is to ask the client to cut up the video after it has been recorded. Once the video is chunked up into segments, the client uploads them to the server and labels each segment with an index so they can be recombined in order later. When the server receives the segments, it does the transcoding process in parallel thereby saving time. Once the segments are all transcoded, we combine them together so they are available for playback.

The processing portion of the pipeline is now split into segments

On the server side, we structure each video processing pipeline as a directed acyclic graph. Each node is a unit of execution that runs on a worker machine and each edge represents a dependency between two nodes. Each node runs once all of its dependencies have finished. As an example here is a simplified video processing pipeline execution for a basic nonsegmented pipeline:

In this example pipeline, the majority of the work that is happening is in the transcoding node. If we can parallelize that portion, then we can reduce upload latency significantly. Our segmented pipeline aims to parallelize that portion by adding a transcoding task per segment. Then we add a stitch task, which concatenates the frames of each segment’s video and places the resulting video in a new video container. This stitch task depends on each per segment task in the pipeline as follows:

Segmented uploads reduce upload latency in many cases but come with a few tradeoffs. For instance, segmented uploads increase the complexity of the pipeline. There are some quality metrics that are only available per segment at transcode time, such as SSIM. These metrics are not helpful to us on a per segment basis. Therefore, we need to do a duration weighted average of the SSIM of all segments to come up with the SSIM of the whole video. Similarly, handling exceptions is more complex since there are more cases to handle.

Also by segmenting up the video, we have introduced another step in the pipeline that stitches all the transcoded segments together. This requires additional CPU that wasn’t necessary in the nonsegmented case as well as is another step, which could fail, causing our success rate to decrease. More importantly, the stitching step increases IO requirements significantly in the resulting system. Each segment is transcoded on an individual worker machine. When stitching the segments together, we want to have all those segments available locally on another worker machine that is performing the stitch. Therefore, that other worker machine must download all the segments from the network, which greatly increases IO utilization.

On segment length, the smaller we make the segments, the more work we can do in parallel. However since there is some fixed overhead in setting up a worker machine to transcode a segment, we want to keep the segment length above a certain threshold. If we make the segments too small then we are wasting resources just allocating machines to do tiny amounts of work. In practice setting the segment length to something on the order of a few seconds works well for us.

In addition, this isn’t always a net win in terms of upload latency. The benefits of segmented uploads diminish as the initial video gets shorter. For instance, below depicts a comparison between nonsegmented video processing and segmented video processing plotted against time for a short video and a long video. For both I assume that video processing time is directly proportional to the length of the video. Δt is the upload latency win between segmented and nonsegmented pipeline execution. The win is much smaller for the shorter video compared to the long video:

Overall, we make the decision to segment the video at the beginning of the upload process depending on the product and the length of the video. Some video products, such as stories, have enforced length maximums that are short enough that segmenting isn’t worth the complexity. On the other hand, for video products like IGTV, a minimum length is enforced that is long enough to make segmented uploads always worthwhile.

Passthrough Uploads

Another performance optimization we use to improve the upload latency and save CPU utilization is something we call a “passthrough” upload. In some cases, the media that is uploaded is already ready for playback on most devices. If so, we can skip the video processing altogether and store the video directly into our data model. This reduces upload latency since we do not need to transcode the video in these cases.

Passthrough check is now the critical path for publishing

An important part of this approach is setting comprehensive checks in place so that the video that enters Instagram conforms to our standards for playback. In addition to the pipelines that transcode video, we add another pipeline that checks some of the properties of the passed-in video, such as its codec and bitrate, to confirm the video is eligible for passthrough. If the video has a less supported codec, then fewer Instagram users will be able to play the video. Similarly, if the bitrate is too high then loading the video for playback over the network will take too long.

Once the codec and bitrate pass our eligibility criteria, we then check the video file with an internal tool that reports on the topology, consistency, and streams storage consistency. If the video file is inconsistent, we attempt to repair the original video file. This internal tool also reliably identifies buffer overflow attack scenarios so that we do not serve any malicious files to our users. At the end we transmux the repaired video with the original audio and store that in our data model:

The resulting passthrough pipeline completes much more quickly than the transcoding pipelines. Since our checks guarantee that this video version is playable, we allow this callback to mark the video asset as publishable. This greatly improves our video processing latency and additionally improves the quality of the content since transcoding is a lossy process.

The main tradeoff here is captured in the bitrate ceiling we have set for a few reasons. If the bitrate of the original video is too high and we perform a passthrough upload, then we will store a much larger file than we would have otherwise if we transcoded the video. Also, there are diminishing returns in visual quality as bitrate increases, which is even more apparent when these videos are played on mobile devices with limited screen size. In the case of high-bitrate original videos, there is less visual quality win when comparing the passthrough version with the highest-quality transcoded version. In practice, our bitrate ceiling allows us to control these tradeoffs.

All in all, passthrough uploads can be especially useful for our most latency-dependent video products such as direct video messaging.

What’s next?

Over the years, video processing at Instagram has improved significantly. This infrastructure has provided much more value to our users in the form of efficiency, reliability, and quality. We are currently working on making the high-level procedures mentioned above even more efficient.

One promising area is generating and purging encodings on demand as the video ages and interacts with the outside world.

For instance, we may want to alter what representations a certain video has based on data such as popularity or how old the video is. Since older content isn’t viewed as much, we may not need to store all video versions. Instead we can just have a subset of the video versions for the small bit of traffic that looks far into the past. If an old video becomes suddenly popular then we may want to regenerate those versions on demand.

Designing the system that manages what video representations we have over each video’s life cycle involves many interesting challenges. What signals should we choose to consume? How do we efficiently manage and iterate over the existing videos in a performant manner?

There is high potential for impact considering the scale of Instagram, where many of our 1B+ users create or consume video every day. If this sounds interesting to you, join us!

Many thanks to my team members, Runshen Zhu, Ang Li, Rex Jin, Haixia Shi, and Juehui Zhang, who helped build the above improvements to our infrastructure.

Ryan Peterman is a software engineer on Instagram’s Media Infrastructure team in San Francisco.

Instagram Engineering

Video Upload Latency Improvements at Instagram

Publishing Signal

Segmented Video Upload Processing

Passthrough Uploads

What’s next?

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Instagram Engineering

Written by Ryan Peterman

Responses (5)