Facebook是如何支持80万并发视频流直播的

数据库 来源:dica54dica 110℃ 0评论

具备构建横跨全球的分布式服务能力的公司寥寥无几,甚至比拥有核武器的国家还要少。然而,Facebook就是这样的一个公司,它的视频流直播系统Facebook Live就是一个横跨世界的分布式服务。Facebook的CEO Mark Zuckerberg说:

我们做了一个重大决定,把更多的精力集中在视频直播上。因为直播是一种新兴的方式,跟过去五年甚至十年的那些离线视频不一样……我们正迎来视频的新黄金时期。如果把时间快进五年,人们在Facebook上看到的和他们每天分享的大部分内容都是视频,这对我来说一点也不惊奇。

如果你身处广告行业,还有什么比获得源源不断的可作为广告载体的内容更能激动人心?这些内容不拘一格地出现,持续增长,永不停息。谷歌在这上面打起了如意算盘,开始把广告业务的重心压在呈指数级增涨的Web上。

能够体现Facebook在流视频方面具有强大能力的例子,当属那个两人使用橡皮圈撬开一个西瓜的视频。这个视频时长45分钟,高峰时段有80万人同时在线观看,这些观众还给出了30万份评论。对于一个拥有15亿用户的社交网络来说,这样的并发量可以说已经达到了病毒级规模。

2015年的Super Bowl(美国国家美式足球联盟年度冠军赛)有1亿1千4百万观众,其中大概有236万观看的是视频直播。在2015年的E3游戏展期间,Twitch直播系统高峰期用户达到84万。9月16日的共和党辩论在高峰时有92万1千人同时在线观看直播。

这么看来,Facebook也已经是名列前茅了。这里要注意的是,Facebook在同一时间还要处理其它大量的视频流。

有一篇文章引用了Facebook首席产品官Chris Cox的话,他说Facebook:

有超过100人同时工作在Live项目上(刚开始只有12个人,现在有150多人)
希望并发直播流的服务能力可以达到百万级别
希望可以做到单个流支持百万并发用户,还能无缝地支持跨设备、跨地域的视频流服务

Cox说“我们发现这是一个非常具有挑战性的基础设施问题”。如果把我们解决这个问题的细节公之于众应该会很有趣的吧?天啊!不过等等,我们会这么干的!

Federico Larumbe来自Facebook流量团队,他负责的缓存系统支撑着Facebook的CDN和全局负载均衡器。他为我们带来了“横向扩展Facebook Live”的出色演讲,分享了Live的一些工作细节。

下面是我对这次演讲做的笔记,它真的令人印象深刻。

最初的故事

Live是一个可以让人们实时分享视频的新项目。
Live在2015年4月份启动,当时只能通过Mentions使用,作为少数高端人士与他们粉丝之间的互动工具。
在之后的一年里,产品不断改进,协议也在迭代更新。
    他们开始使用HLS,也就是HTTP Live Streaming。iPhone开始支持Live,并允许他们使用现有的CDN架构。
    同时对RTMP(Real-Time Messaging Protocol)进行调研,RTMP是一个基于TCP的协议。手机端分别有一个视频流和音频流被发送到Live Stream服务器。
        优点:RTMP缩短了从广播者到观看者之间的延迟,这个对广播来说意义重大。减少几秒钟的延迟,从用户体验方面来说就是一个很大的改进。
        缺点:需要全新的架构,因为它不是基于HTTP的。需要开发新的RTMP代理,这样才能大规模使用。
    同时调研了MPEG-DASH(基于HTTP的动态自适应流)。
        优点:相比HLS,它可以节省15%的空间。
        缺点:它支持自适应比特率,编码质量取决于网络的吞吐量。
2015年12月,在多个国家启动了该项目。

不同的直播视频引起的问题

之前提到的撬西瓜视频的流量模式:
    刚开始增涨很快,在几分钟内就超过每秒100个请求,然后持续增涨,直到视频结束。
    然后流量呈断崖式下降。
    换句话说,流量的形状就像一个尖刺。
直播视频跟一般的视频不一样,它的流量模式呈尖刺状。
    直播视频更吸引人,比一般视频会多出3倍以上的浏览量。
    直播视频会出现在显眼位置,更有可能被浏览到。
    网站的忠实用户会收到通知,所以有更多的人可能会看到视频。
尖刺流量模式会给缓存系统和负载均衡器带来一些问题。
缓存系统问题
    有可能很多用户同时观看视频直播。这样会造成惊群(Thundering Herd)问题。
    尖刺流量模式会给缓存系统带来压力。
    视频按秒分段存储,缓存视频分段的服务器有可能在流量高峰时过载。
全局负载均衡问题
    Facebook的PoP(Point of Presence)服务器分布在世界各地,流量通过全局进行分发。
    如何防止高峰流量拖垮PoP是个具有挑战性的问题。

全局架构

视频直播流从主播端到观众端的流程是这样的:

主播在他们的手机上发起一个视频直播。
手机把RTMP流发送到Live Stream服务器上。
Live Stream服务器对视频流进行编码并转成多种比特率。
服务器为每种比特率持续地生成MPEG-DASH分段。
分段被存储在数据中心的缓存里。
分段从数据中心的缓存转发到PoP的缓存里。
观众端接收直播流。
观众端设备上的播放器以一定的速率从PoP缓存里获取分段。

如何横向扩展

在数据中心缓存和PoP缓存之间存在一个多路分发点。用户访问的是PoP缓存,而不是数据中心缓存,而且有很多PoP缓存分布在世界各地。
在每个PoP里也有多路分发机制。
    PoP内部被分为两层:一层是HTTP代理,一层是缓存。
    用户向HTTP代理请求分段,代理检查分段是否已经在缓存里,如果是,就返回分段,否则请求会被发送到数据中心。
    不同的分段被存储在不同的缓存里,这样有助于在多个缓存主机间进行负载均衡。

避免数据中心出现惊群效应

如果所有用户同时对同一个分段发起请求会出现什么情况?
如果分段不在缓存里,所有请求都会被发送到数据中心。
合并请求。在PoP缓存里使用合并请求可以减少发送请求的数量,这样只有一个请求会被发送给数据中心。其它请求会等待第一个请求返回的响应,然后把数据返回给用户。
增加一个新的缓存层,避免出现热点服务问题。
    所有用户向的请求都发给同一个主机会造成该主机过载。
    在代理里增加缓存层。只有第一个请求会访问到缓存,代理会处理剩下的请求。

PoP还存在风险,需要全局负载均衡来救场

数据中心的惊群问题得到了解决,但PoP仍然存在风险。Live存在的一个问题是,在PoP达到负载均衡器的负载指标之前,高峰流量已经让PoP发生过载。
每个PoP的服务器数量和连接带宽都是有限的。如何避免PoP在高峰时发生过载?
一个叫Cartographer的系统把子网跟PoP映射起来,它会对每个子网和PoP之间的延时进行监测。
在知道每个PoP负载的情况下,用户请求会被发送到距离最近的可用PoP上。代理有一个负载计数器记录了它们的负载情况。通过收集这些计数器我们就可以知道每个PoP的负载情况。
现在出现了对PoP处理能力的约束和最小化延迟的优化问题。
控制系统在收集指标和作出反应方面存在延时。
他们把指标收集时间从一分半钟减少到3秒,不过3秒仍然是延迟。
解决方案是对负载进行预测。
他们实现了一个负载评估器,通过前一个负载和当前负载来推断后面的负载。
    如果当前负载是增加的,那么评估器如何能推断下一个负载会减弱?
    他们使用了三次样条插值(Cubic Spline Interpolation)功能。
    先获得第一个和第二个导数,如果速度是正数,说明负载在增加。如果加速度是负数,那么说明速度在下降,并最终变成零。
    三次样条插值可以预测更复杂的流量模式,不仅仅是线性模式。
    避免振动。
    插值功能同时解决了振动问题。
    指标收集和反应出现延迟说明数据已经过时。插值会减小误差,预测更准确,同时减少振动。这样负载就可以接近预设值。
    目前的预测是基于前三次的时间间隔,每个间隔30秒,所以得出的结果几乎是实时的。

测试

想办法让PoP过载。
构建一个负载测试服务,为PoP模拟直播流量。
模拟10倍于真实数据的负载。
模拟每次请求一个分片的客户端。
这个测试系统有助于发现和修补负载评估器的漏洞,用以调整配置参数,并验证用于解决惊群问题的缓存层是否工作正常。

上传的可靠性

实时上传视频是一个挑战。
举个使用100Kbps到300Kbps的网络带宽上传视频的例子。
音频需要64Kbps的吞吐量。
标准分辨率的视频需要500Kbps的吞吐量。
手机的自适应码率用于协调视频跟音频之间的吞吐量差值。视频的码率根据实际可用的网络带宽进行调整。
手机根据已通过RTMP上传的字节数和上一个间隔的平均权重来决定上传的码率。

未来的方向

使用推送机制代替轮询机制,在发送分片请求前,使用HTTP/2把分片推送到PoP上。

以下是英文原文:

Fewer companies know how to build world spanning distributed services than there are countries with nuclear weapons. Facebook is one of those companies and Facebook Live, Facebook’s new live video streaming product, is one one of those services.

Facebook CEO Mark Zuckerberg: 

    The big decision we made was to shift a lot of our video efforts to focus on Live, because it is this emerging new format; not the kind of videos that have been online for the past five or ten years...We’re entering this new golden age of video. I wouldn’t be surprised if you fast-forward five years and most of the content that people see on Facebook and are sharing on a day-to-day basis is video.

If you are in the advertising business what could better than a supply of advertising ready content that is never ending, always expanding, and freely generated? It’s the same economics Google exploited when it started slapping ads on an exponentially growing web.

An example of Facebook’s streaming prowess is a 45 minute video of two people exploding a watermelon with rubber bands. It reached a peak of over 800,000 simultaneous viewers who also racked up over 300,000 comments. That’s the kind of viral scale you can generate with a social network of 1.5 billion users.

As a comparison The 2015 Super Bowl was watched by 114 million viewers with an average 2.36 million on the live stream. On Twitch there was a peak of 840,000 viewers at E3 2015. The September 16th Republican debate peaked at 921,000 simultaneous live streams.

So Facebook is right up there with the state of the art. Keep in mind Facebook would have a large number of other streams going on at the same time as well.

A Wired article quotes Chris Cox, Facebook’s chief product officer, who said Facebook:

    Has more than a hundred people working on Live. (it started with ~12 and now there are more than 150 engineers on the project)

    Needs to be able to serve up millions of simultaneous streams without crashing.

    Need to be able to support millions of simultaneous viewers on a stream, as well as seamless streams across different devices and service providers around the world.

Cox said that “It turns out it’s a really hard infrastructure problem.”

Wouldn't it be interesting if we had some details about how that infrastructure problem was solved? Woe is we. But wait, we do!

 Federico Larumbe from Facebook’s Traffic Team, which works on the caching software powering Facebook’s CDN and the Global Load Balancing system, gave an excellent talk: Scaling Facebook Live, where he shares some details about how Live works.

Here’s my gloss on the talk. It’s impressive.
Origin Story

    Facebook is a new feature that allows people to share video in real-time. (Note how this for Facebook is just another feature).

    Launched in April 2015 Live could only be used by celebrities through the Mentions app as a medium for interacting with fans.

    This began a year of product improvement and protocol iteration.

        They started with HLS, HTTP Live Streaming. It’s supported by the iPhone and allowed them to use their existing CDN architecture.

        Simultaneously began investigating RTMP (Real-Time Messaging Protocol), a TCP based protocol. There’s a stream of video and a stream of audio that is sent from the phone to the Live Stream servers.

            Advantage: RTMP has lower end-end latency between the broadcaster and viewers. This really makes a difference an interactive broadcast where people are interacting with each other. Then lowering latency and having a few seconds less delay makes all the difference in the experience.

            Disadvantage: requires a whole now architecture because it’s not HTTP based. A new RTMP proxy need to be developed to make it scale.

        Also investigating MPEG-DASH (Dynamic Adaptive Streaming over HTTP).

            Advantage: compared to HLS it is 15% more space efficient.

            Advantage: it allows adaptive bit rates. The encoding quality can be varied based on the network throughput.

        Pied Piper Middle-Out Compression Solution: (just kidding)

    Launched in dozens of countries in December 2015.

Live Video is Different and that Causes Problems

    The traffic pattern of the Watermelon video mentioned earlier:

        A very steep initial rise, in a few minutes it reached more than 100 requests per second and continued increasing until the end of the video.

        Then traffic dropped like a rock.

        In other words: traffic is spiky.

    Live video is different than normal videos: it causes spiky traffic patterns.

        Live videos are more engaging so tend to get watched 3x more than normal videos.

        Live videos appear at the top of the news feed so have a higher probability of being watched.

        Notifications are sent to all the fans of each page so that’s another group of people who might watch the video.

    Spiky traffic cause problems in the caching system and the load balancing system.

    Caching Problems

        A lot of people may want to watch a live video at the same time. This is your classic Thundering Herd problem.

        The spiky traffic pattern puts pressure on the caching system.

        Video is segmented into one second files. Servers that cache these segments may overload when traffic spikes.

    Global Load Balancing Problem

        Facebook has points of presence (PoPs) distributed around the world. Facebook traffic is globally distributed.

        The challenge is preventing a spike from overloading a PoP.

Big Picture Architecture

This is how a live stream goes from one broadcaster to millions of viewers.

    A broadcaster starts a live video on their phone.

    The phone sends a RTMP stream to a Live Stream server.

    The Live Stream server decodes the video and transcodes to multiple bit rates.

    For each bit rate a set of one-second MPEG-DASH segments is continuously produced.

    Segments are stored in a datacenter cache.

    From the datacenter cache segments are sent to caches located in the points of presence (a PoP cache).

    On the view side the viewer receives a Live Story.

    The player on their device starts fetching segments from a PoP cache at a rate of one per second.

How does it scale?

    There is one point of multiplication between the datacenter cache and the many PoP caches. Users access PoP caches, not the datacenter, and there are many PoP caches distributed around the world.

    Another multiplication factor is within each PoP.

        Within the PoP there are two layers: a layer of HTTP proxies and a layer of cache.

        Viewers request the segment from a HTTP proxy. The proxy checks if the segment is in cache. If it’s in cache the segment is returned. If it’s not in cache a request for the segment is sent to the datacenter.

        Different segments are stored in different caches so that helps with load balancing across different caching hosts.

Protecting the Datacenter from the Thundering Herd

    What happens when all the viewers are requesting the same segment at the same time?

    If the segment is not in cache one request will be sent to the datacenter for each viewer.

    Request Coalescing. The number of requests is reduced by adding request coalescing to the PoP cache. Only the first request is sent to the datacenter. The other requests are held until the first response arrives and the data is sent to all the viewers.

    New caching layer is added to the proxy to avoid the Hot Server problem.

        All the viewers are sent to one cache host to wait for the segment, which could overload the host.

        The proxy adds a caching layer. Only the first request to the proxy actually makes a request to the cache. All the following requests are served directly from the proxy.

PoPs are Still at Risk - Global Load Balancing to the Rescue

    So the datacenter is protected from the Thundering Herd problem, but the PoPs are still at risk. The problem with Live is the spikes are so huge that a PoP could be overloaded before the load measure for a PoP reaches the load balancer.

    Each PoP has a limited number of servers and connectivity. How can a spike be prevented from overloading a PoP?

    A system called Cartographer maps Internet subnetworks to PoPs. It measure the delay between each subnet and each PoP. This is the latency measurement.

    The load for each PoP is measured and each user is sent to the closest PoP that has enough capacity. There are counters in the proxies that measure how much load they are receiving. Those counters are aggregated so we know the load for each PoP.

    Now there’s an optimization problem that respects capacity constraints and minimizes latency.  

    With control systems there’s a delay to measure and a delay to react.

    They changed the load measurement window from 1.5 minutes to 3 seconds, but there’s still that 3 second window.

    The solution is to predict the load before it actually happens.

    A capacity estimator was implemented that extrapolates the previous load and the current load of each PoP to the future load.

        How can a predictor predict the load will decrease if the load is currently increasing?

        Cubic splines are used for the interpolation function.

        The first and second derivative are taken. If the speed is positive the load is increasing. If the acceleration is negative that means the speed is decreasing and it will eventually be zero and start decreasing.

        Cubic splines predict more complex traffic patterns than linear interpolation.

        Avoiding oscillations. This interpolation function also solves the oscillation problem.

        The delay to measure and react means decisions are made on stale data. The interpolation reduces error, predicting more accurately, and reduces oscillations. So the load can be closer to the capacity target

        Currently prediction is based on the last three intervals where each interval is 30 seconds. Almost instantaneous load.

Testing

    You need to be able to overload a PoP.

    A load testing service was built that is globally distributed across the PoPs that simulates live traffic.

    Able to simulate 10x production load.

    Can simulate a viewer that is requesting one segment at a time.

    This system helped reveal and fix problems in the capacity estimator, to tune parameters, and to verify the caching layer solves the Thundering Herd problem.

Upload Reliability

    Uploading a video in real-time is challenging.

    Take, for an example, an upload that has between 100 and 300 Kbps of available bandwidth.

    Audio requires 64 Kbps of throughput.

    Standard definition video require 500 Kbps of throughput.

    Adaptive encoding on the phone is used to adjust for the throughput deficit of video + audio. The encoding bit-rate of the video is adjusted based on the available network bandwidth.

    The decision for the upload bitrate is done in the phone by measuring uploaded bytes on the RTMP connection and it does a weighted average of the last intervals.

Future Direction

    Investigating a push mechanism rather than the request-pull mechanism, leveraging HTTP/2 to push to the PoPs before segments have been requested.