[go: up one dir, main page]

In October 2020, Chrome enabled HTTP/3 by default. HTTP/3 (RFC 9114) runs over IETF QUIC (RFC9000). Default-enabling HTTP/3 in Chrome resulted in improved performance compared not only HTTP/1 and HTTP/2, but also Google QUIC. Benefits included reduced Google search latency and fewer rebuffers for YouTube.

The journey to optimizing performance did not end when HTTP/3 was default enabled. Recent advancements include the implementation of the HTTP/3 ORIGIN frame (RFC 9412) and Server's Preferred Address (RFC 9000 Section 9.6). The former enhances connection coalescing, while the latter reduces a connection's round trip time (RTT). Both features have been enabled by default in M131, which was released to Stable on 11/19.

ORIGIN Frame

When a connection is established for a specific hostname, the server’s certificate typically contains numerous other hostnames for which the server is authoritative. However, a client cannot immediately send requests for those other hostnames on that connection without first performing a DNS lookup for the other hostname and verifying that the IP address of the connection matches the resolved address. This additional DNS resolution introduces latency and significantly reduces the likelihood of connection pooling due to potential IP mismatches. The metrics from Chrome indicate that nearly 20% of HTTP/3 connections would be unnecessary if not for this IP mismatch.

Creating a new connection, even with QUIC 0-RTT, is expensive in terms of latency, memory, and CPU usage. This is because:

  • DNS resolution adds latency unless cached locally in Chrome’s DNS cache.
  • Both client and server must send multiple packets to complete a QUIC handshake.
  • TLS necessitates CPU-intensive asymmetric cryptography on both ends.
  • The congestion controller begins in its default state, potentially leading to under or over-sending.
  • 0-RTT might fail.
  • Non-safe requests aren't sent via 0-RTT.
  • More connections consume more memory.

Additionally, features like HTTP priorities (RFC 9218) are only effective if there are multiple simultaneous responses to send.

The HTTP/3 ORIGIN Frame (RFC 9412) enables a server to indicate what domains it would like to pool onto a connection. Additionally, once the frame is received, it indicates other domains should not be pooled onto that connection, even if they are in the certificate.

Server’s Preferred Address

In some cases, the initial server address to which the client connects is not the most efficient route. It might be behind an L4 load balancer, and connecting directly could increase stability. Particularly when using Anycast, it’s possible the server is distant from where traffic enters the network, creating a 3-legged path that increases the round trip time.

Once the handshake is confirmed, Server’s Preferred Address allows a server to indicate it would like the client to migrate to a different server IP. Though a QUIC connection is not bound to a single 4-tuple like TCP, this is the only type of migration in RFC9000 where the server can change its address.

So far, only Google’s Media CDN has widely enabled advertising an alternative address, but we expect more servers to adopt it soon. Testing has shown that this migration is successful over 99% of the time in Chrome and reduces average RTT by 40-80%.

Today’s The Fast and the Curious post covers how Chrome achieved best-in-class Speedometer scores on mobile devices, resulting in faster and smoother web experiences for Android users.

Chrome has always been about speed. Whether it's loading pages quickly, running complex web apps smoothly, or delivering a seamless browsing experience, performance is at the heart of our browser. And we're always looking for ways to make Chrome even faster.

Over the last two years, we have been hard at work on a number of performance improvements for Android devices. We're excited to share some of the progress we've made.

Speedometer on Android

One of the key metrics we use to track Chrome's performance is the Speedometer benchmark. This benchmark is developed in collaboration with other major web browser engines and measures how quickly Chrome can complete interactions with web pages, including parsing/rendering HTML or CSS and running JavaScript.

Since the release of Chrome M112, we've seen a significant increase in Speedometer 2.1 scores on Android devices [1]. In fact, on many devices, scores more than doubled, with the newest Snapdragon® 8 Elite Mobile Platform setting new records for Speedometer performance on mobile devices! These huge accomplishments are a testament to the work not only of the Chrome and Android teams, but also our silicon and SoC partners.

Since Chrome M112, Speedometer 2.1 scores have more than doubled on many Android devices. [1]

How Did We Do It?

The improvements resulted from several changes, including:

  • Build optimizations: We've made a number of changes to the way Chrome is built, which has resulted in faster code execution tuned to modern premium Android devices and SoCs.
  • V8 and Blink improvements: Many improvements to the JavaScript engine (V8) and the rendering engine (Blink) have further boosted performance.
  • Scheduling, OS and SoCs: We worked closely with Android partners to optimize the way Chrome interacts with the operating system and its thread scheduling to make the best use of the silicon on the devices.

Let's take a closer look at each of these areas.

Build optimizations

The Android device ecosystem is very diverse. From entry-level phones to the newest premium ones, Chrome needs to run well on all devices. Up until last year, we shipped the same Chrome build to all these different Android devices. The memory and disk size constraints on entry-level devices resulted in Chrome having to prioritize a small binary size. Consequently, many modern build optimizations were out of reach, as they resulted in much larger binaries.

With M113, Chrome was finally able to ship a separate higher-performance build targeting premium Android devices via the Google Play Store. While we still ship a more binary-size-constrained build to other devices, this approach allowed us to land some of those modern optimizations into the new premium build:

  • By targeting 64-bit Arm instead of 32-bit Arm, we can make use of more efficient Arm instruction set features and larger 64-bit operations.
  • Since binary size is less relevant on premium devices with large disks and sufficient memory, we can now compile C++ code optimized for speed (-O2 / -O3) rather than size (-Oz).
  • Furthermore, we tweaked the inlining thresholds used by the compiler to enable more inlining in hot code (within and across modules), while updating the model and policy used by another compiler pass (MLGO) to reduce inlining in cold code.
  • We now also apply profile-guided optimization (PGO) techniques to the build to further improve the code layout and optimization level for hot code.
  • Finally, we improved cross-function code ordering by aligning Chrome's orderfile generation with the new 64-bit build. We also now include Speedometer 3, the latest version of the industry-standard browser speed benchmark, in the workloads used to generate the orderfile.

Together, these build optimizations account for more than half of the overall Speedometer score improvements. This progress was facilitated by our collaboration with Arm, who contributed valuable insights and improvements, including to identify and address inefficiencies in Chrome's PGO setup and inlining.

V8 and Blink improvements

Chrome continuously improves the performance of its JavaScript and web rendering engines, V8 and Blink. Most optimizations are small in individual impact, but stacked together, these improvements add up and contributed most of the remaining Speedometer impact! Notable ones include:

  • We now utilize an optimized fast-path HTML parser to parse innerHTML attributes.
  • V8 launched its Sparkplug compiler tier, a super fast baseline compiler that sits right above its Ignition interpreter and generates non-optimized code very quickly. Later, V8 also launched Maglev, a new mid-tier compiler that generates semi-optimized code. It takes longer to do so than Sparkplug, but much less time than Turbofan, V8's ultra-optimizing compiler tier. All together, this new tiering hierarchy allows V8 to tier up more gradually, improving both performance and power consumption.
  • We tuned our heuristics that decide when garbage collection occurs, targeting times when the rendering engine is idle or when users navigate away from pages.
  • We landed many other incremental optimizations, e.g. to V8 and our parsing, style, layout, and text rendering engines.

Scheduling and OS

To achieve the best possible performance, Android partners invest heavily in tuning the operating system's thread scheduling and frequency scaling policies, as well as improving the performance of the Silicon itself.

We worked closely with our partners to improve their tuning for Chrome and Speedometer. In particular, our collaboration with Qualcomm was very fruitful: By combining optimized scheduling policies with improved hardware performance, their newest Snapdragon 8 Elite mobile platform realized a 60-80% improvement in Speedometer 3.0 compared to its predecessor, resulting in class-leading web performance. This collaboration also highlighted important bottlenecks in Chrome's code, such as the need for improved PGO and opportunities in V8.

Speedometer 3.0 on Snapdragon 8 Gen 3 (left) compared to Snapdragon 8 Elite (right), Chrome M131

Why do these improvements matter?

Faster Speedometer scores translate to improvements in real user interactions with web content, such as faster page loads and interactions. Back at M112, loading a Google Docs document on Pixel Tablet took more than 50% longer than it does today -- that's the effect of a doubled Speedometer score!

Chrome M112 vs. M129 on Pixel Tablet, loading a Google Doc (frame count)

[1] Speedometer 3 was released during M122, so results from Speedometer 2.1 are provided for a full picture. Measurements shown in graphs were taken on Pixel Tablet.