[go: up one dir, main page]

Image of Shanghai skyline
Case Study

Trip.com Group

How Trip.com Group switched to Cilium For Scalable and Cloud Native Networking

Challenge

Trip.com Group Limited, a multinational travel service conglomerate based in Shanghai, China, stands as a giant in the travel industry. The company operates several renowned travel fare aggregators and metasearch engines, including Trip.com, Skyscanner, Qunar, Travix, and MakeMyTrip, catering to users in over 40 languages across 200 countries.

Trip.com initially built their Kubernetes platform with an internally developed CNI based on OpenStack Neutron, but quickly ran into issues. As their Kubernetes clusters expanded, they struggled with performance and scalability issues, centralized IPAM bottlenecks, and hardware limitations. They were unable to support the rapid growth of business and the number of core network device entries was rapidly approaching the hardware limit. They run over 20,000 nodes on on-premise physical clusters as well as spot instances on Alibaba Cloud and AWS and needed a solution that would scale with them.

Solution

After evaluating several options, Trip.com found Cilium to be the best fit. Cilium’s cloud native networking model helped meet their scalability requirements and its use of eBPF instead of iptables made it more performant than other solutions they evaluated. Network policy also allowed them to increase their security posture and being feature-rich and backed by an active community made Cilium the ideal choice.

Impact

With Cilium in place, Trip.com now boasts of a unified network and security solution across their hybrid cloud. They deploy Kubernetes clusters in both private and public clouds, utilizing Cilium for network connectivity, security policies, and leveraging Hubble for network and security audit events. Their data plane has also maintained stable performance for over four years without any major outages. The cohesive technology stack and reliability of Cilium has not only addressed the scalability issues of their previous setup but also significantly reduced operational costs, even as their clusters continue to grow.

Challenges:
Location:
Cloud Type:
Published:
September 13, 2023

Projects used

By the numbers

Hybrid Cloud

With over 20,000 nodes

200,000

Hubble events per second

3,000+

Cilium network policy entries

Switching To Cilium For Scalable and Cloud Native Networking

Trip.com Group Limited, a multinational travel service conglomerate, serves customers in over 40 languages and 200 countries. Their operations are supported by a vast IT infrastructure, with Kubernetes clusters deployed both on-premise and in cloud environments, including AWS and Alibaba Cloud. The platform team has more than 100 people managing everything from Kubernetes to CI/CD supporting 10,000 engineers.

The networking team within the platform team is 9 people with 3 working on Cilium now. This team manages a deployment that spans over 20,000 nodes, including on-premise physical nodes and spot instances on Alibaba cloud and AWS collectively supporting over 350,000 pods.

Originally, Trip.com’s infrastructure relied heavily on an internally developed CNI based on OpenStack Neutron. However, as their Kubernetes clusters began to expand, they encountered several challenges because their solution was built for virtual machines rather than the dynamic world of containers and cloud native. Performance and stability issues quickly became evident with centralized IPAM and limitations on how quickly they could change network device configurations impacting scalability. Their existing network design was struggling to support the rapid growth of the business and they were nearing the hardware limit for core network device entries. Motivated by these hurdles, they started searching for a cloud native solution.

Trip.com evaluated several potential options, including popular Kubernetes networking solutions and their own internally developed CNI. Their criteria were clear: they needed a solution that could overcome their current hardware limitations, resolve the performance bottleneck of centralized IPAM, enhance cluster scalability, be cloud native for future integration with Kubernetes, provide network policy for security, work across their hybrid cloud, and provide superior data plane performance.

 After their evaluation, Cilium emerged as the ideal choice. Its node-local networking model and use of eBPF instead of iptables perfectly aligned with Trip.com’s scalability requirements.  Moreover, Cilium’s cloud native and feature-rich nature was exactly what Trip.com was looking for. The active and vibrant community backing Cilium further solidified their decision.

“We tried Flannel and Calico and to extend our own solution to support Kubernetes, but found that Cilium’s approach of replacing iptables with eBPF and removing kube-proxy created a much more performant and scalable solution. With Cilium, we have very fast IPAM and it just scales even with thousands of nodes in the clusters.”

Jaff Cheng, Senior Software Developer, Trip.com

Cilium’s Benefits Beyond Just Networking

After choosing Cilium, Trip.com began the transition from their existing networking infrastructure setup to Cilium. Their deployment strategy was comprehensive: in the private cloud, they utilized Cilium’s direct-routing combined with Bird BGP for route advertisement, while in the public cloud, they used the corresponding IPAM plugin to allocate IP addresses from VPC subnets. This gives them the same networking experience wherever their clusters are running.

Cilium also gives them a consistent security experience across clouds. Security policies are synchronized between clusters through Kubernetes federation and they use Cilium’s host firewall feature to enforce policies on both pods and hosts. Network access/audit events are collected by Hubble and displayed via a self-managed ELK stack. Hubble is also used to capture network flow events like tcp connection requests and used to understand what happened to their applications at a certain point in time.

“eBPF brings a lot of possibilities that allows Cilium to build powerful features on top of it”

Jaff Cheng, Senior Software Developer, Trip.com

Establishing A Unified Network and Security Solution With Cilium

Trip.com’s transition to Cilium is a huge success for their platform team. It addressed their network scalability challenges, reduced operational costs, and enhanced stability. The shift allowed them to centralize network functionalities across all of their infrastructure and prepare for future growth. Beyond just networking, Cilium also gives Trip.com the ability to observe and secure their applications. As they look to the future, Trip.com plans to further leverage Cilium’s features for deeper insights and performance optimization, ensuring they meet the evolving demands of the travel industry.

“Cilium is just stable. We have been running it in production for almost 5 years and we haven’t had any major incidents in the dataplane which is very important for our applications. When you don’t have a problem, you just don’t notice it. We believe Cilium is not only production ready for large scale, but also one of the best candidates in terms of performance, features, and community” 

Jaff Cheng, Senior Software Developer, Trip.com

To dive into the technical details of their use of Cilium, check out these blogs: 

And their talk at the last eBPF Summit: