In this post, I'll explore how to think about breaking circular dependencies between critical system-level components. We’ll use Katanemo’s control-plane v. data-plane architecture to highlight the problem, and share a simple solution that scales well - and is secure. But first, a few words about Katanemo as its my first blog with them.
Nearly a year ago, I joined Katanemo to re-imagine the role of infrastructure for developers building SaaS applications in the cloud. Why SaaS? Because I have written over three tier-1 services at my time in Microsoft, Cloudera and Oracle and in each one of those instances I failed to adequately balance crufty and error prone infrastructure work with desirable feature work. In every single instance, multi-tenant infrastructure design became the rate limiter to innovation.
Katanemo is how we are re-imaging the role of infrastructure for SaaS builders. Our guiding principle has been to envision a future where infrastructure becomes virtually invisible - The future of infrastructure, is no infrastructure. The rapid proliferation of infrastructure primitives has made it daunting for developers to make the right selections, and then operate and scale those choices. Diverting attention from their core competency.
As we build towards this future, we have taken a component-level approach to making the lives of developers easy. Today, Katanemo helps developers instantly build critical authentication and safety features so they can focus on moving faster. It's an identity (CIAM) and fine-grained permissions service - unified as one. With Katanemo’s purpose-built (SaaS) CIAM, developers can instantly sign-up users and teams, and up-sell into enterprise use cases in minutes with robust governance features. One problem: Katanemo is a SaaS service, so what CIAM stack should we use for our own service?
One key challenge we faced in building a unified identity and fine-grained permissions service was that we needed something like that for our own cloud service. How robust and delightful would Katanemo be if we weren’t confident in using its own service?
Have you heard of the phrase “eating your own dog food” or “drinking your own champagne”? Well, to measure the robustness and completeness of our solution, Katanemo had to be built on Katanemo. This approach offers numerous advantages like catching bugs before our customers do, measuring the effectiveness of our workflows and the developer experience, feeling the pain of any performance degradation, etc. But this approach also creates a circular dependency in systems level design: our control plane now depends on our data plane, which depends on our control plane. 🤯
Before diving further into breaking circular dependencies, let's take a moment to clarify the terms "control plane" and "data plane". These concepts have gained popularity through their implementation in massive-scale cloud services. At its core, this design pattern aims to segregate critical elements such as metadata, configurations, upgrades, etc. (Control Plane) from the core functionality of the service (Data Plane) along the data path.
The Control Plane assumes the responsibility of storing, managing, and controlling the actual execution layer: aptly named the Data Plane.To explain this concept in simple terms, let's consider the following example. Imagine an airport with a Control Tower and airplanes. In this analogy:
Control Plane: The Control Tower represents the Control Plane. It is the central hub that handles all the planning, coordination, and decision-making. Air traffic controllers sit in the Control Tower, managing the overall air traffic, issuing instructions, and ensuring the safe and efficient movement of airplanes. They deal with tasks like deciding which planes should take off, land, change altitude, or follow specific routes.
Data Plane: The airplanes themselves represent the Data Plane. They are the core entities performing the actual tasks, such as carrying passengers and cargo from one location to another. The airplanes follow the instructions given by the Control Tower and execute specific actions accordingly. They do not worry about the overall coordination; their main focus is to perform their designated functions effectively. Getting passengers from point A to point B.
In Summary:
--> Control Plane (Control Tower): Responsible for planning, coordination, and decision-making.
--> Data Plane (Airplanes): Responsible for executing specific tasks based on the instructions received from the Control Plane.
Now let’s dive into the problem: building Katanemo on Katanemo creates a circular dependency between our Control plane (service that handles calls to create roles, permission policies, etc) and our Data plane (service that authorizes requests at scale).
Our Control plane must rely on Data plane (authorize requests) to determine whether a request to create roles, resource permissions or ABAC policies should be allowed or denied. Simultaneously, the authorizer service depends on the Control plane for critical data, such as role details and resource tags to evaluate an authorization request. If we don't break this circular dependency a single request to our control plane could possibly never terminate: Call to Control plane to create a role --> Call to the authorizer (Data plane) to validate permissions --> Call to Control plane to access permissions data -> loop.
We considered several approaches to solve the above circular dependency. For example, one approach involved making the control plane a super service with privileged access to read/write authorization rules in our DynamoDB tables. And this would involve (at best) a shared common library between the Control plane and our authorization engine (Data plane). But we quickly saw the authorization logic diverging with deeply nested if/else statements in the Control plane. Not only did this violate "drinking your own champagne" but it also increased the blast-radius of a security incident if the super service ever got compromised.
To address this problem, we decided to introduce an intermediary storage layer between the Control plane and Authorizer (Data plane) which acts as the source of truth for the state of various objects like resource permissions and role-based policies. This not only resolved the circular dependency but also created a key advantage - our services can now scale independently, based on their actual load. No longer will the Control plane be overwhelmed by Authorizer requests, and vice versa.
The above solution is based on our past experiences and lessons while building infrastructure services at the likes of AWS, Meta, Lyft, etc. Stay tuned for more exciting updates and insights into our continuous journey of improvement!