TAGS:

Build Your K8s Environment For The Real World Part 1 – Day Zero Ops

Michael Levan

When you’re designing a Kubernetes environment, whether it’s small or large, there are a few things that you must think about prior to writing the code to deploy the cluster or implementing the GitOps Controller for all of your Continuous Delivery needs. First, you must plan. Planning is the most important phase.

In blog one of the Build Your K8s Environment For The Real-World series, you’ll learn all of the Day Zero Ops needs, why they’re important, and ultimately how to think about them to ensure a successful Kubernetes build out.

Security Fundamentals

Your overall security planning comes on Day Zero. (We’ll cover implementation, cluster hardening, container image scanning, and other best practices on Day One and Day Two).

At Day Zero, the biggest two questions are:

  1. Do I have security requirements from management/leadership?
  2. Do I have compliance measures I need to follow?

For example, in a healthcare environment you’ll need to follow certain PHI and HIPPA regulations. These aren’t exactly technical, but they will come into play when you’re thinking about the location where you want to run Kubernetes.

Aside from those two questions, consider security best practices from the perspectives of users, infrastructure, networking, and Kubernetes.

A user perspective includes authentication, authorization and Role-Based Access Control (RBAC), which you’ll be learning about in the next section.

An infrastructure perspective will consider where and how the clusters run. If you’re creating Kubernetes clusters on-prem, that means all of the virtual machines they’re running on must be securely managed, including updates, patches, and fixing security issues on applications and binaries deployed to the cluster.

For networking in Kubernetes, you can use a service mesh, a security-centric Container Network Interface (CNI) such as Cilium or Calico, or a combination of both. You’ll learn more about these throughout the series, but in short, a service mesh and security-centric CNI are what help encrypt east-west traffic using mTLS. A service mesh also provides observability.

The last, and certainly not least, is securing Kubernetes itself. There’s a plethora of information and best practices around this, but the best list to follow is:

  1. Network segregation and encrypted traffic with service mesh and a security-centric CNI along with eBPF
  2. Proper Namespace configuration with RBAC permissions tied to each Namespace
  3. Resource quotas to ensure that only “X” amount of resources are allowed to be created
  4. Proper Quotas, Limits, and Requests per Namespace
  5. Proper network policies to ensure that network traffic does not travel between Namespaces and labels.
    1. app=client1 should not touch app=client2
  6. OPA in place for overall policy management.
  7. Audit logging
  8. RBAC

Outside of the above list, I recommend looking at the CIS Benchmarks for Kubernetes, which you can find here: https://www.cisecurity.org/benchmark/kubernetes

RBAC

A few different tenants will need access to Kubernetes clusters.

  1. Users
  2. Teams
  3. Service Accounts

Users and teams are the engineers working on the Kubernetes environment. Service accounts are used to deploy Kubernetes Resources (Pods, Services, etc.) to the cluster.

Although users/teams and service accounts are very different and used very differently, they they all need permissions. Furthermore, they should only have the permissions and priveleges they need to perform their tasks. Limiting permissions can help reduce the risk of an account being used for unwanted or malicious behavior.

When planning your authentication and authorization architecture, map out which engineers need access and which don’t, along with the type of access each engineer or group of engineers need. For example, not all engineers need write access to the production cluster. The same goes for the service accounts.

One important note to remember is Kubernetes RBAC (the K8s resource itself from the Core API Group) only deals with authorization/permissions, not authentication. You need a separate approach for authentication. Because of this, organizations will typically look for an OpenID Connect (OIDC) solution that has both authentication and authorization for Kubernetes. Two common solutions in the cloud are Azure Active Directory and AWS IAM.

Kubernetes Architecture

What are your business needs? Do you need one cluster? Two clusters? Multitenancy? Multi-cloud?

Creating a Kubernetes cluster and turning on auto-scaling in the cloud is a straightforward task. Creating a Kubernetes environment for what your business actually needs is an entirely different task.

When you’re planning the architecture for your Kubernetes environment, ask these questions:

  1. How many applications are running? This will give you an idea of what resources you need (CPU, memory, etc.).
  2. How many third-party services (GitOps, Service Mesh, etc.) do you have for the Management Cluster (a cluster used by itself for these services without anything else installed)?
  3. Do you have any idea of the resources needed for the application stack that’s currently running?
  4. Do you have any specific compliance requirements to meet?
  5. What dependencies exist? For example, do you have a database you need to connect to?

With the above five questions in mind, you should be able to determine how and where you’re going to run a Kubernetes cluster.

The recommended starting size is 3 to 4 Control Planes and 3 to 5 Worker Nodes. If you’re running a Kubernetes cluster in the cloud (GKE, AKS, EKS, etc.), you won’t have to worry about the Control Plane portion, just the Worker Nodes. In terms of the size/resources (CPU/memory/storage) available for each Worker Node, it really comes down to bullet point number 3 – the resources needed for the application stack.

Considering Best Practices

Here are a few standard best practices:

  1. Ensure that you set Limits, Requests, and Quotas.
  2. Ensure proper application stack segregation with Namespaces.
  3. Ensure that users/teams only have permission to what Kubernetes resources they need.
  4. Always pick tools/platforms and solutions for a specific business need, not just because they look fun to use.

The Kubernetes docs are a great place to start in terms of defining what best practices should look like for you as they give you a few solid ideas. Below are some links.

Best practices: https://kubernetes.io/docs/setup/best-practices/

Configuration best practices: https://kubernetes.io/docs/concepts/configuration/overview/

Security best practices: https://kubernetes.io/blog/2016/08/security-best-practices-kubernetes-deployment/

While there are general best practices that apply to most organizations, your own company will also have its own set of requirements, risk profiles, compliance mandates, and so on. This company-specific requirements will also influence what constitutes a “best practice.” For example, your company might prioritize fixing security vulnerabilities immediately, while another company might address vulnerabilities during scheduled change windows. Take advantage of policy enforcement tools like Open Policy Agent (OPA) and Kyverno to define your best practices as policies and ensure that they’re followed.

Tools and Platforms

To wrap up your Day Zero Ops plan, you’ll have to think about the tools and platforms you need to ensure a robust Kubernetes environment.

The categories include:

  1. Autoscaler
  2. Service Mesh
  3. Monitoring
  4. Observability
  5. OIDC (authentication and authorization)
  6. Cluster Deployment
  7. Application Deployment
  8. Secrets Management
  9. Scanning (cluster scanning, container image scanning, etc.)
  10. Cost and resource optimization

There are a lot of tools in these categories. For example, from a Cluster Deployment perspective, are you going with Terraform? Pulumi? CloudFormation? For everything else, especially in the CNCF landscape, there are over 1,300 tools. Because of the vast amount of tools that all essentially do the same thing, it comes down to one important question – do you want an enterprise solution or a homegrown solution?

“Enterprise solution vs homegrown solution” means this – do you want to use Datadog or do you want to patch together multiple open-source tools for monitoring and observability? Do you want to pay licensing costs for a tool so it’s all in one place and managed for you, or do you have enough engineering staff to maintain an open-source solution?

There’s no right or wrong answer as you’re paying either way. You’re either paying a tool/platform to do the work for you or you’re paying engineers to do the work for you. Both paths have pros and cons and both paths work well for each environment. It really comes down to the organization.

About Michael Levan: Michael Levan is a seasoned engineer and consultant in the Kubernetes space who spends his time working with startups and enterprises around the globe on Kubernetes and cloud-native projects. He also performs technical research, creates real-world, project-focused content, and coaches engineers on how to cognitively embark on their engineering journey. He is a DevOps pro, HashiCorp Ambassador, AWS Community Builder, and loves helping the tech community by public speaking internationally, blogging, and authoring tech books.