Deploying Kubernetes on AWS EC2: Infrastructure Management Made Easy for Data Engineers

Introduction
In data engineering, efficient infrastructure management is as crucial as building robust data pipelines. As data volumes grow, scalability and resilience become essential for seamless processing and analytics. Kubernetes simplifies this by automating deployment, scaling, and management of containerized applications, ensuring efficient resource use and fault tolerance. While AWS EKS provides a fully managed Kubernetes service, AWS EC2 offers a flexible, scalable environment for setting up Kubernetes clusters on virtual machines, optimizing performance and cost. In this blog, we’ll explore how to deploy Kubernetes on AWS EC2 to streamline infrastructure management for data engineering.
About Kubernetes
Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. It was originally developed by Google and is now maintained by the Cloud Native Computing Foundation (CNCF). Kubernetes provides a unified way to manage infrastructure, ensuring that applications run reliably across distributed environments. For data engineers, infrastructure management often involves provisioning and maintaining resources to support ETL pipelines, data lakes, real-time streaming, and machine learning workloads. Kubernetes streamlines this process in several ways:
Automated Deployment & Scaling – Kubernetes automatically manages the deployment of containerized workloads and scales them based on demand. This ensures efficient resource utilization, whether handling small data transformations or large-scale batch processing.
High Availability & Fault Tolerance – Kubernetes ensures system resilience by distributing workloads across multiple nodes. If a container fails, Kubernetes automatically restarts it or redistributes the workload, preventing data pipeline failures.
Resource Optimization – Kubernetes allows engineers to define CPU and memory limits, ensuring that applications don’t consume excessive resources. This is particularly useful when running multiple workloads on shared infrastructure.
Simplified Multi-Cloud & Hybrid Deployments – Kubernetes abstracts away underlying infrastructure, enabling seamless deployment across on-premises, AWS, Google Cloud, or Azure environments. This flexibility is crucial for enterprises with hybrid or multi-cloud strategies.
Efficient Job Scheduling – Kubernetes provides job scheduling capabilities for batch workloads, ensuring that ETL or AI/ML pipelines run at optimal times with the necessary resources.
Service Discovery & Load Balancing – Kubernetes automatically manages networking between services, making it easier to build and maintain distributed data processing applications.
Kubernetes has become a cornerstone of modern data engineering due to its ability to provide a consistent, scalable, and automated infrastructure. Its popularity stems from industry adoption by major cloud providers like AWS, support for big data and AI workloads, cost efficiency, and open-source support.
About AWS EC2
Amazon Elastic Compute Cloud (AWS EC2) is a scalable cloud computing service that provides virtual servers, known as instances, to run applications. It allows users to deploy, configure, and manage virtual machines with flexible CPU, memory, storage, and networking options. EC2 supports auto-scaling, load balancing, and on-demand provisioning, making it ideal for hosting applications, databases, and services like Kubernetes. By leveraging Kubernetes on AWS EC2, data engineers can deploy and manage their workloads efficiently, ensuring scalability, reliability, and automation—all critical factors in modern data-driven applications. In the next sections, we’ll explore how to set up a Kubernetes cluster on AWS EC2 to optimize data engineering workflows.
Configuration Steps
To configure the Kubernetes cluster in an AWS Ec2 instance, the following steps are involved.
Navigate to AWS EC2
Open AWS Management Console, then search for EC2 and navigate to the EC2 dashboard

Launch a New EC2 instance
Click on Launch Instance to create a new EC2 instance.

Configuration of EC2 instance
Provide a suitable name to the EC2 instance, and select Ubuntu as the machine image

As per Kubernetes documentation, a VM should have at least 2 GB RAM and 2 CPUs for deployment. Select an instance type accordingly. Here, we choose t2.medium (2 vCPUs and 4 GB memory), which is suitable for Kubernetes deployment.

Generate a new SSH Key Pair for connecting to the instance.

Then, we need to configure network settings to open the required ports in the VM. As per Kubernetes documentation, the following ports need to be open in the control plane and worker nodes.

Since we are deploying Kubernetes on a single EC2 instance (acting as both the control plane and worker node), all required ports must be open.

Configure storage for the VM by allocating 30GB of gp3 storage, as Kubernetes requires more space.

Finally, click on “Launch Instance“ to create a new instance with the configured settings.
This will take some time to configure the VM and once the VM is ready, we can connect to VM using SSH and then start configuring Kubernetes inside this EC2 VM.

Connecting to EC2 Instance using SSH
We will use the previously generated SSH Key to connect to the EC2 instance.
Here select ec2 instance and then click on connect

It will now provide us the commands that can be used to connect to VM using public DNS

The following command is used
ssh -i "ec2-kube-vm-key.pem" ubuntu@ec2-34-203-28-122.compute-1.amazonaws.comOnce connected we get into Ubuntu CLI as shown below

Configuration of Kubernetes
Update and Install necessary packages and dependencies
sudo apt update && sudo apt upgrade && sudo apt autoremoveUpdate /etc/hosts with the hostname and private IP of the VM

Reboot VM
sudo rebootAfter some time connect to the VM again using SSH and then disable swap by running the following command.
sudo swapoff -aKubernetes requires the swap to be disabled for stable memory management.
Load the required kernel modules by running the following commands
sudo tee /etc/modules-load.d/containerd.conf <<EOF br_netfilter EOF sudo modprobe br_netfilter sudo sysctl -w net.ipv4.ip_forward=1 lsmod | grep br_netfilter sysctl net.ipv4.ip_forwardHere we load br_netfilter which enables bridged traffic filtering allowing linux kernel to process network packets that pass through bridges and net.ipv4.ip_forward enables IP forwarding allowing the system to act as a router or forward traffic between networks. Kubernetes uses container networking extensively, and these settings ensure proper communication between containers, pods, and nodes.
Install Container Runtime
A container runtime is a software responsible for running and managing containers on a host system. It handles tasks like pulling container images, creating and starting containers, allocating resources, and managing the lifecycle of containers. Kubernetes itself does not directly run containers—it relies on a container runtime to do so, hence we install a container runtime.
Here I have installed containerd as container runtime as it is lightweight and efficient and backed by CNCF. Alternatively, you can also use other container runtimes like cri-o, docker, Mirantis Container Runtime.
But before that install the necessary dependencies for containerd
sudo apt install -y curl gnupg2 software-properties-common apt-transport-https ca-certificatesDownload GPG keys for Docker’s official repository and add a new repository in the apt package manager’s list of sources. This allows us to install software from Docker’s official repository.
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmour -o /etc/apt/trusted.gpg.d/docker.gpg sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"Update the apt repository and then install containerd
sudo apt update sudo apt install -y containerd.ioConfigure Containerd for Kubernetes deployment
containerd config default | sudo tee /etc/containerd/config.toml >/dev/null 2>&1 sudo sed -i 's/SystemdCgroup \= false/SystemdCgroup \= true/g' /etc/containerd/config.toml sudo systemctl restart containerd sudo systemctl enable containerdEnabling
SystemdCgroup = trueensures compatibility with the host system's resource management (especially on systemd-based Linux distributions).Install Kubernetes(Kubelet, Kubectl, Kubeadm)
For this too download the GPG public key for the Kubernetes package repository and add the Kubernetes repository to the system’s APT configuration which allows us to install Kubernetes components from the official source.
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.listThen update the repository and install Kubernetes
sudo apt-get update sudo apt-get install -y kubelet kubeadm kubectl sudo apt-mark hold kubelet kubeadm kubectlInitialize Kubernetes Cluster using Kubeadm
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --control-plane-endpoint=34.203.28.122Here we specify the IP address range(CIDR block) to be used for pod networking.
This takes some time and on a successful run, you will get the following output. The
10.244.0.0/16range means that pod IPs will be assigned from this subnet (e.g.,10.244.0.1,10.244.0.2, etc.). This option is often required when using certain CNI (Container Network Interface) plugins like Flannel, which rely on a predefined pod network CIDR. Also, we specify the endpoint(Public IP address or DNS name of VM) that the control plane uses for communication
Here it describes future steps for configuring a multi-node kubernetes cluster which includes configuring kube config and joining worker nodes to the kubernetes control plane.
As we are using a single node as both the control plane and worker node, we skip this step of joining and instead, we will taint the master node so that it acts as boththe master node and worker node in this single node Kubernetes cluster.
mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config kubectl taint nodes ip-172-31-26-199 node-role.kubernetes.io/cont rol-plane-In the current state, if we check the Kubernetes node status, it will be NotReady.

This is because we need to configure networking addon for kubernetes cluster. In our case, we will install flannel as networking addon.
Installing the flannel networking addon
We install the flannel addon by using the following manifest file
flannel.yaml. Flannel is a popular CNI (Container Network Interface) plugin used in Kubernetes to provide networking between pods across different nodes in a cluster. It is a simple, lightweight, and easy-to-configure overlay network that enables communication between containers running on different nodes. In Kubernetes, each pod gets its own unique IP address, and these pods need to communicate with each other regardless of which node they are running on. Flannel ensures that pods can communicate seamlessly across the cluster by creating an overlay network.
For larger clusters or environments requiring advanced networking features, we might consider alternatives like Calico , Cilium , or Weave Net .
--- kind: Namespace apiVersion: v1 metadata: name: kube-flannel labels: pod-security.kubernetes.io/enforce: privileged --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: flannel rules: - apiGroups: - "" resources: - pods verbs: - get - apiGroups: - "" resources: - nodes verbs: - get - list - watch - apiGroups: - "" resources: - nodes/status verbs: - patch --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: flannel roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: flannel subjects: - kind: ServiceAccount name: flannel namespace: kube-flannel --- apiVersion: v1 kind: ServiceAccount metadata: name: flannel namespace: kube-flannel --- kind: ConfigMap apiVersion: v1 metadata: name: kube-flannel-cfg namespace: kube-flannel labels: tier: node app: flannel data: cni-conf.json: | { "name": "cbr0", "cniVersion": "0.3.1", "plugins": [ { "type": "flannel", "delegate": { "hairpinMode": true, "isDefaultGateway": true } }, { "type": "portmap", "capabilities": { "portMappings": true } } ] } net-conf.json: | { "Network": "10.244.0.0/16", "Backend": { "Type": "vxlan" } } --- apiVersion: apps/v1 kind: DaemonSet metadata: name: kube-flannel-ds namespace: kube-flannel labels: tier: node app: flannel spec: selector: matchLabels: app: flannel template: metadata: labels: tier: node app: flannel spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/os operator: In values: - linux hostNetwork: true priorityClassName: system-node-critical tolerations: - operator: Exists effect: NoSchedule serviceAccountName: flannel initContainers: - name: install-cni-plugin #image: flannelcni/flannel-cni-plugin:v1.1.0 for ppc64le and mips64le (dockerhub limitations may apply) image: docker.io/rancher/mirrored-flannelcni-flannel-cni-plugin:v1.1.0 command: - cp args: - -f - /flannel - /opt/cni/bin/flannel volumeMounts: - name: cni-plugin mountPath: /opt/cni/bin - name: install-cni #image: flannelcni/flannel:v0.20.2 for ppc64le and mips64le (dockerhub limitations may apply) image: docker.io/rancher/mirrored-flannelcni-flannel:v0.20.2 command: - cp args: - -f - /etc/kube-flannel/cni-conf.json - /etc/cni/net.d/10-flannel.conflist volumeMounts: - name: cni mountPath: /etc/cni/net.d - name: flannel-cfg mountPath: /etc/kube-flannel/ containers: - name: kube-flannel #image: flannelcni/flannel:v0.20.2 for ppc64le and mips64le (dockerhub limitations may apply) image: docker.io/rancher/mirrored-flannelcni-flannel:v0.20.2 command: - /opt/bin/flanneld args: - --ip-masq - --kube-subnet-mgr resources: requests: cpu: "100m" memory: "50Mi" limits: cpu: "100m" memory: "50Mi" securityContext: privileged: false capabilities: add: ["NET_ADMIN", "NET_RAW"] env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: EVENT_QUEUE_DEPTH value: "5000" volumeMounts: - name: run mountPath: /run/flannel - name: flannel-cfg mountPath: /etc/kube-flannel/ - name: xtables-lock mountPath: /run/xtables.lock volumes: - name: run hostPath: path: /run/flannel - name: cni-plugin hostPath: path: /opt/cni/bin - name: cni hostPath: path: /etc/cni/net.d - name: flannel-cfg configMap: name: kube-flannel-cfg - name: xtables-lock hostPath: path: /run/xtables.lock type: FileOrCreateHere in the flannel config please make sure that inside net-conf.json> Network the POD CIDR mentioned matches the one used during kubeadm init command. Then apply the manifest by running the following command.
kubectl apply -f flannel.yamlNow again check the node status and pods. The node should be in ready status and pods should all be running by running the following commands.
kubectl get nodes kubectl get pods -n kube-system kubectl get pods -n kube-flannelThe following output is expected.

Test Kubernetes cluster operation by deploying a simple hello world nginx server
To test whether the Kubernetes cluster is ready and capable of deploying applications, we will deploy a simple "Hello World" NGINX server application in the recently configured Kubernetes cluster.
The following manifest file will be used for this "Hello World" application:
hello-world.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: hello-world-deployment spec: replicas: 3 selector: matchLabels: app: hello-world template: metadata: labels: app: hello-world spec: containers: - name: hello-world image: nginx:latest ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: name: hello-world-service spec: selector: app: hello-world ports: - protocol: TCP port: 80 targetPort: 80 type: LoadBalancerRun this command to deploy this hello world application in a new namespace
kubectl create ns hello-world kubectl apply -f hello-world.yaml -n hello-worldNow check the resources deployed in this new namespace by running the following command
kubectl get all -n hello-worldThe following output is expected.

As the service is accessible over nodeport, we can access this hello-world application using the VM’s public IP and nodeport as shown below
<VM-PUBLIC-IP>:<NodePort>

We can see that the application is running successfully. Hence this verifies the successful deployment of Kubernetes inside EC2 instance. Now we can use this for deploying other tools as per requirements for data engineering projects.
Conclusion
To wrap up, deploying Kubernetes on AWS EC2 provides a flexible and cost-effective way to manage containerized workloads for data engineering projects. By automating the deployment, scaling, and management of applications, Kubernetes ensures efficient resource utilization and fault tolerance, which are critical for handling large-scale data pipelines, ETL jobs, and machine learning workloads. Through this blog, we demonstrated how to set up a Kubernetes cluster on an EC2 instance, configure networking with Flannel, and validate the setup by deploying a simple "Hello World" NGINX application. This foundation not only verifies the cluster's readiness but also showcases how Kubernetes can streamline infrastructure management, making it a powerful tool for modern data engineering workflows. With this setup, you can now confidently deploy and scale tools tailored to your project’s requirements, leveraging the scalability and flexibility of Kubernetes on AWS EC2.


