Your Kubernetes pods might suddenly die, and you might find they were OOMKilled. These frustrating terminations can happen even with plenty of free memory on your nodes. The Linux kernel’s unique memory management behavior causes this problem.
A Kubernetes exit code 137 signals that the OOM Killer has terminated a process. Several factors trigger this behavior. Your resource limits might be wrong. Your application code could have memory leaks. Unexpected traffic spikes can overwhelm the system. Resource-hungry containers on the same node compete for memory. The OOM Killer’s behavior is a vital aspect to understand. It assigns scores to processes and decides what to terminate based on complex criteria, including Quality of Service (QoS) classes.
This piece will help you find hidden memory leaks. You’ll understand what really causes OOMKilled errors. We’ll show you economical solutions that prevent these issues from disrupting your production environments.

Understanding OOMKilled in Kubernetes: Beyond Exit Code 137
Exit code 137 shows up frequently in Kubernetes logs. This mysterious signal needs a deeper look at the mechanisms of memory management in containerized environments. The code shows a termination from a SIGKILL signal (128+9=137), which the Linux kernel’s Out-Of-Memory (OOM) Killer usually triggers.
What happens when Kubernetes kills a pod
You might see pods suddenly stopping with the status “OOMKilled” while watching your cluster. Many developers think Kubernetes sends this signal. In stark comparison to this, the Linux kernel on the node that hosts your pod sends this signal when memory limits are crossed.
The container crossing its memory limit starts this chain of events:
- The Linux kernel terminates the container process with SIGKILL (signal 9)
- Kubelet detects this termination and notifies the Kubernetes control plane
- The pod’s status is updated to show the container as “Terminated” with reason “OOMKilled”
- The pod’s overall status depends on its remaining containers
The pod keeps its “Running” status if some containers still run, even with the OOMKilled container. The pod’s status changes to “Failed” if all containers stop. Kubernetes might try to restart the stopped container based on your restart policy. This could create an endless restart loop if the memory issue stays unfixed.
How the Linux OOM Killer makes decisions
The OOM Killer acts as the memory sheriff of Linux systems. It decides which processes live or die when resources run low. This system kicks in under three main conditions: physical memory runs out, swap space is gone, or memory reclamation fails.
The oom_score calculation drives these decisions. Every running process gets a score based on several key factors:
- Memory consumption: Including Resident Set Size (RSS), virtual memory allocation, and shared memory usage
- Process longevity: Older processes get a small advantage
- CPU utilization: Both total CPU time and recent usage affect scoring
- Process privilege: Root processes get 30 points taken off their score
The Linux kernel picks processes with the highest “badness” score to terminate when memory gets tight. Kubernetes adds to this by setting an oom_score_adj value based on the pod’s Quality of Service (QoS) class:
- Guaranteed pods: -997 (least likely to be killed)
- Burstable pods: Variable score between 2-999
- BestEffort pods: 1000 (first to be terminated)
This hierarchy makes sure critical workloads with proper resource limits survive memory pressure by sacrificing less important processes.
Common misconceptions about OOMKilled errors
People often misunderstand how OOMKilled errors work in Kubernetes environments:
Misconception #1: Exit code 137 always indicates memory issues.
Reality: Memory problems usually cause this error, but failed health checks can also trigger a SIGKILL signal with the same exit code.
Misconception #2: OOM Killer only activates when a node runs completely out of memory.
Reality: The OOM Killer can stop processes even with free memory on the node. It works based on cgroup limits and memory pressure thresholds instead of complete memory depletion.
Misconception #3: Simply increasing memory limits always solves OOMKilled problems.
Reality: More memory might help temporarily, but fixing memory leaks in application code will give a better long-term solution.
Misconception #4: The largest memory consumer always gets killed first.
Reality: The OOM Killer uses a complex scoring system. Memory size is just one factor—QoS class, process age, and other traits affect termination decisions by a lot.
Misconception #5: OOMKilled errors originate from Kubernetes.
Reality: These errors come from the Linux kernel’s memory management system. Kubernetes just reports them and helps influence the kernel’s decisions through resource settings.
These details help developers move from just reacting to OOMKilled errors to building memory-efficient applications with the right resource settings.
Identifying Hidden Memory Leaks in Containerized Applications
Memory leaks hide under the surface of containerized applications. These leaks slowly eat up resources until Kubernetes pods hit OOMKilled status. Each programming language needs its own approach and special tools to spot these hidden resource drains.
Memory leak patterns in Java applications
Java apps in containers face special memory management challenges. Container memory limits add extra constraints. The JVM’s garbage collector sometimes doesn’t fully grasp these limits. Java memory leaks happen when object references stick around needlessly. This stops garbage collection from freeing up memory.
Common Java leak patterns in Kubernetes include:
- Unbounded caches: Collections that grow without size limits
- ThreadLocal variables: Large objects stuck in thread-local storage
- Improper resource handling: Database connections or streams left open
- Static collections: Collections that keep growing throughout the app’s life
Heap dump analysis is vital for Java apps in containers.
Tools like “jmap with -dump:format=b,file=heap.bin” take memory snapshots to analyze. Getting heap dumps in containerized environments can be tricky. Many production containers use JRE instead of JDK, which lacks these diagnostic tools.
Here’s what you can do:
- Put JDK in your container image for troubleshooting
- Use Spring Boot Actuator’s /actuator/heapdump endpoint if you have it
- Look at multiple heap dumps taken over time to spot growing collections
Node.js memory leak detection in Kubernetes pods
Node.js apps often run into memory issues in Kubernetes environments. Problems pop up with closures, callbacks, or event listeners. The V8 JavaScript engine uses generational garbage collection. Reference retention can throw this off track.
The Inspector API helps track down Node.js memory leaks in Kubernetes pods. Send a SIGUSR1 signal to the running Node.js process and port-forward with kubectl port-forward <pod-name> 9229. This lets you connect Chrome DevTools to analyze the heap.
Taking heap snapshots one after another shows memory growth patterns. A ground case found AJV (JSON schema validation library) creating thousands of validation objects without reusing them. Node.js version 19+ changed how heap space relates to container memory limits. This might cause more garbage collection overhead in Kubernetes.
Python memory management challenges in containers
Python manages memory through a private heap that belongs just to the Python process. Python doesn’t give freed memory back to the operating system right away. Instead, it keeps this memory for future use.
Memory gets released at the arena level, not at individual blocks or pools. Python containers in Kubernetes might show ever-growing memory usage even when garbage collection runs.
Python memory issues in containers usually come from:
- Big objects that don’t get released properly
- Data structures that grow forever without limits
- Circular references that block garbage collection
Running memory-heavy functions in separate processes helps alleviate these issues. Resources get released after the function finishes, whatever Python’s internal memory management does. Tools like memory_profiler and the resource module help track memory usage patterns in containers.
Go memory profiling for Kubernetes workloads
Go apps have efficient garbage collection but can still hit memory issues in Kubernetes. Since version 1.19, Go’s memory management works better with container limits thanks to GOMEMLIMIT
.
Pprof is a powerful built-in tool to profile Go apps in Kubernetes. Here’s how to use it:
- Get to the /debug/pprof/heap endpoint through port-forwarding
- Get heap profiles using go tool pprof <binary> ‘http://localhost:<port>/debug/pprof/heap‘
- Look through the results to find memory retention patterns
eBPF-based profiling offers another way to watch memory continuously with little overhead. Unlike traditional profiling, eBPF gathers data regularly without slowing things down much. This makes it great for production environments.
Continuous profiling helps pinpoint where apps use memory resources. Developers can spot and fix performance bottlenecks before Kubernetes pods hit OOMKilled events.
Advanced Kubernetes Pod Memory Debugging Techniques
Debugging memory issues in Kubernetes needs specialized tools to investigate pods that experience OOMKilled terminations. Simple resource metrics are not enough. Teams need advanced analysis techniques to uncover hidden memory consumption patterns that trigger exit code 137 errors.
Using cAdvisor for container memory analysis
cAdvisor (Container Advisor) runs as part of kubelet on every Kubernetes node and collects container-level statistics at the lowest level. This Google-developed tool finds containers automatically and gathers significant memory metrics. cAdvisor exposes several vital metrics to analyze pod memory:
-> container_memory_usage_bytes - Current memory consumption including all memory whatever the access time
-> container_memory_working_set_bytes - Current working set watched by the OOM killer
-> container_memory_max_usage_bytes - Maximum recorded memory usage
-> container_memory_failcnt - Count of times memory usage hit limits
cAdvisor’s metrics help teams identify containers that breach their memory boundaries during kubernetes pod oomkilled issues. These metrics are available through a REST API that external monitoring systems can use.
Leveraging Prometheus and Grafana for memory trend visualization
Prometheus serves as a metrics server that scrapes and stores data from various sources, including cAdvisor endpoints. Teams can track memory consumption patterns that lead to oomkilled events with Prometheus queries like:
sum(container_memory_working_set_bytes{container!=""}) by (pod) / sum(kube_pod_container_resource_limits{resource="memory"}) by (pod) > 0.8
This query helps find pods that use more than 80% of their memory limits, suggesting potential problems.
Grafana connects to Prometheus data sources and creates visual dashboards that show memory trends over time. These visualizations help teams find gradual memory increases that trigger oomkilled terminations. Memory usage dashboards should include time series panels that track pod and container-level memory consumption in the cluster.
Memory dump analysis in Kubernetes environments
Memory dumps show snapshots of application memory at specific points in time. Java applications that experience oomkilled errors need this command:
kubectl exec <pod-name> -- jmap -dump:format=b,file=heap.bin <pid>
Teams can transfer files using kubectl cp:
kubectl cp <pod-name>:/path/to/dump local-path
Tools like JVisualVM, MAT, or language-specific analyzers help identify memory leaks in these dumps. .NET applications require dotnet-dump collect, followed by analysis with WinDbg or Visual Studio.
Real-World Case Studies of K8s OOMKilled Incidents
Real-life incidents teach us the most about memory issues in Kubernetes environments. Case studies show what happens when abstract concepts like OOMKilled errors appear in production systems.
E-commerce platform memory leak investigation
Robert, an engineer managing a multi-tenant e-commerce system, faced a puzzling series of failures after deploying new software. The original deployment looked fine with normal weekend traffic showing no problems. The trouble started when Monday’s traffic brought alerts showing high HTTP failure rates in almost all tenant backend services.
Robert found that report service pods kept restarting without clear application errors. A deeper look at Kubernetes logs showed an unusual spike in OOMKilled events. The logs proved these pods stopped working due to memory limits rather than application logic failures.
The data showed memory usage steadily climbing to almost the defined limit (400MB) before each pod stopped and restarted. The memory needed to run reports changed based on the number and size of requested reports.
Robert fixed the problem with a simple solution: he raised the pod’s memory limit from 400MB to 600MB. This stopped the reports from triggering the OOM killer. Yes, it is clear this whole ordeal showed why load and memory testing matter before new versions go to production.
Microservice architecture memory cascade failures
Memory issues in complex microservice architectures rarely stay isolated. One overloaded service often creates a chain reaction through connected systems. This chain reaction, known as cascading failure, is one of the most dangerous patterns in Kubernetes deployments.
A good example is an e-commerce platform with microservices that handle product catalogs, shopping carts, order processing, and payment gateways. Memory leaks in one component spread as other services try to fix failed operations.
Memory leaks in applications often come from poorly configured caches or unused objects staying in memory. These leaks slowly use up resources until OOMKilled events happen and disrupt dependent services.
Teams need to look at the entire service topology to find cascade failures instead of focusing on components that show symptoms. The root cause often lies somewhere else in the service mesh. Teams must monitor across service boundaries to catch memory issues before they spread through the system.
Memory Profiling Tools for Different Runtime Environments
Language-specific memory profiling tools help us learn about memory consumption before containers hit their limits in Kubernetes pod OOMKilled events. Memory analysis works best with a deep understanding of each runtime environment.
JVM memory analysis with jmap and MAT
Java applications that face oomkilled errors in Kubernetes need heap dumps to investigate memory leaks. Here’s how I capture these snapshots using jmap
:
I start by accessing the container:
kubectl exec -it <pod-name> -- bash
Next, I run jps to find the Java process ID and take a heap dump:
jmap -dump:live,format=b,file=<filename>.bin <process_id>
The heap dump needs to move to my local machine:
kubectl cp <namespace>/<pod>:<container_path>/<filename>.bin <local_path>
Eclipse Memory Analyzer Tool (MAT) does a great job analyzing these dumps. It automatically spots potential memory leaks through its “leak suspects” report.
Node.js heap snapshots in Kubernetes
Node.js applications in Kubernetes need a different approach to memory analysis. The pods sometimes exit with code 137. I enable the Inspector API by sending a SIGUSR1 signal to the running Node.js process:
kubectl exec <pod-name> -- kill -SIGUSR1 1
The next step is setting up port forwarding:
kubectl port-forward <pod-name> 9229:9229
Chrome DevTools’ memory profiler can now connect and analyze heap usage. Multiple snapshots taken over time show memory growth patterns. My analysis focuses on growing array and string allocations between snapshots.
Node.js 19+ comes with a change that links new heap space sizing to container memory limits. This might increase garbage collection frequency in Kubernetes environments.
Python memory profilers for containerized applications
Python applications that face kubernetes pod oomkilled situations have several profiling options. Tracemalloc, part of the standard library, tracks memory simply:
import tracemalloc
tracemalloc.start()
# Application code execution
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
Memray gives a more complete analysis, especially in production environments. You can set up continuous profiling with memray:
memray run --follow-fork -o /app/profiles/profile.bin -m gunicorn app:app
The profile files can be analyzed using memray summary or memray flamegraph to see memory consumption patterns.
Each runtime has its own memory management quirks. The right profiling tool that matches your application’s language is the first step to fix oomkilled issues in Kubernetes environments.
Conclusion
Memory management issues don’t deal very well with Kubernetes environments. Understanding OOMKilled errors enables us to build stronger systems. This piece explores how container memory limits, Linux kernel behavior, and application performance work together.
Memory leaks demonstrate differently in runtime environments of all types. Java’s garbage collection quirks, Node.js memory patterns, Python’s private heap management, and Go’s resource strategies need unique debugging methods. We can detect and fix these problems before they affect production workloads by using language-specific profiling tools and monitoring solutions like cAdvisor, Prometheus, and Grafana.
Real-life case studies showed how basic memory leaks could lead to system-wide failures. These examples proved why proper resource setup and continuous monitoring matter. The right mix of debugging techniques, tools, and investigation methods are the foundations of keeping Kubernetes clusters healthy.
Becoming skilled at memory management needs technical knowledge and constant alertness. OOMKilled errors might look scary at first, but the methods and strategies here offer a clear solution to these challenges. Note that successful container orchestration relies on understanding memory issues and taking steps to prevent them.
FAQs
Q1. What causes OOMKilled errors in Kubernetes?
OOMKilled errors occur when a container exceeds its memory limit. This can be due to improperly defined resource limits, memory leaks in application code, unexpected traffic spikes, or resource-hungry containers sharing the same node.
Q2. How can I identify memory leaks in my Kubernetes pods?
To identify memory leaks, use language-specific profiling tools like jmap for Java, Chrome DevTools for Node.js, and tracemalloc for Python. Additionally, leverage monitoring tools like cAdvisor, Prometheus, and Grafana to visualize memory trends over time.
Q3. What should I do if my pod keeps getting OOMKilled?
First, analyze the pod’s memory usage patterns and logs. Then, consider increasing the memory limit temporarily, optimizing the application code, or implementing automatic pod restarts. For critical services, adding more replicas can help maintain availability during OOMKilled events.
Q4. How does the Linux OOM Killer decide which process to terminate?
The OOM Killer assigns scores to processes based on factors like memory consumption, process longevity, CPU utilization, and process privilege. In Kubernetes, the pod’s Quality of Service (QoS) class also influences this score, with Guaranteed pods being least likely to be killed.
Q5. Are there any misconceptions about OOMKilled errors in Kubernetes?
Yes, common misconceptions include thinking that OOMKilled errors always indicate memory issues (they can also be caused by failed health checks), believing the OOM Killer only activates when a node runs out of memory completely, and assuming that simply increasing memory limits always solves the problem without addressing underlying issues.
Leave a Reply