Memory leaks are a common problem in software development, and they can happen for various reasons. At Paxos, we’ve had several memory leaks over the years in our various services. Each instance is unique, and they’re always a challenge to solve. However, over time, we’ve become better at solving them. We’ve achieved this by enabling great tooling for our Golang applications to root cause and resolve the issues quickly.
In this blog post, we’ll discuss how you can identify a memory leak, debug memory leaks for Golang applications, and, finally, an interesting case of the most difficult memory leak we’ve had to debug.
How can you tell you have a memory leak?
At Paxos, we run applications with very high uptime and low latency. As a result, we follow strict uptime and latency Service Level Obligations (SLOs). To successfully meet our SLOs, we leverage a range of great monitoring tools. We watch for signs of degradation in many aspects of our applications so we are alerted when they are not operating correctly and then can act quickly.
Latency graph of an application with a memory leak
During one instance of a memory leak, we were alerted to increasing latency in our application. The above graph shows the 99th percentile (P99) latency for an application with a real memory leak we encountered at Paxos. Although our P99 latency normally operates around ~30 milliseconds, this application ranged well above that, extending into the 80-140 millisecond range.
The latency graph provided below shows a sawtoothing pattern, which is a key indicator that we are encountering a memory leak. This pattern occurs because applications run out of memory, reboot and then reset their memory usage back to a base state, which temporarily decreases our latency. Over time, the memory usage continues to increase without being discarded, and as a result, all requests on that application begin to slow down again.
Slow memory leak dashboard
As we dig deeper into the issue, we view our memory usage on our pods, and we can see we have a consistent linear increase across all of them. We know it’s time to start pulling out our tools to solve this memory leak.
Debugging a memory leak for Golang
We run PProf, a Google open-source library, on our Golang applications to pull in various stats about memory usage. PProf collects a profiling sample that you can download from the web server running on your application. Once you’ve gathered the profile, you can use local tooling to visualize memory usage in your application and understand peak usage.
Setting up PProf
Setting up PProf on any Golang application is very straightforward. You add it as an import variable and make sure that you are running an HTTP server.
import (
"net/http"
_ "net/http/pprof"
)
go func() {
http.ListenAndServe("localhost:8080", nil)
}()
Once you’ve set up your application to run PProf you can download reports like a memory heap dump with a simple command and then, use PProf locally to visualize the memory usage on your computer.
// Downloading pprof profiles
curl http://127.0.0.1:8080/debug/pprof/heap > /tmp/profile.prof
// Visualize them on your local machine
go tool pprof -http=:8088 /tmp/profile.prof
Memory leak shown in PProf
The heap dump profiles PProf provides are incredibly useful when viewed visually or in text form. Here is an example of a heap dump that clearly shows a memory leak in the LFU style cache of GCache. As you can see in this profile, there is extensive usage of memory in the increment function. In this incorrect cache implementation, the increment function increased an array every time a read occurred on the cache. This increase in unnecessary memory usage can be repaired to reduce the allocated memory usage.
$ go tool pprof ~/test /tmp/profile958427541/mem.pprof
File: test
Type: inuse_space
Time: Dec 4, 2020 at 12:38pm (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 14MB, 100% of 14MB total
Showing top 5 nodes out of 8
flat flat% sum% cum cum%
10.58MB 75.53% 75.53% 14MB 100% github.com/bluele/gcache.(*LFUCache).increment
3.43MB 24.47% 100% 3.43MB 24.47% container/list.(*List).insertValue (inline)
0 0% 100% 3.43MB 24.47% container/list.(*List).InsertAfter (inline)
0 0% 100% 14MB 100% github.com/bluele/gcache.(*LFUCache).Get
0 0% 100% 14MB 100% github.com/bluele/gcache.(*LFUCache).get
A more complex case – Goroutine based memory leaks
Occasionally, you’ll encounter a more complex memory leak that isn’t immediately obvious. In our case, we knew at least that the memory leak was growing at a linear rate, so we took heap profiles over time to try to spot differences in usage. However, even this was to no avail in our search for where the memory usage was coming from. Since we couldn’t see anything obvious, we tried viewing the heap in multiple different ways to identify what could possibly be causing the linear memory leak.
There are four main ways to view the heap profile:
- inuse_space: memory allocated but not yet released
- inuse_objects: objects allocated but not yet released
- alloc_space: the total amount of memory allocated
- alloc_objects: the total number of objects allocated
Typically, we start with inuse_space to view where the memory leak occurs. This is supposed to represent currently used memory that has yet to be released. This value should match what we see as the current memory usage of the application. In this case, the amount of memory shown as inuse_space did not equal the amount of memory our Grafana dashboards indicated was in use – another strange phenomenon.
We viewed the different types of metrics in the heap dump to see if there were any potential clues in where this memory was being allocated. The alloc_space metric, which shows total memory allocation over time, showed particularly high usage in certain areas where we do in-memory caching. Given the pattern used for the cache, we concluded this was normal. All of that memory seemed to be evicted at a consistent enough rate that it shouldn’t be a problem. As we continued to investigate, we needed to determine where memory was being used that wasn’t showing up on our regular heap dump profile.
Goroutines
Golang has a powerful, lightweight execution threading architecture called goroutines. We extensively use it in our applications for speed optimizations and running background tasks. The goroutine model also allows a concept called channels, which lets you communicate between different goroutines with the information they need to operate. This enables you to parallelize tasks or perform other complex operations with multiple goroutines processing the information.
As we couldn’t find anything else useful in the heap dump profile or a CPU profile, we decided to explore what other data PProf could provide us. PProf offers great tooling that lets you see how many goroutines your application runs and what those goroutines are waiting on. Below are some commands you can run to count the number of goroutines your application is running and gather details about why they are still running.
// Total count of running goroutines
curl http://127.0.0.1:8080/debug/pprof/goroutine\?debug\=1 > goroutine-count
// Detailed information on running goroutines
curl http://127.0.0.1:8080/debug/pprof/goroutine\?debug\=2 > goroutine-log.txt
After running our PProf commands to get the running count of memory leaks and to pull the exact details on them, we finally identified the cause of the memory leak in our application. Thousands of goroutines were running and unable to close themselves. Most of these goroutines were stuck on a ‘chan send wait’ reason.
goroutine 67973 [chan send, 31 minutes]:
This ‘chan send wait’ can occur when data is sent to an unbuffered channel and the reader of the channel fails to close it once it’s done. This leads to the consumer of the messages leaving the channel open, and when the sender’s message is left unacknowledged indefinitely, it will wait and hold open its goroutine indefinitely.
Now that we’d identified the issue, we had two options: we could ensure the reader of the channel would close it after its usage, or we could create a buffered (fixed length) channel. In either case, when a message is sent to a closed or full channel, the sender will trigger a panic in the application. We decided to ensure we close the channel after the receiver had completed its operations and handled the panic from the sender so that it wouldn’t cause any application failures, given that we didn’t consistently have a particular number of messages to always send into the channel.
Memory leak gone!
At last, the memory leak was gone! After deploying the code change, we could see over the next few hours that our memory usage had completely stabilized in our applications, and our latency metrics returned to normal as well. This was a significantly more challenging memory leak to debug because goroutine memory usage does not show up on a regular heap dump profile.
Final thoughts
At Paxos, I’ve seen firsthand how memory leaks can be incredibly disruptive, affecting both performance and stability. Introducing PProf was a game-changer for us. But, while we could identify direct memory leaks, it was the more nuanced challenges that came with Golang’s goroutine mechanism that truly tested our troubleshooting skills. I cannot stress enough the importance of deeply understanding goroutines and their communication channels. Throughout this journey, we’ve learned the value of a systematic approach, the right tools and a deep grasp of the language and its architecture. It’s a continuous learning process and we’re always following a culture of continuous improvement at Paxos!
Interested in exploring careers at Paxos?
Check out our open engineering roles.