I’ve been watching the 2017 Gophercon videos from here. There are many good talks that I would recommend watching from that list.

One that I really wanted to try out on my own was the profiling presentation that Peter Bourgon did. I assume for the sake of simplicity, he left some details out. I’ve been trying to figure them out on my own.

I was inspired to try to profile my own Golang web apps using the method he presented in his talk. So in this post I’ll show you a simplified example of how to profile your own web applications in more detail.

The web app

To see a practical example of profiling and to keep it simple, we need to create a web app that when called on a particular route, performs some sort of calculation and returns a result in it’s payload.

If we were to mimic real life scenarios, we would be here all day. So just to keep it simple, we will create a single route that calculates the 25th Fibonacci number when it’s called and returns it in the response body.

I’ve written two versions of the Fibonacci function before hand, one that performs recursion (exponential time – very CPU intensive) and one that calculates the number using a vector (linear time – not very CPU intensive).

In order to use the profiling tool, you need to import the net/http/pprof  package and register some routes. In the presentation I mentioned earlier, the speaker mentioned that we could leave this imported even in production environments since it does not affect performance.

package main

import (
	"fmt"
	"log"
	"net/http"
	"net/http/pprof"
)

// O(n) Fibonacci
func linearFibonacci(n int) int {
	// Create an int array of size n + 1
	v := make([]int, n+1)

	// F(0) = 0
	v[0] = 0
	// F(1) = 1
	v[1] = 1

	// F(i) = F(i-1) + F(i-2)
	for i := 2; i <= n; i++ {
		v[i] = v[i-1] + v[i-2]
	}

	// F(n) - return the n-th Fibonacci number
	return v[n]
}

// O(2^n) Fibonacci
func exponentialFibonacci(n int) int {
	// F(0) = 0
	if n == 0 {
		return 0
	}

	// F(1) = 1
	if n == 1 {
		return 1
	}

	// F(n) = F(n-1) + F(n-2) - return the n-th Fibonacci number
	return exponentialFibonacci(n-1) + exponentialFibonacci(n-2)
}

// HTTP request handler
func handler(w http.ResponseWriter, r *http.Request) {
	// return the 25th Fibonacci number in the response payload
	fmt.Fprintf(w, "%d", exponentialFibonacci(25))
}

func main() {
	// Create a new HTTP multiplexer
	mux := http.NewServeMux()

	// Register our handler for the / route
	mux.HandleFunc("/", handler)

	// Add the pprof routes
	mux.HandleFunc("/debug/pprof/", pprof.Index)
	mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
	mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
	mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
	mux.HandleFunc("/debug/pprof/trace", pprof.Trace)

	mux.Handle("/debug/pprof/block", pprof.Handler("block"))
	mux.Handle("/debug/pprof/goroutine", pprof.Handler("goroutine"))
	mux.Handle("/debug/pprof/heap", pprof.Handler("heap"))
	mux.Handle("/debug/pprof/threadcreate", pprof.Handler("threadcreate"))

	// Start listening on port 8080
	if err := http.ListenAndServe(":8080", mux); err != nil {
		log.Fatal(fmt.Sprintf("Error when starting or running http server: %v", err))
	}
}

As you can see, this is a really simple application, save it to a file called main.go and build it like this: go build -o myserver main.go .

Now you can run your binary: ./myserver and to check if it’s working we’ll send a request to it:

$ curl http://localhost:8080
75025

The CPU profile

Now, while the server is running, you will need to run two commands in parallel. You first need to start the profiling tool which will record data for 30 seconds after it is run AND as it is running, run the Apache Benchmark tool to send a few requests it’s way.

So you will need to run the profiling tool like this:

go tool pprof -seconds 30 myserver http://localhost:8080/debug/pprof/profile

While that’s running, run the benchmark:

ab -k -c 8 -n 100000 “http://127.0.0.1:8080/”

This will create a total of 100,000 keep-alive requests to your server, using 8 cores. To better understand what this benchmark command does, it is explained in the link provided.

After 30 seconds, the profiling tool will show a prompt for commands that will look something like this:

$ go tool pprof -seconds 30 myserver http://localhost:8080/debug/pprof/profile
Fetching profile from http://localhost:8080/debug/pprof/profile?seconds=30
Please wait... (30s)
Saved profile in /Users/username/pprof/pprof.myserver.localhost:8080.samples.cpu.013.pb.gz
Entering interactive mode (type "help" for commands)
(pprof)

Here you can run commands to show you how much of CPU time each function took and other useful information. For example, if I run top5  it will list the top 5 functions that are hogging the CPU:

(pprof) top5
84.57s of 85.18s total (99.28%)
Dropped 87 nodes (cum <= 0.43s)
Showing top 5 nodes out of 30 (cum >= 1.09s)
      flat  flat%   sum%        cum   cum%
    51.93s 60.97% 60.97%     51.93s 60.97%  main.exponentialFibonacci
    20.57s 24.15% 85.11%     20.59s 24.17%  fmt.(*pp).doPrintf
    12.06s 14.16% 99.27%     12.06s 14.16%  syscall.Syscall
     0.01s 0.012% 99.28%     11.22s 13.17%  net.(*netFD).Write
         0     0% 99.28%      1.09s  1.28%  bufio.(*Reader).ReadLine

As you can see the exponentialFibonacci function really hogs the CPU.

Please note: These values might differ on your machine. For reference I’m using a MacBook Pro (Retina, 13-inch, Early 2015), 2.7 GHz Intel Core i5 Processor and 8 GB 1867 MHz DDR3 of Memory.

If we wanted to see a graph of this profile we need to run the web command like this:

(pprof) web
(pprof)

This will open up your default browser and display an image of the profile. Here’s a crop of the image that concerns our Fibonacci function:

golang profiling

So during that profile, 60.97% of the time, the exponentialFibonacci was running on the CPU.

Optimizations

Now, we know from the theory that O(n) < O(2^n). Let’s see if this holds up in practice, it we were to replace the exponentialFibonacci  call with linearFibonacci  inside the handler  function.

Now we run the profile again. You can immediately see that it took less time because the benchmark actually finishes really fast this time.

If we run top5 now, the linearFibonacci  function doesn’t even make the cut.  Even if you try to do top100 you will not find it because the compiler inlined that particular code.

So we need to rebuild the application with the compiler flags that disable inlining like this:

go build -gcflags -l -o myserver main.go

 

Now even with this flag enabled I had a hard time finding the function in the top. I went ahead and increased the hard-coded value for the n-th Fibonacci number to 10,000. So I’m looking for the 10,000th Fibonacci number, this number doesn’t even fit inside the integer datatype in Golang. It will overflow several times before coming to a stop. I also increased the benchmark to 1,000,000 requests.

Now if I run top5 I get:

(pprof) top5
36.21s of 46.49s total (77.89%)
Dropped 226 nodes (cum <= 0.23s)
Showing top 5 nodes out of 102 (cum >= 1.97s)
      flat  flat%   sum%        cum   cum%
    25.94s 55.80% 55.80%     26.23s 56.42%  syscall.Syscall
     3.38s  7.27% 63.07%      3.38s  7.27%  runtime.kevent
     2.60s  5.59% 68.66%      2.60s  5.59%  runtime.usleep
     2.32s  4.99% 73.65%      4.26s  9.16%  main.linearFibonacci
     1.97s  4.24% 77.89%      1.97s  4.24%  runtime.mach_semaphore_signal

Or in graphical format:

golang profiling

As you can see, it barely even makes a dent.

So for this test, calculating the 25th Fibonacci number recursively takes 60% of the CPU while calculating the 10,000th Fibonacci number linearly takes 4% of the CPU (without inlining).

Another useful command for pprof to see how much CPU time a function takes is the list  command. Or, if you’re like me, to find out if a function is actually called.

For our linearFibonacci  function it looks like this:

(pprof) list linearFibonacci
Total: 46.49s
ROUTINE ======================== main.linearFibonacci in /Users/username/workspace/go/src/github.com/username/test_profiling/main.go
     2.32s      4.26s (flat, cum)  9.16% of Total
         .          .      8:)
         .          .      9:
         .          .     10:// O(n) Fibonacci
         .          .     11:func linearFibonacci(n int) int {
         .          .     12:	// Create an int array of size n + 1
      10ms      1.95s     13:	v := make([]int, n+1)
         .          .     14:
         .          .     15:	// F(0) = 0
         .          .     16:	v[0] = 0
         .          .     17:	// F(1) = 1
         .          .     18:	v[1] = 1
         .          .     19:
         .          .     20:	// F(i) = F(i-1) + F(i-2)
     260ms      260ms     21:	for i := 2; i <= n; i++ {
     2.05s      2.05s     22:		v[i] = v[i-1] + v[i-2]
         .          .     23:	}
         .          .     24:
         .          .     25:	// F(n) - return the n-th Fibonacci number
         .          .     26:	return v[n]
         .          .     27:}

A better comparison

A better way to compare the two methods, and the theory to practice, is this:

  1. Knowing that the exponentialFibonacci method is O(2^n), it would take approximately 2^25 = 33554432 instructions to calculate the 25th Fibonacci number.
  2. Linearly, calculating the 33554432th Fibonacci number should take roughly the same time as calculating the 25th number exponentially.

So following the methodology above we do this:

  1. Build the application using the exponentialFibonacci(25) call.
  2. Start the application.
  3. Start the Apache Benchmark for 1,000,000 requests.
  4. Start the CPU profile for 30s seconds.

We get this:

(pprof) top5
98.27s of 99.02s total (99.24%)
Dropped 64 nodes (cum <= 0.50s)
Showing top 5 nodes out of 30 (cum >= 1.30s)
      flat  flat%   sum%        cum   cum%
    60.78s 61.38% 61.38%     60.78s 61.38%  main.exponentialFibonacci
    24.54s 24.78% 86.16%     24.54s 24.78%  fmt.(*pp).doPrintf
    12.95s 13.08% 99.24%     12.95s 13.08%  syscall.Syscall
         0     0% 99.24%      1.30s  1.31%  bufio.(*Reader).ReadLine
         0     0% 99.24%      1.30s  1.31%  bufio.(*Reader).ReadSlice

Now for the second part:

  1. Build the application using the linearFibonacci(33554432) call.
  2. Start the application.
  3. Start the Apache Benchmark for 1,000,000 requests.
  4. Start the CPU profile for 30s seconds.

We get this:

(pprof) top5
49280ms of 49870ms total (98.82%)
Dropped 92 nodes (cum <= 249.35ms)
Showing top 5 nodes out of 29 (cum >= 470ms)
      flat  flat%   sum%        cum   cum%
   28650ms 57.45% 57.45%    44400ms 89.03%  main.linearFibonacci
   15660ms 31.40% 88.85%    15660ms 31.40%  runtime.memclr
    3910ms  7.84% 96.69%     3910ms  7.84%  runtime.usleep
     590ms  1.18% 97.87%      590ms  1.18%  runtime.duffcopy
     470ms  0.94% 98.82%      470ms  0.94%  runtime.mach_semaphore_timedwait

As you can see, the flat percentages, which is how much of the time was spent in the routine itself, is roughly the same. 61.38% vs 57.45%, it’s about 4% difference between them.

Profiling memory

Using the same process, you can run the following command to profile memory:

go tool pprof -alloc_objects myserver http://localhost:8080/debug/pprof/heap

If you run a top command you should see something like this:

(pprof) top10
9741685 of 9927382 total (98.13%)
Dropped 7 nodes (cum <= 49636)
Showing top 10 nodes out of 33 (cum >= 99079)
      flat  flat%   sum%        cum   cum%
   3182489 32.06% 32.06%    3182489 32.06%  net/textproto.(*Reader).ReadMIMEHeader
   2050835 20.66% 52.72%    2050835 20.66%  context.WithCancel
   1068043 10.76% 63.47%    8447075 85.09%  net/http.(*conn).readRequest
    675175  6.80% 70.28%    5155947 51.94%  net/http.readRequest
    667729  6.73% 77.00%     667729  6.73%  net/url.parse
    655370  6.60% 83.60%    1414760 14.25%  main.handler
    618866  6.23% 89.84%     618866  6.23%  main.linearFibonacci
    589833  5.94% 95.78%     589833  5.94%  net/textproto.(*Reader).ReadLine
    134266  1.35% 97.13%     172250  1.74%  net/http.newBufioWriterSize
     99079     1% 98.13%      99079     1%  sync.(*Pool).pinSlow

Conclusion

Now that you’ve seen the basics on how to profile your Golang web apps, you can start diving into heavier stuff like this. Take some time and run a profile on your own Golang web apps.

Also, you should see the Gophercon talk I mentioned at the start of this post, it’s quite good.