Skip to main content

Command Palette

Search for a command to run...

Instrumenting Go Services with Prometheus: The Right Way

Published
7 min read
E

Backend Developer | Golang & Python I enjoy building reliable APIs, distributed systems, and automation tools. Writing here about backend engineering, system design, and real-world dev experiences.

Introduction

Logs tell you what happened. Metrics tell you how often it happened and how long it took.

Here's the thing: logs are great for debugging specific issues, but they're terrible for answering questions like:

  • How many requests per second are we handling?

  • What's our 99th percentile latency?

  • How many errors happened in the last hour?

For those questions, you need metrics. And in the Go world, Prometheus is the de facto standard.

In this post, I'll show you how to instrument your Go services with Prometheus. We'll cover the core metric types, how to instrument HTTP handlers, and some real-world patterns I use in production.

If you're running microservices and you're not collecting metrics yet, this is your wake-up call.

Why Prometheus?

There are other metrics systems out there—Datadog, New Relic, CloudWatch—but Prometheus has some unique advantages:

  1. Pull-based: Services expose metrics, Prometheus scrapes them. Simple, stateless, easy to debug.

  2. Open-source: No vendor lock-in, runs anywhere.

  3. Powerful queries: PromQL lets you slice metrics however you want.

  4. Ecosystem: Grafana integration, alerting, tons of exporters.

Plus, it's the industry standard for cloud-native apps. Learn it once, use it everywhere.

Core Metric Types

Prometheus has 4 metric types. Understanding when to use each is critical.

Counter

A counter only goes up. Think "total requests", "total errors", "total bytes sent".

var requestsTotal = promauto.NewCounter(prometheus.CounterOpts{
    Name: "http_requests_total",
    Help: "Total HTTP requests",
})

// In your handler:
requestsTotal.Inc()

Counters are useful with the rate() function:

rate(http_requests_total[5m])  # requests per second over last 5 min

Gauge

A gauge can go up or down. Think "active connections", "memory usage", "queue size".

var activeConnections = promauto.NewGauge(prometheus.GaugeOpts{
    Name: "http_active_connections",
    Help: "Current active HTTP connections",
})

// When connection opens:
activeConnections.Inc()

// When connection closes:
activeConnections.Dec()

Histogram

A histogram samples observations (usually durations or sizes) and counts them in buckets. Think "request duration", "response size".

var requestDuration = promauto.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Help:    "HTTP request duration",
    Buckets: prometheus.DefBuckets,  // 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
})

// In your handler:
start := time.Now()
// ... do work ...
requestDuration.Observe(time.Since(start).Seconds())

Histograms let you calculate percentiles:

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))  # p99 latency

Summary

A summary is similar to a histogram but calculates quantiles on the client side. I rarely use these—histograms are more flexible and can be aggregated across instances.

Instrumenting HTTP Handlers

Here's a complete example of instrumenting an HTTP service:

package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests by path and status",
        },
        []string{"path", "status"},
    )

    httpDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration by path",
            Buckets: prometheus.DefBuckets,
        },
        []string{"path"},
    )
)

func main() {
    // Expose /metrics endpoint
    http.Handle("/metrics", promhttp.Handler())

    // Your API handlers
    http.HandleFunc("/", metricsMiddleware(helloHandler))
    http.HandleFunc("/api/orders", metricsMiddleware(ordersHandler))

    http.ListenAndServe(":8080", nil)
}

func metricsMiddleware(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap the ResponseWriter to capture status code
        ww := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}

        // Call the actual handler
        next.ServeHTTP(ww, r)

        // Record metrics after handler completes
        duration := time.Since(start).Seconds()
        httpDuration.WithLabelValues(r.URL.Path).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.URL.Path, http.StatusText(ww.statusCode)).Inc()
    }
}

// Wrapper to capture status code
type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func helloHandler(w http.ResponseWriter, r *http.Request) {
    w.Write([]byte("Hello!"))
}

func ordersHandler(w http.ResponseWriter, r *http.Request) {
    // simulate some work
    time.Sleep(50 * time.Millisecond)
    w.Write([]byte(`{"orders": []}`))
}

Now if you hit http://localhost:8080/metrics, you'll see:

# HELP http_requests_total Total HTTP requests by path and status
# TYPE http_requests_total counter
http_requests_total{path="/",status="OK"} 42
http_requests_total{path="/api/orders",status="OK"} 17

# HELP http_request_duration_seconds HTTP request duration by path
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{path="/",le="0.005"} 40
http_request_duration_seconds_bucket{path="/",le="0.01"} 42
http_request_duration_seconds_sum{path="/"} 0.123
http_request_duration_seconds_count{path="/"} 42

That's what Prometheus scrapes every 15 seconds (or whatever interval you configure).

Setting Up Prometheus

Create a prometheus.yml config file:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'my-go-app'
    static_configs:
      - targets: ['localhost:8080']

Run Prometheus with Docker:

docker run -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

Now open http://localhost:9090 and you can query your metrics.

Useful PromQL Queries

Here are the queries I use most often:

Requests per second

rate(http_requests_total[5m])

Error rate (assuming 5xx = errors)

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

99th percentile latency

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

50th percentile (median)

histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

Top 5 slowest endpoints

topk(5, histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

Requests by status code

sum by (status) (rate(http_requests_total[5m]))

Best Practices

Use Labels Wisely

Labels are powerful, but they can explode your cardinality if you're not careful.

Good labels:

  • path (limited set of routes)

  • status (limited set of HTTP codes)

  • method (GET, POST, PUT, DELETE)

Bad labels:

  • user_id (unbounded, could be millions)

  • request_id (unique per request)

  • trace_id (unique per request)

If you add high-cardinality labels, you'll run out of memory fast.

Choose the Right Histogram Buckets

The default buckets (prometheus.DefBuckets) work for most cases:

[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

But if your service has different characteristics, customize them:

Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1}  // for fast APIs
Buckets: []float64{1, 5, 10, 30, 60, 120}  // for slow background jobs

Don't Instrument Everything

More metrics = more noise. Focus on what matters:

  • Request rate, latency, errors (the RED method)

  • Resource usage (CPU, memory, connections)

  • Business metrics (orders processed, payments completed)

Skip stuff like "function X was called"—that's what logs and tracing are for.

Use Counter, Not Gauge, for Totals

Common mistake: using a gauge for something that should be a counter.

Bad:

totalRequests = promauto.NewGauge(...)
totalRequests.Inc()

Good:

totalRequests = promauto.NewCounter(...)
totalRequests.Inc()

Counters are designed for this. They handle resets properly and work with rate() queries.

Real-World Example

Here's a more complete example from one of my production services:

package metrics

import (
    "time"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    RequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "api_requests_total",
            Help: "Total API requests",
        },
        []string{"method", "path", "status"},
    )

    RequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "api_request_duration_seconds",
            Help:    "API request duration",
            Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5},
        },
        []string{"method", "path"},
    )

    DatabaseQueriesTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "db_queries_total",
            Help: "Total database queries",
        },
        []string{"query_type", "status"},
    )

    DatabaseQueryDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "db_query_duration_seconds",
            Help:    "Database query duration",
            Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1},
        },
        []string{"query_type"},
    )

    CacheHitsTotal = promauto.NewCounter(prometheus.CounterOpts{
        Name: "cache_hits_total",
        Help: "Total cache hits",
    })

    CacheMissesTotal = promauto.NewCounter(prometheus.CounterOpts{
        Name: "cache_misses_total",
        Help: "Total cache misses",
    })
)

// Track database query
func TrackDBQuery(queryType string, fn func() error) error {
    start := time.Now()
    err := fn()
    duration := time.Since(start).Seconds()

    status := "success"
    if err != nil {
        status = "error"
    }

    DatabaseQueriesTotal.WithLabelValues(queryType, status).Inc()
    DatabaseQueryDuration.WithLabelValues(queryType).Observe(duration)

    return err
}

Usage:

func GetOrder(ctx context.Context, orderID string) (*Order, error) {
    var order *Order

    err := metrics.TrackDBQuery("get_order", func() error {
        return db.QueryRow("SELECT * FROM orders WHERE id = ?", orderID).Scan(&order)
    })

    return order, err
}

This gives you visibility into database performance, cache hit rates, API latency—everything you need to understand how your service is performing.

Integrating with Grafana

Prometheus is great for querying, but Grafana is better for dashboards.

  1. Run Grafana:
docker run -d -p 3000:3000 grafana/grafana
  1. Add Prometheus as a data source (http://localhost:9090)

  2. Create a dashboard with panels like:

    • Request rate (line graph)

    • Error rate (single stat)

    • P50/P99 latency (line graph)

    • Request breakdown by path (pie chart)

Now you have a real-time dashboard showing how your service is performing.

Common Pitfalls

  1. High cardinality: Don't use unbounded labels (user IDs, request IDs).

  2. Too many metrics: Focus on what's actionable.

  3. Wrong metric type: Use counters for totals, histograms for durations.

  4. Forgetting to expose /metrics: Prometheus can't scrape if the endpoint isn't exposed.

  5. Not testing scraping: Use curl http://localhost:8080/metrics to verify.

Wrapping Up

Metrics are essential for running production services. They let you answer questions like "is the service slow?" or "are we seeing more errors?" in seconds instead of hours.

Start simple: instrument your HTTP handlers with request count and duration. Once that's working, add database metrics, cache metrics, business metrics. Build it incrementally.

Next up, I'll cover distributed tracing with OpenTelemetry—because metrics tell you what is slow, but tracing tells you why.

Questions? Drop a comment. Always happy to talk about observability.

Resources


Thanks for reading! This is part of my series on building production-ready observability in Go. Follow along for more posts on tracing, logging, and alerting.

More from this blog

eshah.dev

16 posts