Over the past five years, our team has shipped more than fifty production microservice systems on Go and Kubernetes. The architecture has matured from "Docker containers with shell scripts" to a fairly opinionated stack that we now reach for by default. This article walks through the design decisions, the tradeoffs we accept, and the patterns we have learned to avoid the hard way.
Service Boundaries
The single most important decision in any microservice system is where to draw service boundaries. Get this wrong and no amount of clever infrastructure rescues you. Get it right and most other problems become tractable.
Our rule of thumb: a service should own a clearly bounded business capability and one data store. If two services need to write to the same table, they are the same service. If a single service is writing to three data stores with no transactional relationship between them, it is probably three services.
Communication Patterns
For synchronous service-to-service calls we use gRPC. For asynchronous events we use Kafka. We deliberately avoid REST between internal services — not because REST is bad, but because gRPC gives us versioned schemas, code-generated clients, and clear streaming semantics out of the box.
A common failure mode in microservice systems is unbounded synchronous fan-out. Service A calls B, which calls C, which calls D. When D slows down, the entire dependency chain stalls. Two patterns help here:
- Set aggressive client-side deadlines on every gRPC call and propagate them through the chain.
- Lift any non-critical step out of the request path and onto Kafka, with a separate consumer that can retry independently.
- Use circuit breakers at every service boundary — we use the sony/gobreaker library and have not regretted it.
- Cache aggressively for read paths, with TTLs short enough that staleness is acceptable.
Deployment Topology
We deploy on Kubernetes, one namespace per environment, with each microservice as its own deployment. We are conservative about operators — most services need only the standard primitives (Deployment, Service, ConfigMap, Secret, HPA) and we resist the urge to introduce custom CRDs unless the operational burden of NOT having them is provably high.
Observability That Pays Off
Three streams matter: metrics, logs, and traces. We instrument every gRPC method and every Kafka consumer with OpenTelemetry. Metrics flow to Prometheus, traces to Grafana Tempo, logs to Loki. The combined view in Grafana lets an engineer follow a request across six services in a single timeline.
The most consequential observability investment we made was forcing every service to expose a /healthz endpoint that performs a real dependency check (database ping, Kafka consumer lag, downstream gRPC ping) rather than just returning 200. This dropped our average incident detection time from 14 minutes to 3.
Graceful Degradation
When a downstream service fails, the question is not "how do we keep working as if nothing happened" — that is rarely possible — but "what is the smallest graceful failure we can present to the user."
For a product detail page, that might mean rendering everything except the personalized recommendations module when the recommendation service is down. For a checkout flow, it might mean accepting an order with a delayed fraud-check and surfacing the result asynchronously. Every team should keep a written list of degradation modes per service.
Distributed systems are not hard because the technology is hard. They are hard because the failure modes compound and most of them are invisible until they bite you. Plan for failure, observe everything, degrade gracefully.
Closing Thoughts
There is no "right" microservice architecture. There is an architecture that fits your organization, your domain, and your operational maturity. The patterns above have served us well across e-commerce, fintech, and logistics clients — but the most important thing we have learned is that the boundaries should follow the business, not the technology. When in doubt, draw the line where teams already draw it.
Need help with this?
Explore the iPlus Solution services most relevant to this article.