Set Up Cloud Workloads With Metrics You'll Wish You Already Had

‍

It's hard to imagine successful software development today without DevOps. But DevOps has become increasingly complex, burdened by a zoo of tools and process gaps (think "silent tech debt"). At CloudGeometry, we have years of experience working through DevOps challenges for hundreds of client engagements, delivering high-quality releases into production – all day, every day.

Today, the market has begun to use "Platform Engineering'' to refer to more robust approaches to DevOps. Yet even before the term was in common use, we were already taking a rigorous, repeatable approach to resolving DevOps problems by leaning into proven open-source technologies. We call this integrated toolchain CGDevX. It's our reference implementation for Platform Engineering. You'll find our ongoing efforts in our repo at GitHub.

‍

The culture of collaboration on which DevOps is built focuses on trust. That said, the increased velocity and variety that come with cloud-native versatility greatly intensify the rate of change in software delivery.

Sh*t is just happening faster. There's just more and more of it. No slack channels or standup meetings can replace the power and autonomy unleashed by well-specified data.

That's why CGDevX puts a special emphasis in adding easy-to-write, easy-to-read data. When you want to make the most of cloud-native, it's a vital step towards closing gaps between conventional DevOps and the untapped potential of a platform engineering approach.

‍

‍
Video Highlights

Open-source standard tools like Prometheus and Grafana add new dimensions of productivity and precision. It's essential to the questions you need to answer enabling continuous improvement in your 21st-century SDLC. With CGDevX, we focus on enhancing workload monitoring and troubleshooting with a user-friendly approach for collecting and visualizing a wide range of metrics across Kubernetes clusters. Here’s a detailed summary of the key aspects of our approach:

Monitoring Setup and Tools: We utilize Prometheus for data collection from Kubernetes clusters, focusing on a variety of metric types such as counters, gauges, histograms, and summaries. Our setup includes the community version of the kube-Prometheus stack, essential for medium to large-scale Kubernetes deployments.

Metrics Collection: We ensure the provision of all necessary Kubernetes metrics, supplemented with standard extensions like compute node and OS state metrics. For specific needs, we expose custom metrics like garbage collection or business metrics, which are collected via our Auto Discovery mechanism.

Demo Application Deployment: We deploy our example node.js app with Express framework, showing how to incorporate Prometheus metrics using the express-prom-bundle package. This setup automatically exposes metrics on all application routes via the '/metrics' endpoint.

Data Visualization with Grafana: To make sense of the plethora of metrics, we use Grafana for visualization, equipped with pre-packaged and customized dashboards. Our Grafana setup includes SSO access for read-only users, typically developers, ensuring secure and controlled access.

Granular and Global Views: Our Grafana dashboards offer both a global view of Kubernetes metrics and the ability to zoom into specific services or environments. We provide detailed views of nodes, pods, containers, and even Kubernetes internals like API server health.

Service-Specific Dashboards: We include dashboards for platform services like Argo CD and nginx, offering insights into application status, resource usage, request latency, and more.

Workload-Specific Dashboards: For workload-specific analysis, we offer dashboards like the USE method view, providing data on system performance metrics such as CPU, memory, and network utilization.

Access Management and Alerting: Access to Grafana is managed centrally using Vault as the OIDC provider, with RBAC configured to prevent unauthorized changes. We also emphasize the importance of alerting, with pre-configured rules and the ability for users to customize and export these rules to the GitOps repository.

Real-Time Monitoring and Alerting: After modifying our demo application, we monitor the impact on application metrics, set up alerts for specific thresholds, and configure notifications through channels like Slack. We successfully observe and resolve triggered alerts, demonstrating the effectiveness of our monitoring setup.

‍

In conclusion, this approach to Kubernetes monitoring solution leverages the strengths of open-source tools to simplify the challenges of workload monitoring and troubleshooting. With pre-configured metric collection, customizable dashboards, and efficient alerting mechanisms, you can achieve enhanced observability and operational efficiency while keeping the complexities of user and permission management under control.

David is a longtime Silicon Valley executive and a skilled & experienced tech leader, with decades of experience in customer facing roles practicing product and service management grounded in process analytics. His work spans cloud infrastructure, analytics, mobile/embedded and open source. He’s a startup veteran (10+ venture-funded companies, both successful outcomes and the other kind), and has also served 12+ years in product & business leadership roles at publicly-traded enterprise tech corporations.