With your ingest data charted correctly, you have the ability to start optimizing your telemetry to reduce your redunant ingest data and lower your ingest costs. The first step is to create an optimization plan, and then use data optimization techniques to put that plan into action.
Understand your observability objectives
One of the most important parts of the data ingest governance framework is to create observability value drivers. These help you align your data with concrete metrics that you can use to measure how important (or redundant) they are to your goals.
They also help you understand your objectives when you need to configure new telemetry in the future. When you do, you'll want to understand what it delivers to your overall observability systems to prevent any needless overlap. If you find yourself creating new telemetry data that doesn't align to any of the objectives below, it may be a good sign that the data isn't needed, and that you can prevent creating it to help cut down on costs:
Meeting an internal SLA
Meeting an external SLA
Supporting feature innovation (A/B performance and adoption testing)
Monitoring customer experience
Holding vendors and internal service providers to their SLA
Monitoring business process health
Other compliance requirements
Alignment to these objectives allows you to make the right decisions about value prioritization and helps guide teams to know where to start when instrumenting new platforms and services.
Develop an optimization plan
Once you understand your objectives, it's time to get an optimization plan in place. This plan will help you measure the value of your ingest data, and helps you find what data you can safely exclude to help keep costs down.
For this section, we'll make two core assumptions:
You have the tools and techniques from the baseline your data ingest doc to properly track where your data is coming from.
You have a good understanding of the observability maturity value drivers, which are crucial to set a value and a priority to groups of telemetry.
We have three examples below of how you can assess your own telemetry ingest and make the right decisions for your organization's needs. Although each of these examples focuses on one value driver, most instrumentation provides data on many areas of value.
An account is ingesting about 20% more than they'd budgeted for. They've been asked by a manager to find some way to reduce consumption. Their most important value driver is Uptime, performance, and reliability.
Their estate includes:
(dev, staging, prod)
Distributed tracing
Browser
Infrastructure monitoring of 100 hosts
Kubernetes monitoring (dev, staging, prod)
Logs (dev, staging, prod - including debug)
Optimization plan
Omit debug logs (knowning they can be turned on if there is an issue), saving 5%.
Omit several Kubernetes state metrics which are not required to display the Kubernetes cluster explorer, saving 10%.
Drop some custom browser events they were collecting when they tested of new features, saving 10%.
After executing those changes the team is 5% below their budget and has freed up some space to do a NPM pilot, completing the task assigned to them by their manager.
Final outcome
5% under their original budget
Headroom created for an NPM pilot which serves uptime, performance, and reliability objectives
Minimal loss of uptime and reliability observability
A team responsible for a new user-facing platform with an emphasis on and browser monitoring is running 50% over budget. They'll need to right-size their ingest, but they're adamant about not sacrificing any Customer experience observability.
Their estate includes:
Mobile
Browser
APM
Distributed tracing
Infrastructure on 30 hosts, including process samples
Serverless monitoring for some backend asynchronous processes
Logs from their serverless functions
Various cloud integrations
Optimization plan
Omit the serverless logs, which are redundant for thier needs due to the Lambda integration.
Decrease the process sample rate on their hosts to every one minute.
Drop process sample data in dev environments.
Turn off EC2 integration which is highly redundant with other infrastructure monitoring provided by the New Relic infra agent.
Final outcome
5% over their original budget
Sufficient to get them through peak season
No loss of customer experience observability
After executing the changes, they're now just 5% over their original budget and conclude that this is enough to carry them through peak season.
A team is in the process of refactoring a large Python monolith to four microservices. The monolith shares infrastructure with the new architecture including a customer database and a cache layer. They're 70% over budget and have two months before they can officially decommission the monolith.
Their estate includes:
Kubernetes monitoring (microservices)
New Relic host monitoring (monolith)
APM (microservices and host monitoring)
Distributed tracing (microservices and host monitoring)
Postgresql (shared)
Redis(shared)
MSSQL (future DB for the microservice architecture)
Load balancer logging (microservices and host monitoring)
Optimization plan
Configure load balancer logging to only monitor 5xx response codes.
Set the custom sample rate on ProcessSample, StorageSample, and NetworkSample to 60s for hosts running the monolith
Disable MSSQL monitoring since the new architecture doesn't use it.
Disable distributed tracing for the monolith as it's far less useful then it will be for the microservice architecture.
Final Outcome
1% under their original budget
No loss of Innovation and growth observability
Sugerencia
We recommend you to track the plan in a task managing tool to help manage the optimization plan and understand the effect each optimization task takes. You can use this Data optimization plan template to help.