While running the KTranslate Docker container for New Relic network monitoring, you can monitor the health of the container to proactively detect potential issues.
The KTranslate container image has the -tee_logs=true and -metrics=jchf settings available during runtime, which allow it to send health metrics directly to New Relic. These are enabled by default when installing network monitoring via the New Relic guided install. We recommend you to set them up when installing network monitoring manually.
Logs from KTranslate
Sugerencia
If you want to check the logs locally from the Docker host, run docker logs $CONTAINER_NAME. For example, docker logs ktranslate-snmp.
The -tee_logs=true option sends logs to New Relic when polling devices. To see them, do the following:
FROM Log SELECT*WHERE`collector.name`='ktranslate'AND`message`NOTLIKE'%[Info]%'
With a parsing rule applied to your logs
Logs UI:
bash
$
collector.name:"ktranslate" severity:-"Info"
NRQL:
FROM Log SELECT*WHERE`collector.name`='ktranslate'AND`severity`!='Info'
Expected Results:
bash
$
KTranslate>cisco-7513 There was an SNMP polling error with the CustomDeviceMetrics walking OID .1.3.6.1.2.1.4.31.1.1.21 after 0 retries: request timeout(after 0 retries).
Sugerencia
KTranslate has the following log severity levels: Info, Warn, and Error.
FROM Log SELECT*WHERE`collector.name`='ktranslate'AND`message`LIKE'%Match Attribute%'
Expected Results:
bash
$
KTranslate>cisco-7513 Added 1 Match Attribute(s)
All devices are expected to have at least 1 Match Attribute inherited from the default monitor_admin_shut: true configuration. You should expect a value of 2 to be shown for a device that you have added a single match attribute to.
Sugerencia
You can further filter these results by adding the device name to your query: collector.name:"ktranslate" message:"*$DEVICE_NAME*Match Attribute*".
Metrics from KTranslate
The -metrics option captures the following performance metrics when polling devices:
Metric
Granularity
Description
baseserver_healthcheck_execution_total
Top Level
Rate of internal health checks. Shows mostly that things are not deadlocked and should always be greater than 0.
inputq
Top Level
Messages per second (msg/sec) received over the last 60 seconds from all SNMP, Flow, and VPC inputs combined.
jchfq
Top Level
Gauge rate with number of available pre-allocated buffers. It should be about 8,000.
delivery_metrics_nr
Delivery to New Relic
Batches per second (batches/sec) sent over the last 60 seconds for all metrics to New Relic.
delivery_logs_nr
Delivery to New Relic
Logs per second (logs/sec) sent over the last 60 seconds for all logs to New Relic.
delivery_wins_nr
Delivery to New Relic
Wins per second (wins/sec) of 200 HTTP codes received over the last 60 seconds from sending metrics and events to New Relic.
device_metrics
SNMP
Polls per second (polls/sec) of SNMP polling over the last 60 seconds for device level metrics.
interface_metrics
SNMP
Polls per second (polls/sec) of SNMP polling over the last 60 seconds for interface level metrics.
snmp_fail
SNMP
Gauge to monitor if SNMP polling is working faceted by device_name. Where 1 means good and 2 means fail.
netflow.flows
Netflow
Flows per second (fps) received over the last 60 seconds for all device flow data: IPFIX, NetFlow, or sFlow.
syslog_queue
Syslog
Gauge of syslog messages waiting to be processed.
syslog_errors
Syslog
Errors per second (errors/sec) over the last 60 seconds while processing syslog messages.
syslog_messages
Syslog
Messages per second (msg/sec) received over the last 60 seconds for all syslog data.