How to clean up tags in your malfunctioning TIG stack for SNMP monitoring – or chasing runaway tags with InfluxQL, grep and a Python script

Hello folks, this is going to be a quick one (well compared to the others… ). A while back we had to switch our SNMP monitoring to SNMPv3, which was a bit of a problem as it wasn’t supported on all of our switches. The newer models supported both auth and priv modes but the older ones mostly supported just auth (no encryption). It was a bit of a pain to separate which was which, as there was also the hashing and encryption algorithms we needed to make sure were supported. After fighting a few battles with our multiple NMS software, we finally claimed victory.

Ninja InfluxDB Assassins or Runaway Tags

Of course that wasn’t the case exactly with our TIG stack as the procedure meant that I had to separate the switches into two categories and have each config file in our telegraf.d directory (the folder from where each config file is loaded) split in two and adjust our settings for the SNMPv3 modes. It was a tedious task and took some time and, as I was interrupted multiple times, .. I left some mistakes behind. As most stuff worked when I finished, I never found out about them and went on to the next project with a happy smile on my face.

At some point during last week we faced a few problems with our VPN infrastructure. Naturally we turned to our TIG stack for performance graphs. But something was not right.. For some nodes, graphs were not working. I checked whether the containers were up.

docker ps

Not it. For the same TIG nodes some things worked and some didn’t. As I had neglected to study how to properly create and assign retention policies to my databases, I felt guilt and fear creeping up on me.. So I turned to the logs:

docker-compose logs --follow --tail 100

That’s when I noticed the message complaining about max tag values: "lvl=warn msg="max-values-per-tag limit may be exceeded soon" . I thought that total data retention had hit me so following what is described in this link I had altered the default retention policy “autogen” with the following :

ALTER RETENTION POLICY "autogen" ON "telegraf" DURATION 52w SHARD DURATION 120d DEFAULT

That of course works after you enter your container and run the influx cli, authenticated and chosen your database:

docker exec –it influxdb /bin/bash 
influx
auth
use telegraf

I also deleted some data before that point in time with DELETE WHERE time < '2020-03-01 01:01:01' but I found out that deleting data doesn’t remove those from the index, the command DROP does that and I had some trouble using that. I started reading more from the section of the documentation for InfluxDB where the InfluxDB Query Language is referenced: https://docs.influxdata.com/influxdb/v1.7/query_language/ (version 1.7).

As the log messages kept coming, it dawned on me that deleting data wasn’t the answer, my problem was that I had too many tags, apparently (duh..), I just didn’t want to believe it. As the main tag source in my case was the different hostname values from our network nodes, that could not be normal, the number of nodes per TIG host was way below the max tag limit (default is 10000). I used the following command to see what tag values I had:

SHOW TAG VALUES WITH KEY = "hostname"

That produced too many values indeed but what was strange was that the surplus values were all integers. Something was feeding bad data into my DBs. I needed to do two things:

Find the source of the problem so the flooding of bad values would stop.
Clean up the mess.

Learn to sleep on it

I was getting tired, it was time to go home so I gave up for the day. Next morning the result of too many google searches, big data and Big Brother watching us bore fruit.. The following post appeared on my google discover page as I was walking my dog before going to work: https://www.influxdata.com/blog/solving-runaway-series-cardinality-when-using-influxdb/ . It was obvious, that was my issue. I didn’t quite understand what was that language and what were those references about buckets but I soon discovered that it didn’t apply to my version of InfluxDB but to version 2.0: https://docs.influxdata.com/influxdb/v1.8/concepts/glossary/#retention-policy-rp (“A bucket is a named location where time series data is stored in InfluxDB 2.0. In InfluxDB 1.8+, each combination of a database and a retention policy (database/retention-policy) represents a bucket.“). So I had to find another way to clean things up.

Solve the problem

Just like the famous phrase from “A.Friend” in the film Disclosure. I had to tackle the real issue. I used the commands SHOW TAG VALUES WITH KEY = "hostname" and SHOW SERIES FROM snmp (a table where I look up the hostname and define it as tag that is later inherited by the interface tables, basic monitoring stuff from my older post series, part1 and part2). The second command gave me a quick idea on which nodes were giving me problems. I went back to my telegraf snmp plugin configs and found out I had two kind of problems:

I had copied configs from other group of switches that did support auth and priv mode into configs for older switches that didn’t support encryption. So the responses came back encrypted and that’s how the bogus hostnames / tag values were created.
I had cloned configs and modified them correctly to support older switches with SNMPv3 but had left the nodes addresses in the original file too, thus resulting in the same problem as above.

For both of those issues I had to look inside the configs and do a lot of grep for the ip addresses contained in the “agent_hostname” field to find out if each ip address was present in multiple config files. For example, while in the telegraf.d folder, running the following :

grep "10.0.1.10" *

… will return all the occurrences of the ip address 10.0.1.10 along with the config filenames that contain it.

Of course the list of bad tag values was enormous (thousands). So it was too difficult to proceed with finding the all the causes, even if I went first for the “low hanging fruit” as the blog post from Influx Data suggested (the values causing the most problems). My eyes literally hurt and I felt I was loosing the battle. I had to find a way to clean up my DBs faster.

Clean up the mess

In an older post I described how I used the influxdb python library to create data points for the number of Checkpoint VPN users and then graphed the results with Grafana, a solution still in use today, as the pandemic is still fully going on. Using Python was probably a good idea, so I looked the documentation up again but the available methods to manage my db and in particular to drop series with a certain set of criteria, were not implemented. What was implemented, was the way to execute queries with influxql, which was better than nothing. It would not be as fast, but it would certainly be faster than me typing those queries by hand and my eyes would hurt less. So after some trial and error, I came up with the following script.

!/usr/bin/python3
from influxdb import InfluxDBClient

#Define the connection object
client = InfluxDBClient(host='servername', port=8086, username='influxdbuser', password='influxdbpassword')
client.switch_database('telegraf')
#set a query to find the tag values that come from the "hostname" tag defined in the telegraf config
query = 'show tag values from snmp with key = \"hostname\"'
#execute the query
result = client.query(query)
#store the different tag values
values = result.raw['series'][0]['values']
#loop through the tag values, find if that value is a positive integer, and if so, drop that series from the index. That takes care of the tag value as well (it's removed)
for item in values:
    if item[1].isdigit():
        print("item:")
        print(item)
        dropquery = f'drop series where \"hostname\" = \'{item[1]}\''
        dropresult = client.query(dropquery)

After running that, I had to restart the influxdb instance so it was cleaned up and back in business as usual (I used docker-compose in my setup to combine telegraf and influxdb for each node, as is described in the series of posts I mentioned earlier).

docker-compose restart

I made sure (using the logs) that influxdb has booted up normally and was back accepting writes (POST requests) from telegraf before going in to check again.

docker exec –it influxdb /bin/bash 
influx
auth 
show series 
show series from snmp
show series from interface
show series from interface old
show tag values with key = "hostname"

No strange tag values there, so all good!!

Wrap-up!

Well that was it! Not too long this time right? Same here, it only took about an hour to write and get everything right. If you want to know more about how to use the influxdb python client, besides taking a look at the documentation, take a look at their github page, where you can find code and examples: https://influxdb-python.readthedocs.io/en/latest/ . Take care, until next time, you can always look me up on Twitter under the handle mythryll.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30