In the previous episode..
In part 1, we have set up the playground. In part 2, we look at the planning.
Migration planning – creating an appetite
What’s for dinner?
So what did we built exactly? And how does it need to work? Here is what we needed the code to do for each migration:
- Get the mac addresses of end hosts/machines from the access switches participating in the migration, along with port/vlan/switch ip/hostname information.
- Get the arp data for the same end hosts/machines from the relevant distribution switches doing the inter-vlan routing for those sites. Combine arp data with the data already collected in order to get ip addresses of the end hosts/machines in the same records. If there is no ip address for that mac address (rare occasion), just leave it blank. Things to keep in mind for this:
- Both above tasks happen in parallel for each block of actions to minimize collection time. It usually takes only a few seconds, depending on the access switches platforms (platforms supporting only telnet transport take longer to connect and get data back).
- A common approach and code flow is used regardless of whether the transport for the access switches is ssh or telnet. The same code is used with a function name as parameter to direct flow towards the nornir filter parameters.
- Execute Name lookups for all the ip addresses collected. Get hostnames back. If no hostname comes back, fill in with a special text value (‘resolve failed’).
- Store the data in file if a respective option was used in the code (we may also want to just print or just return data)
- All code produced should be able to be re-used in a different context (for example as part of a different application, called as a library function, where debug/print would make no sense. For that reason, storing in files/printing/logging should all be optional and activated with respective flags
- All code should follow the DRY principle in order to make sure the work or mistakes are not replicated and that the same code flow can be re-used in a different context.
- Functions should not too large so that the DRY principle is easier to follow. In that perspective, functions should perform a small number of tasks, as small as possible.
I struggled with the question whether I should go for object oriented architecture or not. I am still not sure what’s best. So far, no classes. Maybe in a future iteration of the code.
But that’s just the initial data collection before the migration, right?
Exactly, then the actual migration takes place and whenever each location was finished, we would need to check. So:
- If we are migrating from an old rack to a new rack but still with legacy network devices, then the data collection algorithm would need to be ran again, but this time results would be compared against historical data to determine what happened to each end host:
- Did we get everyone back?
- Did we reconnect hosts to the right switch/port/vlan?
- If we are migrating from the legacy network to the SDA network (same rack), then after the migration all necessary data would be in Cisco ISE, so a query would need to be done there, and again compare the result with historical data, to answer similar questions:
- Did we get everyone back?
- Are hosts and users authenticated with 802.1x or MAB?
- Are they idle?
- Did they end up on the guest network?
- etc.
How does this end?
If we get everyone back in the first go, celebrate, that’s a huge success! But usually, after all these years, something is missed or a cable may be in a bad state and for whatever reason, we may get bad results, like lost hosts. Or we may have made a mistake and connected a host to a different switch/port/vlan than the original connection. Depending on what happened exactly, we need to ask the team performing the re-patching part of the migration to check the specific cable connections for errors/omissions and then we have to check again.
Or if the migration is towards the SD-Access network, then perhaps a host may fail to pass 802.1x or MAB authentication, end up in a different network than the desired one (for example a guest network) or the port may need flapping to work correctly. In some cases we may have to intervene on the network devices themselves or the hosts or we may need to provide workarounds to force naughty devices (e.g. wayward printers) to behave correctly.
If some problems remain, the necessary data must be collected in order to re-queue them as faults and try to correct it on the next business day, first thing in the morning. Getting the right data to the right people in this case may prove valuable and will certainly minimize impact and time to recovery. You can’t win all the time but you can try to mitigate damage.
Taking things a couple of steps further
Who is who
In quite a few cases, we would need to find out which user was using which workstation. If already migrated to SDA, that can be easy. ISE has all the information already – well almost all of it, but you can get the rest easily. On the legacy network however that’s a little harder. We can get almost everything we need during the collection phase, except the information which user corresponds to each workstation (802.1x is not configured/available on the legacy network).
Why do we need that? Well here are two main use cases, but more can be derived:
- Before each migration we needed to gather the list of users involved and the list of departments affected so we can send an email to all of them, warning of the change and asking to keep their workstations powered on throughout the migration (so we can check that we didn’t miss anyone).
- Once the migration was done, we would need to see whether any hosts were left without a connection (it happens when the cabling is old or when there is just too much re-patching to do), and if so, which user do those ‘lost’ hosts correspond to? If we know who they are:
- then maybe we can trace their cable connection to their office and then to the patch panel,
- or at least we can find out who our colleagues from the help desk will need to call on Monday morning and where the cable technicians will need to go to look for that connection.
So we do need a way to get to the user information, and cross-reference that to the workstations, so we can get to the users affected by the migration.
Results to go please..
I also definitely wanted to make sure that the teams doing the migrations had access to the results from wherever they were so I needed to send the data over to MS-Teams so anyone carrying a smartphone could get to them without needing to use a laptop on location where the network racks are.
A way to cause the check to be re-run would be even better.. but I am not there yet (chat-ops, one day Jacob, one day).
Anything left out?
Oh yes, there were a few more things I planned for but never actually got to complete them (I did reach an mvp stage with each one though).
One was getting the dhcp reservations for the hosts involved in the migration, filter them and get back a result with their new ip addresses after the migration, so they can be reserved again for the new ip subnets. I found a way to get the dhcp data from the dhcp server using powershell and wrote some python code to read the data and get it in usable form (list of dictionaries ofc). I didn’t go through with integrating that into the rest of the code, so these actions were completed in stages with individual scripts (they are not in the repo). It would have been great if I could call the powershell commands from within python but that wasn’t possible (resist the urge, it wasn’t, really, I just won’t write another paragraph explaining why, blame Microsoft).
I also didn’t bother to modify the code to gather data for wireless access points that also got migrated to the new network, those being a specific use case, also before the migration they were in separate switches and recovering a connection or rebooting an access point is a different story. Finally they were migrated at a much smaller scale than user workstations and other types of end hosts. I guess I could have done something separate for that using some of the same code (DRY always), but I just didn’t do it.
Finally, I wanted to check whether the workstations involved in the migration were included in firewall rules using their ip address, as that would change so the objects would need to be changed. I went half the way through with this, but didn’t have enough time to complete it. Again these actions remained manual, as with the rest of the work that would not be able to be done by code, unless we trained our own AI, for example to inspect host data and determine special cases (e.g. access controllers), connections that would need to be migrated at a later time (e.g. media encoders), connections that needed to be assigned to specific switch ports in the new equipment (e.g. industrial/medical equipment), etc.
All those remained in the care of my workmate who put in most of the networking technical work, while I handled the pm role and that of the little robots caretaker..
How do we do all that?
Let me get my pickax
Well how would an engineer do it manually?
- show mac address-table at each access switch to get mac address, port, vlan, etc, save data to files
- show ip arp at each distribution switch to get ip address/mac address combinations, save data to files
- cross-reference those files to produce enriched data (good luck there)
- nslookup or similar commands to get a hostname for each ip address, save data to files
- cross-reference those files to produce even more enriched data (again, good luck)
- access the management software for the checkpoint firewall, get the AD user id for each workstation.
- find the user’s data in the company directory (section/department, phone, office, etc)
But that’s only the first phase, the data collection before the migration.
What about after the migration?
Depending on which kind of migration, the engineer would:
- either need to do it all again if we were just moving from old racks to new preserving the legacy network devices, or
- query Cisco ISE for data using the mac address and find out whether each mac address for each host migrated is present on the SDA network, whether it’s listed in the authentication sessions, if the authentication took place with 802.1x or MAB, whether the PC is idle without user connected and whether the host is in a situation where the port needs flapping.
Alternatively to the second case, DNA Center (or Catalyst Center now) could be used to find and collect mac address data along with the rest like ip address, hostname and even MS-AD user id (provided DNAC and ISE are connected through PxGrid).
Trust issues
Do you trust the code?
The code was written in waves of effort, so there was a lot of intermittent testing involved in order to apply concepts to practice and verify that they work. Every piece of code that was proved to work, got integrated as part of a group of functions, depending on the group of operations it belonged to. So there was a different group for mac and arp collection etc, a different group for ISE data collection, a different one for Checkpoint FW Management data collection, etc.
Beyond that, I did write some tests to verify the code works so that later on I would be able to integrate those in some form of continuous integration/continuous deployment (CI/CD) to make sure that further development and modification would not break the code.. However I didn’t get to do this for everything I wanted, there’s still a whole body of work to do in that area. That is easy to see in the code itself.
Don’t avoid the question
If I do have to answer with a single word, then YES. I do trust it. It does exactly what an engineer would do, only incredibly faster, and performs the tasks without getting tired and without errors, provided it’s tested against diverse enough data, which I did, to the best of my knowledge (I had to correct/complete it enough times to compensate).
But we did follow the four eyes principle for each migration anyway, just to be sure, so migration checks were also performed manually in parallel by my workmate. That may sound strange, but it was a conscious choice.
You made an engineer do all those in parallel??? (arrest that man!)
No, of course not. That’s not practical or even possible. And no one would go for it either. It was only possible to do summary checks, for example counting mac addresses before and after the migration.
Also, the only other historical data available came from Cisco Prime Infrastructure which does gather end host connection data, but as it’s done periodically and over a time window of a few days, you can’t rely on that data for what was actually connected to which port exactly before the migration or even on the previous day. It’s useful when no other data source is available, but it’s not accurate enough.
Did everyone else trust the code?
No.
Well, not at first anyway.
Fear of flying
I could tell you a fairy tale where there was once upon a time an organization where every network engineer embraced change and was inspired by network automation and programmability and answered the call for applying it to network operations and deployments, without fear or hesitation or even without distrust towards the ones evangelizing and pushing for the change.. but it would be pointless as it would be just that. A fairy tale. That’s just not how things happen in the real world.
So naturally there was fear, hesitation, distrust, even conflict and challenge. I am not sure if that is strangely part of the final success, whether that motivates me or even keeps me in check at the same time. I feel that it could have been avoided or even that more and better things could have been accomplished if the code development had been made with a team working behind it, with a common vision. There are also other factors to consider like pre-existing stress from an overloaded schedule, the fear that your co-worker will not have your back if they venture into new territory, even just plain old fatigue. The result was that I wrote the code alone, but we made use of it as a team.
Are you sure it was possible for this to be different?
I am not. I can tell you one thing though: counting is not knowing. It’s guessing. You are guessing you got everybody back. You are guessing everything is fine. And you need to know. That’s the only way to fight your fears and make sure it all went well. You need to know it all went well. Fear at best can only shake you up and wake you up. It can’t really help you and it can certainly keep you down or even paralyze you. If you want to stop fearing the beast (code), test it. Print, debug, log. But of course you need time for that. And that is not so simple (you need time to make time, remember that?).
Are you saying doing things by hand is useless?
Not at all. Getting a count never hurts and manual checking remains fine for a small number of hosts involved in the migration. You get the output of a few commands before and after and you compare if needed (e.g. if your count is different). But if the number is not that small, then scale is where it hurts:
- Speed – You can’t do the checks fast enough, code will always be faster than you
- Accuracy – You can’t be sure your eyes are not deceiving you, tested code will not go wrong doing a comparison.
- Fatigue – Performing a few hundred checks can be a heavy load, code doesn’t get tired.
Were you right to trust the code? How did it go?
After performing all but 5 migrations out of 80 locations (we are nearly done), I can say that it was a real life saver, it did help avoid problems and complaints that would occur on the next business day and it made repeating the checks again and again so much easier, allowing us to adapt to situations we would not be able to without it.
Automation teams – Reaching Critical Mass
I have to say that I totally respect engineers who have reservations against automation, who prefer to trust what they know has worked for them over the years, even if that seems to be under-performing in some cases (in this case it was certain not to be enough). You can’t just throw someone out of a plane with a parachute strapped on his/her back and shout “Believe!” as he/she falls from the plane..
At the same time, to really reach a point where complete products/pieces of work are made, you need more than one person in the team continuously providing unwavering support to the common vision and doing the actual work for it. I have read/heard/seen amazing success stories with just 2 or 3 people onboard. But you need at least that much, to reach that critical mass, and everyone must be committed. It’s really not that different than any other important goal in life or work.
Philosophy and Ideas
Code overlays
I got the idea for using functions as parameters (everything is an object in Python) when reading about lambda functions, and specifically my daughter’s course notes from their introductory python course in the university, where she is currently studying IT. I hate lambda functions but that idea was attractive so I put it to use.
function separation
I also tried to keep code that did a specific task a certain way in one place and not mix it with code doing something else, as much as possible. The idea was to be able to maximize code re-use for more projects, so functions needed to be granular enough to allow that.
The accidental company data model
As much as you can try to create reusable code without thinking too much about the data, you will probably end up making assumptions/decisions on what the data should look like, what the structure should be, what type of info should be included, how they ‘keys’ should be called, etc.
That unavoidably leads to the creation of a data model (or a group of data models) that is specific to you or your organization, for this project at least. Perhaps you didn’t mean for that to happen, but let’s be honest, what seems logical to you might seem wrong or incomplete to someone else, so there is really not one obvious universal way to plan for data. If you had thought about that before starting on the project maybe some things would be different even just for your case. But who thinks like that, right?
As it turns out, a lot of people. Maybe not aspiring mid level network automation/programmability engineers like me, but people who have been doing this their whole professional life, being a programmer and a network engineer at the same time, since the start. People like Ivan Pepelnjak.
The Heavy Networking Podcast, obviously
I recently listened to a great podcast with Ivan Pepelnjak, Dinesh Dutt, Claudia de Luna and David Sinn:
Packet Pushers, HN717: Network Source(s) Of Truth – A Roundtable Discussion with Ethan Banks and Andrew Conry-Murray as the hosts:
What a great listen! What a great round of experts! Listen to them for yourselves. If you don’t know who they are yet, you should. I don’t know everyone, ofc. Follow Ivan for the blunt truth, Claudia for the inspiration and Dinesh for effectiveness & simplicity. Btw, Ethan is a master of a host. Great discussion.
At some point in that discussion, Ivan talks about the data model. And my immediate reaction was “do you really need to design the data model before you even start?”. But I think the answer is that sooner or later, you will have to. You will do it, whether you realize it or not. Or you will do a number of iterations on it because things will feel wrong until they don’t, so you will be modifying the data model again and again. You could design it from the start, intentionally. It might save some time and effort or even regret later on. Or not. Just wing it. It’s going to be fine, no worries .. right? RIGHT? (“Έλα μωρέ, μια χαρά θα πάει..“)
Data form
Let’s look at some examples of how the data looks in our case. Υou will probably need to zoom in to “take a look”, but you can also browse this data at the github repo where the code is published or the other one (read on).
Network Data
When the code executes, a list of dictionaries is returned, and when stored in file it looks like this:
User directory
This is an example of the form of the user directory when stored as a csv:
It’s this file we use to cross-reference data we get from the checkpoint management api about the workstations involved in the migration.
User and section/department report
Here is what a list of matched users looks like (sorry, no colors, code snapshot would not grub this one):
and here is the list of matched sections/departments to where the users belong, when needed to alert their heads of sections/departments before the migration.
Is that all the data?
No, not all. You can’t see here an example of what you get back from ISE about the active authentication sessions where we draw our mac address/ip address/hostname data from after the migration or an example of data we get from Checkpoint Management API to cross-reference hostnames with MS-AD user IDs. The code reveals that the ISE API responds in XML and that the Checkpoint Management API SDK gets back list of logs, which a complicated multi-level structure that you need to explore (hint, not all fields are always there.. Some are missing in some cases, you need to anticipate and compensate).
You can also use Postman or another REST client to explore the ISE API (Nicolas Russo has a nice ISE Postman Collection you can download and try for yourself, I have published one of mine for the Checkpoint Management REST API).
Wait.. was that your real data?
Hehe, ofc not. Let me say that again, just to be sure it’s clear. As I already said in part 1, this is fake data, created by ChatGPT-4 using faker. I presented the form I needed for each case, and the rules that determine the values (in some fields) and it went ahead and created both the scripts that generate the data, and a set of data for each case. I didn’t do that with ISE or Checkpoint responses (although I could have, but at some point it becomes too much trouble. If you can’t apply the code to a real use case, either contact me to help you understand the structure or use your imagination).
Not done yet..
We have explained the process to some extent. Now we stop again for a small break. Once you are ready, move on to part 3 for the completion of this series.