Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.

Challenges in Network Troubleshooting
In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint where the issue could be. Multi-vendor network environments. Different platforms, different types of issues. How do we reduce the MTTR(Mean Time To Resolve)? How do we correlate data from different monitoring systems (Syslog Servers, Graphs etc.) to isolate the issue? What we need A network scanning utility which can scan any piece of network and highlight potential issues in the least possible time. A utility which can scrap API’s from different internal monitoring systems and show data in a single place.

Solution: LiTracert (Li-Traceroute)
A troubleshooting utility based on routing table parsing. It gathers the uplink information from the routing tables for any source to destination and creates a “routing-tree” data structure. The routing tree will have the information of all the uplinks and outgoing interfaces from each node towards the destination IP. Platform based network scanning. E.g. If you want to scan all juniper devices in the network, Litracert perform the scan in few minutes.

Step 1: Creation of “routing-tree”
Use Case 1 : Scan a data center network from any given source to any destination. Step 1: Creation of “routing-tree” The script finds the network device on top of the source host and performs a route lookup for the dst host. From the route lookup data, it creates a dict like below. 2 uplink_device_1 uplink_device_2 Source 1 2) The routing output is parsed by a parse data fun() which can parse data from different vendors. it creates a dictionary like below. uplink_device_3 uplink_device_4

Step 2: Multiprocessing using Pool
DC uplink_device_1 uplink_device_n uplink_device_2 uplink_device_n Source Destination uplink_device_3 uplink_device_n uplink_device_4 uplink_device_n 3) All uplink ips are processed parallely using Pool function which makes it fast. 4) The route parsing is stopped when the destination ip is seen as a connected route. 5) The routing tree is passed to the main litracert code which checks for the issues in the tree data points.

Show ip route <destination ip>
Workflow Source Network Node. Show ip route <destination ip> Gather Uplink Information. Parse routing data from uplinks. Create Routing tree object. Is the destination route a connected route? SRC No Uplink device 3 Uplink device 1 Uplink device 2 Yes Uplink device n Uplink Interface Uplink platform Uplink device n Uplink Interface Uplink platform Uplink device n Uplink Interface Uplink platform Route Parsing Complete. Create Routing Tree. DST

Separate files for per platform checks
Routing Tree Object Source Node device_dict = { "device": { "outgoing_interface": [interface_1, interface_2, interface_3], "uplink_devices": [uplink_device_1, uplink_device_2, uplink_device_3], "bgp_peers": [peer_1, peer_2, peer_3, peer_4] } Separate files for per platform checks Litracert takes the routing tree as an input and parses it, It figures out the platform type for each routing-tree node and executes respective functions for that platform. Eg. if the node type is juniper, it performs all the checks related to juniper. It parses the tree to figure out the outgoing interfaces for each routing tree node and figures out any drop or errors using SNMP. It presents a consolidated report to the user after parsing the entire routing-tree.

Sample Output: Below output shows a scan run on all our Juniper routers for all possible issues. The source of the data includes syslog logs, monitoring tools api scraping, outputs from the device. And it all happens within a few minutes!!

Use Case 2 - Scanning the Network for any given type of platform.
E.g. I want to scan all Cisco Nexus devices in my network. User input queries. REST API calls to our inventory system fetches the list of devices. Using Pool feature all the devices are processed parallely and a report is generated and shown to the user.

Challenges in Network Maintenances
Problem Statement : A growing infrastructure is always changing. Maintenances are inevitable part of our daily work. How do I make sure that my maintenance doesn't cause any outage? Sometimes we may overlook certain things during the Maintenances which could cause potential unforeseen outages. Solution: So we need to have a tool in place which will show the state changes of the gears which were under maintenances. A tool which will give us a clear picture of network state change brought by a maintenance. NetSMART (Network Simplified Maintenance and Reporting Tool) An In-house tool to track the changes brought by any maintenance.

Simple workflow of NetSMART
User creates list of network devices under maintenance. User inputs that filename to NetSMART. NetSMART executes prechecks on the input devices. stores in a Json format. Engineer performs the maintenance. NetSMART executes the postchecks and performs the comparison with the precheck file. A report is generated with the diffs and presented to the engineer.

List of commands for juniper platform

cisco_functions.py juniper_functions.py netsmart.py
Separate Modules for each platform which contains functions specific to that platform. netsmart.py Pool is used to process multiple devices at the same time. JSON compare compares 2 json files and return the diff.

Sample NetSMART output
<<Snipped>>

At the end of maintenance, a detailed report
is generated and sent over . Red marked fields shows the changes happened due to the maintenance. Helps Engineers to see any unexpected changes and take proactive steps.

Device Decom Tool Problem Statement : How do we decommission network devices cleanly from network in a secure manner? Solution: We need to have a tool in place which will first validate if the device is good to be decommissioned and then perform decom steps on the device. Finally it should check if the device has been cleanly decommissioned or not. There are times when we need to decom a part of data center or the whole data center in a very short span of time so the tool must be able to decom multiple devices simultaneously. This is acheived by using Pool multiprocessing in python. Use Case: We decommissioned 2 full scale DC’s with nodes in just 4 days !!

User generates a list of devices to be decomed
Workflow User generates a list of devices to be decomed Check if the devices dont have any server ports up other than the uplinks Create a JIRA Change Record with device info Manager approves the JIRA Change request Engineer executes the decom script giving the CR number as input Script checks if the CR is approved. If approved, performs decom steps relevant to platform Script finally does a ping check on the devices to make sure none of them are UP on the network anymore

Opensourcing Plans Work in progress to opensource these tools after abstraction. Will be rolling out opensource code in August.

Questions/Feedback?

Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.

Similar presentations

Presentation on theme: "Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.

Similar presentations

Presentation on theme: "Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint."— Presentation transcript:

Similar presentations

About project

Feedback