Telco Cloud Troubleshooting

Telco Cloud Troubleshooting
Overview (NCIR) Narration: Hi and welcome to the Telco Cloud Troubleshooting Overview for NCIR course! My name is Julie. I am an Operation and Maintenance engineer and I’ll be your tutor for this course. The length of time necessary to complete this self-paced course will be different for each student. We estimate the average student will need 30 minutes to complete this course. Audio available Put headset on Available as PDF

Legal notice Intellectual Property Rights
All copyrights and intellectual property rights for Nokia training documentation, product documentation and slide presentation material, all of which are forthwith known as Nokia training material, are the exclusive property of Nokia. Nokia owns the rights to copying, modification, translation, adaptation or derivatives including any improvements or developments. Nokia has the sole right to copy, distribute, amend, modify develop, license, sublicense, sell, transfer and assign the Nokia training material. Individuals can use Nokia training material for their own personal self-development only, those same individuals cannot subsequently pass on that same Intellectual Property to others without the prior written agreement of Nokia. The Nokia training material cannot be used outside of an agreed Nokia training session for development of groups without the prior written agreement of Nokia. This slide does not contain narration.

How to navigate Take a moment and click on the markers to get familiar with the navigation items. i ! Narration: Let’s take a moment to look at the navigation items of this eLearning tool. Click on the markers for specific information. From this point forward, you are in control. Click [next] to advance. (Developer’s Note: Following text will appear as pop-ups in final WBT output.) Course Menu a menu on demand. Use this cloud icon to display and select the chapters of the course. Volume Control (no text needed, only the title) Slide Tool bar Use this bar to play or pause the current screen. Page Navigation Use this buttons to view the previous or next screen. Top Navigation Bar This navigation bar gives access to resource files, a glossary of terms used in this course and the narration. You can use these links anytime during the course. Information Marker (i) The information marker provides additional information. Zoom in/out (pin icon - available in the WBT authoring tool) In order to zoom in and out the eLearning frame hold "Ctrl" and "+" or "-" keys. An alternative is holding "Ctrl" key and roll the mouse wheel forward and backward.

How to navigate Course Intro 1. Problems 2. Tools Take a moment and click on the markers to get familiar with the navigation items. 3. Prosess 4. Examples i ! (Developer’s Note: Example of the menu when it’s open. This slide will not be included in the final outcome. Add as layer in final output.)

Kind of Problems Basic Process Tools Examples Course outline
Narration: During this course I’ll show you the fundamentals of troubleshooting the Telco Cloud for NCIR. We’ll first look at the types of problems we may face. Then we’ll check the tools and basic process we can use. Finally, we’ll see some examples of common malfunctions along with the best way to handle them.

After completing this module you will be able to:
Course objectives After completing this module you will be able to: Describe each tool used to troubleshoot the Telco Cloud for NCIR, specifying how to access it and when to use it. Given a description of a problem in the Telco Cloud for NCI R, use the simple troubleshooting process to select appropriate tools and explain how to perform root cause analysis Narration: At the end of this training you will be able to: - Describe each tool that is used to troubleshoot the Telco Cloud for NCIR, specifying how to access it and when to use it. Then, given a description of a common issue, use a simple troubleshooting process to select appropriate tools and explain how to perform a root cause analysis.

Module 1: Types of problems
After completing this module you will be able to: Identify the types of problems that are most likely to occur Narration: So, let’s start by understanding the kind of problems we may face.

Kinds of problems Guest Software Connectivity Host Software Hardware
VMs AVS Hypervisor Host Software Host OS Host OS Compute Node Controller Node Narration: There are three kinds of problems causing malfunctions. They can be related to a piece of equipment, software, or connectivity. These problems can be independent of each other or related. A best practice is to review all active alarms and other information, such as logs, error messages, and node states, before taking any troubleshooting steps to determine if there is a relationship between them. Let us take a closer look. Hardware NDCS Server Switch

Hardware Hardware NDCS VMs AVS Hypervisor Host OS Host OS Compute Node
Controller Node Hardware consists of the racks, Power Distribution Units, servers, switches, transceivers, and cabling. Hardware NDCS Server Switch

Software Guest Software Host Software Provider Network OAM Network
VMs AVS Hypervisor Host Software Host OS Host OS Provider Network OAM Network Data interfaces Internal mgmt ntwk OAM ntwk Compute Node Controller Node Software consists of host and guest software. The host software contains the compute nodes, controller nodes, host operating system, hypervisor, and Accelerated vSwitch. The guest software contains the virtual machines with the Virtual Network Functions. The compute nodes run the computing services and host the deployed virtual machines. The compute nodes connect to the controller nodes via the internal management network. The compute nodes connect to provider networks via data interfaces. Controller nodes run the services used to manage the cloud infrastructure. Controller nodes manage hosts via the internal management network. They also provide administration interfaces via the OAM network. We will discuss these networks in more detail while looking into connectivity. The host operating system for all nodes within a NCIR cluster deployment is Wind River Linux. The hypervisor virtualizes computing resources and applications. It is implemented with Kernel Virtual Machines (KVM) along with Quick Emulator (QEMU). Kernel Virtual Machine (KVM) is used to set up guest Virtual Machines and feed the guest simulated Input/Output (I/O). QEMU is a hosted virtual machine monitor that emulates CPUs and provides a set of device models. The accelerated vSwitch (AVS) is the scalable DPDK-based user space L2 switch. NDCS Server Switch

Connectivity Connectivity Tenant Network Provider OAM Network Network
VMs AVS Tenant Network Hypervisor Physical ntwk Host OS Host OS Provider Network OAM Network Internal mgmt ntwk OAM ntwk Data interfaces Infrastructure ntwk Physical ntwk Compute Node Controller Node Connectivity consists of the L2 switching facility that houses the internal management network and infrastructure management network. Connectivity also includes the physical network, provider/tenant networks, and OAM network. Internal management network is an isolated L2 network implemented on the internal L2 switch. This network uses a dedicated port-based VLAN. It enables communications between the hosts and controller nodes for software installation and management. This is internal to the NCIR cluster and its operations are transparent from the cloud host perspective. Infrastructure network is an optional network used to improve overall performance for a variety of operational functions. When available, it is used for control and data synchronization when migrating virtual machines between compute nodes. Basically, the infrastructure network provides a target for NCIR to offload heavy traffic from the internal management network. This prevents sensitive traffic from being starved for bandwidth when there is background traffic. When the infrastructure network is unavailable all infrastructure traffic is carried over the internal management network. The physical network is a physical transport resource used to interconnect compute nodes among themselves and external networks such as provider and tenant networks. This network is not configured by the NCIR administrator. It is physically provisioned by the Data Center where the NCIR Cluster is deployed. The provider network is a virtual network that provides underlying network connectivity needed to instantiate the tenant networks. The provider network is created by the NCIR administrator in one of two types: flat or VLAN. Flat means that the network is mapped entirely over the physical network. Each physical network can realize one flat provider network. The flat provider network supports a maximum of one tenant network. VLAN provider networks are implemented over a range of VLAN identifiers supported by the physical network. Multiple provider networks can be defined for the same physical network, all operating via non-overlapping sets of VLAN IDs. The tenant network is a virtual network associated with a tenant. A tenant network is instantiated on a compute node, and makes use of a provider network. They provide switching facilities to the virtual service instances communicating with external resources and with other virtual service instances running on the same or different compute nodes. The OAM network is a physical network used to access the configuration and management facilities. The web administration interface, and console interfaces to the controllers, are available on this network. NDCS Server Switch

Source of information in case of malfunction
Primary troubleshooting source ! Alarms General messages why system did not perform task Error messages Information that can be analyzed and shared ---- Logs ---- Useful data for monitoring Statistical reports Node states Node status Narration: Now that we have a basic understanding of how the system works, what can we use to troubleshoot a problem. There are several sources of information you will rely on to troubleshoot a problem. Used collectively these sources will help you know when, and where, there is a malfunction. Alarms are the primary source of information in most troubleshooting situations. Logs are useful data that can be analyzed and shared. Take logs from the node sending the alarm and the node that is the object of the alarm. The more information the better! Node status is the node state which may also indicate problems. Error messages tell you why the system cannot carry out a task. They can appear in the supplementary information fields of alarms and logs. Statistical reports contain useful data that can be used to monitor and assess the system which may indicate forthcoming problems before they affect traffic. Let us take a closer look at each of these sources.

! Alarms Resource Maintenance Storage Data Networking Controller HA
Backup and Restore System Configuration Software Management VM Instances There are many types of alarms reported by the NCIR. They are: Resource alarms which let you know that there are CPU usage and memory threshold problems with the hosts and filesystems. Maintenance alarms which tell you about failures and service degradation with the hosts, ports, and interfaces. The storage alarm which lets you know there is problem with the storage for a cluster. Data Networking alarms which notify you that there are problems with the ports, interfaces, agents, and provider networks. Controller HA alarms which nform you that there are problems with the service domains and hosts. The backup and restore alarm which notifies you that a system backup and restore is in progress on the host. System Configuration alarms that tell you that there are problems with the configuration. Software Management alarms which let you know there are problems with the patches on the host. Finally, there are Virtual Machine Instance alarms that tell you that there are non-recoverable failures, rebooting problems, shutoffs, and migrations with the VM instances.

Typically, no action required
Logs Transient events No node state changes Typically, no action required Examples: Instance deletions Failed migrations With logs you can obtain information about transient events. These are certain system events that do not result in node state changes, and typically do not require immediate customer action. Instance deletions or failed migration attempts are recorded in Customer Logs. Each log describes a single event. The logs are held in a buffer, with older logs discarded as needed to release logging space. The logs are displayed in a list, along with summary information. For each individual log, you can view detailed information.

Node status State Node availability Operational State Admin State
Available Permanent Enabled Unlocked Degraded Migrating VMs out Transient Failed Disabled In-test SW install Locked Online Offline (auto-discovered) Power off The operational states of a node are: enabled and disabled. The administration states of a node are: locked and unlocked. For example, if a compute node is locked it cannot provide service. Using this information along with the other troubleshooting sources you can perform a root cause analysis of the problem. Remember to consider all information available before performing any troubleshooting steps.

Error messages What is occurring? Where is it occurring?
Why is it occurring? Is it still occurring? Error messages are supplementary information that may explain why a task could not be properly performed. They may indicate in addition to the alarms and logs what piece of equipment or software is having problems. An example of when error messages may appear is during a software configuration change. To ensure that the change is successful without problems you can monitor the error messages. They should all clear once the operation is successfully completed.

CPU and memory utilization
Statistical reports CPU and memory utilization Network traffic counters Storage space Performance measurements are periodically collected by the Ceilometer from different resources such as hosts, virtual machine instances, and the Accelerated Virtual Switch. The measurements include CPU and memory utilization, network traffic counters, and storage space. The Ceilometer is available via the WebGUI and Command Line Interface (CLI). Reports can be generated with specific information to aid in performing a root cause analysis or predicting if a troubleshooting problem is about to arise.

Knowledge Check Narration:
It’s time to check your understanding of the kind of problems occuring in the Telco Cloud system.

What are the three types of problems you may encounter?
Hardware, software, and connectivity NE, interfaces, and external system System, NE, and VM Hardware, firmware, and software (Developer’s Notes: No narration on this slide. Correct answer in bold.)

What is the primary source of troubleshooting information?
Alarms Logs Statistical reports Error messages (Developer’s Notes: No narration on this slide. Correct answer in bold.)

Module 2: Troubleshooting Tools
After completing this module you will be able to: Describe the tools that can be used to troubleshoot the Telco Cloud for NCIR Narration: In the troubleshooting process, you are going to use specific tools to investigate the causes of malfunctions. After this module, you will be able to describe such tools, specifying how to access them, and when they should be used.

Troubleshooting Tools
Use NASM Hardware BMC Web Interface Hardware Dashboard Controls Hardware, software, and connectivity Command Line Interface (CLI) Hardware, software, and connectivity Narraton: Here are the main troubleshooting tools. Through our exploration of these you will learn how to access them and when to use them. Each plays a unique role in the troubleshooting process. Let us take a closer look.

Nokia AirFrame System Manager (NASM)
Nokia AirFrame System Manager. So, what is the NASM? Why use it? It is the system management tool used to monitor and manage the entire Data Center hardware and software. You can check the health of the Data Center using the Dashboard, and review event logs in Event Logs. By now, you are probably saying this is all well and good, but how do I access it? Here is how: First, in a web browser, preferably Firefox or Chrome, enter the address of the NASM server ( server IP address}:8443/airframe). Then login as administrator using the default User ID and password – admin. Remember to change the password after your first login. So, when should you use the NASM? Say a host is having some problems you can use this tool to verify the health and review the event logs of the hardware. That can help you determine if a problem is hardware-based or if the root cause lies elsewhere.

NEED SCREEN CAPTURE BMC WebGUI
The Baseboard Management Controller WebGUI is an external, independent hardware management tool used to manage the server hardware. Since it is external it is available even when the host hardware, or operating system, is hung-up or unable to function. Only OEM Proprietary, Administrator and Operator privileges are authorized to log in to the BMC web interface. Log in through the BMC nework port. Open a web browser and enter the {BMC IP Address}. Both username and password are admin. When do you use it? If a server is having some problems, such as a fault LED blinking, you could check the Event Logs in Server Health on the WebGUI. Events on the WEBGUI can help to diagnose more problems related to the server hardware and firmware.

NEED UPDATED SCREEN CAPTURE
Dashboard Controls NEED UPDATED SCREEN CAPTURE Dashboard Controls. What is it? It is the NCIR web administration interface that helps you manage all aspects of the system. It plays a key role in the troubleshooting process. Most troubleshooting information is available here, specifically on the Admin tab. Need performance information? Overview shows basic performance charts including hosts’ status, provider network port utilization, and compute node processing, memory, and disk usage. Use the Resource Usage information to obtain more detailed performance analysis. Need to know how the hypervisors are doing? The Hypervisors menu contains charts depicting their resource usage on compute nodes. Need to verify alarms or review customer logs? This information is located under Fault Management. So, how do you access all this useful information? Generally speaking, the Dashboard installed on the controller node. In a web browser with JavaScript and cookies enabled enter the hostname, or IP address for the dashboard. On the login page enter your unique username and password, or the default username and password – admin. The visible tabs and functions in the dashboard will depend on your access permissions, or roles. For example, if you are logged in as an end user the Project tab and Identity tab will display. If you are logged in as an administrator in addition to the Project and Identity tabs you will see the Admin tab.

Command Line Interfaces (CLIs)
Server debug log (IPMI) $ wrsroot $ {keystone_admin} Switch log System alarms Customer logs Command Line Inteface, or CLI. What is it? For starters, it is a robust selection of commands available for managing every aspect of the system – from hardware, to software, to connectivity. Here we will examine the main CLIs used for collecting, or viewing data, when there is a possible problem. This includes commands for collecting various logs and system alarms. Commands can be used remotely through SSH or Telnet, or directly through a serial port on the switch. Great! But, how do you access CLIs? Remember NCIR has two types of user accounts for all administration, operation, and general hosting purposes. These are not related to the cloud user tenant accounts. There is the wrsroot account which is a local, per-host account. The default initial password is wrsroot. Regular Linux user accounts can be created using the second type of user account – LDAP accounts. These are centrally managed and automatically propagated to all hosts in a cluster. Execute the OpenStack administrative commands as the Keystone “admin” user. To acquire Keystone administrative privileges run the etc/nova/openrc script. The system prompt wil indicate the new acquired Keystone privileges, {keystone_admin}. Click on each command type to learn more.

Server debug log (IPMI)
IPMI commands – only a few of them ipmitool -I lan -H <BMC IPADDR> -U admin -P admin -V ipmitool -I lan -H <IPADDR> -U admin -P admin mc info ipmitool -I lan -H <IPADDR> -U admin -P admin sel elist ipmitool -I lan -H <IPADDR> -U admin -P admin sel elist -v ipmitool -I lan -H <IPADDR> -U admin -P admin sdr ipmitool -I lan -H <IPADDR> -U admin -P admin sensor ipmitool -I lan -H <IPADDR> -U admin -P admin fru ipmitool -I lan -H <IPADDR> -U admin -P admin chassis status Server having problems that are not easily fixed? You can use the Intelligent Platform Management Interface (IPMI) commands to collect a server debug log for troubleshooting. Refer to the customer documentation – Troubleshooting AirFrame Data Center Solution Hardware > Troubleshooting a Server for a complete list of IPMI commands used to collect a server debug log.

Switch log Switch log commands show hardware Show logging buffered
Show tech-support (only from serial console) Is a switch giving you problems? Then use these commands to collect a switch log for troubleshooting. Show hardware and show logging buffered can be execute from either a serial console or remotely via SSH or Telnet. Show tech-support is only available on a serial console.

System Alarms CLI commands
Alarm commands system alarm-list system alarm-show system alarm-delete system alarm-history-list Narration: Are there alarms? What can I do with alarm information? Often alarms are the first indicator that there may be problem. Use these commands to interact with the NCIR alarm sub-system. Find out if there are problems so you can begin your root cause analysis. Remember, though, it is best to review all available information before taking any troubleshooting steps. Consider this scenario: An inteface is down which triggers an alarm for the interface. That same downed interface also causes the host to not communicate properly which triggers an alarm for the host. Try fixing only the host alarm and you stand the chance it will continue to reappear. Fix the downed interface and both active alarms are cleared. You need to examine all the available information together, understand how the system works, and perform a root cause analysis before taking any troubleshooting actions. Click on each alarm type to learn more. {Following information displayed in pop-ups in final WBT} $ system alarm-list -q 'alarm_id= ;entity_instance_id=service_domain=controller.service_group=directory-services‘ Displays currently active alarms. Each alarm object is listed with a unique UUID which you can use to obtain additional information. Specific subsets of alarms, or a particular alarm, can be listed using the query filter. Refer to customer documentation for details on the query filter. $ system alarm-show {uuid} Shows additional information about a currently active alarm. $ system alarm-delete {uuid} Manually deletes an alarm that remains active for no apparent reason, which may occur in rare conditions. Alarms usually clear automatically when the trigger condition is corrected. Manually deleting an alarm should not be done unless it is absolutely clear that there is no reason for the alarm to be active. $ system alarm-history-list {-l limit} Shows the 30 most-recent change events in most recent event first order. Use the limit option to specify the size of the list. Specific subsets of alarms, or a particular alarm, can be listed using the query filter. Refer to customer documentation for details on the query filter.

Customer Log CLI commands
Customer log commands collect all Collect {node} system log-list system log-show Narration: So, it seems as if there may be a general system problem brewing, or a problem with a specific node. You need information. What can you do? Collect logs for analysis. Use the commands shown to collect and view logs. Click on the customer log commands to learn more about each. $ collect all Shows all logs from all nodes. This may be useful when troubleshooting a problem that spans many nodes. $ collect {node} Shows all logs for a specific node. This may be useful when troubleshooting problem on a specific node. $ system log-list {-q query} {-l limit} Show all logs. Use the query filter to find specific logs. Use the limit filter to set the maximum number of logs to return, beginning with the most recent. $ system log-show {uuid} Shows details of a specific, individual log.

Knowledge Check Narration:
It’s time to check your understanding of the tools used to troubleshoot the Telco Cloud in a Radio environment.

Match the CLI command with the corresponding task:
system alarm-list – List currently active alarms. system alarm-show – Show additional information about a specific currently active alarm. system alarm-delete – Manually delete an alarm. system alarm-history-list – Query historical alarms. System log-list – List current customer logs. System log-show – View details on an individual customer log. Developer’s notes: The slide shows correct answers. The WBT authoring tool will mix the tool/functions. No narration on this slide.

Module 3: Basic troubleshooting process
After completing this module you will be able to: Describe the basic Telco Cloud Infrastructure troubleshooting process Narration: Now that you are familiar with the tools you will be using, let’s see how you’ll carry out troubleshooting. After this module you will be able to describe the basic Telco Cloud Infrastructure troubleshooting process.

How do you know there is something wrong?
Is it working as expected? Dashboard Controls > Inventory On-site at Data Center – physical LED inspection Are they working as expected? Dashboard Controls > Overview Dashboard Controls > Fault Mgmt CLI commands Virtual Machines Interfaces Hardware Are they working as expected? Dashboard Controls > Inventory > Interfaces Dashboard Controls > Fault Mgmt CLI Commands Narration: How do you know there is something wrong? Easy! Use the information available – alarms, logs, node status, error messages, and statistical reports – to perform a comprehensive analysis. Check if the Virtual Machines are working as expected. This is done by checking the hosts’ status on the Dashboard Controls > Overview page. You notice a problem with the host so you check for alarms against the problem host using either the Dashboard Controls > Fault Management or CLI commands. Next, you should verify how the interfaces are performing. This can be done by checking the Dashboard Controls > Inventory > Interfaces page. If you notice a problem you can then check for alarms against the interface using the previously described methods. If the problem is still not solved, you should verify that the hardware is working properly. Again, you can use the Dashboard Controls > Inventory page. Or, if you are on-site at the Data Center you can do a physical inspection of LEDs. Remember the earlier scenario: A downed interface is causing communication problems on the host. Following this process, you would find an alarm for the host indicating that it is experiencing intermittent network infrastructure communication failure. Then, upon examining the interfaces you notice that there is an alarm for a management interface failure. Using this information you begin to realize that the downed interface may be causing the host’s communication problems.

Module 4: Examples of basic
troubleshooting process After completing this module you will be able to: Explain how to use the web interface and CLI commands to troubleshoot problems Narration: Let’s now check how to use the web interface and CLI commands to troubleshoot the NCIR system. In this module, three scenarios are introduced. You will answer some questions on what to do. I’ll provide feedback as we go. A bar will inform you of your progress on each scenario. After completing this module you will be able to explain how to troubleshoot problems using the web interface and Command Line Interface (CLI).

Select the command used to check alarms. system alarm-list
Example 1: Alarm Filesystem exceeded; threshold: 70%, actual: 79.00% Over the last two weeks you performed several “collect all” system log requests. Did your actions trigger an alarm? $ collect all Select the command used to check alarms. system alarm-list system alarm-show system alarm-view Scenario progress Narration: Over the last two weeks you performed several ”collect all” system log requests without cleaning out the /scratch folder. Remember that the ”collect all” command collects data from all nodes that can be used for troubleshooting. That is a lot of information! Now, you want to know if any alarms were triggered by your actions. Which command do you use to check alarms? Select the correct answer. Question feedback: system alarm-list -> That’s right! This command shows a list of all active alarms on the system. system alarm-show -> This alarm shows the Alarm Details for a specific alarm. Try again! system alarm-view -> Sorry, that is not a valid command for NCIR. Try again!

You notice that Alarm 100.104 occurred on the active Alarm list.
Example 1: Alarm Filesystem exceeded; threshold: 70%, actual: 79.00% $ system alarm-list You notice that Alarm occurred on the active Alarm list. d221accc-c7cb-43b1-a5ef-b7a2fde495ea | | Filesystem exceeded; threshold: 70%, actual: 79.00%. | host=controller-0.filesystem=/scratch | minor What should you do next? Use command system alarm-show Contact Nokia NCIR Technical Support Use command system alarm-delete Scenario progress Narration: You notice that alarm ”File System threshold exceeded” is set. What should you do next? Select the correct answer. Question feedback: Use command system alarm-show -> That’s right! Using the UUID of the alarm this command displays the Alarm Details which contain troubleshooting information. Contact Nokia NCIR Technical Support -> That is a possibility but you may want to try fixing the problem on your own first. Try again! Use command system alarm-delete -> You may feel like deleting the alarm but remember alarms should be manually deleted only when you are absolutely sure the alarm should not be set. Probably not the best solution. Try again!

You review the details for Alarm 100.104.
Example 1: Alarm Filesystem exceeded; threshold: 70%, actual: 79.00% $ system alarm-show d221accc-c7cb-43b1-a5ef-b7a2fde495ea You review the details for Alarm d221accc-c7cb-43b1-a5ef-b7a2fde495ea | | ... host=controller-0.filesystem=/scratch | minor ... Now what should you do? Verify the filesystem using command df -h Ignore the alarm Contact Nokia NCIR Technical Support Scenario progress Narration: Now that you viewed the alarm details what should you do? Select the correct answer. Question feedback: Verify the filesystem using df –h command – Verifying the filesystem is a smart move! Ignore the alarm – It is never a good idea to ignore an alarm. Try again! Contact Nokia NCIR Technical Support -> That is a possibility but you may want to try fixing the problem on your own first by following the troubleshooting instructions. Try again!

You confirm that the filesystem is at 79% for the scratch folder.
Example 1: Alarm Filesystem exceeded; threshold: 70%, actual: 79.00% $ df -h You confirm that the filesystem is at 79% for the scratch folder. Filesystem Size Used Avail Use% Mounted on ... /dev/mapper/cgts--vg-scratch--lv 7.6G 5.6G 1.6G 79% /scratch Now what should you do? List the scratch folder using the ls command Worry about another alarm Contact Nokia NCIR Technical Support Scenario progress Narration: You verfied that the filesystem is at 75%. Now what should you do? Select the correct answer. Question feedback: List the scratch folder using the ls command – Good choice! Listing the scratch folder will show if there are any old files that can be removed. Worry about another alarm – Before worrying about another alarm you will want to follow the troubleshooting instructions for this alarm. Try again! Contact Nokia NCIR Technical Support -> That is a possibility but you may want to try fixing the problem on your own first by following the troubleshooting instructions. Try again!

Clean extra files from /scratch folder using rm command
Example 1: Alarm Filesystem exceeded; threshold: 70%, actual: 79.00% $ ls -l /scratch You notice that there are several old files on the filesystem. Ten to be exact. -rw-r--r-- 1 root root Jul 13 07:39 ALL_NODES_ tar.gz ... -rw-r--r-- 1 root root ALL_NODES_ tar.gz.9 Now what should you do? Clean extra files from /scratch folder using rm command Worry about another alarm Contact Nokia NCIR Technical Support Scenario progress Narration: There are several old files listed. What should you do next? Select the correct answer. Question feedback: Clean extra files from the /scratch folder using the rm command – That’s right! This will clean up the folder and potentially lower usage below the threshold that triggered alarm Worry about another alarm – Before worrying about another alarm you will want to follow the troubleshooting instructions for this alarm. Try again! Contact Nokia NCIR Technical Support -> That is a possibility but you may want to try fixing the problem on your own first by following the troubleshooting instructions. Try again!

Extra files are cleaned.
Example 1: Alarm Filesystem exceeded; threshold: 70%, actual: 79.00% Extra files are cleaned. $ rm ALL_NODES_ tar.gz.9 $ rm ALL_NODES_ tar.gz.8 Now what should you do? Verify /scratch folder is below usage threshold using df –h command Worry about another alarm Contact Nokia NCIR Technical Support Scenario progress Narration: The extra files are now cleaned from the /scratch folder. What should you do next? Select the correct answer. Question feedback: Verify /scratch folder is below usage threshold using df –h command – Excellent thought! Removing files should lower the folder usage below the threshold. If not, consider removing more old files until the usage is under the threshold. Worry about another alarm – Before worrying about another alarm you will want to follow the troubleshooting instructions for this alarm. Try again! Contact Nokia NCIR Technical Support -> That is a possibility but you may want to try fixing the problem on your own first by following the troubleshooting instructions. Try again!

The usage is confirmed to be below the threshold.
Example 1: Alarm Filesystem exceeded; threshold: 70%, actual: 79.00% The usage is confirmed to be below the threshold. $ df -h Filesystem Size Used Avail Use% Mounted on ... /dev/mapper/cgts--vg-scratch--lv 7.6G 4.5G 2.7G 63% /scratch Now what should you do? Verify alarm cleared using command system alarm-history-list Worry about another alarm Contact Nokia NCIR Technical Support Scenario progress Narration: The usage is confirmed to be below the threshold. What should you do next? Select the correct answer. Question feedback: Verify alarm is cleared using command system alarm-history-list – Excellent thought! If alarm is cleared then the problem is resolved. Worry about another alarm – Before worrying about another alarm you will want to follow the troubleshooting instructions for this alarm. Try again! Contact Nokia NCIR Technical Support -> That is a possibility but you may want to try fixing the problem on your own first by troubleshooting the problem. Try again!

Example 1: Alarm 100.104 - Filesystem exceeded; threshold: 70%, actual: 79.00%
$ system alarm-history-list UUID | Time Stamp | Alarm State | Alarm ID ... d221accc-c7cb-43b1-a5ef-b7a2fde495ea | T12:13: | clear | Congratulations! Alarm is cleared. Scenario progress Narration: Congratulations! Alarm is cleared.

Example 2: Server Port State is DOWN after successful migration
A live migration is needed. Scenario progress Narration: For cloud administrators often there are times when it is necessary to bring a host down for maintenance purposes. In these cases a live migration needs to be performed to transfer the Virtual Machine to another compute node. So, what happens when the live migration is successful but problems are noticed after the migration? Let us explore this scenario.

$ date; nova live-migration --block-migrate migr2-loc-loc-image Migration command is executed. Scenario progress Narration: Using the nova live-migration command you execute the live migration. The live migration is now underway.

$ nova list Migration was successful! | ID | Name | Status | Task State | Power State | Networks | | 49d69c1c fe21b1eb506 | migr2-loc-loc-image | ACTIVE | - | Running ... Now what should you do? Review alarms Contact Nokia NCIR Technical Support Review the alarms list Scenario progress Narration: Congratulations! The migration was a success. Now what should you do? Select the correct answer. Question feedback: Review alarms – Good choice! Verify that the migration did not trigger any alarms. Contact Nokia NCIR Technical Support -> Probably not at this point. There are no issues or difficulties to report and investigate. Try again! Review the alarms list – This is a good possibility but log files may be more useful at this point. Try again!

$ system alarm-list Oh no! There was a problem with the live migration. d221accc-c7cb-43b1-a5ef-b7a2fde495ea | | Data port failed | host = <hostname>.port = < port-uuid> | ... Now what should you do? Collect and review log files Contact Nokia NCIR Technical Support Review the alarms list Scenario progress Narration: Oh no! You notice that there is an alarm. Now what should you do? Select the correct answer. Question feedback: Collect and review log files – Nice choice! Using the collect all command will give you a complete picture of the system so you can trace what occurred during the live migration. Contact Nokia NCIR Technical Support -> Probably not at this point. There are no issues or difficulties to report and investigate. Try again! Review the alarms list – This is a good possibility but log files may be more useful at this point. Try again!

Now what should you do? Check the switch port LED status Review the alarms list Ignore the collected log files You notice that a port is DOWN. Scenario progress Narration: In the log files you notice that a port is DOWN. This may be the cause of the port failure alarm. What should you do next? Select the correct answer. Question feedback: Check the switch port LED status – That’s right! Working with Data Center personnel check the switch port LED status. This can help you determine if it is a hardware or connectivity problem. If the LED is off that indicates that there is no network activity or that the port is disabled. Verify that the port is on, and that the cabling is good. Review the alarms list – Maybe, but it is probably a good idea to get additional support at this point. Try again! Ignore the collected log files – Probably not a good idea. It would be good to get additional support on this problem. Try again!

$ system alarm-list Port is still DOWN. Working with Data Center did not help. d221accc-c7cb-43b1-a5ef-b7a2fde495ea | | Data port failed | host = <hostname>.port = < port-uuid> | ... Now what should you do? Contact Nokia NCIR Technical Support Review the alarms list Ignore the collected log files Scenario progress Narration: After working with the Data Center personnel you notice that the alarm and logs are still not cleared. What should you do next? Select the correct answer. Question feedback: Contact Nokia NCIR Technical Support – That’s right! Make a clear ticket with instructions of what was done and what happened during your analysis. Attaching the collected log files to the ticket will help with the investigation. You may also want to include a switch log or SNMP trap information. Review the alarms list – Maybe, but it is probably a good idea to get additional support at this point. Try again! Ignore the collected log files – Probably not a good idea. It would be good to get additional support on this problem. Try again!

Example 3: VM remains in Error state after system restore from backup
System backup and restore performed. However, a VM remains in Error state. $ nova list | ID | Name | Status | Task State | Power State | Networks | | dde0d3a4-a d4-6bb35d87ea65 | Stabi_01_cpuburn | ERROR | - | NOSTATE |... What should you do? Check for alarm on the VM Contact Nokia NCIR Technical Support Ignore the error Scenario progress Narration: You performed a system backup and restore procedure in a minimal two server configuration. However, after the restore a VM remains in Error showing “NOSTATE”. What should you do? Select the correct answer. Question feedback: Check for alarm on the VM -> That’s right! Using the nova Contact Nokia NCIR Technical Support -> That is a possibility but you may want to try fixing the problem on your own first by troubleshooting the problem. Try again! Ignore the error – It is never a good idea to ignore an error. Try again!

| UUID | Alarm ID | Reason Text | Entity Instance ID | Severity | Time Stamp | 857b20fd-730e-46da-87df-e022bf671b07 | | Instance Stabi_01_cpuburn owned by admin has failed on host controller-0| ... | critical | ... There is a critical alarm for the problematic VM. What should you do next? Check the Proposed Repair Action for the alarm Contact Nokia NCIR Technical Support Perform the system restore again Scenario progress Narration: You notice there is a critical alarm on the problematic VM. What should you do next? Select the correct answer. Question feedback: Check the Proposed Repair Action for the alarm -> That’s right! Reviewing the Proposed Repair Action will let you know if there is anything that needs to be done. Contact Nokia NCIR Technical Support -> That is a possibility but you may want to try fixing the problem on your own first by troubleshooting the problem. Try again! Perform the system restore again – Since you do not know what caused the initial problem it is probably not a good idea to reinitate a restore at this point. Try again!

$ system alarm-show {uuid} Proposed Repair Action The system will automatically attempt to restart the instance at regular intervals. No repair action needed. Alarm ID ... Proposed Repair Action The system will automatically attempt to re-start the instance at regular intervals. No repair action required. Now what should you do? Monitor the instance details Contact Nokia NCIR Technical Support Perform the system restore again Scenario progress Narration: You learn from the alarm details that the Proposed Repair Action is the system will attempt to restart the instance at regular intervals. No repair action needed. Now what should you do? Select the correct answer. Question feedback: Monitor the instance details – That is a smart move! If the error does not clear you may want to contact Nokia NCIR Technical Support. If it does clear the problem is resolved. Contact Nokia NCIR Technical Support -> That is a possibility but you may want to try fixing the problem on your own first by troubleshooting the problem. Try again! Perform the system restore again – Since you do not know what caused the initial problem it is probably not a good idea to reinitate a restore at this point. Try again!

$ nova show Stabi_01_cpuburn Excellent work! Monitor the instance details. If the error state does not clear automatically contact Nokia NCIR Technical Support. Property | Value ... OS-EXT-SRV-ATTR:host | controller-0 ... OS-EXT-STS:power_state | 0 OS-EXT-STS:vm_state | error ...Name | Stabi_01_cpuburn Status | ERROR ... Scenario progress Narration: Excellent work! Monitoring the instance details will let you know when the error state clears. If the error state does not clear automatically contact Nokia NCIR Technical Support.

This course covered the following topics:
Course summary This course covered the following topics: Main causes of malfunction in the Telco Cloud Tools to troubleshoot the Telco Cloud Basic troubleshooting process Examples of common scenarios Narration: You have reached the end of this course. Let’s briefly sum up the topics we covered: We looked at the Telco Cloud NCIR architecture and identified the most common issues that may cause malfunction. We also learned about the source of information that helps us understand the root cause of a problem. Then we looked at key tools that allow us to troubleshoot the Telco Cloud. Finally, we explored some common scenarios. We learned what to do in those situations.

Thank you! Congratulations! You have completed this course. Now you know the tools and basic process needed to troubleshoot the Telco Cloud for NCIR. Now you should be ready to continue your troubleshooting training with the hands-on courses. Thank you for joining this training! IPMI – Intelligent Platform Mgmt Interface (network connection directly to HW instead of OS or login shell) --- external, part of Baseboard Management Controller (BMC); CLI and Webgui

Telco Cloud Troubleshooting

Similar presentations

Presentation on theme: "Telco Cloud Troubleshooting"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Telco Cloud Troubleshooting

Similar presentations

Presentation on theme: "Telco Cloud Troubleshooting"— Presentation transcript:

Similar presentations

About project

Feedback