How to Use Mellanox Firmware Tools (MFT): A Beginner’s Guide

Troubleshooting Mellanox Firmware Tools: Common Issues and Fixes

Overview

Mellanox Firmware Tools (MFT) manage firmware and settings for Mellanox/ConnectX network adapters. This guide covers frequent issues, diagnostic commands, and concrete fixes to restore device functionality quickly.

1. Unable to Detect Adapter

  • Symptoms: mlxup, mst status, or lspci do not list the adapter.
  • Quick checks:
    • Hardware: ensure card seated, power/cables connected.
    • PCI visibility: run lspci | grep -i mellanox.
    • Kernel modules: check lsmod | grep mlx5core.
  • Fixes:
    1. Reseat card and reboot host.
    2. If not visible at BIOS level, check motherboard slot/power and test the card in another host.
    3. Ensure drivers installed: install/confirm Mellanox OFED or kernel drivers for your distro. Example (Debian/Ubuntu):

      Code

      sudo apt-get update sudo apt-get install -y mlnx-ofed-basic
    4. If PCIe ACS/ACS override or SR-IOV settings were changed in BIOS, revert or test with defaults.

2. mlxup / mft Commands Fail with Permission Denied

  • Symptoms: Commands error out when run as non-root.
  • Fixes:
    • Run commands as root or with sudo: sudo mft or sudo mlxup.
    • Add appropriate udev rules or group permissions if you need non-root access (create a group, set device permissions in /etc/udev/rules.d/).

3. Firmware Update Fails or Hangs

  • Symptoms: Firmware upload stops, device unresponsive after update.
  • Diagnostics:
    • Check current firmware: mlxfwmanager -d query or mlxup -v.
    • Inspect system logs: sudo journalctl -u mlxfwmanager and dmesg for errors.
  • Fixes:
    1. Use the correct firmware package for your adapter model and ASIC revision. Verify with lspci -nn and mlxfwmanager output.
    2. Ensure uninterrupted power and avoid network traffic during update; perform during maintenance window.
    3. Retry with force flag only if supported: mlxfwmanager -d install –force .
    4. If device becomes unresponsive, perform a cold reboot. If still bricked, contact vendor support for recovery tools or RMA.

4. MFT Reports Version Mismatch or Unsupported Image

  • Symptoms: mlxfwmanager warns image unsupported or versions mismatch.
  • Fixes:
    • Confirm image matches device model and firmware family. Use mlxfwmanager -d query to get device identifiers.
    • Download official firmware from the Mellanox/NVIDIA support site for your exact SKU.
    • Convert or extract images if using vendor-supplied bundles—follow vendor instructions.

5. Loss of Link or Poor Performance After Update

  • Symptoms: Packet loss, link flaps, reduced throughput after firmware change.
  • Diagnostics:
    • Check link state: ethtool and iblinkinfo for InfiniBand.
    • Review driver/modules: modinfo mlx5_core, dmesg for link errors.
  • Fixes:
    1. Roll back to previous firmware if available and known-good: mlxfwmanager -d install .
    2. Update drivers/kernel to compatible versions recommended alongside firmware.
    3. Verify port speed/auto-negotiation settings with ethtool and switch side configuration.
    4. Test with direct-connected cable or different switch port.

6. SR-IOV or VF Issues

  • Symptoms: Virtual Functions not appearing or unstable.
  • Diagnostics:
    • Confirm SR-IOV enabled in BIOS and kernel: lspci | grep Virtual.
    • Check sysfs: cat /sys/class/net//device/sriov_totalvfs and sriov_numvfs.
  • Fixes:
    1. Enable SR-IOV on device: echo | sudo tee /sys/class/net//device/sriov_numvfs.
    2. Ensure firmware and driver versions support SR-IOV for your model.
    3. Recreate VFs and rebind to correct drivers (vfio-pci or ixgbevf/mlx5_core as needed).

7. Incompatible Tool Versions or Broken Package

  • Symptoms: mft utilities crash, missing dependencies after distro upgrade.
  • Fixes:
    • Reinstall MFT package from Mellanox/NVIDIA repos for your OS and kernel version.
    • Use distribution packages when possible; for custom kernels, build or install matching driver/tool versions.

Diagnostic Command Reference (common)

  • lspci | grep -i mellanox
  • sudo mlxfwmanager -dquery
  • sudo mlxfwmanager -d install
  • ethtool
  • sudo dmesg | tail -n 200
  • sudo journalctl -u mlxfwmanager
  • lsmod | grep mlx5_core

When to Contact Support

  • Device unresponsive after firmware write and cold reboot fails.
  • Firmware image clearly mismatched and recovery tools needed.
  • Suspected hardware failure after diagnostics.

Quick Recovery Checklist

  1. Confirm hardware visibility (lspci).
  2. Check driver/modules (lsmod, dmesg).
  3. Validate firmware image matches device.
  4. Use official mlxfwmanager with proper flags.
  5. Reboot and retest link and performance.
  6. Contact vendor support with device logs and mlxfwmanager outputs if unresolved.

If you want, I can generate exact commands tailored to your Linux distribution and Mellanox model—provide distro and device output.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *