My partner told me that one of their customer couldn't connect to office network from home over the weekend. He thought it was a normal downtime and would check it on Monday.
He went back on Monday only to find a very black charred data room with his UPS almost melted away. It seem the UPS cause a fire but fire alarm didn't sound and the fire was confined to to the data room.
1. Found out the Air Conditioning in the server room was not working.
2. Found out the ink was about to run out in the printer before it actually ran out.
3. I discovered that many of my machines had processor utilization at 80% most of the time.
Well it depends on the scope of what you are monitoring but I have seen
some weird things. Most problems stem from simple issues but are often
hard to find. One such experience of mine was old switch firmware that
caused an entire site to go down because it would die when SNMP was
configured and postured. This cause nothing but constant outages and
aggravation for the client I was working with. The resolution was to
upgrade the version and or get newer hardware.
Hmm, the most surprising issue (not for me, but management) has been power outages. I worked at a local heath department until recently, and of course we were emergency responders, required to be operational even during earthquakes, storms and volcanic eruptions. Management wanted to place our applications into 'the cloud' for fault tolerance and redundancy, but could not understand that didn't cover last mile connectivity. The problem comes with extended outages; cell tower UPSes have about 12 hours of capacity. Even without that cellular services are overloaded during emergencies making access problematic.
Simply kicking the problem over the fence to another entity solves nothing. Eventually I was able to get management to understand the problem, but it took a couple years and their direct experience with cell outages.
1. Hundreds of rats chewed fiber links in a crawlspace in London.
2. Monkeys tore apart local loop in Thailand and attacked technicians preventing repair.
3. 2 KM of copper wire, the entire local loop, stolen.
4. Street vendor refused to move cart for three weeks to give access to manhole so circuit could be serviced.
5. A national Telco refused to open new tickets because their team just lost the World Cup (football/soccer)
6. On campus wire ran alongside a parking garage door opener motor causing intermittent interference.
7. A micro break in an on campus wire that ran under a road on a military base. The circuit would only error when a tank would drive over road. Lighter vehicles would not cause problem.
8. Cisco TAC attributing a router failure to "sun spot activity"
9. Cleaning crew unplugging a router every night to plug in their vacum cleaner.
10. Entire segments of network going dark as the twin towers fell on Sept 11, 2001.
Something that comes as no surprise but often the case is that the network is often blamed for an application problem. If you don't have complete visibility of both application and network performance you can often be stumped by what you think is a slow network but it's the application that is hosing your user experience.