Valuable Features
To me, the most valuable features have to do with a few things. First of all, the probe set is fantastic. Probably more than that, is the fact that we can manage the probes and we can manage the robot without having root access to the boxes. Prior to using UIM, we used some other tools that I'll leave unnamed and if robots went down - well, robots going down could still cause a problem with UIM - but if robots kind of are flaky and need to be restarted, we can do that through the console without root access. If probes go down, we can restart those. If we need to install probes or remove probes, we can do those things. With our previous monitoring tools, we couldn't do those things, which in the banking world, and in a lot of companies, but in the banking world where I come from, we're siloed. We're mandated by the federal governments that our teams basically only have the rights that they need to do their job. Because of that, we can't give the monitoring team root access to anything.
That was a huge plus, when we found a tool that allowed us to do a lot of this maintenance stuff and troubleshooting stuff without root access. Because with the previous tool, we would have to open up a ticket, assign it to a completely different team, and then based on their workload, it could take days for them to get something back up and running for us. With UIM, we can do almost all of that. The only gotcha is if the robot has actually crashed or not running at all. That's the only one, but it essentially freed up 80% of the issues that would require us going to another team to fix, which helped my team be more productive.
That, combined with the probe sets, and primarily one of the probe sets that I love a lot is the LogMon probe. Just looking at all the other tools and the tools put out by EMC, HP, and IBM, none of them had anything close to the LogMon probe. The UIM LogMon probe is, in my opinion, by far above and beyond any of the big four. Most of the others just required you writing scripts for almost anything like that. Just some of the probes were just much more mature and user friendly.
The other thing I really love about the tool is that it was developed by one company, mostly Nimsoft, which means that all the probes and all the features of it fit nicely into their one console. The learning curve was way lower. With the big four, they tend to purchase and adopt and combine, and before you know it, you have a tool that is a conglomerate of 16 different companies. When we were doing our research of each of the big four tools, the learning curve was very steep on all of them. With Nimsoft/UIM, you just learn basically the one console, how a probe works, how you would do all of that. You learn it once and you know it for the whole tool, whereas, these other ones, because they're a mish-mash of a dozen tools or more, you have essentially learn a dozen different ways to do these simple things.
In a lot of ways, it was a ton of crossover, too. When we were looking into it, you would ask, "How would you monitor this specific thing." They go, "Oh, well there's three ways to do that." That's not very effective, because now, "Well, which one do we pick?" They wouldn't really give you an answer, because all three of them work, but now you've got three different places a monitoring point can fit. If you ever have to go back and troubleshoot, you've got to look in three different places.
I love the fact that since CA has bought Nimsoft, they've kept it very similar to how it was before, which I am way grateful for that. We had big fears that it would be ripped apart, but it looks like they're keeping it fairly good there.
Room for Improvement
We've had a lot of difficulty integrating with our ticketing software, which is currently Remedy. I was really surprised when we bought Remedy that we would have such a difficult time integrating, because they're one of the big four, and CA is one of the big four. I would think that that was just boiler plate stuff, but it wasn't. We had to have custom stuff, and it was built off of a really old probe. It's real backwater duct-tape-and-twine keeping that system together. Since then, CA has come out with a 2.0, but that doesn't work with our current version. We can't make it work.
I don't really want to go into all the details on that, but essentially we're having a real rough time retrieving a ticket number back instantly. With our current system, we open a ticket and eventually they send us a ticket number back and we connect it, but the new system doesn't really make it work. I haven't had enough time to dig into that. Primarily, in our Remedy side of the house, it's had a lot of turnover and, basically, we have no support on that side of the fence to make it work. The fact that CA doesn't have a plug-and-play that works real well on that is a little frustrating.
Stability Issues
One of the things that I've noticed over the years of working with it is that working with the console and working with the different hubs and robots, it seems to me like over the years that if your database was slow or down, the database was primarily used for storing data points, historical data points, and if that was down, you couldn't store those points, but the tool still functioned properly. We're finding more and more that has been moved into the database, meaning that if your database is down or buggy or slow, the tool itself, the IM console is relying more and more on data out of the database. So, if you got a slow connection or if your database is buggy or if your database is down, you basically can't control your environment at all.
That's a negative thing I've seen change in the tool, because it used to be that if the database went down, we could still access all of our hubs, all of our robots through the IM console, control them. Alerts that came in would still create a ticket, because we actually pass it to a ticketing software and all that functioned, but now that's not the case anymore. If the database is down or if there's something going on, the console becomes very buggy and very, very slow, and sometimes impossible to use. That's one thing that I wouldn't mind someone looking into.
Scalability Issues
We've had to scale a lot more than support tells us. They claim they can support X, amount of boxes per hub, and we find that's just not even close to true. We don't turn any of the data points on at all by default. We just monitor the CDM probe, the CPU disk and memory for alerts only, and we're not scoring any historical data. We're not capturing those data points, and because we're not capturing those data points, we're basically on a bare bones infrastructure for that box. It seems like support told us we could support 2000 boxes and they were talking fully-loaded with all the data points, and we simply can't. We're maxing out about anywhere from 300 to 500 boxes of robots reporting to a hub. Most hubs, they start to get boggy and stuff. We've had to just add additional hubs.
We also struggle with backup hubs and being able to coordinate the configuration between a primary hub and a fail-over hub for that stuff. We have backup fail-over hubs that basically sit empty and they're just waiting to take on the load. Coordinating the configuration files between them has become impossible. Well, we haven't put tons of effort into it lately, but they had a HA probe, but the HA probe only does so much. It turns a few things on, but there was nothing that would sync up configuration files for certain probes. Without that syncing of config files, it was impossible to keep up.
Customer Service and Technical Support
I go back for ten years with Nimsoft and CA's owned them like three or four years. I'll tell you when CA first bought them, support was terrible. It would take them two to three weeks to respond to a ticket, but typically the response was a question. Then we would update immediately and three or four days later, it would be another question. It literally took a month of clarifying questions, and lately that's immensely improved. That actually got me to the point where I stopped using support. We would search the forums and I've been using the product long enough that I just kind of figure out work arounds. But, lately when a few big things have happened and we've been forced to go to support, they've been way more responsive. That's probably been a big change in the last two years I would say.
Initial Setup
We have a multi-tiered setup so we have several hubs that each control certain zones, about 500 robots per hub. We call those our secondary hubs, because we then have a primary hub, which we call the MOM. We have a DR MOM as well, so we basically have a three-level structure. When we first set it up, everybody acted like that was the way to go. All the support told us that was the way to go, but consequently every time we have to deal with support on it, they act like the three-level structure is just not normal. I'm not sure how else we would do it, because when we really call them in and try and figure it out, they just say leave it the way it is.
But it was a pretty straightforward installation. It's all the tweaking of everything once you get it installed. Making sure your tickets flow or the alarms flow properly, and rules get fired properly to do certain things -- that's where it gets real tricky. Make sure rules aren't crossing each other and creating circles, endless loops and things like that. We've had a few headaches with some of the pieces doing that, especially with DR forwarding alarms to two boxes, but then they have to update as well, and the next thing you know, you've got a loop. That's been a little difficult over the years. We've got it worked out.
Other Advice
If they were in the process currently of comparing other products out there and trying to boil down that decision, one thing that I did is I made a chart. I basically took all of the monitoring things out there like CQ, Disk, Memory, Log Files, basically broke down everything, URLs, simple URL monitoring, more advanced scripting of website monitoring, I took all that and I built this template. Then, I went through and I basically said, does UIM cover this? Yes, yes, yes or basically I took CA , and I said what CA products does it take to cover all the points I need? UIM covered 95% of that. When I went to IBM and HP, as an example, I did the same thing and it took anywhere from eight to twelve products to do the same thing.
The way that I sold this up the command chain was I then said well it's a steep learning curve for these types of tools. If you have to learn eight or twelve tools versus one tool, not only is your job going to be easier, but you can sell it up the chain for less man hours to get efficient. That was one of the tips that I'd give to customers who are looking at the product is the learning curve is much less due to the fact that you're learning one tool to cover X amount of things you've got to do compared to eight or twelve.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
While I agree with your assessment of UIM (it is a leader in the NW monitoring space), I would urge you to compare these tools to vendor offerings. This company is less of an innovator and much more of an acquirer (Spectrum & NetQoS rocked, still do).
If this is your first exposure to centralized dashboards and top-level manager-of-manager approaches, you will likely find other companies offering more innovative approaches.
Your point on Human vs. Product resonates with my experience too. Take a look at some of the free and open source software (FOSS) offerings out there, if you feel your team can make the difference. You may find there is no need to pay out for licensing and maint of commercial of the shelf (COTS) solutions.
Opinions I express here are based solely on my own experience and do not reflect in anyway leanings of my employer.