What is our primary use case?
The solution is our upper-tier all-flash shared storage for our customers. We have shared virtual private cloud environments that we serve storage to. This solution is our high-tier flash storage offering.
What is most valuable?
The product is easy to manage and deploy. It's got full API functionality and the performance is pretty steady.
The initial setup is pretty straightforward.
What needs improvement?
We have had some issues with it scaling as high as the marketing says it can. We've got some very large clusters of up to over 20 nodes and when you get to that size your upgrades tend to take a long time or just waste. We tend to have issues beyond 20 plus nodes.
The upgrade process could be better. Lately, we've had lots of hardware having general issues with lots of failures. It seems like every month at least we're replacing an entire node, as opposed to just dry failures which you would normally expect, or small components. It seems like we have to replace an entire node pretty often. The hardware reliability isn't quite there.
For how long have I used the solution?
I've used the solution since 2014. I've had it for the last seven years.
What do I think about the stability of the solution?
The product is pretty stable. As far as the software side of it, how it works, it's got the double helix protection. It's very redundant. However, due to hardware issues, we've had some performance problems as a node fails and it takes all those drives out of commission. That's partly an issue with capacity management, however, just as a result, we've had a plan for much lower usage.
We need to have a much bigger buffer there to deal with node failures to ensure it doesn't impact performance. If you're running at 70% and you suddenly lose a node now you're hitting cluster full alarms and that can impact performance as well as the ability to continue creating volumes and things like that for customers.
Other than that, it works as expected as far as maintaining redundancy. We've never had a problem with losing data or anything like that, even with those hard work failures.
What do I think about the scalability of the solution?
Scalability may be a function of the distributed nature. When you do an upgrade on a 20 plus node cluster, it's going to go through and upgrade each node one at a time. As a result, for example, it can take sometimes upwards of 48 to 50 hours to complete as you're going through this one step at a time. It can be stretched out due to the fact that you can only do one node at a time and hold off during the day if needed.
One thing that we run into during the upgrades is we have had cases where very large volumes become disconnected. Some of that might just be that we need to limit the size of the volumes that we support. However, for our customers, we have had issues with nodes rebooting and a volume might be disconnected from the ESXi for longer than the period that the ESXi can tolerate and then you get all paths down.
Right now, we have hundreds of end-users on this solution. We are a service provider and likely have thousands of users if you take into account our customer's user base.
We're not planning on expanding SolidFire. We're looking at different options just for our work for a private counter environment. We're going towards more of an HCI architecture. SolidFire may stick around as a dedicated platform or just an all-flash option, however, it's not going to be our primary shared storage.
How are customer service and support?
I've never really dealt with technical support. There was another engineer that mostly dealt with them. The times that I dealt with support, they were pretty knowledgeable. That was definitely before NetApp purchased SolidFire. Their support was top-notch. Since then, if we can get past the first layer of support, it seems to be better than what I would expect, however, there have been issues with calling in and not getting the right support. It just takes time to get past the level that isn't as knowledgeable as the next level up.
Which solution did I use previously and why did I switch?
I'm also familiar with Pure, NetApp, and VNX.
Pure's are more traditional to controller architecture as opposed to the distributed architecture of SolidFire. It's also all-flash, just like SolidFire. It's even simpler than SolidFire in terms of deployment and management. They've got an active controller configuration so that upgrades are essentially transparent as you upgrade a node or scale. It's just the way that the architecture's designed on the back end.
How was the initial setup?
We have found the initial setup to be fairly easy. The implementation process is pretty smooth. It's self-explanatory.
We've got two people dedicated to the SolidFire array. We have several in our data centers and we have a whole support task force that deals with tickets. However, in general, there are two people that are the platform owners that ensure everything's up to date and any things are being resolved as needed.
What's my experience with pricing, setup cost, and licensing?
While I don't know specifics about the licensing costs or procedures, my understanding it's comparable to other products. We did a comparison with Pure recently and the per-gigabyte charge was within a range of five to 10 cents difference.
What other advice do I have?
We are currently a customer and an end-user.
I did not use the latest version of the solution. We were a couple of years behind. We were most recently at version 11. I've been out of the operations group now for the last, probably eight months or so, however, it's my understanding that they recently updated it, however, the last one I worked with was version 11.
I'd rate the solution at a solid eight out of ten simply due to the hardware issues which are pretty impactful lately and the issues with the upgrades that we've seen lately.
*Disclosure: My company does not have a business relationship with this vendor other than being a customer.