What is our primary use case?
We use this solution for data engineering, data transformation, repairing data for machine learning and doing queries. We have between 30 and 40 users making use of this solution.
How has it helped my organization?
Certain data sets that are very large are very difficult to process with Pandas and Python libraries. Spark SQL has helped us a lot with that.
What is most valuable?
The query language and using it to process large datasets has been the biggest benefit for us.
What needs improvement?
It takes a bit of time to get used to using this solution versus Panda as it has a steep learning curve. You need quite a high level of skill with SQL in general to use this solution. If SQL is not someone's primary language, they might find it difficult to get used to.
This solution could be improved if there was a bridge between Panda and Spark SQL such as translating from Panda operations to SQL and then working with those queries that are generated.
In a future release, it would be useful to have a real time dashboard versus batch updates to Power BI.
For how long have I used the solution?
I have been using this solution for four years.
What do I think about the stability of the solution?
I would rate it a nine out of ten for stability. We have not had any crashes or major bugs. It's just a case of the behavior that is a different than expected and then changing the queries but I haven't had stability issues.
What do I think about the scalability of the solution?
I haven't had major issues with scaling very large datasets. I would rate it a nine out of ten for scalability.
How are customer service and support?
We have worked with Hortonworks for the management of this solution. In terms of online help, there is a lot of information.
Which solution did I use previously and why did I switch?
We have previously used Panda DataFrame but it is typically not very scalable. I have also used the distributed version of Panda, Dask but neither of these solutions work as well as Spark and Spark SQL.
How was the initial setup?
In terms of setting up the on-premise cluster, it can be quite complex. I would rate it a six or seven in terms of complexity. Using it on the cloud is very straightforward.
We're using it mostly is on-premise, but we also have cloud instances where we use Spark so we have a mix of use cases.
orThe deployment can be done by one person. Typically, we have bigger teams two or three people at least, but one person can look after it and maintain.
What about the implementation team?
The deployment was done between the in-house team and Hortonworks external team. Our internal team included developers and data engineers.
What's my experience with pricing, setup cost, and licensing?
The on-premise solution is quite expensive in terms of hardware, setting up the cluster, memory, hardware and resources. It depends on the use case, but in our case with a shared cluster which is quite large, it is quite expensive.
I would rate the pricing a seven or eight out of ten however, it is easy to run into pricing issues with something like a Databricks cluster if you don't manage usage properly.
What other advice do I have?
Training is quite important to get users up to scratch with Sparks SQL and Spark. Planning is needed in terms of training and skillsets. In terms of the typical DevOps MLOps deployment with pipelines, this training is particularly important. Otherwise you may end up with lots of functionality and queries that are difficult to change, deploy or maintain.
I would rate this solution an eight out of ten. In terms of scalability, it is very useful.
Which deployment model are you using for this solution?
Hybrid Cloud