We all know it's really hard to get good pricing and cost information.
Please share what you can so you can help your peers.
We use the open-source version. It is free to use. However, you do need to have servers. We have three or four. they can be on-premises or in the cloud.
Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera.
I'm unsure as to how much the licensing is for the solution. It's not an aspect of the product I deal with directly.
Apache spark is available in cloud services like AWS cloud, Azure. We have to use the specific service for our use case. For example we can use AWS Glue which runs spark for ETL process, AWS EMR /Azurre data brick for on demand data processing in the cloud. Basically it depends on how much capacity we will processing the data. It is recommended to get started with minimal configuration and stop the services when not in use.
The initial setup is straightforward. It took us around one week to set it up, and then the requirements and creation of the project flow and design needed to be done. The design stage took three to four weeks, so in total, it required between four and five weeks to set up.
I would suggest not to try to do everything at once. Identify the area where you want to solve the problem, start small and expand it incrementally, slowly expand your vision. For example, if I have a problem where I need to do streaming, just focus on the streaming and not on the machine learning that Spark offers. It offers a lot of things but you need to focus on one thing so that you can learn. That is what I have learned from the little experience I have with Spark. You need to focus on your objective and let the tools help you rather than the tools drive the work. That is my advice.
If you were talking to someone whose organization is considering Apache Spark, what would you say?
How would you rate it and why? Any other tips or advice?