Try our new research platform with insights from 80,000+ expert users
Lokesh Jayanna - PeerSpot reviewer
Vice President at Goldman Sachs at a computer software company with 10,001+ employees
Real User
Nov 26, 2023
Stable product with a valuable SQL tool
Pros and Cons
  • "The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."
  • "At the initial stage, the product provides no container logs to check the activity."

What is our primary use case?

We use the product for extensive data analysis. It helps us analyze a huge amount of data and transfer it to data scientists in our organization.

What is most valuable?

The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it. It is a useful feature for us.

What needs improvement?

At the initial stage, the product provides no container logs to check the activity. It remains inactive for a long time without giving us any information. The containers could start quickly, similar to that of Jupyter Notebook.

For how long have I used the solution?

We have been using Apache Spark for eight months to one year.

Buyer's Guide
Apache Spark
December 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: December 2025.
879,425 professionals have used our research since 2012.

What do I think about the stability of the solution?

It is a stable product. I rate its stability an eight out of ten.

What do I think about the scalability of the solution?

We have 45 Apache Spark users. I rate its scalability a nine out of ten.

How was the initial setup?

The complexity of the initial setup depends on the kind of environment an organization is working with. It requires one executive for deployment. I rate the process an eight out of ten.

What's my experience with pricing, setup cost, and licensing?

The product is expensive, considering the setup. However, from a standalone perspective, it is inexpensive.

What other advice do I have?

I advise others to analyze data and understand your business requirements before purchasing the product. I rate it an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Jagannadha Rao - PeerSpot reviewer
Lead Data Scientist at a university with 51-200 employees
Real User
Oct 24, 2023
A flexible solution that can be used for storage and processing
Pros and Cons
  • "The most valuable feature of Apache Spark is its flexibility."
  • "Apache Spark's GUI and scalability could be improved."

What is our primary use case?

We use Apache Spark for storage and processing.

What is most valuable?

The most valuable feature of Apache Spark is its flexibility.

What needs improvement?

Apache Spark's GUI and scalability could be improved.

For how long have I used the solution?

I have been using Apache Spark for four to five years.

What do I think about the scalability of the solution?

Around 15 data scientists are using Apache Spark in our organization.

How was the initial setup?

Apache Spark's initial setup is slightly complex compared to other other solutions. Data scientists could install our previous tools with minimal supervision, whereas Apache Spark requires some IT support. Apache Spark's installation is a time-consuming process because it requires ensuring that all the ports have been accessed properly following certain guidelines.

What about the implementation team?

While installing Apache Spark, I must look at the documentation and be very specific about the configuration settings. Only then I'll be able to install it.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an expensive solution.

What other advice do I have?

I would recommend Apache Spark to other users.

Overall, I rate Apache Spark an eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Apache Spark
December 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: December 2025.
879,425 professionals have used our research since 2012.
Data Engineer at a manufacturing company with 51-200 employees
Real User
Aug 3, 2023
A useful and easy-to-deploy product that has an excellent data processing framework
Pros and Cons
  • "The data processing framework is good."
  • "The solution must improve its performance."

What is our primary use case?

Our customers configure their software applications, and I use Apache to check them. We use it for data processing.

What is most valuable?

The data processing framework is good. The product is very useful.

What needs improvement?

The solution must improve its performance.

For how long have I used the solution?

I have been using the solution for four to five years.

What do I think about the stability of the solution?

The tool is stable. I rate the stability more than nine out of ten.

What do I think about the scalability of the solution?

We have a small business. Around four people in my organization use the solution.

How was the initial setup?

The deployment was easy.

What about the implementation team?

The solution was deployed with the help of third-party consultants.

What other advice do I have?

Overall, I rate the product more than eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer2208003 - PeerSpot reviewer
Quantitative Developer at a marketing services firm with 11-50 employees
Real User
Jul 12, 2023
Seamless in distributing tasks, including its impressive map-reduce functionality
Pros and Cons
  • "The distribution of tasks, like the seamless map-reduce functionality, is quite impressive."
  • "When using Spark, users may need to write their own parallelization logic, which requires additional effort and expertise."

What is our primary use case?

Predominantly, I use Spark for data analysis on top of datasets containing tens of millions of records.

How has it helped my organization?

I have an example. We had a single-threaded application that used to run for about four to five hours, but with Spark, it got reduced to under one hour.

What is most valuable?

The distribution of tasks, like the seamless map-reduce functionality, is quite impressive. For the user, it appears as simple single-line data manipulations, but behind the scenes, the executor pool intelligently distributes the map and reduce functions.

What needs improvement?

The visualization could be improved.

For how long have I used the solution?

I have been working with Apache Spark for only a few months, not too long.

What do I think about the stability of the solution?

I haven't faced any stability issues. It has been stable in my experience.

What do I think about the scalability of the solution?

When it comes to the scalability of Spark, it's primarily a processing engine, not a database engine. I haven't tested it extensively with large record sizes.

In my organization, quite a few people are using Spark. In my smaller team, there are only two users.

What about the implementation team?

In terms of maintenance, when the load hits around 95%, we need to prioritize scripts and analysis within the team. 

We coordinate and prioritize based on the available resources. If there were self-service tools or better hand-holding for such situations, it would make things easier.

Which other solutions did I evaluate?

Currently, we extensively use pandas and Polaris. We are leveraging Docker and Kubernetes as a framework, along with AWS Batch for distribution. This is the closest substitute we have for Spark Distribution.

Both Docker and Kubernetes are more general-purpose solutions. If someone is already using Kubernetes and it's provided as a service, it can be used for special-purpose utilization, similar to Docker and Kubernetes.


In such cases, users may need to write the parallelization logic themselves, but it's relatively easy to onboard and start with a distributed load. Spark, on the other hand, is primarily used for special-purpose utilization. Users typically choose Spark when they have data-intensive tasks.

Another significant issue with Spark is its syntactics. For instance, if we have libraries like Panda or Polaris, we can run them single-threaded on a single core, or we can distribute them leveraging Kubernetes.

We don't need to rewrite that code base for Spark. However, if we are writing code specifically for Spark Executors, it will not be amenable to running it locally.

What other advice do I have?

I would recommend understanding the use case better. Only if it fits your use case, then go for it. But it is a great tool.

Overall, I would rate Apache Spark an eight out of ten. 

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
PeerSpot user
Director of Enginnering at a tech services company with 501-1,000 employees
Real User
Aug 1, 2022
Easy to code, fast, open-source, very scalable, and great for big data
Pros and Cons
  • "Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica. Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark."
  • "Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."

What is our primary use case?

I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.

How has it helped my organization?

Spark has been at the forefront of data processing engine. I have used Apache Spark for multiple projects for different clients. It is an excellent tool to process massive amount of data. 

What is most valuable?

Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica.

Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark.

What needs improvement?

Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.

For how long have I used the solution?

I have been using this solution for around 7 years.

What do I think about the stability of the solution?

There were bugs three to four years ago, which have been resolved. There were a couple of issues related to slowness when we did a lot of transformations using the Width columns. I was writing a POC on ETL for moving from Informatica to Spark SQL for the ETL pipeline. It required the use of hundreds of Width columns to change the column name or add some transformation, which made it slow. It happened in versions prior to version 1.6, and it seems that this issue has been fixed later on.

What do I think about the scalability of the solution?

It is very scalable. You can scale it a lot.

How are customer service and support?

I haven't contacted them.

How was the initial setup?

The initial setup was a little complex when I was using open-source Spark. I was doing a POC in the on-premise environment, and the initial setup was a little cumbersome. It required a lot of set up on Unix systems. We also had to do a lot of configurations and install a lot of things. 

After I moved to the Cloudera CDH version, it was a little easy. It is a bundled product, so you just install whatever you want and use it.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera.

What other advice do I have?

I would definitely recommend Spark. It is a great product. I like Spark a lot, and most of the features have been quite good. Its initial learning curve is a bit high, but as you learn it, it becomes very easy.

I would rate Apache Spark an eight out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
AmitMataghare - PeerSpot reviewer
Associate Director at a consultancy with 10,001+ employees
Real User
Apr 29, 2022
High performance, beneficial in-memory support, and useful online community support
Pros and Cons
  • "One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."
  • "Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."

What is our primary use case?

Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.

What is most valuable?

One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast.

What needs improvement?

Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors.

For how long have I used the solution?

I have been using Apache Spark for approximately five years.

What do I think about the stability of the solution?

Apache Spark is stable.

What do I think about the scalability of the solution?

I have found Apache Spark to be scalable.

How are customer service and support?

Apache Spark is open-source, there is no team that will give you dedicated support, but you can post your queries on the community forums, and usually, you will receive a good response. Since it's open-source, you depend on freelance developers to respond to you, you cannot put a time limit there, but the response, on average, is pretty good.

How was the initial setup?

If Apache Spark is in the cloud, setting it up will require only minutes. If it's on Amazon, GCP, or Microsoft cloud, it'll take minutes to set everything up. However, if you are using the on-premise version, then it might take some time to set up the environment.

What other advice do I have?

I rate Apache Spark an eight out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Senior Test Automation Specialist at a financial services firm with 501-1,000 employees
Real User
Top 20
Feb 26, 2022
Useful for big data and scientific purposes, but needs better query handling, stability, and scalability
Pros and Cons
  • "It is useful for handling large amounts of data. It is very useful for scientific purposes."
  • "We are building our own queries on Spark, and it can be improved in terms of query handling."

What is our primary use case?

We are using it for big data. We are using a small part of it, which is related to using data.

What is most valuable?

It is useful for handling large amounts of data. It is very useful for scientific purposes.

What needs improvement?

There are some difficulties that we are working on. It is useful for scientific purposes, but for commercial use of big data, it gives some trouble.

They should improve the stability of the product. We use Spark Executors and Spark Drivers to link to our own environment, and they are not the most stable products. Its scalability is also an issue.

We are building our own queries on Spark, and it can be improved in terms of query handling.

For how long have I used the solution?

In my company, it has been used for several years, but I have been using it for seven months.

What do I think about the scalability of the solution?

It is not scalable. Scalability is one of the issues.

How are customer service and support?

It is open source from my point of view. So, there is no support.

What other advice do I have?

I would advise not using it if you don't have experienced users inside your organization. If you have to figure it all out on your own, then you shouldn't start with it.

Overall, I would rate it a six out of 10. For a commercial use case, it is a six out of 10. For scientific purposes, it is an eight out of 10.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Suresh_Srinivasan - PeerSpot reviewer
Co-Founder at a computer software company with 11-50 employees
Real User
Top 5
Jan 13, 2022
Handles large volume data, cloud and on-premise deployments, but difficult to use
Pros and Cons
  • "Apache Spark can do large volume interactive data analysis."
  • "Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn."

What is our primary use case?

The solution can be deployed on the cloud or on-premise.

How has it helped my organization?

We are using Apache Spark, for large volume interactive data analysis.

MechBot is an enterprise, one-click installation, trusted data excellence platform. Underneath, I am using Apache Spark, Kafka, Hadoop HDFS, and Elasticsearch.

What is most valuable?

Apache Spark can do large volume interactive data analysis.

What needs improvement?

Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.

For how long have I used the solution?

I have been using Apache Spark for approximately 11 years.

What do I think about the stability of the solution?

The solution is stable.

What do I think about the scalability of the solution?

Apache Spark is scalable. However, it needs enormous technical skills to make it scalable. It is not a simple task.

We have approximately 20 people using this solution.

How was the initial setup?

If you want to distribute Apache Spark in a certain way, it is simple. Not every engineer can do it. You need DevOps specialized skills on Spark is what is required.

If we are going to deploy the solution in a one-layer laptop installation, it is very straightforward, but this is not what someone is going to deploy in the production site.

What's my experience with pricing, setup cost, and licensing?

Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free.

What other advice do I have?

We are well versed in Spark, the version, the internal structure of Spark, and we know what exactly Spark is doing. 

The solution cannot be easier. Everything cannot be made simpler because it involves core data, computer science, pro-engineering, and not many people are actually aware of it.

I rate Apache Spark a six out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: December 2025
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.