Try our new research platform with insights from 80,000+ expert users
Miodrag Milojevic - PeerSpot reviewer
Senior Data Archirect at Yettel
Real User
Top 20
Parallel computing helped create data lakes with near real-time loading
Pros and Cons
  • "It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."
  • "If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation."

What is our primary use case?

I use the solution for data lakes and big data solutions. I can combine it with the other program languages.

What is most valuable?

One of the reasons we use Spark is so we can use parallelism in data lakes. So in our case, we can get many data nodes, and the main power of Hadoop and big data solutions is the number of nodes usable for different operations. It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance. Also, Spark has an option for near real-time loading and processing. We use micro batches of Spark.

What needs improvement?

If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation. In combination with other tools, many sessions remain, even if you think they've stopped. This is the main problem with big data sessions, where zombie sessions reside that you have to take care of. Otherwise, they spend resources and cause problems.

For how long have I used the solution?

I've been using Apache Spark for more than two years. I'm using the latest version.

Buyer's Guide
Apache Spark
October 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: October 2025.
869,832 professionals have used our research since 2012.

What do I think about the stability of the solution?

The solution is stable, but not completely. For example, we use Exadata as an extremely stable data warehouse, but that's not possible with big data. There are things that you have to fix sometimes. The stability is similar to the cloud solution, but that depends on the solution you need.

What do I think about the scalability of the solution?

The solution is scalable, but adding new nodes is not easy. It will take some time to do that, but it's scalable. We have about 20 users using Apache Spark. We regularly use the solution.

How are customer service and support?

We use Cloudera distribution, so we ask Cloudera for support, which is not open-source.

How was the initial setup?

When you install the complete environment, you install Spark as a part of this solution. The setup can be tricky when introducing security, such as connecting Spark using Kerberos. It can be tricky because when you use it, you have to distribute your architecture with many servers, and even then, you have to prepare Kerberos on every server. It's not possible to do this in one place.

Deploying Apache Spark is pretty complex. But that is a problem with the security approach. Our security guys requested this security, so we use Kerberos authentication mandatorily, which can be complex. We had five people for maintenance and deployment, not to mention deployment or other roles.

What about the implementation team?

We had an external integrator, but we also had in-house knowledge. Sometimes, we need to change or install something, and it's not good to ask the integrator for everything because of availability and planning. We had more freedom thanks to our internal knowledge.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera. But in that case, you don't have any support. If you face a problem, you might find something in the community, but you cannot ask Cloudera about it. If you have open source, you don't have support, but you have a community. Cloudera has different packages, which are licensed versions of products like Apache Spark. In this case, you can ask Cloudera for everything.

What other advice do I have?

Spark was written in Scala. Scala is a programming language fundamentally in Java and useful for data lakes.

We thought about using Flink instead, but it wasn't useful for us and wouldn't gain any additional value. Besides, Spark's community is much wider, so information is available and is better than Flink's.

I rate Apache Spark an eight out of ten.

If you plan to implement Apache Spark on a large-scale system, you should learn to use parallelism, partitioning, and everything from the physical level to get the best performance from Spark. And it will be good to know Python, especially for data scientists using PySpark for analysis. Likewise, it's good to know Scala because you can be very efficient in preparing some datasets since it is Spark's native language.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Suriya Senthilkumar - PeerSpot reviewer
Analyst at Deloitte
Real User
Top 10
Processes a larger volume of data efficiently and integrates with different platforms
Pros and Cons
  • "The product’s most valuable features are lazy evaluation and workload distribution."
  • "They could improve the issues related to programming language for the platform."

What is our primary use case?

We use the product in our environment for data processing and performing Data Definition Language (DDL) operations.

What is most valuable?

The product’s most valuable features are lazy evaluation and workload distribution.

What needs improvement?

They could improve the issues related to programming language for the platform. 

For how long have I used the solution?

We have been using Apache Spark for around two and a half years.

What do I think about the stability of the solution?

The platform’s stability depends on how effectively we write the code. We encountered a few issues related to programming languages.

What do I think about the scalability of the solution?

We have more than 100 Apache Spark users in our organization.

Which solution did I use previously and why did I switch?

Before choosing Apache Spark for processing big data, we evaluated another option, Hadoop. However, Spark emerged as a superior choice comparatively.

How was the initial setup?

The initial setup complexity depends on whether it's on the cloud or on-premise. For cloud deployments, especially using platforms like Databricks, the process is straightforward and can be configured with ease. However, if the deployment is on-premise, the setup tends to be more time-consuming, although not overly complex.

What's my experience with pricing, setup cost, and licensing?

They provide an open-source license for the on-premise version. However, we have to pay for the cloud version including data centers and virtual machines.

What other advice do I have?

Apache Spark is a good product for processing large volumes of data compared to other distributed systems. It provides efficient integration with Hadoop and other platforms.

I rate it a ten out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Apache Spark
October 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: October 2025.
869,832 professionals have used our research since 2012.
Ilya Afanasyev - PeerSpot reviewer
Senior Software Development Engineer at Yahoo!
Real User
Reliable, able to expand, and handle large amounts of data well
Pros and Cons
  • "There's a lot of functionality."
  • "I know there is always discussion about which language to write applications in and some people do love Scala. However, I don't like it."

What is our primary use case?

It's a root product that we use in our pipeline.

We have some input data. For example, we have one system that supplies some data to MongoDB, for example, and we pull this data from MongoDB, enrich this data from other systems - with some additional fields - and write to S3 for other systems. Since we have a lot of data, we need a parallel process that runs hourly.

What is most valuable?

We use batch processing. It works well with our formats and file versions. There's a lot of functionality. 

In our pipeline each hour, we make a copy of data from MongoDB, of the changes from MongoDB to some specific file. Each time pipeline copied all of the data, it would do it each time without changes to all of the tables. Tables have a lot of data, and in the last MongoDB version, there is a possibility to read only changed data. This reduced the cost and configuration of the cluster, and we saved about $150,000.

The solution is scalable.

It's a stable product.

What needs improvement?

The primary language for developers on Spark is Scala. Now it's also about Java. I prefer Java versus Scala, and since they are supported, it is good. I know there is always discussion about which language to write applications in, and some people do love Scala. However, I don't like it.

They use currently have a JDK version which is a little bit old. Not all features are on it. Maybe they should pull support of the JDK version.

For how long have I used the solution?

I've used the solution for a year and a half. 

What do I think about the stability of the solution?

The solution is stable. There are no bugs or glitches. It doesn't crash or freeze. 

What do I think about the scalability of the solution?

The product scales well. It's fine to expand if needed. 

Many teams use Spark. For example, we have a few kinds of pipelines, huge pipelines. One of them processes 300 billion events each day. It's our core technology currently.

We do not plan to increase usage. We keep our legacy system on Spark, and we are now discussing Flink and Spark and what we would prefer. However, most of the people are already migrating new systems to Flink. We will keep Spark for a few more years still. 

How are customer service and support?

We have an internal team, and they participate in process of developing Spark. They are Spark contributors, and if we have some problems, we turn to them. It's our own people, yet they work with Spark. Generally, if the problem is more minor, we look at some sites or have some discussion about Spark or internal guys who have experience with Spark. 

Which solution did I use previously and why did I switch?

We also use Flink.

Before Spark, I worked with another company that we used some different technology, including Kafka, Radius, Postgres SQL, S3, and Spring. 

How was the initial setup?

I didn't handle the initial setup. We were using this pipeline and clusters already. I just installed it on my local server. However, in terms of difficulty, I didn't see any problem. The deployment might only take a few hours. 

I found some documentation. I got the documentation from the site and downloaded the archive and unzipped it, and installed it. I can't say that I installed something from a special configuration. I just installed a few nodes for debugging and for running locally, and that's all. Also, in one case I used, for example, a Docker configuration with Spark. It all worked fine.

What's my experience with pricing, setup cost, and licensing?

It's an open-source product. I don't know much about the licensing aspect. 

Which other solutions did I evaluate?

We have compared Flink and Spark as two possible options. 

What other advice do I have?

I can recommend the product. It's a nice system for batch processing huge data.

I'd rate the solution eight out of ten. 

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Lokesh Jayanna - PeerSpot reviewer
Vice President at Goldman Sachs at a computer software company with 10,001+ employees
Real User
Top 5
Stable product with a valuable SQL tool
Pros and Cons
  • "The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."
  • "At the initial stage, the product provides no container logs to check the activity."

What is our primary use case?

We use the product for extensive data analysis. It helps us analyze a huge amount of data and transfer it to data scientists in our organization.

What is most valuable?

The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it. It is a useful feature for us.

What needs improvement?

At the initial stage, the product provides no container logs to check the activity. It remains inactive for a long time without giving us any information. The containers could start quickly, similar to that of Jupyter Notebook.

For how long have I used the solution?

We have been using Apache Spark for eight months to one year.

What do I think about the stability of the solution?

It is a stable product. I rate its stability an eight out of ten.

What do I think about the scalability of the solution?

We have 45 Apache Spark users. I rate its scalability a nine out of ten.

How was the initial setup?

The complexity of the initial setup depends on the kind of environment an organization is working with. It requires one executive for deployment. I rate the process an eight out of ten.

What's my experience with pricing, setup cost, and licensing?

The product is expensive, considering the setup. However, from a standalone perspective, it is inexpensive.

What other advice do I have?

I advise others to analyze data and understand your business requirements before purchasing the product. I rate it an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Jagannadha Rao - PeerSpot reviewer
Lead Data Scientist at International School of Engineering
Real User
Top 5
A flexible solution that can be used for storage and processing
Pros and Cons
  • "The most valuable feature of Apache Spark is its flexibility."
  • "Apache Spark's GUI and scalability could be improved."

What is our primary use case?

We use Apache Spark for storage and processing.

What is most valuable?

The most valuable feature of Apache Spark is its flexibility.

What needs improvement?

Apache Spark's GUI and scalability could be improved.

For how long have I used the solution?

I have been using Apache Spark for four to five years.

What do I think about the scalability of the solution?

Around 15 data scientists are using Apache Spark in our organization.

How was the initial setup?

Apache Spark's initial setup is slightly complex compared to other other solutions. Data scientists could install our previous tools with minimal supervision, whereas Apache Spark requires some IT support. Apache Spark's installation is a time-consuming process because it requires ensuring that all the ports have been accessed properly following certain guidelines.

What about the implementation team?

While installing Apache Spark, I must look at the documentation and be very specific about the configuration settings. Only then I'll be able to install it.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an expensive solution.

What other advice do I have?

I would recommend Apache Spark to other users.

Overall, I rate Apache Spark an eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer1759647 - PeerSpot reviewer
Information Technology Business Analyst at a aerospace/defense firm with 10,001+ employees
Real User
A highly scalable and affordable tool that can be used to gather information from different systems
Pros and Cons
  • "The product is useful for analytics."
  • "The product could improve the user interface and make it easier for new users."

What is most valuable?

We use it as an ETL tool to gather information from different systems. The product is useful for analytics.

What needs improvement?

The product could improve the user interface and make it easier for new users. It has a steep learning curve.

For how long have I used the solution?

I have been using the product for approximately three to four years. Currently, I am using the latest version.

What do I think about the stability of the solution?

The tool is stable. I rate the stability a ten out of ten.

What do I think about the scalability of the solution?

The tool is very scalable. I rate the scalability a ten out of ten. Approximately 30 users are using Apache Spark in our organization.

How are customer service and support?

We are using the free version of the product. So, we are not using any support.

How would you rate customer service and support?

Positive

How was the initial setup?

The basic installation is easy. However, we are working in the security business and need a very secure installation. It has been quite difficult. I rate the basic installation a ten out of ten. I rate the ease of setup a two or three out of ten for a more secure installation with all the security features. The solution is deployed on-premises in our organization. The deployment process requires a couple of weeks.

What's my experience with pricing, setup cost, and licensing?

We are using the free version of the solution.

What other advice do I have?

I would recommend the product. I think it's a good solution for analytics. Overall, I rate the product an eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Armando Becerril - PeerSpot reviewer
Partner / Head of Data & Analytics at Intelligence Software Consulting
Real User
Top 10
Great for machine learning applications; good documentation available
Pros and Cons
  • "Provides a lot of good documentation compared to other solutions."
  • "The migration of data between different versions could be improved."

What is our primary use case?

We use Spark for machine learning applications, clustering, and segmentation of customers.

What is most valuable?

Apache provides a lot of good documentation compared to other solutions. 

What needs improvement?

The migration of data between different versions could be improved. 

For how long have I used the solution?

I've been using this solution for four years. 

What do I think about the stability of the solution?

The solution is stable. 

What do I think about the scalability of the solution?

The solution is scalable. 

How are customer service and support?

If you pay for customer support then you get a quick and efficient response, otherwise the community support offers good help. 

How was the initial setup?

The initial setup has been simplified over the past few years and is now relatively straightforward. 

What's my experience with pricing, setup cost, and licensing?

Licensing costs depend on where you source the solution. 

What other advice do I have?

This is a good solution for big data use cases and I rate it eight out of 10. 

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
CTO at Hammerknife
Real User
Top 5Leaderboard
Provides a valuable implementation of distributed data processing with a simple setup process
Pros and Cons
  • "Apache Spark provides a very high-quality implementation of distributed data processing."
  • "There were some problems related to the product's compatibility with a few Python libraries."

What is our primary use case?

We use the product for real-time data analysis.

What is most valuable?

Apache Spark provides a very high-quality implementation of distributed data processing. I rate it 20 on a scale of one to ten.

What needs improvement?

There were some problems related to the product's compatibility with a few Python libraries. But I suppose they are fixed.

For how long have I used the solution?

We have been using Apache Spark for the last two to three years.

What do I think about the stability of the solution?

I rate the product's stability a ten out of ten.

What do I think about the scalability of the solution?

The product is enormously scalable.

How was the initial setup?

The initial setup process is simple if you are a good professional. You have to select a few parameters and press enter. It is already integrated into Databricks platform. One person is enough to manage small and medium implementations.

What's my experience with pricing, setup cost, and licensing?

It is an open-source platform. We do not pay for its subscription.

Which other solutions did I evaluate?

We are evaluating a few analytics engineering and DBT solutions. For now, Spark is in the secondary position.

What other advice do I have?

I recommend Apache Spark for batch analytics features.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: October 2025
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.