What needs improvement with Apache Spark?

Please share with the community what you think needs improvement with Apache Spark.

What are its weaknesses? What would you like to see changed in a future version?

Miriam Tover

Service Delivery Manager at PeerSpot

Helped 900,125 peers since 2012

39 Answers

Last answered Feb 27, 2026

Michael Lierheimer

Consultant, Chief Engineer, Teamleiter at infoteam Software AG

Consultant

Top 20

Feb 27, 2026

I find that there really lacks the technical depth to do any recommendations for future updates of Apache Spark. I used it for two years for our prototype work and testing things, but because I had no final project with a release and running at the customer side or other side, I cannot say what I would expect if I wanted to use it in a real project. Regarding the current licensing cost, I would say it is in the medium range. However, because I do not have a licensed project for our customer, I do not know if it would be too high for our customers if they have to buy the license for themselves. For me, compared to other things, the licensing was acceptable.

Search for a product comparison

Devindra Weerasooriya

Data Architect at Devtech

Real User

Top 10

Nov 20, 2025

Areas for improvement are obviously ease of use considerations, though there are limitations in doing that, so while various tools like Informatica, TIBCO, or Talend offer specific aspects, licensing can be costly; I prefer to work this way, which does not imply being anti-tooling, but since your focus was on my technology, these will continue to be my technologies.

Omar Khaled

Data Engineer at a tech company with 10,001+ employees

Real User

Top 10

Aug 11, 2025

Regarding Apache Spark, I have only used Apache Spark Structured Streaming, not the machine learning components. I am uncertain about specific improvements needed today. However, after five years, there will be many new cloud providers, connectors, and solutions. Every year, there should be some enhancement to remain competitive in the market. As new solutions emerge frequently, the basic improvement would be to have integration with these solutions.

KamleshPant

Senior Software Architect at USEReady

MSP

Top 5Leaderboard

Apr 24, 2025

There is complexity when it comes to understanding the whole ecosystem, especially for beginners. I find it quite complex to understand how a Spark job is initiated, the roles of driver nodes, worker nodes, stages, and tasks. Additionally, clustering may be a bit complex to set up.

Bharghava Raghavendra Beesa

Senior Developer at Infosys

MSP

Top 5Leaderboard

Jan 21, 2025

The Spark solution could improve in scheduling tasks and managing dependencies. Spark alone cannot handle sequential tasks, requiring environments like Airflow scheduler or scripts. For instance, one task should trigger another based on completion, however, Spark can't manage these dependent loads. We focus on specific compute tasks that we can deliver.

Aleksandr Motuzov

Head of Data Science center of excellence at Ameriabank CJSC

Real User

Top 5Leaderboard

Sep 23, 2024

The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Consequently, alternatives like Doc DB are more preferable. Additionally, performance in some cases is slower, making alternatives two to five times faster.

Buyer's Guide

Apache Spark

June 2026

Free Report: Apache Spark Reviews and More

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2026.

DOWNLOAD NOW

900,125 professionals have used our research since 2012.

Dunstan Matekenya

Data Scientist at a financial services firm with 10,001+ employees

Real User

Top 5Leaderboard

Jul 10, 2024

Apache Spark lacks geospatial data.

Ilya Afanasyev

Senior Software Development Engineer at Yahoo!

Real User

Aug 3, 2022

The primary language for developers on Spark is Scala. Now it's also about Java. I prefer Java versus Scala, and since they are supported, it is good. I know there is always discussion about which language to write applications in, and some people do love Scala. However, I don't like it. They use currently have a JDK version which is a little bit old. Not all features are on it. Maybe they should pull support of the JDK version.

Gopi Krishnan

Works at Ideas2IT Technologies

Real User

Jun 10, 2020

There is still enough space of improvement on Apache Spark in term of integration and improving speed. Apache spark community can use Rust, C++ implementation to improve performance.

Suriya Senthilkumar

Analyst at Deloitte

Real User

Feb 26, 2024

They could improve the issues related to programming language for the platform.

Hamid M. Hamid

Data architect at Banking Sector

Real User

Feb 5, 2024

The product has matured at the moment. The product's interoperability is an area of concern where improvements are required. Apache Spark can be integrated with high-tech tools like Informatica. Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users.

Suresh_Srinivasan

Co-Founder at FORMCEPT Technologies

Real User

Jan 31, 2024

In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond

Sachin Shukre

Sr Manager at a transportation company with 10,001+ employees

Real User

Dec 6, 2023

Apart from the restrictions that come with its in-memory implementation. It has been improved significantly up to version 3.0, which is currently in use. Once I get those insights, I can let you know if the restrictions have been overcome. For example, there is an issue with heap memory getting full in version 1.6. There are other improvements in 3.0, so I will check those. In future releases, I would like to reduce the cost.

reviewer1283880

CEO International Business at a tech services company with 1,001-5,000 employees

MSP

Nov 10, 2023

It requires overcoming a significant learning curve due to its robust and feature-rich nature.

Jagannadha Rao

Lead Data Scientist at International School of Engineering

Real User

Oct 20, 2023

Apache Spark's GUI and scalability could be improved.

Farzam Khodaei

Data Engineer at Berief Food GmbH

Real User

Jul 26, 2023

The solution must improve its performance.

reviewer2208003

Quantitative Developer at a marketing services firm with 11-50 employees

Real User

Jul 6, 2023

The visualization could be improved.

Armando Becerril

Partner / Head of Data & Analytics at Intelligence Software Consulting

Real User

Feb 13, 2023

The migration of data between different versions could be improved.

reviewer1904019

Chief Technology Officer at a tech services company with 11-50 employees

Real User

Jul 4, 2022

Apache Spark can improve the use case scenarios from the website. There is not any information on how you can use the solution across the relational databases toward multiple databases.

AmitMataghare

Associate Director at a consultancy with 10,001+ employees

Real User

Apr 27, 2022

Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors.

Salvatore Campana

CEO & Founder at Xautomata

Real User

Apr 27, 2022

An area for improvement is that when we start the solution and declare the maximum number of nodes, the process is shared, which is a problem in some cases. It would be useful to be able to change this parameter in real-time rather than having to stop the solution and restart with a higher number of nodes.

Onur Tokat

Big Data Engineer Consultant at Collective[i]

Consultant

Feb 15, 2022

Spark could be improved by adding support for other open-source storage layers than Delta Lake. The UI could also be enhanced to give more data on resource management.

Suresh_Srinivasan

Co-Founder at FORMCEPT Technologies

Real User

Dec 28, 2021

Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.

Oscar Estorach

Chief Data Strategist And Director at theworkshop.es

Real User

Top 20

Aug 18, 2021

If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing. It's not easy to install. You are typically dealing with a big data system. It's not a simple, straightforward architecture.

reviewer1535340

Senior Solutions Architect at a retailer with 10,001+ employees

Real User

Mar 27, 2021

The logging for the observability platform could be better.

NitinKumar

Director of Enginnering at Sigmoid

Real User

Feb 1, 2021

Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.

Kürşat Kurt

Software Architect at Akbank

Real User

Oct 28, 2020

Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.

Rajendran Veerappan

Director at Nihil Solutions

Real User

Jul 23, 2020

There are lots of items coming down the pipeline in the future. I don't know what features are missing. From my point of view, everything looks good. The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate. There should be more information shared to the user. The solution already has all the information tracked in the cluster. It just needs to be accessible or searchable.

KamleshKhollam

Managing Consultant at a computer software company with 501-1,000 employees

Real User

Feb 2, 2020

I would like to see integration with data science platforms to optimize the processing capability for these tasks.

Suresh_Srinivasan

Co-Founder at FORMCEPT Technologies

Real User

Jan 29, 2020

We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data.

it_user1223676

Lead Consultant at a tech services company with 51-200 employees

Consultant

Jan 29, 2020

We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time. There is some latency in the system and latency in the data caching. The main issue is that we need to design it in a way that data will be available to us very quickly. It takes a long time and the latest data should be available to us much quicked.

reviewer879201

Technical Consultant at a tech services company with 1-10 employees

Consultant

Dec 23, 2019

I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist. Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best. Overall, it offers everything that I can imagine right now.

Mohamed Ghorbel

Director of BigData Offer at IVIDATA

Real User

Dec 9, 2019

The solution needs to optimize shuffling between workers.

reviewer1046250

Senior Consultant & Training at a tech services company with 51-200 employees

Consultant

Oct 13, 2019

When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data. Once you are experienced, it is easier and more stable. When you are trying to do something outside of the normal requirements in a typical project, it is difficult to find somebody with experience.

Snrsecengin567

Snr Security Engineer at a tech vendor with 201-500 employees

Real User

Jul 14, 2019

The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive.

it_user946074

Principal Architect at a financial services firm with 1,001-5,000 employees

Real User

Jul 10, 2019

The search could be improved. Usually, we are using other tools to search for specific stuff. We'll be using it how I use other tools - to get the details, but if there any way to search for little things that will be better. It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster. In the next release, if they can add more analytics, that would be useful. For example, for data, built data, if there was one port where you put the high one then you can pull any other close to you, and then maybe a log for the right script.

it_user1059558

Portfolio Manager, Enterprise Solutions Architect at Capgemini

Real User

Apr 8, 2019

Better data lineage support.

Sumanth Punyamurthula

Director - Data Management, Governance and Quality at Hilton Worldwide

Real User

Mar 17, 2019

It is like going back to the '80s for the complicated coding that is required to write efficient programs.

reviewer894894

Solutions Architect at a computer software company with 51-200 employees

User

Jun 27, 2018

I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.

Apache Spark

69 Reviews

Apache Spark is a leading open-source processing tool known for scalability and speed in managing large datasets. It supports both real-time and batch processing and is widely used for building data pipelines, machine learning applications, and analytics.Apache Spark's strengths lie in its ability to process large data volumes efficiently through real-time and batch capabilities. With in-memory computation, it ensures fast data processing and significant performance gains. Its wide range of...

Download Apache Spark Report Read more

Related Q&As

Aug 28, 2023

Which solution has better performance: Spring Boot or Apache Spark?

Apr 19, 2020

Which is the best RDMBS solution for big data?