Try our new research platform with insights from 80,000+ expert users

Apache Spark Room for Improvement

Dunstan Matekenya - PeerSpot reviewer

Dunstan Matekenya

Data Scientist at a financial services firm with 10,001+ employees

Apache Spark lacks geospatial data.

View full review »

Bharghava Raghavendra Beesa - PeerSpot reviewer

Bharghava Raghavendra Beesa

Senior Developer at Infosys

The Spark solution could improve in scheduling tasks and managing dependencies. Spark alone cannot handle sequential tasks, requiring environments like Airflow scheduler or scripts. For instance, one task should trigger another based on completion, however, Spark can't manage these dependent loads. We focus on specific compute tasks that we can deliver.

View full review »

Madhan Potluri - PeerSpot reviewer

Madhan Potluri

Head of Data at a energy/utilities company with 51-200 employees

The only issue I faced with the tool was that I used to choose the compute device to support parallel processing, and it has to be more like scaling up horizontally. The tool should be more scalable, not in terms of increasing the CPU or something, but more in the area of units. If two units are not enough, the third or fourth unit should be able to come into the picture.

From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable. Sometimes, I get an error saying that it is an RDD-related error, and it becomes difficult to understand where it went wrong. When I deal with datasets using a library called Pandas in Python, I can actually apply functions on each column and get a transformation from the column. When I try to do the same thing with Apache Spark, it is okay and works, but it is not straightforward; I need to deal with it a little differently, and even after trying to do that differently, the problem I face there is, sometimes it will throw an error saying that it is looping back to the same, but I was not getting that kind of errors in Pandas.

In future updates, the tool should be made more user-friendly. I want to take fifty parallel processes rather than one, and I want to pick some particular columns to be split by partition, so if the tool is user-friendly and offers clarity and flexibility, then that will be good.

View full review »

Free Report: Apache Spark Reviews and More

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: July 2025.

861,390 professionals have used our research since 2012.

KamleshPant - PeerSpot reviewer

KamleshPant

Senior Software Architect at USEReady

There is complexity when it comes to understanding the whole ecosystem, especially for beginners. I find it quite complex to understand how a Spark job is initiated, the roles of driver nodes, worker nodes, stages, and tasks. Additionally, clustering may be a bit complex to set up. View full review »

AM

Aleksandr Motuzov

Head of Data Science center of excellence at Ameriabank CJSC

The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Consequently, alternatives like Doc DB are more preferable. Additionally, performance in some cases is slower, making alternatives two to five times faster.

View full review »

SS

Sachin Shukre

Sr Manager at a transportation company with 10,001+ employees

Apart from the restrictions that come with its in-memory implementation. It has been improved significantly up to version 3.0, which is currently in use.

Once I get those insights, I can let you know if the restrictions have been overcome. For example, there is an issue with heap memory getting full in version 1.6. There are other improvements in 3.0, so I will check those.

In future releases, I would like to reduce the cost.

View full review »

reviewer2534727 - PeerSpot reviewer

reviewer2534727

Manager Data Analytics at a consultancy with 10,001+ employees

For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial.

View full review »

SurjitChoudhury - PeerSpot reviewer

SurjitChoudhury

Data engineer at Cocos pt

There could be enhancements in optimization techniques, as there are some limitations in this area that could be addressed to further refine Spark's performance.

View full review »

Miodrag Milojevic - PeerSpot reviewer

Miodrag Milojevic

Senior Data Archirect at Yettel

If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation. In combination with other tools, many sessions remain, even if you think they've stopped. This is the main problem with big data sessions, where zombie sessions reside that you have to take care of. Otherwise, they spend resources and cause problems.

View full review »

Ilya Afanasyev - PeerSpot reviewer

Ilya Afanasyev

Senior Software Development Engineer at Yahoo!

The primary language for developers on Spark is Scala. Now it's also about Java. I prefer Java versus Scala, and since they are supported, it is good. I know there is always discussion about which language to write applications in, and some people do love Scala. However, I don't like it.

They use currently have a JDK version which is a little bit old. Not all features are on it. Maybe they should pull support of the JDK version.

View full review »

Anshuman Kishore - PeerSpot reviewer

Anshuman Kishore

Director Product Development at Mycom Osi

There can be challenges in getting a good developer for Apache Spark. Getting developers in the market with the right skill set for Apache Spark is tough. The aforementioned area can be considered for improvement in the product.

At times during the deployment process, the tool goes down, making it look less robust. To take care of the issues in the deployment process, users need to do manual interventions occasionally. I feel that the use of large datasets can be a cause of concern during the tool's deployment phase, making it an area where improvements are required.

View full review »

VM

Vineeth Marar

Cloud solution architect at 0

The setup I worked on was really complex.

View full review »

Hamid M. Hamid - PeerSpot reviewer

Hamid M. Hamid

Data architect at Banking Sector

The product has matured at the moment. The product's interoperability is an area of concern where improvements are required.

Apache Spark can be integrated with high-tech tools like Informatica. Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users.

View full review »

UjjwalGupta - PeerSpot reviewer

UjjwalGupta

Module Lead at Mphasis

Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial.

View full review »

Atal Upadhyay - PeerSpot reviewer

Atal Upadhyay

AVP at MIDDAY INFOMEDIA LIMITED

It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework.

View full review »

Atif Tariq - PeerSpot reviewer

Atif Tariq

Cloud and Big Data Engineer | Developer at Huawei Cloud Middle East

Apache Spark should add some resource management improvements to the algorithms. Thereby, the solution can manage SKUs more efficiently with a physical and logical plan over the different data sets when you are joining it.

View full review »

Lucas Dreyer - PeerSpot reviewer

Lucas Dreyer

Data Engineer at BBD

One limitation is that not all machine learning libraries and models support it. While libraries like Scikit-learn may work with some Spark-compatible models, not all machine-learning tools are compatible with Spark. In such cases, you may need to extract data from Spark and train your models on smaller datasets instead of directly using Spark for training.

View full review »

reviewer1759647 - PeerSpot reviewer

reviewer1759647

Information Technology Business Analyst at a aerospace/defense firm with 10,001+ employees

The product could improve the user interface and make it easier for new users. It has a steep learning curve.

View full review »

Suriya Senthilkumar - PeerSpot reviewer

Suriya Senthilkumar

Analyst at Deloitte

They could improve the issues related to programming language for the platform.

View full review »

Lokesh Jayanna - PeerSpot reviewer

Lokesh Jayanna

Vice President at Goldman Sachs at a computer software company with 10,001+ employees

At the initial stage, the product provides no container logs to check the activity. It remains inactive for a long time without giving us any information. The containers could start quickly, similar to that of Jupyter Notebook.

View full review »

Jagannadha Rao - PeerSpot reviewer

Jagannadha Rao

Lead Data Scientist at International School of Engineering

Apache Spark's GUI and scalability could be improved.

View full review »

FK

Farzam Khodaei

Data Engineer at Berief Food GmbH

The solution must improve its performance.

View full review »

reviewer2208003 - PeerSpot reviewer

reviewer2208003

Quantitative Developer at a marketing services firm with 11-50 employees

The visualization could be improved.

View full review »

Armando Becerril - PeerSpot reviewer

Armando Becerril

Partner / Head of Data & Analytics at Intelligence Software Consulting

The migration of data between different versions could be improved.

View full review »

Oscar Estorach - PeerSpot reviewer

Oscar Estorach

Chief Data-strategist and Director at Theworkshop.es

If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing.

It's not easy to install. You are typically dealing with a big data system.

It's not a simple, straightforward architecture.

View full review »

SB

SlavenBatnozic

CTO at Hammerknife

There were some problems related to the product's compatibility with a few Python libraries. But I suppose they are fixed.

View full review »

MA

Marco Amhof

PLC Programmer at Alzero

The solution’s integration with other platforms should be improved.

View full review »

Mahdi Sharifmousavi - PeerSpot reviewer

Mahdi Sharifmousavi

Lecturer at Amirkabir University of Technology

This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed.

There is also limited Python compatibility, which should be improved.

View full review »

NK

NitinKumar

Director of Enginnering at Sigmoid

Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.

View full review »

AmitMataghare - PeerSpot reviewer

AmitMataghare

Associate Director at a consultancy with 10,001+ employees

Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors.

View full review »

PE

Peter-Paul Eijkenboom

Senior Test Automation Specialist at APG

There are some difficulties that we are working on. It is useful for scientific purposes, but for commercial use of big data, it gives some trouble.

They should improve the stability of the product. We use Spark Executors and Spark Drivers to link to our own environment, and they are not the most stable products. Its scalability is also an issue.

We are building our own queries on Spark, and it can be improved in terms of query handling.

View full review »

Suresh_Srinivasan - PeerSpot reviewer

Suresh_Srinivasan

Co-Founder at FORMCEPT Technologies

Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.

View full review »

KK

Kürşat Kurt

Software Architect at Akbank

Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.

View full review »

RV

Rajendran Veerappan

Director at Nihil Solutions

There are lots of items coming down the pipeline in the future. I don't know what features are missing. From my point of view, everything looks good.

The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate.

There should be more information shared to the user. The solution already has all the information tracked in the cluster. It just needs to be accessible or searchable.

View full review »

Suresh_Srinivasan - PeerSpot reviewer

Suresh_Srinivasan

Co-Founder at FORMCEPT Technologies

In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond

View full review »

reviewer1283880 - PeerSpot reviewer

reviewer1283880

CEO International Business at a tech services company with 1,001-5,000 employees

It requires overcoming a significant learning curve due to its robust and feature-rich nature.

View full review »

Salvatore Campana - PeerSpot reviewer

Salvatore Campana

CEO & Founder at Xautomata

An area for improvement is that when we start the solution and declare the maximum number of nodes, the process is shared, which is a problem in some cases. It would be useful to be able to change this parameter in real-time rather than having to stop the solution and restart with a higher number of nodes.

View full review »

reviewer1185906 - PeerSpot reviewer

reviewer1185906

Manager - Data Science Competency at a tech services company with 201-500 employees

When you are working with large, complex tasks, the garbage collection process is slow and affects performance. This is an area where they need to improve because your job may fail if it is stuck for a long time while memory garbage collection is happening. This is the main problem that we have.

View full review »

Onur Tokat - PeerSpot reviewer

Onur Tokat

Big Data Engineer Consultant at Collective[i]

Spark could be improved by adding support for other open-source storage layers than Delta Lake. The UI could also be enhanced to give more data on resource management.

View full review »

reviewer1535340 - PeerSpot reviewer

reviewer1535340

Senior Solutions Architect at a retailer with 10,001+ employees

The logging for the observability platform could be better.

View full review »

KK

KamleshKhollam

Managing Consultant at a computer software company with 501-1,000 employees

I would like to see integration with data science platforms to optimize the processing capability for these tasks.

View full review »

it_user1223676 - PeerSpot reviewer

it_user1223676

Lead Consultant at a tech services company with 51-200 employees

We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time. There is some latency in the system and latency in the data caching. The main issue is that we need to design it in a way that data will be available to us very quickly. It takes a long time and the latest data should be available to us much quicked.

View full review »

Suresh_Srinivasan - PeerSpot reviewer

Suresh_Srinivasan

Co-Founder at FORMCEPT Technologies

We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data.

View full review »

reviewer879201 - PeerSpot reviewer

reviewer879201

Technical Consultant at a tech services company with 1-10 employees

I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist.

Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best.

Overall, it offers everything that I can imagine right now.

View full review »

MG

Mohamed Ghorbel

Director of BigData Offer at IVIDATA

The solution needs to optimize shuffling between workers.

View full review »

reviewer1046250 - PeerSpot reviewer

reviewer1046250

Senior Consultant & Training at a tech services company with 51-200 employees

When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data. Once you are experienced, it is easier and more stable.

When you are trying to do something outside of the normal requirements in a typical project, it is difficult to find somebody with experience.

View full review »

it_user946074 - PeerSpot reviewer

it_user946074

Principal Architect at a financial services firm with 1,001-5,000 employees

The search could be improved. Usually, we are using other tools to search for specific stuff. We'll be using it how I use other tools - to get the details, but if there any way to search for little things that will be better.

It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster.

In the next release, if they can add more analytics, that would be useful. For example, for data, built data, if there was one port where you put the high one then you can pull any other close to you, and then maybe a log for the right script.

View full review »

LC

Snrsecengin567

Snr Security Engineer at a tech vendor with 201-500 employees

The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive.

View full review »

it_user1059558 - PeerSpot reviewer

it_user1059558

Portfolio Manager, Enterprise Solutions Architect at Capgemini

Better data lineage support.

View full review »

SP

Sumanth Punyamurthula

Director - Data Management, Governance and Quality at Hilton Worldwide

It is like going back to the '80s for the complicated coding that is required to write efficient programs.

View full review »

reviewer894894 - PeerSpot reviewer

reviewer894894

Works at a computer software company with 51-200 employees

I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.

View full review »

it_user786777 - PeerSpot reviewer

it_user786777

Manager | Data Science Enthusiast | Management Consultant at a consultancy with 5,001-10,000 employees

Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.

View full review »

it_user746943 - PeerSpot reviewer

it_user746943

Big Data and Cloud Solution Consultant at a financial services firm with 10,001+ employees

Dynamic DataFrame options are not yet available.

View full review »

it_user746673 - PeerSpot reviewer

it_user746673

Sr. Software Engineer at a tech vendor with 1-10 employees

This product is already improving as the community is developing it rapidly. More ML based algorithms should be added to it, to make it algorithmic-rich for developers.

View full review »

it_user326142 - PeerSpot reviewer

it_user326142

Architect at a healthcare company with 51-200 employees

Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).

View full review »

it_user372393 - PeerSpot reviewer

it_user372393

Big Data Consultant at a tech services company with 501-1,000 employees

Apache Spark provides very good performance The tuning phase is still tricky.

View full review »

it_user371832 - PeerSpot reviewer

it_user371832

Chief System Architect at a marketing services firm with 501-1,000 employees

Spark is actually very good for batch analysis much more good than Hadoop, it's much simple, much more quicker etc., but it actually lacks the ability to perform real-time querying like Vertica or Redshift.

Also, it is more difficult for an end user to work with Spark than normal database. even comparing with analytic database like Vertica or Redshift.

View full review »

it_user365304 - PeerSpot reviewer

it_user365304

Software Consultant at a tech services company with 10,001+ employees

Question of improvement always comes to mind of the developers. Just like the most common need of the developers, if a user-friendly GUI along with 'drag & drop' feature can be attached to this framework, then it would be easier to access it.

Another thing to mention, there always is a place for improvement in terms of the memory usage. If in future, it is achievable to use less memory for processing, it would obviously be better.

View full review »

it_user374028 - PeerSpot reviewer

it_user374028

Core Engine Engineer at a computer software company with 51-200 employees

It needs to be simpler to use the machine learning algorithms supported by Octave (example polynomial regressions, polynomial interpolation).

View full review »

it_user374040 - PeerSpot reviewer

it_user374040

Systems Engineering Lead, Mid-Atlantic at a tech company with 10,001+ employees

Apache Spark as a data processing engine has come a long way since its inception. Although you are able to perform complex transformations using Spark libraries, the support for SQL to perform transformations is still limited. You can alleviate some of these limitations by running Spark within Hadoop ecosystem and by leveraging the fairly evolved HiveQL.

View full review »

it_user373173 - PeerSpot reviewer

it_user373173

Lead Big Data Engineer at a non-profit with 51-200 employees

Good tool to analyse Spark application performance. Right now there are still many parameters to tune in order to get good performance of Spark application, I would like to see the auto tuning of parameters.

View full review »

it_user74256 - PeerSpot reviewer

it_user74256

Engineer at a tech vendor with 10,001+ employees

Better monitoring ability. Especially monitoring integration with customer codes.

View full review »

it_user371334 - PeerSpot reviewer

it_user371334

CEO at a tech consulting company with 51-200 employees

Better integration of BI tools wold be a much appreciated improvement.

View full review »

it_user371325 - PeerSpot reviewer

it_user371325

Data Scientist at a tech vendor with 10,001+ employees

It needs better documentation as well as examples for all the Spark libraries. That would be very helpful in maximizing its capabilities and results.

View full review »

it_user365301 - PeerSpot reviewer

it_user365301

Software Developer (Product Engineering) at a computer software company with 501-1,000 employees

Like I said scalability is still an issue, also stability. Spark on Yarn still doesn't seem to have programming submission api, so have to rely on spark-submit script to run jobs on YARN. Scala vs Java API have performance differences which will require sometimes to code in Scala.

View full review »

reviewer1904019 - PeerSpot reviewer

reviewer1904019

Chief Technology Officer at a tech services company with 11-50 employees

Apache Spark can improve the use case scenarios from the website. There is not any information on how you can use the solution across the relational databases toward multiple databases.

View full review »

Free Report: Apache Spark Reviews and More

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: July 2025.

861,390 professionals have used our research since 2012.