Try our new research platform with insights from 80,000+ expert users

Apache Spark Streaming Room for Improvement

Himansu Jena - PeerSpot reviewer

Himansu Jena

Sr Project Manager at Raj Subhatech

There are various ways we can improve Apache Spark Streaming through best practices. The initial part requires attention to batch interval tuning, which helps small intervals in micro batches based on latency requirements and helps prevent back pressure. We can use data formats such as Parquet or ORC for storage that needs faster reads and leveraging feature predicate push-down optimizations.

We can implement serialization which helps with any Kyro in terms of .NET or Java. We have boxing and unboxing serialization for XML and JSON for converting key-pair values stored in browser. We can also implement caching mechanisms for storing and recomputing multiple operations.

We can use specified joins which help with smaller databases, and distributed joins can minimize users. We can implement project optimization memory for CPU efficiency, known as Tungsten. Additionally, load balancing, checkpointing, and schema evaluation are areas to consider based on performance and bottlenecks. We can use Bugzilla tools for tracking and Splunk to monitor the performance of process systems, utilization, and performance based on data frames or data sets.

View full review »

Kuldeep Pal - PeerSpot reviewer

Kuldeep Pal

Data Engineer at Walmart Global Tech

The positive impact from Apache Spark Streaming is its near real-time capability. It has a good ecosystem that provides good support. However, if you need purely real-time data, you would be going with Flink. Apache Spark Streaming is good for near-to-real-time data and requires less maintenance, which is beneficial for developers and companies.

The new feature coming in Apache Spark Streaming 4 is continuous streaming. If continuous streaming becomes stable and performs comparably to Flink, then Apache Spark Streaming would be preferred everywhere due to its good maintenance and support system.

While it is reliable, there are some issues with Apache Spark Streaming as it is not 100% reliable. Sometimes it fails, requiring numerous configurations such as checkpointing, watermarking, and other features. If you select a 10-minute window and the data arrives at the 30th minute, it sometimes loses data in between. You also have to apply back pressure when numerous messages are coming in. It requires constant monitoring and maintenance. I would say it is 90-95% reliable, but multiple configurations and frequent maintenance make it slightly less reliable.

The continuous deployment feature being in beta phase could benefit everyone if released earlier.

View full review »

Khoa Dang Le - PeerSpot reviewer

Khoa Dang Le

Principal AI Engineer at IMT Solutions

One of the improvements we need is in Spark SQL and the machine learning library. I don't think there is too much to work on, but the issue is when we want to use machine learning, we always need to retrain MLlib in Apache Spark for it to run their pipeline.

They cannot change the transformational machine learning aspect significantly. The problem is we need to use it in a certain manner. After that, we need to apply another pipeline for the machine learning processes, and that's what we work on.

View full review »

Apache Spark Streaming

Free Report: Apache Spark Streaming Reviews and More

Learn what your peers think about Apache Spark Streaming. Get advice and tips from experienced pros sharing their opinions. Updated: November 2025.

873,085 professionals have used our research since 2012.

Shahzad Munir - PeerSpot reviewer

Shahzad Munir

Sr. Manager Data Engineer at a tech consulting company with 51-200 employees

In the old way of working with Apache Storm, which is also a real-time processing engine, and Flink as there was the ability to append small data files. With those solutions, you can append to the same file until reaching a certain limit, then start writing to another file. I miss this feature in Apache Spark Streaming. From an architecture point of view, it's not possible for Apache Spark Streaming, but this is a feature I really miss compared to Flink or Apache Storm.

Monitoring is an area where they could definitely improve Apache Spark Streaming. When you have a streaming application, it generates numerous logs. After some time, the logs become meaningless because they're quite large and impossible to open. Monitoring how the streaming is progressing, including record rejection rates, failures, and successes is crucial. The rejected records are most critical, being those that cannot be compacted or processed in the streaming line. While monitoring features exist in the Spark UI with graphs showing input limits, incoming rates, and output rates, accessing this information requires navigating through Spark's UI. Furthermore, if the Spark application runs for an extended period, the UI becomes inaccessible, making it impossible to monitor your Apache Spark Streaming application.

Another significant missing feature is the handling of slowly changing dimensions. When dealing with big data in Apache Spark Streaming, there are two different types of datasets: static data and streaming data. Apache Spark Streaming doesn't provide a way to automatically update static data when joining it with streaming data. For example, if you have customer data as static data and network data as streaming data, the application starts consuming network data but loads customer data from a previous snapshot. After 24 hours, Apache Spark Streaming cannot reload the customer data independently. The application must be stopped and restarted to consume the latest customer snapshot for joining with streaming data.

View full review »

Venkata Phaneendra Reddy Janga - PeerSpot reviewer

Venkata Phaneendra Reddy Janga

Data Engineer III at a tech consulting company with 10,001+ employees

One improvement I would expect is real-time processing instead of micro-batch or near real-time. Frameworks such as Apache Beam or Apache Flink process data in real-time, and integrating this capability into Apache Spark Streaming would be beneficial.

Another improvement could be in the job stopping process. In the DataProc environment, we have to place a file for the Apache Spark Streaming job to detect and stop gracefully. A feature to stop the job directly from the UI would be helpful.

View full review »

Ajay Hiremath - PeerSpot reviewer

Ajay Hiremath

Gen AI Lead/Architect at Alvaria

I believe the downsides of Apache Spark Streaming are that it primarily supports structured data. Currently, in my organization, we require thousands of transcripts that need to be handled during live conversations. If Apache Spark Streaming allowed all unstructured data to transfer, that would be a really great use case.

My name can be put on the top of my review. My company name can be mentioned. Apache can contact me in case they have questions or comments about this review. I am interested in being a reference for Apache.

View full review »

Oscar Estorach - PeerSpot reviewer

Oscar Estorach

Chief Data-strategist and Director at Theworkshop.es

In terms of improvement, the UI could be better. Additionally, Spark Streaming works well for various use cases, but improvements could be made for ultra-fast scenarios where seconds matter. While some business processes require real-time data every second, not all projects demand such speed. For instance, batch processing, short intervals for competitive intelligence, or operational intelligence actions might not need sub-second precision. Streaming is versatile but needs careful consideration based on the specific use case and problem at hand.

View full review »

AbhishekGupta - PeerSpot reviewer

AbhishekGupta

Engineering Leader at Walmart

The service structure of Apache Spark Streaming can improve. There are a lot of issues with memory management and latency. There is no real-time analytics. We recommend it for the use cases where there is a five-second latency, but not for a millisecond, an IOT-based, or the detection anomaly-based. Flink as a service is much better.

Apache Spark Streaming does not have auto-tuning. A customer needs to invest a lot, in terms of management and maintenance.

View full review »

AM

Aleksandr Motuzov

Head of Data Science center of excellence at Ameriabank CJSC

We don't have enough experience to be judgmental about its flaws, as we've only used stable features like batch micro-batch. Integration poses no problem; however, I don't use some features and can't judge those.

View full review »

DR

Daleep R

Chief Technology Officer at Teslon Technologies Pvt Ltd

In terms of disadvantages, it was a bit cumbersome due to its size. It wasn't quite cloud-native back then, meaning it wasn't easy to deploy it in a Kubernetes cluster and similar environments. I found it a bit challenging, but I'm not sure if that's still the case now. It probably has better support.

It was on-prem when we wanted to migrate it to the cloud, especially on Kubernetes, I remember facing some difficulties in successfully migrating the system.

View full review »

Prashast Tripathi - PeerSpot reviewer

Prashast Tripathi

Data Engineer at a comms service provider with 201-500 employees

Apache Spark Streaming is a native integration of some libraries in terms of cost and load-related optimizations. The cost and load-related optimizations are areas where the tool lacks and needs improvement.

View full review »

SB

Srikanth Bhuvanagiri

Sr Technical Analyst at Sumtotal

The initial setup is quite complex.

View full review »

reviewer1494531 - PeerSpot reviewer

reviewer1494531

Head of Data Science at a energy/utilities company with 10,001+ employees

We would like to have the ability to do arbitrary stateful functions in Python.

View full review »

Oscar Estorach - PeerSpot reviewer

Oscar Estorach

Chief Data-strategist and Director at Theworkshop.es

The installation is difficult. You definitely need more than one person. That said, if you are implementing the cloud, it's easier.

The solution itself could be easier to use.

The solution is free to use as it is open-source.

View full review »

RK

RajeevKumar10

DevOps engineer at Vvolve management consultants

The scalability features are already good, but they could be further enhanced. Additionally, the debugging aspect could use some improvement.

View full review »

reviewer1516182 - PeerSpot reviewer

reviewer1516182

Chief Innovation & Technology Leader at a mining and metals company with 1,001-5,000 employees

There could be an improvement in the area of the user configuration section, it should be less developer-focused and more business user-focused. For example, it is still not plug and play and use as some of the cloud offerings that come ready to use. It is not up there in the reading leading edge.

View full review »

reviewer2392494 - PeerSpot reviewer

reviewer2392494

Enterprise Data Architect at a pharma/biotech company with 11-50 employees

The product's event handling capabilities, particularly compared to Kaspersky, need improvement. Integrating event-level streaming capabilities could be beneficial. This aligns with the idea of expanding Spark's functionality to cover unaddressed areas, potentially enhancing its competitiveness.

View full review »

Apache Spark Streaming

Free Report: Apache Spark Streaming Reviews and More

Learn what your peers think about Apache Spark Streaming. Get advice and tips from experienced pros sharing their opinions. Updated: November 2025.

873,085 professionals have used our research since 2012.