What advice do you have for others considering Apache Spark Streaming?

Question

If you were talking to someone whose organization is considering Apache Spark Streaming, what would you say? How would you rate it and why? Any other tips or advice?

Khoa Dang Le · Accepted Answer

One thing I would share with other organizations considering Apache Spark Streaming is the necessity of having effective data storage. We want to ensure we acquire and manage our data storage effectively. I would always recommend using Apache Spark Streaming as a solution for managing our entire data pipeline. On a scale of 1-10, I rate Apache Spark Streaming an 8.

Ajay Hiremath · Answer

The product I discussed was Apache Spark Streaming. For academics, I used the free version. Regarding maintenance, I do not think it requires any maintenance on my part. On a scale of one to ten, I rate Apache Spark Streaming a seven out of ten.

Himansu Jena · Answer

Most features in Apache Spark Streaming are used for database operations, focusing on speed, fault tolerance, scalability in terms of batch, real-time, SQL analytics, machine learning, graph processing, lazy evaluation, and compatibility. Distributed systems provide more accuracy and clustering of machines across large data sets. The data is divided into portions, partitions, or small pieces and processed in parallel across multiple work nodes, significantly accelerating processing time compared to single solutions. It helps with in-memory computing, storing memory, reducing frequent disk input-output, and enabling faster algorithms. We use NumPy and Pandas for matrix operations, creating algorithms that generate models fitting our deep learning or machine learning techniques. The accuracy level typically reaches 90% and above based on the data quality. When dealing with various data types including COBOL, Excel, JSON, video, audio, and MPG files, challenges can arise with incomplete or missing values. This particularly affects GIS data accuracy, such as predicting transport routes or electrical pole placements. While we achieve 90% efficiency, working with historical data versus current data presents challenges in business growth predictions. When encountering fault tolerance issues, we communicate directly with the Apache Spark Streaming development team through LinkedIn channels or their on-site team. They provide customer support where issues can be reported via SMS or email with the file name for solution assistance. The team helps address issues with data frames, data sets, RDD functionality, version migrations, and integration with tools such as Miniconda, Anaconda, and Node.js server. I rate Apache Spark Streaming 9 out of 10.

Venkata Phaneendra Reddy Janga · Answer

I would suggest Apache Spark for streaming processing if they want to manage clusters on their own. For serverless options, exploring other use cases could be beneficial. Apache Spark is a good starting point, considering its strong open source community contributions. We have not experienced downtime with Apache Spark Streaming, but we have had crashes. Sometimes the state memory keeps piling up, so we have to make tuning. We had crashes, but not very often. We were able to check specific load times and resolve those issues. On a scale of 1-10, I rate Apache Spark Streaming an 8.

Aleksandr Motuzov · Answer

The solution rates a nine out of ten.

reviewer2392494 · Answer

Spark does not encounter integration issues, particularly due to its utilization of JDBC connectors. These connectors facilitate seamless integration with third-party solutions. Furthermore, successful integration with tools like SAP HANA indicates its versatility in handling various data sources. Additionally, its performance surpasses Informatica in certain scenarios, especially when real-time streaming capabilities are crucial. It remains a preferred choice for businesses requiring efficient real-time data processing. I rate it an eight.

Oscar Estorach · Answer

For those starting with Apache Spark Streaming, I recommend studying and understanding data relationships. While it might seem complex at first, there are helpful resources available. Overall, I would rate Apache Spark Streaming as a nine out of ten.

Prashast Tripathi · Answer

Apache Spark Streaming has very specific use cases and needs to be evaluated based on the needs of an individual before choosing it. Overall, I rate the solution an eight out of ten.

Daleep R · Answer

I would highly recommend Spark Streaming for standard streaming or IoT use cases. The entire Spark ecosystem, including Spark Core, streaming, ML, and other components, can be highly beneficial. It's better to stick with the Spark ecosystem rather than use other platforms and frameworks. For streaming and IoT, Spark Streaming is a great choice. Overall, I would rate the solution an eight out of ten. The only issue I found, at least during the time I actively worked with it, was that it was resource-intensive, even for small-scale applications. In comparison, some other platforms, like Pulsar, had lighter resource consumption and performed better in terms of resource usage and associated costs. At least, to begin with, it performs better with the resource usage and dollar value associated with it. But at least to begin with it is a bit heavy and resource intensive, which is why I rate it an eight.

Srikanth Bhuvanagiri · Answer

It's important to be familiar with Spark Streaming and Spark libraries, because familiarity with those scripts and coding languages makes it easier to work with the Spark code ecosystem to get the integrations of Spark Streaming or any Spark cluster creations. I rate this solution eight out of 10.

What advice do you have for others considering Apache Spark Streaming?

14 Answers

Related Q&As