Software Architect at Akbank
Real User
Provides fast aggregations, AI libraries, and a lot of connectors
Pros and Cons
  • "AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI."
  • "Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."

What is our primary use case?

We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. 

Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them.  

This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.

How has it helped my organization?

Aggregations are very fast in our project since we started to use Spark. We can tell results in around 300 milliseconds. Before using Spark, the time was around 700 milliseconds. 

Before using Spark, we only used Couchbase. We needed fast results for this project because transactions come from various channels, and we need to decide and resolve them at the earliest because users are performing the transaction. If our result or process takes longer, users might stop or cancel their transactions, which means losing money. Therefore, fast results time is very important for us.

What is most valuable?

AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI. 

What needs improvement?

Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.

Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.

For how long have I used the solution?

I am a Java developer. I have been interested in Spark for around five years. We have been actively using it in our organization for almost a year.

What do I think about the stability of the solution?

It is the most stable platform. As compare to Flink, Spark is good, especially in terms of clusters and architecture. My colleagues who set up these clusters say that Spark is the easiest.

What do I think about the scalability of the solution?

It is scalable, but we don't have the need to scale it. 

It is mainly used for reporting big data in our organization. All teams, especially the VR team, are using Spark for job execution and remote execution. I can say that 70% of users use Spark for reporting, calculations, and real-time operations. We are a very big company, and we have around a thousand people in IT.

We will continue its usage and develop more. We have kind of just started using it. We finished this project just three months ago. We are now trying to find out bottlenecks in our systems, and then we are ready to go.

How are customer service and support?

We have not used Apache support. We have only used Cloudera support for this project, and they helped us a lot during the development cycle of this project. 

How was the initial setup?

I don't have any idea about it. We are a big company, and we have another group for setting up Spark.

What other advice do I have?

I would advise planning well before implementing this solution. In enterprise corporations like ours, there are a lot of policies. You should first find out your needs, and after that, you or your team should set it up based on your needs. If your needs change during development because of the business requirements, it will be very difficult. 

If you are clear about your needs, it is easier to set it up. If you know how Spark is used in your project, you have to define firewall rules and cluster needs. When you set up Spark, it should be ready for people's usage, especially for remote job execution. 

I would rate Apache Spark a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Suresh_Srinivasan - PeerSpot reviewer
Co-Founder at FORMCEPT Technologies
Real User
Top 10
Enables us to process data from different data sources
Pros and Cons
  • "We use Spark to process data from different data sources."
  • "In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."

What is our primary use case?

Our primary use case is for interactively processing large volume of data.

What is most valuable?

We use Spark to process data from different data sources. 

What needs improvement?

In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond

For how long have I used the solution?

I have been using Apache Spark for eight to nine years. 

What do I think about the stability of the solution?

It is a stable solution. The solution is ten out of ten on stability. 

What do I think about the scalability of the solution?

The solution is highly scalable. All of the technical guys use Spark. Our product is used by many people within our customers' company.

How was the initial setup?

The initial setup is straightforward. 

What's my experience with pricing, setup cost, and licensing?

The solution is moderately priced. 

What other advice do I have?

I rate the overall solution a ten out of ten. 

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.
CTO at Hammerknife
Real User
Top 5
Provides a valuable implementation of distributed data processing with a simple setup process
Pros and Cons
  • "Apache Spark provides a very high-quality implementation of distributed data processing."
  • "There were some problems related to the product's compatibility with a few Python libraries."

What is our primary use case?

We use the product for real-time data analysis.

What is most valuable?

Apache Spark provides a very high-quality implementation of distributed data processing. I rate it 20 on a scale of one to ten.

What needs improvement?

There were some problems related to the product's compatibility with a few Python libraries. But I suppose they are fixed.

For how long have I used the solution?

We have been using Apache Spark for the last two to three years.

What do I think about the stability of the solution?

I rate the product's stability a ten out of ten.

What do I think about the scalability of the solution?

The product is enormously scalable.

How was the initial setup?

The initial setup process is simple if you are a good professional. You have to select a few parameters and press enter. It is already integrated into Databricks platform. One person is enough to manage small and medium implementations.

What's my experience with pricing, setup cost, and licensing?

It is an open-source platform. We do not pay for its subscription.

Which other solutions did I evaluate?

We are evaluating a few analytics engineering and DBT solutions. For now, Spark is in the secondary position.

What other advice do I have?

I recommend Apache Spark for batch analytics features.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
PLC Programmer at Alzero
Real User
Top 20
Highly-recommended robust solution for data processing
Pros and Cons
  • "I appreciate everything about the solution, not just one or two specific features. The solution is highly stable. I rate it a perfect ten. The solution is highly scalable. I rate it a perfect ten. The initial setup was straightforward. I recommend using the solution. Overall, I rate the solution a perfect ten."
  • "The solution’s integration with other platforms should be improved."

What is our primary use case?

We are a software solutions company that serves a variety of industries, including banking, insurance, and industrial sectors. The product is specifically employed for managing data platforms for our customers.


What is most valuable?

The solution, as a package, excels across the board. I appreciate everything, not just one or two specific features.


What needs improvement?

The solution’s integration with other platforms should be improved.


For how long have I used the solution?

I have been using the solution for the past eight years. Currently, I’m using the latest version of the solution.


What do I think about the stability of the solution?

The solution is highly stable. I rate it a perfect ten.


What do I think about the scalability of the solution?

The solution is highly scalable. I rate it a perfect ten.


How was the initial setup?

The initial setup was straightforward and was conducted on the cloud. The entire deployment process took just 15 minutes. The deployment process involves provisioning the computational part tool using Terraform.


What's my experience with pricing, setup cost, and licensing?

The solution is affordable and there are no additional licensing costs.


What other advice do I have?

I recommend using the solution. Overall, I rate the solution a perfect ten.


Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
Flag as inappropriate
PeerSpot user
Lokesh Jayanna - PeerSpot reviewer
Vice President at Goldman Sachs at a computer software company with 10,001+ employees
Real User
Top 5
Stable product with a valuable SQL tool
Pros and Cons
  • "The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."
  • "At the initial stage, the product provides no container logs to check the activity."

What is our primary use case?

We use the product for extensive data analysis. It helps us analyze a huge amount of data and transfer it to data scientists in our organization.

What is most valuable?

The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it. It is a useful feature for us.

What needs improvement?

At the initial stage, the product provides no container logs to check the activity. It remains inactive for a long time without giving us any information. The containers could start quickly, similar to that of Jupyter Notebook.

For how long have I used the solution?

We have been using Apache Spark for eight months to one year.

What do I think about the stability of the solution?

It is a stable product. I rate its stability an eight out of ten.

What do I think about the scalability of the solution?

We have 45 Apache Spark users. I rate its scalability a nine out of ten.

How was the initial setup?

The complexity of the initial setup depends on the kind of environment an organization is working with. It requires one executive for deployment. I rate the process an eight out of ten.

What's my experience with pricing, setup cost, and licensing?

The product is expensive, considering the setup. However, from a standalone perspective, it is inexpensive.

What other advice do I have?

I advise others to analyze data and understand your business requirements before purchasing the product. I rate it an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
Jagannadha Rao - PeerSpot reviewer
Lead Data Scientist at International School of Engineering
Real User
Top 10
A flexible solution that can be used for storage and processing
Pros and Cons
  • "The most valuable feature of Apache Spark is its flexibility."
  • "Apache Spark's GUI and scalability could be improved."

What is our primary use case?

We use Apache Spark for storage and processing.

What is most valuable?

The most valuable feature of Apache Spark is its flexibility.

What needs improvement?

Apache Spark's GUI and scalability could be improved.

For how long have I used the solution?

I have been using Apache Spark for four to five years.

What do I think about the scalability of the solution?

Around 15 data scientists are using Apache Spark in our organization.

How was the initial setup?

Apache Spark's initial setup is slightly complex compared to other other solutions. Data scientists could install our previous tools with minimal supervision, whereas Apache Spark requires some IT support. Apache Spark's installation is a time-consuming process because it requires ensuring that all the ports have been accessed properly following certain guidelines.

What about the implementation team?

While installing Apache Spark, I must look at the documentation and be very specific about the configuration settings. Only then I'll be able to install it.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an expensive solution.

What other advice do I have?

I would recommend Apache Spark to other users.

Overall, I rate Apache Spark an eight out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
Data Engineer at Berief Food GmbH
Real User
Top 5Leaderboard
A useful and easy-to-deploy product that has an excellent data processing framework
Pros and Cons
  • "The data processing framework is good."
  • "The solution must improve its performance."

What is our primary use case?

Our customers configure their software applications, and I use Apache to check them. We use it for data processing.

What is most valuable?

The data processing framework is good. The product is very useful.

What needs improvement?

The solution must improve its performance.

For how long have I used the solution?

I have been using the solution for four to five years.

What do I think about the stability of the solution?

The tool is stable. I rate the stability more than nine out of ten.

What do I think about the scalability of the solution?

We have a small business. Around four people in my organization use the solution.

How was the initial setup?

The deployment was easy.

What about the implementation team?

The solution was deployed with the help of third-party consultants.

What other advice do I have?

Overall, I rate the product more than eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
Quantitative Developer at a marketing services firm with 11-50 employees
Real User
Top 20
Seamless in distributing tasks, including its impressive map-reduce functionality
Pros and Cons
  • "The distribution of tasks, like the seamless map-reduce functionality, is quite impressive."
  • "When using Spark, users may need to write their own parallelization logic, which requires additional effort and expertise."

What is our primary use case?

Predominantly, I use Spark for data analysis on top of datasets containing tens of millions of records.

How has it helped my organization?

I have an example. We had a single-threaded application that used to run for about four to five hours, but with Spark, it got reduced to under one hour.

What is most valuable?

The distribution of tasks, like the seamless map-reduce functionality, is quite impressive. For the user, it appears as simple single-line data manipulations, but behind the scenes, the executor pool intelligently distributes the map and reduce functions.

What needs improvement?

The visualization could be improved.

For how long have I used the solution?

I have been working with Apache Spark for only a few months, not too long.

What do I think about the stability of the solution?

I haven't faced any stability issues. It has been stable in my experience.

What do I think about the scalability of the solution?

When it comes to the scalability of Spark, it's primarily a processing engine, not a database engine. I haven't tested it extensively with large record sizes.

In my organization, quite a few people are using Spark. In my smaller team, there are only two users.

What about the implementation team?

In terms of maintenance, when the load hits around 95%, we need to prioritize scripts and analysis within the team. 

We coordinate and prioritize based on the available resources. If there were self-service tools or better hand-holding for such situations, it would make things easier.

Which other solutions did I evaluate?

Currently, we extensively use pandas and Polaris. We are leveraging Docker and Kubernetes as a framework, along with AWS Batch for distribution. This is the closest substitute we have for Spark Distribution.

Both Docker and Kubernetes are more general-purpose solutions. If someone is already using Kubernetes and it's provided as a service, it can be used for special-purpose utilization, similar to Docker and Kubernetes.


In such cases, users may need to write the parallelization logic themselves, but it's relatively easy to onboard and start with a distributed load. Spark, on the other hand, is primarily used for special-purpose utilization. Users typically choose Spark when they have data-intensive tasks.

Another significant issue with Spark is its syntactics. For instance, if we have libraries like Panda or Polaris, we can run them single-threaded on a single core, or we can distribute them leveraging Kubernetes.

We don't need to rewrite that code base for Spark. However, if we are writing code specifically for Spark Executors, it will not be amenable to running it locally.

What other advice do I have?

I would recommend understanding the use case better. Only if it fits your use case, then go for it. But it is a great tool.

Overall, I would rate Apache Spark an eight out of ten. 

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: April 2024
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.