Onur Tokat - PeerSpot reviewer
Big Data Engineer Consultant at Collective[i]
Consultant
Scala-based solution with good data evaluation functions and distribution
Pros and Cons
  • "Spark can handle small to huge data and is suitable for any size of company."
  • "Spark could be improved by adding support for other open-source storage layers than Delta Lake."

What is our primary use case?

I mainly use Spark to prepare data for processing because it has APIs for data evaluation. 

What is most valuable?

The most valuable feature is that Spark uses Scala, which has good data evaluation functions. Spark also supports good distribution on the clusters and provides optimization on the APIs.

What needs improvement?

Spark could be improved by adding support for other open-source storage layers than Delta Lake. The UI could also be enhanced to give more data on resource management.

For how long have I used the solution?

I've been using Spark for six years.

Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.

What do I think about the stability of the solution?

Generally, Spark works correctly without any errors. It may give out some errors if your data changes, but in that case, it's a problem with the configuration, not with Spark.

What do I think about the scalability of the solution?

The cloud version of Spark is very easy to scale.

How was the initial setup?

The initial setup is not complex, but it depends on the product's component on the architecture. For example, if you use Hadoop, setup may not be easy. Deployment takes about a week, but the Spark cluster can be installed in the virtual architecture in a day.

What other advice do I have?

Spark can handle small to huge data and is suitable for any size of company. I would rate Spark as eight out of ten. 

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user
Director at Nihil Solutions
Real User
Stable and easy to set up with a very good memory processing engine
Pros and Cons
  • "The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."
  • "The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate."

What is our primary use case?

When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.

What is most valuable?

The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly.

What needs improvement?

There are lots of items coming down the pipeline in the future. I don't know what features are missing. From my point of view, everything looks good.

The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate.

There should be more information shared to the user. The solution already has all the information tracked in the cluster. It just needs to be accessible or searchable.

For how long have I used the solution?

I started using the solution about four years ago. However, it's been on and off since then. I would estimate in total I have about a year and a half of experience using the solution.

What do I think about the stability of the solution?

The stability of the solution is very, very good. It doesn't crash or have glitches. It's quite reliable for us.

What do I think about the scalability of the solution?

The scalability of the solution is very good. If a company has to expand it, they can do so.

Right now, we have about six or seven users that are directly on the product. We're encouraging them to use more data. We do plan to increase usage in the future.

How are customer service and technical support?

I'm a developer, so I don't interact directly with technical support. I can't speak to the quality of their service as I've never directly dealt with them.

Which solution did I use previously and why did I switch?

We did previously use a lot of different mechanisms, however, we needed something that was good at processing data for analytical purposes, and this solution fit the bill. It's a very powerful tool. I haven't seen other tools that could do precisely what this one does.

How was the initial setup?

The initial setup isn't too complex. It's quite straightforward.

We use CACD DevOps from deployment. We only use Spark for processing and for the Data Bricks cluster to spin off and do the job. It's continuously running int he background.

There isn't really any maintenance required per se. We just click the button and it comes up automatically, with the whole cluster and the Spark and everything ready to go.

What's my experience with pricing, setup cost, and licensing?

I'm unsure as to how much the licensing is for the solution. It's not an aspect of the product I deal with directly.

What other advice do I have?

We're customers and also partners with Apache.

While we are on version 2.6, we are considering upgrading to version 3.0.

I'd rate the solution nine out of ten. It works very well for us and suits our purposes almost perfectly.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user
Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.
Technical Consultant at a tech services company with 1-10 employees
Consultant
Good Streaming features enable to enter data and analysis within Spark Stream
Pros and Cons
  • "I feel the streaming is its best feature."
  • "When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."

What is our primary use case?

We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.

What is most valuable?

I have worked with Hadoop a lot in my career and you need to do a lot of things to get it to Hello World. But in Spark it is easy. You could say it's an umbrella to do everything under the one shelf. It also has Spark Streaming. I feel the streaming is its best feature because I have extracted to enter data and analysis within Spark Stream.

What needs improvement?

I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist.

Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best.

Overall, it offers everything that I can imagine right now. 

For how long have I used the solution?

I have been using Apache Spark for a couple of months.

What do I think about the stability of the solution?

In terms of stability, I have not seen any bugs, glitches or crashes. Even if there is, that's fine, because I would probably take care of it and then I'd have progressed further in the process.

What do I think about the scalability of the solution?

I have not tested the scalability yet.

In my company, there are two or three people that are using it for different products. But right now, the client I'm engaged with doesn't know anything about Spark or Hadoop. They are a typical financial company so they do what they do, and they ask us to do everything. They have pretty much outsourced their whole big data initiative to us.

Which solution did I use previously and why did I switch?

I have used MapReduce from Hadoop previously. Otherwise, I haven't used any other big data infrastructure.

In my work previously, not in this company, I was working with some big data, but I was extracting using a single-core off my PC. I realized over time that my system had eight cores. So instead, I used all of those cores for multi-core programming. Then I realized that Hadoop and Spark do the same thing but with different PC's. That was then I used multi-core programming and that's the point - Spark needs to go and search Hadoop and other things.

How was the initial setup?

The initial setup to get it to Hello World is pretty easy, you just have to install it. But when you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources. But you can get a lot of help from different sources on the internet. So it's great. A lot of people are doing it.

I work with a startup company. You know that in startups you do not have the luxury of different people doing different things, you have to do everything on your own, and it's an opportunity to learn everything. In a typical corporate or big organization you only have restricted SOPs, you have to work within the boundaries. In my organization, I have to set up all the things, configure it, and work on it myself.

What's my experience with pricing, setup cost, and licensing?

I would suggest not to try to do everything at once. Identify the area where you want to solve the problem, start small and expand it incrementally, slowly expand your vision. For example, if I have a problem where I need to do streaming, just focus on the streaming and not on the machine learning that Spark offers. It offers a lot of things but you need to focus on one thing so that you can learn. That is what I have learned from the little experience I have with Spark. You need to focus on your objective and let the tools help you rather than the tools drive the work. That is my advice.

What other advice do I have?

On a scale of 1 to 10, I'd put it at an eight.

To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user786777 - PeerSpot reviewer
Manager | Data Science Enthusiast | Management Consultant at a consultancy with 5,001-10,000 employees
Real User
We can now harness richer data sets and benefit from use cases
Pros and Cons
  • "With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."
  • "Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing."

How has it helped my organization?

Organisations can now harness richer data sets and benefit from use cases, which add value to their business functions.

What is most valuable?

Distributed in memory processing. Some of the algorithms are resource heavy and executing this requires a lot of RAM and CPU. With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.

What needs improvement?

Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.

For how long have I used the solution?

Three to five years.

What do I think about the stability of the solution?

At times when users do not know how to use Spark and request a lot of resources, then the underlying JVMs can crash, which is a big sense of worry. 

What do I think about the scalability of the solution?

No issues.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
PeerSpot user
Director of Enginnering at Sigmoid
Real User
Top 20
Easy to code, fast, open-source, very scalable, and great for big data
Pros and Cons
  • "Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica. Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark."
  • "Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."

What is our primary use case?

I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.

How has it helped my organization?

Spark has been at the forefront of data processing engine. I have used Apache Spark for multiple projects for different clients. It is an excellent tool to process massive amount of data. 

What is most valuable?

Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica.

Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark.

What needs improvement?

Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.

For how long have I used the solution?

I have been using this solution for around 7 years.

What do I think about the stability of the solution?

There were bugs three to four years ago, which have been resolved. There were a couple of issues related to slowness when we did a lot of transformations using the Width columns. I was writing a POC on ETL for moving from Informatica to Spark SQL for the ETL pipeline. It required the use of hundreds of Width columns to change the column name or add some transformation, which made it slow. It happened in versions prior to version 1.6, and it seems that this issue has been fixed later on.

What do I think about the scalability of the solution?

It is very scalable. You can scale it a lot.

How are customer service and support?

I haven't contacted them.

How was the initial setup?

The initial setup was a little complex when I was using open-source Spark. I was doing a POC in the on-premise environment, and the initial setup was a little cumbersome. It required a lot of set up on Unix systems. We also had to do a lot of configurations and install a lot of things. 

After I moved to the Cloudera CDH version, it was a little easy. It is a bundled product, so you just install whatever you want and use it.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera.

What other advice do I have?

I would definitely recommend Spark. It is a great product. I like Spark a lot, and most of the features have been quite good. Its initial learning curve is a bit high, but as you learn it, it becomes very easy.

I would rate Apache Spark an eight out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Senior Test Automation Consultant / Architect at a tech services company with 11-50 employees
Consultant
Useful for big data and scientific purposes, but needs better query handling, stability, and scalability
Pros and Cons
  • "It is useful for handling large amounts of data. It is very useful for scientific purposes."
  • "We are building our own queries on Spark, and it can be improved in terms of query handling."

What is our primary use case?

We are using it for big data. We are using a small part of it, which is related to using data.

What is most valuable?

It is useful for handling large amounts of data. It is very useful for scientific purposes.

What needs improvement?

There are some difficulties that we are working on. It is useful for scientific purposes, but for commercial use of big data, it gives some trouble.

They should improve the stability of the product. We use Spark Executors and Spark Drivers to link to our own environment, and they are not the most stable products. Its scalability is also an issue.

We are building our own queries on Spark, and it can be improved in terms of query handling.

For how long have I used the solution?

In my company, it has been used for several years, but I have been using it for seven months.

What do I think about the scalability of the solution?

It is not scalable. Scalability is one of the issues.

How are customer service and support?

It is open source from my point of view. So, there is no support.

What other advice do I have?

I would advise not using it if you don't have experienced users inside your organization. If you have to figure it all out on your own, then you shouldn't start with it.

Overall, I would rate it a six out of 10. For a commercial use case, it is a six out of 10. For scientific purposes, it is an eight out of 10.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Senior Solutions Architect at a retailer with 10,001+ employees
Real User
A unified analytics engine with a valuable parallel processing feature
Pros and Cons
  • "I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library."
  • "The logging for the observability platform could be better."

What is our primary use case?

We use Apache Spark to prepare data for transformation and encryption, depending on the columns. We use AES-256 encryption. We're building a proof of concept at the moment. We prepare patches on Spark for Kubernetes on-premise and Google Cloud Platform.

What is most valuable?

I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library.

What needs improvement?

The logging for the observability platform could be better.

For how long have I used the solution?

I know about this technology for a long time, maybe for about three years.

Which solution did I use previously and why did I switch?

Because my area is data analytics and analytics solutions, I use BigQuery, SQL, and ETL. I also use Dataproc and DataFlow.

What about the implementation team?

We use an integrator sometimes, but recently we put together a team to support the infrastructural requirements. This is because the proof of concept is self-administered.

What other advice do I have?

I would recommend Apache Spark to new users, but it depends on the use case. Sometimes, it's not the best solution.

On a scale from one to ten, I would give Apache Spark a ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user946074 - PeerSpot reviewer
Principal Architect at a financial services firm with 1,001-5,000 employees
Real User
Fast performance and has an easy initial setup
Pros and Cons
  • "I found the solution stable. We haven't had any problems with it."
  • "It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster."

What is our primary use case?

We use the solution for analytics.

How has it helped my organization?

I'm not sure how it has improved my organization but I believe that it's a good product.

What is most valuable?

The fast performance is the most valuable aspect of the solution.

What needs improvement?

The search could be improved. Usually, we are using other tools to search for specific stuff. We'll be using it how I use other tools - to get the details, but if there any way to search for little things that will be better.

It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster.

In the next release, if they can add more analytics, that would be useful. For example, for data, built data, if there was one port where you put the high one then you can pull any other close to you, and then maybe a log for the right script. 

For how long have I used the solution?

I've been using the solution for two years.

What do I think about the stability of the solution?

I found the solution stable. We haven't had any problems with it.

How are customer service and technical support?

Usually, we can fix any issues. If we have problems, we google a little bit to find the issue. 

Which solution did I use previously and why did I switch?

I was using some other systems and we moved to Spark later. We faced performance and other issues with the other solution.

How was the initial setup?

The initial setup was easy. We keep on getting data from different sources so we will keep on porting in little bits. It's not done in a single sitting, so I can't really say how long it takes.

What other advice do I have?

I would recommend the solution. I would rate it an eight or nine out of 10.

For some areas, I would give it ten but I cannot use some parts. If you are going to use it for a consumer then I would be able to recommend it and you should go ahead. It doesn't work for me as I have different clients and different engagements.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: April 2024
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.