We changed our name from IT Central Station: Here's why
Get our free report covering Amazon, Microsoft, VMware, and other competitors of Apache Spark Streaming. Updated: January 2022.
563,208 professionals have used our research since 2012.

Read reviews of Apache Spark Streaming alternatives and competitors

Mohammad Masudu Rahaman
Founder at Talkingdeal.com LLC
Real User
Top 10
Good logging mechanisms, a strong infrastructure and pretty scalable
Pros and Cons
  • "There are a lot of options in Spring Cloud. It's flexible in terms of how we can use it. It's a full infrastructure."
  • "The configurations could be better. Some configurations are a little bit time-consuming in terms of trying to understand using the Spring Cloud documentation."

What is our primary use case?

Mostly the use cases are related to building a data pipeline. There are multiple microservices that are working in the Spring Cloud Data Flow infrastructure, and we are building a data pipeline, mostly a step-by-step process processing data using Kafka. Most of the processor sync and sources are being developed based on the customers' business requirements or use cases. 

In the example of the bank we work with, we are actually building a document analysis pipeline. There are some defined sources where we get the documents. Later on, we extract some information united from the summary and we export the data to multiple destinations. We may export it to the POGI Database, and/or to Kafka Topic. 

For CoreLogic, we were actually doing data import to elastic. We had a BigQuery data source. And from there we did some transformation of the data then imported it in the elastic clusters. That was the ETL solution.

How has it helped my organization?

For example, like PCF, all the cloud services, has their own microservice management infrastructure. However, if you have a CDF running, then the developer has more control over the messaging platform. How we can control the data flowing from one microservice to another microservice is great. As a developer, I feel more in control. Some hosted services (like the cloud) or some hosted infrastructure make us run smaller microservices, but they are actually infrastructure dependent. If anything happens (like any bug or any issue), it can be difficult to trace the problem. That's not true here. In a CDF they are really good at logging. Therefore, as a developer, I can have my Spring Boot logging mechanism to check what the problem is and it helps a lot. 

I've been working with the solution for eight or nine years at this point. I feel more comfortable with the infrastructure. CDF is actually infrastructure for Spring Boot applications running inside it. As a task or as the Longleaf microservice in the data pipeline. If you have a Spring Cloud Data Flow server implemented in your project, that means you have your own data pipeline architecture, and you can design your flow of the processing of the data as you wish. 

There is also logging for these short leading tasks. When the task is started, when the task is stopped, this kind of logging also helps to get some more transparency. 

In terms of the direct benefit of the company, they spend less money due to the fact that if you have some kind of hosted BPM or some kind of hosted service to orchestrate your microservices, then you need to pay some fees to a company to manage it. However, if your developer can manage the CDF, then this management cost gets reduced. I'm not sure of the actual hard costs, however, I am aware of the savings.

What is most valuable?

Mostly we enjoy the orchestration of microservices as you can have a Spring Boot application and build your own steps. You can deal with multiple processors as you need. There is a Spring Task inside CDF. That task is also helpful for a temporary position. You can trigger some tasks and it will do something in a few microseconds and then finish the task. There is no memory occupied by running the microservice. You can just open the microservice and it will do some work and then it will die and memory is released. These kinds of temporary activities are also helpful. 

It's a low-resource type of product. You have a scheduler running, and you have a lot of smaller tasks to be done by the Scheduler. Therefore, you don't need to keep the microservice running. You can trigger the task and the task will be executed and it will be down and GAR execution will be down and then memory will be released. So you don't ever need to keep any long life microservices.

There are a lot of options in Spring Cloud. It's flexible in terms of how we can use it. It's a full infrastructure.

What needs improvement?

The configurations could be better. Some configurations are a little bit time-consuming in terms of trying to understand using the Spring Cloud documentation. 

The documentation on offer is not that good. Spring Cloud Data Flow documentation for the configurations is not exactly clear. Sometimes they provide some examples, which are not complete examples. Some parts are presented in the documentation, but not shown in an example code. When we try to implement multiple configurations, for example, when we integrated with PCF, Pivotal Cloud Foundry, with CDF, there were issues.  It has workspace concept, however, in a CDF when we tried to implement the workspace some kind of boundary configuration was not integrating properly. Then we went to the documentation and tried to somehow customize it a little bit on the configuration level - not in the code level - to get the solution working.

It is open source. Therefore, you need to work a little bit. You need to do some brainstorming on your own. There's no one to ask. We cannot call someone and ask what the problem is. It is an open-source project without technical support. It's up to us to figure out what the problem is.

For how long have I used the solution?

I've been working with the solution for more than 11 months on two separate projects in California and Illinois. However, I have been familiar with the solution since 2017 and have used it on and off since then on a variety of projects.

What do I think about the stability of the solution?

Spring Cloud Data Flow is an open-source project and a lot of developers are working on this project. It is really stable right now. The configuration part may need some improvement, or, rather, simplifying in that some configuration could be simplified somehow. For a simpler implementation or a smaller project, there is no problem. If you deploy in PCF it is the CDF server, and if you deploy in Kubernetes it is the CDF server, then there are some integrations. 

What do I think about the scalability of the solution?

The solution scales well. 

The main reason to use the Spring Cloud Data Flow server is for scaling your project. You can split it into multiple microservices, then you can deploy it into multiple servers. We took help from the PCF platform as PCF has a Pivotal Cloud Foundry. They have Spring Cloud Data Flow server integrated right in. In their cluster, our microservice was running, however, it was running in multiple instances. We can increase the number of instances of these microservices as we need. 

How are customer service and technical support?

The solution is open-source, so there really isn't technical support to speak of. If there are issues, we need to troubleshoot them ourselves. We need to go through the code and work through the issues independently.

Which solution did I use previously and why did I switch?

We've had experience with Apache 95 and also Spark, however, Spark is just an execution engine mostly. They also have similar architecture. Apache 95, like this solution, is also open-source. We've looked at Amazon Step Function, however, their concept is similar to a serverless architecture. You don't need to even do the, boilerplate coding to run the application as a microservice. You just copy the part of the code you need to execute as a function. In ACDF what we do, we write microservice as application double application, then run that code inside my microservice, we've had some method, however, in AWS, Amazon Step Function, lambda, you can only put the part of the good that you need to execute, then use their platform to connect all the steps. Amazon can be expensive as you do need to pay for their services. The others you can just install on your servers.

How was the initial setup?

During the initial setup, when I ran the CDF server (just one GAR then Skipper server another GAR), I created some tasks and created a source string with an ITA service string. These tasks are all simple. However, if we try to integrate with some kind of platform, for example, another platform where I'm going to deploy a CDF, then the complexity comes into play. Otherwise, if you can run it in a single ECS or any kind of Linux box or in a server instance. Then there no issue. You can do everything.

I used the Docker compass and we did Docker-ize a lot of things. It was a quick deployment.

That said, each deployment is unique to each client. It's not always the same steps across the board.

What other advice do I have?

While the deployment is on-premises, the data center is not on-premises. It's in a different geographical location, however, it was the client's own data center. We deployed there, and we installed the CDF server, then the Skipper server, and everything else including all the microservices. We used the PCF Cloud Foundry platform and for the bank, we deployed in Kubernetes. 

Spring Cloud Data Flow server is pretty standard to implement. The year before it was a new project, however, now it is already implemented in many, many projects. I think developers should start using it if they are not using it yet. In the future, there could be some more improvements in the area of the data pipeline ETF process. That said, I'm happy with the Spring Cloud Data Flow server right now.

Our biggest takeaway has been to  design the pipeline depending on the customer's needs. We cannot just think about everything as a developer. Sometimes we need to think about what the customer needs instead. Everything needs to be based on customer flow. That helps us design a proper data pipeline. The task mechanism is also helpful if we can run some tasks instead of keeping the application live 24 hours. 

Overall, I'd rate the solution nine out of ten. It's a really good solution and a lot cheaper than a lot of infrastructure provided by big companies like Google or Amazon.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
RameshCh
Sr. BigData Architect at ITC Infotech
MSP
Top 5
Very elastic, easy to scale, and a straightforward setup
Pros and Cons
  • "It's easy to increase performance as required."
  • "Instead of relying on a massive instance, the solution should offer micro partition levels. They're working on it, however, they need to implement it to help the solution run more effectively."

What is our primary use case?

We work with clients in the insurance space mostly. Insurance companies need to process claims. Their claim systems run under Databricks, where we do multiple transformations of the data. 

What is most valuable?

The elasticity of the solution is excellent.

The storage, etc., can be scaled up quite easily when we need it to.

It's easy to increase performance as required.

The solution runs on Spark very well.

What needs improvement?

Instead of relying on a massive instance, the solution should offer micro partition levels. They're working on it, however, they need to implement it to help the solution run more effectively.

They're currently coming out with a new feature, which is Date Lake. It will come with a new layer of data compliance.

For how long have I used the solution?

We've been using the solution for two years.

What do I think about the stability of the solution?

I don't see any issues with stability going down to the cluster. It would certainly be fine if it's maintained. It's highly available even if things are dropped. It will still be up and running. I would describe it as very reliable. We don't have issues with crashing. There aren't bugs and glitches that affect the way it works.

What do I think about the scalability of the solution?

The system is extremely scalable. It's one of its greatest features and a big selling point. If a company needs to scale or expand, they can do so very easily.

We require daily usage from the solution even though we don't directly work with Databricks on a day to day basis. Due to the fact that we schedule everything we need and it will trigger work that needs to be done, it's used often. Do you need to log into the database console every day? No. You just need to configure it one time and that's it. Then it will deliver everything needed in the time required.

How are customer service and technical support?

We use Microsoft support, so we are enterprise customers for them. We raise a service request for Databricks, however, we use Microsoft. Overall, we've been satisfied with the support we've been given. They're responsive to our needs.

Which solution did I use previously and why did I switch?

We work with multiple clients and this solution is just one of the examples of products we work with. We use several others as well, depending on the client.

It's all wrappers between the same underlying systems. For example, Spark. It's all open-source. We've worked with them as well as the wrappers around it, whether the company was labeled Databrary, IBM insights, Cloudera, etc. These wrappers are all on the same open-source system.

If we with Azure data, we take over Databricks. Otherwise, we have to create a VM separately. Those things are not needed because Azure is already providing those things for us.

How was the initial setup?

The situation may have been a bit different for me than for many users or organizations. I've been in this industry for more than 15 or 17 years. I have a lot of experience. I also took the time to do some research and preparation for the setup. It was straightforward for me.

The deployment with Microsoft usually can be done in 20 minutes. However, it can take 40 to 45 minutes to complete. An organization only requires one person to upload the data and have complete access to the account.

What about the implementation team?

I deployed the solution myself. I didn't require any assistance, so I didn't enlist any resellers or consultants to help with the process.

What's my experience with pricing, setup cost, and licensing?

The solution is expensive. It's not like a lot of competitors, which are open-source.

What other advice do I have?

There isn't really a version, per se. 

It's a popular service. I'd recommend the solution. The solution is cloud-agnostic right now, so it really can go into any cloud. It's the users who will be leveraging installed environments that can have these services, no matter if they are using Azure or Ubiquiti, or other systems.

I don't think you can find any other tool or any other service that is faster them Databricks. I don't see that right now. It's your best option.

Overall, I'd rate the solution eight out of ten. The reason I'm not giving it full marks is that it's expensive compared to open source alternatives. Also, the configuration is difficult, so sometimes you need to spend a couple of hours to get it right.

Which deployment model are you using for this solution?

Public Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Partner / Head of Data & Analytics at a computer software company with 11-50 employees
Real User
Top 10
Gives us low latency for fast, real-time data, with useful alerts for live data processing
Pros and Cons
  • "The top feature of Apache Flink is its low latency for fast, real-time data. Another great feature is the real-time indicators and alerts which make a big difference when it comes to data processing and analysis."
  • "One way to improve Flink would be to enhance integration between different ecosystems. For example, there could be more integration with other big data vendors and platforms similar in scope to how Apache Flink works with Cloudera. Apache Flink is a part of the same ecosystem as Cloudera, and for batch processing it's actually very useful but for real-time processing there could be more development with regards to the big data capabilities amongst the various ecosystems out there."

What is our primary use case?

We use Apache Flink to monitor the network consumption for mobile data in fast, real-time data architectures in Mexico. The projects we get from clients are typically quite large, and there are around 100 users using Apache Flink currently.

For maintenance and deployment, we split our team into two squads, with one squad that takes care of the data architecture and the other squad that handles the data analysis technology. Each squad is three members each.

What is most valuable?

The top feature of Apache Flink is its low latency for fast, real-time data. Another great feature is the real-time indicators and alerts which make a big difference when it comes to data processing and analysis.

What needs improvement?

One way to improve Flink would be to enhance integration between different ecosystems. For example, there could be more integration with other big data vendors and platforms similar in scope to how Apache Flink works with Cloudera. Apache Flink is a part of the same ecosystem as Cloudera, and for batch processing it's actually very useful but for real-time processing there could be more development with regards to the big data capabilities amongst the various ecosystems out there.

I am also looking for more possibilities in terms of what can be implemented in containers and not in Kubernetes. I think our architecture would work really great with more options available to us in this sense.

Finally, it's a challenge to find people with the appropriate skills for using Flink. There are a lot of people who know what should be done better in big data systems, but there are still very few people with Flink capabilities.

For how long have I used the solution?

I've been using Apache Flink for about one year.

What do I think about the stability of the solution?

We have not really had many issues. 

What do I think about the scalability of the solution?

Scaling Apache Flink is easily done because we use Kubernetes and containers.

How are customer service and technical support?

I can't comment on Apache Flink's technical support but I feel that the documentation is complete and adequate for our needs when doing configuration or solving technical issues.

How was the initial setup?

The setup was complex. The most challenging part of it was identifying how to realize the real-time low latency, fail-over, and high availability within our container and Kubernetes architecture. The configuration of all of this was not simple and it took about a month to get fully set up.

What about the implementation team?

We have two squads in our company that manage the implementation. One squad takes care of the data architecture and the other squad handles the data analysis technology.

What's my experience with pricing, setup cost, and licensing?

Apache Flink is open source so we pay no licensing for the use of the software.

Which other solutions did I evaluate?

Our clients had previously compared Apache Flink with Apache Spark and Apache Spark Streaming. The main advantage of Flink in comparison is that Flink handles complex processing better.

What other advice do I have?

My advice to others when using Apache Flink is to hire good people to manage it. When you have the right team, it's very easy to operate and scale big data platforms.

I would rate Apache Flink a nine out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
Flag as inappropriate
Get our free report covering Amazon, Microsoft, VMware, and other competitors of Apache Spark Streaming. Updated: January 2022.
563,208 professionals have used our research since 2012.