Apache Flink Reviews and Pricing

ZHIZHENG

Product Operations Manager at OKX

Mar 16, 2023

Download

Great data streaming tool but documentation needs to be more accessible

Pros and Cons

"Apache Flink's best feature is its data streaming tool."

"Apache Flink's documentation should be available in more languages."

What is most valuable?

Apache Flink's best feature is its data streaming tool.

What needs improvement?

Apache Flink's documentation should be available in more languages.

For how long have I used the solution?

I've been using Apache Flink for four to five years.

What do I think about the stability of the solution?

Apache Flink is quite stable.

Buyer's Guide

Apache Flink

July 2025

Free Report: Apache Flink Reviews and More

Learn what your peers think about Apache Flink. Get advice and tips from experienced pros sharing their opinions. Updated: July 2025.

DOWNLOAD NOW

864,155 professionals have used our research since 2012.

What do I think about the scalability of the solution?

Apache Flink is scalable.

How was the initial setup?

The initial setup was easy, and the cloud-based version takes no more than a minute to deploy.

What other advice do I have?

I would recommend Apache Flink to other users and rate it seven out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Armando Becerril

Partner / Head of Data & Analytics at Intelligence Software Consulting

Apr 27, 2021

Download

Gives us low latency for fast, real-time data, with useful alerts for live data processing

Pros and Cons

"The top feature of Apache Flink is its low latency for fast, real-time data. Another great feature is the real-time indicators and alerts which make a big difference when it comes to data processing and analysis."

"One way to improve Flink would be to enhance integration between different ecosystems. For example, there could be more integration with other big data vendors and platforms similar in scope to how Apache Flink works with Cloudera. Apache Flink is a part of the same ecosystem as Cloudera, and for batch processing it's actually very useful but for real-time processing there could be more development with regards to the big data capabilities amongst the various ecosystems out there."

What is our primary use case?

We use Apache Flink to monitor the network consumption for mobile data in fast, real-time data architectures in Mexico. The projects we get from clients are typically quite large, and there are around 100 users using Apache Flink currently.

For maintenance and deployment, we split our team into two squads, with one squad that takes care of the data architecture and the other squad that handles the data analysis technology. Each squad is three members each.

What is most valuable?

The top feature of Apache Flink is its low latency for fast, real-time data. Another great feature is the real-time indicators and alerts which make a big difference when it comes to data processing and analysis.

What needs improvement?

One way to improve Flink would be to enhance integration between different ecosystems. For example, there could be more integration with other big data vendors and platforms similar in scope to how Apache Flink works with Cloudera. Apache Flink is a part of the same ecosystem as Cloudera, and for batch processing it's actually very useful but for real-time processing there could be more development with regards to the big data capabilities amongst the various ecosystems out there.

I am also looking for more possibilities in terms of what can be implemented in containers and not in Kubernetes. I think our architecture would work really great with more options available to us in this sense.

Finally, it's a challenge to find people with the appropriate skills for using Flink. There are a lot of people who know what should be done better in big data systems, but there are still very few people with Flink capabilities.

For how long have I used the solution?

I've been using Apache Flink for about one year.

What do I think about the stability of the solution?

We have not really had many issues.

What do I think about the scalability of the solution?

Scaling Apache Flink is easily done because we use Kubernetes and containers.

How are customer service and technical support?

I can't comment on Apache Flink's technical support but I feel that the documentation is complete and adequate for our needs when doing configuration or solving technical issues.

How was the initial setup?

The setup was complex. The most challenging part of it was identifying how to realize the real-time low latency, fail-over, and high availability within our container and Kubernetes architecture. The configuration of all of this was not simple and it took about a month to get fully set up.

What about the implementation team?

We have two squads in our company that manage the implementation. One squad takes care of the data architecture and the other squad handles the data analysis technology.

What's my experience with pricing, setup cost, and licensing?

Apache Flink is open source so we pay no licensing for the use of the software.

Which other solutions did I evaluate?

Our clients had previously compared Apache Flink with Apache Spark and Apache Spark Streaming. The main advantage of Flink in comparison is that Flink handles complex processing better.

What other advice do I have?

My advice to others when using Apache Flink is to hire good people to manage it. When you have the right team, it's very easy to operate and scale big data platforms.

I would rate Apache Flink a nine out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Disclosure: My company has a business relationship with this vendor other than being a customer. Partner

Buyer's Guide

Apache Flink

July 2025

Free Report: Apache Flink Reviews and More

Learn what your peers think about Apache Flink. Get advice and tips from experienced pros sharing their opinions. Updated: July 2025.

DOWNLOAD NOW

864,155 professionals have used our research since 2012.

reviewer1450275

Software Development Engineer III at a tech services company with 5,001-10,000 employees

Nov 10, 2020

Download

Provides truly real-time data streaming with better control over resources; ML library could be more flexible

Pros and Cons

"This is truly a real-time solution."

"The machine learning library is not very flexible."

What is our primary use case?

My company is a cab aggregator, similar to Uber in terms of scale as well. Just like Uber, we have two sources of real-time events. One is the customer mobile app, one is the driver mobile app. We get a lot of events from both of these sources, and there are a lot of things which you have to process in real-time and that is our primary use case of Flink. It includes things like surge pricing, where you might have a lot of people wanting to book a cab so the price increases and if there are fewer people, the price drops. All that needs to be done quickly and precisely. We also need to process events from drivers' mobile phones and calculate distances. It all requires a lot of data to be processed very quickly and in real-time.

How has it helped my organization?

The end-to-end latency was drastically reduced, and our capability of handling high throughput has increased by using Flink. It provides a lot of functionality with its windows and maps and that gives us a lot of extra features and power that other frameworks don't have. The solution has helped us by enabling a lot of creative features so we are now able to detect if something abnormal is happening, like a driver has deviated from the set route or the car has not moved for a long period of time, all in real time. Being able to check this has led to more secure rides for our customers.

What is most valuable?

The most valuable feature of Apache Flink would be that it is truly real-time. Unlike Spark and other technologies, it's not recurring batch processing and it also allows me better control over resources. For example, in Spark, it's very difficult to create multiple parallel streams and it consumes the memory of your entire cluster very greedily. With Flink, I have very good control, can choose the number of task managers with a fixed amount of memory, and configured parallelism. This flexibility is very useful in scaling of Flink.

What needs improvement?

Flink has become a lot more stable but the machine learning library is still not very flexible. There are some models which are not able to plug and play. In order to use some of the libraries and models, I need to have a Python library because there might be some pre-processing or post-processing requirements, or to even parse and use the models. The lack of Python support is something they can maybe work on in the future.

For how long have I used the solution?

I've been using this solution for two years.

What do I think about the stability of the solution?

The solution has become a lot more stable over time. We have around 10-12 users and most are software developers. Even if we are running our task managers on cheap servers, we make sure that our job manager definitely runs on a very expensive server, which never goes down. Things remain more stable that way. We're a large company and have teams dedicated to dealing with the infrastructure and taking care of the maintenance and infra, making sure that jobs runs smoothly. A small company could do the maintenance itself.

What do I think about the scalability of the solution?

It's a good product and allows me to scale very easily. If I'm getting more and more data, I can very easily increase the memory allocated for every task manager, and increase the number of parallelism. We are increasing usage of Flink as much as possible. There are some things that we still run on Spark, but whenever we need to scale, have easy resource management, and a more real-time streaming solution, we usually now go for Flink. We have never faced any scalability issues and we are running at a very high profile. I think we are yet to reach the scaling limits of Flink.

How are customer service and technical support?

We have never used Apache's tech support. We usually just Google for our questions. If we don't get the answers directly from Google, we go through the documentation, which is comprehensive, and usually find our answers there.

Which solution did I use previously and why did I switch?

We previously used Spark for streaming, but not for real time applications. We have moved some of our services from Spark to Flink. We also use Kafka extensively, but that is mostly for asynchronous communication between different services. Kafka is a totally different use case. You cannot substitute it with Flink. Overall in terms of streaming, we have used Spark, Kafka and Flink.

How was the initial setup?

The initial setup was not very straightforward because compared to other frameworks, Flink is quite new. There isn't yet a good online community, online blogs, or guidance. You have to rely more or less on their documentation for everything. Even if you go to Stack Overflow, for Spark, you will get lots of questions and answers which will help you. With Flink, you have to actually read a lot. It's not as straightforward as other frameworks.

What's my experience with pricing, setup cost, and licensing?

We have only used the open-source version of Flink.

What other advice do I have?

My advice would be to make sure you understand your requirements, flink's architecture, how it works and whether it is the right solution for you. They provide very good documentation which is useful. The solution isn't suitable for every case and it may be that Spark or some other framework is more suitable. If you are a major company that cannot afford any downtime, and given that Flink is a relatively new technology, it might be worthwhile investing in the monitoring. That would include writing scripts for monitoring and making sure that the throughput of the applications is always steady. Make sure your monitoring and your SOPs around monitoring, are in place.

I would rate this solution a seven out of 10.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Vinod Iyer

Principal Software Engineer at a tech services company with 1,001-5,000 employees

Oct 30, 2020

Download

Offers good API extraction and in-memory state management

Pros and Cons

"Apache Flink is meant for low latency applications. You take one event opposite if you want to maintain a certain state. When another event comes and you want to associate those events together, in-memory state management was a key feature for us."

"In terms of improvement, there should be better reporting. You can integrate with reporting solutions but Flink doesn't offer it themselves."

What is our primary use case?

The last POC we did was for map-making. I work for a map-making company. India is one ADR and you have states within, you have districts within, and you have cities within. There are certain hierarchical areas. When you go to Google and when you search for a city within India, you would see the entire hierarchy. It's falls in India.

We get third party sources, government sources, or we get it from different sources, if we can. We get the data, and this data is geometry. It's not a straightforward index. If we get raw geometry, we will get the entire map and the layout.

We do geometry processing. Our POC was more of processing geometry in a distributed way. The exploration that I did was more about distributing this geometry and breaking this big geometry.

How has it helped my organization?

Flink moved on to becoming a standard technology for location platform. There's only one location platform available right now by an open location platform. That platform leverages Flink. Because Flink is the component for streaming screen property. It's an optic whose data extends within the organization. Anywhere we need low latency applications, we use Flink.

What is most valuable?

Apache Flink is meant for low latency applications. You can take one event opposite if you want to maintain a certain state. When another event comes and you want to associate those events together, in-memory state management is a key feature for this.

Checkpointing was important since we have the consumption done by Kafka. There was a continuous pool of data coming in from cars and which was put into Kafka. This particular Apache Flink component came in and it started processing it. When there's a failure or something effective, checkpointing is very important.

It also helped us in exactly one standard. Another valuable feature is API extraction, which is nicely done. Anyone can understand it. It's not very complex. Anyone can go through all the transformations and everything they have. It's easy to use that. It's a well-balanced extraction.

What needs improvement?

In terms of improvement, there should be better reporting. You can integrate with reporting solutions but Flink doesn't offer it themselves.

They're more about the processing side. Low latency processing is out of their scope. As ar as low latency is concerned, you can integrate to other backend solutions as well. They have that flexibility. APIs are good enough. Its in-memory is so fast, you could have faster-developed data and stuff like that.

What do I think about the stability of the solution?

The stability was good enough. There are a few issues that were application dependent. From the processing standpoint, it did what it was expected to do as such, there were a few issues with Python integration like checkpointing. The checkpointing was not done properly at times, but again that was more about integration going faster and also optimizing our checkpointing intervals and stuff like that. As Flink is concerned, they have good checkpointing and safe points.

There were 50 developers and DevOps working on it.

How are customer service and technical support?

I was on the DevOps side. Support was all driven from Chicago. There was a different team in Chicago who was driving all this stuff. I was a completely hands-on developer. My interaction was more from using the API and developing applications. I don't need to use support. Flink was straightforward.

How was the initial setup?

The deployment can be declared on any kind of distributed managers. I haven't used it, but that's a good option that you could even integrate it within APIs. This adds flexibility to it.

I was not part of the deployment when it was initially done. When I came into the picture, it was more about the API. We had already started using it at the application level at the organization. initially, when it was done in my previous organization, it was an earlier version of Flink. I think they started off in 2016 and there might have been some glitches or some technical issues. When I came in it was pretty smooth. I didn't find any issues and really I hopped into Flink.

Which other solutions did I evaluate?

We also looked at Spark Streaming versus Apache. Spark Streaming is not real-time. That's where we understood that Flink is good enough when you want to have real-time Processing. That's the only process that we have right now and Spark Streaming is more of a big data set.

If you want real-deal real-time processing, you have to invest in Flink but part of it is more of when you use Flink, you have everything stored. You store the state also in-memory so you add up the cost of using that engine compared to PAC streaming. It's not mandatory it's up to the application but if you want to really have real-time processing if you want to store the state and if you really want to have a low latency application, that's when I would go with Flink. Whereas Spark streaming would be more of whenever it's okay to have a bit of a delay like really low latency applications.

Flink gives you flexibility. The reason we chose Spark was because people in our company were already familiar with it. We haven't started working on it yet because it's a half-done POC.

What other advice do I have?

Flink is really simple and simple to adopt. You can use any backend state management tools, like DB or something of that sort. it has the visibility to integrate with different technologies, that's also very important. It's pretty welded and I believe for low latency. The API is pretty well written that way to support you.

I would rate Apache Flink an eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Hitesh Baid

Lead Software Engineer at a tech services company with 5,001-10,000 employees

Oct 19, 2020

Download

Drastically reduces the turnaround/ processing time, Documentation is in depth and most of the things are straight forward.

Pros and Cons

"The event processing function is the most useful or the most used function. The filter function and the mapping function are also very useful because we have a lot of data to transform. For example, we store a lot of information about a person, and when we want to retrieve this person's details, we need all the details. In the map function, we can actually map all persons based on their age group. That's why the mapping function is very useful. We can really get a lot of events, and then we keep on doing what we need to do."

"The TimeWindow feature is a bit tricky. The timing of the content and the windowing is a bit changed in 1.11. They have introduced watermarks. A watermark is basically associating every data with a timestamp. The timestamp could be anything, and we can provide the timestamp. So, whenever I receive a tweet, I can actually assign a timestamp, like what time did I get that tweet. The watermark helps us to uniquely identify the data. Watermarks are tricky if you use multiple events in the pipeline. For example, you have three resources from different locations, and you want to combine all those inputs and also perform some kind of logic. When you have more than one input screen and you want to collect all the information together, you have to apply TimeWindow all. That means that all the events from the upstream or from the up sources should be in that TimeWindow, and they were coming back. Internally, it is a batch of events that may be getting collected every five minutes or whatever timing is given. Sometimes, the use case for TimeWindow is a bit tricky. It depends on the application as well as on how people have given this TimeWindow. This kind of documentation is not updated. Even the test case documentation is a bit wrong. It doesn't work. Flink has updated the version of Apache Flink, but they have not updated the testing documentation. Therefore, I have to manually understand it. We have also been exploring failure handling. I was looking into changelogs for which they have posted the future plans and what are they going to deliver. We have two concerns regarding this, which have been noted down. I hope in the future that they will provide this functionality. Integration of Apache Flink with other metric services or failure handling data tools needs some kind of update or its in-depth knowledge is required in the documentation. We have a use case where we want to actually analyze or get analytics about how much data we process and how many failures we have. For that, we need to use Tomcat, which is an analytics tool for implementing counters. We can manage reports in the analyzer. This kind of integration is pretty much straightforward. They say that people must be well familiar with all the things before using this type of integration. They have given this complete file, which you can update, but it took some time. There is a learning curve with it, which consumed a lot of time. It is evolving to a newer version, but the documentation is not demonstrating that update. The documentation is not well incorporated. Hopefully, these things will get resolved now that they are implementing it. Failure is another area where it is a bit rigid or not that flexible. We never use this for scaling because complexity is very high in case of a failure. Processing and providing the scaled data back to Apache Flink is a bit challenging. They have this concept of offsetting, which could be simplified."

What is our primary use case?

Services that need real-time and fast updates as well as lot of data to process, flink is the way to go. Apache Flink with kubernetes is a good combination. Lots of data transformation grouping, keying, state mangements are some of the features of Flink. My use case is to provide faster and latest data as soon as possible in real time.

How has it helped my organization?

The main advantage is the turnaround time, which has been reduced drastically because of Apache Flink. Earlier, it used to take a lot of processing time but now things have changed, and everything is in almost real time. We get the latest data in a very less time. There is no waiting or lag of data in the application, time has been one of the important factors.

The other factor is memory. The utilization of the machine has been more efficient since we started to use this solution. The big data applications definitely use a large group of machines to process the data. These machines are not optimally utilized. Some of the machines might not have been required, but they still take hold of the resources. In Kubernetes, we can provide resources, and in Apache Flink, there is a configuration where you can do the deployment in combination with a single cluster node. Scalability is quite flexible in flink with task managers and resource configuration.

What is most valuable?

MapFunction, FilterFunction are the most useful or the most used function in Flink. Data transformation becomes easy for example, Applications that store information about people and when they want to retrieve those person's details in some kind of relation, in the map function, we can actually filter all persons based on their age group. That's why the mapping function is very useful. This could be helpful in analytics to target specific news to specific age group.

What needs improvement?

TimeWindow feature. The timing of the content and the windowing is a bit changed in 1.11. They have introduced watermarks.

Watermark is basically associating data in the stream with a timestamp. Documentation can be referred. They have updated rest of the documentaion but not the testing documentation. Therefore, We have to manually try and understand few concepts.

Integration of Apache Flink with other metric services or failure handling data tools needs some kind of update or its in-depth knowledge is expected before integrating. Consider a use case where you want to actually analyze or get analytics about how much data you have processed and how many failed? Prometheus is one of the common metric tools out of the box supported by flink, along with other metric services. The documentation is straight forward. There is a learning curve with metric services, which can consume a lot of time, if not well versed with those tools.

Failure handling basic documentation is provided by flink, like restart on task failure, fixed delay restart...etc.

For how long have I used the solution?

I have been using Apache Flink for almost nine months.

What do I think about the stability of the solution?

The uptime of our services has increased, resources are better utilized, restarts is automated on failures, alerts are triggered when infrastructure breaches threshold and application failure metrics are logged, due to which the application has become robust, scaling is something which can be tweaked as per usage of application, business rules. Also Flink parameters are configurable, altogether it made our application more stable and maintainable.

What do I think about the scalability of the solution?

I haven't actually hit a lot of performance or stress test to see the scaling.
There is no detailed documentation for scaling. There is no fixed solution as well, it depends on use cases, there are rest APIs to scale your task dynamically in Flink Documentation, haven't personally tried it yet.

How are customer service and technical support?

I haven't actually used their technical support yet.

Which solution did I use previously and why did I switch?

I have tried batch processing, It was not that much effective in my use case. It was time consuming and not real time solution.

How was the initial setup?

The initial setup is straightforward. A new joined person (with some experience in software industry) would not find it that much difficult to understand and then contribute to the application. When you want to start writing the code for it, then things get tricky and more complex because if you are not involved with what Apache Flink is providing and what you need to do, it is very difficult. If followed the documentation, there are various examples provided and different deployment strategies as well.

What was our ROI?

It is good solution for time and cost saving.

What's my experience with pricing, setup cost, and licensing?

Being open source licensed, cost is not a factor. The community is strong to support.

What other advice do I have?

To get your hands wet on streaming or big data processing applications, to understand the basic concepts of big data processing and how complex analytics or complications can be made simple. For eg: If you want to analyze tweets or patterns, its a simple use case where you just use flink-twitter-connector and provide that as your input source to Flink. The stream of random tweets keeps on coming and then you can apply your own grouping, keying, filtering logic to understand their concepts.

An important thing I learned while using flink, is basic concepts of windowing, transformation, Data Stream API should be clear, or atleast be aware of what is going to be used in your application, or else you might end up increasing the time rather than decreasing. You should also understand your data, process, pipeline, flow, Is flink the right candidate for your architecture or an over kill?
It is flexible and powerful is all I can say.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Sandesh Deshmane

Software Architect at a tech vendor with 501-1,000 employees

Oct 15, 2020

Download

Provides out-of-the-box checkpointing and state management

Pros and Cons

"With Flink, it provides out-of-the-box checkpointing and state management. It helps us in that way. When Storm used to restart, sometimes we would lose messages. With Flink, it provides guaranteed message processing, which helped us. It also helped us with maintenance or restarts."

"The state maintains checkpoints and they use RocksDB or S3. They are good but sometimes the performance is affected when you use RocksDB for checkpointing."

What is our primary use case?

We have our own infrastructure on AWS. We deploy Flink on Kubernetes Cluster in AWS. The Kubernetes cluster is managed by our internal Devops team.

We also use Apache Kafka. That is where we get our event streams. We get millions of events through Kafka. There are more than 300K to 500K events per second that we get through that channel.

We aggregate the events and generate reporting metrics based on the actual events that are recorded. There are certain real-time high-volume events that are coming through Kafka which are like any other stream. We use Flink for aggregation purposes in this case. So we read this high volume events from Kafka and then we aggregate. There is a lot of business logic running behind the scenes. We use Flink to aggregate those messages and send the result to a database so that our API layer or BI users can directly read from database.

How has it helped my organization?

Flink has improved my organisation by enabling us to become independent of Redis which is used as an intermediate caching layer with Apache Storm for aggregation. Redis was a bottleneck. With an increasing number of messages, Redis was becoming full and also had a higher chance of errors because we were doing the checkpoints and state management manually.

With Flink, it provides out-of-the-box checkpointing and state management. It helps us in that way. When Storm used to restart, sometimes we would lose messages or intermediate state . With Flink, it provides guaranteed message processing, which helped us. It also helped us with application maintenance/deployments and restarts.

What is most valuable?

When we use the Flink streaming pipeline, the first thing we use is the Windowing mechanism with event time feature. So as that happens, the Flink aggregation is very easy. The next thing is we were using Apache Storm. Apache Storm is stateless, and Apache Flink is stateful. With Apache Storm, we were supposed to use an intermediate distributed cache. But because we use Flink, and it is stateful, we can manage the state or failure mechanism. The result is that we do aggregation every 10 minutes and we do not need to worry about our application stopping in between those 10 minutes and then restarting.

When we were using Storm, we used to manage all of it ourselves. We created manual checkpoints in Redis, but with Flink, it supports inbuilt features like checkpointing and statefulness. There is event time or author time that you can have for your messages.

Another important thing is the out-of-order message processing. When you use any streaming mechanism, there is a chance that your source is producing messages that are out-of-order. When you build a state machine, it's very important that you can have the messages in order, so that your computations/results are correct. What happens with Storm or any other framework that you're using is that to get messages in order, you have to use an intermediate Redis cache, and then sort the messages. When we use Flink, it has an inbuilt way to have the messages in order, and we can process them. It saves a lot of time and a lot of code.

I have written both Storm and Flink code. With Storm, I used to write a lot of code, hundreds of lines but with Flink, it's less, around 50 to 60 lines. I don't need to use Redis to do the intermediate cache. There is a lot of code that is being saved. I have to aggregate around 10 minutes and there is an inbuilt mechanism. With Storm, I need to write out logic and then I need to write a bunch of connected bolts and intermediate Redis. The same code that would take me one week to write in Storm, I could do that same thing in a couple of days with Flink.

I started with Flink five to six years ago for one use case and the community support and documentation were not good at that time. When we started back again in 2019, we saw that documentation and community support were good.

What needs improvement?

The state maintains checkpoints and they use RocksDB or S3. They are good but sometimes the performance is affected when you use RocksDB for checkpointing.

We can write python bolts/applications inside Apache Storm Code and it supports Python as a programming language, but with Flink, the Python support is not that great. When we do machine learning, data science, or ML work, we want to integrate the data science or machine learning pipeline with our real-time pipeline and most of the data science or machine learning work is in Python.

It was very easy with Storm. Storm supports native Python language, so integration was easy. But Flink is mostly Java. The integration of Python with Java is difficult, so it's not direct integration. We need to find an alternative way. We created an API layer in between so the Java and Python layers were communicating by using an API. We just called data science models or ML models using the API which runs in Python while Flink runs in Java. We would like to see improvement where we can have another way to run it. Currently, it's there, but it's not that great. This is an area that we would like to see improvement.

For how long have I used the solution?

I have been using Apache Flink for one-and-a-half years now.

What do I think about the stability of the solution?

Stability-wise, it's good and stable. We do aggregations on data streams from received Kafka. Flink Application connects to multiple Kafka topic and read the data. The number of messages generated in Kafka are very high . Sometimes in production, we see some glitches, where data is mismatched. Our Flink runs on Kubernetes Cluster , so sometime when worker node crashes or application restarts we see mismatch in aggregation results.

We are yet to verify whether it's a problem with the Flink framework or it's a problem with the code which does aggregation and checkpointing. We are yet to figure out whether the data is lost when a worker nodes crashes or we restart Flink application, or there is a problem with the way we have done the implementation. This problem is intermittent not always replicated.

What do I think about the scalability of the solution?

It's easy to scale because it supports Docker. Once you have Docker/Containers, you can deploy it on Kubernetes or any other container Orchestrator. So scalability-wise, it's good, you can just launch the cluster. When you have an automated cluster launching mechanism, you can easily scale up and down.

So far, there are close to 10 users who use Flink and most of them are software engineers, senior software engineers, DevOps guys, DevOps architect, and a Cloud architect.

Most of our work was on Storm but we saw improvement with Flink. So we have moved one business application. We have a couple of other main business applications or a data pipeline and we would like to move that as well.

How are customer service and technical support?

We have not used technical support. There are good forums and community support.

Which solution did I use previously and why did I switch?

We switched from Storm to Flink. We looked at Apache Spark Streaming as well, but some of the use cases were better in Apache Flink. We chose Flink over Spark Streaming and Kafka Streams. We thought Flink was better and so we went with it.

Spark is micro-batch but this Flink offers complete streaming. Memory management with Apache Spark is not that great, but Flink has automatic memory management. For our use case, we found Flink is faster as compared to Spark. The windowing mechanism that Flink provides is better than Spark.

How was the initial setup?

In terms of the implementation, we initially set up our development instances for Mac, which was easy. We have the documentation available. For the setup, when we wanted to move it to production, it provided the setup on Kubernetes. That Kubernetes setup is a little bit complicated. You need a person who understands Kubernetes well. A developer alone cannot do it. When you want to take it to production, the setup on Kubernetes using Docker is a little bit complicated. We need something like a one-click deployment script that can launch the cluster so that you can then do it.

In another case, we used AWS. There is Flink support in AWS EMR that we could have readily. It's a manged service and so it was easier for us. We don't need to bother with launching the cluster and running our workload. When we have to manage our own cluster using Kubernetes and Flink, it's a little bit complicated. There are a bunch of manual steps that need to be done.

Moving to production, we did EMR a couple of days. But for the Kubernetes cluster setup, it took us two to three weeks. The setup required a couple of team members from the DevOps team and engineering side.

In terms of our deployment strategy, we were already using the Kubernetes cluster for most of the use cases, and we wanted to use the same Kubernetes cluster. The first thing we wanted to do is Dockerize the application that we were running and then use the same Kubernetes cluster or create a separate workspace in that and use it.

What about the implementation team?

We did the deployment ourselves. We have a team of three or four DevOps guys who manage our Kubernetes cluster.

For the deployment, we needed one or two guys and for development, we are three to four people. We had a lot of other business applications that are in Flink.

Which other solutions did I evaluate?

Apache Storm, Spark Streaming , Kafka Streams

What other advice do I have?

My advice would be to validate your use case. If you are using already a streaming mechanism, I suggest that you validate what your actual use cases are and what the advantages of Flink are. Make sure that the use case that you are trying can be done by Flink. If you're doing simple aggregation and you don't want to worry about the message order then it's fine. You can use Storm or whatever you are using. If you see features that are there and are useful for you, then you should go for Flink.

Validate your use case, validate your data and pipeline, do a small POC, and see if it is useful. If you think it's useful and worth doing a migration from your existing solution, then go for it. But if you don't already have a solution and Flink will be your first one, then it's always better to use Flink.

The biggest lesson I have learned is that the deployment using Kubernetes was a little bit difficult. We did not evaluate when we started the work, so we migrated on the code part, but we did not take on the deployment part. Initially, if we would have seen the deployment part, then we could have chosen Kafka Streams as well because we were getting a similar result, but on the deployment side, Kafka Streams was easy. You don't need to worry about the cluster.

I would rate Apache Flink an eight out of ten. I would have given it a nine or so if it wasn't for that the deployment on Kubernetes is a little bit complicated.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Jyala Rahul Jyala

Sr Software Engineer at a tech vendor with 10,001+ employees

Oct 13, 2020

Download

Good documentation, API support, and metrics, but it only has partial Python support

Pros and Cons

"The documentation is very good."

"We have a machine learning team that works with Python, but Apache Flink does not have full support for the language."

What is our primary use case?

We are using Flink as a pipeline for data cleaning. We are not using all of the features of Flink. Rather, we are using Flink Runner on top of Apache Beam.

We are a CRM product-based company and we have a lot of customers that we provide our CRM for. We like to give them as much insight as we can, based on their activities. This includes how many transitions they do over a particular time. We do have other services, including machine learning, and so far, the resulting data is not very clean. This means that you have to clean it up manually. In real-time, working with Big Data in this circumstance is not very good.

We use Apache Flink with Apache Beam as part of our data cleaning pipeline. It is able to perform data normalization and other features for clearing the data, which ultimately provides the customer with the feedback that they want. We also have a separate machine learning feature that is available, which can be optionally purchased by the customer.

How has it helped my organization?

We have a set of pipeline services that we run. For example, we might use Apache Beam for running a four-hour service, and we use Flink to run it. You can run any job using Flink, including an Apache Spark job.

We have many systems including Elasticsearch Database, MongoDB, and other services. Based on what we have running, we want to clean and transform some of our data.

Currently, we have two implementations of Flink and one of them is running Kafka, whereas the other one is Cassandra. Based on that, we process all of the things that we want and if it's streaming then we used Kafka, whereas if it is batch then we use Cassandra. The result of all of these services is that it can provide a much better user experience.

What is most valuable?

The most valuable feature is that there is no distinction between batch and streaming data. When we want to use batch mode, we use Apache Spark. The problem with Spark is that when it comes to time-series data, it does not train well. With Flink, however, we can have the streaming capability that we want.

The documentation is very good.

A lot of metrics are supported and there is also logging capability.

There is API support.

What needs improvement?

We have a machine learning team that works with Python, but Apache Flink does not have full support for the language. We needed to use Java to implement some of our job posting pipelines.

For how long have I used the solution?

We have been using Apache Flink for between one and one and a half years.

What do I think about the stability of the solution?

Stability is pretty good and we haven't had any problem with it.

We are using this product extensively and we have new products being onboarded.

What do I think about the scalability of the solution?

Apache Flink scales well. As long as we are using Kubernetes, we are happy to scale as much as you want.

We have a data team with between twenty and twenty-five people. It is split into two groups where the first group works on reporting, machine learning, and background operations. The second group works with Big Data.

How are customer service and technical support?

We have not used technical support from Apache.

Community support is available.

Which solution did I use previously and why did I switch?

Prior to Flink, we used Apache Spark.

We had to move to Flink because of the streaming capabilities that it has. In our architecture, we have one layer for batch processing and the other for streaming. This is quite a pain for us because we don't want to have two separate jobs to handle both streaming and batch processing. Using Flink, we are able to utilize the API and handle both of these jobs.

How was the initial setup?

The complexity of the initial setup is dependent on your use cases and what it is that you are trying to achieve. I found that we didn't have any problem with it.

This product can be deployed on-premises or as a SaaS on the cloud. It depends on the requirements of the customer.

The deployment using Kubernetes takes approximately 30 minutes to complete.

What about the implementation team?

Our in-house team is responsible for scaling and other maintenance. There is very good documentation available for this.

What's my experience with pricing, setup cost, and licensing?

This is an open-source platform that can be used free of charge.

Which other solutions did I evaluate?

We got to learn about Apache Flink through using Apache Beam. Originally, I did not know very much about Flink. The problem with Apache Beam is that you cannot run it alone. Once you create the jobs, you need a tool to run them. There are two options left, being Apache Spark and Apache Flink. We chose Flink because it was more compatible with what we wanted to do.

What other advice do I have?

We are very happy with the product, and we have been able to achieve all of the use cases that we are expected to deliver for our customers.

Over time, I have seen many improvements including in the documentation. An example is that when we first started using this product, almost two years ago, there was no support available.

At this point, we do not have much opt-in but we have some use cases to ensure that our system is not breaking. We have QA who can validate these things based on what is expected versus what we have done.

My advice for anybody who is considering Flink is that it has very mature documentation and you can do what you want. It is a very good way to implement streaming pipelines and you won't have any problems.

The biggest lesson that I have learned from using Flink is how we can customize the experience for the customer and how important it is to keep up with the industry. We don't want to be left behind.

I would rate this solution a seven out of ten.

Which deployment model are you using for this solution?

Public Cloud

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Ertugrul Akbas

Manager at ANET

Jan 30, 2023

Download

Easy to use, stable, scalable, and has good community support with a lot of documentation

Pros and Cons

"It is user-friendly and the reporting is good."

"There is a learning curve. It takes time to learn."

What is our primary use case?

We use Apache Flink in-house to develop the Tectonic platform.

What is most valuable?

It's usable and affordable.

It is user-friendly and the reporting is good.

What needs improvement?

There is a learning curve. It takes time to learn.

The initial setup is complex, it could be simplified.

For how long have I used the solution?

I have been using Apache Flink for more than one year.

I am using the latest version.

What do I think about the stability of the solution?

Apache Flink is a scalable product. We have no issues with the stability.

What do I think about the scalability of the solution?

It's a very scalable solution. We have more than 100 people in our organization who are using it.

How are customer service and support?

We use community resources. There is a lot of documentation available online.

How was the initial setup?

The initial setup is complex.

What's my experience with pricing, setup cost, and licensing?

It's an open-source solution.

Which other solutions did I evaluate?

We have not evaluated competitors. We followed the trends and based on the experience and opinions of people from all over the world, we selected Apache Flink.

What other advice do I have?

I would recommend Apache Flink to others who are interested in using it.

I would rate this solution an eight out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: My company does not have a business relationship with this vendor other than being a customer.