What is our primary use case?
It serves as a versatile tool for data ingestion, enabling various tasks including data transformation from one type or format to another. It facilitates seamless preparation and processing of data, supporting diverse operations such as format conversion, type transformation, and other related functions.
How has it helped my organization?
We leverage Apache Airflow to orchestrate our data pipelines, primarily due to the multitude of data sources we manage. These sources vary in nature, with some delivering streaming data, while others follow different protocols such as FTP or utilize landing areas. We utilize Airflow for orchestrating tasks such as data ingestion, transformation, and preparation, ensuring that the data is formatted appropriately for further processing. Typically, this involves tasks like normalization, enrichment, and structuring the data for consumption by tools like Spark or other similar platforms in our ecosystem.
The scheduling and monitoring functionalities enhance our data processing workflows. While the interface could be more user-friendly, proficiency in scheduling and monitoring can be attained through practice and skill development.
The scalability of Apache Airflow effectively accommodates our increasing data processing demands without issue. While occasional server problems may arise, particularly in this aspect, overall, the product remains reliably stable.
It offers a straightforward means to orchestrate numerous data sources efficiently, thanks to its user-friendly interface. Learning to use it is relatively quick and straightforward, although some experimentation, practice, and training may be required to master certain aspects.
What is most valuable?
Our data workflow management is greatly streamlined by the use of Apache Airflow, which proves highly beneficial. Its user-friendly interface makes it straightforward to operate, offering a plethora of features for data preparation, buffering, and format conversion. With its extensive capabilities, Airflow serves as a comprehensive tool for managing our data workflows effectively.
What needs improvement?
The current pricing of Apache Airflow is considerably higher than anticipated, catching us off guard as it has evolved from its initial pricing structure. It would be beneficial to improve the pricing structure. Also, enhancing the interface furthermore would be highly beneficial.
For how long have I used the solution?
We have been using it for approximately two years.
What do I think about the stability of the solution?
While the stability of the system is satisfactory, maintaining stability requires vigilance and attention to various factors. During usage, occasional issues may arise, particularly when operating on-premises configurations. For instance, a single hard disk failure on a physical node can pose a challenge, necessitating the node's shutdown for disk replacement. However, the process of switching off and on the node is intricate and requires careful handling.
What do I think about the scalability of the solution?
Scalability is achievable, but it comes with its challenges, particularly in terms of temporary downsizing due to failures or other unforeseen circumstances. While scaling up is feasible, each additional node introduced into the cluster adds complexity and raises the likelihood of potential failures. Dealing with failures involves following standard procedures, yet reinstating the cluster to its fully operational state can be a demanding task.
Approximately ten technical staff members and an equivalent number of data scientists utilize the platform. Additionally, a segment of the network team employs it for network quality analysis, leveraging reporting tools built on top of Impala, which is integrated into the cluster.
How was the initial setup?
The setup process is notably intricate, particularly considering our cluster configuration consisting of twelve data nodes and various additional components. Furthermore, unforeseen issues may arise, such as disk space constraints for Airflow or similar challenges, necessitating vigilance and attention to detail to avoid complications.
What about the implementation team?
The initial phase of the deployment process involves creating a comprehensive plan outlining the setup of our cluster, considering all nodes involved. Since we're deploying on-premises, we need to determine which components will reside on physical machines and which can be accommodated on virtual machines or clusters. This assessment will guide the allocation of resources to each server, ensuring an optimal configuration. Following this, the configuration phase begins, taking into account the specific requirements of our organization and stringent security measures. Access to the clusters must be carefully managed, categorized, and restricted as per security protocols. It's imperative to prepare everything meticulously prior to deployment to ensure a smooth and successful implementation. We've undertaken the deployment process partially in-house and with the assistance of the system integrator dedicated to this project. For maintenance and deployment tasks, we rely on a team of ten technical personnel. Typically, only two or three individuals are needed to monitor operations and address issues as they arise. Moreover, we have the backing of our system integrator for additional support if necessary.
What's my experience with pricing, setup cost, and licensing?
The pricing is on the higher side.
What other advice do I have?
I would confidently recommend Apache Airflow to others, assuring them of its benefits. In my opinion, it's a mature and efficient product that delivers reliable performance. Overall, I would rate it nine out of ten.
Which deployment model are you using for this solution?
On-premises
*Disclosure: My company does not have a business relationship with this vendor other than being a customer.