What is our primary use case?
My main use case for Azure Databricks is integrating different types of sources such as Oracle ERP, SAP ERP, API, SQL server, and Oracle databases, focusing on an ingestion platform, integrating and ingesting data into the raw layer, transforming that into a silver layer, and then creating a golden layer for the transformed data to be consumed by Power BI.
When I open Azure Databricks, the first thing I typically do is determine the services or applications to create a first end-to-end architecture platform using Azure services, decide which services are best for ingestion and transformation into the silver and golden layers, and then build a proof of concept to implement an end-to-end framework, which will guide the successful delivery of the entire project in a regular stream.
When I open Azure Databricks, my day-to-day workflow begins with validating the proof of concept framework, choosing the right service for the ingestion pattern, and then building or running notebooks for raw, silver, and gold layer transformations. I involve myself in building the data architecture, coordinating with my senior tech leads or COE groups if I have any questions to ensure we understand the required solutions, and after presenting the architecture to the customer, we progress on development and testing in the deliverables.
What is most valuable?
The best features of Azure Databricks I use the most include schema evolution, auto merge, and efficient handling of shuffles or transformations, along with Z-ordering and partitioning, which enhance the performance of data transformation. Additionally, Delta format, time traveling, versioning, and ACID properties are sophisticated features that provide great support for cloud Lakehouse architecture.
Azure Databricks has positively impacted my organization by allowing us to build medallion architecture or Kappa architecture for handling both scheduled and streaming data, and after establishing the ingestion framework, the existing data services handle transformations and cleansing effectively, which means we rarely need to change services. I also make use of Unity Catalog, RBAC, ACL, and Purview for data governance and security, enabling efficient management of scalable solutions across the organization.
What needs improvement?
The biggest friction point I have experienced with Azure Databricks is its cost-effectiveness; for projects with less data volume, it is advisable to use Azure Fabric services instead, as Azure Databricks may not be suitable for low volume processing.
I wish Azure Databricks offered the ability to further integrate advanced features such as Genie AI, which simplifies coding by automatically building PySpark code based on scenarios. From my perspective, I need to enhance my knowledge around Azure data lineage using Purview, and I find that downtime of Azure Databricks clusters significantly impacts the service and execution of existing pipelines, requiring careful management of backlog items.
For how long have I used the solution?
I have been using Azure Databricks for more than four years.
What do I think about the stability of the solution?
When I first implemented Azure Databricks in my environment, setting it up took longer due to the need to coordinate with the customer for all the necessary data platform setups. However, if the customer already has an Azure data platform, it is easy to onboard new people quickly and start the development phases on time.
What do I think about the scalability of the solution?
When my team started using Azure Databricks, they needed formal training to become comfortable with it, as IT organizations typically require newly onboarded or recruited individuals to undergo multiple training sessions on different platforms, including Azure, which lasts four to six months, after which they are assigned to projects and begin shadowing senior engineers or developers to pick up small initiatives.
Which solution did I use previously and why did I switch?
Before standardizing on Azure Databricks, we were using traditional ETL services such as Informatica, Ab Initio, DataStage, Oracle ODI, and SAP BODS as data warehousing tools, and with data modernization, companies began migrating to the Azure data platform and other cloud services.
What was our ROI?
My pipelines are now significantly faster compared to older ETL tools. For example, what used to take over 12 to 14 hours to process 2 GB of source data in Synapse Analytics now completes within 5 hours using Azure Databricks framework for the transformation part, illustrating a substantial improvement in performance.
Which other solutions did I evaluate?
When evaluating modern options, I did consider other tools besides Azure Databricks, including Snowflake, Synapse, and AWS Glue, but I chose Azure Databricks specifically because it handles high volumes of source data efficiently, managing daily ingestion in gigabytes or terabytes, and facilitates the transformation of data into modern data models suitable for machine learning and AI purposes.
What other advice do I have?
When I open Azure Databricks, my day-to-day workflow begins with validating the proof of concept framework, choosing the right service for the ingestion pattern, and then building or running notebooks for raw, silver, and gold layer transformations. I involve myself in building the data architecture, coordinating with my senior tech leads or COE groups if I have any questions to ensure we understand the required solutions, and after presenting the architecture to the customer, we progress on development and testing in the deliverables.
Although Delta Live Tables and streaming services are features I discussed and planned to use during implementation, I have utilized them very minimally for my applications.
My advice for someone considering Azure Databricks with a similar workflow is to ensure their solution can operate independently of cloud services, making it possible to work across Azure, GCP, or AWS, and to implement a plug-in/plug-out feature architecture that does not disrupt the overall framework. I would rate this product a seven out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Microsoft Azure