Real-world, scenario-driven questions asked in mid to senior Data Engineering interviews
Azure Databricks has become a core platform for building modern Lakehouse architectures on Azure. Interviewers today focus less on definitions and more on real-world implementation, performance, governance, and cost optimization.
This blog presents a curated and logically sequenced list of Azure Databricks interview questions, covering ingestion, transformation, governance, CI/CD, streaming, security, and architecture — exactly how interviews flow in real scenarios.
Section 1: Databricks & Lakehouse Fundamentals
1. Why Databricks? Explain the architecture of Azure Databricks
-
Core components (Control Plane, Data Plane)
-
Why Databricks over traditional Spark or Synapse
-
Role in Lakehouse architecture
2. What is Lakehouse Medallion Architecture (Bronze / Silver / Gold)?
-
Benefits and demerits
-
How it compares with traditional data warehouses
-
When it may not be the right choice
3. What are the benefits of Delta Lake file format?
-
ACID transactions
-
Schema enforcement & evolution
-
Time Travel
-
Performance optimizations
4. External vs Internal (Managed) Tables in Databricks
-
Storage ownership
-
Use cases
-
Governance implications with Unity Catalog
Section 2: Data Ingestion & Incremental Processing
5. What is Databricks Auto Loader and how does it perform incremental loading?
-
File notification vs directory listing
-
Exactly-once semantics
-
Schema inference & evolution
6. How do you implement CDC (Change Data Capture) in Databricks?
-
CDC from source systems
-
Merge-based CDC
-
Streaming vs batch CDC
-
Delta Change Data Feed (CDF)
7. Explain SCD Type 1 vs SCD Type 2. When do you use each?
-
Business scenarios
-
Storage vs history trade-offs
8. Write and explain an SCD Type 2 MERGE command
-
Surrogate keys
-
Active flags
-
Effective start/end dates
Section 3: Delta Lake Internals & Performance
9. What is Delta Time Travel? How is it implemented?
-
Version-based vs timestamp-based queries
-
Use cases (debugging, rollback)
10. What are OPTIMIZE and VACUUM commands? When do you use them?
-
Small file problem
-
Retention periods
-
Cost vs performance trade-offs
11. What is Liquid Clustering? How is it better than Z-ORDER?
-
Dynamic clustering
-
No fixed columns
-
When Z-ORDER still makes sense
12. How do you scale a Databricks pipeline?
-
Horizontal vs vertical scaling
-
Auto-scaling clusters
-
Partitioning strategies
-
Streaming scale considerations
Section 4: Streaming & Real-Time Processing
13. Explain Databricks Streaming concepts
-
Watermarking
-
Window functions
-
Tumbling vs sliding windows
14. What is foreachBatch? When do you use it?
-
Stateful streaming
-
Streaming + MERGE patterns
-
Idempotent writes
Section 5: Data Quality & Modeling
15. How do you implement data quality checks in Medallion Architecture?
-
Bronze: schema & null checks
-
Silver: business validations
-
Gold: aggregations & reconciliation
-
Tools (DLT expectations, custom checks)
16. Dimensional Modeling: Star vs Snowflake Schema
-
Benefits and trade-offs
-
Query performance
-
Maintenance complexity
-
When to choose which
Section 6: Security, Governance & Compliance
17. How does Unity Catalog enable data governance in Databricks?
-
Centralized metastore
-
Fine-grained access control
-
Audit logging
18. Explain catalogs, schemas, and metastores in Unity Catalog
-
Hierarchy
-
Privileges and access patterns
19. How do you implement column-level and row-level security (PII masking)?
-
Dynamic views
-
Column masking policies
-
Role-based filtering
20. How do you secure Databricks resources and clusters?
-
Cluster policies
-
Network isolation
-
Secrets & credential passthrough
Section 7: Cloud Integration & Migration
21. How do you connect Databricks with ADLS Gen2 securely?
-
ABFS connector
-
External credentials
-
Managed identities
22. On-prem to Azure cloud migration – explain the end-to-end steps
-
Assessment
-
Data transfer
-
Schema conversion
-
Validation & optimization
23. Explain Open Delta Sharing
-
Secure data sharing outside Databricks
-
Cross-org / cross-cloud use cases
Section 8: Cost Optimization & Monitoring
24. How do you optimize costs on Azure Databricks?
-
Cluster sizing
-
Job vs all-purpose clusters
-
Spot instances
-
Storage optimizations
25. How do you compute Databricks costs without using Azure Portal?
-
System tables
-
Information schema
-
Usage & billing views
Section 9: Orchestration, CI/CD & DevOps
26. How do you schedule Databricks jobs?
-
Jobs UI
-
Cron schedules
-
Event-driven triggers
27. What are the types of triggers in Databricks Workflows?
-
Time-based
-
File arrival
-
Job completion
28. Explain CI/CD in Databricks using Azure DevOps
-
Git integration
-
Databricks Asset Bundles (DAB)
-
Environment promotion (Dev → QA → Prod)
Section 10: Testing & Code Quality
29. How do you implement unit testing in Databricks?
-
PyTest
-
Mocking Spark sessions
-
Testing transformations
30. What Python and SQL quality tools do you use?
-
Black – code formatting
-
SQLFluff – SQL linting
-
Importance of readable, maintainable pipelines
Section 11: Cluster Configuration & Policies
31. What cluster configurations have you used in Databricks?
-
Job vs interactive clusters
-
Auto-scaling
-
Photon-enabled clusters
32. How do cluster policies help with security and cost control?
-
Restrict node types
-
Enforce tagging
-
Prevent misuse
Section 12: Azure Ecosystem Awareness
33. What is the role of Azure Data Factory (ADF) in Databricks projects?
-
Orchestration vs transformation
-
When to use ADF vs Databricks Workflows
34. What is Azure Purview (Microsoft Purview)?
-
Data discovery
-
Lineage
-
Governance integration with Databricks
Final Thoughts
These questions reflect how real Databricks interviews are conducted today — focused on:
-
Architecture decisions
-
Performance tuning
-
Security & governance
-
Cost awareness
-
Production-ready pipelines
If you master these areas, you are well-prepared for mid to senior-level Azure Databricks roles.