Databricks • DCDEP
Validates advanced proficiency in building and optimizing production-grade data engineering solutions on Databricks, covering data processing with Delta Lake and Structured Streaming, data modeling using Medallion Architecture, Databricks tooling including Workflows and REST APIs, and security, governance, and deployment.
Questions
628
Duration
120 minutes
Passing Score
70%
Difficulty
ProfessionalLast Updated
Feb 2026
The Databricks Certified Data Engineer Professional certification validates advanced proficiency in building, optimizing, and maintaining production-grade data engineering solutions on the Databricks Data Intelligence Platform. Successful candidates demonstrate deep expertise across core platform capabilities including Delta Lake, Unity Catalog, Auto Loader, Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables), Databricks Compute (including serverless), Lakeflow Jobs, and the Medallion Architecture. The exam was updated in 2025 to reflect a Data Intelligence Platform framing, with expanded coverage of AI-driven features, Delta Sharing, Lakehouse Federation, Databricks Asset Bundles (DAB), and enhanced Unity Catalog governance.
This certification assesses the ability to design secure, reliable, and cost-effective ETL pipelines; process complex data from diverse sources using Python and SQL; implement Change Data Capture (CDC), SCD1, and SCD2 patterns; and apply best practices in schema management, observability, performance optimization, and data governance. Candidates are also evaluated on streaming workloads using Structured Streaming, workflow orchestration via Databricks Workflows, and deployment automation using the Databricks CLI, REST API, and Asset Bundles.
This certification is designed for experienced data engineering professionals with at least one year of hands-on experience building and operating production data pipelines on Databricks. Ideal candidates include Senior Data Engineers, Lead Analytics Engineers, Data Architects, and Big Data professionals who work daily with Apache Spark, Delta Lake, and the Databricks Lakehouse Platform. Those who have already obtained the Databricks Certified Data Engineer Associate credential are especially well-positioned, as the Professional exam builds substantially on that foundational knowledge.
Professionals transitioning from traditional ETL development, data scientists who regularly build and maintain pipelines, and Solutions Architects seeking to validate deep platform expertise are also strong candidates. Code examples on the exam are primarily in Python and SQL, so comfort with PySpark and Spark SQL is essential.
There are no formal prerequisite certifications required, but Databricks strongly recommends holding or demonstrating mastery of the skills covered by the Databricks Certified Data Engineer Associate certification before attempting the Professional exam. Candidates should have at least one year of hands-on experience performing the data engineering tasks outlined in the official exam guide.
Recommended knowledge areas include proficiency with Apache Spark (PySpark and Spark SQL), Delta Lake operations (MERGE, OPTIMIZE, ZORDER, VACUUM, Change Data Feed), Structured Streaming concepts (Auto Loader, windowing, watermarking), Unity Catalog for data governance, Databricks Workflows for job orchestration, and familiarity with DevOps practices including version control and CI/CD pipelines. Candidates should be comfortable working in the Databricks Workspace, using the Databricks CLI and REST API, and applying the Medallion Architecture (Bronze, Silver, Gold layers) in real-world pipeline design.
The Databricks Certified Data Engineer Professional exam consists of approximately 60 scored multiple-choice questions, with some candidates reporting up to 65 questions. The exam duration is 120 minutes (2 hours). As with other Databricks exams, the form may include a small number of unscored survey items used to gather statistical data for future exam development; these are not identified and do not affect the final score, and additional time is factored in to account for them.
The exam is delivered online via a remote proctoring platform and costs $200 USD (plus applicable taxes). The passing score is 70%. Questions are scenario-based and require applied knowledge rather than rote memorization, frequently presenting realistic production engineering challenges in PySpark and SQL. Recertification is required every two years by retaking the current version of the exam.
The Databricks Certified Data Engineer Professional credential is recognized as an advanced-tier validation of Lakehouse Platform expertise, positioning holders for senior roles such as Senior Data Engineer, Lead Analytics Engineer, Data Architect, and Solutions Architect. Databricks is used by more than 7,000 organizations globally, including approximately 40% of Fortune 500 companies, creating sustained demand for certified professionals. Certified data engineers in the US typically earn between $115,000 and $150,000 annually, with top earners exceeding $160,000 depending on experience, location, and industry. Glassdoor data places the average at approximately $131,000, with a range extending to $170,000 at the 75th percentile.
Compared to the Associate-level certification, the Professional credential signals the ability to architect and operate enterprise-grade solutions — not just implement them — which substantially increases leverage in salary negotiations and job applications. The certification also serves as a differentiator against candidates holding generalist cloud data engineering credentials (e.g., AWS, Azure, GCP data engineer certs), as Databricks expertise is platform-specific and increasingly in demand as organizations adopt the Lakehouse architecture for unified analytics and AI workloads. Recertification every two years ensures holders stay current with the rapidly evolving platform.
5 sample questions with correct answers and explanations. Start a practice session to test yourself across all 628 questions.
1. A streaming join combines impression events with click events to calculate click-through rates. Both streams have watermarks defined: impressions with 1 hour watermark on impression_time, clicks with 2 hour watermark on click_time. The join condition includes expr('click_time BETWEEN impression_time AND impression_time + INTERVAL 30 MINUTES'). What is the purpose of the time range constraint in the join condition? (Select one!)
Explanation
Stream-stream joins require time range constraints in the join condition to enable state cleanup and prevent unbounded state growth. Without time bounds, Spark must maintain state for all records indefinitely because any future record could potentially match. The time range constraint allows Spark to determine when state for a record can be safely discarded because no future records can match it based on watermarks. While the 30-minute window may also serve business logic purposes, the technical requirement for stream-stream joins is that time constraints enable state management. Performance improvement is a side effect, not the primary purpose. Watermarks are enforced independently of the time range constraint, though both work together for complete stateful operation management.
2. A production Delta table has the following properties: delta.logRetentionDuration = 60 days and delta.deletedFileRetentionDuration = 14 days. VACUUM was last run 10 days ago with default retention. A user attempts to query VERSION AS OF from 20 days ago. What is the expected outcome? (Select one!)
Explanation
The transaction log retention of 60 days ensures metadata for version 20 days ago is available. The deleted file retention of 14 days means VACUUM will not remove data files until they have been marked as deleted for at least 14 days. Since VACUUM ran 10 days ago with default 7-day retention, it only removed files deleted more than 7 days before that run. Files needed for the 20-day-old version would still exist because they have not exceeded the 14-day retention threshold since being marked as deleted. The query succeeds because both the transaction log entry and the required data files are still present. VACUUM does not remove files based on version age alone, but based on when they were logically deleted from the table.
3. A data pipeline uses MERGE INTO to perform upserts on a partitioned Delta table with 50 TB of data across 2000 partitions. The merge operation takes 45 minutes to complete. The ON clause includes: ON target.id = source.id AND target.partition_date = source.partition_date. The partition_date column is the partition key. What optimization would provide the most significant performance improvement? (Select one!)
Explanation
The MERGE operation is already optimized because including the partition column in the ON clause enables partition pruning, which limits the merge operation to only the partitions containing matching data. This is the most effective optimization for partitioned tables in MERGE operations. Broadcast hints are typically ineffective for large source tables in MERGE operations. AQE provides benefits but partition pruning has the most significant impact on merge performance. While liquid clustering offers advantages, replacing an already optimized partitioned merge would require significant rewriting without guaranteed improvement for this specific pattern.
4. A Delta table stores sensitive customer data and has the following table properties configured: delta.deletedFileRetentionDuration set to interval 7 days and delta.logRetentionDuration set to interval 30 days. After running VACUUM with default settings, a data analyst attempts to query VERSION AS OF from 10 days ago and receives a FileNotFoundException. What is the cause? (Select one!)
Explanation
VACUUM removes data files not referenced by any table version within the retention period specified by delta.deletedFileRetentionDuration, which defaults to 7 days. Since the query attempts to access a version from 10 days ago, the required data files were physically deleted by VACUUM. The transaction log retention of 30 days is sufficient and still contains the metadata for the 10-day-old version, but the actual Parquet data files are gone. Change Data Feed is unrelated to time travel functionality. Time travel is limited by data file retention controlled by VACUUM, not by an arbitrary 7-day limit.
5. A data governance team uses Unity Catalog system tables to monitor query usage and identify tables accessed by specific users. They need to generate a report showing all tables queried by the finance_analysts group in the last 30 days. Which system table should they query? (Select one!)
Explanation
The system.query.history table contains detailed information about executed queries including user identity, query text, and execution timestamps. Joining this data with group membership information and parsing the query text for table references enables identifying which tables were accessed by group members within the time window. The table_lineage system table tracks data flow between tables in DLT pipelines, not general query access patterns. The system.access.audit table contains account-level audit logs but is not optimized for analyzing query-level table access patterns. The information_schema.tables contains metadata about table structure and permissions but does not record actual access history or usage patterns.
One-time access to this exam