Databricks • DCASD
Validates the ability to use Apache Spark DataFrame API and Spark SQL for data manipulation tasks, covering Spark architecture and execution model, DataFrame transformations and actions, Structured Streaming, Spark Connect, and performance tuning.
Questions
604
Duration
90 minutes
Passing Score
70%
Difficulty
AssociateLast Updated
Feb 2026
The Databricks Certified Associate Developer for Apache Spark validates a candidate's ability to use the Apache Spark DataFrame API and core Spark concepts to perform essential data manipulation tasks within a Spark session. The exam was significantly updated in April 2025 (replacing the legacy Spark 3.0 version) and now covers the Spark DataFrame API for selecting, renaming, and manipulating columns; filtering, dropping, sorting, and aggregating rows; handling missing data; combining, reading, writing, and partitioning DataFrames with schemas; and working with user-defined functions (UDFs) and Spark SQL functions. Code in the exam is presented exclusively in Python.
Beyond the DataFrame API, the certification assesses foundational knowledge of the Spark architecture and execution model, including execution and deployment modes, the execution hierarchy (jobs, stages, tasks), fault tolerance, garbage collection, lazy evaluation, shuffling, actions, and broadcasting. The updated exam also includes coverage of Structured Streaming fundamentals, Spark Connect, the Pandas API on Apache Spark, and common performance tuning and troubleshooting techniques. This breadth makes it a comprehensive entry-level credential for working with Apache Spark in production data environments.
This certification is designed for early-to-mid-career data practitioners who work with Apache Spark in Python on a regular basis. Ideal candidates include data engineers, data analysts, and software developers who build or maintain Spark-based data pipelines and need to demonstrate foundational proficiency. Databricks recommends at least six months of hands-on experience performing the tasks covered in the exam guide before attempting the exam.
The certification is well-suited for professionals transitioning into big data engineering roles, data platform engineers working on Databricks-based lakehouse architectures, and developers at organizations that use Apache Spark at scale. Because Apache Spark is used across industries—from finance and retail to healthcare and technology—this credential is valuable beyond Databricks-specific roles.
There are no formal prerequisites required to register for this exam. However, Databricks strongly recommends that candidates have at least six months of practical, hands-on experience with Apache Spark before attempting the certification. Candidates should be comfortable writing PySpark code using the DataFrame API and have a working understanding of Spark's execution model.
Recommended preparatory knowledge includes familiarity with Python programming, basic SQL, and core big data concepts such as distributed computing and partitioning. Databricks' official training courses—particularly 'Apache Spark™ Programming with Databricks' and 'Developing Applications with Apache Spark™'—are strongly recommended as preparation. Prior completion of an introductory Databricks or Spark course, combined with hands-on practice in a Databricks workspace, will substantially improve a candidate's readiness.
The exam consists of 45 scored multiple-choice questions and must be completed within 90 minutes, delivered in a proctored format either online or at an authorized testing center. All code snippets and questions are presented in Python. The exam costs $200 USD per attempt, and the certification is valid for two years, after which recertification is required. The passing score is 70%.
As with many proctored certification exams, the exam may include a small number of unscored pilot items used to gather statistical data for future exam development; these items are not identified and do not impact the final score, and additional time is factored in to account for them. Questions are scenario-based and frequently require candidates to evaluate PySpark code snippets, making hands-on coding experience essential for success.
Earning the Databricks Certified Associate Developer for Apache Spark credential signals verified, entry-level proficiency in one of the most widely deployed distributed data processing frameworks in the industry. Apache Spark is used at scale by organizations across virtually every sector, meaning this certification is relevant beyond Databricks-specific roles. Common target positions for certified professionals include Data Engineer, Big Data Developer, Analytics Engineer, and Data Platform Engineer. With experience, certified professionals move into senior data engineering, data architecture, and principal engineering roles.
In terms of compensation, mid-level data engineers with Spark proficiency in the United States commonly earn in the range of $130,000–$180,000 annually, with senior roles exceeding $200,000 at major technology firms. Industry surveys suggest that certified professionals can command 10–20% salary premiums over non-certified peers with equivalent experience. The certification is valid for two years and pairs well with other Databricks credentials—such as the Databricks Certified Data Engineer Associate or the Databricks Certified Machine Learning Associate—for professionals building a broader Databricks certification portfolio.
5 sample questions with correct answers and explanations. Start a practice session to test yourself across all 604 questions.
1. A Spark SQL query uses the following window function: df.withColumn('running_total', sum('amount').over(Window.partitionBy('customer_id').orderBy('date').rowsBetween(Window.unboundedPreceding, Window.currentRow))). What does this window specification compute? (Select one!)
Explanation
The window specification with rowsBetween unboundedPreceding and currentRow creates a cumulative calculation that includes all rows from the beginning of the partition to the current row, computing a running total. The partitionBy ensures calculations are done separately for each customer_id, and orderBy date ensures chronological accumulation. A window without rowsBetween would compute the sum across all rows in the partition. A sliding window would use specific offsets like rowsBetween(-1, 1). The function uses sum, not avg, so it cannot compute averages.
2. A streaming application performs a stream-stream left outer join between impressions and clicks with the following code: impressions.withWatermark('impressionTime', '2 hours').join(clicks.withWatermark('clickTime', '3 hours'), expr('impressionAdId = clickAdId AND clickTime >= impressionTime AND clickTime <= impressionTime + interval 1 hour'), 'left_outer'). An impression at 10:00 has no matching click. When will the outer NULL result be emitted? (Select one!)
Explanation
In stream-stream outer joins, NULL results are delayed until the watermark guarantees that no future matches can arrive. For an impression at 10:00 with a 1-hour time constraint, potential clicks can arrive up to 11:00. With a 3-hour watermark on clicks, the system must wait until the watermark advances past 11:00 plus the watermark delay to ensure no late clicks will match. The result is not emitted immediately because late clicks might still arrive. The 2-hour impression watermark alone does not determine when outer results are emitted. Left outer joins do emit NULL results, but they are delayed based on watermark guarantees.
3. A data pipeline reads CSV files using spark.read.csv() with inferSchema set to true. The pipeline runs daily but experiences performance degradation as the number of input files grows. The schema remains consistent across all files. Which approach will improve read performance while maintaining correct data types? (Select one!)
Explanation
Defining an explicit schema with StructType and passing it to spark.read.schema() is the fastest approach because it completely eliminates the expensive schema inference scan that reads all files to determine data types. InferSchema requires Spark to scan the entire dataset, which becomes slower as files accumulate. Using inferSchema=false with default StringType avoids the scan but loses correct data types, requiring manual casting later. The samplingRatio option reduces inference time but still performs scanning and may produce incorrect schemas if the sample is not representative. maxPartitionBytes controls partition sizing for reading data, not schema inference performance.
4. A data engineer uses df.transform(lambda x: x.filter(col('status') == 'active').withColumn('processed', lit(True))) to chain transformations. What is the primary benefit of using transform() over direct method chaining? (Select one!)
Explanation
The transform method allows defining reusable transformation functions that can be applied to multiple DataFrames, improving code modularity and reusability. This is especially useful for complex transformation logic that needs to be applied consistently across different DataFrames. Transform operations go through the same Catalyst optimization as direct method chaining. Transform maintains lazy evaluation just like direct method chaining. Transform does not change the memory usage patterns compared to direct chaining.
5. A data pipeline reads Parquet files with the following code: df = spark.read.parquet('/data/year=2024/month=01/', '/data/year=2024/month=02/'). The files were written with different schemas where month=02 has an additional column revenue. What happens when mergeSchema option is not specified? (Select one!)
Explanation
When reading Parquet files with different schemas and mergeSchema is not enabled (default is false), Spark uses the schema from the first file or directory encountered. For files missing columns from the inferred schema, Spark fills those columns with null values. In this case, month=01 data would have null for the revenue column. Spark does not throw exceptions for schema mismatches by default. Schema merging must be explicitly enabled with option('mergeSchema', 'true'). Spark does not automatically read only common columns; it uses the full schema from the first source.
One-time access to this exam