Databricks • DCASD

Databricks Certified Associate Developer for Apache Spark Practice Test

Validates the ability to use Apache Spark DataFrame API and Spark SQL for data manipulation tasks, covering Spark architecture and execution model, DataFrame transformations and actions, Structured Streaming, Spark Connect, and performance tuning.

Exam Details

Questions

604

Duration

90 minutes

Passing Score

70%

Difficulty

Associate

Last Updated

Feb 2026

Exam Overview

The Databricks Certified Associate Developer for Apache Spark validates a candidate's ability to use the Apache Spark DataFrame API and core Spark concepts to perform essential data manipulation tasks within a Spark session. The exam was significantly updated in April 2025 (replacing the legacy Spark 3.0 version) and now covers the Spark DataFrame API for selecting, renaming, and manipulating columns; filtering, dropping, sorting, and aggregating rows; handling missing data; combining, reading, writing, and partitioning DataFrames with schemas; and working with user-defined functions (UDFs) and Spark SQL functions. Code in the exam is presented exclusively in Python.

Beyond the DataFrame API, the certification assesses foundational knowledge of the Spark architecture and execution model, including execution and deployment modes, the execution hierarchy (jobs, stages, tasks), fault tolerance, garbage collection, lazy evaluation, shuffling, actions, and broadcasting. The updated exam also includes coverage of Structured Streaming fundamentals, Spark Connect, the Pandas API on Apache Spark, and common performance tuning and troubleshooting techniques. This breadth makes it a comprehensive entry-level credential for working with Apache Spark in production data environments.

Official exam page

Who Should Take This Exam

This certification is designed for early-to-mid-career data practitioners who work with Apache Spark in Python on a regular basis. Ideal candidates include data engineers, data analysts, and software developers who build or maintain Spark-based data pipelines and need to demonstrate foundational proficiency. Databricks recommends at least six months of hands-on experience performing the tasks covered in the exam guide before attempting the exam.

The certification is well-suited for professionals transitioning into big data engineering roles, data platform engineers working on Databricks-based lakehouse architectures, and developers at organizations that use Apache Spark at scale. Because Apache Spark is used across industries—from finance and retail to healthcare and technology—this credential is valuable beyond Databricks-specific roles.

Prerequisites

There are no formal prerequisites required to register for this exam. However, Databricks strongly recommends that candidates have at least six months of practical, hands-on experience with Apache Spark before attempting the certification. Candidates should be comfortable writing PySpark code using the DataFrame API and have a working understanding of Spark's execution model.

Recommended preparatory knowledge includes familiarity with Python programming, basic SQL, and core big data concepts such as distributed computing and partitioning. Databricks' official training courses—particularly 'Apache Spark™ Programming with Databricks' and 'Developing Applications with Apache Spark™'—are strongly recommended as preparation. Prior completion of an introductory Databricks or Spark course, combined with hands-on practice in a Databricks workspace, will substantially improve a candidate's readiness.

Exam Format

The exam consists of 45 scored multiple-choice questions and must be completed within 90 minutes, delivered in a proctored format either online or at an authorized testing center. All code snippets and questions are presented in Python. The exam costs $200 USD per attempt, and the certification is valid for two years, after which recertification is required. The passing score is 70%.

As with many proctored certification exams, the exam may include a small number of unscored pilot items used to gather statistical data for future exam development; these items are not identified and do not impact the final score, and additional time is factored in to account for them. Questions are scenario-based and frequently require candidates to evaluate PySpark code snippets, making hands-on coding experience essential for success.

Skills Measured

1.Spark DataFrame API – Core Operations (approximately 70% of exam): Selecting, renaming, and manipulating columns; filtering, dropping, sorting, and aggregating rows; handling missing and null data; combining DataFrames via joins and unions; reading and writing DataFrames in various formats (Parquet, CSV, JSON, Delta); managing schemas and partitioning; applying user-defined functions (UDFs); and using Spark SQL functions.
2.Spark Architecture and Execution Model: Understanding Spark's execution and deployment modes (client vs. cluster, local vs. distributed); the execution hierarchy (applications, jobs, stages, tasks); lazy evaluation and the DAG scheduler; fault tolerance and lineage; garbage collection behavior; shuffling and its performance implications; actions vs. transformations; and broadcasting variables and accumulators.
3.Structured Streaming: Fundamentals of Spark Structured Streaming, including reading from and writing to streaming sources and sinks, output modes (append, complete, update), triggers, and watermarking for handling late data.
4.Spark Connect: Understanding the Spark Connect architecture and its role in enabling a decoupled client-server model for submitting Spark workloads, including how it differs from the traditional Spark driver model.
5.Pandas API on Apache Spark: Using the Pandas API on Spark (formerly Koalas) to write Pandas-compatible code that executes on a distributed Spark cluster, including awareness of its behavioral differences from native Pandas.
6.Performance Tuning and Troubleshooting: Common optimization techniques including caching and persistence strategies, partition management, avoiding data skew, understanding the Spark UI for diagnosing bottlenecks, and applying best practices for efficient Spark job execution.

Study Tips

Work through the official Databricks training course 'Apache Spark™ Programming with Databricks' (available on the Databricks Academy platform), as its curriculum is directly aligned with the exam objectives and includes hands-on labs in a real Databricks environment.
Practice writing PySpark DataFrame transformations from scratch in a free Databricks Community Edition workspace. Focus heavily on chaining transformations, reading/writing various file formats, handling nulls, and writing UDFs—these topics represent the bulk of exam questions.
Study the official Databricks exam guide and topic outline published on the certification page to ensure you cover every listed objective, paying special attention to new topics in the 2025 version: Spark Connect, Structured Streaming, and the Pandas API on Spark.
Read the relevant chapters of 'Spark: The Definitive Guide' by Bill Chambers and Matei Zaharia (O'Reilly) to build a solid understanding of Spark's execution model, lazy evaluation, shuffling, and fault tolerance—conceptual questions on these topics appear alongside code-based questions.
Take at least one full-length timed practice exam under real conditions (45 questions, 90 minutes) to build familiarity with the question format and identify weak areas. Review every incorrect answer by testing the corresponding code in a Databricks notebook rather than just reading explanations.
Review the Spark documentation and Databricks documentation pages for the specific functions and classes that appear in the exam—particularly `pyspark.sql.functions`, `DataFrame` methods, streaming API (`readStream`/`writeStream`), and `SparkSession` configuration options.
For Structured Streaming and Spark Connect, focus on conceptual understanding rather than memorizing syntax—these topics tend to appear as scenario or concept questions rather than code-evaluation questions in the exam.

Career Benefits

Earning the Databricks Certified Associate Developer for Apache Spark credential signals verified, entry-level proficiency in one of the most widely deployed distributed data processing frameworks in the industry. Apache Spark is used at scale by organizations across virtually every sector, meaning this certification is relevant beyond Databricks-specific roles. Common target positions for certified professionals include Data Engineer, Big Data Developer, Analytics Engineer, and Data Platform Engineer. With experience, certified professionals move into senior data engineering, data architecture, and principal engineering roles.

In terms of compensation, mid-level data engineers with Spark proficiency in the United States commonly earn in the range of $130,000–$180,000 annually, with senior roles exceeding $200,000 at major technology firms. Industry surveys suggest that certified professionals can command 10–20% salary premiums over non-certified peers with equivalent experience. The certification is valid for two years and pairs well with other Databricks credentials—such as the Databricks Certified Data Engineer Associate or the Databricks Certified Machine Learning Associate—for professionals building a broader Databricks certification portfolio.

Sample Questions

Preview — answers shown

5 sample questions with correct answers and explanations. Start a practice session to test yourself across all 604 questions.

1. A Spark SQL query uses the following window function: df.withColumn('running_total', sum('amount').over(Window.partitionBy('customer_id').orderBy('date').rowsBetween(Window.unboundedPreceding, Window.currentRow))). What does this window specification compute? (Select one!)

AThe sum of all amounts for each customer across all rows

BThe cumulative sum of amounts for each customer from the first row to the current row

CThe average amount for each customer up to the current date

DThe sum of amounts in a sliding window of 2 rows for each customer

Explanation

The window specification with rowsBetween unboundedPreceding and currentRow creates a cumulative calculation that includes all rows from the beginning of the partition to the current row, computing a running total. The partitionBy ensures calculations are done separately for each customer_id, and orderBy date ensures chronological accumulation. A window without rowsBetween would compute the sum across all rows in the partition. A sliding window would use specific offsets like rowsBetween(-1, 1). The function uses sum, not avg, so it cannot compute averages.

2. A streaming application performs a stream-stream left outer join between impressions and clicks with the following code: impressions.withWatermark('impressionTime', '2 hours').join(clicks.withWatermark('clickTime', '3 hours'), expr('impressionAdId = clickAdId AND clickTime >= impressionTime AND clickTime <= impressionTime + interval 1 hour'), 'left_outer'). An impression at 10:00 has no matching click. When will the outer NULL result be emitted? (Select one!)

AAfter the watermark advances past 11:00 plus the maximum watermark delay to ensure all potential matches are considered

BAfter exactly 2 hours when the impression-side watermark expires the record

CImmediately when the watermark advances past 10:00 since no match exists

DNever, because left outer joins do not emit NULL results in streaming queries

Explanation

In stream-stream outer joins, NULL results are delayed until the watermark guarantees that no future matches can arrive. For an impression at 10:00 with a 1-hour time constraint, potential clicks can arrive up to 11:00. With a 3-hour watermark on clicks, the system must wait until the watermark advances past 11:00 plus the watermark delay to ensure no late clicks will match. The result is not emitted immediately because late clicks might still arrive. The 2-hour impression watermark alone does not determine when outer results are emitted. Left outer joins do emit NULL results, but they are delayed based on watermark guarantees.

3. A data pipeline reads CSV files using spark.read.csv() with inferSchema set to true. The pipeline runs daily but experiences performance degradation as the number of input files grows. The schema remains consistent across all files. Which approach will improve read performance while maintaining correct data types? (Select one!)

AUse spark.read.csv() with inferSchema=false and let Spark use default StringType for all columns

BEnable spark.sql.files.maxPartitionBytes to increase partition size and reduce schema inference overhead

CDefine an explicit StructType schema and pass it to spark.read.schema() without using inferSchema

DUse spark.read.format('csv').option('samplingRatio', 0.1) to reduce schema inference scan

Explanation

Defining an explicit schema with StructType and passing it to spark.read.schema() is the fastest approach because it completely eliminates the expensive schema inference scan that reads all files to determine data types. InferSchema requires Spark to scan the entire dataset, which becomes slower as files accumulate. Using inferSchema=false with default StringType avoids the scan but loses correct data types, requiring manual casting later. The samplingRatio option reduces inference time but still performs scanning and may produce incorrect schemas if the sample is not representative. maxPartitionBytes controls partition sizing for reading data, not schema inference performance.

4. A data engineer uses df.transform(lambda x: x.filter(col('status') == 'active').withColumn('processed', lit(True))) to chain transformations. What is the primary benefit of using transform() over direct method chaining? (Select one!)

ATransform reduces memory usage by processing data in smaller batches

BTransform operations are automatically optimized by Catalyst while direct chaining is not

CTransform enables reusable transformation functions that can be applied to multiple DataFrames

DTransform executes transformations immediately rather than using lazy evaluation

Explanation

The transform method allows defining reusable transformation functions that can be applied to multiple DataFrames, improving code modularity and reusability. This is especially useful for complex transformation logic that needs to be applied consistently across different DataFrames. Transform operations go through the same Catalyst optimization as direct method chaining. Transform maintains lazy evaluation just like direct method chaining. Transform does not change the memory usage patterns compared to direct chaining.

5. A data pipeline reads Parquet files with the following code: df = spark.read.parquet('/data/year=2024/month=01/', '/data/year=2024/month=02/'). The files were written with different schemas where month=02 has an additional column revenue. What happens when mergeSchema option is not specified? (Select one!)

ASpark reads only the common columns present in both directories

BSpark automatically merges schemas by default for Parquet format

CSpark uses the schema from the first file and fills missing revenue values with null for month=01

DSpark throws an exception due to schema mismatch between the two directories

Explanation

When reading Parquet files with different schemas and mergeSchema is not enabled (default is false), Spark uses the schema from the first file or directory encountered. For files missing columns from the inferred schema, Spark fills those columns with null values. In this case, month=01 data would have null for the revenue column. Spark does not throw exceptions for schema mismatches by default. Schema merging must be explicitly enabled with option('mergeSchema', 'true'). Spark does not automatically read only common columns; it uses the full schema from the first source.

One-time access to this exam

Full access to all 604 questions

Or $15/mo for all 201 exams

Detailed explanations

Free preview stays available