Databricks • DCMLEA
Validates foundational knowledge of machine learning on the Databricks platform, covering AutoML, Feature Store, ML workflows and experiment tracking with MLflow, model development with Spark ML, and model deployment and serving.
Questions
630
Duration
90 minutes
Passing Score
70%
Difficulty
AssociateLast Updated
Feb 2026
The Databricks Certified Machine Learning Associate certification validates foundational knowledge and practical ability to perform core machine learning tasks on the Databricks Lakehouse Platform. The exam covers the full ML lifecycle, including exploratory data analysis, feature engineering, model training, hyperparameter tuning, evaluation, and deployment using Databricks-native tooling such as AutoML, the Feature Store, Unity Catalog integration, and Managed MLflow for experiment tracking and model registry. Candidates are expected to demonstrate proficiency with both single-node and distributed machine learning approaches, including Spark ML APIs, Hyperopt with SparkTrials, and Pandas UDFs.
The certification was updated on October 28, 2024, to reflect current platform capabilities including real-time, batch, and streaming inference patterns as well as MLOps best practices such as model metadata tagging. All machine learning code on the exam is in Python; data manipulation code outside ML-specific tasks may appear in SQL. The exam is administered online through Databricks' exam delivery platform and costs $200 USD, with local taxes potentially applicable.
This certification is designed for data scientists, machine learning engineers, and ML-adjacent data engineers who perform machine learning workflows on Databricks and want to validate their skills at an associate level. Candidates are typically early-to-mid career practitioners with approximately 6 or more months of hands-on experience using Databricks for machine learning tasks including model training, tuning, and deployment.
The exam is also well-suited for analytics consultants and data engineers who collaborate closely with ML teams and want to deepen their understanding of the Databricks ML platform. It serves as a prerequisite stepping stone for the Databricks Certified Machine Learning Professional certification.
There are no formal prerequisites required to sit for this exam. However, Databricks recommends at least 6 months of hands-on experience performing machine learning tasks on the Databricks platform as outlined in the official exam guide. Candidates should have practical familiarity with Databricks workspaces, clusters, Repos, and Jobs, as well as the Databricks Runtime for Machine Learning and its bundled libraries.
A foundational understanding of machine learning concepts—including supervised learning, feature engineering, model evaluation metrics, and hyperparameter tuning—is expected. Familiarity with Python and a working knowledge of Apache Spark concepts (DataFrames, distributed computation) are strongly recommended, as Spark ML accounts for the largest share of exam content.
The Databricks Certified Machine Learning Associate exam consists of 48 scored multiple-choice and multiple-response questions to be completed within 90 minutes. The passing score is 70%. The exam may include a small number of unscored items used to gather statistical data for future exam development; these items are not identified on the form, do not count toward the final score, and are accounted for in the total allotted time.
The exam is delivered online through Databricks' exam delivery platform and can be taken remotely. All ML code presented in questions is written in Python; SQL may appear for non-ML data manipulation scenarios. The certification is valid for two years from the date of passing, after which recertification is required to maintain certified status. The exam fee is $200 USD (local taxes may apply).
Holding the Databricks Certified Machine Learning Associate credential signals verified proficiency with the Databricks Lakehouse Platform for ML—a platform widely adopted across enterprises using the Azure Databricks, AWS, and Google Cloud ecosystems. It is recognized by employers hiring for data scientist, ML engineer, and MLOps roles where Databricks is part of the production stack. The certification is particularly valuable at organizations that have standardized on Databricks for unified data and AI workloads, as it demonstrates readiness to contribute to ML pipelines without extensive onboarding.
While Databricks does not publish official salary data tied to this specific credential, practitioners with Databricks ML certifications and associated skills (Spark, MLflow, cloud ML platforms) command salaries broadly in the $110,000–$160,000+ USD range for ML engineer and data scientist roles in the US market, depending on seniority and location. The Associate-level certification serves as a recognized stepping stone to the Databricks Certified Machine Learning Professional exam, which tests advanced topics such as model monitoring, feature engineering at scale, and custom MLflow integrations.
5 sample questions with correct answers and explanations. Start a practice session to test yourself across all 630 questions.
1. A machine learning team transitions a model version from Staging to Production in the Workspace Model Registry using the UI. They navigate to the Models page, click on the model name customer_churn_predictor, and see the model overview showing description and permissions. They need to transition version 4 from Staging to Production and archive all existing Production versions. Where should they navigate to perform this transition? (Select one!)
Explanation
Stage transitions in the Workspace Model Registry UI must be performed from the individual version details page, not from the model overview page. Users must click on the specific version number to access the version page where the Stage dropdown and transition controls are located. The model overview page shows aggregate information about all versions but does not provide stage transition controls. The Serving tab is for creating serving endpoints and is separate from Model Registry stage management. There is no batch transition feature on the Models home page; transitions are performed per version on the version details page.
2. A machine learning engineer uses StringIndexer to encode categorical variables before training a GBTClassifier for customer churn prediction with three classes: Low Risk, Medium Risk, and High Risk. What error will occur? (Select one!)
Explanation
GBTClassifier in Spark MLlib only supports binary classification with exactly two classes, not multiclass problems with three or more classes. Attempting to train a GBTClassifier with three classes (Low Risk, Medium Risk, High Risk) will fail. StringIndexer supports any number of categories and is appropriate for tree-based models. GBTClassifier works directly with StringIndexer output and does not require OneHotEncoder; tree-based models handle indexed categorical variables efficiently without one-hot encoding. GBTClassifier does not require scaling for any features as tree-based models are scale-invariant.
3. A fintech company deploys a fraud detection model to a Model Serving endpoint with GPU_SMALL workload type. During peak transaction periods, the endpoint experiences increased latency. They want to enable scale_to_zero_enabled to reduce costs during off-peak hours. What is the primary risk of this configuration for production workloads? (Select one!)
Explanation
The primary risk of scale to zero for GPU endpoints is cold start latency when the endpoint needs to scale up from zero instances, combined with the fact that GPU capacity is not guaranteed when scaling from zero. This can cause unacceptable delays during sudden traffic spikes. Scale to zero is supported for GPU workload types including GPU_SMALL. Model artifacts are not lost when scaling to zero as they are stored separately. Databricks documentation specifically warns against using scale to zero for production workloads due to these reliability concerns.
4. An ML platform team orchestrates a multi-stage ML pipeline using Databricks Jobs with three sequential tasks: data preprocessing, model training, and model registration. The training task needs the feature table path from the preprocessing task. Which Databricks Jobs feature enables passing this information between tasks? (Select one!)
Explanation
Task values provide the native Databricks mechanism for passing data between job tasks. The preprocessing task uses dbutils.jobs.taskValues.set(key, value) to store the feature table path, and the training task retrieves it using dbutils.jobs.taskValues.get(taskKey, key). This is the recommended approach for lightweight data transfer between pipeline stages. While sharing via notebook variables with %run or Delta tables technically works, these are not purpose-built for inter-task communication. Environment variables are set at cluster creation time and cannot be dynamically updated between tasks.
5. A healthcare analytics team trains classification models to predict patient readmission risk. They use Databricks AutoML and want to optimize for the area under the Precision-Recall curve instead of ROC AUC because their dataset has severe class imbalance (5% positive cases). Which primary_metric value should they specify in automl.classify()? (Select one!)
Explanation
Databricks AutoML classification supports the following primary_metric values: f1 (default), log_loss, precision, accuracy, and roc_auc. Area under the Precision-Recall curve is not available as a primary metric option in AutoML, even though it would be beneficial for imbalanced datasets. Teams needing PR AUC optimization must use custom training pipelines with MLflow instead of AutoML. There is no pr_auc, auc_pr, or precision_recall metric option in the AutoML classify() API. For imbalanced classification with AutoML, f1 or precision are the closest available alternatives.
One-time access to this exam