ML Engine
Thermocline exposes in-database ML through aggregation stages in storage-engine.
Stages
$mlTrain$mlPredict$mlListModels$mlDeleteModel
Stage parsing/dispatch anchors:
services/storage-engine/src/aggregation/pipeline.rsservices/storage-engine/src/aggregation/ml.rs
Runtime note:
- Storage-engine ML stage initialization forces the ML engine enabled for stage execution (
ml_engine()setsconfig.enabled = true). ML_ENGINE_ENABLEDstill exists in plugin config parsing but does not disable$ml*stages in storage-engine.
$mlTrain Required Fields
| Field | Type | Required | Notes |
|---|---|---|---|
model | string | Yes | Model name/key |
type | string | Yes | Algorithm type/alias |
features | string[] | Yes | Numeric feature field paths |
target | string | Supervised models | Required for supervised model families |
Supported Model Types and Aliases
| Canonical Type | Accepted Aliases |
|---|---|
linearRegression | linearRegression |
logisticRegression | logisticRegression |
elasticNet | elasticNet |
kmeans | kmeans |
miniBatchKmeans | miniBatchKmeans |
dbscan | dbscan |
pca | pca |
truncatedSvd | truncatedSvd, svd |
naiveBayes | naiveBayes, gaussianNaiveBayes |
multinomialNaiveBayes | multinomialNaiveBayes, multinomialNB |
decisionTree | decisionTree |
randomForest | randomForest |
isolationForest | isolationForest |
Model Hyperparameters
Supervised Regression / Classification
- Logistic regression:
maxIterations,tolerance - ElasticNet:
alpha,l1Ratio,maxIterations,tolerance - DecisionTree:
maxDepth,minSamplesSplit,minSamplesLeaf - RandomForest:
nEstimators/nTrees,maxDepth,minSamplesSplit,minSamplesLeaf,maxFeatures
Clustering / Decomposition / Anomaly
- KMeans:
k,maxIterations,tolerance - MiniBatchKmeans:
k,batchSize,maxIterations,tolerance - DBSCAN:
eps,minPoints - PCA:
nComponents - TruncatedSvd:
nComponents - IsolationForest:
nTrees,subsampleSize,contamination
Evaluation Options ($mlTrain)
You can request built-in validation output for supervised models:
evaluation.testRatio(ortestRatio/test_ratio)evaluation.kFolds(orkFolds/k_folds)evaluation.shuffleSeed(orshuffleSeed/shuffle_seed)
Unsupervised models reject evaluation options with explicit stage errors.
Returned Metrics
Regression
r_squared,mae,mse,rmse,explained_variance
Classification
accuracyprecision_macro,recall_macro,f1_macroprecision_weighted,recall_weighted,f1_weightedsupportconfusion_matrix
Binary Ranking (where applicable)
roc_auc,pr_auc,log_loss
Tree Explainability
DecisionTree and RandomForest include:
feature_importanceshap.base_valueshap.mean_abs
$mlPredict
$mlPredict loads model metadata/features by model name and appends prediction fields to each input document.
Common output fields:
prediction_prediction
Family-specific outputs can include:
cluster(clustering)projection,_projection(PCA/SVD)anomaly_score,is_anomaly(isolation forest)
Model Registry and Persistence
ML models are cached in-memory and persisted to disk-backed model store.
Store root resolution order:
ML_ENGINE_MODEL_STORE_DIRSE_DATA_DIR/ml-modelsDATA_DIR/ml-models- temp dir fallback
Lifecycle stages:
$mlListModels: list persisted + loaded models$mlDeleteModel: delete model from memory and persistent store
Example
// Train
db.training.aggregate([
{
$mlTrain: {
model: "rf_churn_v1",
type: "randomForest",
features: ["age", "tenure_months", "support_tickets"],
target: "churned",
nEstimators: 200,
maxDepth: 12,
minSamplesSplit: 5,
minSamplesLeaf: 2,
evaluation: { testRatio: 0.2, kFolds: 5, shuffleSeed: 42 }
}
}
])
// Predict
db.scoring.aggregate([
{ $mlPredict: { model: "rf_churn_v1" } },
{ $project: { customer_id: 1, prediction: 1, _prediction: 1 } }
])
// List and delete models
db.any.aggregate([{ $mlListModels: {} }])
db.any.aggregate([{ $mlDeleteModel: { model: "rf_churn_v1" } }])
Input Constraints and Errors
The stage enforces strict validation (examples):
- Missing/empty
model,type, orfeatures - Non-numeric feature values
- Invalid model-specific parameters
- Logistic regression requires exactly 2 classes
- Multinomial Naive Bayes requires non-negative feature values
These are surfaced as explicit stage errors in $mlTrain/$mlPredict.