ML Engine

Thermocline exposes in-database ML through aggregation stages in storage-engine.

Stages

$mlTrain
$mlPredict
$mlListModels
$mlDeleteModel

Stage parsing/dispatch anchors:

services/storage-engine/src/aggregation/pipeline.rs
services/storage-engine/src/aggregation/ml.rs

Runtime note:

Storage-engine ML stage initialization forces the ML engine enabled for stage execution (ml_engine() sets config.enabled = true).
ML_ENGINE_ENABLED still exists in plugin config parsing but does not disable $ml* stages in storage-engine.

`$mlTrain` Required Fields

Field	Type	Required	Notes
`model`	string	Yes	Model name/key
`type`	string	Yes	Algorithm type/alias
`features`	string[]	Yes	Numeric feature field paths
`target`	string	Supervised models	Required for supervised model families

Supported Model Types and Aliases

Canonical Type	Accepted Aliases
`linearRegression`	`linearRegression`
`logisticRegression`	`logisticRegression`
`elasticNet`	`elasticNet`
`kmeans`	`kmeans`
`miniBatchKmeans`	`miniBatchKmeans`
`dbscan`	`dbscan`
`pca`	`pca`
`truncatedSvd`	`truncatedSvd`, `svd`
`naiveBayes`	`naiveBayes`, `gaussianNaiveBayes`
`multinomialNaiveBayes`	`multinomialNaiveBayes`, `multinomialNB`
`decisionTree`	`decisionTree`
`randomForest`	`randomForest`
`isolationForest`	`isolationForest`

Model Hyperparameters

Supervised Regression / Classification

Logistic regression: maxIterations, tolerance
ElasticNet: alpha, l1Ratio, maxIterations, tolerance
DecisionTree: maxDepth, minSamplesSplit, minSamplesLeaf
RandomForest: nEstimators/nTrees, maxDepth, minSamplesSplit, minSamplesLeaf, maxFeatures

Clustering / Decomposition / Anomaly

KMeans: k, maxIterations, tolerance
MiniBatchKmeans: k, batchSize, maxIterations, tolerance
DBSCAN: eps, minPoints
PCA: nComponents
TruncatedSvd: nComponents
IsolationForest: nTrees, subsampleSize, contamination

Evaluation Options (`$mlTrain`)

You can request built-in validation output for supervised models:

evaluation.testRatio (or testRatio / test_ratio)
evaluation.kFolds (or kFolds / k_folds)
evaluation.shuffleSeed (or shuffleSeed / shuffle_seed)

Unsupervised models reject evaluation options with explicit stage errors.

Returned Metrics

Regression

r_squared, mae, mse, rmse, explained_variance

Classification

accuracy
precision_macro, recall_macro, f1_macro
precision_weighted, recall_weighted, f1_weighted
support
confusion_matrix

Binary Ranking (where applicable)

roc_auc, pr_auc, log_loss

Tree Explainability

DecisionTree and RandomForest include:

feature_importance
shap.base_value
shap.mean_abs

`$mlPredict`

$mlPredict loads model metadata/features by model name and appends prediction fields to each input document.

Common output fields:

prediction
_prediction

Family-specific outputs can include:

cluster (clustering)
projection, _projection (PCA/SVD)
anomaly_score, is_anomaly (isolation forest)

Model Registry and Persistence

ML models are cached in-memory and persisted to disk-backed model store.

Store root resolution order:

ML_ENGINE_MODEL_STORE_DIR
SE_DATA_DIR/ml-models
DATA_DIR/ml-models
temp dir fallback

Lifecycle stages:

$mlListModels: list persisted + loaded models
$mlDeleteModel: delete model from memory and persistent store

Example

// Train
db.training.aggregate([
  {
    $mlTrain: {
      model: "rf_churn_v1",
      type: "randomForest",
      features: ["age", "tenure_months", "support_tickets"],
      target: "churned",
      nEstimators: 200,
      maxDepth: 12,
      minSamplesSplit: 5,
      minSamplesLeaf: 2,
      evaluation: { testRatio: 0.2, kFolds: 5, shuffleSeed: 42 }
    }
  }
])

// Predict
db.scoring.aggregate([
  { $mlPredict: { model: "rf_churn_v1" } },
  { $project: { customer_id: 1, prediction: 1, _prediction: 1 } }
])

// List and delete models
db.any.aggregate([{ $mlListModels: {} }])
db.any.aggregate([{ $mlDeleteModel: { model: "rf_churn_v1" } }])

Input Constraints and Errors

The stage enforces strict validation (examples):

Missing/empty model, type, or features
Non-numeric feature values
Invalid model-specific parameters
Logistic regression requires exactly 2 classes
Multinomial Naive Bayes requires non-negative feature values

These are surfaced as explicit stage errors in $mlTrain/$mlPredict.

Stages​

$mlTrain Required Fields​

Supported Model Types and Aliases​

Model Hyperparameters​

Supervised Regression / Classification​

Clustering / Decomposition / Anomaly​

Evaluation Options ($mlTrain)​

Returned Metrics​

Regression​

Classification​

Binary Ranking (where applicable)​

Tree Explainability​

$mlPredict​

Model Registry and Persistence​

Example​

Input Constraints and Errors​