Skip to main content

ML Engine

Thermocline exposes in-database ML through aggregation stages in storage-engine.

Stages

  • $mlTrain
  • $mlPredict
  • $mlListModels
  • $mlDeleteModel

Stage parsing/dispatch anchors:

  • services/storage-engine/src/aggregation/pipeline.rs
  • services/storage-engine/src/aggregation/ml.rs

Runtime note:

  • Storage-engine ML stage initialization forces the ML engine enabled for stage execution (ml_engine() sets config.enabled = true).
  • ML_ENGINE_ENABLED still exists in plugin config parsing but does not disable $ml* stages in storage-engine.

$mlTrain Required Fields

FieldTypeRequiredNotes
modelstringYesModel name/key
typestringYesAlgorithm type/alias
featuresstring[]YesNumeric feature field paths
targetstringSupervised modelsRequired for supervised model families

Supported Model Types and Aliases

Canonical TypeAccepted Aliases
linearRegressionlinearRegression
logisticRegressionlogisticRegression
elasticNetelasticNet
kmeanskmeans
miniBatchKmeansminiBatchKmeans
dbscandbscan
pcapca
truncatedSvdtruncatedSvd, svd
naiveBayesnaiveBayes, gaussianNaiveBayes
multinomialNaiveBayesmultinomialNaiveBayes, multinomialNB
decisionTreedecisionTree
randomForestrandomForest
isolationForestisolationForest

Model Hyperparameters

Supervised Regression / Classification

  • Logistic regression: maxIterations, tolerance
  • ElasticNet: alpha, l1Ratio, maxIterations, tolerance
  • DecisionTree: maxDepth, minSamplesSplit, minSamplesLeaf
  • RandomForest: nEstimators/nTrees, maxDepth, minSamplesSplit, minSamplesLeaf, maxFeatures

Clustering / Decomposition / Anomaly

  • KMeans: k, maxIterations, tolerance
  • MiniBatchKmeans: k, batchSize, maxIterations, tolerance
  • DBSCAN: eps, minPoints
  • PCA: nComponents
  • TruncatedSvd: nComponents
  • IsolationForest: nTrees, subsampleSize, contamination

Evaluation Options ($mlTrain)

You can request built-in validation output for supervised models:

  • evaluation.testRatio (or testRatio / test_ratio)
  • evaluation.kFolds (or kFolds / k_folds)
  • evaluation.shuffleSeed (or shuffleSeed / shuffle_seed)

Unsupervised models reject evaluation options with explicit stage errors.

Returned Metrics

Regression

  • r_squared, mae, mse, rmse, explained_variance

Classification

  • accuracy
  • precision_macro, recall_macro, f1_macro
  • precision_weighted, recall_weighted, f1_weighted
  • support
  • confusion_matrix

Binary Ranking (where applicable)

  • roc_auc, pr_auc, log_loss

Tree Explainability

DecisionTree and RandomForest include:

  • feature_importance
  • shap.base_value
  • shap.mean_abs

$mlPredict

$mlPredict loads model metadata/features by model name and appends prediction fields to each input document.

Common output fields:

  • prediction
  • _prediction

Family-specific outputs can include:

  • cluster (clustering)
  • projection, _projection (PCA/SVD)
  • anomaly_score, is_anomaly (isolation forest)

Model Registry and Persistence

ML models are cached in-memory and persisted to disk-backed model store.

Store root resolution order:

  1. ML_ENGINE_MODEL_STORE_DIR
  2. SE_DATA_DIR/ml-models
  3. DATA_DIR/ml-models
  4. temp dir fallback

Lifecycle stages:

  • $mlListModels: list persisted + loaded models
  • $mlDeleteModel: delete model from memory and persistent store

Example

// Train
db.training.aggregate([
{
$mlTrain: {
model: "rf_churn_v1",
type: "randomForest",
features: ["age", "tenure_months", "support_tickets"],
target: "churned",
nEstimators: 200,
maxDepth: 12,
minSamplesSplit: 5,
minSamplesLeaf: 2,
evaluation: { testRatio: 0.2, kFolds: 5, shuffleSeed: 42 }
}
}
])

// Predict
db.scoring.aggregate([
{ $mlPredict: { model: "rf_churn_v1" } },
{ $project: { customer_id: 1, prediction: 1, _prediction: 1 } }
])

// List and delete models
db.any.aggregate([{ $mlListModels: {} }])
db.any.aggregate([{ $mlDeleteModel: { model: "rf_churn_v1" } }])

Input Constraints and Errors

The stage enforces strict validation (examples):

  • Missing/empty model, type, or features
  • Non-numeric feature values
  • Invalid model-specific parameters
  • Logistic regression requires exactly 2 classes
  • Multinomial Naive Bayes requires non-negative feature values

These are surfaced as explicit stage errors in $mlTrain/$mlPredict.