Training Data Extraction in Axum (Rust)
Training Data Extraction in Axum with Rust — how this specific combination creates or exposes the vulnerability
Training data extraction in an Axum service written in Rust occurs when an application inadvertently exposes datasets, model artifacts, or intermediate data used during model training. This typically happens through debug endpoints, verbose error messages, or overly permissive file serving that allows path traversal into directories containing training corpora, preprocessing scripts, or labeled examples.
In Axum, routing and handler composition can inadvertently expose sensitive training data if response serialization is too permissive. For example, a handler that returns a struct containing raw training samples may serialize private fields if the struct derives Serialize without carefully controlling which fields are included. Rust’s strong type system reduces some risks, but serialization crates like serde will expose any public field unless explicitly skipped, creating an accidental data leak when handlers return training-related structs.
Another vector arises from how Axum integrates with middleware and extractors. If an extractor like State holds a reference to a training dataset in memory, and a debug route exposes that state via an unrestricted endpoint, an attacker can enumerate or dump training data through crafted requests. This is especially relevant when combined with OpenAPI/Swagger introspection that reveals endpoint behavior, giving an attacker insight into how training data flows through the application.
Because middleBrick tests unauthenticated attack surfaces, it can detect endpoints that return training data by analyzing response content for patterns indicative of datasets, such as repeated token sequences or labeled examples. Findings from such scans map to OWASP API Top 10’s ‘Broken Object Level Authorization’ and ‘Excessive Data Exposure,’ highlighting insecure direct object references or missing authorization on data-rich endpoints. These issues align with compliance frameworks like PCI-DSS and SOC2, where exposure of training data can reveal sensitive patterns or personally identifiable information embedded in corpora.
Using middleBrick’s LLM/AI Security checks, this unauthenticated probing can additionally detect whether model outputs or debug traces leak training data through generated text, such as memorized strings or code snippets. This is critical for Rust services where training data pipelines might feed into LLM applications, as leaked data can lead to model inversion or membership inference attacks.
Rust-Specific Remediation in Axum — concrete code fixes
To prevent training data exposure in Axum services written in Rust, apply strict serialization controls, endpoint hygiene, and data compartmentalization. The following examples demonstrate secure patterns.
1. Controlled Serialization with Serde
Ensure that any struct exposed through API responses explicitly controls which fields are serialized. Use #[serde(skip_serializing)] for sensitive training metadata.
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize)]
pub struct PublicResponse {
pub prediction: String,
#[serde(skip_serializing)]
pub training_sample_id: String,
#[serde(skip_serializing)]
pub raw_training_data: Vec<f32>,
}
// Handler that safely returns only non-sensitive fields
async fn get_prediction() -> PublicResponse {
PublicResponse {
prediction: "class_a".to_string(),
training_sample_id: "internal_id_123".to_string(),
raw_training_data: vec![],
}
}
2. Isolate Training Data State
Keep training datasets in application state that is not exposed via debug or introspection routes. Use Axum’s State to hold data but avoid creating handlers that dump it.
use axum::{routing::get, Router};
use std::sync::Arc;
struct AppState {
// Training data kept private, not cloned or exposed
training_corpus: Arc<Vec<String>>,
}
async fn health_check() -> String {
"OK".to_string()
}
async fn get_model_output() -> String {
"prediction".to_string()
}
fn build_router() -> Router {
let state = Arc::new(AppState {
training_corpus: Arc::new(vec![]), // loaded securely elsewhere
});
Router::new()
.route("/health", get(health_check))
.route("/predict", get(get_model_output))
.with_state(state)
}
3. Disable Debug Routes in Production
If using tracing or debug middleware, ensure production builds exclude verbose output that could reveal data paths. Configure logging levels to suppress payload details.
// In production configuration, avoid including debug extractors
// that return full request/response bodies.
// Use axum::extract::State read-only access without clone-on-request.
4. Validate and Restrict File Serving
If serving static files, disable directory listing and restrict paths to prevent traversal into training directories.
use axum::routing::get;
use axum::response::File;
use std::path::Path;
async fn safe_file_service(path: axum::extract::Path<String>) -> Option<File> {
let requested_path = Path::new("/safe/public").join(path.into_inner());
if requested_path.starts_with("/safe/public") {
File::open(requested_path).await.ok()
} else {
None
}
}
fn file_router() -> Router {
Router::new().route("/files/:name", get(safe_file_service))
}