If you’re exploring machine learning in Python, chances are you’ve already come across scikit-learn. This open-source library is one of the most powerful tools for building, evaluating, and refining models. Whether you’re training classifiers, scaling features, or experimenting with hyperparameters — scikit-learn is your go-to Swiss Army knife.
This guide is a quick primer on key features: from fitting your first model to chaining full pipelines — with real, runnable Python code. Let’s dive in.
Estimators: Fit, Predict, Repeat
Every model in scikit-learn starts with an estimator. These are classes like RandomForestClassifier
, LinearRegression
, or KMeans
, and they all follow the same interface:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[1, 2, 3], [11, 12, 13]]
y = [0, 1]
clf.fit(X, y)
print(clf.predict([[4, 5, 6], [14, 15, 16]]))
# Output: array([0, 1])
Scikit-learn expects:
X
: A 2D array of shape(n_samples, n_features)
y
: A 1D array of target labels (for supervised learning)
Once trained, .predict()
will work on new data instantly — no retraining required.
🔧 Transformers: Preprocess Like a Pro
Before training, your data often needs cleaning or scaling. This is where transformers shine. They follow the same .fit()
/ .transform()
pattern:
from sklearn.preprocessing import StandardScaler
X = [[0, 15], [1, -10]]
scaler = StandardScaler()
X_scaled = scaler.fit(X).transform(X)
# Output:
# array([[-1., 1.],
# [ 1., -1.]])
Want to apply different scalers to different columns? Use ColumnTransformer
.
🔗 Pipelines: Keep It Clean and Leak-Free
Machine learning isn’t just about fitting models — it’s about chaining steps together correctly. That’s why pipelines are essential. They prevent data leakage and streamline your workflow.
Here’s a full example using a pipeline to scale features and fit a logistic regression model:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
pipe = make_pipeline(StandardScaler(), LogisticRegression())
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe.fit(X_train, y_train)
print("Accuracy:", accuracy_score(pipe.predict(X_test), y_test))
# Output: ~0.97
Cross-Validation: Don’t Trust a Single Split
Fitting a model is easy. Trusting it? Not so fast.
Scikit-learn offers tools like train_test_split
and cross_validate
to evaluate models properly. Here’s 5-fold cross-validation in action:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_validate
X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()
result = cross_validate(lr, X, y)
print(result['test_score']) # Typically five R² scores
# Output: array([1., 1., 1., 1., 1.])
Hyperparameter Search: Automate the Tuning
Hyperparameters like max_depth
, n_estimators
, or alpha
can make or break your model.
Scikit-learn lets you automate tuning using RandomizedSearchCV
or GridSearchCV
. Here’s a random search over a small parameter space for RandomForestRegressor
:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from scipy.stats import randint
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
param_distributions = {
'n_estimators': randint(1, 5),
'max_depth': randint(5, 10)
}
search = RandomizedSearchCV(
RandomForestRegressor(random_state=0),
param_distributions=param_distributions,
n_iter=5,
random_state=0
)
search.fit(X_train, y_train)
print("Best parameters:", search.best_params_)
print("Test score:", search.score(X_test, y_test))
# Output: e.g., {'max_depth': 9, 'n_estimators': 4}, score ~0.73
Important: Always wrap your models and pre-processing steps in a pipeline during tuning. Otherwise, you risk contaminating your training data with test set info — a cardinal sin in ML.
Final Thoughts: Where to Go Next
This guide gave you a hands-on tour of:
- Fitting estimators
- Scaling and transforming data
- Building robust pipelines
- Evaluating with cross-validation
- Performing automatic parameter searches
Scikit-learn’s consistency across models and tools is one of its greatest strengths. Once you learn the interface, you can plug in any model, scale, or selector — and it just works.
For next steps:
- Browse scikit-learn’s User Guide
- Explore the API Reference
- Check out the Examples Gallery
Leave a Reply