Introduction to Machine Learning#

sklearn

Python is probably the most often used programming language to train and run machine learning models.

For shallow machine learning models, sklearn is undoubtedly the most popular library you can use. Sklearn has tutorials (check this great getting started and this common pitfalls tutorial!), toy-datasets, extensive descriptions for all models, algorithms and many more things. Its incredible and I highly highly recommed you to use this library, if you choose to continue working with machine learning.

When it comes to deep learning, python also offers a lot of options. Personally, I prefer to use PyTorch, but there are other options.

Anyway, we wont bother with deep learning in this course, so lets just skip straight to sklearn. As opposed to numpy, I prefer to directly import the classes and methods I use, instead of importing the whole module. I dont know why, but its what people just do, apparently.

The first thing we will do now is fetch the breast cancer dataset from sklearn. It is a common, easy to use dataset that allows us to apply a easy binary classification. It contains features computed from digitized images of a breast mass and a binary label: malignant (0) or benign (1).

from sklearn.datasets import load_breast_cancer

import pandas as pd
import numpy as np

#import for plotting
import seaborn as sns
import matplotlib.pyplot as plt

We set as_frame=True to get both X and y as dataframe objects. Sklearn estimators typically require numpy arrays as their input, but pandas is a nice way to look at your data in a meaningful way

X,y = load_breast_cancer(return_X_y=True,as_frame=True)
print(X.shape,y.shape)
(569, 30) (569,)

To make our data more intuitive, we can create a new variable called diagnosis which holds the actual labels that correspond to 0 and 1 in our y variable. We should not use this to train our model, but it makes sense to use this variable for data exploration.

Before that, we should switch our 0 and 1s, so that a malignant tumor is the positive (1) class, and a benignant tumor is the negative (0) class.

y = 1-y
diagnosis = y.map({1:"Malignant",0:"Benignant"})
X.head(n=10)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678
5 12.45 15.70 82.57 477.1 0.12780 0.17000 0.15780 0.08089 0.2087 0.07613 ... 15.47 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440
6 18.25 19.98 119.60 1040.0 0.09463 0.10900 0.11270 0.07400 0.1794 0.05742 ... 22.88 27.66 153.20 1606.0 0.1442 0.2576 0.3784 0.1932 0.3063 0.08368
7 13.71 20.83 90.20 577.9 0.11890 0.16450 0.09366 0.05985 0.2196 0.07451 ... 17.06 28.14 110.60 897.0 0.1654 0.3682 0.2678 0.1556 0.3196 0.11510
8 13.00 21.82 87.50 519.8 0.12730 0.19320 0.18590 0.09353 0.2350 0.07389 ... 15.49 30.73 106.20 739.3 0.1703 0.5401 0.5390 0.2060 0.4378 0.10720
9 12.46 24.04 83.97 475.9 0.11860 0.23960 0.22730 0.08543 0.2030 0.08243 ... 15.09 40.68 97.65 711.4 0.1853 1.0580 1.1050 0.2210 0.4366 0.20750

10 rows × 30 columns

y.head(n=10)
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: target, dtype: int32

Inspecting the shape of our feature matrix X, we see that we have 569 samples and 30 features.

The ground truth label vector y is one-dimensional.

X.shape,y.shape
((569, 30), (569,))

Exercise#

It makes a lot of sense to exermine the data you are dealing with. Using the seaborn library, examine the distribution (how many entries there are) of the y variable.

Hint:

You want to use sns.histplot here.

sns.histplot(y,discrete=True)

Exercise#

Use the sns.boxplot() function to plot a distribution plot of a feature of your choice (e.g., a column from X), given the labels stored in diagnosis.

Hint:

Pass diagnosis to the x argument and the column of your choice to the y argument. Dont forget to be very explicit when passing the arguments to seaborn.

sns.boxplot(x=diagnosis,y=X["mean radius"])

Exercise#

Next thing to check would be our input data. We have 32 features ready to use in our dataframe. It makes a lot of sense to check the correlation between these features. If features are highly redundant, they may lead to a overfitting model, because there is a lot of redundancy and noise in the data. Meaning, the model learns features that may not actually be important.

Typically, we want to train (adapt weights!) our model with features that are highly informative.

Since our feature matrix X is a pandas dataframe we can get the correlation scores between features by calling df.corr(). Be sure to replace df with your own dataframe (X),

Create a variable called corr and store in that variable the correlation scores. Then, use sns.heatmap() to plot the correlation scores. Do you notice any interesting trends?

Hint:

Try to pass the cmap="coolwarm" argument to the sns.heatmap function.

You can exclude features before you actually train your model, especially if you suspect that they might influence the performance of your model. Since we only have 32 features, we can still go on with this dataset.

One very very important thing you want to do before you train your model is to normalize your features! Non-normalized features can heavily influence the performance of your model. You dont want that to happen to you.

Using the df.describe() method, we see that all features differ heavily in their mean, min, max and std value.

X.describe()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 30 columns

Luckily, sklearn has a solution for this. We can use the StandardScaler function from the preprocessing module. StandardScaler allows us to scale our data to a mean of 0 and unit variance.

      z = (x - u) / s
from sklearn.preprocessing import StandardScaler

But now one very important thing before we proceed!

Scaling your data is super important, but we only want to scale our training data! If we scale the whole dataset, we will contamine our test data, by leaking distribution information from the training set to the test set.

This will render all our conclusions and our analysis meaningless.

Thats why we first want to split our dataset into a training and test set.

Of course, sklearn offers us a easy solution, using the train_test_split function.

from sklearn.model_selection import train_test_split

There are some key arguments we need to pass here. First, we need to pass our full dataset, so X and y.

Then, we need to define the size of the split using the train_size argument. The size of this argument depends a lot on the data size you have available, choosing 70% to 80% is a good rule of thumb here.

At last, we can also set the random_state argument to enhance reproducibility.

X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,random_state=42)

Exercise#

Inspect the shape of your training and test data to see, if your split worked as intended. You want to see less samples in your test data and more samples in your training data.

X_train.shape,X_test.shape,y_train.shape,y_test.shape

Now with this out of the way, we can now actually scale our data. We want to call the Scaler.fit_transform on our training data and Scaler.transform on our test data.

This way, we make sure, that the test data is scaled using the training distribution, and there is no leakage beforehand.

Scaler = StandardScaler()

X_train_scaled = Scaler.fit_transform(X_train)
X_test_scaled = Scaler.transform(X_test)

Exercise#

Using the np.mean() function, examine how the scaled and the non scaled arrays differ.

Hint: You can use the axis argument, to either view the mean of the samples (rows) or the features (columns). axis = 0 will give you the mean per column (average column across all samples)

np.mean(X_train,axis=0),np.mean(X_train_scaled,axis=0)

Now that our features are properly scaled, we can tend to our classification model. We already discussed a logistic regression classifier in more detail, but there are of course many more estimators. One could use.

For example, there are support vector machines, decision trees or the random forest classifier. We’ll start our model training with the logistic regression and then compare different classifiers.

from sklearn.linear_model import LogisticRegression

After loading the model class we need to first intiate it. We can pass several arguments to the LogisticRegression class (try LogisticRegression?), but for now we keep it simple and only pass random_state.

model = LogisticRegression(random_state=42)

With our data properly preprocessed and scaled, we are ready to train the model.

Model training in sklearn works by calling the model.fit() method.

What the fit() function does is finding the proper model coefficients (weights). We need to pass both X_train_scaled and y_train.

model.fit(X_train_scaled,y_train)
LogisticRegression(random_state=42)

Now the first thing we want to do is check the achieved training accuracy. We do that with the model.score() function and need to pass both the X and the y variables.

model.score(X_train_scaled,y_train)
0.9868131868131869

Exercise#

Calculate the test accuracy of the trained model instance

model.score(X_test_scaled,y_test)

While accuracy is certainly one metric to check after fitting a classifier, there is more to come. We can visualize our results per class using a classification report. This classification report not only returns us the accuracy, but also the precision, recall and f1-score.

from sklearn.metrics import classification_report,confusion_matrix

To use the classification report, we need to pass both y_true and y_pred. We already have y_true, but we have not explicitly calculated y_pred yet. You can get the predicted values using model.predict(X).

Exercise#

Get the predicted y labels from your test data and name it y_pred_test. Apply the classification_report function and print the report afterwards.

y_pred_test = model.predict(X_test_scaled)

print(classification_report(y_test,y_pred_test))

Explaination of the metrics:

High precision means fewer false alarms (false positives). Of all the times we predicted positive, how many were actually positive?; Relates more to quality

Recall = Of all the actual positives, how many did we correctly predict?; Relates more to quantity

F1-Score = Harmonic mean of both metrics

We can further visualize the performance of your classifier using a confusion matrix, which we display using seaborn.

Exercise#

Insert your true and predicted y values in the confusion_matrix('y true', 'y test') code snippet and run the cell. How did our classifier do?

#initiate the confusion matrix
cm = confusion_matrix(y_test, y_pred_test)

# Define class names and store it in a list
class_names = ['Benign', 'Malignant']

# Plot the confusion matrix using sns.heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=class_names, 
            yticklabels=class_names)

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Now while that is pretty interesting, we can also further investigate, how the training actually influenced and changed the weights of the model.

Remember, that the weights were adapted using gradient descent. How would you interpret these coefficients?

Hint: The model equation for the logistic regression was

      z = X(i) * w + b

Try to answer the following questions:

  1. What does a negative and a positive coefficient mean?

  2. Do you have an idea, why there is a dysbalance between positive and negative coefficients?

feature_names = X.columns  
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': model.coef_[0]
})
print(coef_df.sort_values(by='Coefficient', ascending=False)[:10])
                    Feature  Coefficient
15        compactness error     0.685972
19  fractal dimension error     0.613405
5          mean compactness     0.542106
18           symmetry error     0.499980
8             mean symmetry     0.235713
11            texture error     0.188640
16          concavity error     0.180815
9    mean fractal dimension     0.076701
25        worst compactness     0.005207
4           mean smoothness    -0.066754
sns.barplot(data=coef_df,x="Feature",y="Coefficient")
plt.xticks(rotation=90);
../_images/6debcb3094cdc9ef3c42db92f6acdc2bae352d6f66945f7e5354178dd51395ec.png

Cross Validation#

Now sometimes you may end up working with datasets, that have limited data available. Since sufficient data is crucial for machine learning models, splitting your data into a train, test and validation set might really impair the predictive abilities of your model.

Thats is a situation where you want to use cross validation. There are multiple ways to go on about cross validation, the most common one is implementing the K-Fold split.

Here, the whole dataset is split into k different sets, the model is fit and evaluated on the remaining parts of the data. We repeat this procedure and report the average performance across all k splits.

crossval

Atlast, you still want to use a hold-out test set to evaluate your model, but you still save some time.

Also, cross validation can help with overfitting, because you train your model not only on an arbitraty data split, but on multiple instances.

This gives us a more robust and reliable model estimation.

We import the cross_val_score function from sklearn.model_selection. We actually dont need to scale our data now, because this will happen implicitly, when we pass the Scaler to a pipeline element.

What we also want to do is use something called Stratified K-Fold. This is especially helpful when dealing with imbalanced data. Stratified K-Fold makes sure, that in each K- split are approximately an equal number of labels (target classes).

from sklearn.model_selection import cross_val_score,StratifiedKFold
from sklearn.pipeline import Pipeline 

After importing the modules, we initiate a Pipeline Object. We pass both a Scaler and a Classifier we want to evaluate. Then, we initiate the stratified K Fold. We define K with the n_splits argument.

At last we simply call cross_val_score.

# Create the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),        
    ('classifier', LogisticRegression())  
])

# Set up Stratified K-Fold 
stratified_kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# call cross validation
cross_val_score(pipeline, X, y, cv=stratified_kfold)
array([0.96842105, 0.96842105, 0.98941799])

Exercise#

Try to fit the cross validation varying splits. Does the output change?

Try to fit the cross validation function without the scaler in the Pipeline object. Does the behavior differ?

# Create the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),        
    ('classifier', LogisticRegression())  
])

# Set up Stratified K-Fold 
stratified_kfold = StratifiedKFold(n_splits=7, shuffle=True, random_state=42)

# call cross validation
cross_val_score(pipeline, X, y, cv=stratified_kfold)
# Create the pipeline
pipeline = Pipeline([       
    ('classifier', LogisticRegression())  
])

# Set up Stratified K-Fold 
stratified_kfold = StratifiedKFold(n_splits=7, shuffle=True, random_state=42)

# call cross validation
cross_val_score(pipeline, X, y, cv=stratified_kfold)

Beyond Classification: Regression#

So far, we’ve seen how logistic regression can be used for classification—deciding which category an observation belongs to, like predicting whether someone will click a button or not. Now, we’re shifting gears from classification to prediction in a continuous sense.

In psychology and many real-world applications, we often want to predict a continuous outcome—like how anxious someone will feel in a situation, or how long they’ll take to respond to a stimulus—based on traits or behaviors. This is called regression.

The idea is: given a set of known features (like personality traits, age, or behavior), can we predict an individual score on some outcome? This lets us model individual differences in a meaningful way.

Here, we will use the diabetes dataset from sklearn.

This dataset contains data from 442 patients, and includes predictors like age, BMI (body mass index), blood pressure, and several blood test results. The goal is to predict a continuous outcome: a disease progression score one year after baseline.

from sklearn.datasets import load_diabetes

X_diabetes,y_diabetes = load_diabetes(return_X_y=True,as_frame=True)
diabetes_df = X_diabetes
diabetes_df['target'] = y_diabetes
diabetes_df
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641 135.0
... ... ... ... ... ... ... ... ... ... ... ...
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207 178.0
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018118 0.044485 104.0
439 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993 -0.011080 -0.046879 0.015491 132.0
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044528 -0.025930 220.0
441 -0.045472 -0.044642 -0.073030 -0.081414 0.083740 0.027809 0.173816 -0.039493 -0.004220 0.003064 57.0

442 rows × 11 columns

A low target score relates to a low disease progression, whereas a high target score relates to a high disease progression.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import mean_squared_error

# Initialize model and LOOCV
model = LinearRegression()
loo = LeaveOneOut()

In LOOCV, for each iteration, the model is trained on all but one data point (i.e., “leaving out” one sample) and then tested on that left-out data point.

In this dataframe, we have a lot of individual data. We want to see, how good our model can generalize to across individuals.

# Store predictions
y_pred = []
y_true = []
X_diabetes = X_diabetes.to_numpy()
y_diabetes = y_diabetes.to_numpy()
# Perform LOOCV
for train_index, test_index in loo.split(X_diabetes):
    X_train, X_test = X_diabetes[train_index], X_diabetes[test_index]
    y_train, y_test = y_diabetes[train_index], y_diabetes[test_index]

    model.fit(X_train, y_train)
    y_pred.append(model.predict(X_test)[0])
    y_true.append(y_test[0])

# Get Loss
mse = mean_squared_error(y_true, y_pred)
print(f"LOOCV Mean Squared Error: {mse:.2f}")

plt.figure(figsize=(6, 6))
plt.scatter(y_true, y_pred, alpha=0.5)
plt.plot([min(y_true), max(y_true)], [min(y_true), max(y_true)], 'r--')
plt.xlabel("True Target")
plt.ylabel("Predicted Target")
plt.title("LOOCV Predictions")
plt.grid(True)
plt.tight_layout()
plt.show()
LOOCV Mean Squared Error (using only BMI): 0.00
../_images/ec4bbc25913d0599255db857eaa049dcb1bfe14bb41cc63a5f03a1738b1a3b20.png