Sitemap

Use a support vector machine to predict on breast cancer

5 min readApr 12, 2025
Press enter or click to view image in full size

Breast cancer is a disease where abnormal cells in the breast grow uncontrollably, forming a tumor. It often begins in the milk ducts or lobules of the breast. If untreated, these cancerous cells can spread to nearby tissues or other parts of the body, a process known as metastasis.

It’s the most common cancer in women globally, though men can also develop it. Risk factors include age, genetics (like BRCA mutations), hormonal changes, obesity, and lifestyle factors such as alcohol consumption.

Symptoms might include a lump in the breast, changes in breast shape or size, nipple discharge, or skin changes like dimpling. Early detection through screenings like mammograms is crucial for effective treatment.

Treatment options vary depending on the type and stage of cancer but often include surgery, radiation, chemotherapy, hormone therapy, or targeted drugs like capivasertib, which has shown promise in slowing the progression of advanced breast cancer.

Breast cancer is the fourth most common cause of death in the UK, accounting for 7% of all cancer deaths. In females, breast cancer is the second most common cause of cancer death, amounting to 15% of all female cancer deaths. Overall, the five year survival rate of breast cancer is 91.2% and the ten year survival rate is 84%.

Since breast cancer is such a prevalent disease in modern society, I thought it would be a good idea to take a breast cancer dataset from the Kaggle data science website and develop an algorithm to make predictions on whether a breast tumour is malignant or benign.

After creating a Jupyter Notebook and saving it in my Kaggle account, I imported the libraries that I would need to execute the program, being:-

  • Pandas to create dataframes and process the data
  • Numpy to create numpy arrays and make mathematical computations
  • Os to go into the computer’s operating system
  • Sklearn to provide machine learning functionality
  • Matplotlib and seaborn to visualise the data

I used pandas to view all of the columns in the dataframe that would be created in the script:-

Press enter or click to view image in full size

I used pandas to read the csv file that was the input data and create a dataframe, df:-

Press enter or click to view image in full size

I dropped the id column from the dataframe. The reason for this is because pandas automatically indexes every row of data in a dataframe:-

I used matplotlib to create a bar chart of the target. As can be seen from the image below, there are more benign tumours than malignant ones:-

Press enter or click to view image in full size

I used seaborn to create a heatmap of the dataframe and noted there are quite a few highly correlated features:-

Press enter or click to view image in full size

I dropped the features that have more than 90% correlation in an attempt to reduce the number of features in the dataframe:-

Press enter or click to view image in full size
Press enter or click to view image in full size

I defined the dependent and independent variables. The dependent variable is represented by y, while the independent variable is represented by X:-

I used sklearn’s StandardScaler to scale the data, because it is easier for the model to make predictions when the data is closely dimensioned:-

I used sklearn’s train_test_split method to split the X and y variables into training and validation sets:-

Press enter or click to view image in full size

I then chose the model that I would use in this instance, being a support vector classifier.

A Support Vector Classifier (SVC) is a machine learning algorithm used for classification tasks. It is based on Support Vector Machines (SVM), which aim to find a hyperplane that best separates different classes in a dataset. In scikit-learn, the SVC module provides an easy-to-use implementation.

Key Concepts of SVC are:-

1. Hyperplane: The goal is to find a hyperplane (line in 2D, plane in 3D, etc.) that best separates the classes. In cases where the data is not linearly separable, SVC uses kernels to map the data to a higher-dimensional space.

2. Support Vectors: Support vectors are the data points that are closest to the hyperplane. These points are crucial because they define the hyperplane’s position and orientation.

3. Margin: The margin is the distance between the hyperplane and the nearest data points (support vectors) from each class. SVC maximizes this margin to ensure better generalization.

Parameters of SVC are:-

  • kernel: Specifies the type of kernel function to use (e.g., `’linear’`, `’rbf’`, `’poly’`).
  • C`: A regularization parameter. Higher values make the model fit the data more strictly but might overfit.
  • gamma: Kernel coefficient for `’rbf’`, `’poly’`, and `’sigmoid’` kernels. Controls the influence of individual points.

Advantages of SVC are:-

  • Effective in high-dimensional spaces.
  • Works well with non-linear boundaries using kernels.
  • Robust to overfitting in some cases, especially with proper regularization.
Press enter or click to view image in full size

I made predictions on the validation set:-

Press enter or click to view image in full size

I achieved an accuracy of 96.5% on the predictions that had been made, which is not too bad:-

Press enter or click to view image in full size

I could always try out a different model in an attempt to improve the score, such as a tree based model.

I have created a code review to accompany this blog post and it can be viewed here:- https://youtu.be/Z2L-HiwDnRU

--

--

Crystal X
Crystal X

Written by Crystal X

I have over five decades experience in the world of work, being in fast food, the military, business, non-profits, and the healthcare sector.

Responses (1)