Explain in your own words what the parameter class_weight does. Set it to ‘balanced’. 2. Explain in your own words what the parameter learning_rate does and how the learning rate will be adjusted using the selection ‘optimal’. Set it to ‘optimal’. 3. Build a pipeline sklearn pipeline Pipeline including a standard scaler object (sklearn.p

Assignment Task

Introduction

In this assessment, you will experiment with a classifier using a standard dataset. You will analyze the data and evaluate the results obtained by the classifier. Based on your understanding of the method and data, you will measure performance and propose steps for improvement.

Purpose

This assessment will provide you with the opportunity to: Frame classification methods, introduced in previous courses, in the context of machine learning Demonstrate the importance of good practice for data handling and cleaning Practise use of the Python tool suite for machine learning tasks Demonstrate informed analysis of performance and results.

Tasks

This assessment is divided into three tasks:

1. Prepare and analyze the dataset

2. Train a regularised stochastic gradient descent classifier

3. Evaluate the classifier.

Directions

In this assessment, you will train a classifier to detect breast cancer from a set of characteristics of the cell nuclei in an image of a fine needle aspirate of a breast mass. The Breast Cancer Wisconsin dataset is available from the collection of example datasets in scikit learn. You will practice using pipelines and hyperparameter optimisation to train a classification model, evaluate the classifier in a white box fashion, and become familiar with handling multi-dimensional data.

Prepare and analyze the dataset

Description

1. Load the Breast Cancer Wisconsin dataset from the scikit-learn sample data collection.

2. Summarise the dataset with pandas functions info() and describe().

3. Briefly describe in your own words what each of the 10 parameters of the cell nuclei mean using the documentation of the dataset in scikit- learn, the paper of the original collectors of the dataset, and the example image in Figure .

4. How are the classes Malignant (cancer) and Benign (healthy) encoded in the dataset? What is the default assumption for label coding of sklearn metrics.precision_score? How wil you handle the label coding?

5. Plot histograms of all the 30 features, using in each diagram two distributions, one for each class. Use three figures with 10 subplots each.

6. Plot receiver-operating-characteristic (ROC) curves of the individual features into three figures, one figure for each of the groups of 10. The The ROC curve of one feature can be calculated by moving a class decision threshold value across the value range of the feature.

7. Which of the parameters seems promising based on the histograms and ROC curves? Justify your choice while referring to the particular features in the figures that indicate a good separation. Choose your top five candidate features.

8. Split the dataset into a training set (80%) and a test set (20%) using train_test_split in sklearn. The test data will not be used until the final evaluation.

Create a stochastic gradient descent classifier

1. Explain in your own words what the parameter class_weight does. Set it to ‘balanced’.

2. Explain in your own words what the parameter learning_rate does and how the learning rate will be adjusted using the selection ‘optimal’. Set it to ‘optimal’.

3. Build a pipeline sklearn pipeline Pipeline including a standard scaler object (sklearn.preprocessing.StandardScaler) and the classifier object. If you have not come across Pipelines before, they are explained in Géron Ch2, “Transformation Pipelines”, as well as the sklearn documentation. Explain what the standard scaler does and what would happen if you did not scale the data with it.

Related posts: