Built-in Scikit-Learn datasets for machine learning

by Alex
Built-in Scikit-Learn datasets for machine learning

The Scikit-Learn library provides clean datasets that you can use when building machine learning models. They come with Scikit-Learn. You don’t need to download anything. With just a few lines of code you will have the data ready to go. Having ready-made datasets is a huge advantage because you can start building models right away, without wasting time getting, cleaning, and transforming data – something that data scientists spend a lot of time on. Even after all the preparatory work is done, using Scikit-Learn samples may seem a little confusing to you at first. Don’t worry, in a few minutes you’ll know exactly how to use datasets, and you’ll be on your way to exploring the world of artificial intelligence. This article assumes that you have python, scikit-learn, pandas and Jupyter Notebook (or you can use Google Collab) installed. Let’s get started.

Introduction to Scikit-Learn datasets

Scikit-Learn provides seven datasets, which they call game datasets. Don’t let yourself be fooled by the word “game.” These datasets are quite large and serve as a good starting point for learning machine learning (hereafter ML). Here are some examples of available datasets and how to use them:

  • Boston Housing Prices – Use ML to predict housing prices based on attributes such as number of rooms, crime rate in the city.
  • Breast Cancer Diagnosis Dataset (Wisconsin) – use ML to diagnose cancer as benign (not spreading to the rest of the body) or malignant (spreading).
  • Wine recognition – use ML to identify the type of wine by chemical properties.

In this article, we will work with the “Breast Cancer Wisconsin” dataset. We’ll import the data and figure out how to read it. As a bonus, we will build a simple machine learning model that can classify scanned cancer images as malignant or benign. To learn more about the samples provided, click here to go to the Scikit-Learn documentation.

How do I import the datasets module?

The available datasets can be found in sklearn.datasets. Let’s import the necessary data. First, we add the datasets module, which contains all seven samples.

from sklearn import datasets

Each dataset has a corresponding function used to load it. These functions have the same format: “load_DATASET()”, where DATASET is the name of the sample. To load a breast cancer dataset, we use load_breast_cancer(). Similarly, when recognizing wine, we call load_wine(). Let’s load the selected data and store it in the variable data.

data = datasets.load_breast_cancer()

Up to this point, we haven’t encountered any problems. But the load functions mentioned above (such as load_breast_cancer()) don’t return the data in the tabular format we’ve come to expect. Instead, they pass us a Bunch object. Don’t know what a Bunch is? No worries. Think of the Bunch object as a fancy analog of the dictionary from the Scikit-Learn library. Let’s quickly refresh your memory. A dictionary is a data structure in which data is stored as keys and values. Think of it like the similarly named book we’re used to. You search for a word of interest (key) and get its definition (value). Programmers have the ability to make keys and corresponding values whatever they want (can be words, numbers, and so on). For example, in the case of storing personal contacts, the keys are names and the values are phone numbers. Thus, the dictionary in Python is not limited to its typical representation, but can be applied to anything you like.

What’s in our Bunch Dictionary?

The Bunch dictionary provided by Sklearn is quite a powerful tool. Let’s find out what keys are available to us.


print(data.keys())

We get the following keys:

  • data is the data needed for the prediction (metrics from the scan, such as radius, area and others) in the NumPy array.
  • target is the target data (the variable you want to predict, in this case whether the tumor is malignant or benign) in the NumPy array.

The values of these two keys provide us with the data we need for training. The other keys (see below) have an explanatory purpose. It is important to note that all datasets in Scikit-Learn are divided into data and target. data represents the indicators, the variables that the model uses for training. target includes the actual class labels. In our case, target data is a single column that classifies the tumor as either 0 (malignant) or 1 (benign).

  • feature_names are the names of the indicators, in other words, the column names in the data.
  • target_names are names of the target variable or variables, in other words, names of the target column or columns.
  • DESCR is short for DESCRIPTION and is a description of the selection.
  • filename is the path to the data file in CSV format.

To see the value of a key, you can type data.KEYNAME, where KEYNAME is the key of interest. So, if we want to see a description of the dataset:


print(data.DESCR)

Here is a small part of the result (the full version is too long to add to the article):

.. _breast_cancer_dataset:
Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------
**Data Set Characteristics:**
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
...

You can also learn about sampling information by visiting the Scikit-Learn documentation. Their documentation is much more readable and accurate.

Working with a dataset

Now that we understand what the load function returns, let’s see how we can use the dataset in our machine learning model. First of all, if you want to explore a selected dataset, use pandas to do so. Like this:


# import pandas
import pandas as pd
# Read the DataFrame using these functions
df = pd.DataFrame(data.data, columns=data.feature_names)
# Add a "target" column and fill it with data.
df['target'] = data.target
# Let's look at the first five lines
df.head()

mean radiusmean texturemean perimetermean areamean smoothnessmean compactnessmean concavitymean concave pointsmean symmetrymean fractal dimension…worst textureworst perimeterworst areaworst smoothnessworst compactnessworst concavityworst concave pointsworst symmetryworst fractal dimensiontarget017.9910.38122.801001.00.118400.277600.30010.147100.24190.07871…17.33184.602019.00.16220.66560.71190.26540.46010.118900120.5717.77132.901326.00.084740.078640.08690.070170.18120.05667…23.41158.801956.00.12380.18660.24160.18600.27500.089020219.6921.25130.001203.00.109600.159900.19740.127900.20690.05999…25.53152.501709.00.14440.42450.45040.24300.36130.087580311.4220.3877.58386.10.142500.283900.24140.105200.25970.09744…26.5098.87567.70.20980.86630.68690.25750.66380.173000420.2914.34135.101297.00.100300.132800.19800.104300.18090.05883…16.67152.201575.00.13740.20500.40000.16250.23640.076780 You have uploaded a training sample to Pandas DataFrame, which is now completely ready to learn and use. To really see the capabilities of this dataset, run it:

df.info()

RangeIndex: 569 entries, 0 to 568
Data columns (total of 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mean radius 569 nonnull float64
1 meaning texture 569 non-null float64
2 mean perimeter 569 non-null float64
3 mean area 569 non-null float64
4 mean smoothness 569 non-null float64
5 mean compactness 569 non-null float64
6 mean concavity 569 non-null float64
7 mean concave points 569 non-null float64
8 mean symmetry 569 non-null float64
9 mean fractal dimension 569 non-null float64
10 radius error 569 non-null float64
11 texture error 569 non-null float64
12 perimeter error 569 non-null float64
13 area error 569 non-null float64
14 smoothness error 569 non-null float64
15 compactness error 569 non-null float64
16 concavity error 569 non-null float64
17 concave points error 569 non-null float64
18 symmetry error 569 non-null float64
19 fractal dimension error 569 non-null float64
20 worst radius 569 non-null float64
21 worst texture 569 non-null float64
22 worst perimeter 569 non-null float64
23 worst area 569 non-null float64
24 worst smoothness 569 non-null float64
25 worst compactness 569 non-null float64
26 worst concavity 569 non-null float64
27 worst concave points 569 non-null float64
28 worst symmetry 569 non-null float64
29 worst fractal dimension 569 non-null float64
30 target 569 non-null int32
dtypes: float64(30), int32(1)
memory usage: 135.7 KB

A few things to watch out for:

  • There is no missing data, all columns contain 569 values. This saves us from having to consider missing values.
  • All data types are numeric. This is important because Scikit-Learn models do not accept qualitative variables. In the real world, when we receive such variables, we convert them to numeric variables. Scikit-Learn datasets do not contain qualitative values.

Consequently, Scikit-Learn takes the job of cleaning up the data. These data sets are extremely handy. You will have fun learning machine learning by using them.

Learning on datasets from sklearn.datasets

Finally, the fun part. Next, we’ll build a model that classifies cancerous tumors as malignant and benign. This will show you how to use the data for your own models. We will build a simple K-nearest neighbor model. First, let’s split the sample into two: one to train the model – giving it data to learn, and one to test it to see how well the model works with data (scan results) that it hasn’t seen before.


X = data.data
y = data.target
# we split the data with Scikit-Learn's train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

This gives us two datasets, one for training and one for testing. Let’s start training the model.


from sklearn.neighbors import KNeighborsClassifier
logreg = KNeighborsClassifier(n_neighbors=6)
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)

Did you get 0.923? That means the model is 92% accurate! In just a few minutes, you created a model that classifies tumor scans with 90% accuracy. Of course, in the real world things are more complicated, but it’s a good start. You can download the notebook with the code here. You will learn a lot by trying to build models using datasets from Scikit-Learn. Have fun learning artificial intelligence!

Related Posts

LEAVE A COMMENT