Introduction to Classification

Classification Algorithms in Machine Learning

  • Decision Trees
  • Naive Bayes
  • Linear Discriminant Analysis
  • k-Nearest Neighbors
  • Logistic Regression
  • Neural Networks
  • Supoort Vector Machines(SVM)

K-Nearest Neighbours

  • Multi-class classifier: A classifier that can predict a field with multiple discrete values.
  • KNN: K-Nearest Neighbor, a method for classifying cases based on their similarity to other cases. Based on similar cases with same class labels are near each other. can be used to estimate values for a continous target

Procedure

  1. Pick a value for K.
  2. Calculate the distance of unknown case from all cases, can be Euclidean distance.
  3. Select the K-observations in the training data that are "nearest" to the unknown data point
  4. Predict the response of the unknown data point using the most popular response value from the K-nearest neighbors.

which K

1-NN

16a575a36a4df6a7a6f0492c136080da.png

5-NN

f0e3953dd4222e22564e629f90650cc3.png

Based on Evaluation

Use evaluation to calculate the accuracy and then determine the best value K for KNN. dad30fb4767ed1f46e9b007206aca619.png c856acc65beed589493b407144f490b0.png 7f94fe31d19b38dad103b859a2687fa5.png

Used for Regression

Based not only on the distance, but also on all the attributs to calculate a "distance". 0c22cc73a54987049d2c6e6e5ff280be.png

Evaluation – Classification Accuracy

Jaccard index

Also known as Jaccard Similarity Coefficient/Score(Intersection over Union)

  • y: Actual labels
  • \hat{y}: Predicted labels

J(y, \hat{y}) = \dfrac{|y \cap \hat{y}|}{|y \cup \hat{y}|} = \dfrac{|y \cap \hat{y}|}{|y| + |\hat{y}| - |y \cap \hat{y}|}

f05597befe47a6eccaec5ede834bead3.png

Example

2cf65cb1fe4613dd80d066ba5239d147.png

F1-score

Confusion Matrix

ce8cda2dd09b77338e5a2be28bce7363.png

  • Precision = \dfrac{TP}{TP + FP}
  • Recall = \dfrac{TP}{TP + FN}
  • F1-score = 2 \times (prc \times rec) / (prc + rec)

50d156ed73b59107283e38a1a94e54b4.png

Log loss

Performance of a classifier where the predicted output is a probability value between 0 and 1.

LogLoss = -\dfrac{1}{n}\Sigma(y \times log(\hat{y}) + (1-y)log(1-\hat{y}))

c418ca5751b866fcbe590ea339e6fe08.png

Decision Trees

6d22aeeaf09fbe5c1d53f247227cab9f.png

  • Each internal node correspondes to a test
  • Each branch corresponds to a result of the test
  • Each leaf node assigns a classifications

Building Procedure

  1. Choose an attribute from dataset
  2. Calculate the significance of attribute in splitting of data(entropy of data, and then information gain)
  3. Split data based on the value of the best attribute
  4. Go to step 1

Find the best attribute

Bad attribute

756af69345dd09c67c8b90cbfeaa1a5c.png

Better atrribute

  • More Predictiveness
  • Less Impurity
  • Lower Entropy d01c52a3786941264a58e2033158154e.png

Entropy

Measure of randomness of uncertainty

Entropy = -p(A)log(p(A)) - p(B)log(p(B))

If totally homogeneous, the entropy is 0, if half and half, the entropy is 1

  • The lower the entropy, the less uniform the distribution, the purer the node

5b8ebe658b03dd449ed51c2f7df2d4ee.png

Gain(s, Sex) =0.940 - [(8/14)0.811 + (6/14)1.0] =0.048

418f9d45c87be9b4bc124c4797765182.png

Gain(s, Sex) =0.940 - [(7/14)0.985 + (7/14)0.592] =0.151

Sex attribute has more information gain, so choose sex as the splitting attriubte

Information Gain

Information gain is the information that can increase the level of certainty after splitting

Information \space Gain = (Entropy \space before \space split) - (Weighted \space entropy \space after \space split)

48de4905914ef0851665154fb02f4e6b.png 897e790f612af62785dd317a538bf2fa.png

Python Programming

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

Get the data

$ wget -O drug200.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv

Show the first 5 lines

my_data = pd.read_csv("drug200.csv", delimiter=",")
my_data[0:5]

The data size

my_data.size

Preprocess the data

X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[0:5]
from sklearn import preprocessing
le_sex = preprocesing.LabelEncoder()
le_sex.fit(['F', 'M'])
X[:, 1] = le_sex.transform(X[:, 1])

le_BP = preprocessing.LabelEncoder()
le_BP.fit(['LOw', 'NORMAL', 'HIGH'])
X[:, 2] = le_BP.transform(X[:, 2])

le_Chol = preprocessing.labelEncoder()
leChol.fit(['NORMAL', 'HIGH'])
X[:, 3] = le_Chol.transform(X[:, 3])

X[0: 5]

Setting up the decision tree

Split the dataset

from sklearn.model_selection import train_test_split
X_trainset, X_testset, y_trainset, y_testst = train_test_split(X, y, test_size = 0.3, random_state = 3)

Modeling

drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)

drugTree.fit(X_trainset, y_trainset)

Prediction

predTree = drugTree.predict(X_testset)

To make an intuitive comparison

print(predTree[0:5])
print(y_testset[0:5])

Evaluation

from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTree's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

To calculate the accuracy without sklearn

le_Drug = preprocessing.LabelEncoder()
le_Drug.fit(['drugA', 'drugB', 'drugC', 'drugX', 'drugY'])
testDrug = le_Drug.transform(y_testset.values)
predDrug = le_Drug.transform(predTree)
1 - np.mean((testDrug - predDrug) ** 2)

Logistic Regression

Logistic Regression is a classification algorithm for categorical variables.

  • If the data is binary(multi-class also supported)
  • IF a probabilistic decision is required
  • find a linear boundary
  • understand the impact of a feature

Logistic Function

Also called the Sigmoid funciton \sigma(\theta^TX) = \dfrac{1}{1 + e^{-\theta^TX}}

7ca9811abd39d4c48d9bef79dec29647.png

Training Process

  1. Initialize \theta
  2. Calculate \hat{y} = \sigma (\theta^TX) for a customer
  3. Compare the output with the actual one, and record the error
  4. Calculate the errors for all customers
  5. Change the \theta to reduce the cost.
  6. Go back to step 2.

Cost Function

Complex Version

Cost(\hat{y}, y) = \dfrac{1}{2}(\sigma(\theta^TX) - y)^2

J(\theta) = \dfrac{1}{m}\Sigma_{i-1}^m Cost(\hat{y},y )

Simplified

cost

Minimize the Cost function

The gradient is a vector that is along the steepest direciton fa01533c262a0a7cbf53cdaa477f7028.png

  1. initilize the parameters randomly
  2. Feed the cost function with training set, and calculate the error
  3. Calculate the gradient of cost function
  4. Update weights with new values
  5. Go to step 2 until cost is small enough

SVM – Support Vector Machine

  1. Mapping data to a high-dimensioal feature space
  2. Finding a separator

Kernelling – The transformation

Kernelling is about doing data transformation, may try the following models

  • Linear
  • Polynomial
  • RBF(Radial basis function)
  • Sigmoid

Find the hyperplane

05a83e8331fbdc49ce6b9da455dff3f3.png

Prons and cons

  • Advantages:

    • Accurate in high-dimensional spaces
    • Memory efficient
  • Disadvantages

    • Prone to over-fitting
    • No probability estimation

Applicatons

  • Image Recognition
  • Text category assignment
  • Detecting spam
  • sentiment analysis
  • Gene Expression Classification
  • Regression, outlier deteciton and clustering

Python Programming

Dependencies

import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from skleran.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt

Load the Cancer data

$ wget -O cell_samples.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv

Load Data from CSV File

cell_df = pd.read_csv("cell_samples.csv")
cell_df.head(10)    # to have a look at the data

923094afadf208420839c4e2d99148aa.png

To have an intuitive look,

ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');
cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);
plt.show()

db1c8f90d846527ba25a045fc2909df9.png

Data Pre-processing and selection

Have a look at column data types

cell_df.dtypes

701a1bd84dfb34c304c0a3af01f5c78e.png

transform the non-numerical value to numerical

cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]
cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')
cell_df.dtypes
  • coerce means by force

6be9279493f7d73ba23f38825253c6ea.png

Transform the table to array

feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feacture_df)
X[0:5]

Before: 16945b0a3fdf2effcee103f337b0ee52.png After: 702a4cc11711232549ac81eff1814f34.png

transform the value of Class

cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class'])
y [0:10]

ed8f725f8f6e562a7184fc12a5e9bb97.png

Split into Train/Test dataset

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)
  • random_state is about which pseudo-random generator to take 32ab7fc7297218d1987274f4a9637b1a.png

Modeling(SVM with Scikit-learn)

Fit

from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
  • clf: classifier
  • SVC: support Vector Classification

Preict new

yhat = clf.predict(X_test)
yhat[0:5]

563af095d2e134e5b6d69bf73c236670.png

Evaluation

from sklearn.metrics import classification_report, confusion_matrix
import itertools

Confusion Matrix

To plot the confusion matrix 3628f8747c17bbb4f188470097161e5d.png

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)

print(classification_report(y_test, yhat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)', 'Malignant(4)'], normalize= False, title='Confusion matrix')

f6ff585d97b0b6ecad08711d6303b804.png

f1_score

from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')

d9c5632e74f3a2c87982fcfb6cf4fec6.png

Jaccard index for accuracy

from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

25927e5f9df73b2f711bf4e5a6bb8003.png

Practice with Linear Kernel

ec05735235c53df6ad4e64f4098ad092.png 680d58345d3695b732ba1e7e88e26e1f.png