# Intro to ML & Regression – IBM at Coursera

These are the notes I took when learning Week 1 & 2 of the course Machine Learning with Python at Coursera.

## AI, Machine Learning, Deep Learning

• ML is the statistics part of AI
• Deep Learning is a special part of ML

## Major machine learning techniques

• Regression / Estimation: Predicting continuous values
• Classsification: Predicting the item clas / catefory of a case
• Clustering: finding the structure of data; summarization
• Associations: Associating frequent co-occurring itmes/events
• Anomaly detection: discovering abnormal and unusual cases
• Sequence minging: predicting next events; click-stream(Markov Model, HMM)
• Dimension Reduction: Reducing the size of data(PCA)
• Recommendation systems: Recommneding items

• Numpy
• Scipy
• matplotlib
• pandas
• scikit learn

## Supervised vs Unsupervised

### Supervised Learning

Deals with labeled data

• regression
• classification

### Unsupervised learning

finds patterns and groupings from unlabeled data

#### Techniques

• Dimension Reduction
• Density Estimation
• clustering
• Discovering structure
• summarization
• Anomaly detection

## Regression

the process of predicting a continous value

• X: Independent variable
• Y: Dependent variable, continuous

Regression can be

• linear
• non-linear

### Simple Regression

One feature to predict anohter

### Multiple Regression

Many feactures to predict one

### Linear Regression

#### Signle Linear Regression $\hat{y} = \theta_0 + \theta_1 x_1$

• $\hat{y}$: response variable, predicted value
• $x_1$: a single predictor
• $\theta_0$: intersect
• $\theta_1$: slope, gradient, coefficient

Residual value: the error, $y - \hat{y}$

##### Calculation of $\theta$ $\theta$‘s #### Multiple Linear Regression $\hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2+...+\theta_nx_n$ $\hat{y} = \theta^TX$ $\theta^T=[\theta_0, \theta_1, \theta_2, ...]$ $X = \begin{vmatrix} 1 \ x_0 \ x_1 \end{vmatrix}^T$

##### Estimate $\theta$ $\theta$
• Ordinary Least Squares
• Linear algebra operations
• Takes a long time for large datasets(10K+ rows)
• An optimization algorithm
• Newton’s Method

OLS: Ordinary Least Squares

Should firstly use scatter plot to visualize if the plot is linear

### Polynomial Regression $\hat{y} = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3$

• $x_1 = x$
• $x_2 = x^2$
• $x_3 = x^3$ $\hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3$

Polynomial Regression $\rightarrow$ Multiple Linear Regression $\rightarrow$ Least Squares

• Least Squares: Minimizing the sum of the squares of the differences between $y$ and $\hat{y}$

### Non-linear Regression

#### Examples ##### Expnential ##### Logarithmic ##### Sigmoidal/Logistic $Y = a + \dfrac{b}{1 + c^{X-d}}$ #### Fit Process

##### Plotting the Dataset ##### Choosing a model

The Sigmoidal might fit.  $\hat{Y} = \dfrac{1}{1 + e^{\beta_1(X - \beta_2)}}$ $beta_1$: Controls the curve’s steepness $beta_2$: Slides the curve on the x-axis

##### Building the Model

Contruct the model function

def sigmoid(x, Beta_1, Beta_2):
y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2))
return y


Visualize and Compare with an initial value

beta_1 = 0.10
beta_2 = 1990.0

# logistic function
Y_pred = sigmoid(x_data, beta_1, beta_2)

# plot initial prediction against datapoirnts)
plt.plot(x_data, Y_pred*15000000000000.)
plt.plot(x_data, y_data, 'ro') Then normalize x and y before finding the parameters

xdata = x_data / max(x_data)
ydata = y_data / max(y_data)


curve_fit uses non-linear least squares to fit our sigmoid function

from scipy.optimize import curve_fit
popt, pcov = curve_fit(sigmoid, xdata, ydata)
# print the final parameters
print(" bata_1 = %f, bata_2 = %f" % (popt, popt))


Then plot to see if that model works well

x = np.linspace(1960, 2015, 55)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show() ##### Evaluation

We can verify the accuracy of our model by using model evaluation.

# write your code here
from sklearn.metrics import r2_score

# split data into train/test
msk = np.random.rand(len(df)) < 0.8
train_x = xdata[msk]
test_x = xdata[~msk]
train_y = ydata[msk]
test_y = ydata[~msk]

# build the model using train set
popt, pcov = curve_fit(sigmoid, train_x, train_y)

# predit using the test set
y_hat = sigmoid(test_x, *popt)

print("Mean absolute error: %.4f" % np.mean(np.absolute(y_hat - test_y)))
print("Residual sum of squares (MSE): %.6f" % np.mean((y_hat - test_y) ** 2))
print("R2-score: %.2f" % r2_score(y_hat , test_y) ) ## Model evaluation approaches ### Training and Test on the Same Dataset  $Error = \dfrac{1}{n} \Sigma_{j=1}^n |y_j - \hat{y_j}|$

• High training accuracy
• Low out-of-sample accuracy

#### Training Accuracy

• High Accuracy is not always good: outfit, capture noise and produce a non-generalized model #### Out-of-Sample Accuracy

The accuracy of predicting unkown dataset

### Train/Test Split

• mutually exclusive
• More accurate evaluation on out-of-sample accuracy
• highly

### K-fold cross-validation

Split the data into k-folds, and each fold as a testing data set and the rest and rest as training set to train the model. The overall accuracy is the average of the k-folds. ### Regression Evaluation Metrics

• MAE: mean absolute error
• MSE: mean squared error
• RMSE: root mean squared error
• RAE: Relative Absolute Error
• RAE: Relative Squared Error

#### MAE $MAE = \dfrac{1}{n}\Sigma_{j = 1}^n|y_j - \hat{y_j}|$

#### MSE $MSE = \dfrac{1}{n}\Sigma_{i=1}^n(y_i - \hat{y_i})^2$

#### RMSE $MSE = \sqrt{\dfrac{1}{n}\Sigma_{i=1}^n(y_i - \hat{y_i})^2}$

#### RAE $RAE = \dfrac{\Sigma_{j=1}^n|y_j - \hat{y}{j}|}{\Sigma_{j=1}^n|y_j - \overline{y}|}$

#### RSE $RSE = \dfrac{\Sigma_{j=1}^n(y_j - \hat{y_j})^2}{\Sigma_{j=1}^n(y_j - \overline{y})^2}$ $R^2 = 1 - RSE$