Applied ML Assignment 1#
(This notebook along with it's PDF version can be found in this Github Repo)#
Heart Attack Analaysis#
Introduction#
A heart attack occurs when an artery supplying your heart with blood and oxygen becomes blocked. A blood clot can form and block your arteries, causing a heart attack. This Heart Attack Analysis helps to understand the chance of attack occurrence in persons based on varied health conditions.
Dataset#
The dataset is Heart_Attack_Analysis_Data.csv
. It has been uploaded to elearn.
This dataset contains data about some hundreds of patients mentioning:
- Age
- Sex
- Exercise Include Angina(1=YES, 0=NO)
- CP_Type (Chest Pain) (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
- ECG Results
- Blood Pressure
- Cholesterol
- Blood Sugar
- Family History (Number of persons affected in the family)
- Maximum Heart Rate
- Target (0 = LESS CHANCE , 1 = MORE CHANCE)
Aim#
- Building a Predictive Model using NaΓ―ve Bayesian Approach (Which features decide heart attack?)
- Comment on the performance of this model using AUC-ROC, Precision, Recall, F_score, Accuracy
You need to
1. Preprocess the data to enhance quality
2. Carry out descriptive summarization of data and make observations
3. Identify relevant, irrelevant attributes for building model.
4. Use data visualization tools and make observations
5. Carry out the chosen analytic task. Show results including intermediate results, as needed
6. Evaluate the solution
Following are some points for you to take note of, while doing the assignment in Jupyter Notebook:
- State all your assumptions clearly
- List all intermediate steps and learnings
- Mention your observations/findings
Submission Plan#
The following will be done in this notebook:
1. Verify the datatypes of the values given in the dataset and validate with the information given in the document.
2. Check for invalid values based on domain knowledge by checking if values are present in humanly possible ranges.
3. Figure out which columns are numeric and categorical based on the unique values each column has and based on information given in the assignment document.
4. Check for trends among numerical features and among categorical features to see if feature reduction can be done (via pairplots etc)
5. Check which numerical attributes are relevant and which are irrelevant and drop irrelevant ones.
6. Scale numerical attributes with a standard scaler.
7. Train a gnb MODEL 1 where it is fit with data where only scaling is done to numerical data.
8. Check for outliers via boxplots and remove them with IQR method.
9. Train a gnb MODEL 2 where it is fit with data where scaling is done to numerical data, and outliers are removed using the IQR method.
10. One hot encode categorical data.
11. Train a gnb MODEL 3 where it is fit with data where scaling is done to numerical data, outliers are removed using the IQR method and one hot encoding is done to categorical data.
We will then compare the accuracy, Precision, Recall, F-Score and AOC-ROC of three models trained.
My Submission#
Importing necessary packages#
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sb
import warnings
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix, ConfusionMatrixDisplay, precision_recall_fscore_support, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings('ignore')
Viewing Data#
Check the shape of the dataframe loaded into memory from the CSV file and see the datatypes used in the dataset given
df = pd.read_csv("./Heart_Attack_Analysis_Data.csv")
print("Dataframe Shape: {}".format(df.shape))
print("----------------------------------\n")
print("With following data types:\n")
df.info()
print("----------------------------------\n")
print("First 5 rows of Dataframe:")
df.head()
Dataframe Shape: (303, 11)
----------------------------------
With following data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 303 non-null int64
1 Sex 303 non-null int64
2 CP_Type 303 non-null int64
3 BloodPressure 303 non-null int64
4 Cholestrol 303 non-null int64
5 BloodSugar 303 non-null int64
6 ECG 303 non-null int64
7 MaxHeartRate 303 non-null int64
8 ExerciseAngina 303 non-null int64
9 FamilyHistory 303 non-null int64
10 Target 303 non-null int64
dtypes: int64(11)
memory usage: 26.2 KB
----------------------------------
First 5 rows of Dataframe:
Age | Sex | CP_Type | BloodPressure | Cholestrol | BloodSugar | ECG | MaxHeartRate | ExerciseAngina | FamilyHistory | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 1 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 0 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 1 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0 | 1 |
print("Stats on Dataframe:")
df.describe()
Stats on Dataframe:
Age | Sex | CP_Type | BloodPressure | Cholestrol | BloodSugar | ECG | MaxHeartRate | ExerciseAngina | FamilyHistory | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 |
mean | 54.366337 | 0.683168 | 0.966997 | 131.623762 | 246.264026 | 0.148515 | 0.528053 | 149.646865 | 0.326733 | 1.204620 | 0.544554 |
std | 9.082101 | 0.466011 | 1.032052 | 17.538143 | 51.830751 | 0.356198 | 0.525860 | 22.905161 | 0.469794 | 1.096825 | 0.498835 |
min | 29.000000 | 0.000000 | 0.000000 | 94.000000 | 126.000000 | 0.000000 | 0.000000 | 71.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 47.500000 | 0.000000 | 0.000000 | 120.000000 | 211.000000 | 0.000000 | 0.000000 | 133.500000 | 0.000000 | 0.000000 | 0.000000 |
50% | 55.000000 | 1.000000 | 1.000000 | 130.000000 | 240.000000 | 0.000000 | 1.000000 | 153.000000 | 0.000000 | 1.000000 | 1.000000 |
75% | 61.000000 | 1.000000 | 2.000000 | 140.000000 | 274.500000 | 0.000000 | 1.000000 | 166.000000 | 1.000000 | 2.000000 | 1.000000 |
max | 77.000000 | 1.000000 | 3.000000 | 200.000000 | 564.000000 | 1.000000 | 2.000000 | 202.000000 | 1.000000 | 5.000000 | 1.000000 |
All columns are integer
type and all columns have a value for all rows (303 not null), which means that all values are filled and there are no missing values.
Data Preprocessing#
Now to check certain columns with humanly possible ranges based on domain knowledge#
The following are the assumed ranges for these columns:
1. 0 < Age <= 100 years
2. 90 <= BloodPressure <= 200
3. 60 <= MaxHeartRate <= 220
print("\nMinimum Age = {}".format(df["Age"].min()))
print("Maximum Age = {}".format(df["Age"].max()))
print("\nMinimum Blood Pressure = {}".format(df["BloodPressure"].min()))
print("Maximum Blood Pressure = {}".format(df["BloodPressure"].max()))
print("\nMinimum Heart Rate = {}".format(df["MaxHeartRate"].min()))
print("Maximum Heart Rate = {}".format(df["MaxHeartRate"].max()))
Minimum Age = 29
Maximum Age = 77
Minimum Blood Pressure = 94
Maximum Blood Pressure = 200
Minimum Heart Rate = 71
Maximum Heart Rate = 202
All of the mentioned columns have values within acceptable ranges.
Checking number of unique values for each column#
Column Name -> Unique Number count
for column in list(df.columns):
print("{} -> {}".format(column, df[column].value_counts().shape[0]))
Age -> 41
Sex -> 2
CP_Type -> 4
BloodPressure -> 49
Cholestrol -> 152
BloodSugar -> 2
ECG -> 3
MaxHeartRate -> 91
ExerciseAngina -> 2
FamilyHistory -> 6
Target -> 2
Taking columns that have a maximum of 4 unique values (And based on information given in assignment document) as categorical and the rest as numeric:
category_list = ["Sex", "CP_Type", "BloodSugar", "ECG", "ExerciseAngina"]
numeric_list = ["Age", "BloodPressure", "Cholestrol", "MaxHeartRate", "FamilyHistory"]
Checking for trends in numeric features through Pair Plots#
df_number = df.loc[:, numeric_list]
df_number["Target"] = df["Target"]
sb.pairplot(df_number, hue = "Target", palette="husl")
plt.show()
The pair plots do not show any particular trends that can be used to reduce numerical features. We do see a slight relation between age and max heart rate but the plot is scattered enough to not relate them together.
Checking frequncy of each categorical feature wrt target column to check how well it is balanced#
df_category = df.loc[:, category_list]
df_category["Target"] = df["Target"]
for i in category_list:
plt.figure()
sb.countplot(x = i, data = df_category, hue = "Target", palette="husl")
plt.title(i)
Here we see that there is very little rows that have ECG value = 2 and similarly very little rows for BloodSugar value = 1.#
Checking relevant numerical features#
To do this we shall use the f_oneway
function from scipy.stats
. This function performs one-way ANOVA(Analysis of Variance) to test the null hypothesis that two groups of data have the same population mean.
A feature is only relevant if the sample from a particular feature for a target category is statistically very different from another sample from the same feature for another target category.
Checking relevance of Age#
result = stats.f_oneway(df["Age"][df["Target"] == 0],
df["Age"][df["Target"] == 1])
result.pvalue
7.524801303442268e-05
The pvalue is < 0.05. This shows that the means of the two distributions (One with Age wrt less chance of getting heart attack and the other with more chance of getting heart attack) are significantly different statistically, hence Age is relevant
Checking relevance of BloodPressure#
result = stats.f_oneway(df["BloodPressure"][df["Target"] == 0],
df["BloodPressure"][df["Target"] == 1])
result.pvalue
0.011546059200233376
The pvalue is < 0.05. This shows that the means of the two distributions (One with BloodPressure wrt less chance of getting heart attack and the other with more chance of getting heart attack) are significantly different statistically, hence BloodPressure is relevant
Checking relevance of Cholestrol#
result = stats.f_oneway(df["Cholestrol"][df["Target"] == 0],
df["Cholestrol"][df["Target"] == 1])
result.pvalue
0.1387903269560108
The pvalue is > 0.05. This shows that the means of the two distributions (One with Cholestrol wrt less chance of getting heart attack and the other with more chance of getting heart attack) are not significantly different statistically, hence Cholestrol is irrelevant
Checking relevance of MaxHeartRate#
result = stats.f_oneway(df["MaxHeartRate"][df["Target"] == 0],
df["MaxHeartRate"][df["Target"] == 1])
result.pvalue
1.6973376386560805e-14
The pvalue is < 0.05. This shows that the means of the two distributions (One with MaxHeartRate wrt less chance of getting heart attack and the other with more chance of getting heart attack) are significantly different statistically, hence MaxHeartRate is relevant
Checking relevance of FamilyHistory#
result = stats.f_oneway(df["FamilyHistory"][df["Target"] == 0],
df["FamilyHistory"][df["Target"] == 1])
result.pvalue
0.6172651404419242
The pvalue is > 0.05. This shows that the means of the two distributions (One with FamilyHistory wrt less chance of getting heart attack and the other with more chance of getting heart attack) are not significantly different statistically, hence FamilyHistory is irrelevant
Dropping the irrelevant features#
df.drop(["Cholestrol"], axis = 1, inplace= True)
df.drop(["FamilyHistory"], axis = 1, inplace= True)
numeric_list.remove("Cholestrol")
numeric_list.remove("FamilyHistory")
df.head()
Age | Sex | CP_Type | BloodPressure | BloodSugar | ECG | MaxHeartRate | ExerciseAngina | Target | |
---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 1 | 0 | 150 | 0 | 1 |
1 | 37 | 1 | 2 | 130 | 0 | 1 | 187 | 0 | 1 |
2 | 41 | 0 | 1 | 130 | 0 | 0 | 172 | 0 | 1 |
3 | 56 | 1 | 1 | 120 | 0 | 1 | 178 | 0 | 1 |
4 | 57 | 0 | 0 | 120 | 0 | 1 | 163 | 1 | 1 |
Scaling the numeric attributes in the dataframe with a standard scaler#
scaler = StandardScaler()
df[numeric_list] = scaler.fit_transform(df[numeric_list])
df.head()
Age | Sex | CP_Type | BloodPressure | BloodSugar | ECG | MaxHeartRate | ExerciseAngina | Target | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.952197 | 1 | 3 | 0.763956 | 1 | 0 | 0.015443 | 0 | 1 |
1 | -1.915313 | 1 | 2 | -0.092738 | 0 | 1 | 1.633471 | 0 | 1 |
2 | -1.474158 | 0 | 1 | -0.092738 | 0 | 0 | 0.977514 | 0 | 1 |
3 | 0.180175 | 1 | 1 | -0.663867 | 0 | 1 | 1.239897 | 0 | 1 |
4 | 0.290464 | 0 | 0 | -0.663867 | 0 | 1 | 0.583939 | 1 | 1 |
Model 1#
Training a Guassian Naive Bayes model with the data (With scaling done to numerical features)#
df1 = df.copy()
X = df1.drop(["Target"], axis = 1)
y = df1[["Target"]]
Split X and y to training and test data#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 3)
print("X_train: {}".format(X_train.shape))
print("y_train: {}".format(y_train.shape))
print("X_test: {}".format(X_test.shape))
print("y_test: {}".format(y_test.shape))
X_train: (242, 8)
y_train: (242, 1)
X_test: (61, 8)
y_test: (61, 1)
Prediction Analysis:#
The following is the analysis for a naive bayes model that was trained with the following done on the data:
1. Scale the numerical features with a standard scaler
2. Split the data into training and test data with a 20% test data
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("Model 1 Results:")
print("----------------------------------\n")
ns_probs = [0 for _ in range(len(y_test))]
ns_auc = roc_auc_score(y_test, ns_probs)
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
y_probs = gnb.predict_proba(X_test)
gnb_probs = y_probs[:, 1]
gnb_auc = roc_auc_score(y_test, gnb_probs)
gnb_fpr, gnb_tpr, temp = roc_curve(y_test, gnb_probs)
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(gnb_fpr, gnb_tpr, marker='.', label='Gaussian NB')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
print("----------------------------------\n")
print("AUC-ROC Score: {0:0.2f}%".format(roc_auc_score(y_test, gnb_probs) * 100))
print("Precision Score: {0:0.2f}%".format(precision_score(y_test,y_pred) * 100))
print("Recall Score: {0:0.2f}%".format(recall_score(y_test,y_pred) * 100))
print("F Score: {0:0.2f}%".format(f1_score(y_test,y_pred) * 100))
print("Accuracy Score: {0:0.2f}%".format(accuracy_score(y_test,y_pred) * 100))
print("\n----------------------------------\n")
print("Confusion matrix:")
cm = confusion_matrix(y_pred,y_test)
ConfusionMatrixDisplay(cm,display_labels =["Less Chance","More Chance"]).plot()
plt.show()
Model 1 Results:
----------------------------------
----------------------------------
AUC-ROC Score: 85.71%
Precision Score: 82.50%
Recall Score: 82.50%
F Score: 82.50%
Accuracy Score: 77.05%
----------------------------------
Confusion matrix:
Investigating dataset for outliers#
Analyzing the boxplot for scaled numeric attributes to check for outliers#
plt.figure(figsize=(10,8))
sb.boxplot(data=df[numeric_list], palette="husl")
plt.show()
print("\n----------------------------------\n")
----------------------------------
We can see some outlier values for blood pressure and max heart rate. we can drop these outliers using the IQR method.
Dropping outliers with IQR method#
Going with \(+/- 1.6 \times IQR\) to accomodate data upto \(3\sigma\) from the mean to remove the outliers.
print("Original shape of dataframe: {}".format(df.shape))
for i in numeric_list:
Q25 = np.percentile(df.loc[:, i],25)
Q75 = np.percentile(df.loc[:, i],75)
IQR = Q75 - Q25
upper_bound = np.where(df.loc[:, i] >= (Q75 + 1.6*IQR))
lower_bound = np.where(df.loc[:, i] <= (Q25 - 1.6*IQR))
df.drop(upper_bound[0], inplace = True)
df.drop(lower_bound[0], inplace = True)
print("Shape of dataframe after dropping outliers: {}".format(df.shape))
Original shape of dataframe: (303, 9)
Shape of dataframe after dropping outliers: (293, 9)
Model 2#
Training a Guassian Naive Bayes model with the data (With scaling done to numerical features and removing outliers)#
df2 = df.copy()
X = df2.drop(["Target"], axis = 1)
y = df2[["Target"]]
Split X and y to training and test data#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 3)
print("X_train: {}".format(X_train.shape))
print("y_train: {}".format(y_train.shape))
print("X_test: {}".format(X_test.shape))
print("y_test: {}".format(y_test.shape))
X_train: (234, 8)
y_train: (234, 1)
X_test: (59, 8)
y_test: (59, 1)
Prediction Analysis:#
The following is the analysis for a naive bayes model that was trained with the following done on the data:
1. Scale the numerical features with a standard scaler
2. Remove outliers 1.6 times IQR below and above the Q1 and Q3 respectively
3. Split the data into training and test data with a 20% test data
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("Model 2 Results:")
print("----------------------------------\n")
ns_probs = [0 for _ in range(len(y_test))]
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
y_probs = gnb.predict_proba(X_test)
gnb_probs = y_probs[:, 1]
gnb_fpr, gnb_tpr, temp = roc_curve(y_test, gnb_probs)
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(gnb_fpr, gnb_tpr, marker='.', label='Gaussian NB')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
print("----------------------------------\n")
print("AUC-ROC Score: {0:0.2f}%".format(roc_auc_score(y_test, gnb_probs) * 100))
print("Precision Score: {0:0.2f}%".format(precision_score(y_test,y_pred) * 100))
print("Recall Score: {0:0.2f}%".format(recall_score(y_test,y_pred) * 100))
print("F Score: {0:0.2f}%".format(f1_score(y_test,y_pred) * 100))
print("Accuracy Score: {0:0.2f}%".format(accuracy_score(y_test,y_pred) * 100))
print("\n----------------------------------\n")
print("Confusion matrix:")
cm = confusion_matrix(y_pred,y_test)
ConfusionMatrixDisplay(cm,display_labels =["Less Chance","More Chance"]).plot()
plt.show()
Model 2 Results:
----------------------------------
----------------------------------
AUC-ROC Score: 85.92%
Precision Score: 84.21%
Recall Score: 80.00%
F Score: 82.05%
Accuracy Score: 76.27%
----------------------------------
Confusion matrix:
Finding correlation between features through a heatmap#
corr_features = set()
corr_matrix = df.corr()
plt.figure(figsize = (10,8))
sb.heatmap(corr_matrix, annot = True, cmap="magma")
plt.show()
for i in range(len(corr_matrix .columns)):
for j in range(i):
if abs(corr_matrix.iloc[i, j]) > 0.5:
colname = corr_matrix.columns[i]
corr_features.add(colname)
print("\n----------------------------------\n")
print("The number of correlating features: {}".format(len(corr_features)))
print("The correlating features are: {}".format(corr_features))
print("\n----------------------------------\n")
----------------------------------
The number of correlating features: 0
The correlating features are: set()
----------------------------------
None of the features seem to correlate with each other and hence we cannot do feature reduction
Model 3#
Training a Guassian Naive Bayes model with the data (With scaling done to numerical features, outliers removed and one hot encoding categorical features)#
Original dataframe:#
df3 = df.copy()
df3.head()
Age | Sex | CP_Type | BloodPressure | BloodSugar | ECG | MaxHeartRate | ExerciseAngina | Target | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.952197 | 1 | 3 | 0.763956 | 1 | 0 | 0.015443 | 0 | 1 |
1 | -1.915313 | 1 | 2 | -0.092738 | 0 | 1 | 1.633471 | 0 | 1 |
2 | -1.474158 | 0 | 1 | -0.092738 | 0 | 0 | 0.977514 | 0 | 1 |
3 | 0.180175 | 1 | 1 | -0.663867 | 0 | 1 | 1.239897 | 0 | 1 |
4 | 0.290464 | 0 | 0 | -0.663867 | 0 | 1 | 0.583939 | 1 | 1 |
One hot encoded dataframe:#
df3 = pd.get_dummies(df3, columns = category_list, drop_first = True)
df3.head()
Age | BloodPressure | MaxHeartRate | Target | Sex_1 | CP_Type_1 | CP_Type_2 | CP_Type_3 | BloodSugar_1 | ECG_1 | ECG_2 | ExerciseAngina_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.952197 | 0.763956 | 0.015443 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
1 | -1.915313 | -0.092738 | 1.633471 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | -1.474158 | -0.092738 | 0.977514 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0.180175 | -0.663867 | 1.239897 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0.290464 | -0.663867 | 0.583939 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
X = df3.drop(["Target"], axis = 1)
y = df3[["Target"]]
Split X and y to training and test data#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 3)
print("X_train: {}".format(X_train.shape))
print("y_train: {}".format(y_train.shape))
print("X_test: {}".format(X_test.shape))
print("y_test: {}".format(y_test.shape))
X_train: (234, 11)
y_train: (234, 1)
X_test: (59, 11)
y_test: (59, 1)
Prediction Analysis:#
The following is the analysis for a naive bayes model that was trained with the following done on the data:
1. One hot encode categorical features
2. Scale the numerical features with a standard scaler
3. Remove outliers 1.6 times IQR below and above the Q1 and Q3 respectively
4. Split the data into training and test data with 20% test data
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("Model 3 Results:")
print("----------------------------------\n")
ns_probs = [0 for _ in range(len(y_test))]
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
y_probs = gnb.predict_proba(X_test)
gnb_probs = y_probs[:, 1]
gnb_fpr, gnb_tpr, temp = roc_curve(y_test, gnb_probs)
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(gnb_fpr, gnb_tpr, marker='.', label='Gaussian NB')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
print("----------------------------------\n")
print("AUC-ROC Score: {0:0.2f}%".format(roc_auc_score(y_test, gnb_probs) * 100))
print("Precision Score: {0:0.2f}%".format(precision_score(y_test,y_pred) * 100))
print("Recall Score: {0:0.2f}%".format(recall_score(y_test,y_pred) * 100))
print("F Score: {0:0.2f}%".format(f1_score(y_test,y_pred) * 100))
print("Accuracy Score: {0:0.2f}%".format(accuracy_score(y_test,y_pred) * 100))
print("\n----------------------------------\n")
print("Confusion matrix:")
cm = confusion_matrix(y_pred,y_test)
ConfusionMatrixDisplay(cm,display_labels =["Less Chance","More Chance"]).plot()
plt.show()
Model 3 Results:
----------------------------------
----------------------------------
AUC-ROC Score: 90.53%
Precision Score: 85.00%
Recall Score: 85.00%
F Score: 85.00%
Accuracy Score: 79.66%
----------------------------------
Confusion matrix:
Comparing the results of the three models:#
From the results obtained from above prediction analysis, we can tabulate them together below:
Model/Metric | AUC-ROC | Precision | Recall | F_Score | Accuracy |
---|---|---|---|---|---|
Model 1 | 85.71% | 82.50% | 82.50% | 82.05% | 77.05% |
Model 2 | 85.92% | 84.21% | 80.00% | 82.05% | 76.27% |
Model 3 | 90.53% | 85.00% | 85.00% | 85.00% | 79.66% |
This shows that the maximum Accuracy is obtained when scaling is applied on numerical attributes with outliers being dropped and one hot encoding is done to the categorical features (Model 3).
We can also see that Model 2 has lesser accuracy but higher Precision than Model 1. This shows that Model 1 is overfitted.
We know that:
\(Precision = \frac{True Positive}{True Positive + False Positive}\)
and
\(Recall = \frac{True Positive}{True Positive + False Negative}\)
Upon viewing the confusion matrix for each model it is clear that lesser false negatives and false positives are recorded in Model 3 in comparison to Model 1 and Model 2 which attributes to the better Precision and Recall scores in Model 3.
Finally, we know that:
\(F = 2 \times \frac{Precision \times Recall}{Precision + Recall}\)
This goes to show why Model 3 has better F score since its recall and precision is better than the other two models.
Tags: !AMLIndex Assignments