Checking null or nan values We can check the data types and at the same time, the number of non-null values of all features by using the info() method of pandas. If you only want to check the data type of the features then you can use dtypes .
See the result of dataset info();
print(f"Dataset info :- \n {df.info()}")
## Output
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null float64
1 sex 303 non-null float64
2 cp 303 non-null float64
3 restbps 303 non-null float64
4 chol 303 non-null float64
5 fbs 303 non-null float64
6 restecg 303 non-null float64
7 thalach 303 non-null float64
8 exang 303 non-null float64
9 oldpeak 303 non-null float64
10 slope 303 non-null float64
11 ca 303 non-null object
12 thal 303 non-null object
13 hd 303 non-null int64
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
"""
it provides all information about our dataset, such as
Total number of samples or rows Column names Number of non-null values The data type of each column Our dataset doesn’t have any null values because the total number of features is 303 ranging from 0 – 302; all features have the same number of samples/rows.
We “ca ” should contain the number of major vessels(0-3) which should be float or int. But it’s datatype showing object. Let’s explore it.
print(f"Unique values of ca variable :- \n {df['ca'].unique()}")
## Output
"""
Unique values of ca variable :-
array(['0.0', '3.0', '2.0', '1.0', '?'], dtype=object)
"""
As we can see ca have missing values pointed by ‘?’. Next, we are going to do the same for thal variables and also we are going to check how many missing values are there.
# print out the number of rows that contain missing values
len(df.loc[(df['ca']=='?')
|
(df['thal']=='?')])
## Output
"""
6
"""
Creating a mask and dropping missing values.
df = df.loc[(df['ca'] != '?')
&
(df['thal'] != '?')]
As you can see all of the 6 missing values are dropped.
print(f"Dataset Shape :- \n {df.shape})"
## Output
"""
Dataset Shape:-
(297, 14))
"""
Splitting the dataset into dependent and independent variables Now we will take all independent columns (the target column is dependent and the remaining all are independent columns to each other), as X and the target variable as y .
# Features and target creations
X = df.drop(['hd'],axis=1).copy()
y = df[['hd']].copy()
The preferred data type of each should be like this.
age – Float sex – Category cp, chest pain – Category 1 = typical angina 2 = atypical angina 3 = non-anginal pain 4 = asymptomatic restbp, resting blood pressure(in mm Hg) – Float chol, serum cholesterol in mg/dl – Float fbs, fasting blood sugar – Category 0 =>= 120 mg/dl 1 =< 120 mg/dl restecg, resting electrocardiographic results – Category 0 = normal 1 = having ST-T wave abnormality 2 = showing probable or definite left venticular hypertrophy thalach, maximum heart rate achieved – Float exang, exercise induced angina – Category oldpeak, St depression induced by exercise relative to rest – Float Slope, the slope of the peak exercise ST segment – Category 1 = unsloping 2 = flat 3 = downsloping ca, number of major vessels(0-3) colored by fluoroscopy – Float Thal, thalium heart scan – Category 3 = normal 6 = fix defect 7 = reversible defect But as you can see current data types of our data.
print(f"Dataset info :- \n {df.info()}")
## Output
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null float64
1 sex 303 non-null float64
2 cp 303 non-null float64
3 restbps 303 non-null float64
4 chol 303 non-null float64
5 fbs 303 non-null float64
6 restecg 303 non-null float64
7 thalach 303 non-null float64
8 exang 303 non-null float64
9 oldpeak 303 non-null float64
10 slope 303 non-null float64
11 ca 303 non-null object
12 thal 303 non-null object
13 hd 303 non-null int64
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
"""
Two popular methods to do one-hot encoding is ColumnTransformer() from sci-kit learn and get_dummies() from pandas. Here we are going to use get_dummies() .
X_encoded = pd.get_dummies(X, columns=['cp',
'restecg',
'slope',
'thal'])
X_encoded.head()
Let’s see what are the unique values we are having for the target variable. We know “hd ” is the target variable, and the remaining all are features of our dataset.
print(f"Unique values of target variable :- \n {df['hd'].unique()}")
## Output
"""
Unique values of target variable :-
[0 2 1 3 4]
"""
Here we only want to detect if someone has a change of heart disease or not. For this example, we are not worried about the degree of heart disease. So, we are going to convert 2,1,3,4 as 1.
y_not_zero_index = y > 0 #getting the index of non zero values
y[y_not_zero_index] = 1
Now, we need to split the whole dataset into train and test datasets. Training data is used at the time of building the model and a test dataset is used to evaluate trained models.
By using the train_test_split method from the sklearn library we can do this process of splitting the dataset into 70% train and 30% test sets.
#split the data into train and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, random_state=29,test_size=0.3)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
## Output
"""
(207, 22)
(90, 22)
(207, 1)
(90, 1)
"""
Building Heart Disease Detection using Machine Learning algorithm Now our dataset is ready for building models. Let’s jump to the development of a preliminary model using the machine learning algorithm decision tree.
Preliminary Decision tree algorithm Implementation using python sklearn library
## Building decision tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
# initialize object for DecisionTreeClassifier class
dt_classifier = DecisionTreeClassifier(random_state=29)
# train model by using fit method
print("Model training starts........")
dt_classifier.fit(X_train,y_train)
print("Model training completed")
acc_score = dt_classifier.score(X_test, y_test)
print(f'Accuracy of model on test dataset :- {acc_score}')
# predict result using test dataset
y_pred = dt_classifier.predict(X_test)
# confusion matrix
print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
# classification report for f1-score
print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
# Output
"""
Model training starts.
Model training completed
Accuracy of model on test dataset :- 0.7333333333333333
Confusion Matrix :-
[[32 10]
[14 34]]
Classification Report :-
precision recall f1-score support
0 0.70 0.76 0.73 42
1 0.77 0.71 0.74 48
accuracy 0.73 90
macro avg 0.73 0.74 0.73 90
weighted avg 0.74 0.73 0.73 90
"""
Now, let’s visualize the decision tree to understand the tree structure.
plt.figure(figsize=(20,10))
plot_tree(dt_classifier,
filled=True,
rounded=True,
class_names=['No HD', 'HD'],
feature_names=X_encoded.columns);
Visualization of preliminary decision tree Cost complexity Pruning part 1: Visualize alpha There are a lot of parameters like max_depth, and min_samples that reduce overfitting. However, pruning a tree with cost complexity pruning can simplify the whole process of finding a smaller tree that improves the accuracy of the testing dataset.
Pruning a decision tree is all about finding the right value for the pruning parameter alpha, which controls how little or how much pruning happens.
We omit the maximum value of alpha with ccp_alphas = ccp_alphas[:-1] as it prunes all leaves, leaving us with only a root instead of a tree.
train_scores = [clf_dt.score(X_train,y_train) for clf_dt in clf_dts]
test_scores = [clf_dt.score(X_test,y_test) for clf_dt in clf_dts]
fig, ax = plt.subplots()
ax.set_xlabel('alpha')
ax.set_ylabel('accuracy')
ax.set_title('Accuracy vs alpha for training and testing sets')
ax.plot(ccp_alphas,train_scores, marker='o', label='train',drawstyle='steps-post')
ax.plot(ccp_alphas,test_scores, marker='o', label='test',drawstyle='steps-post')
ax.legend()
plt.show()
Cost complexity pruning part 2: cross-validation for finding the best alpha The graph we just drew suggested one value for alpha 0.014, but another set of data might suggest another optimal value.
We will do this by using the cross_val_score() function to generate different training and testing datasets then train and test the tree with the datasets.
clf_dt = DecisionTreeClassifier(random_state=42,ccp_alpha=0.014)
#we are creating 5-fold cross validation as we don't have lots of data
scores = cross_val_score(clf_dt, X_train, y_train, cv=5)
df = pd.DataFrame(data={'tree':range(5),'accuracy':scores})
df.plot(x='tree',y='accuracy', marker='o', linestyle='--')
The graph above shows that using different Training and testing data with the same alpha resulted in different accuracies, suggesting that alpha is sensitive to the datasets. So instead of picking a single train dataset and signal testing dataset, let’s use cross-validation to find the optimal value for ccp_alpha.
alpha_loop_values = [] #store the results
for ccp_alpha in ccp_alphas:
clf_dt = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
scores = cross_val_score(clf_dt, X_train, y_train, cv=5)
alpha_loop_values.append([ccp_alpha,np.mean(scores),np.std(scores)])
#Now we can draw a graph of mean and standard deviations of the scores
alpha_results = pd.DataFrame(alpha_loop_values,
columns=['alpha','mean_accuracy','std'])
alpha_results.plot(x='alpha',
y= 'mean_accuracy',
yerr='std',
marker='o',
linestyle='--')
ideal_ccp_alpha = alpha_results[(alpha_results['mean_accuracy']==max(alpha_results["mean_accuracy"]))]['alpha']
ideal_ccp_alpha = float(ideal_ccp_alpha)
print(f"Ideal ccp alpha:- \n {ideal_ccp_alpha}")
# Output
"""
Ideal ccp alpha:- 0.734262
"""
Building, Evaluating, Drawing in the Final Classification Tree
clf_dt_pruned = DecisionTreeClassifier(random_state=42,
ccp_alpha=ideal_ccp_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)
plot_confusion_matrix(clf_dt_pruned,X_test,y_test, display_labels=['No HD', 'HD'])
print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
# Output
"""
Classification Report :-
precision recall f1-score support
0 0.72 0.79 0.75 42
1 0.80 0.73 0.76 48
accuracy 0.76 90
macro avg 0.76 0.76 0.76 90
weighted avg 0.76 0.76 0.76 90
"""
Now, let’s visualize the decision tree to understand the tree structure.
plt.figure(figsize=(20,10))
plot_tree(clf_dt_pruned,
filled=True,
rounded=True,
class_names=['No HD', 'HD'],
feature_names=X_encoded.columns);
Visualization of Pruned decision tree Accuracy comparison between the two models.
acc = dt_classifier.score(X_test, y_test)*100
print("Preliminary Decision Tree Test Accuracy {:.2f}%".format(acc))
acc = clf_dt_pruned.score(X_test, y_test)*100
print("Pruned Decision Tree Test Accuracy {:.2f}%".format(acc))
# Output
"""
Preliminary Decision Tree Test Accuracy 73.33%
Pruned Decision Tree Test Accuracy 75.56%
"""
Here we have to discuss a few terms and formulae related to the confusion matrix. Using the confusion matrix we can measure the effectiveness of our model and also we can also get a better idea of what types of errors it’s making.
True Positive (TP):- The number of positive labels is correctly predicted by trained models. This means the number of Class-1 samples is correctly predicted as Class-1.
True Negative (TN):- The number of negative labels was correctly predicted by trained models. This means the number of Class-0 samples is correctly predicted as Class-0.
False Positive (FP):- The number of positive labels is incorrectly predicted by trained models. This means the number of Class-1 samples was incorrectly predicted as Class-0.
False Negative (FN):- The number of negative labels is incorrectly predicted by trained models. This means the number of Class-0 samples was incorrectly predicted as Class-1.
Now we are creating confusion matrixes based on preliminary and pruned decision tree models.
Conclusion Finally, our model gives a test accuracy of 75.56% by using pruned decision trees. Which is an increment of 2% from the preliminary tree. By applying additional data preprocessing techniques or using another model like Random Forest, Support Vector Machines (SVM), and k-nearest neighbors we can get a better result.
REFERENCES:
If you found this article informative, then please share it with your friends and comment below with your queries or thoughts.
Anup Das I'm obsessed with python and write articles about python tutorials for Django, Data Science, and Automation.
Stay Connected with ATT Be the first one to get the latest updates about python and django.
We would like to show you notifications for the latest news and updates.