Author: Anderson Hitoshi Uyekita
Course: Data Science - Foundation I
COD: ND110
Date: 13/12/2018
Dataset: No-show appointments
Version: 1.0
I decided to face this specific problem because I am from Brazil and it will be a big challenge for me to make any conclusions using this data. I am an Electrical Engineer, and I have never been to Vitória, even more, I do not anything about the public health care systems.
It's a bit complicated to address any kind of question (beforehand) if I do not have a clue about the variables in this dataset. Obviously, (Now) I have already loaded the data and printed some summaries, and my approach for this Project is aggregate new variables (all these new information are public and free to access) to enhance my analysis.
With these new variables, such as Average Income per Month, Number of inhabitants in each Neighborhood (divided into male and female), Neighborhood, and Regional Administration.
I hope at the end this document I have made good questions, and the most important, I have "answered" properly.
Although I do not have a question yet, I wonder to this project an opportunity to practice all the content learned "in class". So, my objectives for this project are:
I have made along this document three questions:
1. Are patients with many appointments are more likely to show-up then the patients with less than two appointments?
Seems to be true, but I lack information to affirm it, what I could say is when I analyse patient with a higher number of appointments, they tend to have a better rate of show-up.
2. Are patients with many appointments in average older than those with few appointments?
I also identify a positive correlation, when I analyse patients with higher number of appointments they tend to have an age average higher than the population.
3. Are patients with disabilities tend to have better rates of show-up? Is the number of disabilities raises the show-up rates?
Due to the small number of patients with disabilities, it is difficult to answer this question, but I have found a better rate of show-up in the group of patients with disabilities, however, patients with many disabilities tend to have worse show-up rates.
All these questions I have answered using the original data and additional data collected on the web. My report is linear and I start with a Data Wrangling (Gather and Assess), after that I start with the EDA, in this step I stopped a while to gather new data on the internet, and I came back to finish the EDA and to pose and answer the questions.
In the final of the document, I have inserted an appendix to segregate from the main document the Additional Data I have found.
Obs.: I really do not know if I am doing great and for this reason, I have decided to submit my report to receive a feedback. I am not sure If I am on the "right path". I know I have not used all the resources I have gathered in the web, and I do want to use them to pose a new question, but first I need a preliminary feedback.
I have written this report using the Jupyter Notebook to allow anyone the reproducibility of each step. Despite I have created some dataset to enhance my analysis, which turns everything much harder to reproduce, I have stored it in Github, so it's available to anyone to download and to use it.
All this research is performed using:
To be honest, I did not use the Python IDLE in this project, because I wrote directly in Jupyter Notebook, so to perform study I code everything directly in Opera.
I have used the Atom to push to Github repository, and nothing more than this.
I kindly ask you to install each of this packages before you run the next steps.
You can access all files of this report in this repository:
This file is available at Kaggle and maintained by @JoniHoppen, is also available to download at GDrive.
This dataset records medical appointments from March to june of 2016, realized in publics hospitals in Vitória, a medium size city (almost 320.000 inhabitants) of Brazil located in the southeast region.
|
---|
(**) Both pictures extracted from Wikipedia.
Vitória is the Capital of Espírito Santo State, and has almost 320.000 inhabitants (also according with the Wikipedia) spread in 9 Regional Administration areas, which has a total of 80 neighborhoods.
The public health systems are compounded by Basic Units of Health (Unidade Básica de Saúde, for the sake of this document UBS) spread all over the city, the Figure 2 shows how these UBS are located in the city.
I have extracted this illustration from the document Relatório de Gestão 2008 from the Sistema Único de Saúde of Vitória Council Health Secretary, based on this document and on these two:
After a briefly compilation of info from these three sources, I have summed up 38 (See details here) UBS (or similars) in Vitória City.
I will assume for this report all appointments could be done in any of this 38 premises listed above.
Observation: This is not important to the project, but could be important to someone who wants to learn more about this dataset. The Regional Administration Area is slightly different from the Health Territory Area because the Regional Administration Area has 9 areas and the Health Territory Area has 6, which means some neighbourhood were relocated to other regions. It could be much explicit if you compare these two files (Regional Administrative Area and Health Territory Area).
If you are still curious about the location of each UBS, you can access the GeoWeb, this is a web-based tool with a thematic map.
Unfortunately, this dataset lacks a codebook, despite this hurdle, much of the problems could be solved reading the Kaggle Discussions Channel. This dataset has:
Each of these variables could be interpreted by using the Table 1.
Variable | Type | Description |
---|---|---|
PatientId | int | Internal ID to track the patient |
AppointmentID | date | Internal ID to track the appointment |
Gender | cat | F: Female and M: Male |
ScheduledDay | date | Planned date to appointment |
AppointmentDay | date | When the appointment occurs |
Age | int | Person Age, if Age is negative the person is pregnant |
Neighbourhood | str | In Brazil, each city is divided into Regional Administration (Regiões Administrativas) and then in neighborhood (bairros). Read more in Appendix A. |
Scholarship | bool | 0: No and 1: Yes |
Hipertension | bool | 0: No and 1: Yes |
Diabetes | bool | 0: No and 1: Yes |
Alcoholism | bool | 0: No and 1: Yes |
Handcap | int | Means the quantity of disabilities of this person (0,1,2,3,4). |
SMS_received | int | Number of SMS sent to the patient |
No-show | cat | No: Show-up and Yes: No-show |
I have found some good information available on the internet, and I decided to aggregate for my analysis. In the final of this document, in Appendix chapter, I have added some details from where I get this info. As results of it, I have written two datasets:
Let's start analysing the number of rows, columns, number of null values, number of duplicated rows, and data types. I am also interested in the first and last rows to assess if there is any problem with a header or a non-standard value in the final of the archive.
These are all the libraries I will use during this report these libraries:
# Importing Libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
Let's load the noshowappointments-kagglev2-may-2016.csv
file, according with the instructions:
This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.
Has 100.000 medical appointments.
# Loading the dataset.
df_vit = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
# Plot the dimensions
print("Rows: {}\nColumns: {}".format(*df_vit.shape))
Rows: 110527 Columns: 14
As I expected, the number of rows is roughly 110k and 14 columns.
Now, I want to see the first and the last 10 rows, just to ensure if everything is ok.
df_vit.head(5) # Print the first 5 rows.
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
df_vit.tail(5) # Print the last 5 rows.
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
110522 | 2.572134e+12 | 5651768 | F | 2016-05-03T09:15:35Z | 2016-06-07T00:00:00Z | 56 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
110523 | 3.596266e+12 | 5650093 | F | 2016-05-03T07:27:33Z | 2016-06-07T00:00:00Z | 51 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
110524 | 1.557663e+13 | 5630692 | F | 2016-04-27T16:03:52Z | 2016-06-07T00:00:00Z | 21 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
110525 | 9.213493e+13 | 5630323 | F | 2016-04-27T15:09:23Z | 2016-06-07T00:00:00Z | 38 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
110526 | 3.775115e+14 | 5629448 | F | 2016-04-27T13:30:56Z | 2016-06-07T00:00:00Z | 54 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 1 | No |
After check the first and last rows, and confirm nothing is wrong with it. I will analyse the number of duplicated entries.
# Sum the number of rows with duplicated entries.
dupe = sum(df_vit.duplicated())
print("Number of duplicated rows: {}".format(dupe))
Number of duplicated rows: 0
Nice dataframe, no duplicated entries (identical row). Now, let's find any NA value or any kind of strange value using the .info()
method. Note with results of .info()
I can check if there is inconsistency in the data type of each variable.
# Check if any rows has NA values.
df_vit.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): PatientId 110527 non-null float64 AppointmentID 110527 non-null int64 Gender 110527 non-null object ScheduledDay 110527 non-null object AppointmentDay 110527 non-null object Age 110527 non-null int64 Neighbourhood 110527 non-null object Scholarship 110527 non-null int64 Hipertension 110527 non-null int64 Diabetes 110527 non-null int64 Alcoholism 110527 non-null int64 Handcap 110527 non-null int64 SMS_received 110527 non-null int64 No-show 110527 non-null object dtypes: float64(1), int64(8), object(5) memory usage: 11.8+ MB
Based on the .info()
results, I have found some problem, so I will only comment which I think is wrong.
PatientId
: ID to identify the patient, must be int;ScheduledDay
and AppointmentDay
: Are dates, I will need to fix it;Scholarship
, Hipertension
, Diabetes
, No-show
, and Alcoholism
: Boolean variables, I also need to fix it later.Let's see the columns names, maybe it will be necessary an adjustment in the upper case to lower case.
# Print the columns names.
df_vit.columns
Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay', 'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show'], dtype='object')
Fortunately, I did not find any space between the variable's name, however, there is a plenty of variable with One uppercase in the middle of the variable. To put everything uniformly, later I will apply the .lower()
.
The .nunique()
method prints the number of unique values for each variable.
# Print the number of unique values for each variable.
df_vit.nunique()
PatientId 62299 AppointmentID 110527 Gender 2 ScheduledDay 103549 AppointmentDay 27 Age 104 Neighbourhood 81 Scholarship 2 Hipertension 2 Diabetes 2 Alcoholism 2 Handcap 5 SMS_received 2 No-show 2 dtype: int64
The .nunique()
method results are very clarifying because I realize some PatientId
appointed more than once.
unique_patientid = round(100*(110527-62299)/110527,2),round(100*(62299)/110527,2)
print("Appointed more than once: {}%\nAppointed one time: {}%".format(*unique_patientid))
Appointed more than once: 43.63% Appointed one time: 56.37%
Let's take a look in the Age
variable, I want to know the .max()
, .min()
, .mean()
etc, the .decribe()
shows a very good summary.
# Shows the mean, max, min etc.
df_vit['Age'].describe()
count 110527.000000 mean 37.088874 std 23.110205 min -1.000000 25% 18.000000 50% 37.000000 75% 55.000000 max 115.000000 Name: Age, dtype: float64
The minimum age of -1 is quite weird. So, I will print all the observations with less than zero age.
# Cases of age less than zero.
df_vit[df_vit.Age < 0]
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
99832 | 4.659432e+14 | 5775010 | F | 2016-06-06T08:58:13Z | 2016-06-06T00:00:00Z | -1 | ROMÃO | 0 | 0 | 0 | 0 | 0 | 0 | No |
Only one observation has age below of zero, for this reason, in the next chapter (Data Cleaning) I will remove this observation.
I would also like to record when this appointment happens.
nov = sum(df_vit['ScheduledDay'].str.contains('2015-11'))
dez = sum(df_vit['ScheduledDay'].str.contains('2015-12'))
jan = sum(df_vit['ScheduledDay'].str.contains('2016-01'))
fev = sum(df_vit['ScheduledDay'].str.contains('2016-02'))
mar = sum(df_vit['ScheduledDay'].str.contains('2016-03'))
aph = sum(df_vit['ScheduledDay'].str.contains('2016-04'))
may = sum(df_vit['ScheduledDay'].str.contains('2016-05'))
jun = sum(df_vit['ScheduledDay'].str.contains('2016-06'))
print("nov: {}\ndez: {}\njan: {}\nfev: {}\nmar: {}\nmar: {}\naph: {}\njun: {}".format(nov,dez,jan,fev,mar,aph,may,jun))
nov: 1 dez: 61 jan: 60 fev: 281 mar: 3614 mar: 25339 aph: 67421 jun: 13750
The appointment distribution is very left-skewed, most of the record observation was made in April, March, and June.
Move on to Handcap
variable, which has 5 levels:
# If a person has disabilities (blind, dumb, etc.)
df_vit['Handcap'].value_counts() # This is cummulative and I do not know what kind o disabilities each patient has.
0 108286 1 2042 2 183 3 13 4 3 Name: Handcap, dtype: int64
After that, most of the patients do not have disabilities.
Let's see the patient gender.
# Gender
df_vit.Gender.value_counts()
F 71840 M 38687 Name: Gender, dtype: int64
Women are predominant in this medical appointments.
The next variable is SMS_received
, adopting the interpretation of the creator/maintainer of the dataset, this variable is the number of messages sent/received by the patient.
df_vit.SMS_received.value_counts()
0 75045 1 35482 Name: SMS_received, dtype: int64
This variable looks like a boolean, but let's assume as quantity because the maintainer said this is the number of messages sent to the patient.
The next 4 variables are all boolean, so I will bind the results into a single table.
# Just o save some lines of report
b_family = df_vit['Scholarship'].value_counts() # Counts the number of Scholarship
b_alcohol = df_vit['Alcoholism'].value_counts() # Counts the number of Alcoholism
b_hiper = df_vit['Hipertension'].value_counts() # Counts the number of Hipertension
b_diab = df_vit['Diabetes'].value_counts() # Counts the number of Diabetes
pd.concat([b_family,b_alcohol,b_hiper,b_diab],axis = 1)
Scholarship | Alcoholism | Hipertension | Diabetes | |
---|---|---|---|---|
0 | 99666 | 107167 | 88726 | 102584 |
1 | 10861 | 3360 | 21801 | 7943 |
After this extense Data Assessing, let start the Data Clean.
In this chapter I will edit the df_vit
dataframe modifying some variables types, removing non-desireable observations, etc.
Based on the long Data Assessing, this dataframe is quite clean, I just need to convert some variable and remove a non-standard negative age.
Renaming the columns will turn the study much easier and fast, due to the uppercase in the middle of the variable names, it's annoying because you must record the exact form to type.
# The rename methods deals with changing dash to underscores and lower casing everything.
df_vit.rename(columns=lambda x: x.strip().lower().replace("-", "_"), inplace=True)
# Print columns name to ensure the modification.
df_vit.columns
Index(['patientid', 'appointmentid', 'gender', 'scheduledday', 'appointmentday', 'age', 'neighbourhood', 'scholarship', 'hipertension', 'diabetes', 'alcoholism', 'handcap', 'sms_received', 'no_show'], dtype='object')
This minor changing will save me time in future, avoiding me to commit typos problems.
Now, let's convert patientid
, scheduledday
, and appointmentday
.
# Convert float to int
df_vit.patientid = df_vit.patientid.astype(np.int64) # I am coercing to int64, because the default converts to int32.
The conversion of scheduledday
and appointmentday
will be performed by the numpy
package.
# Converting the ScheduledDay to date.
df_vit.scheduledday = df_vit.scheduledday.apply(np.datetime64)
# Converting the AppointmentDay to date.
df_vit.appointmentday = df_vit.appointmentday.apply(np.datetime64)
# Print the first 5 rows.
df_vit.head()
patientid | appointmentid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 29872499824296 | 5642903 | F | 2016-04-29 18:38:08 | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
1 | 558997776694438 | 5642503 | M | 2016-04-29 16:08:27 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
2 | 4262962299951 | 5642549 | F | 2016-04-29 16:19:04 | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
3 | 867951213174 | 5642828 | F | 2016-04-29 17:29:31 | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
4 | 8841186448183 | 5642494 | F | 2016-04-29 16:07:23 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
Both datetime
variables are converted but have minor differences.
scheduledday
: It is OK;appointmentday
: Lose the hours, minutes, and seconds, probably is due to all the cases have the same time 00:00:00
.Now, let's convert the categorical variable, such as:
gender
: male and female categories;no-show
: yes or no.# Converting the No-show variable as categorical
df_vit['no_show'] = pd.Categorical(df_vit['no_show'])
# Converting the Gender variable as categorical
df_vit['gender'] = pd.Categorical(df_vit['gender'])
Finally, I will convert all boolean variables.
# Converting int64 variables to bool.
df_vit.scholarship = df_vit.scholarship.apply(bool)
df_vit.hipertension = df_vit.hipertension.apply(bool)
df_vit.diabetes = df_vit.diabetes.apply(bool)
df_vit.alcoholism = df_vit.alcoholism.apply(bool)
Let's print the .info()
once again to ensure the conversion.
df_vit.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): patientid 110527 non-null int64 appointmentid 110527 non-null int64 gender 110527 non-null category scheduledday 110527 non-null datetime64[ns] appointmentday 110527 non-null datetime64[ns] age 110527 non-null int64 neighbourhood 110527 non-null object scholarship 110527 non-null bool hipertension 110527 non-null bool diabetes 110527 non-null bool alcoholism 110527 non-null bool handcap 110527 non-null int64 sms_received 110527 non-null int64 no_show 110527 non-null category dtypes: bool(4), category(2), datetime64[ns](2), int64(5), object(1) memory usage: 7.4+ MB
I will remove all observations with age below of zero because it is not correct. I know there is only one observation with this characteristics, it is not likely this deletion will affect all the analysis.
# Cases of Age less than zero. I have updated my variable name to lowercase age.
df_vit = df_vit[df_vit.age >= 0]
# Print the number of rows
print("Number of rows: {}".format(df_vit.shape[0]))
Number of rows: 110526
As I expected, the number of rows has decreased one unit.
Here, I finished the Data Wrangling.
Sadly, I did not come up with a good and interesting question, but as a kickstart to my Exploratory Analysis, I will try to find any relationship between the patient who has appointed more than one time and no-showed.
My Question: Is there a relationship between patients with many appointments to do show-up more frequently?
First, I want to know how is the percentage of a no-show of all appointments.
no_show_p = df_vit['no_show'].value_counts() # First show-up and second no-show.
print(no_show_p)
print("Appointments with Show-up: {}\nAppointments with No-show: {}\n".format(*no_show_p)) # No-show: Yes : 22319
print("No-show Percentage: {}%".format(round(100*no_show_p[1]/sum(no_show_p)),2)) # Show-up: No : 88207
No 88207 Yes 22319 Name: no_show, dtype: int64 Appointments with Show-up: 88207 Appointments with No-show: 22319 No-show Percentage: 20.0%
The no-show rate is 20.0% for all appointments but I know that some patients have an appointment more than once. Let's get deep in this question plotting the histogram.
# Histogram of patient id.
df_vit['patientid'].value_counts().hist(bins = [1, 2, 3, 4, 5, 6, 7, 8, 9], # 9 Levels of bons to be showed
figsize= [15,4], # Strech a image to horizontal
cumulative = False); # Do not cumulative
plt.title('Graphic 1 - Vitória City Appointments per Patient ID Histogram') # Add title
plt.xlabel('Appointments per Patient ID') # X axis Label
plt.ylabel('Frequency'); # Y axis Label
Although it seems to be not clear that only a few patients have several appointments. I will plot the patientid
with most appointments to analyse this in details.
# This is the patientid with multiple appointments.
df_vit['patientid'].value_counts().head(5)
822145925426128 88 99637671331 84 26886125921145 70 33534783483176 65 258424392677 62 Name: patientid, dtype: int64
As you can see, there are some patients with high frequency, for this reason, I would like to test the theory of 80/20, so I will .sum()
the first 20% of patientid
with higher frequency.
# Number of unique patient id
unq_pat = df_vit['patientid'].nunique()
# Summation of the 20% patients with higher number of appointments
res = sum(df_vit['patientid'].value_counts().head(int(unq_pat * 0.20))) # I use the int() to ensure a integer number as index.
# Print
print('Number of appointments : {}'.format(res))
print('Number of appointments (%): {}%'.format(round(100*res/110526),2))
Number of appointments : 48767 Number of appointments (%): 44%
The 20% with higher frequency (patientid
with many appointments) are responsible for 44% of the total appointments, it is far from the 80%, but it shows a kind of "asymmetry".
Are patients with many appointments are more likely to show-up then the patients with less than two appointments?
Let's divide the dataset into two groups using a threshold of at least 3 appointments to classify each patientid.
df_gp1
) that one with a higher number of appointments, and;df_gp2
) that one with a lower number of appointments.# Counts the number of appointment for each patient id. # I have decided to split in several lines
n_appoin_per_pat_id = df_vit['patientid'].value_counts() # because in the future this is much easier for
# me to revise this code, and to understand.
# Creates a "vector" of boolean (True or False) for each patient id.
is_greater_3 = n_appoin_per_pat_id >= 3
is_not_great_3 = n_appoin_per_pat_id < 3
# Quantity of Patients with more or equal 3 appointments.
is_greater_3.value_counts()
False 51814 True 10484 Name: patientid, dtype: int64
There are 10,484 patients with more than or equal to 3 appointments and 51,814 patients with less than 3 appointments.
Obs.: I have read this thread in Stack overflow to understand how to extract the "rows names."
# Creating a list of patientid with more than 3 appointments.
more_3 = list(n_appoin_per_pat_id[is_greater_3].index) # List of patient id with more than 3 appointments
less_3 = list(n_appoin_per_pat_id[is_not_great_3].index) # List of patient id less than 3 appointments
# First 10 patients id with more than 3 appointments.
more_3[:10] # Slicing an Example
[822145925426128, 99637671331, 26886125921145, 33534783483176, 258424392677, 75797461494159, 871374938638855, 6264198675331, 66844879846766, 872278549442]
The more_3
and less_3
are two lists of patientid
, they will be used to select the rows of df_vit
.
# Testing if any less of information.
print(df_vit['patientid'].nunique() == len(more_3) + len(less_3)) # The total must be equal to these two subsets.
True
I have inserted a test to ensure if these two subsetted data frames have the same quantity of observation from the original dataframe. If the results are True
, is everything OK!
# Print the number of patient id in each list.
print("Dataframe with higher number appointments: {}\nDataframe with lower number appointments: {}".format(len(more_3),len(less_3)))
Dataframe with higher number appointments: 10484 Dataframe with lower number appointments: 51814
These two vectors (more_3
and less_3
) will be used to "filter" and separate which rows pertain to df_gp1
or df_gp2
.
Obs.: I have read this Stack overflow thread to learn how to use .isin()
.
# Subsetting the Dataframe with the Patients with more than 3 appointments
df_gp1 = df_vit[df_vit['patientid'].isin(more_3)]
# Subsetting the Dataframe with the Patients with less than 3 appointments
df_gp2 = df_vit[df_vit['patientid'].isin(less_3)]
# Printing the dimension of each dataframe
print("df_gp1\nObservations: {}\nColumns: {}\n\ndf_gp2\nObservations: {}\nColumns: {}".format(*df_gp1.shape,*df_gp2.shape))
df_gp1 Observations: 44817 Columns: 14 df_gp2 Observations: 65709 Columns: 14
Both data frames are ready to calculate the no-show rate.
# Assign the results of value count to a variable, I need it to make calculations
df_gp1_res_no_show = df_gp1.no_show.value_counts()
# Percentage Calculation
df_gp1_res_no_show_p = 100 * df_gp1_res_no_show[1]/(df_gp1_res_no_show[0] + df_gp1_res_no_show[1])
# Print results and rounding.
print("No-show: {}%".format(round(df_gp1_res_no_show_p,2)))
No-show: 21.09%
# Assign the results of value count to a variable, I need it to make calculations
df_gp2_res_no_show = df_gp2.no_show.value_counts()
# Percentage Calculation
df_gp2_res_no_show_p = 100 * df_gp2_res_no_show[1]/(df_gp2_res_no_show[0] + df_gp2_res_no_show[1])
print("No-show: {}%".format(round(df_gp2_res_no_show_p,2)))
No-show: 19.58%
Although the difference between these two results is slight, patients from the group df_gp1
have a higher percentage of a no-show from the df_gp2
. Going to the opposite way from my expectations (and negating my question).
I will make a sensible analysis, varying the threshold of 3 appointments to 4, 5 and so on, therefore I will create a function to encapsulate this lines of codes.
# I decided to encapsulate in one function because it will save me line codes.
def attendance(data, n, complete = False):
"""
-----------------------------------------------------------------------------------------------------
|DESCRIPTION: |
| |
| This function calculate the no-show rate separating the dataframe into two group based on a |
| threshold n. |
| |
-----------------------------------------------------------------------------------------------------
|INPUT: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| data dataframe This dataset must be the 'noshowappointments-kagglev2-may-2016.csv' |
| imported file but it is required an treatment before use this function.|
| The columns name should be renamed to lower case. |
| |
| n int The threshold number of appointment to classify each patient id. |
| |
| complete bool If False return percentage and If True also returns data frames. |
| |
-----------------------------------------------------------------------------------------------------
|OUTPUT: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| [m_no_show_p, list The first element of the list is the percentage of no-show of patients|
| l_no_show_p] with more than or equal to n appointments in the data , the second |
| element is the percentage of the group with less than n appointments. |
| |
| df_more, data frame The former with patientid with more or equal n appointments and the |
| df_less latter with patientid with less than n appointments. |
| |
-----------------------------------------------------------------------------------------------------
"""
greater = data['patientid'].value_counts() >= n # Creating a vector to filter the rows
lesser = data['patientid'].value_counts() < n # Creating a vector to filter the rows
# Extracting the patient id as list
more = list((data['patientid'].value_counts())[greater].index)
less = list((data['patientid'].value_counts())[lesser].index)
# Subsetting the data into two dataframes.
df_more = data[data['patientid'].isin(more)]
df_less = data[data['patientid'].isin(less)]
# Calculation of no-show from the first dataframe (Patients with more than or equal to n appointments)
m_no_show = df_more.no_show.value_counts()
m_no_show_p = 100 * m_no_show[1]/(m_no_show[0] + m_no_show[1])
# Calculation of no-show from the first dataframe (Patients with less than n appointments)
l_no_show = df_less.no_show.value_counts()
l_no_show_p = 100 * l_no_show[1]/(l_no_show[0] + l_no_show[1])
if complete == False:
return [round(m_no_show_p,2), round(l_no_show_p,2)] # Return results in percentage and in a list.
else:
return [round(m_no_show_p,2), round(l_no_show_p,2), df_more, df_less]
The function attendance
will calculate for each quantity of appointments the probability of these two groups.
# Variable Initialization
res_no_show_2_88 = []
# Loop to generate for each index value (threshold) the probabilities.
for index in range(2,88):
res_no_show_2_88.append(attendance(df_vit, index))
The variable res_no_show_2_88
is a list with each element a pair of no-show percentage. For this reason, I need to unpack this list to plot in the graphic.
# Unpacking the res_no_show_2_88 variable
no_show_2_88, show_up_2_88 = zip(*res_no_show_2_88)
# X axis variable
threshold = np.arange(2,88)
I have created two variables (no_show_2_88
and show_up_2_88
) to be plotted and a third variable to simulate my threshold number (threshold
).
# Defining the size of the Plot.
plt.figure(figsize= [12,4])
# red dashes, blue squares and green triangles
plt.plot(threshold, no_show_2_88, 'r--',threshold,show_up_2_88, 'bs')
plt.xlabel('Number of appointments by Patient ID') # X label
plt.ylabel('No-show [%]') # Y label
plt.title('Graphic 2 - No-show rate by Patient ID') # Graphic Title
plt.text(25, 5.5, r'Knee') # Text
plt.grid(True) # Grid
plt.legend(('Group1', 'Group2')) # Legend
plt.show() # Plot the graphic
The graphic above shows the behaviour of the no-show rate when the threshold varies from 2 to 88. Keep in mind, that I defined threshold as the number of appointments by each patient.
As I expected the higher appointment frequency given of a
patientid
leads a higher shop-ups (or lower no-show), this is probably a consequence from the illness severity (chemotherapy, radiotherapy, dialysis, etc.), which obliges the patient to go several times to the hospital.
I have recognised a shortage of information, that is why I will stop here this analysis to collect new data. Later, I will come back to a deeper analysis because I think there is an opportunity for more exploratory analysis.
I have gathered some additional data from Vitória Council Website, let's load them to make a new analysis. I have not yet figure out the second question and binding this additional data could lead me a new insight.
# Loading the Vitoria Auxiliary data.
vit_aux = pd.read_csv('02-Datasets/vit_aux.csv', sep=";");
# Print the first 5 rows.
vit_aux.head()
id_ra | reg_adm | nbh | pop_2000 | pop_2010 | male | female | avg_inc_mon | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Centro | Centro | 9240 | 9838 | 4416 | 5422 | 1791,15 |
1 | 1 | Centro | Do Moscoso | 854 | 795 | 369 | 426 | 647,05 |
2 | 1 | Centro | Fonte Grande | 1459 | 1231 | 592 | 639 | 706,03 |
3 | 1 | Centro | Ilha do Príncipe | 2810 | 2613 | 1194 | 1419 | 623,21 |
4 | 1 | Centro | Parque Moscoso | 1708 | 1773 | 791 | 982 | 1754,79 |
df_vit.head(2)
patientid | appointmentid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 29872499824296 | 5642903 | F | 2016-04-29 18:38:08 | 2016-04-29 | 62 | JARDIM DA PENHA | False | True | False | False | 0 | 0 | No |
1 | 558997776694438 | 5642503 | M | 2016-04-29 16:08:27 | 2016-04-29 | 56 | JARDIM DA PENHA | False | False | False | False | 0 | 0 | No |
Comparing these two datasets, I realized the only variable in common is the neighborhood (or neighbourhood in Britain English), sadly the variable name and the content is not equivalent. It means I need to edit the column from both data frames. Again, I will use a tailored function to deal with this problem.
def lower_ahu(data):
"""
-----------------------------------------------------------------------------------------------------
|DESCRIPTION: |
| |
| This function converts any uppercase letter to lowercase, converts space into underscore and |
| change any non-standard letter such as: á,é,í,ó,ú,ã,õ,ç etc. to a "regular" letter without the|
| accent. |
| |
-----------------------------------------------------------------------------------------------------
|INPUT: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| data list The input expected a list with strings. |
| |
| |
-----------------------------------------------------------------------------------------------------
|OUTPUT: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| data list The converted version of the list, without any word with weird accent. |
| |
-----------------------------------------------------------------------------------------------------
"""
# This helps to reduce the number of "types" of letters.
data = [x.lower() for x in data]
# My list of non standard letter and its accents.
non_standard = [['ã','a'],['á','a'],['à','a'],['â','a'],
['é','e'],['è','e'],['ê','e'],
['í','i'],['ì','i'],
['õ','o'],['ó','o'],['ò','o'],['ô','o'],
['ú','u'],['ù','u'],
['ç','c'],
[" ",'_'],["-",'_'],["'",'_'],["`",'_'],["´",'_'],[",",'.']]
for wrong,correct in non_standard:
data = [x.replace(wrong,correct) for x in data] # Where the magic happens!
return data # Return the data without acccents.
After defining a function, let's use it.
# Converting the content of the nbh varible to lowercase and removing portuguese accents.
vit_aux['nbh'] = lower_ahu(vit_aux['nbh'])
vit_aux['reg_adm'] = lower_ahu(vit_aux['reg_adm'])
# Converting the content of the neighbourhood varible to lowercase and removing portuguese accents.
df_vit['neighbourhood'] = lower_ahu(df_vit['neighbourhood'])
# Let's compare both categories.
sorted(list(vit_aux.nbh.unique())) is sorted(list(df_vit.neighbourhood.unique()))
False
Unfortunately, is False
. Let's dig to find out the problem of different numbers of neighborhoods.
# Number of neighborhood in each dataframe.
nbh_comparison = len(vit_aux.nbh.unique()),len(df_vit.neighbourhood.unique())
print("Auxiliary Dataframe: {}\nVitória Hospital Appointment: {}".format(*nbh_comparison))
Auxiliary Dataframe: 80 Vitória Hospital Appointment: 81
So, there is one neighborhood lacking in vit_aux
data frame, and this neighborhood, after few minutes comparing these two lists, is ilhas_oceanicas_de_trindade
. Curiously, this island is 1,200 kilometers far from the coast, it means this is an ultra-marine territory, a place where only 32 Marines lives and no one else.
# Print all observation from ilhas_oceanicas_de_trindade.
df_vit[df_vit['neighbourhood'] == 'ilhas_oceanicas_de_trindade']
patientid | appointmentid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hipertension | diabetes | alcoholism | handcap | sms_received | no_show | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
48754 | 534986855114 | 5583947 | F | 2016-04-14 12:25:43 | 2016-05-13 | 51 | ilhas_oceanicas_de_trindade | False | False | False | False | 0 | 0 | Yes |
48765 | 7256429752481 | 5583948 | F | 2016-04-14 12:26:13 | 2016-05-13 | 58 | ilhas_oceanicas_de_trindade | False | False | False | False | 0 | 0 | Yes |
As you can see, there are only 2 observations, from two different patients, and both no-showed. Due to the inconsistency of this two observations, I will remove them because I am convinced doing it will not affect my results.
# Creating a vector to remove only the observations with the ilhas_oceanicas_de_trindade as neighbourhood.
rm_trindade = list(map(lambda x : not(x), df_vit['neighbourhood'] == 'ilhas_oceanicas_de_trindade'))
# Subsetting using the vector rm_trindade to remove this observations.
df_vit = df_vit[rm_trindade]
# Print dimensions
print("Rows: {}\nColumns: {}".format(*df_vit.shape))
Rows: 110524 Columns: 14
Finally, I can compare the neighborhood columns of this two data frames.
# Comparing the categories of these two columns. If true, I could join them.
sorted(list(df_vit.neighbourhood.unique())) == sorted(list(vit_aux.nbh.unique()))
True
It is OK, so I will continue to rename the columns to have the same name (neighbourhood
will be replaced by nbh
).
# Creating a vector to assign a new name to the variable neighbourhood
new_name_column = list(df_vit.columns)
# Renaming the columns names
new_name_column[6] = 'nbh'
# Defining variables names.
df_vit.columns = new_name_column
Before any merge, I will convert the avg_inc_mon
to float64
. You can see avg_inc_mon
is originally imported as str/obj
.
vit_aux.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 80 entries, 0 to 79 Data columns (total 8 columns): id_ra 80 non-null int64 reg_adm 80 non-null object nbh 80 non-null object pop_2000 80 non-null int64 pop_2010 80 non-null int64 male 80 non-null int64 female 80 non-null int64 avg_inc_mon 80 non-null object dtypes: int64(5), object(3) memory usage: 5.1+ KB
# Converting the number Brazillian formatting to USA - Changing , to .
vit_aux['avg_inc_mon'] = lower_ahu(vit_aux['avg_inc_mon'])
# Converting str/obj to float
vit_aux['avg_inc_mon'] = vit_aux['avg_inc_mon'].astype(float)
Everything is set-up and ready to the merge.
Obs.: I have read in pandas website how to merge, this is the thread I have read.
# Merging the two datasets.
df_vit = pd.merge(df_vit, vit_aux, how='left', on=['nbh', 'nbh'])
# Print the first 5 rows
df_vit.head()
patientid | appointmentid | gender | scheduledday | appointmentday | age | nbh | scholarship | hipertension | diabetes | ... | handcap | sms_received | no_show | id_ra | reg_adm | pop_2000 | pop_2010 | male | female | avg_inc_mon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 29872499824296 | 5642903 | F | 2016-04-29 18:38:08 | 2016-04-29 | 62 | jardim_da_penha | False | True | False | ... | 0 | 0 | No | 9 | jardim_da_penha | 24623 | 30571 | 13702 | 16869 | 2510.89 |
1 | 558997776694438 | 5642503 | M | 2016-04-29 16:08:27 | 2016-04-29 | 56 | jardim_da_penha | False | False | False | ... | 0 | 0 | No | 9 | jardim_da_penha | 24623 | 30571 | 13702 | 16869 | 2510.89 |
2 | 4262962299951 | 5642549 | F | 2016-04-29 16:19:04 | 2016-04-29 | 62 | mata_da_praia | False | False | False | ... | 0 | 0 | No | 9 | jardim_da_penha | 9317 | 10594 | 5005 | 5589 | 4119.31 |
3 | 867951213174 | 5642828 | F | 2016-04-29 17:29:31 | 2016-04-29 | 8 | pontal_de_camburi | False | False | False | ... | 0 | 0 | No | 9 | jardim_da_penha | 992 | 889 | 452 | 437 | 1498.52 |
4 | 8841186448183 | 5642494 | F | 2016-04-29 16:07:23 | 2016-04-29 | 56 | jardim_da_penha | False | True | True | ... | 0 | 0 | No | 9 | jardim_da_penha | 24623 | 30571 | 13702 | 16869 | 2510.89 |
5 rows × 21 columns
Now, the df_vit
has few more variables, such as:
Let's see the .describe()
and .info()
to ensure if there are any problems during the merging process.
df_vit.describe()
patientid | appointmentid | age | handcap | sms_received | id_ra | pop_2000 | pop_2010 | male | female | avg_inc_mon | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.105240e+05 | 1.105240e+05 | 110524.000000 | 110524.000000 | 110524.000000 | 110524.000000 | 110524.000000 | 110524.000000 | 110524.000000 | 110524.000000 | 110524.000000 |
mean | 1.474960e+14 | 5.675306e+06 | 37.088904 | 0.022249 | 0.321034 | 4.600087 | 6682.462234 | 8183.312665 | 3825.968857 | 4357.343808 | 1063.303482 |
std | 2.560959e+14 | 7.129503e+04 | 23.110111 | 0.161545 | 0.466876 | 2.309988 | 6619.251330 | 10209.917257 | 4705.474679 | 5506.939496 | 793.033265 |
min | 3.921700e+04 | 5.030230e+06 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 4.172693e+12 | 5.640287e+06 | 18.000000 | 0.000000 | 0.000000 | 3.000000 | 2486.000000 | 2565.000000 | 1198.000000 | 1327.000000 | 558.110000 |
50% | 3.173184e+13 | 5.680574e+06 | 37.000000 | 0.000000 | 0.000000 | 4.000000 | 4087.000000 | 4402.000000 | 2102.000000 | 2290.000000 | 741.850000 |
75% | 9.439068e+13 | 5.725523e+06 | 55.000000 | 0.000000 | 1.000000 | 7.000000 | 7585.000000 | 7913.000000 | 3771.000000 | 4142.000000 | 1300.280000 |
max | 9.999816e+14 | 5.790484e+06 | 115.000000 | 4.000000 | 1.000000 | 9.000000 | 24623.000000 | 39157.000000 | 18230.000000 | 20927.000000 | 10843.740000 |
df_vit.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 110524 entries, 0 to 110523 Data columns (total 21 columns): patientid 110524 non-null int64 appointmentid 110524 non-null int64 gender 110524 non-null category scheduledday 110524 non-null datetime64[ns] appointmentday 110524 non-null datetime64[ns] age 110524 non-null int64 nbh 110524 non-null object scholarship 110524 non-null bool hipertension 110524 non-null bool diabetes 110524 non-null bool alcoholism 110524 non-null bool handcap 110524 non-null int64 sms_received 110524 non-null int64 no_show 110524 non-null category id_ra 110524 non-null int64 reg_adm 110524 non-null object pop_2000 110524 non-null int64 pop_2010 110524 non-null int64 male 110524 non-null int64 female 110524 non-null int64 avg_inc_mon 110524 non-null float64 dtypes: bool(4), category(2), datetime64[ns](2), float64(1), int64(10), object(2) memory usage: 14.1+ MB
Everything looks pretty good but before any question, I would like to point out some consideration:
So let's start some interesting analysis, with these new features.
# Monthly Average Income Histogram
df_vit['avg_inc_mon'].hist(bins = 11, figsize = [12,4]);
plt.xlim(0,5000)
plt.title('Graphic 3 - Histogram of Monthly Average Income') # Add title
plt.xlabel('Number of Patients') # X axis Label
plt.ylabel('Frequency'); # Y axis Label
Concretely, the majority of the public attended by the public health systems in Vitória has average income bearing 1000 BRL/month.
However, in some cases one patientid
has multiple entries, it means this patient went several times to the hospital, and so I must remove these duplicated patientid
to a further analysis, for instance, to calculate the .mean()
and .count()
.
Let's remove the duplicated patientid
from the data_vit
.
# Creating a vector to sinalize if there is a duplicated patientid
dupe = df_vit['patientid'].duplicated()
# Converting to Not -> True to False.
dupe = list(map( lambda x : not(x) , dupe ) )
# Selecting the unique patients, no duplicated.
df_clean = df_vit[dupe]
# Dimensions of the cleaned data set.
df_clean.shape # The number of rows from shape is the same from len(df_vit['patientid'].unique())
(62296, 21)
I want to calculated the average age and the average income from all differents patients.
# Average Age Calculation
print("{} years".format(round(df_clean['age'].mean(),2)))
print("{} BRL".format(round(df_clean['avg_inc_mon'].mean(),2)))
36.7 years 1068.71 BRL
I will encapsulate a function to deal with removing duplicated patientid
.
# Defining a function to eliminate the duplicates patientid.
def no_dupe_pat(data_dupe):
"""
-----------------------------------------------------------------------------------------------------
|DESCRIPTION: |
| |
| This function removes all duplicated rows of a given patientid. |
| |
-----------------------------------------------------------------------------------------------------
|INPUT: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| data_dupe data frame Any data frame with at least the patientid column. |
| |
| |
-----------------------------------------------------------------------------------------------------
|OUTPUT: |
| |
| VARIABLE TYPE DESCRIPTION |
| |
| data_dupe data frame A data frame with no duplicated patientid. |
| |
-----------------------------------------------------------------------------------------------------
"""
# Creating a vector to sinalize if there is a duplicated patientid
dupe_less = data_dupe['patientid'].duplicated()
# Converting to Not -> True to False.
dupe_less = list(map( lambda x : not(x) , dupe_less ) )
# Selecting the unique patients, no duplicated and return.
return data_dupe[dupe_less]
With this function, I can calculate more accurate values of the average mean, average income, etc.
I will come back to the first question I have made to go deeper and to answer some new questions. I will reproduce the Graphic 2 as a starter.
# Defining the size of the Plot.
plt.figure(figsize= [12,4])
# red dashes, blue squares and green triangles
plt.plot(threshold, no_show_2_88, 'r--',threshold,show_up_2_88, 'bs')
plt.xlabel('Number of appointments by Patient ID') # X label
plt.ylabel('No-show [%]') # Y label
plt.title('Graphic 2 - No-show rate by Patient ID') # Graphic Title
plt.text(25, 5.5, r'Knee') # Text
plt.grid(True) # Grid
plt.legend(('Group1', 'Group2')) # Legend
plt.show() # Plot the graphic
I am going to split the dataset approximately in the knee (I guess 25 is a good approximation) value of the graphic.
# Calculating the percentage of no-show and data frames.
no_show_more_25,no_show_less_25,df_more_25,df_less_25 = attendance(df_vit, 25, complete= True)
After munging the data, I come up with a question.
Are patients with many appointments in average older than those with few appointments?
# Create a list of the average age of each data frame.
avg_age_25 = [round(df_more_25['age'].mean(),2), round(df_less_25['age'].mean(),2)]
# Print
print("Average Age for Patients with more or equal to 25 appointments: {} years".format(avg_age_25[0]))
print("Average Age for Patients with less than 25 appointments: {} years".format(avg_age_25[1]))
Average Age for Patients with more or equal to 25 appointments: 46.99 years Average Age for Patients with less than 25 appointments: 36.96 years
Well, let's take a look in the distribution.
Before it, I will explain my function no_dupe_pat()
, this function aims to remove any duplicated rows given a patientid
, it means: If a patientid
has 88 appointments, 87 will be excluded.
# Remove duplicates entries in patientid and plot the histogram of ages - Only patients with more or equal 25 apppointments.
no_dupe_pat(df_more_25).age.hist(figsize =[12,4]);
plt.title('Graphic 4 - Histogram of Age for Patients with more than or equal to 25 appointments') # Add title
plt.xlabel('Age [years]') # X axis Label
plt.ylabel('Frequency'); # Y axis Label
# Remove duplicates entries in patientid and plot the histogram of ages - Only patients with less than 25 apppointments.
df_less_25.age.hist(figsize =[12,4]);
plt.title('Graphic 5 - Histogram of Age for Patients with less than 25 appointments') # Add title
plt.xlabel('Age [years]') # X axis Label
plt.ylabel('Frequency'); # Y axis Label
Based on these two histograms:
Only one point does not explain much of this question, so let's varies the number of appointments to plot a graph.
# Initializing variable
list_avg_age = []
# Loop to calculate the average age varying the number of appointment cummulate by patientid
for index in range(2,88):
no_show_more, no_show_less, df_more, df_less = attendance(df_vit, index, complete= True)
list_avg_age.append([round(df_more['age'].mean(),2), round(df_less['age'].mean(),2)]) # Calculate the .mean()
# Unpacking the list_avg_age variable.
avg_age_more, avg_age_less = zip(*list_avg_age)
# Generator of x axis.
threshold = np.arange(2,88)
# Defining the size of the Plot.
plt.figure(figsize= [12,4])
# red dashes, blue squares.
plt.plot(threshold, avg_age_more, 'r--',threshold,avg_age_less, 'bs')
plt.xlabel('Number of appointments by Patient ID') # X label
plt.ylabel('Average Age [years]') # Y label
plt.title('Graphic 6 - Average Age by Patient ID') # Graphic Title
plt.grid(True) # Grid
plt.legend(('Group1', 'Group2')) # Legend
plt.show() # Plot the graphic
Based on the graphic, there is a positive relationship between the number of appointments by the patient and the average, however, this could be a trap, because the sample (the number of observation in the data frame with more than "x" appointments) became too small when I raises the Number of appointments by a patientid, Table 2 shows how the number of observations drop in the data frame so-called df_more
.
k | Observations of df_more | Observations of df_less |
---|---|---|
2 | 24,379 | 37,917 |
3 | 10,484 | 51,812 |
4 | 4,984 | 57,312 |
5 | 2,617 | 59,679 |
10 | 333 | 61,963 |
25 | 28 | 62,268 |
50 | 13 | 62,283 |
88 | 1 | 62,295 |
Seems the average age increases with the increase of the appointments, and it's also true to say the show-up rate increases with the increase in the number of appointment.
Are patients with disabilities tend to have better rates of show-up? Is the number of disabilities raises the show-up rates?
Let's start analysing the number of patients with disabilities.
# Number of appointments of Disabled Patient.
df_vit.handcap.value_counts()
0 108283 1 2042 2 183 3 13 4 3 Name: handcap, dtype: int64
# Calculation of total appointments with Disabled Patient.
tot_dis = sum(df_vit.handcap.value_counts()) - df_vit.handcap.value_counts()[0]
print("The total of appointments with patients with disabilities: {}".format(tot_dis))
The total of appointments with patients with disabilities: 2241
# Calculation of total Disabled Patient.
tot_pat_dis = sum(df_clean.handcap.value_counts()) - df_clean.handcap.value_counts()[0]
print("The total of patients with disabilities: {}".format(tot_pat_dis))
The total of patients with disabilities: 1133
# Subsetting to plot a graphic.
dis_gp = df_vit.handcap.value_counts();
# Creating the Pie Chart.
dis_gp[1:].plot(kind='pie',figsize = (8,8), title = 'Graphic 7 - Pie chart of Disabled Patient Appointments');
plt.xlabel('Number of Desabilities by Patient ID'); # X label
plt.ylabel(''); # Y label
plt.legend(('1 Desability', '2 Desabilities','3 Desabilities','4 Desabilities')); # Legend
As you can see, almost 91% of the patient with disabilities has at least one disability, 8.1% has two disabilities, 0.6% has three disabilities, and 0.01% has four disabilities.
Now, I will calculate the no-show rate.
# variable initialization
no_show_dis_all = []
no_show_dis_all_p = []
# Loop to calculte the number of no-show and show-up.
for index in range(1,5):
no_show_dis_all.append(list(df_vit[df_vit.handcap == index].no_show.value_counts()))
# Unpacking the no_show_dis_all
no_show_dis,show_up_dis = zip(*no_show_dis_all)
# Loop to calculate the percentage.
for index in range(0,4):
no_show_dis_all_p.append(round(100 * show_up_dis[index]/(no_show_dis[index] + show_up_dis[index]),2))
# Print the percentage
print("1 Disability: {}% no-show\n2 Disabilities: {}% no-show\n3 Disabilities: {}% no-show\n4 Disabilities: {}% no-show".format(*no_show_dis_all_p))
1 Disability: 17.92% no-show 2 Disabilities: 20.22% no-show 3 Disabilities: 23.08% no-show 4 Disabilities: 33.33% no-show
# No-show and show-up of any patient with more than or equal to 1 disability.
dis_more_one = df_vit[df_vit.handcap >= 1].no_show.value_counts()
# Percentage
dis_more_one_p = 100 * dis_more_one[1]/(dis_more_one[0]+dis_more_one[1])
# No-show to the all sample of patients with more than or equal to 1 disability.
print("Rate of No-show for any level of handcap: {}%".format(round(dis_more_one_p,2)))
Rate of No-show for any level of handcap: 18.16%
Analysing each group of disability separated (1, 2, 3, and 4) the no-show increase if the number of disability increases. Keep in mind, for patients with 4 disabilities there are 3 patients, with 3 disabilities there are 6, with 2 disabilities there are 9, and with 1 disability there are 1,025 patients.
Using all sample of disabilities the no-show rate (18.16%) is slightly lower than the role dataset (20%), so in average a patient with a disability could have a better rate of show-up.
Throughout of this document, I have pose three questions. I will repeat the same text from the Synopsis:
1. Are patients with many appointments are more likely to show-up then the patients with less than two appointments?
Seems to be true, but I lack information to affirm it, what I could say is when I analyse patient with a higher number of appointments, they tend to have a better rate of show-up.
2. Are patients with many appointments in average older than those with few appointments?
I also identify a positive correlation, when I analyse patients with higher number of appointments they tend to have an age average higher than the population.
3. Are patients with disabilities tend to have better rates of show-up? Is the number of disabilities raises the show-up rates?
Due to the small number of patients with disabilities, it is difficult to answer this question, but I have found a better rate of show-up in the group of patients with disabilities, however, patients with many disabilities tend to have worse show-up rates.
This appendix aims to go in depth to find new data available on internet. Mostly of this information is public and disclosure in the Vitória Council Website.
The city of Vitória is divided into 9 (nine) Regional Administration, according with the Vitória City Council Website, do not expected to read any information in english, so the most important thing to gather in this page are the address of each suburbs administration building and the oficial name of suburbs administration.
Regional Administation | Address | General Information |
---|---|---|
Central 1 - Centro | Praça Américo Poli Monjardim, Forte São João | See more |
Central 2 - Santo Antônio | Avenida Santo Antônio, 1400, Santo Antônio | See more |
Central 3 - Jucutuquara | Rua Santa Rita de Cássia, S/N, De Lourdes | See more |
Central 4 - Maruípe | Rua Marechal Floriano, 709, Maruípe | See more |
Central 5 - Praia do Canto | Avenida Rio Branco, 80, Santa Lúcia | See more |
Central 6 - Goiabeiras | Rua Desembargador Cassiano Castelo, 65, Goiabeiras | See more |
Central 7 - São Pedro | Avenida Beira Mar, 360, São Pedro | See more |
Central 8 - Jardim Camburi | Avenida Santos Evangelista, 15, Jardim Camburi | See more |
Central 9 - Jardim da Penha | Praça Philogomiro Lannes, S/N, Jardim da Penha | See more |
Going depth to find more information in the Vitória Council Website, I have found a map of each Regional administration (Table A1 Column General Information), with general information, such as:
I have recorded these information into one dataset (I do not know how to "webscrap" these infos, and probabily this is going to take too much time of my stydies to learn how to do it. I am new in Python). This dataset, co called vitoria_reg_adm.csv
is available in my Github Repository. The Table A2 shows the description variables.
Variable | Type | |
---|---|---|
id_ra | int | ID of the Regional Administration (Following the sequence stablished in Table A1 |
reg_adm | str | Regional Administration Name |
nbh_qty | int | Quantity of Neighborhood in each Regional Administration |
area_m2 | float | Regional Administration Area in m² |
pop_2010 | int | Population in 2010 |
den_km2 | int | Demographic Density in km² |
home_2010 | int | number of homes |
avg_inc_2010 | float | Monthly Average Income (in BRL) |
eco_act_2012 | int | Commercial Stores Quantity |
Keep in mind, in Brazil the "." is the thousands separator, whereas "," is the decimals separator.
# Loading the Vitoria Regional Administration data.
vit_reg_adm = pd.read_csv('02-Datasets/vit_reg_adm.csv', sep=";");
# Print
vit_reg_adm.head()
id_ra | reg_adm | nbh_qty | area_m2 | pop_2010 | den_km2 | home_2010 | avg_inc_2010 | eco_act_2012 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | Centro | 8 | 2072047 | 19611 | 13140 | 6952 | 1425,82 | 27796 |
1 | 2 | Santo Antônio | 12 | 4425494 | 35261 | 2179 | 10796 | 649,84 | 4738 |
2 | 3 | Jucutuquara | 14 | 4793706 | 34141 | 6418 | 10580 | 1217,69 | 18621 |
3 | 4 | Maruípe | 12 | 5684216 | 54402 | 7122 | 17009 | 806,72 | 10903 |
4 | 5 | Praia do Canto | 9 | 5334352 | 34236 | 9570 | 12133 | 3844,97 | 48150 |
# Number of Neighborhood and Variables
vit_reg_adm.shape
(9, 9)
Vitória City has 80 (eighty) neighborhoods, as you can check in the Vitória Council Website, in this link you can access all neighborhood boundaries, even more in each of this pdf
's you can see the roads and streets and a short resume of each neighborhood. Figure A1, shows an example of information in each pdf
.
Based on the Table A1 Column General Informations links, and list of all neighborhoods, I will create an auxiliary dataset to record additional data, so-called vit_aux.csv
.
In this auxililary dataset I have followed the neighborhood classification of Vitória Council Website (See Table A1, each rows has own on URL, in it you can see the picture of each Regional Administration), in some cases I have found neighborhood in differents Regional Administration. The Table A3 shows the Description Types of each variable/features.
Variable | Type | Description | |
---|---|---|---|
id_ra | int | ID of the Regional Administration (Following the sequence stablished in Table A1) |
|
reg_adm | str | Regional Administration Name | |
nbh | str | Neighborhood Name | |
pop_2000 | int | Population in Census 2000 | |
pop_2010 | int | Population in Census 2010 | |
male | int | Quantity of male in given neighborhood | |
female | int | Quantity of female in given neighborhood | |
avg_inc_mon | float | Average Income per month (in BRL) |
Keep in mind, in Brazil the "." is the thousands separator, whereas "," is the decimals separator.
In this link you can see with details a big map of Vitória and its Regional Administration Areas.
# Loading the Vitoria Auxiliary data.
vit_aux = pd.read_csv('02-Datasets/vit_aux.csv', sep=";");
# Print
vit_aux.head()
id_ra | reg_adm | nbh | pop_2000 | pop_2010 | male | female | avg_inc_mon | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Centro | Centro | 9240 | 9838 | 4416 | 5422 | 1791,15 |
1 | 1 | Centro | Do Moscoso | 854 | 795 | 369 | 426 | 647,05 |
2 | 1 | Centro | Fonte Grande | 1459 | 1231 | 592 | 639 | 706,03 |
3 | 1 | Centro | Ilha do Príncipe | 2810 | 2613 | 1194 | 1419 | 623,21 |
4 | 1 | Centro | Parque Moscoso | 1708 | 1773 | 791 | 982 | 1754,79 |
# Number of Neighborhood and Variables
vit_aux.shape
(80, 8)
I have found a table in the Vitória Council Website (click here to visit the original) with 20 categories (cut by 5 year each one) and by gender.
age | male | female |
---|---|---|
0 to 04 years | 9932 | 9666 |
05 to 09 years | 10165 | 9727 |
10 to 14 years | 11944 | 11686 |
15 to 19 years | 12450 | 12932 |
20 to 24 years | 15297 | 15970 |
25 to 29 years | 15511 | 16621 |
This dataset is important to understand the population distribution by age, and how it affects the appointment.
# Loading the Vitoria Age data.
vit_age = pd.read_csv('02-Datasets/vit_age.csv', sep=";");
# Print
vit_age.head()
age | male | female | |
---|---|---|---|
0 | 0 to 04 years | 9932 | 9666 |
1 | 05 to 09 years | 10165 | 9727 |
2 | 10 to 14 years | 11944 | 11686 |
3 | 15 to 19 years | 12450 | 12932 |
4 | 20 to 24 years | 15297 | 15970 |
y = np.arange(len(vit_age['male']))
x = vit_age['male']
objects = vit_age['age']
plt.barh(y, x, align='center', alpha=0.5)
plt.yticks(y , objects)
plt.xlabel('Inhabitants')
plt.title('Male Distribution by 5 year')
plt.show()
Vitória City has 38 UBS.
Health Territory Area | Number of Neighborhood Covered | UBS Family | US PACS | Reference Center | Specialized Center | Polyclinic | Total |
---|---|---|---|---|---|---|---|
Centro | 11 | 3 | 2 | 2 | 1 | 0 | 8 |
Forte São João | 20 | 2 | 2 | 4 | 0 | 0 | 8 |
Santo Antônio | 9 | 3 | 0 | 0 | 0 | 0 | 3 |
Penha | 14 | 1 | 4 | 0 | 0 | 0 | 5 |
Maruípe | 18 | 6 | 0 | 0 | 0 | 0 | 6 |
São Pedro | 10 | 4 | 0 | 2 | 1 | 1 | 8 |
I have not created any external document for this information.
Throughout the report I have inserted many links to different sources, for this reason, this chapter does not have too many links. I do not think it is necessary to insert the same link here. So, in this chapter, I only put the most important URL to this research.