The data I obtained shows the gross and annual salaries for employees of the Baltimore City Government, from July 2017 to June 2018.
I found this data through the suggestions for project 1 ideas on Blackboard.
I thought given the income inequality discussions around tax season and new tax laws being implemented this year, I would study if there is any income inequality in Baltimore City Government.
import pandas as pd
import time
import datetime
import chardet
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
print('Pandas',pd.__version__)
start_time = time.time()
import os
print('File Size In MB : ',(os.path.getsize('Baltimore_City_Employee_Salaries_FY2018.csv')/1048576),' MB')
Encoding check for the input CSV file to ensure data is in the right format
with open('Baltimore_City_Employee_Salaries_FY2018.csv','rb') as fraw:
file_content = fraw.read()
chardet.detect(file_content)
Character encoding of the CSV file is ascii and confidence level is 1(100%).
Exploring file contents from the CSV:
!head -n 3 Baltimore_City_Employee_Salaries_FY2018.csv
Next, I will extract data from the CSV file and insert into a dataframe for processing
pd.options.display.max_rows = 40
start_time_before_load = time.time()
salary_df = pd.read_csv("Baltimore_City_Employee_Salaries_FY2018.csv")
print('Time taken to load the data : ',time.time() - start_time_before_load,'seconds')
salary_df.shape
The csv/dataframe contains 13683 rows and 7 columns
Let us explore the data a bit using head(), tail(), info(), describe()
salary_df.head()
salary_df.tail()
salary_df.info()
salary_df.describe()
salary_df.describe(include='all')
salary_df.describe(include='object')
Next, I will explore the column metadata...
first_n_entries=5
print('Total rows in the dataframe:', salary_df.shape[0])
for col, col_type in salary_df.dtypes.iteritems():
if(col_type=='object'):
print('\n',col,'has',salary_df[col].nunique(),'unique entries')
print('First', first_n_entries, 'entries are\n')
print(salary_df[col][0:first_n_entries])
def uniq_per_col(df,count):
for this_column in df.columns:
print("\n==== ",this_column,"has",df[this_column].nunique(),"unique entries ====")
print(df[this_column].value_counts().head(count))
uniq_per_col(salary_df,5)
"""
Next, I transform the object data type for JOBTITLE to 'category' data type
"""
salary_df['JOBTITLE'] = salary_df['JOBTITLE'].astype('category')
salary_df['JOBTITLE'].dtype
"""
Next, I transform the object data type for DEPTID to 'category' data type
"""
salary_df['DEPTID'] = salary_df['DEPTID'].astype('category')
salary_df['DEPTID'].dtype
"""
Next, I transform the object data type for DESCR to 'category' data type
"""
salary_df['DESCR'] = salary_df['DESCR'].astype('category')
salary_df['DESCR'].dtype
salary_df.dtypes
Now the dataframe has...
"Benford's law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in any large, randomly produced set of natural numbers, such as tables of logarithms or corporate sales statistics, around 30 percent will begin with the digit 1, 18 percent with 2, and so on, with the smallest percentage beginning with 9."
This law has been famously used in accounting fraud detection to detect anomalous and fabricated data
Presented below is a distribution of first digits found in gross salary data in our source.
"""
Generating histogram for first-diigt of gross pay
"""
first_digit_list_of_gross=[]
for this_amt in list(salary_df['GROSS']):
first_digit_list_of_gross.append(str(this_amt)[0])
first_digit_list_of_gross.sort()
plt.figure(figsize=(15,10))
_=plt.hist(first_digit_list_of_gross,bins=11,density=True)
plt.title("Benford's law analysis on gross pay",fontsize=22)
plt.xlabel('First digit of gross pay',fontsize=16)
plt.ylabel('Percentage of rows',fontsize=16)
plt.show()
"""
Generating histogram for first-diigt of annual pay
"""
first_digit_list_of_annual_rt=[]
for this_amt in list(salary_df['ANNUAL_RT']):
first_digit_list_of_annual_rt.append(str(this_amt)[0])
first_digit_list_of_annual_rt.sort()
plt.figure(figsize=(15,10))
_=plt.hist(first_digit_list_of_annual_rt,bins=11,density=True)
plt.title("Benford's Law analysis on annual pay",fontsize=22)
plt.xlabel('First digit of annual pay',fontsize=16)
plt.ylabel('Percentage of rows',fontsize=16)
plt.show()
Anomaly detection conclusion...
Continuing anomaly detection...
print('Max. annual pay is:', salary_df['ANNUAL_RT'].max(), 'and min. annual pay is:', salary_df['ANNUAL_RT'].min())
print('Max. gross pay is:', salary_df['GROSS'].max(), 'and min. gross pay is:', salary_df['GROSS'].min())
"""
Finiding the indices that contain 0.0 as gross pay
"""
min_gross_idx=salary_df[salary_df['GROSS']==0.0].index
min_gross_idx
"""
Modifying the 0.0 value as annual pay minus 2000.0
"""
gross_list=salary_df['GROSS'].tolist()
annual_rt_list=salary_df['ANNUAL_RT'].tolist()
for idx in min_gross_idx:
gross_list[idx]=annual_rt_list[idx]-2000.0
print(gross_list[idx])
salary_df['GROSS']=gross_list
Black = filled; white = empty
"""
Searching for missing data in sample set of 30 randomly selected data points
"""
_=msno.matrix(salary_df.sample(30))
plt.xlabel('Features in data',fontsize=16)
plt.ylabel('Gaps in data',fontsize=16)
plt.show()
"""
Searching for missing data in sample set of 300 randomly selected data points
"""
_=msno.matrix(salary_df.sample(300))
plt.xlabel('Features in data',fontsize=16)
plt.ylabel('Gaps in data',fontsize=16)
plt.show()
"""
Searching for missing data in sample set of 3000 randomly selected data points
"""
_=msno.matrix(salary_df.sample(3000))
plt.xlabel('Features in data',fontsize=16)
plt.ylabel('Gaps in data',fontsize=16)
plt.show()
I will perform the following tasks here:
print('before:',salary_df.shape)
print('Drop all rows and columns where entire row or column is NaN.')
salary_df.dropna(how='all',axis=0,inplace=True) # rows
salary_df.dropna(how='all',axis=1,inplace=True) # columns
print('Drop columns with duplicate data or with 50% missing value.')
half_count = len(salary_df)*.5
salary_df = salary_df.dropna(thresh=half_count, axis=1)
salary_df = salary_df.drop_duplicates()
print('Drop columns where all rows have the same value.')
for this_column in salary_df.columns:
if (salary_df[this_column].nunique()==1):
salary_df.drop([this_column],axis=1,inplace=True)
print('after:',salary_df.shape)
In my dataset...
salary_df.head().T
"""
Counting null data per column
"""
salary_df.isnull().sum()
"""
Percentage of missing data per column
"""
(salary_df.isnull().sum()/len(salary_df)).sort_values(ascending=False)
GROSS and HIRE_DT Columns have missing data.
For both the columns, I am going to fill up those missing values.
How many missing data poinst are there for HIRE_DT Column?
salary_df['HIRE_DT'].isnull().sum()
Viewing the missing data
salary_df[salary_df['HIRE_DT'].isnull()]
How many columns have missing data where the rows have missing HIRE_DT data?
salary_df[salary_df['HIRE_DT'].isnull()].isnull().sum().sum()
salary_df[salary_df['HIRE_DT'].isnull()].shape
Out of the (3 * 7) = 21 cells associated with the three rows, only 3 cells are empty.
Here, I am using some random date replacement with help from SO Link
"""
Generating a random hire date between max and min hire dates
"""
from datetime import datetime, timedelta
from random import randrange
def get_date(d1):
return datetime.strptime(d1,"%m/%d/%Y %H:%M:%S %p")
salary_df['HIRE_DATE']=salary_df['HIRE_DT'].dropna().apply(get_date)
import random
from datetime import datetime, timedelta
start = salary_df['HIRE_DATE'].min()
end = salary_df['HIRE_DATE'].max()
random_date = start + (end - start) * random.random()
d_as_str = datetime.strftime(random_date,"%m/%d/%Y %H:%M:%S %p")
salary_df['HIRE_DT'].fillna(d_as_str,inplace=True)
salary_df.drop(['HIRE_DATE'],axis=1,inplace=True)
How many missing data for GROSS Column?
salary_df['GROSS'].isnull().sum()
How many other columns have missing data of the rows with missing GROSS data?
salary_df[salary_df['GROSS'].isnull()].isnull().sum().sum()
salary_df[salary_df['GROSS'].isnull()].shape
Out of the (161 * 7) = 1127 cells associated with the 161 rows, only 161 cells are empty.
Here I am using some fixed values (value of ANNUAL_RT for the corresponding row miuns 2000.0) to fill up GROSS column with missing data.
salary_df['GROSS'].fillna(salary_df['ANNUAL_RT']-2000.0,inplace=True)
Black = filled; white = empty
After cleanup and data filling gaps in dataset have been successfully removed
_=msno.matrix(salary_df.sample(3000))
plt.xlabel('Features in data',fontsize=16)
plt.ylabel('Gaps in data',fontsize=16)
plt.show()
Following is some basic numerical study of the salary data from Baltimore City Government
Total Pay in FY2018 made by Baltimore City Government
'${:,.2f}'.format(salary_df['ANNUAL_RT'].sum())
Average Annual Pay in Baltimore City Government
'${:,.2f}'.format(salary_df['ANNUAL_RT'].mean())
Average Gross Pay in Baltimore City Government
'${:,.2f}'.format(salary_df['GROSS'].mean())
Maximum Annual Pay in Baltimore City Government
'${:,.2f}'.format(salary_df['ANNUAL_RT'].max())
Maximum Gross Pay in Baltimore City Government
'${:,.2f}'.format(salary_df['GROSS'].max())
Minimum Annual Pay in Baltimore City Government
'${:,.2f}'.format(salary_df['ANNUAL_RT'].min())
Minimum Gross Pay in Baltimore City Government
'${:,.2f}'.format(salary_df['GROSS'].min())
Employee details with highest annual pay in Baltimore City Government
salary_df[salary_df['ANNUAL_RT']== salary_df['ANNUAL_RT'].max()]
Employee details with highest gross pay in Baltimore City Government
salary_df[salary_df['GROSS']== salary_df['GROSS'].max()]
Employee details with lowest annual pay in Baltimore City Government
salary_df[salary_df['ANNUAL_RT']== salary_df['ANNUAL_RT'].min()]
Employee details with lowest gross pay in Baltimore City Government
salary_df[salary_df['GROSS']== salary_df['GROSS'].min()]
Average annual pay and average gross pay per job title in Baltimore City Government
salary_df.groupby('JOBTITLE').mean()[['ANNUAL_RT','GROSS']]
Average annual pay and average gross pay per department in Baltimore City Government
salary_df.groupby('DEPTID').mean()[['ANNUAL_RT','GROSS']]
Number of unique job titles in Baltimore City Government
salary_df['JOBTITLE'].nunique()
Number of departments in Baltimore City Government
salary_df['DEPTID'].nunique()
List of unique job titles in Baltimore City Government
salary_df['JOBTITLE'].unique()
List of unique departments in Baltimore City Government
salary_df['DESCR'].unique()
Employee count per job title in Baltimore City Government
salary_df['JOBTITLE'].value_counts()
Ten most common jobs in Baltimore City Government
salary_df['JOBTITLE'].value_counts().head(10)
Ten least common jobs in Baltimore City Government
salary_df['JOBTITLE'].value_counts().tail(10)
Employee count per department in Baltimore City Government
salary_df['DEPTID'].value_counts()
Top ten departments by employee count in Baltimore City Government
salary_df['DEPTID'].value_counts().head(10)
Bottom ten departments by employee count in Baltimore City Government
salary_df['DEPTID'].value_counts().tail(10)
... for Salary data in Baltimore City Government
plt.style.use('bmh')
sns.set_style('whitegrid')
plt.figure(figsize=(15,10))
salary_df['ANNUAL_RT'].plot.hist(bins=15,alpha=.8)
plt.title("Histogram showing how many employees gets this annual pay",fontsize=22)
plt.xlabel("Annual pay in Dollars",fontsize=16)
plt.ylabel("Number of employees",fontsize=16)
plt.show()
plt.figure(figsize=(15,10))
salary_df['GROSS'].plot.hist(bins=15,alpha=.8)
plt.title("Histogram showing how employees gets this gross pay",fontsize=22)
plt.xlabel("Gross pay in Dollars",fontsize=16)
plt.ylabel("Number of employees",fontsize=16)
plt.show()
import numpy as np
salary_in_thousands_df = (salary_df['ANNUAL_RT']/1000).to_frame()
freq_count = []
for i in range(0, 26):
freq_count.append(0)
for salary in salary_in_thousands_df['ANNUAL_RT']:
freq_count[int(salary/10)] += 1
freq_percent = []
for i in range(0, 26):
freq_percent.append(round(freq_count[i]/len(salary_df['ANNUAL_RT'])*100, 2))
freq_df = pd.DataFrame({'freq':freq_percent})
plt.figure(figsize=(15,10))
freq_df['freq'].plot(kind='line')
plt.xticks(np.arange(1,27))
plt.title("Percentage of population getting a certain Annual pay in 10000s",fontsize=22)
plt.xlabel("Annual pay in 10000s of Dollars",fontsize=16)
plt.ylabel("Percentage of population",fontsize=16)
plt.show()
"""
Frequency percentage list is now reversed.
Let's find the index splitting at which will allow computation of salary for top paid 0.01% of
employees in Baltimore City Government.
"""
count_percent = len(freq_percent)
temp_copy = freq_percent.copy()
new_percent_list = []
for i in range(count_percent):
new_percent_list.append(temp_copy.pop())
cumulative_sum_list = np.cumsum(new_percent_list, dtype=float)
print([round(cmsum,2) for cmsum in cumulative_sum_list])
To find salary of top 1 percent most paid employees let's sum up salaries till index 2 in the reverse list or index 25 of the frequency list
split_index=25
freq_cumulative_sum_list_count=1
top_1_perccent_salary = 0
employees_in_top_1_percent = 0
bottom_99_perccent_salary = 0
employees_in_bottom_99_percent = 0
for salary in salary_in_thousands_df['ANNUAL_RT']:
if int(salary/10) >= split_index:
top_1_perccent_salary += salary*1000
employees_in_top_1_percent += 1
elif int(salary/10) < split_index:
bottom_99_perccent_salary += salary*1000
employees_in_bottom_99_percent += 1
print("In total, the top", str(cumulative_sum_list[freq_cumulative_sum_list_count]), "percent earns:",
'${:,.2f}'.format(top_1_perccent_salary),
"\nand bottom", str(100-cumulative_sum_list[freq_cumulative_sum_list_count]), "percent earns",
'${:,.2f}'.format(bottom_99_perccent_salary),
"in Baltimore City Government.")
print("\nOn an average, each top", str(cumulative_sum_list[freq_cumulative_sum_list_count]), "percent employee earns:",
'${:,.2f}'.format(top_1_perccent_salary/employees_in_top_1_percent),
"\nand each bottom", str(100-cumulative_sum_list[freq_cumulative_sum_list_count]), "percent employee earns",
'${:,.2f}'.format(bottom_99_perccent_salary/employees_in_bottom_99_percent),
"in Baltimore City Government.")
print("\nThe top", str(cumulative_sum_list[freq_cumulative_sum_list_count]), "to bottom",
str(100-cumulative_sum_list[count_percent-split_index]), "income ratio in Baltimore City Government is:",
'{:,.3f}'.format((top_1_perccent_salary/employees_in_top_1_percent)/
(bottom_99_perccent_salary/employees_in_bottom_99_percent)))
The income ratio between the top 0.01% and bottom 99.99% is 4.406
Baltimore City Government pays fairly well and does not have a huge income inequality when compared to the ratio for the whole state which is at 18.9
salary_df[['ANNUAL_RT','GROSS']].plot(kind='kde',figsize=(15,10))
plt.title('Density plot for Annual Pay and Gross Pay',fontsize=22)
plt.xlabel('Salary in Dollars',fontsize=16)
plt.show()
Average annual pay for the top ten most common jobs in Baltimore City Government
common_job_filter = salary_df['JOBTITLE'].value_counts().head(10).keys()
ten_most_common_job = salary_df[salary_df['JOBTITLE'].isin(common_job_filter)
].groupby('JOBTITLE').mean().dropna().sort_values(['ANNUAL_RT'], ascending=True)
plt.figure(figsize=(15,10))
plt.barh(ten_most_common_job.index.values, ten_most_common_job['ANNUAL_RT'])
plt.title('Average annual pay of the top ten most common jobs',fontsize=22)
plt.xlabel('Average annual pay in Dollars',fontsize=16)
plt.ylabel('Ten most common jobs',fontsize=16)
plt.show()
Not suprisingly, higher ranked officials for example a Police Seargeant makes more money than a Police Officer
Jobs requiring less specialized training and experience, for example a crossing guard makes less money
Average annual pay for departments that are the top ten employers in Baltimore City Government
top_dept_filter = salary_df['DESCR'].value_counts().head(10).keys()
ten_most_common_dept = salary_df[salary_df['DESCR'].isin(top_dept_filter)
].groupby('DESCR').mean().dropna().sort_values(['ANNUAL_RT'], ascending=True)
plt.figure(figsize=(15,10))
plt.barh(ten_most_common_dept.index.values, ten_most_common_dept['ANNUAL_RT'])
plt.title('Average annual pay for departments that are the top ten employers',fontsize=22)
plt.xlabel('Average annual pay in Dollars',fontsize=16)
plt.ylabel('Departments that are the top ten employers',fontsize=16)
plt.show()
Once more, departments requiring more training or having potentially higher risk factors, for example the Police and Sherrif's department tend to make more money.
job_wise_head_count = salary_df.groupby('JOBTITLE').count()['NAME'].sort_values(ascending=False)
top_ten_head_count_job_wise = job_wise_head_count.head(10)
top_ten_head_count_job_wise
plt.figure(figsize=(15,10))
plt.barh(top_ten_head_count_job_wise.index.values, top_ten_head_count_job_wise)
plt.title('Top ten job titles with most employees',fontsize=22)
plt.xlabel('Employee count',fontsize=16)
plt.ylabel('Job Titles',fontsize=16)
plt.show()
The most common job title in Baltimore City Government is that of a Police Officer.
Given the fact that I see articles stating: 'Neighborhoods are crying out': Baltimore has highest homicide rate of U.S. big cities the large police force is probably justified.
In most cases, with experience people tend to receive higher salaries.
Next, I will explore if this is true for the Baltimore City Government...
from datetime import datetime
def years_of_experience(d1):
d1 = datetime.strptime(d1,"%m/%d/%Y %H:%M:%S %p")
d2 = datetime.now()
years_of_experience=(abs((d2 - d1).days))//365
return int(years_of_experience)
"""
Looking at employees with more than 10 years of experience and not an horly laborer.
Also works as one of the top ten most common job titles.
"""
top_ten_jobs_filter = top_ten_head_count_job_wise.keys()
salary_df['YRS_OF_EXPERIENCE']=salary_df['HIRE_DT'].apply(years_of_experience)
salary_df = salary_df[salary_df['YRS_OF_EXPERIENCE'] > 10]
salary_df = salary_df[salary_df['JOBTITLE'].isin(top_ten_jobs_filter)]
salary_df = salary_df[salary_df['JOBTITLE'] != 'Laborer Hourly']
plt.figure(figsize=(15,10))
plt.scatter(x=salary_df['YRS_OF_EXPERIENCE'], y=salary_df['ANNUAL_RT'], cmap='coolwarm', marker='o', s=20)
plt.title('Years of experience vs Annual pay',fontsize=22)
plt.xlabel('Years of experience',fontsize=16)
plt.ylabel('Annual pay in Dollar',fontsize=16)
plt.show()
salary_df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(salary_df.corr(),cmap='coolwarm',annot=True)
plt.scatter(x=salary_df['YRS_OF_EXPERIENCE'], y=salary_df['ANNUAL_RT'], cmap='coolwarm', marker='o', s=20)
plt.title('Correlation heatmap for salary to years of experience',fontsize=22)
plt.show()
There is low correlation between the generated field of "years of experience" and annual pay. This is possible because our assumption that hire date would reflect actual experience is untrue.
print('Total Time taken:',time.time() - start_time,'seconds')
While working on this project I learned how to explore data, find anomalies in datasets, cleanup data and fill up missing data points, if necessary.
I also learned how to perform a number of different visualization techniques and utilize them to characterize the inherent properties of an unknown dataset.
I learned that data headers can be misleading and lead to prejudices but visualizations can help us correct our inferences.
My key conclusion from the project is: Baltimore City Government is a fair wage payer and does not create drastic income inequality.
I observed that both annual and gross pay exhibit "long-tail" characteristics i.e., there are few "very high" or "very low" salaries. This led me to the next logical step of determining the ratio of average income for the top 0.01 to bottom 99.99 employees. This came out to 4.406 which when compared to the ratio of 18.9 for the whole state, seems to be fairly low.
Other conclusions drawn from the analysis and characterization: