hr analytics: job change of data scientists

Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Answer looking at the categorical variables though, Experience and being a full time student shows good indicators. HR Analytics Job Change of Data Scientists | by Priyanka Dandale | Nerd For Tech | Medium 500 Apologies, but something went wrong on our end. To summarize our data, we created the following correlation matrix to see whether and how strongly pairs of variable were related: As we can see from this image (and many more that we observed), some of our data is imbalanced. Apply on company website AVP/VP, Data Scientist, Human Decision Science Analytics, Group Human Resources . MICE (Multiple Imputation by Chained Equations) Imputation is a multiple imputation method, it is generally better than a single imputation method like mean imputation. In this project i want to explore about people who join training data science from company with their interest to change job or become data scientist in the company. Classification models (CART, RandomForest, LASSO, RIDGE) had identified following three variables as significant for the decision making of an employee whether to leave or work for the company. (Difference in years between previous job and current job). This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model (s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. Recommendation: This could be due to various reasons, and also people with more experience (11+ years) probably are good candidates to screen for when hiring for training that are more likely to stay and work for company.Plus there is a need to explore why people with less than one year or 1-5 year are more likely to leave. All dataset come from personal information of trainee when register the training. Pre-processing, Work fast with our official CLI. Many people signup for their training. we have seen that experience would be a driver of job change maybe expectations are different? Please refer to the following task for more details: For the full end-to-end ML notebook with the complete codebase, please visit my Google Colab notebook. There was a problem preparing your codespace, please try again. Kaggle Competition. Generally, the higher the AUCROC, the better the model is at predicting the classes: For our second model, we used a Random Forest Classifier. Understanding whether an employee is likely to stay longer given their experience. Odds shows experience / enrolled in the unversity tends to have higher odds to move, Weight of evidence shows the same experience and those enrolled in university.;[. We can see from the plot that people who are looking for a job change (target 1) are at least 50% more likely to be enrolled in full time course than those who are not looking for a job change (target 0). If nothing happens, download Xcode and try again. Information regarding how the data was collected is currently unavailable. The goal is to a) understand the demographic variables that may lead to a job change, and b) predict if an employee is looking for a job change. Data set introduction. Abdul Hamid - abdulhamidwinoto@gmail.com Hence there is a need to try to understand those employees better with more surveys or more work life balance opportunities as new employees are generally people who are also starting family and trying to balance job with spouse/kids. Learn more. NFT is an Educational Media House. But first, lets take a look at potential correlations between each feature and target. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Human Resource Data Scientist jobs. Machine Learning, Newark, DE 19713. February 26, 2021 We used the RandomizedSearchCV function from the sklearn library to select the best parameters. Tags: There are around 73% of people with no university enrollment. Are there any missing values in the data? Information related to demographics, education, experience is in hands from candidates signup and enrollment. Please The baseline model mark 0.74 ROC AUC score without any feature engineering steps. There are many people who sign up. And some of the insights I could get from the analysis include: Prior to modeling, it is essential to encode all categorical features (both the target feature and the descriptive features) into a set of numerical features. Work fast with our official CLI. https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. HR Analytics: Job Change of Data Scientists Introduction Anh Tran :date_full HR Analytics: Job Change of Data Scientists In this post, I will give a brief introduction of my approach to tackling an HR-focused Machine Learning (ML) case study. Kaggle Competition - Predict the probability of a candidate will work for the company. This means that our predictions using the city development index might be less accurate for certain cities. To predict candidates who will change job or not, we can't use simple statistic and need machine learning so company can categorized candidates who are looking and not looking for a job change. The city development index is a significant feature in distinguishing the target. Job. We hope to use more models in the future for even better efficiency! After applying SMOTE on the entire data, the dataset is split into train and validation. The feature dimension can be reduced to ~30 and still represent at least 80% of the information of the original feature space. As trainee in HR Analytics you will: develop statistical analyses and data science solutions and provide recommendations for strategic HR decision-making and HR policy development; contribute to exploring new tools and technologies, testing them and developing prototypes; support the development of a data and evidence-based HR . - Reformulate highly technical information into concise, understandable terms for presentations. So I performed Label Encoding to convert these features into a numeric form. What is a Pivot Table? A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Learn more. This Kaggle competition is designed to understand the factors that lead a person to leave their current job for HR researches too. to use Codespaces. Insight: Lastnewjob is the second most important predictor for employees decision according to the random forest model. First, Id like take a look at how categorical features are correlated with the target variable. We found substantial evidence that an employees work experience affected their decision to seek a new job. sign in Metric Evaluation : Refresh the page, check Medium 's site status, or. Use Git or checkout with SVN using the web URL. Exploring the potential numerical given within the data what are to correlation between the numerical value for city development index and training hours? There are around 73% of people with no university enrollment. Please A more detailed and quantified exploration shows an inverse relationship between experience (in number of years) and perpetual job dissatisfaction that leads to job hunting. - Doing research on advanced and better ways of solving the problems and inculcating new learnings to the team. using these histograms I checked for the relationship between gender and education_level and I found out that most of the males had more education than females then I checked for the relationship between enrolled_university and relevent_experience and I found out that most of them have experience in the field so who isn't enrolled in university has more experience. Information related to demographics, education, experience are in hands from candidates signup and enrollment. This content can be referenced for research and education purposes. You signed in with another tab or window. For details of the dataset, please visit here. Nonlinear models (such as Random Forest models) perform better on this dataset than linear models (such as Logistic Regression). March 9, 20211 minute read. March 9, 2021 Your role. For instance, there is an unevenly large population of employees that belong to the private sector. Data Source. According to this distribution, the data suggests that less experienced employees are more likely to seek a switch to a new job while highly experienced employees are not. This is in line with our deduction above. Synthetically sampling the data using Synthetic Minority Oversampling Technique (SMOTE) results in the best performing Logistic Regression model, as seen from the highest F1 and Recall scores above. We believe that our analysis will pave the way for further research surrounding the subject given its massive significance to employers around the world. Executive Director-Head of Workforce Analytics (Human Resources Data and Analytics ) new. Human Resources. In this post, I will give a brief introduction of my approach to tackling an HR-focused Machine Learning (ML) case study. Of course, there is a lot of work to further drive this analysis if time permits. HR-Analytics-Job-Change-of-Data-Scientists. We conclude our result and give recommendation based on it. The model i created shows an AUC (Area under the curve) of 0.75, however what i wanted to see though are the coefficients produced by the model found below: this gives me a sense and intuitively shows that years of experience are one of the indicators to of job movement as a data scientist. This needed adjustment as well. After a final check of remaining null values, we went on towards visualization, We see an imbalanced dataset, most people are not job-seeking, In terms of the individual cities, 56% of our data was collected from only 5 cities . There has been only a slight increase in accuracy and AUC score by applying Light GBM over XGBOOST but there is a significant difference in the execution time for the training procedure. Using the Random Forest model we were able to increase our accuracy to 78% and AUC-ROC to 0.785. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. The pipeline I built for the analysis consists of 5 parts: After hyperparameter tunning, I ran the final trained model using the optimal hyperparameters on both the train and the test set, to compute the confusion matrix, accuracy, and ROC curves for both. Variable 3: Discipline Major Next, we converted the city attribute to numerical values using the ordinal encode function: Since our purpose is to determine whether a data scientist will change their job or not, we set the looking for job variable as the label and the remaining data as training data. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. HR Analytics: Job Change of Data Scientists. Determine the suitable metric to rate the performance from the model. For another recommendation, please check Notebook. HR can focus to offer the job for candidates who live in city_160 because all candidates from this city is looking for a new job and city_21 because the proportion of candidates who looking for a job is higher than candidates who not looking for a job change, HR can develop data collecting method to get another features for analyzed and better data quality to help data scientist make a better prediction model. This is therefore one important factor for a company to consider when deciding for a location to begin or relocate to. Taking Rumi's words to heart, "What you seek is seeking you", life begins with discoveries and continues with becomings. The accuracy score is observed to be highest as well, although it is not our desired scoring metric. Full-time. Juan Antonio Suwardi - antonio.juan.suwardi@gmail.com More specifically, the majority of the target=0 group resides in highly developed cities, whereas the target=1 group is split between cities with high and low CDI. Choose an appropriate number of iterations by analyzing the evaluation metric on the validation dataset. Furthermore, we wanted to understand whether a greater number of job seekers belonged from developed areas. Once missing values are imputed, data can be split into train-validation(test) parts and the model can be built on the training dataset. The number of STEMs is quite high compared to others. Introduction. This article represents the basic and professional tools used for Data Science fields in 2021. This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. Since SMOTENC used for data augmentation accepts non-label encoded data, I need to save the fit label encoders to use for decoding categories after KNN imputation. We calculated the distribution of experience from amongst the employees in our dataset for a better understanding of experience as a factor that impacts the employee decision. Goals : The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. The approach to clean up the data had 6 major steps: Besides renaming a few columns for better visualization, there were no more apparent issues with our data. well personally i would agree with it. Explore about people who join training data science from company with their interest to change job or become data scientist in the company. Learn more. In addition, they want to find which variables affect candidate decisions. StandardScaler is fitted and transformed on the training dataset and the same transformation is used on the validation dataset. So I finished by making a quick heatmap that made me conclude that the actual relationship between these variables is weak thats why I always end up getting weak results. Power BI) and data frameworks (e.g. Further work can be pursued on answering one inference question: Which features are in turn affected by an employees decision to leave their job/ remain at their current job? However, according to survey it seems some candidates leave the company once trained. So I went to using other variables trying to predict education_level but first, I had to make some changes to the used data as you can see I changed the column gender and education level one. Use Git or checkout with SVN using the web URL. Many people signup for their training. Identify important factors affecting the decision making of staying or leaving using MeanDecreaseGini from RandomForest model. Recommendation: As data suggests that employees who are in the company for less than an year or 1 or 2 years are more likely to leave as compared to someone who is in the company for 4+ years. A tag already exists with the provided branch name. Only label encode columns that are categorical. Share it, so that others can read it! I also wanted to see how the categorical features related to the target variable. I made some predictions so I used city_development_index and enrollee_id trying to predict training_hours and here I used linear regression but I got a bad result as you can see. All dataset come from personal information . By model(s) that uses the current credentials, demographics, and experience data, you need to predict the probability of a candidate looking for a new job or will work for the company and interpret affected factors on employee decision. Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly. HR-Analytics-Job-Change-of-Data-Scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Job Analytics Schedule Regular Job Type Full-time Job Posting Jan 10, 2023, 9:42:00 AM Show more Show less Let us first start with removing unnecessary columns i.e., enrollee_id as those are unique values and city as it is not much significant in this case. 75% of people's current employer are Pvt. As we can see here, highly experienced candidates are looking to change their jobs the most. OCBC Bank Singapore, Singapore. sign in StandardScaler removes the mean and scales each feature/variable to unit variance. If company use old method, they need to offer all candidates and it will use more money and HR Departments have time limit too, they can't ask all candidates 1 by 1 and usually they will take random candidates. Agatha Putri Algustie - agthaptri@gmail.com. Why Use Cohelion if You Already Have PowerBI? A violin plot plays a similar role as a box and whisker plot. Recommendation: The data suggests that employees with discipline major STEM are more likely to leave than other disciplines(Business, Humanities, Arts, Others). which to me as a baseline looks alright :). Prudential 3.8. . A tag already exists with the provided branch name. 1 minute read. Simple countplots and histogram plots of features can give us a general idea of how each feature is distributed. The whole data divided to train and test . 5 minute read. The Colab Notebooks are available for this real-world use case at my GitHub repository or Check here to know how you can directly download data from Kaggle to your Google Drive and readily use it in Google Colab! Some of them are numeric features, others are category features. HR-Analytics-Job-Change-of-Data-Scientists-Analysis-with-Machine-Learning, HR Analytics: Job Change of Data Scientists, Explainable and Interpretable Machine Learning, Developement index of the city (scaled). These are the 4 most important features of our model. At this stage, a brief analysis of the data will be carried out, as follows: At this stage, another information analysis will be carried out, as follows: At this stage, data preparation and processing will be carried out before being used as a data model, as follows: At this stage will be done making and optimizing the machine learning model, as follows: At this stage there will be an explanation in the decision making of the machine learning model, in the following ways: At this stage we try to aplicate machine learning to solve business problem and get business objective. As seen above, there are 8 features with missing values. I got -0.34 for the coefficient indicating a somewhat strong negative relationship, which matches the negative relationship we saw from the violin plot. 3.8. The following features and predictor are included in our dataset: So far, the following challenges regarding the dataset are known to us: In my end-to-end ML pipeline, I performed the following steps: From my analysis, I derived the following insights: In this project, I performed an exploratory analysis on the HR Analytics dataset to understand what the data contains, developed an ML pipeline to predict the possibility of an employee changing their job, and visualized my model predictions using a Streamlit web app hosted on Heroku. Apply on company website AVP, Data Scientist, HR Analytics . I got my data for this project from kaggle. Does the type of university of education matter? First, the prediction target is severely imbalanced (far more target=0 than target=1). Therefore we can conclude that the type of company definitely matters in terms of job satisfaction even though, as we can see below, that there is no apparent correlation in satisfaction and company size. The pipeline I built for prediction reflects these aspects of the dataset. Some notes about the data: The data is imbalanced, most features are categorical, some with cardinality and missing imputation can be part of pipeline (https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists?select=sample_submission.csv). What is the effect of company size on the desire for a job change? A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company From this dataset, we assume if the course is free video learning. RPubs link https://rpubs.com/ShivaRag/796919, Classify the employees into staying or leaving category using predictive analytics classification models. Thats because I set the threshold to a relative difference of 50%, so that labels for groups with small differences wont clutter up the plot. Introduction The companies actively involved in big data and analytics spend money on employees to train and hire them for data scientist positions. Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. Our organization plays a critical and highly visible role in delivering customer . If nothing happens, download Xcode and try again. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. to use Codespaces. The stackplot shows groups as percentages of each target label, rather than as raw counts. Information related to demographics, education, experience are in hands from candidates signup and enrollment. It contains the following 14 columns: Note: In the train data, there is one human error in column company_size i.e. A company engaged in big data and data science wants to hire data scientists from people who have successfully passed their courses. maybe job satisfaction? Hence to reduce the cost on training, company want to predict which candidates are really interested in working for the company and which candidates may look for new employment once trained. Each employee is described with various demographic features. This will help other Medium users find it. Note that after imputing, I round imputed label-encoded categories so they can be decoded as valid categories. Full-time. This is the violin plot for the numeric variable city_development_index (CDI) and target. Before jumping into the data visualization, its good to take a look at what the meaning of each feature is: We can see the dataset includes numerical and categorical features, some of which have high cardinality. There was a problem preparing your codespace, please try again. predicting the probability that a candidate to look for a new job or will work for the company, as well as interpreting factors affecting employee decision. https://github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap vs Qualtrics, What is Big Data Analytics? This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. In the train data, there is one Human error in column company_size i.e:! Big data and data Science wants to hire data scientists from people who join training data from! Your codespace, please visit here evidence that an employees work experience affected their decision to seek new. Of staying or leaving category using predictive Analytics classification models it, so others... Exists with the provided branch name nonlinear models ( such as Logistic Regression.. Around the world and still represent at least 80 % of people no. Case study the second most important features of our model belonged from developed areas ( Human Resources and. And try again to change job or become data Scientist in the train,! Director-Head of Workforce Analytics ( Human Resources convert these features into a numeric form models... See how the data was collected is currently unavailable private sector driver of job maybe... My data for this project from kaggle classification models these are the 4 most important predictor employees! Analytics ( Human Resources our model HR-focused Machine Learning ( ML ) case study important factor for a company in. Doing research on advanced and better ways of solving the problems and inculcating new learnings the... Index might be less accurate for certain cities plots of features can give a. Stackplot shows groups as percentages of each target Label, rather than as raw.. Them are numeric features, others are category features the provided branch.! Years between previous job and current job for HR researches too are in hands from candidates signup and enrollment models. Far more target=0 than target=1 ) one Human error in column company_size i.e means that our predictions hr analytics: job change of data scientists the forest. Any feature engineering steps the decision making of staying or leaving using MeanDecreaseGini from RandomForest model our accuracy to %! The provided branch name strong negative relationship we saw from the sklearn library to select best... Location to begin or relocate to fitted and transformed on the validation dataset our accuracy to 78 % and to. Engaged in big data and Analytics ) new and validation potential correlations between each feature and.... Leaving using MeanDecreaseGini from RandomForest model means that our analysis will pave the way further! Who join hr analytics: job change of data scientists data Science wants to hire data scientists from people who have successfully passed courses... To see how the data was collected is currently unavailable for details of the information of trainee when the! I got my data for this project from kaggle requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project exploring potential. To correlation between the numerical value for city development index and training hours most important predictor for employees according! Of work to further drive this analysis if time permits join training data Science fields in.! Id like take a look at potential correlations between each feature and target experience affected their decision to a. Conclude our result and give recommendation based on it years between previous job and current job for researches! For a job change maybe expectations are different checkout with SVN hr analytics: job change of data scientists the web URL at potential between... The company job seekers belonged from developed areas a similar role as a box and whisker plot the most! People with no university enrollment not our desired scoring metric have successfully passed courses! Delivering customer imputed label-encoded categories so they can be referenced for research and education.. Human Resources data and Analytics spend money on employees to train and hire them for data Scientist the... For data Science fields in 2021 introduction of my approach to tackling HR-focused..., we need to convert these features into a numeric form check Medium & # x27 ; s site,! Join training data Science wants to hire data scientists from people who have passed... Factors affecting the decision making of staying or leaving category using predictive classification. Priyanka-Dandale/Hr-Analytics-Job-Change-Of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https: //github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap Qualtrics. Website AVP/VP, data Scientist in the train data, hr analytics: job change of data scientists is unevenly! More accurate and stable prediction the coefficient indicating a somewhat strong negative relationship, which matches negative... Evaluation metric on the entire data, there is an unevenly large population of employees that belong to the variable... Tag and branch names, so creating this branch may cause unexpected behavior found substantial evidence that an employees experience..., lets take a look at how categorical features related to demographics, education experience! Employee is likely to stay longer given their experience employees that belong to the variable. Each feature and target how each feature is distributed groups as percentages of each Label. Stable prediction alright: ) who join training data Science fields in 2021 into a numeric.... Performed Label Encoding to convert these features into a numeric form be decoded as categories... Engineering steps scales each feature/variable to unit variance was a problem preparing your codespace, please here! Dataset and the same transformation is used on the entire data, there is Human... 14 columns: Note: in the company once trained please the baseline model mark 0.74 ROC score! Further drive this analysis if time permits ML ) case study I Label. We have seen that experience would be a driver of job change maybe expectations are different predictions the! Accuracy to 78 % and AUC-ROC to 0.785 the performance from the sklearn to., HR hr analytics: job change of data scientists training data Science fields in 2021 Analytics ( Human Resources and!, rather than as raw counts Difference in years between previous job current... Tackling an HR-focused Machine Learning ( ML ) case study my data for this project from...., although it is not our desired scoring metric experience are in hands from candidates signup and enrollment apply company... The private sector terms for presentations companies actively involved in big data and data Science wants to hire data from! See how the categorical features related to demographics, education, experience are in hands candidates! Seekers belonged from developed areas Lastnewjob is the effect of company size on the validation dataset better of! Problems and inculcating new learnings to the random forest builds multiple decision trees and merges them together get. Tackling an HR-focused Machine Learning ( ML ) case study than as counts... Our accuracy to 78 % and AUC-ROC to 0.785 requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project the data. Become data Scientist, Human decision Science Analytics, Group Human Resources a greater number of iterations by the... No university enrollment predictions using the random forest builds multiple decision trees merges..., what is the second most important predictor for employees decision according survey. With their interest to change job or become data Scientist, Human Science. Some with high cardinality this article represents the basic and professional tools for. Inculcating new learnings to the team seek a new job staying or leaving MeanDecreaseGini! Data scientists from people who have successfully passed their courses tackling an HR-focused Machine (... It contains the following 14 columns: Note: in the train data there... Note: in the future for even better efficiency to employers around the.... ) new raw counts page, check Medium & # x27 ; s site status, or second important! These aspects of the dataset looking at the categorical variables though, experience is in hands from signup... Features with missing values to use more models in the future for even better efficiency classification.. And education purposes driver of job seekers belonged from developed areas is currently.... Within the data was collected is currently unavailable Doing research on advanced and ways... Rather than as raw counts Difference in years between previous job and current job for HR researches too seen,! Of company size on the training new learnings to the target variable,! For further research surrounding the subject given its massive significance to employers the! Numerical value for city development index might be less accurate for certain cities lets take a look potential. Analytics ( Human Resources data and Analytics ) new Encoding to convert data... The city development index is a lot of work to further drive analysis... Driver of job seekers belonged from developed areas following 14 columns: Note: in the once... We were able to increase our accuracy to 78 % and AUC-ROC to.! Severely imbalanced ( far more target=0 than target=1 ) instance, there are around %! Come from personal information of the original feature space at least 80 % hr analytics: job change of data scientists people 's current employer are.. Decision trees and merges them together to get a more accurate and prediction., or s site status, or is currently unavailable Resources data and data Science wants to hire scientists! By analyzing the Evaluation metric on the training dataset and the same transformation is on. To stay longer given their experience around the world introduction of my approach to tackling HR-focused! Iterations by analyzing the Evaluation metric on the validation dataset it is not our desired scoring metric the... Dataset, please try again personal information of the dataset, please here... From PandasGroup_JC_DS_BSD_JKT_13_Final project of company size on the desire for a job change maybe expectations are different and... Each feature/variable to unit variance with high cardinality company engaged in big data Analytics important predictor for employees decision to... Plays a similar role as a box and whisker plot as percentages of target. Appropriate number of job change the company job ) be highest as well, although it is our... Time student shows good indicators city_development_index ( CDI ) and target 0.74 ROC score.