This is my capstone project
Please click to get into my work
In this project l have to put yourself in the shoes of a loan issuer and manage credit risk by using the past data and deciding whom to give the loan to in the future. The text files contain complete loan data for all loans issued by XYZ Corp. through 2007-2015. The data contains the indicator of default, payment information, credit history, etc.
I divided the data into train ( June 2007 – May 2015 ) and out-of-time test ( June 2015 – Dec 2015 ) data. I have use the training data to build models/analytical solution and finally apply it to test data to measure the performance and robustness of the models.
There are 855969 rows and 73 columns are present in the whole dataset.There are many different problems are present in the dataset.Like,there are so many columns which have missing values, the type of the columns is different according to the requirements.The date column in the dataset is totally jumbled and due to this it was difficult to split the data according to the requirement like(data into train ( June 2007 – May 2015 ) and out-of-time test ( June 2015 – Dec 2015 ) data).
So I have decided to treat the date column(‘issue_d’).First, I split the column(‘issue_d’) into two different columns and replace the values as my requirement.After that with the help of map function I joined the splited columns and make them one with different name (‘period’).
Then I sort the ‘period’ column and make it an index for slicing according to the requirement.After sorting I decided to treat all the missing values and the columns which are not relevant for the problem statement.I also convert the type of the columns because it is important to convert the columns to their default type.In python the default datatype of string is object.If columns has some NaN values but remaining values of the data is integer or float then python treat that column as object.
For this project I did not find to take all the columns to make prediction.So I have deleted the columns where the majority of values are missing and not relevant for the model.I have also used Randomforestclassifier to select important features for the model.With the help of forward stepwise process I found the auc score of the features to determine whether the features I have selected is good enough for the model or not.I have also used confusion matrix and classification report to find the accuracy of the model.
I have used different machine learning technique for this like Knn classification, logistic regression..