For the end of master, I’m doing my internship at French Development Agency as a data analyst in Guarantee Division (GAR), which offers guarantees to banks providing loans to SMEs, thus limiting their risks.
In order to record the business operation of GAR, we have several databases. Thus, during this internship, my principal mission is according to the business procedure of GAR, merge all datasources into one with the process ETL (Extraction -Transfer-Load), then analyse data to get the valuable information, which contributes to our business.
Since I will finish it in two weeks, here I will share you its most important and interesting part – data analysis with machine learning algorithm. In consideration of the business of our division, the most important thing is not only establishing business relation with more and more banks, but also limiting the amount of compensation. Thus, I analysed the influence of different elements on total amount of compensation in R, with Linear Regression Model and Decision Tree.
Linear Regression Model
Model
As I mentioned above, the aim of analysis is to state the relationship between each different elements and total amount of compensation. So the model is as following.
The dependant variable represents the total amount of compensation in euro. The explanatory variable includes several control variables such as the loan amount in euro, the guarantee amount in euro, and dummy variables of countries, guarantee types, primary beneficiaries (our clients, i.e. banks), final beneficiaries (banks’ clients, i.e. loanees). Among the omitted factors represented by the error term, income of final beneficiaries may directly affect the loan amount.
Data
Since we focused on the total amount of compensation, the dataset “guarantee activation” has been used for analysing. In this dataset, we have 296 observations of 79 variables, I used 11 most important variables of them: total amount of compensation in euro, N° concours, guarantee type, country, primary beneficiary, N° primary beneficiary, final beneficiary, N° final beneficiary, loan amount in euro, guarantee amount in euro and exchange rate.
Moreover, 36 observations are excluded due to incomplete information and 94 observations are excluded because of zero compensation amount. Thus, the final sample consists of 166 observations instead of 296 observations with missing value. Besides, since the amount of compensation, the loan amount and the guarantee amount are pretty large, I reduced them by one thousand, which wouldn’t influence the result of analysis.
Data visualisation
Before analysing, let’s make some graphs to better understand our data.
In this graph, x-axis stands for the amount of compensation (in thousand euro), y-axis shows the name of country, legend indicates three guarantee types. So we can clearly find that banks in Cameroon demand more compensation than others which is mainly in the case of Portfolio Guarantee with the amount lower the 100 k€. Besides, for Single Deal Guarantee, the compensation amount is more various, but most of them are lower than 300 k€. Moreover, the compensation of Comprehensive Agreement is the least one, that might because this type of guarantee is no longer commercialised.
According to this graph, we can get similar idea as the previous one.
This graph shows that with the increasing of loan amount, the compensation increases, which is the same as we expect.
Result
I show the most important coefficients as following:
As Table 1 shows, according to the p-values which are under the significant level, I got the variables that have significant effects on the total amount of compensation. The country Maurice, the primary beneficiary BOA Benin, Bank of Mascarene and MCB Ltd have significant effects on total amount of compensation at 1%. At 5% level, the primary beneficiary ALIOS has a significant impact on total amount of compensation and at 10% level, the beneficiary Credit of Senegal and the loan amount also have a significant impact on it.
Some of the variables above have positive effects on the amount of compensation. Ceteris paribus, signing one contract with BOA Benin increases 179.6 thousand euros of total compensation. Signing one contract with Bank of Mascarene leads to 679.9 thousand more euros, ceteris paribus. Thanks to the estimation analysis, one contract with MCB Ltd increases total compensation by 609 thousand euros. Besides, there are other two variables whose effects are not that outstanding. Ceteris paribus, one contract with Credit of Senegal results in 37.87 thousand euros’ compensation. In addition, a €1000-increasing in the loan amount augments compensation amount by 17.05%.
Moreover, we can observe that the model fits well the data since the value of R2 is 89.82%, which is pretty high. However, the p-value and R2 are not enough. In order to check whether this linear model is really confidential, I firstly checked if the observations are independent with following codes and got the graph below:
According to this graph, we can observe that there is no correlation between fitted values and residual, which means the observations are independent. Then, I checked if all errors are normally distributed with the same variance and mean equals to 0.
This graph compares the quantiles of the residuals to the quantiles of the normal distribution. It presents approximately a line, which indicated all errors are normally distributed.
The statistical analysis by R tells us which bank has higher probability of demanding the compensation, which means we should pay attention on it. For instance, as the result of analysis, Bank of Mascarene has a significant effect on total amount of compensation at 1%, so we may pay attention on this bank and consider the rationality of our agreement with it.
Decision Tree
In this part, the algorithm CART(Classification And Regression Tree) is used.
Firstly, the amount of compensation is divided into 3 groups(A, B and C),
according to the quantile. Then, all data are separated into training set and
test set, with the ratio 3/1. Next, in order to create a Decision Tree, we
need to use library(rpart)
and library(rpart.plot)
. Now, let’s see some
details!
Regression Tree
The Regression Tree can be created by the function rpart()
with method =
"anova"
, then use rpart.plot()
to get the following graph.
Classification Tree
Similar as Regression Tree, Classification Tree is created by the function
rpart()
but with method = "class"
, and use rpart.plot()
to get its graph.
Moreover, we can check the prediction with test set and find the error rate is 7.9%. Besides, according to the following confusion matrix, we can get accuracy is 92%, which is pretty high.
If you are interested in the R script, please check it on my Github, all propositions are welcome!
Reference
- freephotocc, “notebook laptop macbook conceptual”, pixabay.com. [Online]. Available: https://pixabay.com/photos/notebook-laptop-macbook-conceptual-1280538/