Feature Engineering using Machine Learning on Large Financial Datasets

Awhan Mohanty
Towards Data Science
7 min readSep 21, 2018

--

For data scientists and analysts working on large financial data sets in banks figuring out the probability of credit default or bad debt is one of the most critical activities they do. This action is important task to define the credit policies , risk management and credit operations for the bank which are true competitive advantages for a bank working in selling credit products to prospective customers.

However with large data sets it becomes an extremely judgement based call ( and often inaccurate) for analysts which has downstream financial impacts. Also without eliminating non critical features for decision making , the most advanced machine learning algorithms also become powerless because they are fed with “non sense” data.

The recent machine learning paradigms have provided solutions using inbuilt algorithms which can help data analysts to provide business insights to sales and operations team to take proactive actions in terms of customer acquisitions as well as campaign management.

The process below has been developed using real customer data ( Public Information) from Lending Club. Of course this doesn’t include confidential customer information like name, addresses, contact details and social security information. However for the purpose of this activity we don’t need any confidential customer information. The features provided in public data sets should be good enough for us to come up with our intended insights.

We will be using Jupyter Notebook to write a short python program covering following activities —

  • Data Procurement
  • Data Exploration & Cleaning
  • Feature Importance
  • Plot & Visualizations

So lets begin.

Step 1 — Procure Data

We can use Lending club data available in public domain having real world lending & credit default data points. Once the data( csv files) has been downloaded the same can be uploaded to an appropriate folder in Jupyter notebook.

Note: If you don’t have Jupyter notebook, it is highly recommended to download/install and anaconda ( https://www.anaconda.com/download/)

Step 2 — Data Exploration & Cleaning

Import necessary packages and read the data set

Drop non relevant fields

Do some basic explorations

As someone from this domain it is intuitive that employment length is one of the critical factors for credit management and the data for employment length is completely messed up. We need to clean that.

Similarly clean the information available for terms & interest rates.

Convert the critical categorical values to relevant numbers using dummy

Now start working on the Target feature which is the “Loan Status”. Except for “Fully Paid” and ‘Current” customers we can make put rest all customers in a delinquent status and hence with higher propensity to default.

The interest rate field seems to be completely messed up and we need to clean that.

Now it seems we have a fairly clean data upon which we can work the algorithms to get some descent results. But before that its a good idea to backup the clean file in a csv format for future reference & offline reporting.

Let’s do some basic exploration to investigate the data properties using some plotting techniques.

plt.hist(df['loan_amnt'],histtype='bar', align='mid', color='red', label=None, stacked=False, normed=None,data=df)
plt.figure(figsize=(10,6))
df[df['loan_status']==0]['int_rate'].hist(alpha=0.5,color='blue',bins=30,label='loan_status=0')
df[df['loan_status']==1]['int_rate'].hist(alpha=0.5,color='orange',bins=30,label='loan_status=1')
plt.legend()
plt.xlabel('int_rate')

Step 3 — Feature Important using random forests

This is the most important step of this article highlighting the technique to figure out the top critical features for analysis using random forests. This is extremely useful to evaluate the importance of features on a machine learning task particularly when we are working on large number of features. In other words we can say that this is an advanced stage of data cleaning to remove the non essential data which doesn’t contribute in any meaningful way to our target feature.

#### Use Random Forests for Plot the Importance of Featuresprint(__doc__)from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier
X=np.array(df.drop('loan_status',axis=1))
y=np.array(df['loan_status'])
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
random_state=0)
forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
Feature ranking:
1. feature 20 (0.124970)
2. feature 24 (0.074578)
3. feature 23 (0.070656)
4. feature 19 (0.067598)
5. feature 18 (0.066329)
6. feature 16 (0.055700)
7. feature 17 (0.054894)
8. feature 25 (0.053542)
9. feature 0 (0.044944)
10. feature 1 (0.044642)
11. feature 2 (0.044474)
12. feature 4 (0.043504)
13. feature 105 (0.027864)
14. feature 3 (0.019966)
15. feature 106 (0.015091)
16. feature 21 (0.015081)
17. feature 104 (0.012508)
18. feature 121 (0.012266)
19. feature 107 (0.008324)
20. feature 111 (0.006282)
21. feature 22 (0.006171)
22. feature 108 (0.005095)
23. feature 110 (0.003950)
24. feature 112 (0.003154)
25. feature 49 (0.002939)
26. feature 118 (0.002719)
27. feature 109 (0.002582)
28. feature 7 (0.002469)
29. feature 77 (0.002402)
30. feature 56 (0.002310)
31. feature 72 (0.002296)
32. feature 79 (0.002294)
33. feature 55 (0.002291)
34. feature 52 (0.002280)
35. feature 66 (0.002280)
36. feature 70 (0.002266)
37. feature 6 (0.002249)
38. feature 84 (0.002242)
39. feature 65 (0.002238)
40. feature 68 (0.002235)
41. feature 9 (0.002233)
42. feature 67 (0.002215)
43. feature 58 (0.002212)
44. feature 51 (0.002210)
45. feature 71 (0.002208)
46. feature 14 (0.002203)
47. feature 15 (0.002199)
48. feature 62 (0.002198)
49. feature 69 (0.002175)
50. feature 57 (0.002153)
51. feature 60 (0.002148)
52. feature 45 (0.002145)
53. feature 5 (0.002130)
54. feature 83 (0.002115)
55. feature 12 (0.002114)
56. feature 73 (0.002112)
57. feature 82 (0.002102)
58. feature 85 (0.002088)
59. feature 50 (0.002071)
60. feature 33 (0.002038)
61. feature 10 (0.001880)
62. feature 78 (0.001859)
63. feature 59 (0.001828)
64. feature 120 (0.001733)
65. feature 63 (0.001660)
66. feature 27 (0.001549)
67. feature 32 (0.001509)
68. feature 61 (0.001481)
69. feature 117 (0.001448)
70. feature 64 (0.001417)
71. feature 11 (0.001401)
72. feature 115 (0.001389)
73. feature 8 (0.001366)
74. feature 119 (0.001329)
75. feature 13 (0.001158)
76. feature 80 (0.001037)
77. feature 100 (0.000837)
78. feature 76 (0.000823)
79. feature 97 (0.000781)
80. feature 116 (0.000766)
81. feature 113 (0.000741)
82. feature 99 (0.000681)
83. feature 81 (0.000679)
84. feature 44 (0.000572)
85. feature 26 (0.000486)
86. feature 40 (0.000466)
87. feature 102 (0.000458)
88. feature 43 (0.000440)
89. feature 42 (0.000400)
90. feature 98 (0.000398)
91. feature 38 (0.000395)
92. feature 101 (0.000348)
93. feature 41 (0.000345)
94. feature 48 (0.000344)
95. feature 35 (0.000343)
96. feature 103 (0.000343)
97. feature 39 (0.000342)
98. feature 37 (0.000333)
99. feature 34 (0.000305)
100. feature 47 (0.000272)
101. feature 53 (0.000272)
102. feature 46 (0.000271)
103. feature 36 (0.000264)
104. feature 31 (0.000135)
105. feature 54 (0.000131)
106. feature 75 (0.000103)
107. feature 30 (0.000052)
108. feature 29 (0.000045)
109. feature 74 (0.000034)
110. feature 28 (0.000004)
111. feature 114 (0.000000)
112. feature 93 (0.000000)
113. feature 91 (0.000000)
114. feature 90 (0.000000)
115. feature 89 (0.000000)
116. feature 96 (0.000000)
117. feature 88 (0.000000)
118. feature 87 (0.000000)
119. feature 95 (0.000000)
120. feature 86 (0.000000)
121. feature 94 (0.000000)
122. feature 92 (0.000000)

This step to run can take considerable amount of time depending on size. So have some patience :-)

All 122 features are displayed now in their descending order of importance & as a data analyst/scientist working on this domain it shouldn’t be a difficult task to remove the non critical features.

Step 4 — Plot & Visualizations

Even though we have a list of important features , it is always advisable to provide a visual confirmation of this list for various operational or management purposes.

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

Since we have 122 features the plot above is looking messy and ugly.Lets try to explore just the 10 most critical features and see how it comes up

# Plot to show the most important 10 Featuresp = importances[indices][:10]
q=indices[:10]
plt.figure()
plt.title("Feature importances")
plt.bar(range(10), p,
color="r", yerr=std[q], align="center")
plt.xticks(range(10), q)
plt.xlim([-1,10])
plt.show()

Now the with this we can more or less make an assumption that feature 20 is the most critical feature to determine the loan default. You can work around with the plot depending on the requirements of your organization. Also you can curtail the number of features from “thousands” to less than hundred through this mechanism. This really gives a powerful option for banks to run run their optimizations on data before they are fed to sophisticated machine /deep learning algorithms.

Thanks for the read — if you found this article interesting and would like to stay in touch, you can find me on Twitter here or LinkedIn

--

--

Consulting Partner— Financial Services@ Wipro Ltd, Deep Interests in Product Management,Machine Learning & Cognitive Intelligence.