Oversampling with a Non-Minority Class

Utilizing SMOTENC methodology as intended and possibly unintended

Rushil Sheth
Towards Data Science

--

Balancing Act ~ lawyers and datasets; Image from Robert J. Debry Associates

One of my friends recently decided to apply to law school, a grueling task that takes years of preparation. Taking the LSAT itself is a daunting task, and is only the first step in the long process of rounding up your recommendations and making every application school-specific.

That’s when I decided to create a predictor that tells the probability of getting rejected, waitlisted, or accepted at a specific law school. This helped her, and will hopefully help others decide where to apply and invest their energy and money (law school apps aren’t cheap!).

I scraped my data from a non-profit site that allows previous law school applicants to input their statistics and outcomes at law schools. The statistics I used from here are undergrad GPA, LSAT, URM, and work experience as my predictors. It is important to note all this data is self-reported.

Law schools are often judged by their rank, which can change each year. Rankings are based on many of variables including graduation rate, median LSAT score of the incoming class, job placement success, and median undergrad GPA of the incoming class. Every year law schools adjust their standards for admission. I took this into account when creating my predictor by “weighting” the most recent year’s data more heavily than that of previous years.

In practice, this sounds like a great idea, but I had some trouble with the Python implementation. What was the best way to weigh the most recent year? I treated the 2019–2020 admission cycle as my test data and wanted to weigh 2018–2019 more than the other previous years.

First, I just replicated rows and found a lift in the F-measure score vs my baseline model.

school_df = school_df.append([df_cycle18],ignore_index=True)

I decided, however, that this wasn’t giving any additional information to the model. I wanted to create new information for my model and new feasible data in general so I thought of using MCMC, but eventually used SMOTE. To use SMOTE in order to increase the most recent year’s observations I had to restructure the problem.

The admissions year is now the target and my new predictors are GPA, LSAT, URM, and work experience AND decision (old target: admitted, rejected, waitlisted). I specified which cycle to oversample by creating a dictionary and passing this into the sampling_strategy parameter of SMOTE:

cycles = np.unique(smote_cycle_y)for cycle in cycles:
smt_dict[cycle] = smote_cycle_y[smote_cycle_y==cycle].shape[0]
smt_dict[np.max(smote_cycle_y)] = smt_dict[np.max(smote_cycle_y)]*2oversample = SMOTE(sampling_strategy = smt_dict)

Then I took a look at my data and realized that SMOTE, by default, only deals with continuous variables. It transformed my categorical variable for accepted, rejected, or waitlisted into floats. My previous well-defined classification problem had some floats in it as well thus creating way more than 3 classes. As a quick solution, I rounded these floats to an integer of 0, 1, or 2, which did surprisingly well. I needed a better solution, however.

Next, I started searching for multilabel classification SMOTE and debated creating a custom class before I came across the answer, SMOTENC. SMOTENC allows you to identify which classes are nominal and which are categorical, hence the “NC”. Oversampling the most recent year was seamless now:

smt_dict2 = {}for cycle in np.unique(smt_cycle_y):
smt_dict2[cycle] = smt_cycle_y[smt_cycle_y==cycle].shape[0]
smt_dict2[np.max(smt_cycle_y)] = smt_dict2[np.max(smt_cycle_y)]*2oversample = SMOTENC(sampling_strategy = smt_dict2, categorical_features = [0,3,4], random_state = 7)

The results were great and boosted the F-measure once again. Now I used SMOTENC on my minority class from my original data, which helped tremendously.

smote_2_y = df_resample_train[['decision_numeric']] #original targetsamp_dict_2 = {}max_class = np.max(smote_2_y['decision_numeric'].value_counts())for cycle_id_smote in np.unique(smote_2_y):
samp_dict_2[cycle_id_smote] = max_class
oversample_2 = SMOTENC(sampling_strategy = samp_dict_2, categorical_features = [2,3])smote_2_y = df_resample_train.decision_numericX_fin, y_fin = oversample_2.fit_resample(smote_2_X,smote_2_y)

As you can see, the F-measure increased and log loss decreased and our overall model improved altogether:

Hopefully, this article was helpful and showed you the power of oversampling. To read more on SMOTE methodology check out the documentation here and this great explanation. Here are the code repo and link to the working law school predictor app (made using streamlit!)

If this article was helpful to you, I’d love to hear about it! Reach out to me on LinkedIn.

--

--