The world’s leading publication for data science, AI, and ML professionals.

Predicting Essay Effectiveness through MLP + BERT

Utilizing categorical and numerical features with BERT to improve model

Writing effectively has always been key in ensuring effective communication of ideas, critical thinking and persuasion. Writing as a skill is necessary for success in the modern world. However, while global literacy levels are slowly increasing, the growth is not equal amongst countries and social demographical group. For example, multiple studies have shown that there are existing examples of education inequality in the united states. This phenomenon could ultimately result in a spiraling vicious cycle of widening Gini Coefficient.

One solution to the problem would be enabling students to assess their own writing effectiveness. By developing automated guidance, students can assess their own writing effectiveness which allow them to learn more time-efficiently with less dedicated resources. This solution can be in a form of a web application where student upload their argumentative essays and an output will be given to assess the essay effectiveness

The goal of this competition is to classify argumentative elements in student writing as "effective", "adequate" or "ineffective".

Dataset

The dataset was provided by Kaggle through Georgia State University. There will be no distribution of data in this article. The data can be found on kaggle.

Data Exploration

On the surface, the dataset provided contains only a single column of text corpus. However, we can engineer multiple features through contextual knowledge which will make our model perform better.

Citation

Every college student knows how important citation is in making your essay more effective. Therefore, we create a feature that labels True/False if any "source" is detected in the text body.

It is evident that the effective essays have "source" present.

Spelling errors

Utilizing Python library tqdm to identify most common spelling errors. Instead of making it a binary feature (yes/no spelling error), I made it a numerical feature with the number of spelling errors.

While the median score is approximately the same across all target features, it is evident that there is a wider spread for ineffective essay.

Number of words

Effective essays have a relatively high median words when compared to other classes

Polarity and subjectivity

Polarity and Subjectivity is derived largely from python package textblob, which provides a simple NLP API. Polarity of the text corpus is scored between -1 to 1, where -1 represents negative sentiment and +1 represents positive sentiment.

Subjectivity of the text corpus is scored between 0 to 1, where 0 represents objectivity and 1 represents represents subjectivity. Based on the dataset effective essays generally tend to have slightly higher median when compared to other classes.

Data Modelling

Highlighted in orange are processed through multimodal toolkit, the toolkit. Data preprocessing code snippet can be found below. I used One-hot encoding for the categorical features and yeo-johnson transformation for numerical features, as some of the data in the numerical features might be very small or negative.

Defining Metrics

Since it is a multi-class, classification question. The results were evaluated using accuracy, F1 score and training loss. The results achieved marginal increase in terms of accuracy (0.63 to 0.68) while it achieved a higher F1 gain in (0.46 to 0.59).

Loss function shows that there is convergence however it reaches steady-state after 10k steps.

Conclusion

In the very first run of this competition, I took a vanilla BERT model and trained it on the corpus. However, the results did not converge and it resulted in very poor results.

In the second iteration, I created additional features which provided context to the problem statement, much like marking criteria or checklist that teachers intuitively use when assessing essay effectiveness. This resulted in large improvements in the model performance across all performance metrics. The results would be more interpretable if we were to evaluate it using multiclass log-loss since it outputs a probability of each class. For more information in performance evaluation, refer to my github repo where I uploaded both EDA and modelling notebook.

Another key improvement which can be implemented is to include discourse type as part of the feature analysis. Without including this feature, it would be akin to assessing the lead paragraph the same way as a counter rebuttal.

Sources

[1] Dataset: https://www.kaggle.com/competitions/feedback-prize-effectiveness.

[2] Modelling: https://github.com/georgian-io/Multimodal-Toolkit


Related Articles