Preserving Our Better Legislators with Data

Published in

Towards Data Science

7 min readSep 11, 2017

Do you have a favorite Congress member? In an institution with remarkably low approval ratings, people tend to like their local federal legislative representatives. My case is no different; I, too, grew to admire my local Representative, Ileana Ros-Lehtinen.

Beyond district concerns, I was particularly impressed with her legislative record on Latin American issues. Since being the first Hispanic woman elected to Congress, she took an interest in foreign policy and joined the U.S. House Committee on Foreign Affairs. Her background as a Cuban exiled from the island by Castro colored her issue perspectives, but that was never a hindrance to having a perfect record when supporting peace and democracy in Latin America. As a Colombian, I noticed her efforts to help get the country under control in the late 90’s. Later, while working on Capitol Hill, I witnessed as she fought for Congressional support against the budding Chavez/Maduro dictatorship in Venezuela. Similar actions are found on the Federal Record about Nicaragua, Ecuador, etc.

Rep. Ros-Lehtinen is retiring after this session in Congress. As a country, we will sorely miss her passionate voice and gregarious personality on the political stage. More importantly, it leaves the question of what her followers will make of her legacy. In this post, I will contribute my grain of sand to the preservation of her ideals, using data science.

This is retiring member Ileana Ros-Lehtinen. She has an even happier aura in person.

The project is designed to predict how a long standing member of Congress will vote in future legislation once they retire. By using the bills she sponsored in the Committee during the past 4 sessions of Congress, I will create a prediction model that would output a logistic answer to whether Congresswoman Ros-Lehtinen would’ve voted on a bill, given the text.

To leverage the text as features, I performed count vectorization on the bill text to pick up features. This process required a lot of critical decisions, such as the context lost when lemmitizing count vectorization results and the choice of stop words in addition to the standards in NLTK’s ‘english’ set. Language that is standard across bills and that make part of the more procedural aspects of bill texts were incorporated into our stop words threshold when appropriate.

The resulting data frame of the count vectorization will be modeled using XGBoost and Logislitc Regression, with the intention of using boosting methods if the signal is too weak.

Data Dictionary and Cleaning

The data was mainly collected from their now deprecated (at least the data arm) Sunlight Foundation, the U.S. Government Publishing Office (GPO) and ProPublica.

From Sunlight Foundation I scraped a list of all the bills that were referred to the U.S. House Committee on Foreign Affairs (Committee), from 2009 to the present. In the scrape I kept several variables that helped identify the bill, including potential targets for count vectorizing such as the bill title and nicknames.

Using the bill identification information, I built a second scraper that would iterate over the first one for bill information to download the bill text from the GPO.

Finally, I used the information from Committee specific bills from Sunlight Foundation to parse through ProPublica’s bill records. From there I scraped which entries in my set Rep. Ileana Ros-Lehtinen has sponsored or cosponsored. I chose sponsorship over votes because those are considerably more common than votes since a small amount of bills introduced ever get vote consideration. It also shows more personal commitment to the language and content of the legislation being proposed.

After acquiring the necessary data, the next steps were to prepare them for the count vectorizing. The bills text had a lot of formatting to correct whitespaces and select the part of the bill most useful and unique. Prior to running the count vectorizer, the data was isolated to contain just the text and Ros-Lehtinen’s voting record.

At this point I performed a train, test split and proceeded with the count vectorization of the data.

The following graph is just a word count; I’m sharing because I found it funny that the most common words is basically telling the Secretary of State what to do.

Feature and Model Selection

Given the scope of the project and the data I have collected I have chosen the following features on which to build the model.

The main feature(s) is the result of Lemmatized Count Verctorization applied on the corpus of the bill text. The body of text used had been stripped of the preamble of the bill (which includes the sponsors and co-sponsors, as well as the process under which a bill is referred to committee), and of non-word elements such as symbols and section numbering. In addition to the count vectorization, I also used dummy variables to represent what sort of legislation is the entry (HR, HRES, S, etc.).

The response variable is, of course, a metric of Congresswoman Ros-Lehtinen. However, considering how rare is the intersection between bills submitted to the Committee and bills being voted on the floor, I had to use a different parameter. For this reason I chose sponsorship or co-sponsorship of a bill. The number of response classes thus rises to roughly 15% of the data. Using this criteria also prevents me from missing datapoints that occur in the form of voice votes. A voice vote is common in generally agreed legislation and it is incumbent on the opposition party to request the Yays and Nays in order for those vote positions to go into the the federal register.

The most aggressive decision I made to address class imbalance was to upsample my positive responses to match my negative responses. I achieved this by resampling with replacement and setting the parameters so the classes were evenly divided at 2150 entries each. After building the models using this data set, I will revert to my original set and run it through the model. The report based on this will be part of my model scoring.

A reminded of how, and in which order, to deal with highly unbalanced classes, courtesy of the one and only Chris Albon. If his website is not among your most visited, make it a thing.

Equipped with my data and response variable, I build several classification models to see which algorithm highlighted my response variable best.

Results

Even with the number of permutations of model and vectorizer combination, the models struggled getting an AUC score above 60%. Using pairings of Count Vectorizer, and TFIDF Vectorizer, and the models KNeighbors, Logistic Regression, Bernoulli Naive Bayes, Multinomial Naive Bayes, XGBoost, and Random Forest, I looked for the best scoring models possible. Accuracy was always high, but the models kept skewing the results to a very low count of positive predictions, per the confusion matrix and AUC score.

After the upsampling described above, the result of my models improved. When I passed the original data in the upsampled-made models, I found that the best balance of accuracy and computational efficiency was running a logistic regression on TF-IDF Vectorized text. The results of XGBoost model also showed promise similar to the logistic regression, but not sufficient enough to justify its slow performance.

The final cross-validated accuracy and AUC metric for the chosen logistic regression model were 82% accuracy and .69 for AUC. I am conflicted since this is slightly below my stated goal of reaching .70 AUC score, but with an accuracy well above my original goal, I feel that the proof of concept is achieved and further development with more appropriate Natural Language Processing tools.

A model feature importance; for those who are curious about which terms mattered most.

Evaluation

The task of predicting voting behavior is tricky in parliamentary systems where voting behavior is not always recorded. Moreover, legislation is usually structured in time-honored ways by staffers, thus reducing the impact that text vectorizers will have in predicting the voting member’s feelings or style. However, I do believe there is enough of a signal in the analysis of the text that characterizes what sort of legislation they will support.

In specific reference to my results and evaluating my experiment, I am a concerned about the low value of recall for the positive responses in my target variable. The model’s score of ~.45 for the recall of bill sponsorship does cast a slight shadow over my results and assertions, but I am confident that it is an error other NLP tools will help correct.

Final results of my original data when applied to the models made with upsampled dataset. The model was a logistic regression with a TF-IDF vectorization.

A subsequent analysis that would deepen the understanding of voting behavior based on legislation text is to run unsupervised learning algorithms, such as LDA. Deploying this project’s model and an LDA won’t be possible as an integrated system, but it will deepen the subject matter expertise and provide insights for future model development.

Final Thoughts

I will continue to work on this project to better predict voting/sponsoring behavior. I still have the rest of this current Congress session to test this model further; more specifically on Latin American bills, which is what this project was originally incepted to predict. I will also make a subset of bills that are divided on specific issues and see if my model works better in some subsets.