This article describes the free to use data set from the 70,000 response Colorado Learning Attitudes about Science Survey for Experimental Physics (E-CLASS). You can find this data set here: https://github.com/Lewandowski-Labs-PER/eclass-public

Across social science fields there has been a replication crisis. Many studies when revisited cannot produce the same result (e.g., within PER, demonstrated courses taken, not grades as previously demonstrated in, is most predictive of whether or not a student remains in a physics degree program). In some cases, if data were freely available, studies at different institutions could be compared to see if the results could be replicated. Data sharing also encourages the long-term preservation of data, which maintains data integrity and can serve as training tools for future scientists. Free and open data also encourages conversations around specific research questions. Results can be re-analysed with different methods to further establish results. Within the field of physics education research, there are essentially no large, free and publicly available data sets for research.
We have created a large (70,000 response) data set using the the Colorado Learning Attitudes about Science Survey for Experimental Physics (E-CLASS). These data cover 133 universities, 599 unique courses, and 204 instructors, and was collected between 2016 and 2019. The survey assesses a student in a physics course attitude towards solving physics problems with laboratory skills. In this post we will introduce the data set and the python library made to interact with the data set. We encourage you to download the data set and play around with it yourself!
E-CLASS Data Set
The E-CLASS was developed to help instructors and PER researchers measure the impacts of different lab course implementations and interventions. It was developed to address a large variety of learning goals that can be roughly categorized as exploring students’ epistemology and expectations of experimental physics. To be able to address such a large range of goals, the survey was not designed to measure just one or a few latent factors. Additionally, the survey was designed to measure students’ progression of ideas as they move from introductory courses to more advanced-level courses. To achieve this, many questions are directed at either the introductory or advanced level. Thus, we stress that although one can consider an “overall” E-CLASS score, the real power of the assessment comes from examining responses to individual questions, and in particular, ones that align with a particular course’s learning goals.
The E-CLASS Dataset contains 39505 responses to the presurvey and 31093 responses to the post survey. Students can, in some cases, respond to the survey more than once. Thus, there are a total of 35380 unique responses to the presurvey and 28282 unique responses to the post survey. In this case, “unique” is defined as the first response to the pre or post survey. The dataset represents 133 unique universities, 204 unique instructors, and 599 unique courses . The dataset contains data for both students in introductory and "Beyond the First Year" courses (BFY). The total data collected per semester have increased over the course of the data collection period as shown in the figure below.

The E-CLASS itself is made up of 30 Likert-style questions to assess student epistimologies and expectations in comparison to experts. Student are asked to respond to each statement (from strongly agree to strongly disagree) both from their view and predict the view of experimental physicists. In some cases, the expert-like response is disagree. The data have been preprocessed to convert the Likert responses, so that all data are on the five-point scale of non-expert (indicated by 1 in the dataset) like to expert-like (indicated by a 5 in the dataset). All of the research to date has been done by first collapsing the five-point scale to a three-point scale, but the full range is included in the public dataset. We warn researchers that the survey was not designed to reliably distinguish between the two outermost points on either end of the scale (i.e., agree'' and
strongly agree” or disagree'' and
strongly disagree” ). Additionally, on the post survey only, students are asked about which items (23 out of the 30) were important for earning a good grade in the course. Finally, students were also asked a set of demographic, interest, and career plan questions.
DataHelper Python Library
In addition to the dataset itself, we also wrote a library to help researchers access the data. While the raw data is in CSV format across several tables, often times users will want to reduce the data set in certain ways such as only students who attended introductory courses or only students who responded to both the pre and post surveys. Below we show some code examples of how to do this.
The DataHelper library requires pandas and uses pandas paradigms to organize the data.
To import the data set we only need to use the following cod
import DataHelper
e = DataHelper.eclass_data()
This code imports the library, then creates the full dataset in the e
object. This e
object can then be used to access the underlying data. For example, if we want to print the total number of responses we could use the functions e.get_buffy_pre()
and e.get_intro_pre()
.
print('total number of pre responses:',e.get_buffy_pre().shape[0]+ + e.get_intro_pre().shape[0])
>>> total number of pre responses: 39505
These return pandas
dataframes that contain only the BFY data and the intro data, respectively. Thus, returning the shape
of the dataframe gives us the row count.
If we want to get only the matched data we can perform a similar operation:
print('Number of matched responses intro:', e.get_intro_matched().shape)
print('Number of matched responses bfy:', e.get_buffy_matched().shape)
>>> Number of matched responses intro: (19445, 175)
>>> Number of matched responses bfy: (3096, 175)
Where the functions e.get_intro_matched()
performs the matching operation for us.
There are many example notebooks in the github repository.
You can find the data set and DataHelper library here https://github.com/Lewandowski-Labs-PER/eclass-public
This post summarizes the paper https://journals.aps.org/prper/abstract/10.1103/PhysRevPhysEducRes.17.020144