The world’s leading publication for data science, AI, and ML professionals.

Automating Scientific Data Analysis Part 4:

How to Derive, Validate, and Document Regressions from the Complete Data Set

Generating Regressions in the Automated Data Analysis Process

This is the sixth article in a four part series teaching you how to automate scientific data analysis. The first article introduced the concept, provided motivation, and described the general process. The second, third, and fourth articles described how to control data sets, identify the conditions in each test, process the data, and check the data for errors. The fifth article described how to store the data in a logical manner to facilitate generating regressions. This sixth, and final, article will teach you how to develop regressions from those results, and validate and document the resulting models.

Generating Regressions from Stored Data

The final Python-based Automation of laboratory data analysis topic to discuss is that of generating and validating regressions from the stored data. This is often the ultimate goal of laboratory data analysis projects, and there are still several things to think through before declaring the project completed. When generating regressions, the following topics must be considered:

  • How to determine the optimal order of the regression,
  • Creating 1-dimensional regressions with numpy,
  • An introduction to scikit-learn for creating n-dimensional regressions,
  • Validating models, and
  • Visualizing the results of the regression and validation.

This section occasionally uses the example of regressions predicting the performance of heat exchangers. For an introduction to that technology, and the testing and Data Analysis used to model them, see Part 2.

Determining the Optimal Regression Structure

The order of the regression has a dramatic impact on the prediction accuracy of the resulting model. Higher order regressions are much more flexible providing the ability to statistically match data sets better than lower order regressions, or the ability to match more complex data sets. However, the increased flexibility can lead to models that statistically appear to fit the data well, but have dramatic inaccuracies when making predictions between measured points in the data set. Lower order regressions are less flexible, often struggling to statistically match the data as well, but providing a more stable prediction. This can result in lower accuracy at the measured points, but higher accuracy in between.

The difference is created by the number of data points in the data set. More data points are needed to create stable regressions of higher order. Figures 1 and 2 show this effect plainly. Both figures show 2nd, 3rd, and 4th order regressions. Figure 1 shows the regressions created with a thorough data set. Notice how there are not dramatic differences in the regressions. The 4th order regression fits the data set a little closer than the 3rd, and the 3rd slightly better than the 2nd. But they are all quite close, and since the trends are quite close to those shown by the measured data, they can be trusted to accurately predict intermediate values. In this case, clearly the 4th order regression should be chosen.

Figure 1: Example Regression Created from a Thorough Data Set
Figure 1: Example Regression Created from a Thorough Data Set

The same cannot be true of the regressions in Figure 2. Figure 2 shows regressions created using a subset of the same set of measured data. In this case, the 4th order regression again matches the measured data points better than the 2nd or 3rd order regressions. However, there’s a combination of too much complexity and not enough data for the higher order regressions. As they attempt to match the shape of the curve presented by the data points with the limited data set, they become less stable. This can be observed by how wavey the 3rd, and especially the 4th order regressions are. While the 4th order regression clearly matches the measured data the best of the three, it cannot be trusted to interpolate within the bounds of the data set. This is especially true at flow rates above 3.5 gal/min.

Figure 2: Regressions Created from a Limited Data Set
Figure 2: Regressions Created from a Limited Data Set

There are several possible solutions to this problem. They are, listed in order of preference, as follows:

  • Collect more data to create a better data set,
  • Split the data set into multiple sections and generate multiple lower order regressions, and
  • Attempt to identify the best regression from the currently available options.

Collecting More Data

Collecting more data yields a more stable regression by providing more points constraining it. When identifying new tests to improve the data set, specify tests that fill wide gaps in the current set. For instance, the reduced data set shown in Figure 2 would be much more complete if additional points were added to fill the gaps from 3.5 to 5.5 gal/min, and from 5.5 to 7.5 gal/min.

Splitting the Data Set

Splitting the data means identifying multiple sections that could each have a separate, smooth regression, splitting the data set into those multiple sections, and creating individual regressions for those split data sets. While it reduces the amount of data for each regression, it creates smoother shapes for them to match. As a result, it’s possible to get a smoother curve, and a better match, even with the smaller data set. In this case, splitting the data set to create one regression from 0.5–2 gal/min, and a second for 2.5–7.5 gal/min yields much better results. Figure 3 shows the result of creating a second and third order regression using the previously stated split.

Figure 3: Regressions Created from a Limited Data Set, Split at 2 gal/min
Figure 3: Regressions Created from a Limited Data Set, Split at 2 gal/min

The results show a much better match with the split data set than they did with a single data set. The split occurs at 2 gal/min and is evident by a sudden increase in the prediction from the 2nd order regression. This bump shows an increase in error using the 2nd order regression in the range from 2.0–2.5 gal/min. Aside from that region, the 2nd order regression matches the data set very closely. The 3rd order regression does not show the same uncertainty at the split point, and very closely matches the regression over the entire 0.5–7.5 gal/min range. This shows that the approach of splitting the data set at 2 gal/min, and creating separate regressions for above and below, is very effective for this data set.

Identify the Best Option

The least effective approach, reserved for when it’s not possible to collect more data and splitting the data set doesn’t yield satisfying results, is to choose the best regression from the available options. In this data set, and the regressions presented in Figure 2, this means using the fourth order regression. The second order regression may be the smoothest of the three options, but it clearly overpredicts the effectiveness at low flow rates and underpredicts at high flow rates. The third order regression is a little closer, but also a little less stable. The fourth order regression very closely matches both the data points and the trend very closely until 3.5 gal/min. From 3.5 to 5.5 gal/min it will overpredict, and from 5.5 gal/min to 7.5 gal/min it will underpredict; however, inaccurate predictions in that range is true of all three options, and not much more dramatic than the third order regression. When using this regression, it will be important to remember that it’s extremely accurate below 2.5 gal/min, and less trustworthy at higher flow rates.

Creating Regressions with Numpy

NumPy is a package of scientific computing tools available in Python. Among the many tools it provides is a toolset for creating 1-dimensional regressions. The main two tools are:

  • polyfit: This tool identifies the coefficients of a 1-dimensional, nth order polynomial using the x and y data provided by the user. NumPy.polyfit is called with the syntax "numpy.polyfit(Data_x, Data_y, Order)".
  • poly1d: poly1d creates a regression using user-specified coefficients and assigns them to a variable so that the regression can easily be referenced later. NumPy.poly1d is called with the syntax "numpy.poly1d(Coefficients)".

These tools can generate regressions of results quite naturally if the results were stored in a manner consistent with the recommendations in Storing Intermediate Results for Later Analysis. The process uses the following steps:

  1. Open the data file containing the desired data, and store it in a data frame. For this example, the data frame will be called "Data."
  2. Use NumPy.polyfit to identify the coefficients for the data set and store them to a variable. This can be done with code similar to the following:
Coefficients = numpy.polyfit(Data['Flow Rate (gal/min)'], Data['Effectiveness (-)'], 2)

In this case, the coefficients for a 2nd order regression with flow rate as the x data and effectiveness as the y data are stored to the variable "Coefficients". In the heat exchanger, this data set only has a single flow rate representing a situation with the same flow rate on either side of the device.

  1. Call NumPy.poly1d with the coefficients to assign the regression as a function to a variable. For example, the coefficients from the previous step could be used to create a regression using the following:
Effectiveness_SecondOrder = numpy.poly1d(Coefficients)

Now the effectiveness of this heat exchanger can be estimated using a second order regression by calling the "Effectiveness_SecondOrder" function.

  1. The final step is calling the function to generate the desired regressed estimate. A few examples of how this could be used are as follows:
  • If using the regression to evaluate the result in a specific condition, call the function referencing that condition and assign it to a descriptive variable. For example, the effectiveness of this heat exchanger with the same 2 gallons/minute of flow on either sidecould be identified and saved with:
Effectiveness_2_GalPerMin = Effectiveness_SecondOrder(2).
  • Another possible use is filling a column of a data frame with estimated regressed values. This is extremely useful when validating models, as the regressed estimates can be stored in a data frame right next to the measured values and easily compared. This can be done with code similar to:
Data['Effectiveness, Estimated, Second Order (-)'] = Effectiveness_SecondOrder(Data['Flow Rate (gal/min)']

c. A third possibility allows plotting of smooth regressed curves, and uses the NumPy "arange" function. NumPy.arange creates a range of values as specified by the user. It’s done with the syntax:

numpy.arange(Minimum Value, Maximum Value, Step)

For example, numpy.arange(0.5, 7.6, 0.1) creates a range starting at 0.5, with each value increasing by a step of 0.1, and a maximum value of 7.6. This works well for plotting smooth curves, because it’s a quick way to predict the result of the regression at the desired interval. For instance, the following code will create an x and y data set for plotting the curve of this regression with an interval of 0.1 gal/min from a minimum of 0.5 to a maximum of 7.5 gal/min:

x_array = np.arange(0.5, 7.6, 0.1)
Effectiveness_array = Effectiveness_SecondOrder(x_array)

An Introduction to Scikit-Learn

Currently the best tool for creating n-dimensional, nth order regressions is the scikit-learn package. It’s based on scipy, NumPy, and matplotlib, and is commonly used in the field of Data Science and machine learning. It provides many different tools for regressions, with many different approaches. Those interested in learning about all of the capabilities in scikit-learn should consult their documentation, and be prepared to learn about many different types of regression techniques. An understanding of the fundamentals of data science would also be beneficial, and Andrew Ng has published an informative, free course on Coursera.org. Since a full discussion of data science and scikit-learn is entirely too large for this text, this section focuses on introducing some basic linear regression techniques.

Two very useful tools for creating linear regressions with scikit-learn are 1) sklearn.preprocessing.PolynomialFeatures and 2) sklearn.linear_model.LinearRegression.

PolynomialFeatures provides a set of matrix operations, which are used to massage the data into the form needed by LinearRegression. It creates a matrix consisting of all combinations of the input variables of degree less than or equal to that specified by the user. For example, a matrix created using PolynomialFeatures with two input variables (x and y) preparing a 2nd order model will have the terms [1, x, y, x2, xy, y2]. This approach is sometimes referred to as "black-box modeling" because it does not attempt to understand the structure behind why the model behaves this way, and instead aims to create a matching regression based solely on mathematical techniques.

The input matrix can be created using the following steps:

  1. Create the matrix object by calling PolynomialFeatures and specifying the order of the regression. Note that scikit-learn uses the term "degree" instead of "order."
  2. Arrange the input data in the correct format for the upcoming linear regression. For use with PolynomialFeatures, this means a single matrix containing submatrices specifying the conditions for each test. For example, the input matrix with three tests and two input conditions will be of the form [[x_1, y_1], [x_2, y_2], [x_3, y_3]].
  3. Call the PolynomialFeatures fit_transform function on the input data, and use numpy.delete to remove the first row.

As an example, the following code generates the correct matrix for a 5th order regression with the two inputs ‘Hot Flow Rate (gal/min)’ and ‘Cold Flow Rate (gal/min)’. Note that the input data matrix is created using a for loop that reads values from the Effectiveness data frame.

poly = PolynomialFeatures(degree=5)
All_Flow_Pairs = []
for i in Effectiveness.index:
    FlowRate_Hot = Effectiveness.loc[i, 'Hot Flow Rate (gal/min)']
    FlowRate_Cold = Effectiveness.loc[i, Cold Flow Rate (gal/min)']
    All_Flow_Pairs.append([FlowRate_Hot, FlowRate_Cold])
    All_Flow_Pairs_Transformed = poly.fit_transform(All_Flow_Pairs)
    All_Flow_Pairs_Transformed = np.delete(All_Flow_Pairs_Transformed, (0), axis = 1)

LinearRegression is designed to easily accept this matrix, and generate the specified regression. It provides four functions that are especially useful for this process. They are:

  • .fit: This function accepts the input data matrix, and a matching result matrix, and fits the regression specified in PolynomialFeatures to the data. This is the step that creates the regression.
  • .coef: This function prints the coefficients describing the regression. Storing the result from .coef to a variable, and printing it at the end of the program allows storage of the coefficients for later, external use.
  • .intercept_: This function returns the value of the regression if all inputs are set to 0.
  • .score: This function returns the r2 value of the regression when compared to the data set used to create it. It provides a measure of how capable the regression is of predicting the input data set. This concept will be discussed more in the next section.

The following code continues the example from PolynomialFeatures, and shows how the functions of LinearRegression can create, document, and test a regression.

clf = LinearRegression()
clf.fit(All_Flow_Pairs_Transformed, Effectiveness['Effectiveness (-)'])
n_Dimensional_Coefficients = clf.coef_
n_Dimensional_Intercept = clf.intercept_
n_Dimensional_Score = clf.score(All_Flow_Pairs_Transformed, Effectiveness['Effectiveness (-)'])

Validating Models

The most important part of model development, and one that is entirely too elaborate to describe in complete detail in this conceptual text, is that of validation. Validation is the process of ensuring that the predictions generated by a modelmatch the phenomena being modeled. This is a complex, multiple-stage process that is never really finished. The steps to perform a quality model validation are as follows:

  • Ensure that the model matches the data set used to create it,
  • Compare the model predictions to an additional data set,
  • Publish the report detailing the strengths, weaknesses, and appropriate applications of the model.

Verifying Models

The first step is to compare the model predictions to the source data. This step is commonly referred to as "verification." The goal of this step is simple: Ensure that the model matches the experimental data used to create it. In situations where a specific data set was created explicitly for creating the model, as is discussed in this book, the error in the model should be extremely low. When performing this step, there are two general questions to answer:

  1. Does the model match the data set within an acceptable limit of error?
  2. What aspects of the data set are well/poorly represented by the model?

Typically, a model is considered to match the data set within an acceptable limit of error if the error is less than the uncertainty of the measurements used to collect the data. This is often used as the limit, because it represents the limits of the knowledge; if the error between the model and the measurements is less than the uncertainty in the measurements, then there’s no way to know that the model is imperfect.

Inevitably, the model will be better at some predictions than others. It’s important to identify, and investigate these differences. They may show deficiencies in the model, and different modeling approaches may need to be used to correct them. If they’re minor issues that don’t have significant impact on the final results, then noting them for documentation is sufficient. The section Determining the Optimal Regression Structure provided a few responses to these situations.

Validating Models

The second step in model validation is comparing the model predictions to a second data set. This is an important step because the first step didn’t really validate the model, it merely showed that it’s capable of matching the data set that it was mathematically forced to match.

The solution to this challenge is to include extra tests in the test plan, but not include them in the data set used to generate the regression. Keep them in a separate data set, created solely for the purposes of model validation. This way they have no impact on the regression, and provide additional points which can be used for the validation check.

The tests in the validation data set should test the model as thoroughly as possible. Include tests in regions where the model is highly sensitive to changes in inputs, to ensure that it gets the predictions in those ranges correct. Include variation in all the different variables in the model, to ensure that it accurately predicts the behavior of the model in response to changes in those variables. Place tests in areas where there are large gaps in between data points in the original data set, to ensure that the model remains stable in those regions, as was shown in Creating the Optimal Regression Structure.

During this validation phase, attempt to answer the same questions as in the first phase. Is the model accurate within measurement uncertainty? Are there areas where it’s more accurate than others? If it isn’t accurate within measurement uncertainty, can the model be improved? As before, any issues that can be improved should be improved, and any issues that can’t be improved should be documented.

Documentation

The final step of the model validation process is to publish the findings while realizing that it’s more complicated than a simple "valid" or "not valid" declaration. No model is valid in every way, or invalid in every way. Simply proclaiming a model to be invalid doesn’t represent the value that even a limited model can provide. And proclaiming a model to be "valid" gives the impression that it predicts everything accurately, with no weaknesses.

The solution to this problem is straightforward: Instead of proclaiming the model to be valid, publish the strengths and weaknesses of the model. This provides users with a more accurate understanding of appropriate uses of the model, allowing them to decide when they should use the model and when they should not. It also provides documentation supporting future efforts to improve the model.

As an example, these articles have used the example of developing regressions predicting the steady state effectiveness of heat exchangers. That model was designed to predict the steady state effectiveness when the flow rates are between 0.5 and 7.5 gal/min. It is extremely accurate within so long as the flow rate is identical on either side of the device. The models do not address flow rates out side of that range, or situations with differing flow rates on either side of the device. Also, it is designed specifically for steady state effectiveness and should not be used for predictions in transient simulations.

Conclusion

This series of four articles has introduced you to all of the concepts necessary to write programs that automatically analyze your scientific data sets for you. You’ve learned about the following topics:

  • Some possible benefits of automating data analysis,
  • The structure of programs that automate data analysis,
  • How to structure data sets to enable automation,
  • How to identify the conditions of each data test,
  • How to split data files,
  • How to write a for loop iterating through each data file,
  • How to analyze the data in each file,
  • How to check the data in each file for errors,
  • How to store the data from each file so it’s ready for use generating regressions, and
  • How to generate, validate, and document the final regressions from your data set.

Hopefully these articles have helped you develop new skills, and hopefully you can use them to find more ease and more satisfaction in your career.


Related Articles