More on Science, Truth, and Consistency

“Science,” according to “the scientific method,” operates in 4 steps:

  1. Build a model
  2. Catch relevant data
  3. Look for deviations between the model and data.
  4. Go back to step 1, revise the model, and see if the deviations are reduced in the next round.

One might amend this slightly to come up with “the engineering method,” where focus is not on the generic deviations between the model and the data, but specific subsets of the deviations: you want to build a better mouse trap, even if it is lousier at catching rats, so the deviations with respect to mice are reduced further even at cost of increasing the deviations with respect to rats, or something along that line.

Step 3 pertains to consistency: greater consistency implies fewer and smaller deviations. If Step 2 has been performed appropriately, which is to say that the (training) data is an unbiased (and sufficiently large) sample of the reality, the model is closer to the reality — i.e. it is “more true.” Ceteris paribus, a more consistent model is the better model.

But “holding all things equal” is the weasel word. Whenever the models are trained on different (training) data, all things are, by definition, not being held equal. Further, statisticians know that there is, in principle, the error minimizing optimum for any sample: the sample itself, the ultimate overfitting. This is why adding more parts to a model is something discouraged by textbook statistics. This is the rationale behind various validation techniques.

Choice (even if unintentional) of training, testing, and actual application sets (I realize that there is no such generic term, as far as I am aware, as “application set,” at least not on par with training and testing sets. I think there should be.) of the models, however, forces us back to Step 2. If a model is found to be robust for a certain subset of data, characterized by a set of conditions, then that is something about as close to the “truth” as we have, but only conditional on those conditions. If the knights in a certain forest say “ni” and they do so consistently, then the model “knights in that forest say ‘ni’” is a good model — as long as you don’t try to apply it to all knights. We don’t know what a generic knight, unconditionally, says, except by extrapolation without any logical basis.

Put differently, making practical use out of this model crucially depends on the interplay between Step 2 and Step 3. If we naively have trained the model on the data mostly from the knights in that certain forest, but tried to test it on the data mostly from other forests, then we would mistakenly reject it. We should be careful about comparing training and testing sets: if the testing set seems to invalidate what you had found in the training set, the problem may be that the testing set is actually different from the training set. For a the subset of the testing set that is more like the training set the model may still hold (and, likewise, for a subset of the training set that is more like the testing set, it didn’t before). The (implicit) assumption that the training set and testing set are identical is part of the “wilful ignorance” we need to engage in (I love the turn of phrasing by Weisberg!) in order to keep things simple. But we need to keep in mind that this is probably not true. Of course, even if the model is good, it is only good for the knights of the forest in question. Go to another forest, the knights there would not know what “ni” means. When not in Monty Python forest, don’t bother with the Monty Python model.

So if the results from training and testing sets don’t match, we might have two possibilities:

  1. The samples are close enough → the model is not good. (as per Step 3)
  2. The samples are different → the model is good (maybe), but only conditionally (as per Step 2.)

Both possibilities require further investigation. An experiment may or may not be necessary at this stage. What you really need, after uncovering the differences between the training and test sets and the possibility that the model applies only to the knights in a certain forest, is really getting more observations conditional on that forest (since you already have plenty of observations from knights not from that forest.) The question before you is simply answering whether: P(Ni|forest X) >>>> P(Ni|!forest X). Perhaps you already have enough data to answer that question, at least provisionally, or not (that is, you don’t have enough data to “test” it, now that you used up your old testing set). But no really need for a full on experiment if you can get new data from forest X.

Statistical identification of the “truth,” then, needs to be somewhat of a nuanced process: we want to identify not merely “consistent” observation, unconditionally speaking, but “conditionally consistent” ones. Most of the hard work needs to be put into the “conditionally” component. There may be some universally consistent truths that are observable without any conditionalities attached, but if so, they would be very easy to detect and no fancy “SCIENCE!” will be required to uncover them. But when we are looking for conditional truths, sampling becomes the key: conditional truths apply (more) to particular subsets of the reality more than to others. We will learn of them when we find the right samples. If we don’t, we screw up and not see things that are there — only because they are not everywhere.

PS. A related point that emerges from a very common experience (certainly for me — and presumably others) is whether it is even possible for users of the models to keep track of how wrong their model is, when, and where. For the more traditional use of statistics, as means of post facto analysis, it is easy: we have the data, which we don’t pretend to be of use beyond a record of “past” events, and after we fit the model, we can tell at once where the deviations are. For the data yielding quantitative predictions coupled with recordkeeping, even if the “application data” does not yet exist when the model is created, the extent of mismatch can be studied retrospectively. For the various other applications, it is not clear whether it is even possible to review the models retrospectively. I get targeted ads, search results, or recommendations that are hilariously wrong all the time, and I imagine we all get such things. If I am on the other end, I’d like to get some information on how wrong the algorithm was, for which audiences, under what conditions. That information seems essentially impossible to get in a meaningful fashion. Occasionally, we do get the opportunity to report whether the findings/results/ads/recomendations are appropriate, admittedly, but I would not imagine the response rate is random across all users and circumstances.

Sometimes, the errors are obvious: I do not know any Vietnamese, so ads in Vietnamese are obviously wrong. But I do speak Spanish and can read German and Hungarian a bit (just examples, not necessarily a reflection of reality). So even if ads in Hungarian might be “wrong” in the sense that I cannot read it well and I was not looking for the information, it could serendipitously yield something relevant that I could make use of, even if I had not actively sought it. Often users do not necessarily know themselves what it is that they are looking for. We do not know what the “right” translation of Pushkin poetry (and, often, that is the point). “Right answers” often do not exist in a precisely defined form, except for those that we conjure up in the mistaken assumption that they exist under conditions where they do not (again, Keynes’ beauty contest game is pertinent here: the shared expectation of what the players think the common standard of beauty is, rather than what they really consider to be “beauty” tends to drive the “market.” What is more, this “shared but mistaken belief” is more consistent than the truth. To ask them what “beauty” is makes no sense, however, since it does not have a clearly and universally defined answer.) Even if we know the matches between the users and the algorithmic outputs, then, quantifying the “errors” can be a bit tricky. But most targeted ads, search findings, and recommendations cannot even be paired with whatever the users was looking for meaningfully.

This seems especially troubling: if you don’t know how wrong your model is in contact with the reality and cannot estimate how the errors vary across conditions, how can you do science?