Saturday, May 30, 2015

Decision Analytics and Teacher Qualifications

Disclaimers:
  • This a post about statistics versus decision analytics, not a prescription for improving the educational system in the United States (or anywhere else, for that matter).
  • tl;dr.

The genesis of today's post is a blog entry I read on Spartan Ideas titled "Is Michigan Turning Away Good Teachers?" (Spartan Ideas is a "metablog", curated by our library, that reposts other blogs by members of the Michigan State University community. The original post can be found here.) The focus of that post is on changes to the certification examination that would-be teachers in Michigan are required to pass. I'll quote a couple of key passages here, but invite you to read the full post to get the entire context:
Research has found that only about 8% of the differences in student achievement can be attributed to teachers and only 3% of that can be attributed to the combined impact of teachers’ certification, ACT/SAT scores, degrees, and experience.
...
Because teachers’ examination scores have been found to be only weak predictors of their impact on student learning, an assessment that has a low pass rate by design may prevent some who would be effective teachers from obtaining a teaching certificate, a concern that is supported by research.
(The link in the quote is a 2002 article in Education Next by Dan Goldhaber, senior research associate at the Urban Institute.)

My first reaction to the "weak" connection between teacher characteristics and learning outcomes is that it sounded like bad news for people on all sides of the current debates about educational reform. On the one hand, to the "blame the teacher" crowd (who like to attribute perceived problems in the public education system to poor or bad teachers, teacher tenure etc.), one might say that if teacher quality explains "only" 8% of variance in learning outcomes, quit picking on them and look elsewhere. On the other hand, to people (often affiliated with teacher unions) pushing for better compensation, better working conditions etc., one might point out that those are generally incentives to recruit and retain better teachers; so if teacher quality explains "only" 8% of variance in learning outcomes, perhaps those dollars are better spent elsewhere (upgrading schools, improving neighborhood economic conditions, ...).

What struck me second about the original post was the use of the phrases "only about" and "weak predictors". This seems to me to relate to a difference between statistics, as it is commonly taught (and used), and what some people now refer to as "decision analytics". In my experience, the primary focus of statistics (and its sibling "data analytics") is to identify patterns and explain things (along with making predictions). That makes measures such as correlation and percentage of dependent variance explained relevant. In contrast, decision analytics emphasizes changing things. Where are we now, where do we want to be, which levers can we pull to help us get there, how much should we pull each, and what will it cost us to do so? That perspective may put more emphasis on measures of location (such as means), and on which input factors provide us "leverage" (in the archimedean sense of the term, not the regression sense), than on measures of dispersion (variance).

It is common, at least in the social sciences, to categorize predictors as "strong" or "weak" according to how much variation in the dependent variable they predict. This is the statistics perspective. I understand the attractiveness of this, particularly when the point of the model is to "explain" what happens in the dependent variable. At the same time, I think this categorization can be a bit dangerous from a decision analytics standpoint.

Fair warning: I'm about to start oversimplifying things, for the sake of clarity (and to reduce how much typing I need to do). Suppose that I have a unidimensional measure $L$ of learning outcomes and a unidimensional measure $T$ of teacher quality. Suppose further that I posit a linear model (since I'm all about simplicity today) of the form $$L = \alpha + \beta T + \epsilon$$with $\epsilon$ the "random noise" component (the aggregation of all things not related to teacher quality). Let's assume that $T$ and $\epsilon$ are independent of each other, which gives me the usual (for regression) decomposition of variances:$$\sigma_L^2 = \beta^2 \sigma_T^2 + \sigma_\epsilon^2.$$From the research cited above, we expect to find $\beta^2 \sigma_T^2$ to be about 8% of $\sigma_L^2$.

Tell a decision analyst that the goal is to "improve learning", and something all the lines of the following questions should arise:
  • How do we measure "learning"? (Assume that's our $L$ here.)
  • What is our goal (achieve the biggest bang on a fixed budget, achieve a fixed target at minimal cost, ...)?
  • Is the goal expressed in terms of mean result, median result, achievement by students at some fractile of the learning distribution (e.g., boost the bottom quartile of $L$ to some level), or something else (e.g., beat those pesky Taiwanese kids on international math tests)? Reducing variance in $L$, or the range of $L$, could be a goal, but I doubt it would be many people's first choice, since a uniform level of mediocrity would achieve it.
  • What are our levers? Teacher quality (our $T$) would seem to be one. Improving other measures of school quality (infrastructure, information technology) might be another. We might also look at improving socioeconomic factors, either at the school (more free lunches or even breakfasts, more after-school activities, more security on the routes to and from the schools) or elsewhere (safer neighborhoods, better food security, more/better jobs, programs to support two-parent households, ...).
  • How much progress toward our goal do we get from feasible changes to each of those levers?
  • What does it cost us to move those levers?
The (presumably regression-based) models in the research cited earlier address the penultimate question, the connection between levers and outcomes. They may not, however, directly address cost/benefit calculations, and focusing on percentage of variance explained may cause our hypothetical decision analyst to focus on the wrong levers. Socioeconomic factors may well account for more variance in learning outcomes than anything else, but the cost of nudging that lever might be enormous and the return on the investment we can afford might be very modest. In contrast, teacher quality might be easier to control, and investing in it might yield more "bang for the buck", despite the seemingly low 8% variance explained tag hung on it.

In my simplified version of the regression model, $\Delta L = \beta \Delta T$. The same ingredients that lead to the estimate of 8% variance explained also allow us to take an educated guess whether $\beta$ is really zero (teacher quality does not impact learning; what we're seeing in the data is a random effect) and to estimate a confidence interval $[\beta_L, \beta_U]$ for $\beta$. Assuming that $\beta_L > 0$, so that we are somewhat confident teacher quality relates positively to learning outcomes, and assuming for convenience that our goals are expressed in terms of mean learning outcome, a decision analyst should focus on identifying ways to increase $T$ (and, in the process, a plausible range of attainable values for $\Delta T$), the benefit of the outcome $\Delta L$ for any attainable $\Delta T$, and the cost of that $\Delta T$.

Stricter teacher certification exams may be a way to increase teacher quality. Assuming that certification requirements do in fact improve teacher quality (which is a separate statistical assessment), and assuming that we do not want to increase class sizes or turn students away (and therefore need to maintain approximately the same size teaching work force), imposing tighter certification standards will likely result in indirect costs (increasing salaries or benefits to attract and retain the better teachers, creating recruiting programs to draw more qualified people into the profession, ...). As with the connection between teacher quality and learning outcomes, the connecting between certification standards and teacher quality may be weak in the statistical sense (small amount of variance explained), but our hypothetical analyst still needs to assess the costs and impacts to see if it is a cost-effective lever to pull.

So, to recap the point I started out intending to make (which may have gotten lost in the above), explaining variance is a useful statistical concept but decision analysis should be more about cost-effective ways to move the mean/median/whatever.

And now I feel that I should take a pledge to avoid the word "assuming" for at least a week ... assuming I can remember to keep the pledge.

Sunday, May 17, 2015

Feelings of Rejection

This is a quick (?) recap of an answer I posted on a support forum earlier today. I will couch it in terms somewhat specific to CPLEX, but with minor tweaks it should apply to other mixed-integer programming solvers as well.

It is possible to "warm start" CPLEX (and, I'm pretty sure, at least some other solvers) by feeding it an initial solution, or a partial solution (meaning values for some but not all variables). CPLEX will try to complete the solution if necessary, test for feasibility, and perhaps try to repair the solution if it is not feasible.

Today's question involved a warm start that the author was confident was feasible, but that CPLEX was asserting was not (message: "Retaining values of one MIP start for possible repair"). I've seen essentially the same question other times, sometimes with the "retained for repair" message and sometimes with a flat-out rejection. What can you do if CPLEX disagrees that your starting solution is feasible?

First note that the more complete the starting solution, the easier it is for CPLEX to digest it. If you specify, say, values for only the integer variables, CPLEX will try to solve for the corresponding values of the continuous variables. That's usually relatively painless, although (as with anything else MIP-related) your mileage will vary. If you specify values for just a portion of the integer variables, though, repairing the starting solution turns into a MIP problem unto itself. I'm pretty sure it is rarely the case that a warm start that incomplete will pay any dividends, even if CPLEX manages to repair it.

There is a parameter (MIP.Limits.RepairTries) that governs how much effort CPLEX spends trying to repair a starting solution. Higher values lead to more time spent trying to fix your warm start. The author of today's question had that cranked up quite high, so he needed to look other places.

You may be convinced that the warm start you are supplying is feasible, but sometimes bugs happen, or a bit of rounding bites you in the sitzfleisch. A useful thing to try is fixing all the variables (or at least all the integer variables) to match the starting solution, by setting both the lower and upper bound of each variable to its value in the warm start. In effect, you collapse the feasible region down to a single point. Now tell CPLEX to solve the model. If it comes back immediately with your starting solution as the optimal solution, then your warm start really is feasible. On the other hand, if CPLEX claims the problem is now infeasible, run the conflict refiner and get a set of constraints that CPLEX feels are mutually inconsistent. Check your starting solution against those constraints manually and look for a discrepancy.

Update: It turns out that there is an API function (IloCplex.refineMIPStartConflict in the Java API, various pseudonyms in other APIs) that lets you skip the "fix the variables" step and directly identify a conflict in the starting solution (if, in fact, one exists).

One other possibility is numerical instability in the model (basically the MIP version of "gremlins"). Solve the model (without fixing any variables as described in the previous suggestion) and collect information on basis condition numbers. I've covered that subject before ("Ill-conditioned Bases and Numerical Instability"), so I won't rehash it here. If you model is unstable, small rounding errors in your warm start can blow up into deal-breaking errors in constraint satisfaction. There is a parameter (Emphasis.Numerical) that tells CPLEX to spend extra effort on numerical precision, but the more general approach to instability is to look for reformulations of the model that are more stable. In particular, avoiding a mix of large and small constraint coefficients (such as is commonly induced by the dreaded "big M" method of formulating logical constraints) may be helpful.

Tuesday, May 12, 2015

Hell Hath No Fury Like a Dependency Scorned

There is a recent package for the R statistics system named "Radiant", available either through the CRAN repository or from the author's GitHub site. It runs R in the background and lets you poke at data in a browser interface. If you are looking for a way to present data to end-users and let them explore the data dynamically, rather than looking at a bunch of static plots, you might find radiant rather useful.

So I set out to install it, first on my PC and then, when that failed, on my laptop. If you look at the CRAN page, you can see which other R packages radiant requires. That's interesting but not terribly crucial, since the R package installation system by default handles dependencies for you.

The problem is that the package installer will compile radiant, and its various dependencies, on your system, or in my case die trying. What is not visible (with radiant or any other R package) is which non-R compilers and libraries are required. As it turns out, beyond the usual suspects (mainly gcc, the GNU C++ compiler, which I think is a given on pretty much every Linux installation), radiant (or some of its dependencies) requires the GNU FORTRAN compiler (gfortran), which I had on my PC but not my laptop. That much was easy to figure out.

Harder to understand were compilation error messages for some of the dependencies saying "cannot find -lblas" and "cannot find -llapack". For the nontechnical user, even finding those messages in the torrent of compiler output might be a bit iffy. As it happens, I know enough (just barely) to know that "-l" is a prefix telling the compiler to load a library and "blas" and "lapack" were the library names. As it happens, I even know that "blas" is the Basic Linear Algebra Subprograms library and "lapack" is the Linear Algebra Package library. What I did not know was why the compiler could not find them, given that both libraries are installed on both machines.

Trial and error yielded an answer: the compiler needs the static versions of both. Rather ironically, the Wikipedia entry for static libraries (the immediate preceding link) lists one advantage of static libraries over their dynamic counterparts (the ones I already had installed) as "... avoid[ing] dependency problems, known colloquially as DLL Hell or more generally dependency hell". Little do they know ...

Anyway, on Linux Mint (or, presumably, Ubuntu), you can find them listed in the Synaptic package manager under "libblas-dev" and "liblapack-dev". Install those first (along with "gfortran" if not already installed) using Synaptic or apt (from a command line), then install radiant and its dependencies from CRAN using R, and things should go fairly smoothly. For Windows or Mac users, you'll need the same two libraries, but I'm afraid you're on your own finding them.

Monday, May 4, 2015

Model Credibility

Someone asked an interesting question on a support forum recently. The gist was: "How do I confirm that my model is correct?"
On the occasions that I taught simulation modeling, this was a standard topic. Looking back, I don't recall spending nearly as much time on it when teaching optimization, which was a mistake on my part. In those days, operations research/management science topics tended to be taught in OR/MS courses, at least at my institution. Since then, OR/MS topics have to some extent shifted into courses in application areas (such as supply chain management and its siblings), where they necessarily receive less coverage, since they share the course with application content. I have a suspicion that model correctness has slipped even further through the cracks as a result.
There may also be a bit of instructor bias involved. If you are an OR person teaching, say, optimization, you are probably more enthused with the mathematics (and perhaps the computational aspects) than with the quotidian details of applying it. I certainly was. If you are a supply chain instructor, you may want to spend as little time as possible on optimization (including implementation details) because you are anxious to get to the application of the results. Regardless of how it happens, when we don't teach model correctness, I suspect it is employers who pay the price.
Borrowing a bit from an excellent (albeit long in the tooth) simulation book by Law and Kelton[1], I thought I'd review what I know about building credible models. (Since I'm an academic, you may choose to take all this with a large grain of salt.)

From problem to solution

There are many diagrams in textbooks about the evolution of a model. Here's my contribution:
model flow diagram
You (or someone) starts with a problem. You turn that into a conceptual model, which could be a mathematical program, simulation, queuing model, forecasting model or whatever – I'm an equal-opportunity offender. The conceptual model exists on paper (or whatever 21st century replacement for paper you use). You translate the conceptual model into computer code (or delegate it to some minion – one reason doctoral students were invented). This could mean writing fairly high-level code in a modeling language (examples: OPL, SIMSCRIPT), a general purpose mathematical/statistical language (examples: MATLAB, R), or a general purpose coding language, likely linked to some libraries (Java, anyone?). With probability approaching 1.0, the conceptual model will involve parameters, which you will need either to obtain from the end user or estimate from data. Once the model is coded and parameterized, you inflict it on some unwary computer and, after the usual endless debugging, hopefully obtain some results. With that, you're done, right? Not quite. There's still the matter of whether the results are (a) correct and (b) useful.

Validation and Verification

There are two key steps in ascertaining whether the results actually are meaningful in the context of the problem. Neither is necessarily easy to do. Verification refers to confirming that the code accurately represents the conceptual model. It occurs once in the flow diagram (at the marked location, where the conceptual model is turned into code.
Validation refers to confirming that the conceptual model conforms to the user's reality. Validation occurs at multiple locations in our flow diagram.
  • Linking problem to model: Does your model actually address the user's original problem? This is not frivolous, especially where academics are concerned. I'm pretty sure I've seen models that mysteriously migrated from the relatively simple thing the user needed to a more intriguing (i.e., publishable) problem bearing at most a tangential relation to the user's original issue.
  • Linking assumptions to model: Are your assumptions appropriate?
    • Sometimes the modeler is seduced into an alternate universe by the quest for computational tractability. I once saw a book on modeling (whose title I've sadly forgotten) that had a chapter about designing an automated chicken plucker. The chapter title was “Assume a Spherical Chicken”. The search for that led me to the Wikipedia page about "spherical cows".
    • Sometimes the modeler simply knows no better. In an ideal world, an OR analyst spends time observing (or, better still, participating in) operations before attempting to model and improve them. You help load the trucks or work on the assembly line to get a sense of what actually goes on. Often, though, that's a luxury; as the analyst, you either lack the time or the necessary access. In those cases, you need to be extra diligent in checking your assumptions.
  • Linking data to code: Is your data relevant and correctly analyzed? Getting accurate parameter estimates should be considered a part of model validation. A nontrivial part of this is “cleaning” the data. As I once learned the hard way, having operational data in a corporate database and having correct operational data in a corporate database are two entirely distinct things.
So, in a nutshell, pretty much everything other than the one step I marked in the diagram as subject to verification should be considered subject to validation.

How to verify

There are several common techniques for verification. Here are the ones I think are most important. Law and Kelton list a few others as well.
  • If the model's output is a deterministic function of its inputs, and if you can find (or construct) test cases with independently known results, you can compare the computational results on those cases with the expected outcomes.
  • You can show the model to one or more competent coders (with the background to understand the model statement) and ask them to review the code.
  • You can create test cases with “extreme” inputs, run them, and verify that the code produces plausible outputs. For instance, in a queuing model or simulation, you can set the arrival rate equal to or greater than the service rate and verify that the queue explodes. In an optimization model, you can tweak parameters to force a particular constraint to be binding or to have slack, or a particular decision option to be too good to pass up or too expensive to consider, and see if the output matches your tweaking. As a concrete example, I have some code that assigns students to project teams according to certain criteria (mainly, that teams should be as similar to each other as possible). I can easily create test data that would allow for perfectly identical teams to be formed, and I can easily create test data that would prohibit certain mandatory requirements from being met. My code should produce perfect teams in the first case and spit up a useful error message in the second case.

How to validate

Law and Kelton define a model with face validity as “… a model that, on the surface, seems reasonable to people who are knowledgeable about the system under study.” So a good starting point is to describe, in nonmathematical terms, what your model says, and see if the users agree that it sounds appropriate. Note that I wrote “users” (plural). Even if only one person will be responsible for running the code or implementing the solution, it pays to get input from a variety of people familiar with different aspects of the problem. It does little good to cook up a production scheduling model based on input from the person doing the scheduling, only to be told after the fact that the logistics folks either cannot warehouse that much product or cannot move that much product to market in a timely manner.
What I think of as historical validity (I'm not sure that's an official term) is worth checking if historical data for the system is available. Run the model with the historical inputs (parameter settings) and compare the output to the historical results. In a simulation model, you would like the model output to match the historical results (within the confines of what one can expect from an inherently stochastic model). In an optimization model, you would like the model's results to do as well as, if not better than, the historical results. It would also be instructive to check whether the historical solution is feasible in your model. If not, either your constraints are suspect or the users have some way of finagling violations … which should perhaps be baked into the model.
Another thing to try (which might again fit under the umbrella of face validity) is to run scripted scenarios (including edge cases), describe the scenarios to users, show them the model results, and ask if the results seem credible given the scenarios (and, if not, why not). A variation of the scripted scenario option is sensitivity analysis. Start with a scenario the users understand and confirm that the output is credible. Slightly modify one parameter, or at most a small number of parameters, rerun, and show the users the changes in the output. Ask them if they would buy those changes as the appropriate reaction to the change in inputs. For instance, if you are simulating a customer service operation, try adding one server and see if the reductions (hopefully) in waiting time or increases (hopefully) in throughput seem plausible to the people with experience in the operation.

Credibility (or, staying off the shelf)

The ultimate tests of model credibility come in two places. The first is whether the model is ever implemented, or whether it languishes “on the shelf”. Even credible models can end up on the shelf; it happened to me once, when the model's champion within the company was reassigned. Lack of credibility, though, is perhaps the number one reason for a model never to be implemented. The second place credibility comes into play is when the model is implemented: did it make things better?
Validation is a big factor in making the model credible, and keeping it off the shelf, for more reasons than just the obvious one (correctness of the model). Involving users in the model design stage (through face validation) helps to get them familiar with the model, and increases their comfort using it. Subtly, it may also give them a sense of ownership. If they have invested time and energy in the model development, and in particular if they feel their voices have been heard, they will have a stake in seeing the model implemented and in seeing it succeed. They may also be more inclined to revisit assumptions and have an analyst (you?) tinker with the model if changes in the environment cause it to stop tracking reality accurately, whereas uninvested users may be quicker to scrap the model and go back to doing what worked before (or what they find comfortable).

References