Sunday, May 29, 2011

Peeking Inside the Black Box

From the perspective of an analyst, rather than a developer of algorithms, the computational side of operations research might be described in terms of a black box: insert model and data; turn crank; make sense of the results. When it comes to the question of how much the analyst needs to know about what goes on inside the box, I find myself walking a somewhat narrow line between the Um and the Yang.

When teaching statistics to end users, I resist the exhortations of some of my colleagues that students must do computations by hand to “really learn” about statistics. For instance, said colleagues will argue that in order for a student to grasp the concept of correlation, they must take a set of observations of a pair of variables and (manually) center the variables, compute squares and cross-products, total the squares and cross-products, and use the totals to compute standard deviations and the Pearson correlation coefficient. To me, that is just gratuitous grunt work. Students need to know the domain of the correlation coefficient (-1 to +1), when to use it (looking for linear relationships) or not (looking for nonlinear relationships), and its interpretation (including the dangers of making a leap from correlation to causation). The actual computation can safely be left to software.

On the other hand, the widespread availability of modeling software and powerful solvers has changed the practice of optimization in a very fundamental way. Once the exclusive province of people who had taken extensive coursework in the mathematical theory behind the pertinent algorithms (not to mention formal training in numerical analysis), optimization is now increasingly being undertaken by people who apparently have minimal formal training, and who may understand the contextual domain of the problem they are trying to solve but perhaps do not understand the mathematical underpinnings of the methods being used. I've seen evidence of this in a number of forums recently. One of the symptoms is a failure to understand issues relating to rounding error and numerical stability, ranging from the subtle (poor scaling leading to incorrect answers or solver failures) to the blatant (“My binary variable came out 0.99999999999 -- what went wrong??”).

A few recent posts suggested that some users either do not understand the difference between local and global optima or do not understand the role of convexity in optimization models. In particular, I cringe whenever I see nonlinear equality constraints in a model, whereas they don't seem to bother the model authors very much. (I'm not saying that nonlinear equality constraints are fundamentally wrong; I'm just saying that I'd really like not to have any in my model.) A very simple example will illustrate why.

Consider the following trivial optimization problems:\[ \begin{array}{clccclc} \textrm{(D)} & \min & x & \,\,\, & \textrm{(C)} & \min & x\\ & \textrm{s.t.} & x^{2}+y^{2}\le1 & & & \textrm{s.t.} & x^{2}+y^{2}=1\\ & & -2\le x\le2 & & & & -2\le x\le2\\ & & -2\le y\le2 & & & & -2\le y\le2\end{array}\]The (convex) feasible region of problem (D) is a disk, while the (nonconvex) feasible region of problem (C) is a circle. The solution to both problems is clearly $(x,y)=(-1,0)$.

Let's perform a little experiment (using AMPL to model the problems and MINOS 5.51 to solve them). We will solve both (D) and (C) with initial values $(x,y)=(cos(\theta),sin(\theta))$ where $\theta$ is an integer multiple of $\frac{\pi}{4}$. The AMPL code is as follows:

# problem definitions
var x >= -2 <= 2;
var y >= -2 <= 2;
minimize Obj: x;
s.t. Disk: x^2 + y^2 <= 1;
s.t. Circle: x^2 + y^2 == 1;
problem D: x, y, Obj, Disk;
problem C: x, y, Obj, Circle;

# bit used to define starting locations
param pi4 := atan(1); # pi/4
set ANGLE := 0..7; # from 0 to 7/4*pi

# storage for optimal solutions
param dsol {ANGLE};
param csol {ANGLE};

# solve disk problem from all starting angles
problem D;
for {t in ANGLE} {
  let x := cos(t*pi4);
  let y := sin(t*pi4);
  solve;
  let dsol[t] := x;
}

# solve circle problem from all starting angles
problem C;
for {t in ANGLE} {
  let x := cos(t*pi4);
  let y := sin(t*pi4);
  solve;
  let csol[t] := x;
}

display dsol, csol;
The results? For all starting angles, MINOS determines that $(x,y)=(-1,0)$ is the optimal solution to (D). That also applies for (C), with one glaring exception: when $\theta=0$, MINOS decides that the initial value $(x,y)=(1,0)$ is the optimal solution.

The figure below tells the tale. MINOS uses a reduced gradient method, which is a souped-up form of steepest descent. If allowed to, it will move in the direction pointed to by the gradient of the objective function at the current point when maximizing, and in the opposite direction when minimizing. (In this example, the gradient is a constant $(1,0)$.) When movement in that direction is infeasible, it will project the positive or negative gradient onto the hyperplane tangent to the feasible region at the current point, and search in that direction.


The red arrows indicate the (constant) negative gradient direction; the blue arrows are the projections of the gradient onto the unit circle. Having a hard time seeing the blue arrows anchored at $(\pm1,0)$? That's because they have length zero. So for any starting point other than $(\pm1,0)$ in problem (D), MINOS moves right to left across the unit disk until it reaches the boundary (the unit circle), then wends its way around the circle until it reaches the optimum at  $(-1,0)$. If it starts from $(+1,0)$, it cuts straight across the $x$-axis and arrives directly at the optimum. For problem (C), cutting across the disk is not an option. Given any start other than   $(\pm1,0)$, MINOS winds around the circle. Started at $(+1,0)$, however, MINOS sees a zero projection of the gradient, declares victory, stages a photo op and goes home.

The behavior of MINOS on problem (C) is correct given what's inside the black box. MINOS only guarantees local optima, and even that can only be delivered if the problem is moderately well behaved and/or the starting point is chosen well. Ipopt, another popular nonlinear solver, has similar limitations. From some recent posts that I've seen, though, I have the impression that some users are creating highly nonconvex models and expecting optimal solutions they are not destined to get. Compounding this, of course, is the fact that the solver will generally not know it has arrived at a suboptimal solution, and therefore will not alert the user that the solution is suboptimal.

Perhaps modeling languages and solvers should ship with a “warning label” stating that use of the software without first reading a book or taking a course in linear/nonlinear/integer programming is inadvisable.

Speaking of books, I can recommend David Luenberger's "Linear and Nonlinear Programming" (an updated version of which has apparently been published with coauthor Yinyu Ye).

Monday, May 23, 2011

Stepwise Regression in R: Part II

Someone requested a demonstration of how to use the R stepwise function I previously posted, so here goes.  I'll use the "swiss" data set that comes with R (in the datasets package).  There are six variables in the data set; I will use Fertility as the dependent variable (mainly because that is what the example code in the help file for regsubsets does). The code below assumes that my function is saved in a file named stepwise.R, and that the working directory is set to the folder containing that file.

Decisions to add or delete a variable in my function are based on an F-test comparing the model with and without the variable.  In both cases, the null hypothesis is that the variable's true coefficient is zero, so a variable is deleted if we fail to reject the null hypothesis and added if we succeed in rejecting the null hypothesis.  For demonstration purposes, I will use significance levels $\alpha=0.05$ for deletions and $\alpha=0.10$ for additions.

Here is the code:
source('./stepwise.R') # load the stepwise function
data(swiss) # 47 observations of 6 variables, from dataset package
attach(swiss) # so we don't have to prefix every variable
# we will (arbitrarily) use alpha = 0.05 to add a variable and alpha = 0.10 to remove one
# the first run will start with just a constant term
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1, 0.05, 0.10)
# the second run starts with the complete model and initially winnows it
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, 0.05, 0.10)
Created by Pretty R at inside-R.org

The output for the first call (which starts with just a constant term, and builds from there) is as follows:

            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 70.14255   1.822101 38.49542 1.212895e-36

S = 12.491697, R-sq = 0.000000, R-sq(adj) = 0.000000, C-p = 94.805296
=====
+++ Adding Education 

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 79.6100585  2.1040971 37.835734 9.302464e-36
Education   -0.8623503  0.1448447 -5.953619 3.658617e-07

S = 9.446029, R-sq = 0.440616, R-sq(adj) = 0.428185, C-p = 35.204895
=====
+++ Adding Catholic 

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 74.2336892 2.35197061 31.562337 7.349828e-32
Education   -0.7883293 0.12929324 -6.097219 2.428340e-07
Catholic     0.1109210 0.02980965  3.720974 5.598332e-04

S = 8.331442, R-sq = 0.574507, R-sq(adj) = 0.555167, C-p = 18.486158
=====
+++ Adding Infant.Mortality 

                    Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      48.67707330 7.91908348  6.146806 2.235983e-07
Education        -0.75924577 0.11679763 -6.500524 6.833658e-08
Catholic          0.09606607 0.02721795  3.529511 1.006201e-03
Infant.Mortality  1.29614813 0.38698777  3.349326 1.693753e-03

S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162
=====
+++ Adding Agriculture 

                   Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      62.1013116 9.60488611  6.465596 8.491981e-08
Education        -0.9802638 0.14813668 -6.617293 5.139985e-08
Catholic          0.1246664 0.02889350  4.314686 9.503030e-05
Infant.Mortality  1.0784422 0.38186621  2.824136 7.220378e-03
Agriculture      -0.1546175 0.06818992 -2.267454 2.856968e-02

S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
=====

So all the predictors other than Examination are added. Now here is the output from the second call:

                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03

S = 7.165369, R-sq = 0.706735, R-sq(adj) = 0.670971, C-p = 6.000000
=====
--- Dropping Examination 

                   Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      62.1013116 9.60488611  6.465596 8.491981e-08
Agriculture      -0.1546175 0.06818992 -2.267454 2.856968e-02
Education        -0.9802638 0.14813668 -6.617293 5.139985e-08
Catholic          0.1246664 0.02889350  4.314686 9.503030e-05
Infant.Mortality  1.0784422 0.38186621  2.824136 7.220378e-03

S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
===== 
 
We start with the full model, delete Examination, and stop there.

Not content to leave well enough alone, let's see what happens if we start  with the "E" variables (Education and Examination) and go from there.  The code for the function call is

stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ Examination + Education, 0.05, 0.10)
Created by Pretty R at inside-R.org

and the output is

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 85.2532753  3.0854981 27.630312 1.945244e-29
Examination -0.5572183  0.2319374 -2.402451 2.057160e-02
Education   -0.5394570  0.1924380 -2.803277 7.497224e-03

S = 8.981812, R-sq = 0.505485, R-sq(adj) = 0.483007, C-p = 28.135883
=====
+++ Adding Infant.Mortality 

                   Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      55.2746618  8.8077340  6.275696 1.451652e-07
Examination      -0.5108888  0.2063175 -2.476226 1.729132e-02
Education        -0.5225093  0.1709099 -3.057221 3.832793e-03
Infant.Mortality  1.4556114  0.4064507  3.581274 8.644778e-04

S = 7.973957, R-sq = 0.619096, R-sq(adj) = 0.592521, C-p = 14.252399
=====
+++ Adding Catholic 

                    Estimate Std. Error    t value     Pr(>|t|)
(Intercept)      50.02820666 8.66076269  5.7764204 8.325568e-07
Examination      -0.10580461 0.26036962 -0.4063631 6.865390e-01
Education        -0.70415772 0.17969218 -3.9186887 3.221868e-04
Infant.Mortality  1.30567908 0.39150335  3.3350393 1.790664e-03
Catholic          0.08631125 0.03649293  2.3651501 2.271709e-02

S = 7.579356, R-sq = 0.663865, R-sq(adj) = 0.631853, C-p = 9.993398
=====
--- Dropping Examination 

                    Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      48.67707330 7.91908348  6.146806 2.235983e-07
Education        -0.75924577 0.11679763 -6.500524 6.833658e-08
Infant.Mortality  1.29614813 0.38698777  3.349326 1.693753e-03
Catholic          0.09606607 0.02721795  3.529511 1.006201e-03

S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162
=====
+++ Adding Agriculture 

                   Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      62.1013116 9.60488611  6.465596 8.491981e-08
Education        -0.9802638 0.14813668 -6.617293 5.139985e-08
Infant.Mortality  1.0784422 0.38186621  2.824136 7.220378e-03
Catholic          0.1246664 0.02889350  4.314686 9.503030e-05
Agriculture      -0.1546175 0.06818992 -2.267454 2.856968e-02

S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
===== 
 
All roads seem to lead to the same model. Also note that the constant term was not explicitly stated in the third call (no "1+" in the formulas). As usual with R, a constant term is assumed unless explicitly omitted (using "0 +").

Update (08/21/17): I've posted a new version of the function, together with a new version of this demo, in a single R notebook.

Saturday, May 21, 2011

A Shoutout to Anki

I'd like to point out Anki, an open source (GPL 3) virtual flashcard program I've been using. Anki runs on all major PC platforms (and an impressive array of mobile devices), and is feature-rich. It has a plug-in architecture. Cards can contain not just text but also images and audio files. Anki supports LaTeX, so it should be easy to put mathematical formulas on cards. Use of card decks is very flexible (including an easy way to have cards representing difficult concepts repeat sooner than cards representing easy concepts). It is easy to create or import card decks, and you can share card decks (and plug-ins) with other people through their AnkiWeb site. (Disclaimer: I have not looked at the available shared decks, nor have I looked at any plug-ins. The basic download fills my needs.)

My use for Anki has been to learn my students names. At the outset of each semester, I acquire a class list with student photos, strip out the photos, and turn them into an Anki deck. During idle moments at coffee shops, I play the decks and learn to associate names with faces. (Addressing students by name on the first day of class freaks them out; I think they tend to be more docile thereafter.)

I'm not a big fan of rote memorization in education, but I suppose it has its place.  I could see using Anki to learn formulas from queuing theory (for instance, "M/M/1 steady-state queue length" on the front, the corresponding formula on the back) or perhaps some terminology from linear programming (graph of a feasible region and an objective hyperplane tangent along an edge on the front, "multiple optima" on the back). As there are more things in heaven and earth than are dreamt of in my philosophy, I suppose there are more ways Anki could be used in the teaching of operations research.

Tuesday, May 17, 2011

My Infinity Is Bigger Than Your Infinity

Mathematicians are used to the concept that some infinities (e.g., $\aleph_0$) are bigger than other infinities (e.g., $\aleph_1$). In mathematical programming, though, "infinity" is really just a placeholder for "I know there is a finite limit but I don't know what it is" (or "I know there is a finite limit but I'm too lazy to look for it"). The only place you'll see $\infty$ in a mathematical program is as an omitted bound on a variable.

There are two other truisms, related to this post, that are well known to anyone with a reasonable amount of experience solving mathematical programs via user-written code (including APIs to solver libraries). The first is that solvers run on double-precision floating point arithmetic, and $\infty$ is not a d-p number. So it is reasonable to treat any d-p number larger than some specified limit $M$ as a surrogate for $\infty$. In the CPLEX APIs, $M=1e+20$. (In the Java version of the Concert API, the static constants Double.MAX_VALUE and Double.POSITIVE_INFINITY work well for $\infty$, while -Double.MAX_VALUE and Double.NEGATIVE_INFINITY work for $-\infty$; but you can just as easily use $\pm 1e20$.) The second truism is that mixing very large coefficients with not-so-large  coefficients in constraints is an invitation to numerical instability.

This brings me to something I tripped over accidentally a day or so ago. Consider the following trivial linear programming problem: $$\begin{array}{lc}\mathrm{maximize} & x \\ \mathrm{s.t.} & x\le M\end{array}$$ where $M>0$ is a large (but finite) number. The problem has an obvious (and trivial) optimal solution ($x=M$) ... unless you are using a solver, and $M$ is large enough to look like $\infty$. Indeed, for $M\ge 1e+20$, CPLEX (using any of the APIs) will (and should) declare the problem unbounded.

The following table shows the results of a little experiment I did (with CPLEX 12.2.0.2, but the results are not version-dependent):

PresolveNo Presolve
$M\lt 1e+10$Optimal ($x=M$)Optimal ($x=M$)
$M= 1e+10$UnboundedOptimal ($x=M$)

Note the result in the lower left corner: the CPLEX presolver asserts that the problem is unbounded. An e-mail exchange with Dr. Ed Klotz of IBM sorted this out for me, and the rationale does in fact make sense. Suppose that the presolver encounters a variable $x$ that it can fix at a large bound $M$. As it eliminates the variable from the model, any instance of $x$ in the model (say, $a_i x$ in the $i^{\mathrm{th}}$ constraint) is replaced by a potentially large number ($a_i M$), which may introduce numerical instability (see second truism above). So CPLEX is somewhat conservative during presolve about how large is too large. In effect, $\infty=1e+20$ in the solver but $\infty=1e+10$ in the presolver.

I discovered this purely by accident, because I'm a firm believer that no model should have coefficients (including bounds) that require me to type two digit exponent fields in scientific notation. If there are values that large, it's time to rescale. It just caught me by surprise that the surrogate for $\infty$ in the presolver was not the same one used in the APIs.

Sunday, May 15, 2011

Rejiggering an R Data Frame

This is a new experience for me.  Mark Allen (author of the Open Source Research blog) tweeted a question about rearranging a data frame in R.  The question being longer than 140 characters, he used Deck.ly to post an extended tweet.  So I learned something new about the Twitterverse.  I don't have a TweetDeck account, though, so my best option to post a response is to put it here.

Mark has a data frame with one row for each response to any of a set of questions, and three columns: respondent ID; question number; response. Here's a chunk of R code to create a small demo data frame along those lines:
# create some data
m <- matrix(c(1,1,11,2,3,23,1,2,12,2,1,21,2,2,22,1,3,13), nrow=6, ncol=3, byrow=TRUE)
d <- data.frame(m)
names(d) <- c("ID","Question","Answer")
print(d)

Created by Pretty R at inside-R.org

The output is:

  ID Question Answer
1  1        1     11
2  2        3     23
3  1        2     12
4  2        1     21
5  2        2     22
6  1        3     13

Here is code to rearrange it:

# sort the data by ID, then by Question
d <- d[do.call(order,d),]
# extract a list of unique IDs and Question numbers
id <- unique(d[,"ID"])
q <- unique(d[,"Question"])
# rearrange the answers into the desired matrix layout
m <- matrix(d[,"Answer"], nrow=length(id), ncol=length(q), byrow=TRUE)
# add the ids and make a new data frame
m <- cbind(id, m)
dd <- data.frame(m)
names(dd) <- c("ID", paste("Q", q, sep=""))
print(dd)

Created by Pretty R at inside-R.org

The output of the last line (the rejiggered data frame) is:

  ID Q1 Q2 Q3
1  1 11 12 13
2  2 21 22 23

Friday, May 13, 2011

Will Analytics Drag O.R. Back to Its Roots?

The INFORMS blog challenge for May is "O.R. and Analytics", so here goes ...

There are about as many definitions of "analytics" as there are explanations for the recent global economic crash. My take on "analytics" has been that, at least until now, what it meant to business executives was "take this huge pile of data and make sense of it". Operations research models in the real world have always relied on some level of data analysis, because models contain parameters and parameters must be estimated. (Outside the real world -- which is to say in academe and government service -- modelers get to make up whatever parameter values they like.) That said, I've never really associated heavy duty data analysis with O.R. Statisticians analyze data and data miners try to read sheep entrails (according to the statisticians); O.R. analysts build and use models.

As the term "analytics" grabs mind share among executives, though, O.R. societies and O.R. practitioners are trying, to borrow a phrase from Microsoft, to "embrace and extend" it. The societies see this as a way to boost membership and conference attendance, and both the societies and practitioners see it as a way to enhance the visibility of O.R. to the people who will (hopefully) employ O.R. analysts, directly or as consultants. I would not be surprised if the data analysis crowd see this as O.R. people trying to share the spotlight uninvited. Fortunately, since their forte is recognizing patterns as opposed to prescribing solutions, they'll see us coming but probably won't be able to keep us out.

Extending the O.R. brand will require more than just saying "operations research ... including data analysis" or "operations research is analytics ... plus so much more". If we're serious about this, we'll need to reconsider what we mean by "operations research", and perhaps how we go about its practice. Therein lies an opportunity to return to O.R.'s roots.

The history of O.R. traces back to World War II, and the stories from that era have a common thread. Someone has a problem. The problem appears somewhat quantitative in nature (or at least amenable to measurement and quantification). We want to model it, and see what the model suggests might solve the problem. Absent from this description is any pigeon-holing of the problem as a linear program, or a queueing problem, or a discrete event simulation. One of the classic examples was finding ways to move men and materiel from the U.S. to the U.K. more safely during the Battle of the Atlantic. The answer involved determining the optimal shape of a convey (which was not a mathematical programming problem, the word "optimal" notwithstanding), recognizing that small convoys of fast ships (small possibly meaning a single ship) might not need an escort (they were too fast for the U-boats to pick on), and so forth.

As O.R. bloomed after the war, we developed more and better tools (theory, algorithms, software) ... and along the way, we became more specialized. So now we have experts in optimization and experts in simulation and so on, and we tend to adhere to the maxim (which I will shamelessly bastardize) that if you're really good with a screwdriver, everything looks like a screw. At a recent INFORMS conference, I attended a session about teaching modeling to MBAs. Given the topic, I suspect most of the attendees were fellow academics, so I apologize if I offend any practitioners by tarring them with the same brush. At one point in the session, the presenters posed a scenario to us (the Red Cross is thinking about paying for blood donations, and wants some consulting on it), and, with not much more detail than what I just gave, turned us loose on it. The optimizers started writing optimization models. The simulators started sketching out simulation models. If there were any decision analysts in the room, I'm sure they were drawing decision trees. In other words, most of us framed the problem in terms of our favorite tools, rather than fleshing out the problem (which was quite vague) and then looking for tools (possibly mathematically unsophisticated tools) that would match it.

As we try to figure out how data mining and "business intelligence" (note that I'm skipping all oxymoron jokes here, a severe exercise in restraint) fit with O.R., perhaps we can seize the opportunity to starting conceptualizing (and teaching) O.R. as first and foremost sorting out and describing what the problem actually is. My very limited understanding of data mining suggests that it leans more toward letting the data tell you what it wants to tell you than to making the data fit a preconceived model; extend that to the overall problem, rather than just the data, and we're back to something I think the progenitors of O.R. would recognize.