Sunday, February 27, 2011

Stepwise Regression in R

Let me start with a disclaimer:  I am not an advocate of stepwise regression.  I teach it in a doctoral seminar (because it's in the book, and because the students may encounter it reading papers), but I try to point out to them some of its limitations.  If you want to read some interesting discussions on the issues with stepwise, search the USENET group sci.stat.math for references to stepwise.  (If you are a proponent of stepwise, I suggest that you don flame retardant underwear first.)

Since I teach stepwise in my seminar, I would like to demonstrate it in R (not to mention some of my students are learning R while doing their homework, which includes a stepwise problem).  The catch is that R seems to lack any library routines to do stepwise as it is normally taught.  There is a function (leaps::regsubsets) that does both best subsets regression and a form of stepwise regression, but it uses AIC or BIC to select models.  That's fine for best subsets, but stepwise (at least as I've seen it in every book or paper where I've encountered it) uses nested model F tests to make decisions.  Again, if you search around, you can find some (fairly old) posts on help forums by people searching (fruitlessly, it appears) for a stepwise implementation in R.

So I finally forced myself to write a stepwise function for R.  My knowledge of R coding is very, very limited, so I made no attempt to tack on very many bells and whistles, let alone make it a library package.  Unlike most R routines, it does not create an object; it just merrily writes to the standard output stream.  There are a number of limitations (expressed in the comments), and I've only tested it on a few data sets.  All that said, I'm going to post it below, in case someone else is desperate to do conventional stepwise regression in R.

=== code follows ===
    #
    # This is an R function to perform stepwise regression based on a "nested model" F test for inclusion/exclusion
    # of a predictor.  To keep it simple, I made no provision for forcing certain variables to be included in
    # all models, did not allow for specification of a data frame, and skipped some consistency checks (such as whether
    # the initial model is a subset of the full model).
    #
    # One other note: since the code uses R's drop1 and add1 functions, it respects hierarchy in models. That is,
    # regardless of p values, it will not attempt to drop a term while retaining a higher order interaction
    # involving that term, nor will it add an interaction term if the lower order components are not all present.
    # (You can of course defeat this by putting interactions into new variables and feeding it what looks like
    # a first-order model.)
    #
    # Consider this to be "beta" code (and feel free to improve it).  I've done very limited testing on it.
    #
    # Author: Paul A. Rubin (rubin@msu.edu)
    #
    stepwise <- function(full.model, initial.model, alpha.to.enter, alpha.to.leave) {
      # full.model is the model containing all possible terms
      # initial.model is the first model to consider
      # alpha.to.enter is the significance level above which a variable may enter the model
      # alpha.to.leave is the significance level below which a variable may be deleted from the model
      # (Useful things for someone to add: specification of a data frame; a list of variables that must be included)
      full <- lm(full.model);  # fit the full model
      msef <- (summary(full)$sigma)^2;  # MSE of full model
      n <- length(full$residuals);  # sample size
      allvars <- attr(full$terms, "predvars");  # this gets a list of all predictor variables
      current <- lm(initial.model);  # this is the current model
      while (TRUE) {  # process each model until we break out of the loop
        temp <- summary(current);  # summary output for the current model
        rnames <- rownames(temp$coefficients);  # list of terms in the current model
        print(temp$coefficients);  # write the model description
        p <- dim(temp$coefficients)[1];  # current model's size
        mse <- (temp$sigma)^2;  # MSE for current model
        cp <- (n-p)*mse/msef - (n-2*p);  # Mallow's cp
        fit <- sprintf("\nS = %f, R-sq = %f, R-sq(adj) = %f, C-p = %f",
                       temp$sigma, temp$r.squared, temp$adj.r.squared, cp);
        write(fit, file="");  # show the fit
        write("=====", file="");  # print a separator
        if (p > 1) {  # don't try to drop a term if only one is left
          d <- drop1(current, test="F");  # looks for significance of terms based on F tests
          pmax <- max(d[-1,6]);  # maximum p-value of any term (have to skip the intercept to avoid an NA value)
          if (pmax > alpha.to.leave) {
            # we have a candidate for deletion
            var <- rownames(d)[d[,6] == pmax];  # name of variable to delete
            if (length(var) > 1) {
              # if an intercept is present, it will be the first name in the list
              # there also could be ties for worst p-value
              # taking the second entry if there is more than one is a safe solution to both issues
              var <- var[2];
            }
            write(paste("--- Dropping", var, "\n"), file="");  # print out the variable to be dropped
            f <- formula(current);  # current formula
            f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" - ")));  # modify the formula to drop the chosen variable (by subtracting it)
            current <- lm(f);  # fit the modified model
            next;  # return to the top of the loop
          }
        }
        # if we get here, we failed to drop a term; try adding one
        # note: add1 throws an error if nothing can be added (current == full), which we trap with tryCatch
        a <- tryCatch(add1(current, scope=full, test="F"), error=function(e) NULL);  # looks for significance of possible additions based on F tests
        if (is.null(a)) {
          break;  # there are no unused variables (or something went splat), so we bail out
        }
        pmin <- min(a[-1,6]);  # minimum p-value of any term (skipping the intercept again)
        if (pmin < alpha.to.enter) {
          # we have a candidate for addition to the model
          var <- rownames(a)[a[,6] == pmin];  # name of variable to add
          if (length(var) > 1) {
            # same issue with ties, intercept as above
            var <- var[2];
          }
          write(paste("+++ Adding", var, "\n"), file="");  # print the variable being added
          f <- formula(current);  # current formula
          f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" + ")));  # modify the formula to add the chosen variable
          current <- lm(f);  # fit the modified model
          next;  # return to the top of the loop
        }
        # if we get here, we failed to make any changes to the model; time to punt
        break;
      } 
    }

Created by Pretty R at inside-R.org

Saturday, February 26, 2011

Finding Multiple Solutions in a Binary MIP

I'm going to post some AMPL code here that illustrates how to find more than one solution to a mixed integer program in which all integer variables are binary.  The problem I'll use is a multiple knapsack where the supply of each item is one.  The AMPL code should be pretty self-explanatory, with the possible exception of the exclusion constraints.  The notion for those is as follows.  Suppose that my binary variables are $x\in \{0,1\}^d$.  I solve the original model and obtain a solution with $x=\tilde{x}$.  I now add the following constraint:$$\sum_{i : \tilde{x}_i = 0} x_i + \sum_{i : \tilde{x}_i =1}(1-x_i)\ge 1.$$Note that this constraint forces at least one of the binary variables to change its value. (Exclusion constraints for general integer variables are harder to come by.  The only way I know off-hand involves binary expansions of the general integer variables.)

The AMPL code solves the model in a loop, recording each solution and adding an exclusion constraint.  The small example I created happens by chance to have multiple optimal solutions, enough so that the five solutions generated by the code are all optimal (but distinct).  More generally, you could get a sequence of increasingly suboptimal solutions.

Three things are perhaps noteworthy in the code.  The first is the check for infeasibility:  if you've already enumerated all possible solutions (and excluded them), the model will become infeasible.  The second is that I gave default values for the exclusion constraint coefficients (and right hand sides).  AMPL otherwise solved the base model (with no exclusion constraints) but then nagged me when I tried to record the result, because the first exclusion constraint used undefined coefficients.  Apparently recording the result (or even trying to display it) triggered a reevaluation of the model.  I think I could have ducked that by waiting to update nfound, but default values were an easy fix.  Third, note that I did not check whether a binary variable equaled 0 or 1; that might bump into rounding issues.  Saying that anything above (below) 0.5 is a 1 (0) is a very safe test.  (If your rounding error is that bad, you need to clean the beads on your abacus.)

One last comment is that this is not the most efficient way to find multiple solutions, in that it requires the (expanded) model to be solved ab initio each time.  If the solver supports callbacks, a more efficient approach is to use callbacks to record incumbents, reject them, and add exclusion constraints on the fly.  CPLEX has a "solution pool" feature that lets you track multiple solutions (and if CPLEX has it, my guess is that other commercial solvers do or soon will).  Using that is probably even  more efficient than using callbacks.  If the model is easy enough to solve, though, or if your solver lacks those features, the method demonstrated below may be a good choice.

Here is the code:
# Example of how to generate and record multiple solutions to a binary MIP.
#
# The sample problem is a 0-1 multiple knapsack.
#
set ITEMS;  # items available for the knapsacks
set KNAPSACKS;  # available knapsacks
param value {ITEMS} > 0;  # item values
param size {ITEMS} > 0;  # item sizes
param capacity {KNAPSACKS} > 0;  # knapsack capacities
var Pack {ITEMS, KNAPSACKS} binary;
  # Pack[i,k] = 1 iff item i is packed in knapsack k
maximize LoadValue: sum {i in ITEMS, k in KNAPSACKS} value[i]*Pack[i,k];
  # maximize value of load
s.t. Capacity {k in KNAPSACKS}:
  sum {i in ITEMS} size[i]*Pack[i,k]  <= capacity[k];  # capacity limits
s.t. Supply {i in ITEMS}:
  sum {k in KNAPSACKS} Pack[i,k]  <= 1;  # only one of each item available

option solver cplexamp;  # to use CPLEX as the solver

data;  # ordinarily I'd use a separate file for the data
set ITEMS := 1 2 3 4 5 6 7 8 9;
set KNAPSACKS := 1 2;
param capacity := 1 6 2 4;
param:  value  size  :=
   1       6     3
   2       5     2
   3       4     4
   4       3     1
   5       2     1
   6       4     5
   7       7     6
   8       2     3
   9       1     1  ;

model;

# we now create some structure to record multiple solutions
param want := 5;  # number of solutions we want
set SOLUTIONS := 1..want;  # index set for solutions
param packed {ITEMS, KNAPSACKS, SOLUTIONS};
  # packed[i,k,s] will be 1 if solution s packed item i in knapsack k
param payoff {SOLUTIONS};  # objective values of recorded solutions
param coef {ITEMS, KNAPSACKS, SOLUTIONS} default 0;
  # coefficients of solution exclusion constraints
param rhs {SOLUTIONS} default 0;  # right hand sides of exclusion constraints
param nfound default 0;  # number of solutions found

# we add constraints to the previous model to exclude
# solutions we've already seen
s.t. Exclude {s in 1..nfound}:
  sum {i in ITEMS, k in KNAPSACKS} coef[i,k,s]*Pack[i,k] >= rhs[s];

# solve until either the problem becomes infeasible or
# the right number of solutions has been found
for {s in SOLUTIONS} {
  solve;  # solve the model
  if solve_result = 'infeasible' then break;
    # if the model has become infeasible, give up
  let nfound := s;  # bump the counter for solutions found
  let payoff[s] := LoadValue.val;
  let rhs[s] := 1;
  # create an exclusion constraint that forces at least
  # one binary variable to change value
  for {i in ITEMS, k in KNAPSACKS} {
    if Pack[i,k] > 0.5 then {
      # treat this as a 1
      let packed[i,k,s] := 1;
      let coef[i,k,s] := -1;
      let rhs[s] := rhs[s] - 1;
    }
    else {
      # treat this as a 0
      let packed[i,k,s] := 0;
      let coef[i,k,s] := 1;
    }
  }
}

Tuesday, February 15, 2011

An R Resource

Catching up on some backlogged blog reading, I came across several interesting posts about R on John D. Cook's blog (The Endeavor).  Happily, there is an index of all R-related posts there, so I can record one link here (where I'm unlikely to lose it) and catch any new posts with it.  There's a ton of information to mined acquired from those posts.  (I didn't want to get the data-mining crowd whipped into a frenzy, hence the edit).  Among other things, I learned why S (and R by inheritance) has the convention of using a dot to create compound variable names (apparently underscore is an assignment operator in S).  That's one less itch to scratch.

On the same blog, there is also an interesting (and interactive) diagram of the relationships among various distributions.  This could be quite useful in a probability and statistics course.

Tuesday, February 8, 2011

Bounding Dual Variables

An interesting question came up on sci.op-research today: can we (easily) determine bounds for the variables in the dual of an linear program?  If the goal is tight bounds, I don't see any easy solution; but it turns out that, at least in some problems, there may be finite (reasonable?) bounds to be had cheaply.

Let's start with a primal problem in canonical form (alternative forms are left to the reader as an exercise):\[ \begin{array}{lrclr} \textrm{minimize} & c'x\\ \textrm{s.t.} & Ax & \ge & b & \textrm{(P)}\\ & x & \ge & 0\end{array}\] with dual\[ \begin{array}{lrclr} \textrm{maximize} & b'y\\ \textrm{s.t.} & A'y & \le & c & \textrm{(D)}\\ & y & \ge & 0\end{array}.\]To get an upper bound on dual variable $y_j$, we could solve a modified form of the dual problem:\[ \begin{array}{lrclr} \textrm{maximize} & y_j\\ \textrm{s.t.} & A'y & \le & c & \textrm{(D*)}\\ & y & \ge & 0\end{array}.\]This looks for the largest value of $y_j$ (which we hope is finite) consistent with the dual constraints. The dual to this problem is:\[ \begin{array}{lrclr} \textrm{minimize} & c'x\\ \textrm{s.t.} & Ax & \ge & e_j & \textrm{(P*)}\\ & x & \ge & 0\end{array}\]where $e_j$ is a vector whose $j$-th component is 1 and whose other components are all 0.

Now suppose we know a primal vector $\tilde{x}\ge 0$ for (P) such that $A\tilde{x}=h\gg 0$, where $h\gg 0$ means that $h$ is strictly positive in every component. Let $x^*=(1/h_j)\tilde{x}$. Then $x^*$ is feasible in (P*), which means $c'x^*$ is an upper bound for the optimal value of (P*), and therefore also an upper bound for the optimal value of (D*). So we can safely assert $y_j\le c'x^*$ in (D).

If $b\gg 0$, then any feasible solution to (P) is a candidate value of $\tilde{x}$, and the closer $\tilde{x}$ is to optimal, the tighter the bound on $y_j$. Also note that a single $\tilde{x}$ provides bounds for all of the dual variables $y_j$. So the trick is to know a vector $x\ge 0$ for which $Ax\gg 0$.

Saturday, February 5, 2011

Twitter and Firefox

In the beginning, I completely ignored Twitter.  Recently, I started following a few Twitter feeds (from OR people) in Google Reader, mainly because they tended to post links to things I actually found interesting (i.e., not what the tweeter had for lunch, nor where said lunch occurred).  Finally, I bit the bullet and opened a Twitter account, largely so that I could respond to the occasional provocative tweet.  (I think I'm up to three tweets so far, which should hold me for a while.)

The next step was to look for an extension to either Thunderbird or Firefox (both of which I keep open when being abused by a computer), so that I could tweet, retweet or ___ (is there a term for responding to a tweet?  tweeply?) easily.  Of course, I could just open Twitter in a Firefox tab, but that would be too low tech, right?  I found one Tbird extension, but it was an alpha version.  I found several promising looking Firefox extensions, of which the most complete and polished seemed to be Echofon.  So yesterday I installed in on my office PC (Linux Mint Isadora), where it worked well, and this morning I installed it on my home PC (Win XP), where it also worked well.

Unfortunately, in between I installed it on my laptop (also Mint Isadora), where it proceeded to spew authentication errors in an alarmingly OCD way.  In fact, short of killing Firefox from a terminal, the only way I could stop the error messages (which popped up an average of about two seconds apart) was to turn off my WiFi connection.  (I couldn't log out of Twitter, nor exit Firefox gracefully, because the incessant error dialogs were modal.)

Fortunately, I could diff my settings between the laptop and the (working) office PC.  It turns out there was a conflict with another Firefox extension, the Electronic Frontier Foundation's HTTPS-Everywhere.  This extension automatically converts insecure HTTP connections to many sites to secure HTTPS connections.  I installed it on my laptop (which likes to go out for a cup of coffee pretty often) but not my office PC (which hides behind a firewall) (although I may eventually install it there as well).  It works fine with Twitter when I surf to my Twitter account, but apparently it does not get along at all well with Echofon.  Fortunately, HTTPS-Everywhere allows you to configure which sites it will try to seduce into secure connections; by deselecting Twitter.com, I got Echofon to work properly (if perhaps not entirely securely).

More Presentation Software

I previously wrote about the tools I use to create presentations. Lately, I've been looking for tools that allow an instructor to:
  • annotate files being projected (from a PC or laptop, through a digital projector);
  • turn the display into a whiteboard and draw on it;
  • share the screen between slides and a whiteboard (some of my colleagues like to work homework problems by hand while simultaneously displaying either the problem statement or the relevant formulas); and
  • save some of the better doodling as an image (to be uploaded to our course management system).
The motivation for this is largely that we grown our classroom technology incrementally, adding features on top of features in rooms that were not designed for them. A particular recurring theme is rooms with wall-mounted whiteboards that are largely obscured by screens for the digital projectors (leaving a small margin on either side accessible to the instructor).

I've come across several useful programs that I thought I'd list here.  (I'm still working on the other piece of the puzzle, which is finding appropriate and reliable input methods.  My attempts to write with a mouse cause MDs to giggle uncontrollably.)  I should mention that I'm looking for tools for multiple platforms:  our classroom PCs are predominantly Windows-based, and most of my colleagues run Windows on their laptops, but I run Linux on mine and there are at least a few Mac users to be considered.  Also, I'm looking exclusively at free (preferably but not necessarily open source) software, and I'm not looking at "smart board" technology (we may get into that, but most classrooms will continue to have stupid boards, if I may be un-PC). What I've found so far:
  • ZoomIt (Windows only, any version): This is a very lightweight program (267KB download; no installation, no writing to the registry) that does what it does very well.  It requires keyboard use to control but handles pen input for the actual writing.  You freeze whatever is currently on the screen (optionally zooming in on it), then draw in one of six colors (red, blue, orange, green, yellow, pink -- no black).  The width of the pen stroke can be varied.  You can draw rectangles, ellipses and straight lines by holding a key down as you go.  There's blanket erase (with one keystroke) and  incremental undo, but no redo.  With one keystroke, you can turn the entire display into a whiteboard or blackboard.  You can copy the screen (to be pasted into an appropriate program) or save it to disk (as an image).  Three caveats: if you switch among screen annotation, whiteboard and blackboard you lose your annotations; there is no way to return to presentation mode, then come back and retain your annotations; and the image capture is always the full screen (so you may need to crop in another program before uploading it to its final home).  Image manipulations can be done in a variety of programs (I recommend IrfanView).
  • Ardesia (Linux, Windows Vista/7): Ordinarily I find it easier to find free/open-source applications for Linux than for Windows, but it took me a couple of hours of searching to find a screen annotation program I liked. Gromit works pretty well, but it has limited flexibility and requires you to tweak a configuration file to change pen color, etc. Ardesia seems to do almost everything I'd want.  Documentation is a bit lacking, so you need to read the tool tips and do some experimenting, but it works very well.  You can draw in any color, using one of three or four stroke widths, in one of three modes (totally freehand, or with automatic conversion of some strokes to either lines or splines).  You can also type text. If you draw an enclosed region, you can fill it (something I'm not sure I'd use much).  You can add arrow heads, erase portions of the screen (or the whole screen with one click), undo and redo, and save your screen (PDF or PNG).  There's even a recording feature (which I've not figured out yet).  Once it's running, everything is controlled from a toolbar docked on the perimeter of the screen, which is helpful when you're using a convertible laptop and the screen/tablet is covering the keyboard.  There are a few "eye-candy" options I have not mentioned (and have not installed ... yet). Ardesia is available as a .deb package for Ubuntu, and allegedly compiles on other Linux/BSD systems.  It requires a composite manager.  I have not tested the Win 7 version (yet), but it might end up replacing ZoomIt for me.  The only real drawback (other than documentation) from my perspective is that there are no ellipse and rectangle tools.  (Hypothetically I can draw polygons and ellipses in the straight line or spline mode, but so far that has proved to be purely hypothetical.)
  • Whyteboard (Windows, Linux, Mac): If you want to turn your display into a whiteboard (either full screen or windowed), rather than writing on top of some other image, I doubt you'll do better than this. You get a tabbed interface (so that you can have multiple screens of writing, and switch among them at will) with a full palette of drawing tools (text, arrows, lines, basic shapes), with multiple colors and an eraser.  You can save and reopen tabs, and you can attach notes to them. Unlike the screen drawing programs, you can grab any shape and move or resize it, and change its color.  A history of your drawings is kept and can be replayed.  Audio and video players can be dropped onto the whiteboard. If ImageMagick is installed, you can suck in a PDF and draw over it.  Windows 7 (or XP tablet version) users can do some of this in Windows Journal, but having tried both I'm going with Whyteboard (plus it's cross-platform, which is useful to me). The biggest drawback I've found so far is that, while you can save your drawings in a program-specific format, there does not seem to be an image export feature.  [Editor's note:  Author is apparently blind.  Program does export images.  See comment below.] You can, however, select a portion of a tab and copy it to the clipboard, so you can paste it into a program (such as IrfanView, which incidentally runs fine on Linux under Wine) and then crop, rotate, fiddle and export.
Now if I could just like that illegible screen writing problem ...

Tuesday, February 1, 2011

Binary Exclusive Or

Lately I've been playing with a mixed integer programming model that involves Hamming distances between binary vectors.  This involves what amounts to computing the exclusive or between two binary variables.  In a previous post I made a comment about exclusive or being somewhat tricky to implement, but that was in reference to general linear constraints -- requiring that exactly one of two possible linear equalities/inequalities be satisfied.  Fortunately, an exclusive or between to binary variables is easy.

Let $x$ and $y$ be two binary variables, and let $z=x\oplus y$  be the exclusive disjunction of $x$ and $y$, also declared as a binary variable. The truth table is as follows:
$x$$y$$z=x\oplus y$
000
011
101
110

We accomplish this with the following inequalities:\begin{eqnarray*}z\le x + y\\z\le 2 - x - y\\ z\ge x - y\\z \ge y-x.\end{eqnarray*} Verification that this works is left to the reader as an exercise.