Thursday, November 10, 2016

MIP Models in R with OMPR

A number of R libraries now exist for formulating and solving various types of mathematical programs in R (or formulating them in R and solving them with external solvers). For a comprehensive listing, see the Optimization and Mathematical Programming task view on CRAN. I decided to experiment with Dirk Schumacher’s OMPR package for R. OMPR is a domain specific language for linear and mixed-integer linear programs. OMPR currently relies on the R Optimization Infrastructure package (ROI), which uses a plugin architecture to act as a bridge between R code and third party solvers (including CPLEX). The CPLEX plugin for ROI, in turn, has a dependency on the RCplex package to connect to CPLEX.

Of course, to try out an optimization modeling package you need to have something to optimize. I went back to some ancient research I did, specifically a paper in 1990 on using MIP models to choose linear (or polynomial) discriminant functions for classifying observations into one of two groups. For the sleep deprived, the full citation is:
Rubin, P. A. Heuristic solution procedures for a mixed-integer programming discriminant model. Managerial and Decision Economics, Vol. 11, No. 4, October 1990.
I used Anderson's iris data for the test case (since it's conveniently available in R, through the datasets package), and just for the halibut I also threw in a support vector machine (SVM) model using the kernlab package for R. Comparing the two is a bit misleading, since SVM models are inherently nonlinear, but I just felt like doing it. In any case, the purpose is to see how OMPR works with CPLEX, not to compare MIP discriminant models and SVMs.

The details are too lengthy for a blog post, so I posted them separately in an R notebook. If you're not familiar with R notebooks, you can find details here. The web page generated by my notebook contains the source code, and there's a control in the upper right of the web page that will let you download it as a notebook (R Markdown) file. You can also grab it from my GitLab repository for the project. As with other content of this blog, it's yours to play with under a Creative Commons license.

The MIP model is as follows. We start with samples of two classes ( "positive" and "negative") containing $p$ features. Let $X_1$ be the $N_1\times p$ matrix of data for the negative sample and let $X_2$ be the $N_2\times p$ matrix of data for the positive sample. The discriminant function we wish to train is $f(x)=w'x+c$; we will classify an observation $x$ as belonging to the positive or negative class according to the sign of $f(x)$. To allow for both limited measurement accuracy of the data and the inevitable adventures with rounding error, we arbitrarily choose some constant $\delta > 0$ and declare a positive observation correctly classified if $f(x)\ge \delta$ and a negative observation correctly classified if $f(x)\le -\delta$.

To avoid the pitfall of having the solver scale $w$ up to some ridiculous magnitude in order to force borderline observations to "look" correctly classified (i.e., to get around the use of $\delta$ as a minimum magnitude for non-zero classification scores), we bound $w$ via its sup norm: $-1 \le w_j \le 1 \, \forall j\in \{1,\dots,p\}$. The constant term $c$ is unbounded and unrestricted in sign.

To detect and count misclassifications (including ambiguous classifications, $-\delta < f(x) < \delta$, we introduce binary variables $y_i,\, i\in\{1,\dots,N_1\}$ for the negative sample and $z_i,\, i\in\{1,\dots,N_2\}$ for the positive sample. Each will take value 0 if the corresponding observation is correctly classified and 1 if not. In the OMPR demo, I just count total misclassifications ($\sum_i y_i + \sum_i z_i$); in general, misclassifications can be weighted to account for prior probabilities, oversampling of one group relative to the other, or relative importance of different types of errors (e.g., false positives looking for cancer tumors are bad, but false negatives can be fatal). There is also a variable $d$ that captures the minimum absolute value of any correct classification score (i.e., how far correct scores are from the boundary value of 0). Larger values are rewarded in the objective function (using a coefficient $\epsilon > 0$).

The model also contains "big M" type constraints to define the binary variables. Formulas for selecting the values of the parameters $M_1$, $M_2$ and $\epsilon$ are given in the paper. So, finally, we get to the MIP model:


\begin{alignat*}{2} & \textrm{minimize} & \mathbf{1}^{\prime}y+\mathbf{1}^{\prime}z-\epsilon d\\ & \textrm{subject to}\quad & X_{1}w+c \cdot \mathbf{1}+d \cdot \mathbf{1}-M_{1} \cdot y & \le 0\\ & & X_{2}w+c \cdot \mathbf{1}-d \cdot \mathbf{1}+M_{2} \cdot z & \ge 0\\ & & w & \le \mathbf{1}\\ & & w & \ge -\mathbf{1}\\ & & c \quad & \textrm{free}\\ & & d \quad & \ge \delta\\ & & y, \: z \quad & \textrm{binary} \end{alignat*}

Monday, November 7, 2016

Interactive R Plots with GGPlot2 and Plotly

I refactored a recent Shiny project, using Hadley Wickham's ggplot2 library to produce high quality plots. One particular feature the project requires is the ability to hover over a plot and get information about the nearest point (generally referred to as "hover text" or a "tool tip"). There are multiple ways to turn static ggplots into interactive plots, and I've experimented with a few. I ended up using the R plotly library, which worked well. (The free community version was sufficient for my purposes.) The Plotly site has a short introduction to facilitate getting started. I did bump into a few "wrinkles", and thought I would record how I handled them.

Before getting into details, I'll just mention the obvious: besides installing the libraries, you need to load them in your application with library(ggplot2) and library(plotly). Also, I'll point out the default behavior for the plotly library with regard to hover text: hovering shows a pop-up with the coordinates of the point over which (or close to which) you are hovering. You can optionally add text of your choice to the pop-up.

Partly as a consequence of how variables were named in the data frames being passed to the ggplot() function, I sometimes encountered difficulty getting the coordinates properly labeled. For instance, I might want the coordinates labeled "MPG" and "Residual" but I'd end up with "x" and "y". Eventually I decided the simplest solution was to suppress the display of the coordinates (displaying just text) and then put everything I wanted (coordinate values, coordinate labels, observation labels) into a single, formatted hover text entry. The details of how I did that are the gist of this post.

ui.R


In the user interface file, displaying a plot is extremely simple. If the plot is going to be passed in output$myplot, the code to display it is just plotlyOutput("myplot"). In some cases, to avoid overly large plots or plots with unappealing aspect ratios, I would display them with plotlyOutput("myplot", height = "auto").

server.R


In the server file, assume that the plot in question (an instance of class "ggplot")  is in a variable p. The code to convert it for display via Plotly with default hover information is just output$myplot <- ggplotly(p). To suppress the default information and just show my custom text, it's output$myplot <- ggplotly(p, tooltip = "text").

When generating a plot with custom text, the text is provided as an "aesthetic", an argument to the aes() function either in the call to ggplot() or in the call to a "geom" being plotted. One of the more complicated instances in my application involves plotting a histogram of residuals from a regression model with a smoothed density curve superimposed. The code for that (in which the residuals are supplied in a data frame df containing a single column named "r") is as follows:

df %>%
   ggplot(mapping = aes(x = r)) +
   labs(x = "Standardized Residual", y = "Density") +
   geom_histogram(aes(y = ..density..,
                      text = sprintf("Std.Res.: %.2f<br>Count: %d", x, ..count..)
                  )
   ) +
   geom_density(color = "Navy",
                aes(text = sprintf("Std.Res: %.2f<br>Density: %.2f", x, ..density..)
                )
   ) + 
   geom_vline(xintercept = 0, color = "DarkGreen")

Note the calls to sprintf() to format the tool tips (with different information for histogram bars and the density curve). Also note the use of the HTML break tag (<br>)  to separate lines within the tool tip.

The inevitable edge case


As von Moltke (the Elder) noted, no battle plan survives contact with the enemy ... and no coding scheme handles all cases. One of my plots, a QQ plot of residuals, tripped me up. I wanted to identify each point with its row name. No matter where I put aes(text = ...), it failed to work. So I resorted to an alternative approach.

Assume that the QQ plot (an instance of class "ggplot") is in variable p, the source data is in data frame df, and the residuals are in data frame rf, with the graph ending up in output$qplot. The code in the user interface is unchanged. The server code looks like the following:

output$qPlot <-
  renderPlotly({
    z <- plotly_build(p)
    ix <- sort(rf, index.return = TRUE)$ix
    z$x$data[[1]]$text <- rownames(df)[ix]
    z
  })

Some explanation is clearly in order. The plotly_build() function turns a plot generated by ggplot() into a list of instructions to the plotly.js Javascript library. That in turn is formatted by the renderPlotly() function into something plotlyOutput() can digest. Within the list plotly_build() created (stored in local variable z), z$x$data[[1]]$text turns out to be a vector of hover text entries, one per point. I replace those with the row names from the data set. (I could use sprintf() to include the coordinates of the points, but for this particular application that wasn't particularly useful.)

Finally, I have to adjust for the fact that, in a QQ plot, observations are plotted in ascending order, not in their original order.  To match the correct row name to each observation, I sort the residuals, store the sort order index vector in ix, and then use that to reorder the row names.

Friday, November 4, 2016

Surviving an nVidia Driver Update

Scenario: I'm running Linux Mint 17.3 Rebecca (based on Ubuntu 14.04) on a PC with a GeForce 6150SE nForce 430 graphics card. My desktop environment is Cinnamon. The graphics card is a bit long in the tooth, but it's been running fine with the supported nVidia proprietary driver for quite some time. Unfortunately, having no reason to do so, I had not noted down what version of the driver I had.

Yesterday I installed a couple of "recommended" updates, one of which bumped the nVidia driver to version 304.132. Apparently this driver is "recommended" for people who don't want to see their desktops. On the next boot after the upgrade, I got a black screen. To be clear, there's no problem before the desktop loads -- I can see the BIOS messages at the start of the boot and the grub menu where I get to choose which version of the operating system to boot. It's only when we get to the desktop that the display fails.

A bit of searching showed me that I'm far from the only person experiencing this. What's lacking (at least as of this writing) is a definitive fix. I'll skip the gory details of a day and half of fruitless hacking and cut to the chase scene.

Getting to a terminal


The first step was to escape the black hole of my desktop. That was easier said than done. You can't right click on an invisible desktop (or at least if you do it's unproductive). Ditto trying to get to the application menu. Fortunately, control+alt+F2 did manage to kill the (worthless) X session and get me to a terminal. (The display worked fine in terminal mode.)

Getting to a desktop


It's a bit like cutting off your leg to get rid of a broken toe, but one way to get out of nVidia Hell is to get nVidia off your system. So in the terminal I ran

sudo apt-get purge nVidia*

(which deleted all nVidia packages) followed by

sudo reboot now

(which did exactly what you would think). With nVidia gone, the system was forced to use the open source "nouveau" driver. Unfortunately, the nouveau driver seemed to be hopelessly confused about what sort of display I had (it called it a "laptop display") and what resolution to use. The result was a largely unusable (but at least mostly visible) desktop.

Rolling way, way back


My hope was to roll back to the previous nVidia driver, but that hope was quickly dashed. I was able to run the device manager. (You have two ways to do this, depending on how good or bad the nouveau display is. One is to use the Mint menu to run "Device Manager", if you can. The other is to open a terminal and run "gksudo device-manager".) The device manager listed three possible drivers for me. The first was the multiple-expletives-deleted 304.132 nVidia driver, the second was the nouveau driver, and the third was the version 173.14.39 nVidia driver. So I picked the third, applied the changes and restarted.

This got me a fully functional desktop (at the correct resolution), but performance was less than stellar, as one might expect from that old a driver. There were noticeable lags between some operations (such as clicking the close control on a window) and their results (window actually closing). More importantly, if I suspended the machine, when I tried to resume I could not get the desktop back. So version 173 was not the permanent solution.

Rolling back just a little


I've mentioned the sgfxi script before. I tried running it, but it wanted to install the latest supported version, which was that nasty 304.132 thing. After screwing around for way too long, I discovered I could in fact roll back with the script.

The first step was to kill the X server, since sgfxi won't run while X is running. So I used control+alt+F2 to get to a terminal, logged in, and ran

sudo service mdm stop

to get rid of the display manager. That forced me to log in again, after which I ran

sudo su -
sgfxi --advanced-options | less

for two reasons. One was to find the correct option for specifying a particular driver version (it's -o). The other was to get a list of available versions, which appears near the top of the output.

I tried a few of the more recent ones listed, but was told either that they weren't supported (despite appearing in the list) or that some other package was the wrong version to mesh with them. Fortunately, 304.131 could be installed. I assume that was released immediately before the ill-fated 304.132. So once more unto the breach: running (as root)

sgfxi -o 304.131

worked. I was prompted to make a few decisions (one or two of which I simply guessed about), and I got one error message during a cleanup phase, but the script did install the driver and terminated. I rebooted and the system seems to be working normally. It doesn't feel sluggish any more, and returning from a nap is no problem.

The earlier purger and removed the nvidia-settings package, so I used the Synaptic package manager to reinstall version 304 of that. It provides a graphical interface to adjust display settings, although so far the defaults seem to work just fine for me.

Now I just need to be sure never, ever, ever again to update that driver.