Saturday, June 30, 2012

The White Knight Returns

Back in May, on his excellent blog The Endeavor, John D. Cook posed (and subsequently answered) the following problem: if a knight begins in one corner of an empty chessboard and moves randomly, each time choosing uniformly over the available moves from its current position, what is the mean number of moves until it returns to the corner from which it started?

Several people posed exact or approximate answers in the comments. The approximate answers were based on simulation. The most elegant solution among the comments, posted by 'deinst', modeled the problem as a reversible Markov chain on a graph. John Cook's answer used a rather powerful theorem about a random walk on a graph, which I think boils down to the same logic used by 'deinst'. (Cook also includes Python code for a simulation.)

When I read the problem, my first reaction was also to model it as a Markov chain, but sadly I've forgotten most of what I knew about MC's, including the concept of reversibility. I did, at least, recall the essentials of first passage time. The question reduces to computing the expected first passage time from the corner square to itself. The Wikipedia link for first passage times is not entirely helpful, as it is not specific to Markov chains, but the concept is easy to explain. Let the corner in which the knight starts be position $s$, and let $\mu_j$ denote the expected number of moves required until the knight first returns to position $s$, given that it is currently at position $j$ (i.e., the expected first passage time from $j$ to $s$). The answer to the puzzle is $\mu_s$, the expected first passage time from $s$ to itself. Further, let $M_j$ denote the set of positions that the knight can reach in one move, given that it is currently at position $j$. Letting $p_{jk}$ denote the probability that a knight at position $j$ next moves to position $k$, the first passage times $\mu_\cdot$ satisfy the following system of linear equations:\[ \mu_{j}=1+\sum_{k\in M_{j}\backslash\{s\}}p_{jk}\mu_{k}\quad\forall j. \] In our case, $p_{jk} = 1/|M_j|$ if $k\in M_j$ and 0 otherwise. The solution to this system has $\mu_s=168$ (matching the answers of John Cook and deinst).

Other than a quick refresher on Markov chains, my main interest in this was that I'm teaching myself Python while also trying to get a handle on using Sage. So I wrote a Sage notebook to set up and solve the first passage time equations. Here is the content of that notebook, in case anyone is curious. (Please bear in mind I'm new to Python.)

Define the board

board = [(i, j) for i in range(8) for j in range(8)]

Assign a variable for the expected first passage time (to the lower left corner) from each square of the board

vlist = {b : var('r_' + str(b[0]) + '_' + str(b[1])) for b in board}

List the eight moves a knight can make

moves = [(i, j) for i in [-2, -1, 1, 2] for j in [-2, -1, 1, 2] if abs(i) + abs(j) == 3]

Define a function that takes a position and move and returns the new position if the move is legal, None otherwise

def move(position, change): 
    if (position in board) and (change in moves): 
        landing = (position[0] + change[0], position[1] + change[1])
        return landing if landing in board else None
        return None

Compute all legal moves from each board position

legal = {p : [move(p, m) for m in moves if not(move(p, m) is None)] for p in board}

Set up the first passage time equations

eqns = [vlist[b] == 1.0 + (1.0/len(legal[b]))*sum([vlist[x] for x in legal[b] if x != (0, 0)]) for b in board]

Solve the equations

sols = solve(eqns, vlist.values(), solution_dict = True)

Verify that the solution is unique

len(sols) == 1


Find the mean first passage time from position (0,0) to itself



One last note: I had originally intended to give the post a Christian Bale buzz by titling it "The Dark Night Returns". Turns out DC Comics beat me to that title. Oh, well.

Friday, June 29, 2012

The Linux - Nook Color Connection

My local public library now provides e-books (via a consortium), and I just wandered through the maze to download my first one.  I have a first generation Nook Color, so ePub format works fine ... if I can just get the book onto the device with digital rights management handled.  The consortium requires me to install Adobe Digital Editions on a PC, which then acts as an intermediary between the book source and any reading devices.  (ADE also acts as a reader on the PC.)  A friend of mine, a confirmed Windows user, has already been through the process, so I know it works.

Adobe, like a lot of other software publishers, supports Windows and Mac OS but not Linux.  <insert usual imprecations and rants> The good news is that ADE runs fine under Wine, once you get it configured correctly. The bad news is that (a) Adobe does their best to hide the fact that you can download a complete installer program -- their web site recognizes I'm using a Linux variant and tells me to sod off -- and (b) getting ADE running under Wine to recognize the Nook (connected to a USB port) is not entirely obvious.

Fortunately, when I started searching for information on the latter point, I came across the following complete, accurate and readable post by gorthx.  There's no need for me to expand on it; I'm just posting the link here to help boost its position in Google (and so that I don't lose track of it).

Monday, June 25, 2012

Birthing a Litter in CPLEX

In a previous post, I discussed (at a conceptual level) how to work around CPLEX's limitation of at most two children for any node in the search tree of a mixed-integer program. Following up on a question on one of their support forums, I'll add some implementation details and an example (using the Java API).

The example that I will use is a team assignment problem.  Given $N$ students (indexed $1\ldots N$), with student $i$ having a grade point average (GPA) equal to $g_i$ (on a scale from 0.0 to 4.0), form teams containing between $S_{min}$ and $S_{max}\ge S_{min}$ students on each team, so that the mean GPAs of the teams are as close as possible (minimal spread).

The use of mean GPA, combined with a variable team size (unless $S_{max}=S_{min}$), actually makes this a nonlinear model.  I'll finesse that by binary variables, in a Type 1 Special Ordered Set (SOS1), to control the size of each team.

Mathematical Model


  • $N$ = number of students to assign
  • $g_i\in [0.0,4.0]$ = GPA of student $i$
  • $S_{min}$ = minimum allowable size of any team
  • $S_{max}$ = maximum allowable size of any team
  • $T_{max}=\left\lfloor \frac{N}{S_{min}}\right\rfloor $ = maximum number of teams needed
  • $T_{min}=\left\lceil \frac{N}{S_{max}}\right\rceil $ = minimum number of teams needed


  • $x_{ij}$ = 1 if student $i\in \{1,\dots,N\}$ is assigned to team $j\in \{1,\dots,T_{max}\}$, else 0
  • $y_j$ = 1 if team $j\in \{1,\dots,T_{max}\}$ is used, else 0
  • $s_{jk}$ = 1 if team $j\in \{1,\dots,T_{max}\}$ has size $k\in \{S_{min},\dots,S_{max}\}$, else 0
  • $G_{max}$ = highest average (mean) GPA of any team created
  • $G_{min}$ = lowest average (mean) GPA of any team created
  • $G_j$ = sum of the GPAs of students assigned to team  $j\in \{1,\dots,T_{max}\}$


\[\min G_{max} - G_{min}\]


  • $\sum_{j=1}^{T_{max}} x_{ij}=1\quad\forall  i\in \{1,\dots,N\}$ (1)
  • $\sum_{i=1}^Nx_{ij}=\sum_{k=S_{min}}^{S_{max}}k*s_{jk}\quad\forall j\in \{1,\dots,T_{max}\}$ (2)
  • $\sum_{k=S_{min}}^{S_{max}}s_{jk}=y_j\quad \forall j\in \{1,\dots,T_{max}\}$ (3)
  • $y_j\le y_{j-1}\quad \forall j\in \{2,\dots,T_{max}\}$ (4)
  • $G_j=\sum_{i=1}^Ng_i x_{ij}\quad\forall j\in \{1,\dots,T_{max}\}$ (5)
  • $G_j \le k G_{max}+4S_{max}\left(1-s_{jk} \right) \quad \forall j\in \{1,\dots,T_{max}\}, k\in \{S_{min},\dots,S_{max}\}$ (6)
  • $G_j \ge k G_{min}-4k\left(1-s_{jk} \right) \quad \forall j\in \{1,\dots,T_{max}\}, k\in \{S_{min},\dots,S_{max}\}$ (7)

Constraint (1) ensures that each student is assigned to exactly one team. Constraint (2) simultaneously defines the $s$ variables (team sizes) and ensures that every team is either empty or has a size between $S_{min}$ and $S_{max}$. Constraint (3) simultaneously makes the $s$ variables for each team a SOS1 and ensures that the size is nonzero if and only if the team is used ($y_j=1$). Constraint (4) eliminates some symmetry in the model by "packing" teams (so that, if fewer than $T_{max}$ teams are used, the ones used are those with lowest indices). Constraint (5) computes the sum of the GPAs of students assigned to each team.

Finally, constraints (6) and (7) define the smallest ($G_{min}$) and largest ($G_{max}$) team GPAs in a linear manner. If $s_{jk}=1$, then team $j$ has size $k$, and (6) and (7) resolve to \[kG_{min}\le G_j \le kG_{max},\] or equivalently \[G_{min}\le G_j/k \le G_{max},\] where $G_j/k$ is the mean GPA of team $j$. If $s_{jk}=0$, (6) and (7) are both nonbinding; given the 0-4 grading scale, the right side of (6) becomes $k G_{max}+4S_{max}\ge 4S_{max}$ (the highest possible sum of GPAs on any valid team), while the right side of (7) becomes $k\left(G_{min} - 4\right)\le 0$.

Branching Strategy


To illustrate how to simulate more than two children at a node, I will adopt the following branching strategy. At the root node, I will branch on the number of teams being utilized (which must be between $T_{min}$ and $T_{max}$ inclusive), creating one child for each possible team count. At any other nodes, I will let CPLEX branch as it pleases. Note that I am not advocating this strategy as an effective method for problems of this nature; it is strictly a coding example.

Of course, I cannot literally create $T_{max}-T_{min}+1$ children at the root node, since CPLEX limits a node to at most children. So my branch callback will attach to each node a data object containing the fewest ($a$) and most ($b$) teams allowed at that node and its descendants.  When I branch on that node, I will bisect $\{a,\dots, b\}$ into $\{a,\dots,m\}$ and $\{m+1,\dots,b\}$, where $\m=\left\lfloor (a+b)/2 \right\rfloor$. If $a=m$ or $b=m+1$, I will create a "normal" child corresponding to that (with no attached data). If $a<m$ or $m+1 < b$, the corresponding child (which I will refer to as a "composite" node) will have the new range ($[a,m]$ or $[m+1,b]$) attached as node data, and will receive the same special treatment when the time comes to branch on it.

Suppose that $T_{min}=3$ and $T_{max}=7$. This is the top of the search tree we want:
desired search tree
The root node would be "composite", whereas all the children would be "normal". This, on the other hand, is the top of the tree we will generate (given CPLEX's two-child limit):
actual search tree
Here there are three "composite" nodes in addition to the root node; the five leaf nodes are "normal".

Java Code


Shown next is the main program. (My usual disclaimer applies: I'm an amateur coder, and I make no claims that this is optimized for anything.) The entire program can be downloaded here (Java source code, ~5 KB). [Note: This link was updated on 10/02/2014 to point to a new repository.]

There is nothing particularly special about the main program. The GPA is generated randomly; model parameters are set at the top of the main method. I set the absolute tolerance (EpAGap parameter) to 0.01; there is no point in splitting hairs when working with GPA data.

 * Example of how to generate > 2 child nodes in CPLEX.
 * Context: assigning students to teams to balance average GPA.
 * Team sizes are constrained, and the head counts of the largest and smallest
 * teams should differ by at most one.
package multichild;

import ilog.concert.IloException;
import ilog.concert.IloLinearNumExpr;
import ilog.concert.IloNumVar;
import ilog.cplex.IloCplex;
import java.util.Random;

 * @author Paul A. Rubin (
public class MultiChild {

   * Main program
  public static void main(String[] args) {
    // problem parameters
    int nStudents = 60;  // number of students to form into teams
    int minSize = 4;     // minimum team size
    int maxSize = 6;    // maximum team size
    int maxTeams = (int) Math.floor((1.0*nStudents)/minSize);   // max # of teams
    int minTeams = (int) Math.ceil((1.0*nStudents)/maxSize);  // max # of teams
    double[] gpa = new double[nStudents];
    // generate random GPAs
    Random gen = new Random(666L);
    for (int i = 0; i < nStudents; i++) {
      gpa[i] = Math.rint(200.0 * (1.0 + gen.nextDouble()))/100;
        // using Math.rint to get GPAs with exactly two decimal places
     * The MIP model will look like the following, where
     *   x[i][j] = 1 if student i is assigned to team j, else 0
     *   y[j] = 1 if team j is used, else 0
     *   s[j][k] = 1 if team j has size k (k = minSize, ..., maxSize), else 0
     *   gmax = highest composite GPA of any team
     *   gmin = lowest composite GPA of any team
     *   gsum[j] = sum of GPAs of students on team j
     * minimize gmax - gmin
     * s.t.
     *   sum_j x[i][j] = 1 for all i (assign every student once)
     *   sum_k k*s[j][k] = sum_i x[i][j] for all j (size of each team)
     *   sum_k s[j][k] = y[j] for all j  (size is SOS1; unused teams have size 0)
     *   y[j] <= y[j - 1] for all j > 0 (unused teams come at end of list)
     *   gsum[j] = sum_i gpa[i]*x[i][j] for all j (GPA sums)
     *   gsum[j] <= k*gmax + 4*maxSize*(1 - s[j][k]) for all j, k (defines gmax)
     *   gsum[j] >= k*gmin - 4*k*(1 - s[j][k]) for all j, k (defines gmin)
    try {
      IloCplex cplex = new IloCplex();
      // create the variables
      IloNumVar[][] x = new IloNumVar[nStudents][maxTeams];
      IloNumVar[] y = new IloNumVar[maxTeams];
      IloNumVar[][] s = new IloNumVar[maxTeams][maxSize];
      IloNumVar[] gsum = new IloNumVar[maxTeams];
      IloNumVar gmax = cplex.numVar(0.0, 4.0, "gmax");
      IloNumVar gmin = cplex.numVar(0.0, 4.0, "gmin");
      for (int i = 0; i < nStudents; i++) {
        for (int j = 0; j < maxTeams; j++) {
          x[i][j] = cplex.boolVar("x_" + i + "_" + j);
      for (int j = 0; j < maxTeams; j++) {
        y[j] = cplex.boolVar("y_" + j);
        gsum[j] = cplex.numVar(0.0, Double.MAX_VALUE, "gsum_" + j);
        for (int k = minSize - 1; k < maxSize; k++) {
          s[j][k] = cplex.boolVar("s_" + j + "_" + (k + 1) );
      // add the objective function
      cplex.addMinimize(cplex.diff(gmax, gmin));
      // add the constraints
      for (int i = 0; i < nStudents; i++) {
        cplex.addEq(cplex.sum(x[i]), 1.0); // single assignment, student i
      for (int j = 0; j < maxTeams; j++) {
        IloLinearNumExpr expr1 = cplex.linearNumExpr();
        IloLinearNumExpr expr2 = cplex.linearNumExpr();
        for (int k = minSize - 1; k < maxSize; k++) {
          expr1.addTerm(s[j][k], k + 1);
        for (int i = 0; i < nStudents; i++) {
          expr2.addTerm(x[i][j], 1.0);
        cplex.addEq(expr1, expr2); // size definition for team j
        for (int k = minSize - 1; k < maxSize; k++) {
          expr1.addTerm(s[j][k], 1.0);
        cplex.addEq(expr1, y[j]); // size is SOS1; unused teams have size 0
        if (j > 0) {
          cplex.addLe(y[j], y[j - 1]);  // used teams come first
        for (int i = 0; i < nStudents; i++) {
          expr1.addTerm(x[i][j], gpa[i]);
        cplex.addEq(expr1, gsum[j]);  // define GPA sums
        for (int k = minSize - 1; k < maxSize; k++) {
          int kk = k + 1;
          expr1.addTerm(gmax, kk);
          expr2.addTerm(gmin, kk);
          expr1.addTerm(s[j][k], -4*maxSize);
          expr2.addTerm(s[j][k], 4*kk);
          cplex.addLe(gsum[j], expr1); // upper bound on GPA sum
          cplex.addGe(gsum[j], expr2); // lower bound on GPA sum
      // set a tolerance so that we don't split hairs
      cplex.setParam(IloCplex.DoubleParam.EpAGap, 0.01);
      // attach a branch callback -- comment this out to run without
      // special branching
      cplex.use(new Brancher(minTeams, maxTeams, y));
      // solve the model
      if (cplex.getStatus() == IloCplex.Status.Optimal) {
        // fetch the solution
        double[][] xx = new double[nStudents][];
        for (int i = 0; i < nStudents; i++) {
          xx[i] = cplex.getValues(x[i]);
        double[] yy = cplex.getValues(y);
        double[] ggsum = cplex.getValues(gsum);
        // display the optimal solution
        System.out.println("\nAssign " + nStudents + " students into teams "
                           + "of size " + minSize + " to " + maxSize + ".");
        System.out.println("GPA spread is from " + cplex.getValue(gmin)
                           + " to " + cplex.getValue(gmax)
                           + " (a spread of " + cplex.getObjValue() + ")");
        System.out.println("CPLEX explored " + cplex.getNnodes() + " nodes.");
        for (int j = 0; j < maxTeams; j++) {
          if (yy[j] > 0.5) {
            int size = 0;
            for (int i = 0; i < nStudents; i++) {
              if (xx[i][j] > 0.5) {
                System.out.println(j + "\t" + i + "\t" + gpa[i]);
            System.out.println("\tOVERALL\t" + ggsum[j]/size + "\n");
      else {
        System.err.println("CPLEX returned status " + cplex.getStatus());
    } catch (IloException ex) {
      System.err.println("CPLEX pitched a hissy fit:\n" + ex.getMessage());
    } catch (Exception ex) {
      System.err.println("Could not construct the callback!\n" + ex.getMessage());

The information object attached to composite nodes is a simple container class that provides just the minimum and maximum number of teams at that node:
 * BranchInfo defines an object that can be attached to nodes and guides
 * branching decisions.
package multichild;

 * @author Paul A. Rubin (
public class BranchInfo {
  private int minTeams;  // minimum number of teams to use in all child nodes
  private int maxTeams;  // maximum number of teams to use in all child nodes
   * Constructor.
   * @param min minimum number of teams to be used
   * @param max maximum number of teams to be used
  public BranchInfo(int min, int max) {
    this.minTeams = min;
    this.maxTeams = max;

   * Get the maximum number of teams to create among child nodes.
   * @return maximum number of teams to create among child nodes
  public int getMaxTeams() {
    return maxTeams;

   * Get the minimum number of teams to create among child nodes.
   * @return minimum number of teams to create among child nodes
  public int getMinTeams() {
    return minTeams;


Finally, I present below the branch callback implementation. It contains a bunch of extra print statements just to illustrate how it is behaving. The callback is called at every node. At a "normal" node, where the number of teams is already fixed, it returns without doing anything (allowing CPLEX to branch as usual). At a "composite" node, it produces two children as described above (other than in the pathological case of $T_{min}=T_{max}$, in which case it produces a single "normal" child).

The callback detects a "composite" node by the presence of an information object attached (by an earlier call to the callback) to that node. The one exception is the root node, which we know is composite but to which we cannot attach an information object. I handle that by setting a flag in the callback (at the time of construction) saying that the next node encountered will be the root. The first time branching occurs (which must be at the root), the callback recognizes it as a "composite" node and clears the flag.
 * Custom branch callback creating 3+ children.
 * When branching at the root node, this will create one child node for
 * each possible count on the number of teams. At other nodes, it will
 * let CPLEX branch as usual.
package multichild;

import ilog.concert.IloException;
import ilog.concert.IloNumVar;
import ilog.cplex.IloCplex;
import java.util.ArrayList;

 * @author Paul A. Rubin (
public class Brancher extends IloCplex.BranchCallback {
  private int minTeams;  // minimum number of teams to create
  private int maxTeams;  // maximum number of teams to create
  private IloNumVar[] teamVars;  // boolean variables for team use
  private boolean atRoot;  // are we at the root node?
  private ArrayList<Double> bounds; // bounds for variables when branching
  private ArrayList<IloCplex.BranchDirection> dirs; // branch directions
  private ArrayList<IloNumVar> vars;  // variables involved in branching
  private double[] aBds;  // bounds as doubles
  private IloCplex.BranchDirection[] aDirs;  // directions as array
  private IloNumVar[] aVars;  // variables as array

   * Constructor.
   * @param minTeams  minimum number of teams to create
   * @param maxTeams  maximum number of teams to create
   * @param v         boolean variables for team use
   * @throws an exception if the variable array has the wrong dimension
  public Brancher(int minTeams, int maxTeams, IloNumVar[] v) throws Exception {
    this.minTeams = minTeams;
    this.maxTeams = maxTeams;
    this.teamVars = v;
    // make sure the variable vector has the correct dimension
    if (v.length != maxTeams) {
      throw new Exception("Variable array dimension " + v.length
                          + " does not match maxTeams = " + maxTeams);
    this.atRoot = true;

   * The actual callback operation.
   * @throws IloException if CPLEX gets miffed
  protected void main() throws IloException {
    BranchInfo info = null;
    Object raw;
    String msg = ">>> At node " + getNodeId() + ", ";
    IloCplex.NodeId id;
    // check whether we are at the root node (which will have no attachment)
    if (atRoot) {
      raw = new BranchInfo(minTeams, maxTeams);  // create an attachment
      atRoot = false;  // clear the root flag
    else {
      raw = getNodeData();  // get the current node's attachment, if any
    if (raw == null) {
      // no data object -- let CPLEX handle branching
    else if (raw instanceof BranchInfo) {
      // convert the object to an instance of BranchInfo
      info = (BranchInfo) raw;
    else {
      // unknown node data type -- should never happen
      System.err.println("Encountered unknown node data type!");
    // the presence of node data means we are branching on team count
    // if the range of team sizes at this node is one or two, just create
    // normal children; otherwise, bisect the range of counts and create
    // two "special" children
    int min = info.getMinTeams();
    int max = info.getMaxTeams();
    msg += "team range is " + min + " to " + max;
    double obj = getObjValue();  // current node bound
    switch (max - min) {
      case 0: // create a single child with exactly this many teams
        // this case should only occur (at the outset) if minTeams = maxTeams
        cut(min, max);
        id = makeBranch(aVars, aBds, aDirs, obj); // attach no data
        msg += " -- adding one normal child (" 
               + id + ") with exactly " + min + " teams";
      case 1: // create two normal children
        // first child contains exactly min teams, no attached data
        cut(min, min);
        id = makeBranch(aVars, aBds, aDirs, obj); // attach no data
        msg += " -- adding two normal children (" + id + " using exactly "
               + min + " teams and ";
        // second child contains exactly max = min + 1 teams, no attachment
        cut(max, max);
        id = makeBranch(aVars, aBds, aDirs, obj); // attach no data
        msg += id + " using exactly " + max + " teams)";
      case 2: // create one normal child and one composite child
        // first child uses min to min+1 teams
        cut(min, min + 1);
        // attach node info to this child
        info = new BranchInfo(min, min + 1);
        id = makeBranch(aVars, aBds, aDirs, obj, info);
        msg += " -- adding one composite child (" + id + ") with range "
               + min + " to " + (min + 1);
        // second child uses exactly max = min + 2 teams, no attached data
        cut(max, max);
        id = makeBranch(aVars, aBds, aDirs, obj);
        msg += " and one normal child (" + id + ") using exactly "
               + max + " teams";
      default: // max > min + 2: create two composite children
        int mid = (max + min)/2;  // midpoint of size range (rounding down)
        // first child uses min to mid team count
        cut(min, mid);
        info = new BranchInfo(min, mid);
        id = makeBranch(aVars, aBds, aDirs, obj, info);
        msg += " -- adding two composite children (" + id
               + " with range " + min + " to " + mid;
        // second child uses mid + 1 to max team count
        cut(mid + 1, max);
        info = new BranchInfo(mid + 1, max);
        id = makeBranch(aVars, aBds, aDirs, obj, info);
        msg += " and " + id + " with range " + (mid + 1) + " to " + max + ")";
   * Sets up the arguments for a cut that ensures somewhere between min and
   * max teams are used.
   * @param min minimum number of teams to use
   * @param max maximum number of teams to use
  private void cut(int min, int max) {
    this.bounds = new ArrayList<Double>();
    this.dirs = new ArrayList<IloCplex.BranchDirection>();
    this.vars = new ArrayList<IloNumVar>();
    for (int j = 0; j < maxTeams; j++) {
      if (j < min) {  // force the first min variables to be 1
      else if (j >= max) {  // force variables above max to be 0
    // convert the lists to arrays
    aVars = vars.toArray(new IloNumVar[0]);
    aDirs = dirs.toArray(new IloCplex.BranchDirection[0]);
    // unavoidable PITA -- bounds must be converted from Double[] to double[]
    // with a loop
    Double[] temp = bounds.toArray(new Double[0]);
    aBds = new double[temp.length];
    for (int i = 0; i < temp.length; i++) {
      aBds[i] = temp[i];

Related post: "Multiple Children - Again!" (updated code, changing the composite nodes to clones of their parents)

Friday, June 22, 2012

The Normal Density Is Not A Fractal

Early in May, my radio alarm awoke me to NPR news doing a story about a recently published paper [1]. The main point of the paper seemed to be that productivity of people in several fields (academic research, professional and high-level amateur athletics, entertainment, politics) does not seem to follow a Gaussian (normal) distribution, but rather follows a Pareto (power) distribution. Perhaps the primary differences are that a small number of individuals generate a disproportionate share of the output, and that a high proportion of individuals have "below average" output (median well below mean).

My first reaction was "No fooling!" (Actually, "fooling" is an editorial substitution. I can be a bit cranky when I'm torn from the arms of Morpheus.) Hot on that first reaction was this: "The normal density is not a fractal." Specifically, the right tail of a bell curve is not itself a bell curve ... and if I can recognize that while half asleep, it must be fairly obvious.  So where's the beef?

The first figure below shows density functions for Gaussian (red) and Pareto (blue) distributions. The second figure shows a plausible Gaussian distribution for athletic ability among the overall population. Professional athletes, or Division I collegiate athletes, presumably fall in the right tail (shaded). The right tail bears considerably more resemblance to the Pareto distribution in the first figure than it does to a Gaussian distribution.

Gaussian and Pareto distributions Bell curve for athletic ability

The research documented in the paper covered five studies (one for each of the fields I listed above) using 198 samples (a variety of performance measures for each field, or subsets of each field), involving a total of 633,263 individuals. As with any statistical study, you can question methods, sample definitions, interpretation of results etc. I've read the paper, and I find the evidence fairly compelling, particularly as it confirms my initial intuition (expressed in my waking reaction).

I'm not sold on Pareto as the correct choice among one-tailed distributions, and I'm not sold on single tail distributions in general.  Consider, for instance, a performance measure on a [0,1] scale, such as career batting average for baseball players.  The bounded domain precludes a long, thick tail on either side.  A friend of mine did in fact download and fit some batting average data.  I won't reproduce it here, since I'm not sure how his sample was defined (all players or just selected ones), but the histogram he showed me was a lot closer to Gaussian than to Pareto.

That said, it seems intuitive to me that if performance correlates strongly to ability, if ability has a roughly Gaussian distribution, and if there is a selection mechanism in play that selects the most able to compete, then performance among the competitors will not be Gaussian.  This needs to be qualified in a variety of ways, including the following:
  • Not everyone with high ability may choose to compete.  Athletes may find it more lucrative (and less dangerous) to act in adventure films; people capable of great research may find it more lucrative to work in industry (where opportunities to publish are greatly diminished).
  • Not everyone with high ability may have the opportunity to compete.  Some star athletes are forced to retire prematurely, or to abandon hope of starting a professional career, due to health concerns.  Potential stars in any field may go undiscovered due to where they live or what schools they attend.  At the risk of creating a distraction by introducing a somewhat charged topic, women or members of an ethnic (here) or religious (elsewhere) minority may be excluded or discouraged from entering the competition, regardless of their ability.
  • Nepotism (particularly but not exclusively in the entertainment field) may introduce competitors who do not reside in the right tail of the ability distribution.  To a lesser extent (I think), diligence, hard work or "heart" may allow some people with above median but less than excellent athletic ability to compete in high level events.
  • Productivity is not always a monotonic function of talent.  A very talented wide receiver may not catch many balls if he is on a team with a relatively poor quarterback, a strong running game, and/or a stellar defense (allowing the team to practice a conservative offense).  Conversely, a modestly talented receiver on a team with no running game and a poor defense (so that they are perpetually playing catch-up) may catch a disproportionate number of passes.  Similar things can happen to an academic with great research potential working at a "teaching school", or in a department lacking resources and not providing colleagues or doctoral students with whom to collaborate.  (Conversely, some faculty are adept at riding the coattails of their more productive colleagues, showing up as coauthors of all sorts of papers.)
Disclaimers aside, suppose that we accept the central premise of the paper, that in many cases productivity looks more like a power distribution than like a bell curve.  So what?  Here's the point (and the reason that my second reaction, the "not a fractal" comment, was perhaps a bit unjust in evaluating the significance of the paper).  Probably the most common things I see in academic studies involving any sort of performance measure are F-tests of the statistical significance of groups of terms and t-tests of the significance of individual terms in (usually linear) regression models.  Both those tests are predicated on normally distributed residuals.  They are somewhat "robust" with respect to the normality assumption, which is a hand-wavy way of saying "we're screwed if the buggers aren't normal, unless we have an infinite sample size, so we'll just call our sample size close enough to infinite and forge ahead". If the residuals are not sufficiently close to Gaussian, and the sample size is not large enough, F- and t-tests may induce falsely high levels of confidence.

Now it is not the case that the response variable (here, the performance measure) need be normally distributed in order for the residuals to be normally distributed.  Unless ability is adequately covered by the explanatory variables, though, the effects of ability will be seen in the residuals, and if the distribution of ability among the sample (those who likely were chosen at least in part on the basis of ability) bears any resemblance to a Pareto distribution, it seems fairly unlikely that the residuals will be normally distributed ... and fairly risky to assume that they are.  Some academic papers cite specific tests of the normality of residuals, but in my experience it is far from a universal practice.

The authors of the paper point out a second issue related to this.  Extremely high values of performance are more likely to occur with Pareto distributions than with Gaussian distributions.  Some (many?) authors, taking normality for granted, treat extreme values as outliers, assume the observations are defective, and "sanitize" the data by excluding them.

So consumers of academic papers studying performance may be buying a pig in a poke.

[1] O'Boyle Jr. and Aguinis, "The Best and the Rest: Revisiting the Norm of Normality of Individual Performances." Personnel Psychology 65 (2012), 79-119.

Wednesday, June 20, 2012

MythTV versus Media Center

I'm still getting the hang of MythTV, but I think I have enough experience to make a reasonably informed comparison to Windows Media Center (WMC), which I used for years. A quick but important disclaimer is that, while I used the same hardware with both systems (other than an upgraded tuner that plays no role in the comparison), MythTV runs on the latest version of Mythbuntu whereas WMC ran on Windows Vista. I would expect the latest WMC, running on Windows 7, to be more stable and possibly have an improved feature set.

Where WMC has the edge


  • Setup is much easier than with MythTV. I've documented some of my experiences setting up MythTV in other posts (click "MythTV" in the tag cloud, lower right, to find them). Familiarity with Linux (preferably Ubuntu), good skills using Google and nerves of steel are pretty much requirements (as might be the patience of Job). In contrast, with WMC you tell Windows whether you're using an antenna, cable or whatever, identify your signal provider, and you're off to the races.
  • Data for the television guide seemed to download and install more rapidly (although this is a rather subject perception measure, and the guide source was different).
  • The guide is free (or prepaid, if you prefer to look at it as bundled into the cost of the software). With MythTV, I pay a modest annual fee for guide data.  (This is actually not an issue for me, since I was buying that data even when I used WMC, for reasons I'll articulate below.)
  • The automatic download of guide data may be a bit more reliable than with MythTV (where I've already seen one instance in which the scheduled replenishment of guide data failed silently).

Where MythTV has the edge

  • MythTV records shows in a reasonably standard format (MPEG-4, or something quite close to it). That means the recordings can be played by other software and can be transcoded. WMC, on the other hand, uses a rather nonstandard format (DVR-MS). Web searches suggest that it is possible to transcode them, but my first two tries ended in failure. Since VideoLAN's VLC player can play them, and since I have no interest in retaining the recordings once I've watched them, I gave up trying to transcode my accumulated DVR-MS files.
  • MythTV can wake the PC from a  G3 powered off state to record programs (assuming, of course, that the PC remains plugged into a live outlet). WMC, to the best of my knowledge, could only wake the PC from an S3 standby (sleep) state. I use the PC in question solely for recording and viewing television. Not having to leave it in sleep mode saves power, but perhaps more importantly it insulates me against power failures while I'm away on a trip. With WMC, if power happened to fail while I was away, the PC remained off once power was restored, and WMC could not wake it to record shows.
  • I can schedule recordings by looking up the name of the show or by jumping to a date and time in the guide. WMC required me to open the guide (to the current date/time) and then scroll to the date and time of the show I wanted to recorded. I found it surpassingly strange that it had no option (or at least none I could find) to jump to a specific date/time.
  • I can control how many days the guide contains. WMC downloaded however much data it wanted to download. In a few instances, when planning for a trip, I had to wait until the last minute to program recordings because dates late in the trip had not yet materialized in the guide. Although it never happened to me (that I recall), presumably on a long enough trip one would be barred from scheduling some recordings. (This "feature" is why I was already subscribed to the guide data that I ended up using in MythTV.)
  • I have yet to see an analog to the Black Screen of Obfuscation (BSoO).

Regarding the last point, the BSoO was apparently a relative of the Blue Screen of Death (BSoD). This may be a Vista-specific "feature". Occasionally, when I opened WMC, I would encounter an utterly featureless black screen (maximized). Blindly clicking in the upper right corner would kill WMC, so the screen was actually alive and populated; it was just not painting. "Flipping" also worked (that's alt-tab task switching, not flipping the bird -- although I certainly tried the latter more than once). The truly annoying part, though, was that if you killed WMC during a BSoO and then restarted it, you were guaranteed to get another BSoO. This was also true if you killed it using the task manager. Once you saw a BSoO, your only remedy was to reboot the PC -- which could be rather inconvenient if you were trying to record live TV (particularly as Vista set no speed records for rebooting).

I think it was the BSoO that finally convinced me to brave MythTV. So far, so good with that change.

Friday, June 15, 2012

Migrating POPFile

For years now, I've been using POPFile (an open source, cross-platform Bayesian classifier) to filter spam. Once trained, it seems to be more accurate than what my former employer used (or at least what they were using when I installed POPFile).

When I switched my university mail account from POP3 to IMAP, I also switched to the IMAP module in POPFile. It's pretty sweet. Whereas POPFile, like most filters, inserts itself between the mail client and a POP3 mail server (mail client polls POPFile, which in turn polls the server), with IMAP, POPFile works in parallel with the mail client (POPFile polls the server and moves spam-o-grams into selected folders, mail client directly polls the server). This has several advantages:
  • it simplifies configuration a bit, especially when you use different mail clients on different platforms;
  • you can train it just by dragging messages from the folder POPFile put them in to the folder you want them in;
  • if you have POPFile running nonstop (say, on a server), you get spam filtering even when you are using a mail client on a platform where POPFile is not installed (in my case, an Android tablet); and
  • if POPFile should go down or fail to start, your mail client is not disconnected from the server.
The only real issue I see with the IMAP module is that it can only filter one server. I can live with that (particularly as I have no choice.)

POPFile can do more than filter spam, though.  You can teach it to distinguish personal messages from business messages, or sales offers from sports updates, and route messages to specific folders. I highly recommend it.

I had POPFile running on my office PC (Linux Mint), but with my retirement that machine is no longer mine to abuse. So I installed POPFile on my home PC (also Linux Mint). A slightly less than current version of POPFile can be installed effortlessly through Synaptic (the package manager).

Having used it for so long, I have POPFile highly trained, and I was loath to start from scratch. There are pretty good instructions for copying the corpus (database) to another machine, particularly here, but I also copied the configuration file, and somehow ran into a problem getting POPFile to start. If I started it manually using sudo /usr/share/popfile/, the daemon would run. The autostart script, however, failed silently, even if I ran it manually (sudo /etc/init.d/popfile start). A bit of digging indicated that it was complaining about listening on port 110. On the one hand, this makes sense. When you install POPFile, it creates a special user and group with the name popfile, which owns the program file. Even if you run the init.d start script as root, it actually starts the server using the user group popfile. On Linux systems, I think that only the root account has server access to privileged ports (including 110), so the popfile account would not be allowed to hang a server on that port. What did not make sense is that POPFile would be trying to do so, since it was configured (both by default and in the configuration file I was copying over) to listen on nonprivileged port 7070. Even turning off POP3 monitoring completely did not fix the problem. Also, instructions that include copying the configuration file to the new machine do not warn you that it will copy the administrative password (encrypted) and the host name (which will now be incorrect, unless the new machine happens to have the same name as the old one).

So, to jog my memory the next time around, here is the step-by-step process that worked. On Windows or Mac OS, the file names should be the same, but paths will be different.
  1. Install POPFile on the new machine (via the package manager).
  2. Access the control panel on the new machine (point a browser at and shut POPFile down from there. This creates a configuration file the first time you do it. Try to reload just to make sure the server is down.
  3. Copy popfile.cfg (configuration) and popfile.db (corpus) from /var/lib/popfile on the old machine to a temporary directory on the new one. (This will require root access on the old machine.)
  4. As root, move the copy of popfile.db to /var/lib/popfile on the new machine, overwriting the file installed by Synaptic.
  5. On the new machine, run sudo chmod o+rw /var/lib/popfile/popfile.cfg to make the configuration file writable.
  6. Using a file merging program (I like meld), compare the configuration files from the old and new machines, and merge the IMAP settings (lines starting imap_...) from the old configuration into the new one. Save the new configuration file. (Don't worry about resetting the permissions on it; POPFile will do that automatically.)
  7. On the new machine, run sudo /etc/init.d/popfile start, then browse to and verify that POPFile successfully started.

Thursday, June 14, 2012

Decline to Review

The three pillars of academic life are research, teaching and service. (We don't consider endless faculty meetings to be a "pillar"; they're more like a pillory.) One aspect of service is performing (uncompensated) reviews for journals. I've gotten my share of requests, and usually accept the assignment. In fact, I'm working on one now. Sometimes, though, I just have to turn one down, as the following redacted exchange demonstrates.

Reviewer response form:

Editor's reply:

There's more to the story, but the rest will cost you a beer at some conference.

Sunday, June 10, 2012

MythTV Wakeup

My MythTV installation remains a work in progress, but it is coming along. (For other installments in this saga, click "MythTV" in the label cloud at the bottom right of this or any post.) One source of confusion has been the wake-up feature, which I think I've got partially licked.

Instructions for configuring MythTV to wake the PC when a recording is scheduled can be found several places, including here. Following a similar but slightly different set of instructions (whose location I've since forgotten), I tested my BIOS, created the script and tested that, configured both the back end and mythwelcome, and then scheduled some test recordings.  The script worked fine when run (with administrator privileges) from a command prompt, but I found that the PC woke up at times it was not supposed to, and failed to wake up when recordings were scheduled.

Long story short, the key was in the mythwelcome setup (accessed from a terminal via mythwelcome --setup).  There is a field labeled nvram-wakeup Restart Command, and for some reason it was populated by default.  The instructions on the MythTV wiki state that this must be blank, and they're not kidding.  Clearing this field at least partially fixed the wake up issues.  MythTV now wakes the PC (from an S5 state, i.e., powered off) via a BIOS alarm and successfully handles scheduled recordings.

Two wakeup issues remain:
  • I find the PC booted and running at times when nothing is scheduled (and I have no indication why it powered up); and
  • the back end is supposed to wake the machine between 0200 and 0500 local time to download program listings, but that appears not to be happening, at least not regularly.
Update: I fixed the first problem (unscheduled boots). Details are posted here.