Improving the Sudoku Solver

9143361795_a3f7d34867_o

Image: Graniers (Flickr / creative commons)

I had planned to write about something else, but the poor efficiency of the list-based Sudoku Solver keeps nagging me. I think I need to address that before venturing into other topics.

The list-based Sudoku Solver was my idea of the ”simplest-thing-that-could-work” outside or above the brute-force approaches. Since the board of the puzzle was maintained in a list of cells, all the access meant iterating through 81 elements and filtering what ever was needed, which added up to a lot of unnecessary work.

In the previous post, I sketched a couple of ideas for improvement. Although parallel computation might make things to happen faster, it does not make the approach any more efficient. I would prefer to tackle the problem with data structures and algorithms. Since my efforts at elegant tail-recursive solvers have remained inelegant, the list of improvements boils down to two: two-dimensional array for data structure and more elaborate pruning.

The array-based Sudoku board makes use of the array’s indexed positions when accessing cells, rows, columns etc. (see, the complete board class here).

case class ArrayBasedBoard(grid: Array[Array[Cell]]) {

  override def cell(x: Int, y: Int): Cell = grid(y)(x)

  override def row(y: Int): List[Cell] = grid(y).toList

  override def col(x: Int): List[Cell] =
    grid.map(r => r(x)).toList

  ...

The basic array in Scala is mutable as it is a representation of Java’s Array class. Although mutability goes against the grain of functional programming, I’ll make blatant use of it. For instance, when guessing, i.e., trying out a value for a cell from a set of candidates, I make copies of the board, and then assign the guessed values to the new grid. Copies of the board are needed as I might need to backtrack if the recursive iteration runs into a dead-end.

def +(c: Cell): ArrayBasedBoard = {
  val newGrid = gridCopy
  newGrid(c.y)(c.x) = c
  ArrayBasedBoard(newGrid)
}

private def gridCopy: Array[Array[Cell]] = {
  grid.map { row =>;
    val newArray = new Array[Cell](row.size)
    row.copyToArray(newArray)
    newArray
  }
}

I used the first 100 samples from University of Warwick‘s msk_009 dataset that contains mostly ”easy” puzzles as a quick touchstone. The data structure seemed to yield an improvement of about 20% in efficiency over the list-based approach — from 1300 ms to about 1000 ms.

However, bigger improvement arose from the changes in pruning. In addition to pruning  based on the neighboring values described in the previous post, the candidates of each cell can be pruned by neighboring candidates. Consider the puzzle below, for example. After eliminating the redundant candidates from each cell based on the initial values (large dark gray numbers), some cells have been solved (large light blue numbers), but this simple pruning cannot go any further. To proceed we would have to make a guess.

sudoku_pair_elimination

However, if the same pair of candidates occurs twice in the same row, column, or a block, that pair occur anywhere else on the same row, column, or a block, respectively. Making use of this rule, we can continue pruning and avoid guessing.

In the lower left corner, two cells with green background have candidate lists (small red numbers) identical comprising 2 and 3. Since 2 and 3 must to occur in those two cells, 2 and 3 cannot occur anywhere else in the given column. Thus, we can eliminate them from the candidate lists of the column. Since both cells occur also in the same bottom left 3 x 3 block, we can extend the pruning to the that block. The area of effect is colored with light blue.

There is another pair in top-right right corner: we have 4 and 6 in the same row and the same block. It can be used to prune cells that are colored in light yellow.

The pruning leads to the following outcome. Three further cells have been solved (green background) and the candidate lists in areas of effect have been pruned. As new cells have been resolved, the ”normal” pruning based on the neighboring values would pick up from here, followed by this kind candidate pair pruning, etc. Quite often, or should I say typically, the Sudoku puzzles in the newspapers can be solved with just pruning.

sudoku_pair_elimination2

This sort of pruning with pairs elimination (see function eliminatePairs below) involves two phases: finding the pairs and pruning the candidate lists. The function twiceOccurringPairs is given the cells of a row, a column, or a block as a parameter, and it finds pairs of candidates that occur exactly twice. To this end, Scala provides quite elegant tools for manipulating lists.

def twiceOccurringPairs(cells: Traversable[Cell]): Set[Set[Int]] = {
  cells.filter(_.cands.size == 2).
    groupBy(_.cands).map(x =>; x._1 ->; x._2.size).
    filter(_._2 == 2).map(_._1).toSet
}

def eliminatePairs(cells: Traversable[Cell]): List[Cell] = {
  twiceOccurringPairs(cells).flatMap { pair =>
    cells.collect {
      case cell if (cell.value.isEmpty &&
              cell.cands != pair &&
              cell.cands.diff(pair) != cell.cands) =>
        Cell(cell.x, cell.y, cell.cands.diff(pair))
    }
  }.toList
}

The elimination boils down to producing a list of cells for which the candidate list has been pruned. Again, the placement of those cells to the board exploits the mutability of the array.

eliminatePairs(board.col(y)).foreach { cell =>
  board.grid(cell.y)(cell.x) = cell
}

Anyway, this pruning works for pairs of identical candidate lists. A similar rule would apply to identical triplets occurring thrice.

The presented pruning improves the efficiency considerably. Each elimination of a candidate pair means avoiding one guess, i.e., it eliminates one branch in the search tree. Using the 100 samples from msk_009 dataset, the pruning cuts the average processing time down to a third — from about 1000 ms to about 300 ms.

The situation is similar with the University of Warwick‘s data set of 95 hard Sudoku puzzles. The list-based solver spent about 51 seconds on the average, while the array-based solver with new pruning spends about 17 seconds on the average (67% improvement). The median time fell from about 5 seconds to 1 second (80% improvement).

The complete source can be found on GitHub.

Sudoku solver in Scala

2698167583_15b7ce9bee_o

Image: Neha Viswanathan (Flickr / creative commons)

I have worked out a quite few Sudoku puzzles on paper, but there has always been a sense of futility in the effort. It’s not that I’m against wasting time, but rather, instead of solving these Japanese numbers puzzles one-by-one, I have felt I should be working on a general solution, a program that can solve any Sudoku puzzle. I think the time has come to get this problem out of the way.

Sudoku is the perfect programming finger exercise: it has simple rules, there is not need for complicate data structures or need to integrate to external interfaces, and it involves no large amounts of data. You can attack it with brute force or try out something more elaborate. There may be regular competitions, I don’t know, but certainly it seems I am not alone: while I was looking for evaluation data, I noticed that Peter Norvig, among others, had posted a Sudoku solver apparently out of pretty much the same motives.

There is probably a wealth of Sudoku strategies and heuristics explained somewhere. Nevertheless, this time I am approaching the exercise with no background work. I’ll work out a simple Sudoku solver in Scala impromptu.

A Sudoku puzzle looks something like this: You have a 9-by-9 board or grid partitioned into 3-by-3 blocks with some numbers filled in. The purpose is to fill in the blanks with numbers from 1 to 9. The same number must occur exactly once in every row, in every column and in every 3-by-3 block.

sudoku_example

With the general solution in mind, the first thing is to decide is how to represent the puzzle or the board. I’ve chosen the simplest structure that I could come up with: a list of individual cells (I’m hoping this will make the recursion easier). Each cell is defined by its coordinates and a set of candidate values, i.e., all values between 1 and 9 that are valid for the cell.

case class Cell(x: Int, y: Int, cands: Set[Int]) {
  def value: Option[Int] = {
    if (cands.size == 1)
      cands.headOption
    else
      None
  }
}

The board is a bit more elaborate. Since it is based on a list of cells, each access of rows, columns, or blocks requires a full scan of the 81 elements of in the list. This is the downside of simple data structure.

case class Board(cells: List[Cell]) {
  def row(y: Int): List[Cell] = {
    cells.filter(_.y == y)
  }
  def col(x: Int): List[Cell] = {
    cells.filter(_.x == x)
  }
  ...
}

The neighborhood of a cell consists of the row, the column and the block of the cell. In the board below, the neighborhood of the red cell is illustrated by yellow row and column together with pinkish block.

sudoku_neighborhood

The idea is that by imposing the constraints, i.e., what neighboring cells already contain, we prune the list of candidates for each cell. In the example above, the list of possible values for the red cell cannot contain 2,3,4,5,7,8, or 9, which leaves us with only 1. And so on. When all the candidate lists are pruned down to one element, the puzzle is finished.

Sometimes, however, pruning is not enough, and we may have to guess. We try out one candidate for a cell and continue from there. Sometimes guesses result in dead-ends, and the board becomes inconsistent by having the same number twice in a row, a column or a block. In such a case, we have to back-track and guess again.

In my simple solver, prune is a tail-recursive function. It goes through all the cells eliminating the candidates based on the cell’s neighborhood. If pruning increases the number of resolved cells without producing any empty candidate lists, we prune further. If no improvement was attained or if pruning was too vigorous, return the board as is.

@tailrec
def prune(board: Board): Board = {
  val countBefore = board.cells.count(_.value.nonEmpty)
  val prunedCells: List[Cell] = board.cells.map { cell =>;
    val prunedCandidates = if (cell.value.nonEmpty)
      cell.cands
    else
      cell.cands.diff(board.neighborhood(cell.x, cell.y).
        flatMap(_.value))
    Cell(cell.x, cell.y, prunedCandidates)
  }
  val countAfter = prunedCells.count(_.value.nonEmpty)
  val empties = prunedCells.count(_.cands.isEmpty)
  if (countBefore < countAfter && empties == 0)
    prune(Board(prunedCells))
  else
    Board(prunedCells)
}

The actual iteration takes place in another recursive function iterate that receives a board as a parameter. If the board is invalid, for instance due to some cells pruned empty of candidates, we stop in failure. On the other hand, if the board contains a complete solution, we return that. Otherwise, we go through the unassigned cells (starting with the one with the shortest candidate list) and try out assigning candidates until we find a solution (or fail completely).

def iterate(board: Board): Option[Board] = {
  if (!board.isValid) {
    None
  } else if (board.isFinished) {
    Some(board)
  } else {
    prune(board).cells.filter(_.value.isEmpty).
      sortBy(_.cands.size).map { cell =>
      cell.cands.map { c =>
        val result = iterate(prune(board +
          Cell(cell.x, cell.y, Set[Int](c))))

        if (result.nonEmpty && result.get.isFinished)
          return result
      }
      return None
    }
    None
  }
}

I am not quite happy with this function. It is inelegant and certainly not tail-recursive. Nevertheless, it gets the job done. Let’s see how well.

The University of Warwick hosts datasets for the purpose of benchmarking automatic Sudoku solvers. Trying out their top95 data set of 95 hard Sudoku puzzles, my program yields an average response time of about 51 seconds with the median at 5 seconds and the maximum at friggin’ 15 minutes! It is faster than pen-and-paper approaches but it’s far from competitive. It is something of a consolation that all hard Sudokus were solved. Thus, the program works correctly, if sluggishly.

To improve the efficiency I could:

  • replace the list-based representation of the board with a two-dimensional array for  faster access to cells, rows, and columns,
  • adopt more effective pruning methods (e.g., if two cells on the same row/column have the same two candidates, those candidates can be eliminated from the rest of the neighborhood),
  • parallellize the computation, and
  • rewrite the iteration to a loop or a proper tail-recursive function.

And so, solving the initial problem of finding a general solution for Sudoku puzzles transforms into a new problem of finding a more efficient general solution. Sigh.

The source code (slightly refactored) can be found on GitHub.

One learns only from mistakes

20897323365_a08b756ef4_k

Image: National Museum of the U.S. Navy (Flickr / public domain)

When giving my first presentation at an international conference as a young postgraduate, I was asked some tough questions about my work. The critique certainly had its merits, but the tone was a little harsh. The professor — I was told afterwards — was a believer in trial by fire, that is, testing the students under pressure. I didn’t mind it that much, but the organizers went on to hold a meeting to lay down some ground rules for the following sessions. Anyway, as I was about to step down from the podium I managed to respond: “Like Rosenblatt said, one learns only from mistakes”. What a silly thing to say!

Although my response was merely an attempt to brush off the humiliation, there was a grain of truth in it – at least with respect to the computational theory of mind. Its main idea is that although human mind is not a computer as such, it processes information along the same principles as modern digital computers.

Frank Rosenblatt (1928-1971) was one of the pioneers of the artificial intelligence. In 1958 he presented an idea of Perceptron, an algorithm that showed the elementary capability of learning and storing knowledge. The device Rosenblatt built for the U.S. Navy could be taught to, for instance, recognize basic geometric shapes.

Basically, the perceptron was a simple linear classifier based on the model of artificial neurons developed by Warren McCullough and Walter Pitts in 1943. As illustrated below, the neuron receives an input signal x composed of n values. It does some computation and outputs the result y that represents a binary decision, yes or no.

neuronmodel

The input values could be the values of pixels in an image (e.g., character recognition), the occurences of certain words in a piece of text (e.g., spam detection), or readings from some physical tests (e.g., diagnosis). The output value y represents a decision ”yes, it is a C”, ”no, it is not spam”, or ”yes, there is a risk of heart a disease”. Essentially, a neuron can be taught to recognize one concept: whether some signal is the of concept type (a character, a spam email, a case of heart disease) or not.

Perceptron has a weight w associated with each input x representing the relative importance of each input with respect to the taught concept. For instance, in the case of spam detection, the presence of words like win, money, and special are likely to be more important than words like, say, escalator, and hence their weights are likely to be higher. In addition to weights, Perceptron has a threshold value or bias to gauge the output.

One can view the input signal is a n-dimensional vector, and the weights as another n-dimensional vector. The decision can be expressed in terms linear algebra: a linear function f(x) that is an inner-product of the signal and the weights plus the bias:linearclassification

The result of f(x) is still a real-value, not a decision. In the above illustration, the function φ(v) translates the real-value into a decision: If f(x) > 0, then the decision is 1 (positive), otherwise -1 (negative).

The weight vector and the bias form a hyperplane that divides the n-dimensional space of input signals into two parts. The illustration below shows a two-dimensional space with red and blue dots separated by a hyperplane defined by weights w and bias b. The weights form the normal of the hyperplane, and bias varies the distance from the origin, respectively. hyperspace The function f(x) measures the distance from a dot to the hyperplane. If the outcome is positive, the decision is yes, if zero or negative, no. All possible input signals (dots) are on one side of the hyperplane or on the other. That’s a dichotomy if there ever was one.

How does one figure out the weights? This is where one learns only from mistakes. Training is done presenting the perceptron a set of signals accompanied with labels. In other words, to learn to recognize spam, the perceptron could be trained with 1,000 samples of spam emails and 5,000 or 10,000 samples normal emails — the numbers should represent the actual frequency of the concept.

So, one presents the data one-by-one to the perceptron and compares output to the accompanied label. If they are the same, all is good. If they are not, then the perceptron has made an error and the weights have to be adjusted. The adjustments are made in small steps, and the whole training data is run through the perceptron multiple times. Each run of the complete training data is called an epoch.

Over the years, I have written perceptrons in various languages, but this is my first attempt at it with Scala. It is a functional programming language that encourages referential transparency, i.e., that all effects of a function are in the return value. Typically, one uses read-only variables and resorts to recursion for iteration, for instance.

The whole source code can be found on GitHub, but I’ll post some of the juicy parts here. First, we have two data structures: one for labeled samples composed of input signals and a label (1 for positives and -1 for negatives), and another for the weight vector and the bias.

case class LabeledSample(x: Array[Double], label: Int)
case class Weights(w: Array[Double], b: Double)

The trainer itself has two variables: the learning rate eta or η for tuning how coarse and fast or how refined and slow the training should be, and the maximum number of epochs we will run.

class LinearPerceptronTrainer(eta: Double, epochs: Int)

The training epochs are run in a tail recursive loop. Each trial adjusts the weights for the next trial. Each epoch produces new weights and the number of errors it yields on the training data. Should it commit no classification errors, we’ll stop.

In addition to learning rate, we use radius (r in the snippet below) as a guide in adjusting the bias. The radius is simply the length of the longest vector in the training data. The idea was adopted from Christianini & Shawe-Taylor (2000). Naturally, there are other ways to go about adjusting the bias.

@tailrec
final def iterate(epoch: Int, data: List[LabeledSample],
  weights: Weights, r: Double): (Weights, Int) = {
  val (adjWeights, errors) = trial(data, weights, r)
  if (epoch == 0 || errors == 0) {
    (adjWeights, errors)
  } else {
    iterate(epoch - 1, data, adjWeights, r)
  }
}

After all this delegation, the actual learning takes place during the single run of the data. Scala’s foldLeft-operation enables us to iterate through data and accumulate (into acc) the changes resulting from each error. So the outcome of the whole foldLeft is the adjusted weights and the associated number of errors.

First, we calculate the inner product between the sample vector (signal input) and the current weight vector. If the inner product agrees with the label we continue on, but if they disagree, i.e., are of different sign, one positive and one negative, we adjust the weights.

def trial(data: List[LabeledSample],
  weights: Weights, r: Double): (Weights, Int) = {
  data.foldLeft(weights, 0) { (acc, s) =>
    val p = innerProduct(s.x, weights.w) + weights.b
    if (p * s.label <= 0) {
      (Weights(
        acc._1.w.zip(s.x). // pair weights and samples (inputs)
        map(w => w._1 + eta * s.label * w._2), // adjust weights
        acc._1.b + (eta * s.label * r * r)),   // adjust bias
        acc._2 + 1)                            // adjust errors
    } else {
      acc
    }
  }
}

At first glance, the adjustment phase may appear a bit messy. (I’ve named the fields and variables with short and convenient code snippets in mind rather than good practice. I’m not sure it has been a good decision.) Effectively, it produces a pair of Weights object and the error count. The zip operation merely pairs the elements of two lists so they can be processed in one go. The adjustment of the weights reads

weightupdate

where wt + 1 is the adjusted weights, wt the current weights, η is the learning rate, yi the ith sample label and xi the ith sample vector in the data. In a similar vein, the adjustment of the bias can be expressed as

biasupdate

where bt + 1 is the new bias, bt the current bias, and R the radius.

We’ll test it with a small mock dataset.

val data = List(
  LabeledSample(Array[Double](0.0d, 2.0d), -1),
  LabeledSample(Array[Double](1.0d, 2.0d), -1),
  LabeledSample(Array[Double](2.0d, 1.0d), -1),
  LabeledSample(Array[Double](2.0d, 5.0d), 1),
  LabeledSample(Array[Double](3.0d, 3.0d), 1),
  LabeledSample(Array[Double](4.0d, 3.0d), 1)
)

We’ll start with zero weight vector and zero bias. The learning rate is set to 0.001. The training runs for eight epochs as follows:

epoch 1 : (w: [0.0,0.0], b: 0.0, errors: 6)
epoch 2 : (w: [0.006,0.006], b: 0.0, errors: 3)
epoch 3 : (w: [0.003,0.001], b: -0.087], errors: 3)
epoch 4 : (w: [0.012,0.012], b: 0.0], errors: 3)
epoch 5 : (w: [0.009,0.007], b: -0 .087], errors: 3)
epoch 6 : (w: [0.018,0.018], b: 0.0], errors: 3)
epoch 7 : (w: [0.014,0.013], b: -0.087], errors: 1)
epoch 8 : (w: [0.018,0.016], b: -0.058], errors: 0)
Finished in 31 ms (w: [0.018,0.016], b: -0.058], errors: 0)

So, there it is. The weight vector and bias are trained to recognize a mock concept in a two-dimensional space. The learning process is error-driven, i.e., it learns only from mistakes. There is a wisp of generalization (however meager): instead of the actual samples, the model contains a derived hyperplane that applies to all input signals instead of just those it has encountered. Some have animated the progress of Perceptron training using larger data.

Perceptron is certainly not a state-of-the-art classifier. However, from this simple classifier, I hope to make casual excursions into related topics, various other classifiers and classification problems. And whatever comes to mind.

Further reading:

  • Christianini, Nello and Shawe-Taylor, John (2000), Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge, UK.
  • Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, 65(6), 386–408.

The LaTeX-snippets were created with CodeCogs online editor. The source code can be found in GitHub.