Conway's Game of Life in Haskell
Today I came across this excellent game of life implementation in Clojure, and also was learning about monads in Haskell. So I ported the former, using the latter!
The logic translates pretty much the same. Wondering if there is more monads to be had on the newCell assignment line (the one with concatMap and friends), even at the expense of readability. This is a learning exercise, after all. I went for bonus points by writing a function to render the grid, it didn’t go as well. Would love some feedback on it. Here is a forkable version.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
import Data.List
import Control.Monad
type Cell = (Int, Int)
type Grid = [Cell]
-- Game Logic
neighbours :: Cell -> Grid
neighbours (x, y) = do
dx <- [-1..1]
dy <- [-1..1]
guard (dx /= 0 || dy /= 0)
return (x + dx, y + dy)
step :: Grid -> Grid
step cells = do
(newCell, n) <- frequencies $ concatMap neighbours cells
guard $ (n == 3) || (n == 2 && newCell `elem` cells)
return newCell
-- This is the only deviation from the Clojure version, since it is not a
-- built-in in Haskell.
frequencies :: Ord a => [a] -> [(a, Int)]
frequencies xs = do
x <- group $ sort xs
return (head x, length x)
-- UI
-- Feel like I'm missing a concept. Not so happy with this function:
-- * Can `eol` be done a better way? I tried nested maps but it was urgh.
-- * `marker` seems long for a simple tenary. Same issue as `eol` I guess.
formatGrid :: Grid -> String
formatGrid grid = do
y <- ys
x <- xs
[marker x y] ++ eol x
where
marker x y
| (x, y) `elem` grid = '*'
| otherwise = ' '
eol x
| x == maximum xs = ['\n']
| otherwise = []
xs = gridRange fst
ys = gridRange snd
gridRange f = [min grid .. max grid]
where
min = minimum . map f
max = maximum . map f
main = do
mapM_ printGrid . take 3 $ iterate step beacon
where
beacon = [(0, 0), (1, 0), (0, 1), (3, 3), (2, 3), (3, 2)]
printGrid :: Grid -> IO ()
printGrid grid = do
putStrLn $ formatGrid grid
putStrLn ""
|
Project Euler solutions 1-25 in Literate Haskell
A holiday project!
Project Euler is a series of challenging mathematical/computer programming problems that will require more than just mathematical insights to solve. Although mathematics will help you arrive at elegant and efficient methods, the use of a computer and programming skills will be required to solve most problems. — Project Euler.Follows is my attempt at the problems. I use Haskell and mathematics, two subjects in which I am a novice. In writing about my solutions and discoveries, I hope to present a gentle introduction to both.
Please do enjoy: Project Euler solutions 1-25 in Literate Haskell.
DataMapper Retrospective
I introduced DataMapper on my last two major projects. As those projects matured after I had left, they both migrated to a different ORM. That deserves a retrospective, I think. As I’ve left both projects, I don’t have the insider level of detail on the decision to abandon DataMapper, but developers from both projects kindly provided background for this blog post.
Project A
Web application and a batch processing component built on top of a legacy Oracle database.
Good
- Field mappings, nice ruby names and able to ignore fields we didn’t care about.
Bad
- Had to roll our own locking and time zone integration.
- Not great for batch processing (trying to write SQL through DM abstraction.)
It turned out this project required a lot more batch processing than we anticipated, which DataMapper does not shine at. It was migrated to Sequel which provides a far better abstraction for working closer to SQL.
Project B
A fairly typical Rails 3 application. A couple of tens of thousands of lines of code.
Good
- No migrations (pre-release).
- Foreign keys, composite primary keys.
- Auto-validations.
Bad
- Auto-validations with nested attributes was uncharted territory (needed bug fixes).
- Performance on large object graphs was unusable for page rendering (close to two seconds for our home page, which admittedly had a stupid amount of stuff on it).
- Performance was suboptimal (though passable) on smaller pages.
- Tracing through what his happening across multiple gems (particularly around transactions) was tricky.
- The maintenance/interactions of all the various gems was problematic (e.g. gems X,Y work with 1.9.3 but Z doesn’t yet).
- Inability to easily “break the abstraction” when SQL was required.
The performance issues were clear in our code base, but eluded much effort to reduce them down to smaller reproducible problems. The best quick win I found was ~15% by disabling assertions, but I suspect that given the large scope of the problem DataMapper is trying to solve there may not be any approachable way of tackling the issue (would love to be proven wrong!)
We ran into obvious integration bugs (apologies for not having kept a concrete list), a symptom of a library not widely used. As a commiter on the project this wasn’t an issue, since they were easily fixed and moved past (the DataMapper code base is really nice to work on), but having a commiter on your team isn’t a tenable strategy.
DataMapper takes an all-ruby-all-the-time approach, which means things get tricky when the abstraction leaks. Much of the SQL generation is hidden in private methods. Compare some code to create a composable full text search query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
def self.search(keywords, options = {}) options = { conditions: ["true"] }.merge(options) current_query = query.merge(options) a = repository.adapter columns_sql = a.send(:columns_statement, current_query.fields, false) conditions = a.send(:conditions_statement, current_query.conditions, false) order_sql = a.send(:order_statement, current_query.order, false) limit_sql = current_query.limit || 50 conditions_sql, conditions_values = *conditions bind_values = [keywords] + conditions_values find_by_sql([<<-SQL, *bind_values]) SELECT #{columns_sql}, ts_rank_cd(search_vector, query) AS rank FROM things CROSS JOIN plainto_tsquery(?) query WHERE #{conditions_sql} AND (query @@ search_vector) ORDER BY rank DESC, #{order_sql} LIMIT #{limit_sql} SQL end |
To the ActiveRecord equivalent (Sequel is similar):
1 2 3 4 5 6 |
def self.search(keywords) select("things.*, ts_rank_cd(search_vector, query) AS rank") .joins(sanitize_sql_array(["CROSS JOIN plainto_tsquery(?) query", keywords])) .where("query @@ search_vector") .order("rank DESC") end |
Switching to ActiveRecord took a week of all hands (~4) on deck, plus another week alongside other feature work to get it stable. From beginning to in production was two weeks. The end result was a drop in response time (the deploy is pretty blatant in the graph below), start up time, plus 3K less lines of code (a lot of custom code for dropping down to SQL was able to be removed).

Do differently
Ultimately, DataMapper provides an abstraction that I just don’t need, and even if I did it hasn’t had its tires kicked sufficiently that a team can use it without having to delve down to the internals. The applications I find myself writing are about data, and the store in which that data lives is vitally important to the application. Abstracting away those details seems to be heading in the wrong direction for writing simple applications. As an intellectual achievement in its own right I really dig DataMapper, but it is too complicated a component to justify using inside other applications.
Rich Hickey’s talk Simple Made Easy has been rattling around my head a lot.
Nowadays I’m back to ActiveRecord for team conformance. It’s more work to keep on top of foreign keys and the like, but overall it does the job. It’s still too complicated, but has the non-trivial benefit of being used by lots of people. This is my responsible choice at the moment.
On my own projects I first reach for Sequel. It supports all the nice database features I want to use, while providing a thin layer over SQL. In other words, I don’t have to worry about the abstraction leaking because the abstraction is still SQL, just expressed in ruby (which is a huge win for composeability that you don’t get with raw SQL). While it does have “ORM” features, it feels more like the most convenient way of accessing my database rather than an abstraction layer. It’s actively maintained and the only bug I have found was something that Rails broke, and a patch was already available. There are no open issues in the bug tracker. My experiences have been overwhelmingly positive. I haven’t built anything big enough with it yet to have confidence using it on a team project though.
I still have a soft spot in my heart for DataMapper, I just don’t see anywhere for me to use it anymore.
Exercises in style
Let us make a stack machine! It can add numbers! This may be a winding journey. Have some time and an irb up your sleeve. Maybe it is more of a meditation than a blog post? Onwards!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
def push_op(value) lambda {|x| [value, x + [value]] } end def add_op lambda {|x| [x[-1] + x[-2], x[0..-3]] } end [ push_op(1), push_op(2), add_op ].inject([nil, []]) {|(result, state), op| op[state] } |
Get it? Pushes 1, pushes 2, then the add_op pops them off the stack and makes 3. Not a lot of metadata in those lambdas though, and we can’t combine them in interesting way.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
class Operation < Struct.new(:block) def +(other) CompositeOperation.new(self, other) end def run(state) @block.call(state) end end class CompositeOperation < Operation def initialize(a, b) @a = a @b = b super(lambda {|x| @b.block[@a.block[x][1]] }) end def desc @a.desc + "\n" + @b.desc end end class PushOperation < Operation def initialize(value) @value = value super(lambda {|x| [value, x + [value]] }) end def desc "push #{@value}" end end class AddOperation < Operation def initialize super(lambda {|x| [x[-1] + x[-2], x[0..-3]] }) end def desc "add top two digits on stack" end end |
A lot more setup, but now we also get a description of operations!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
def tagged_push_op(value) PushOperation.new(value) end def tagged_add_op AddOperation.new end ops = tagged_push_op(1) + tagged_push_op(2) + tagged_add_op puts ops.desc puts ops.run(start_state).inspect |
Ok you get that. What else can we do?
“every monad [.] embodies a particular computational strategy. A ‘motto of computation,’ if you will.” — Mental Guy
hmmm. What does it mean?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
class VerboseStackEvaluator < Struct.new(:stack) attr_accessor :result, :stack def pass(op) puts op.desc results = op.call(stack) self.class.new(results[1]).tap do |x| x.result = results[0] end end def self.identity new([]) end end e = evaluator.identity. pass(tagged_push_op(1)). pass(tagged_push_op(2)). pass(tagged_add_op) p [e.result, e.stack] |
Oh so now we have one structure (the pass stuff) that we can run through different evaluators. Let us make a recursive one!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
class RecursiveLazyStackEvaluator < Struct.new(:stack) def pass(op) self.class.new(lambda { op.call(stack) }) end def self.identity new(lambda { [nil, []] }) end def result; evaled[0]; end def stack; evaled[1]; end private def evaled @evaled ||= @stack.call end end |
Do you see it is now lazy. Rather than evaluate each operation when pass is called, it saves them up until a result is requested. Look out! Haskell in your Ruby! Recursion might blow out our stack though. Let us isomorphically (I just learned this word) translate it to use iteration!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
class LazyStackEvaluator attr_accessor :steps def initialize(stack, steps = []) @stack = stack @steps = steps end def pass(op) self.class.new(@stack, steps + [op]) end def self.identity new([]) end def result; evaled[0]; end def stack; evaled[1]; end protected def evaled @evaled ||= steps.inject([nil, @stack]) {|(r, s), op| op.call(s) } end end |
Not too shabby. Let’s try something more useful. Given we only have one operation that pops the stack (add), and it only pops two numbers, if we have more than two numbers in a row they start becoming redundant. Let us optimize!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
class OptimizingEvaluator < LazyStackEvaluator def evaled @evaled ||= begin accumulator = [] new_steps = [] steps.each do |step| accumulator << step if !step.is_a?(PushOperation) new_steps += accumulator accumulator = [] elsif accumulator.length > 2 accumulator = accumulator[1..-1] end end new_steps += accumulator new_steps.inject([nil, @stack]) {|(r, s), op| op.call(s) } end end end e = evaluator.identity. pass(tagged_push_op(1)). # This won't get run! pass(tagged_push_op(1)). pass(tagged_push_op(2)). pass(tagged_add_op) p [e.result, e.stack] |
Ok one more. This one is pretty useless for this problem, but perhaps it will inspire thought. Let us multithread!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
class ThreadingEvaluator < LazyStackEvaluator def evaled @evaled ||= begin accumulator = [] workers = [] steps.each do |step| accumulator << step if step.is_a?(AddOperation) workers << spawn_thread(accumulator) accumulator = [] end end workers << spawn_thread(accumulator) unless accumulator.empty? workers.each(&:join) workers.last[:result] end end def spawn_thread(accumulator) Thread.new do sleep rand / 3 Thread.current[:result] = begin e = accumulator.inject(VerboseStackEvaluator.identity) {|e, s| e.pass(s) } [e.result, e.stack] end end end end e = evaluator.identity. pass(tagged_push_op(1)). pass(tagged_push_op(1)). pass(tagged_push_op(2)). pass(tagged_add_op). pass(tagged_push_op(3)). pass(tagged_push_op(4)). pass(tagged_add_op) p [e.result, e.stack] |
Ok that is all. Here is an exercise for you: how would you allow the threading and optimizing evaluators to be combined?
SICP Lisp interpreter in Clojure
On a lazy Sunday morning I can oft be found meandering through the classics of computer science literature. This weekend was no exception, as I put together a LISP interpreter in Clojure based off chapter 4 of The Structure and Interpretation of Computer Progams.
The code is on github, rather than including it inline here, since at 90 lines plus tests it’s getting a tad long for a snippet.
It differs from the SICP version in that the environment variable is immutable, so new versions have to be passed through to each function. This resulted in the “context” concept that encapsulates both the current expression and the environment that does with. It causes a small amount of clunky code (see map-reducer), but also allows easier managing of scoping for lambdas (see do-apply and env-extend). It matches the functional paradigm much better anyway. I also used some higher level primitives such as map and reduce that SICP doesn’t - SICP is demonstrating that they aren’t necessary, but that’s a point I’ve already conceeded and don’t feel I need to replicate.
Critique of my style warmly encouraged, I’m still new to Clojure.
Vim and tmux on OSX
I recently switched from MacVim to vim inside tmux, using iTerm in full screen mode (Command+Enter). It’s pretty rad. I tried screen first, but even after a lot of screwing around there was still a lot of brokeness, and I don’t like how it does split panes anyways. Follows are some notes about what is required for tmux.
Get the latest vim and tmux
Latest vim required for proper clipboard sharing, if you don’t want to install it you can use the pbcopy plugin mentioned below. A formula won’t ever be in the main homebrew repository because they have a policy of not including packages already provided by the system, but homebrew-alt is pretty legit.
1 2 |
brew install --HEAD https://raw.github.com/adamv/homebrew-alt/master/duplicates/vim.rb brew install tmux |
Set up pretty colors
I use the solarized color scheme. To make this work, ensure you are not overriding the TERM variable in your .{bash|zsh}rc, then create an alias for tmux:
1 2 |
# .zshrc alias tmux="TERM=screen-256color-bce tmux" |
I also have a tmux config:
1 2 |
# .tmux.conf set -g default-terminal "screen-256color" |
Clipboard sharing
Up until I wrote this blog post, I had been using the pbcopy plugin to share clipboard using a cute hack involving ssh’ing back into your machine to run pbcopy/pbpaste. In researching some more details on this though I found an excellent write up of the problem and a far better solution by Chris Johnsen that enables proper sharing without ssh’ing, and therefore also the * register (use "*y to copy, "*p to paste – note this does not work with the vim that ships with OSX).
Mouse integration
The mouse is good for two things: scrolling, and selecting text from your scrollback.
For the first, put the following config:
1 2 |
# ~/.tmux.conf set -g mode-mouse on |
For the second, hold the option key while you select.
Workflow
Find another reference for basic keys, this here are notes on top of that. Ctrl-B sucks as an escape sequence, rebind it to Ctrl-A to match screen. Most online references don’t mention it, but the default binding for horizontal split is prefix " (it’s in the man page). I tend to have a main pane for editing and a smaller pane for a REPL or log. If I need to investigate the smaller pane, I press Ctrl-A Ctrl-O, which switches the two panes to give me the log in the larger one.
I use the tslime.vim plugin to send text directly from vim to the supplementary pane. This is a killer feature. As well as the built in Ctrl-C shortcut, I also use a trick I learned from Gary Bernhardt and remap <leader>t on the fly to send whatever command I am currently testing to the other pane. Some examples:
1 2 3 4 |
; Load a file into a clojure repl
:map ;t :w\|:call Send_to_Tmux("\n\n\n(load-file \"./myfile.clj\")\n")<CR>
; Run rspec in zsh
:map ;t :w\|:call Send_to_Tmux("rspec spec/my_spec.rb\n")<CR>
|
If I need to interact with a shell I’ll usually Ctrl-Z vim, do what I need to do, then fg back again. If it’s a context switch, I’ll start a new tmux window then exit it after I’m done with the distraction.
I don’t use sessions. I prefer setting up from scratch each time since it takes no time at all, and eases my brain into the problem. Clean desk and all that.
That’s it. Nothing too fancy, but I’ve been meaning to make the switch from MacVim for a while and with this set up I can’t ever see myself going back.
OCR with Clojure and ImageMagick
Let’s write some Clojure to recognize hand-written digits. It will be fun. But first, some notes.
NOTE THE FIRST: If you actually want proper OCR with Clojure that is actually useful, perhaps try this blog post on using OpenCV and Tesseract. If you want to have some fun from first principles, come with me.
NOTE THE SECOND: This post was heavily inspired by Chapter 2 in Machine Learning in Action, which details the K nearest neighbour algorithm and pointed me to the dataset. If you dig this post, you should buy that book.
OK let’s go! Here’s what we’re going to do:
- Take a snapshot of your handwriting.
- Use ImageMagick to post-process it.
- Convert the snapshot to a text format matching our training data.
- Download and parse a training set of data.
- Identify the digit written in the snapshot using the training data.
It’s going to be great.
Take a snapshot
Draw a single numeric digit on a piece of paper. Take a photo of it and get it on your computer. I used Photo Booth and the built-in camera on my Mac. Tight crop the picture around the number, so it looks something like:

Don’t worry if it’s a bit grainy or blurry, our classifier is going to be pretty smart.
Use ImageMagick to post-process it
The ImageMagick command line utility convert is one of those magic tools that once you learn you can never imagine how you did without it. It can do anything you need to an image. Anything. For instance, resize our image to 32×32 pixels and convert it into black and white.
1 2 3 4 5 6 7 |
(ns ocr.main
(:use [clojure.contrib.shell-out :only (sh)]))
(defn convert-image
[in out]
(sh "convert" in "-colorspace" "gray" "+dither" "-colors" "2"
"-normalize" "-resize" "32x32!" out))
|
It took me a while to figure out this incantation. The user manual for quantize is probably the best reference you’ll find. Note that the exclamation mark in “32×32!” will stretch the dimensions of the image to be square. This is desirable since most people write too skinny, and maybe some write too fat, but we need the digits to be square otherwise everything will look like a “1”. Converting the above “5” will look like this:

I am shelling out from Clojure to transform the file. There are two other options: JMagick, which uses the C API directly using JNI, and im4java which still shells out but gives you a nice interface over the top of it. I couldn’t get the first one working (it looks like a pretty dead project, no updates for a few years), and the latter wouldn’t give me anything helpful in this case.
Convert the image into a text format
The convert program automatically formats the output file based on the file extension, you can easily convert between any graphic format you choose. For instance, convert JPG to PNG:
1 |
convert myfile.jpg myfile.png |
As well as graphic formats though, it also supports the txt format, which looks like this:
1 2 3 4 |
# ImageMagick pixel enumeration: 32,32,255,rgb 0,0: (255,255,255) #FFFFFF white 1,0: ( 0, 0, 0) #000000 black # etc... |
That’s handy, because it can be easily translated into a bitmap with “1” representing black and “0” representing white. The “5” from above will look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
10000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000001111111111 00000000000000111111111111111111 00000000000011111111111111111111 00000000000011111111111111111110 00000000000111111111100000000000 00000000000111100000000000000000 00000000001111100000000000000000 00000000001111000000000000000000 00000000011110000000000000000000 00000000111110000000000000000000 00000000111110000000000000000000 00000000111110000000000000000000 00000000111111111000000000000000 00000000111111111000000000000000 00000000001111111100000000000000 00000000000111111110000000000000 00000000000001111111000000000000 00000000000000111111000000000000 00000000000000011111000000000000 00000000000000001111000000000000 00000000000000000111100000000000 00000000000000000111100000000000 00000000000000011111000000000000 00011111111111111111000000000000 00011111111111111110000000000000 00011111111111111100000000000000 00000111111111111000000000000000 00000000001110000000000000000000 00000000000000000000000000000000 |
I used the duck-streams library found in clojure.contrib to read and write the file from disk, and applied some light processing to get the data into the required format. I also used a temporary file on disk to store the data - I’m pretty sure there would be a way to get convert to write to STDOUT then process that in memory, but I didn’t figure it out. It’s handy for debugging to have the file there anyways.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
(ns ocr.main
(:use [clojure.contrib.shell-out :only (sh)]))
(:use [clojure.contrib.duck-streams :only (read-lines write-lines)]))
(defn read-text-image-line [line]
(if (= "white" (last (split line #"[,:\s]+"))) "0" "1"))
(defn load-text-image
[filename]
(let [lines (vec (drop 1 (read-lines filename)))
converted (map read-text-image-line lines) ]
(map #(apply str %) (partition 32 converted))))
(defn convert-image
[in out]
(sh "convert" in "-colorspace" "gray" "+dither" "-colors" "2"
"-normalize" "-resize" "32x32!" out)
(write-lines out (load-text-image out)))
(def temp-outfile "/tmp/clj-converted.txt")
|
One more function is needed to be able to load that file up again into memory. This one doesn’t need to use read-lines, since the desired format for the classification below is actually just a vector of ones and zeros, so slurp is a quick alternative which is in the core libraries.
1 2 3 4 5 6 |
(defn load-char-file [file]
(let [filename (.getName file)
tokens (split filename #"[_\.]")
label (first tokens)
contents (parse-char-row (slurp file))]
[label contents]))
|
Fetch some training data
The University of California Irving provides some sweet datasets if you’re getting into machine learning. In particular, the Optical Recognition of Handwritten Digits Data Set contains nearly 2000 labeled digits provided in the 32×32 text format the snapshot is now in. All digits are in one file, with a few header rows that can be dropped and ignored.
1 2 |
wget http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits-orig.tra.Z gunzip optdigits-orig.tra.Z |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
(defn parse-char-row [row]
(map #(Integer/parseInt %) (filter #(or (= % "1") (= % "0")) (split row #""))))
(defn parse-char-data [element]
(let [label (trim (last element))
rows (take 32 element)]
[label (vec (flatten (map parse-char-row rows)))]))
(defn load-training-data
[filename]
(let [lines (drop 21 (read-lines filename))
elements (partition 33 lines)]
(map parse-char-data elements)
))
(def training-set (load-training-data "optdigits-orig.tra"))
|
This code returns an array of all the training data, each element being an array itself with the first element a label (“0”, “1”, “2”, etc…) and the second element a vector of all the data (new lines ignored, they’re not important).
Note that I’m using vec throughout. This is to force lazy sequences to be evaluated, which is a required performance optimization for this program otherwise it won’t finish calculating.
Classify our digit
This is the exciting part! I won’t go into the algorithm here (buy the Machine Learning book!), but it’s called K Nearest Neighbour and it’s not particularly fancy but works surprisingly well. If you read my last blog post, you’ll note I’ve dropped the Incanter library. It was too much mucking about and didn’t provide any value for this project. Reading datasets is pretty easy with Clojure anyways.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
(defn minus-vector [& args]
(map #(apply - %) (apply map vector args)))
(defn sum-of-squares [coll]
(reduce (fn [a v] (+ a (* v v))) coll))
(defn calculate-distances [in]
(fn [row]
(let [vector-diff (minus-vector (last in) (last row))
label (first row)
distance (sqrt (sum-of-squares vector-diff))]
[label distance])))
(defn classify [in]
(let [k 10
diffs (map (calculate-distances in) training-set)
nearest-neighbours (frequencies (map first (take k (sort-by last diffs))))
classification (first (last (sort-by second nearest-neighbours)))]
classification))
|
Now to tie it all together with a main function that converts all the snapshots you pass in as arguments.
1 2 3 4 5 6 7 |
(defn classify-image [filename]
(convert-image filename temp-outfile)
(classify (load-char-file (java.io.File. temp-outfile))))
(defn -main [& args]
(doseq [filename args]
(println "I think that is the number" (classify-image filename))))
|
That’s the lot. Use it like so:
1 2 |
> lein run myDigits/5_0.jpg I think that is the number 5 |
Hooray! Here is the full script as a gist. Let me know if you do anything fun with it.
Profiling Clojure
Tonight I was so impressed by how easy it was to profile some Clojure code using built-in JVM tools that I had to share:
Today I also learned more about the Incanter API, and wrote some good code to transform columns, among other things.
Exploring data with Clojure, Incanter, and Leiningen
I’m working through Machine Learning in Action at the moment, and it’s done in Python. I don’t really know Python, but I’d prefer to learn Clojure, so I’m redoing the code samples.
This blog posts show how to read a CSV file, manipulate it, then graph it. Turns out Clojure is pretty good for this, in combination with the Incanter library (think R for the JVM). It took me a while to get an environment set up since I’m unfamiliar with basically everything.
Install Clojure
I already had it installed so can’t remember if there were any crazy steps to get it working. Hopefully this is all you need:
1 |
sudo brew install clojure |
Install Leiningen
Leiningen is a build tool which does many things, but most importantly for me is it manages the classpath. I was jumping through all sorts of hoops trying to get Incanter running without it.
There are easy to follow instructions in the README
*UPDATE: * As suggested in the comments, you can probably just `brew install lein` here and that will get you Leiningen and Clojure in one command.
Create a new project
1 |
lein new hooray-data && cd hooray-data |
Add Incanter as a dependency to the project.clj file, and also a main target:
1 2 3 4 5 6 |
(defproject clj "1.0.0-SNAPSHOT"
:description "FIXME: write"
:dependencies [[org.clojure/clojure "1.2.0"]
[org.clojure/clojure-contrib "1.2.0"]
[incanter "1.2.3-SNAPSHOT"]]
:main hooray_data.core)
|
Add some Incanter code to src/hooray_data/core.clj
1 2 3 4 5 6 |
(ns hooray_data.core (:gen-class) (:use (incanter core stats charts io datasets))) (defn -main [& args] (view (histogram (sample-normal 1000))) |
Then fire it up:
1 2 |
lein deps lein run |
If everything runs to plan you’ll see a pretty graph.
Code
First, a simple categorized scatter plot. read-dataset works with both URLs and files, which is pretty handy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
(ns hooray_data.core
(:use (incanter core stats charts io)))
; Sample data set provided by Incanter
(def plotData (read-dataset
"https://raw.github.com/liebke/incanter/master/data/iris.dat"
:delim \space
:header true))
(def plot (scatter-plot
(sel plotData :cols 0)
(sel plotData :cols 1)
:x-label "Sepal Length"
:y-label "Sepal Width"
:group-by (sel plotData :cols 4)))
(defn -main [& args]
(view plot))
|
Second, the same data but normalized. The graph will look the same, but the underlying data is now ready for some more math.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
(ns hooray_data.core
(:use (incanter core stats charts io)))
; Sample data set provided by Incanter
(def data (read-dataset
"https://raw.github.com/liebke/incanter/master/data/iris.dat"
:delim \space
:header true))
(defn extract [f]
(fn [data]
(map #(apply f (sel data :cols %)) (range 0 (ncol data)))))
(defn fill [n row] (map (fn [x] row) (range 0 n)))
(defn matrix-row-operation [operand row matrix]
(operand matrix
(fill (nrow matrix) row)))
; Probably could be much nicer using `reduce`
(defn normalize [matrix]
(let [shifted (matrix-row-operation minus ((extract min) matrix) matrix)]
(matrix-row-operation div ((extract max) shifted) shifted)))
(def normalized-data
(normalize (to-matrix (sel data :cols [0 1]))))
(def normalized-plot (scatter-plot
(sel normalized-data :cols 0)
(sel normalized-data :cols 1)
:x-label "Sepal Length"
:y-label "Sepal Width"
:group-by (sel data :cols 4)))
(defn -main [& args]
(view normalized-plot))
|
I was kind of hoping the normalize function would have already been written for me in a standard library, but I couldn’t find it.
I’ll report back if anything else of interest comes up as I’m working through the book.
Interface Mocking
UPDATE: This is a gem now: rspec-fire The code in the gem is better than that presented here.
Here is a screencast I put together in response to a recent Destroy All Software screencast on test isolation and refactoring, showing off an idea I’ve been tinkering around with for automatic validation of your implicit interfaces that you stub in tests.
Here is the code for InterfaceMocking:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
module InterfaceMocking # Returns a new interface double. This is equivalent to an RSpec double, # stub or, mock, except that if the class passed as the first parameter # is loaded it will raise if you try to set an expectation or stub on # a method that the class has not implemented. def interface_double(stubbed_class, methods = {}) InterfaceDouble.new(stubbed_class, methods) end module InterfaceDoubleMethods include RSpec::Matchers def should_receive(method_name) ensure_implemented(method_name) super end def should_not_receive(method_name) ensure_implemented(method_name) super end def stub!(method_name) ensure_implemented(method_name) super end def ensure_implemented(*method_names) if recursive_const_defined?(Object, @__stubbed_class__) recursive_const_get(Object, @__stubbed_class__). should implement(method_names, @__checked_methods__) end end def recursive_const_get object, name name.split('::').inject(Object) {|klass,name| klass.const_get name } end def recursive_const_defined? object, name !!name.split('::').inject(Object) {|klass,name| if klass && klass.const_defined?(name) klass.const_get name end } end end class InterfaceDouble < RSpec::Mocks::Mock include InterfaceDoubleMethods def initialize(stubbed_class, *args) args << {} unless Hash === args.last @__stubbed_class__ = stubbed_class @__checked_methods__ = :public_instance_methods ensure_implemented *args.last.keys # __declared_as copied from rspec/mocks definition of `double` args.last[:__declared_as] = 'InterfaceDouble' super(stubbed_class, *args) end end end RSpec::Matchers.define :implement do |expected_methods, checked_methods| match do |stubbed_class| unimplemented_methods( stubbed_class, expected_methods, checked_methods ).empty? end def unimplemented_methods(stubbed_class, expected_methods, checked_methods) implemented_methods = stubbed_class.send(checked_methods) unimplemented_methods = expected_methods - implemented_methods end failure_message_for_should do |stubbed_class| "%s does not publicly implement:\n%s" % [ stubbed_class, unimplemented_methods( stubbed_class, expected_methods, checked_methods ).sort.map {|x| " #{x}" }.join("\n") ] end end RSpec.configure do |config| config.include InterfaceMocking end |
Static Asset Caching on Heroku Cedar Stack
I recently moved this blog over to Heroku, and in the process added in some proper HTTP caching headers. The dynamic pages use the build in fresh_when and stale? Rails helpers, combined with Rack::Cache and the free memcached plugin available on Heroku. That was all pretty straight forward, what was more difficult was configuring Heroku to serve all static assets (such as images and stylesheets) with a far-future max-age header so that they will be cached for eternity. What I’ve documented here is somewhat of a hack, and hopefully Heroku will provide a better way of doing this in the future.
By default Heroku serves everything in public directly via nginx. This is a problem for us since we don’t get a chance to configure the caching headers. Instead, use the Rack::StaticCache middleware (provided in the rack-contrib gem) to serve static files, which by default adds far future max age cache control headers. This needs to be out of different directory to public since there is no way to disable the nginx serving. I renamed by public folder to public_cached.
1 2 3 4 5 6 7 8 9 10 |
# config/application.rb config.middleware.use Rack::StaticCache, urls: %w( /stylesheets /images /javascripts /robots.txt /favicon.ico ), root: "public_cached" |
I also disabled the built in Rails serving of static assets in development mode, so that it didn’t interfere:
1 2 |
# config/environments/development.rb config.serve_static_assets = false |
In the production config, I configured the x_sendfile_header option to be “X-Accel-Redirect”. It was “X-Sendfile” which is an apache directive, and was causing nginx to hang (Heroku would never actually serve the assets to the browser).
1 2 |
# config/environments/production.rb config.action_dispatch.x_sendfile_header = 'X-Accel-Redirect' |
A downside of this approach is that if you have a lot of static assets, they all have to hit the Rails stack in order to be served. If you only have one dyno (the free plan) then the initial load can be slower than it otherwise would be if nginx was serving them directly. As I mentioned in the introduction, hopefully Heroku will provide a nicer way to do this in the future.
Speeding up Rails startup time
In which I provide easy instructions to try a new patch that drastically improves the start up time of Ruby applications, in the hope that with wide support it will be merged into the upcoming 1.9.3 release. Skip to the bottom for instructions, or keep reading for the narrative.
UPDATE: If you have trouble installing, grab a recent copy of rvm: rvm get head.
Background
Recent releases of MRI Ruby have introduced some fairly major performance regressions when requiring files:

For reference, our medium-sized Rails application requires around 2200 files &emdash; off the right-hand side of this graph. This is problematic. On 1.9.2 it takes 20s to start up, on 1.9.3 it takes 46s. Both are far too long.
There are a few reasons for this, but the core of the problem is the basic algorithm which looks something like this:
1 2 3 4 5 6 7 |
def require(file) $loaded.each do |x| return false if x == file end load(file) $loaded << file end |
That loop is no good, and gets worse the more files you have required. I have written a patch for 1.9.3 which changes this algorithm to:
1 2 3 4 5 |
def require(file) return false if $loaded[file] load(file) $loaded[file] = true end |
That gives you a performance curve that looks like this:

Much nicer.
That’s just a synthetic benchmark, but it works in the real world too. My main Rails application now loads in a mite over 10s, down from 20s it was taking on 1.9.2. A blank Rails app loads in 1.1s, which is even faster than 1.8.7.

Getting the fix
Here is how you can try out my patch right now in just ten minutes using RVM.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# First get a baseline measurement cd /your/rails/app time script/rails runner "puts 1" # Install a patched ruby curl https://gist.github.com/raw/996418/e2b346fbadeed458506fc69ca213ad96d1d08c3e/require-performance-fix-r31758.patch > /tmp/require-performance-fix.patch rvm install ruby-head --patch /tmp/require-performance-fix.patch -n patched # ... get a cup of tea, this took about 8 minutes on my MBP # Get a new measurement cd /your/rails/app rvm use ruby-head-patched gem install bundler --no-rdoc --no-ri bundle time script/rails runner "puts 1" |
How you can help
I need a lot more eyeballs on this patch before it can be considered for merging into trunk. I would really appreciate any of the following:
- Try it out on your app and report timings in the comments.
- Code review the patch on this GitHub pull request (it’s C code, but don’t let that scare you off).
- Try it on Windows.
- Report any bugs you find.
Next steps
I imagine there will be a bit more work to get this into Ruby 1.9.3, but after that this is just the first step of many to try and speed up the time Rails takes to start up. Bundler and RubyGems still spend a lot of time doing … something, which I want to investigate. I also want to port these changes over to JRuby which has similar issues (Rubinius isn’t quite as fast out of the gate, but does not degrade exponentially so would not benefit from this patch).
Thank you for your time.
Deleting duplicate data with PostgreSQL
Here is an update to a query I posted a while back for detecting duplicate data. It allows you to select all but one of the resulting duplicates, for easy deletion. It only works on PostgreSQL, but is pretty neat. It uses a window function!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
DELETE FROM users USING ( SELECT id, first_value(id) OVER ( PARTITION BY name ORDER BY created_at DESC ) first_id FROM users WHERE name IN ( SELECT name FROM users GROUP BY name HAVING count(name) > 1 ) ) dups WHERE dups.id != dups.first_id AND users.id = dups.id; |
The order by is optional, but handy if you need to select a particular row rather than just an arbitrary one. You need an extra sub-query because you can’t have window functions in a where clause.
For more tasty PostgreSQL tricks, check out my Meet PostgreSQL screencast, a steal at only $12 plug plug plug.
New Column: Code Safari
I am writing a regular weekly column at the newly launched Sitepoint project RubySource. The column is named “Code Safari”, where I explore the jungle of ruby libraries and gems and figure out how they work. It’s an introductory series designed to not just explain how things operate, but show you the tools and techniques so that you can figure it out yourself.
Three posts have already been published:
- Understanding Concurrent Programming With Ruby’s Goliath, in which I dig into the new Goliath web server to figure out how it uses the new 1.9 Fibers to work some magic.
- Configuring Capybara, in which I investigate how Capybara implemented its configuration DSL, and then make one for myself.
- TWSS and Bayesian Classification of Twitter Searches, in which I inspect the pipes of a beautiful piece of plumbing.
The format is a bit different but I’m really happy with how it is working so far. Let me know what you think.
PostgreSQL 9 and ruby full text search tricks
I have just released an introduction to PostgreSQL screencast, published through PeepCode. It is over an hour long and covers a large number of juicy topics:
- Setup full text search
- Optimize search with triggers and indexes
- Use Postgres with Ruby on Rails 3
- Optimize indexes by including only the rows that you need
- Use database standards for more reliable queries
- Write powerful reports in only a few lines of code
- Convert an existing MySQL application to use Postgres
It’s a steal at only $12. You can buy it over at PeepCode.
In it, I introduce full text search in postgres, and use a trigger to keep a search vector up to date. I’m not going to cover that here, but the point I get to is:
1 2 3 4 |
CREATE TRIGGER posts_search_vector_refresh BEFORE INSERT OR UPDATE ON posts FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(search_vector, 'pg_catalog.english', body, title); |
That is good for simple models, but what if you want to index child models as well? For instance, we want to include comment authors in the search index. I rolled up my sleeves an came up with this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
CREATE OR REPLACE FUNCTION search_trigger() RETURNS trigger AS $$ DECLARE search TEXT; child_search TEXT; begin SELECT string_agg(author_name, ' ') INTO child_search FROM comments WHERE post_id = new.id; search := ''; search := search || ' ' || coalesce(new.title); search := search || ' ' || coalesce(new.body); search := search || ' ' child_search; new.search_index := to_tsvector(search); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER posts_search_vector_refresh BEFORE INSERT OR UPDATE ON posts FOR EACH ROW EXECUTE PROCEDURE search_trigger(); |
Getting a bit ugly eh. It might be nice to move that logic back into ruby land, but we have the problem that we need to call a database function to convert our search document into the correct data-type. In this case, a quick work around is to store a search_document in a text field on the model, then use a trigger to only index that field into our search_vector field. The search_document field can then easily be set from your ORM.
Of course, any self-respecting rubyist should hide all this complexity behind a neat interface. I have come up with one using DataMapper that automatically adds the required triggers and indexes via auto-migrations. You use it thusly:
1 2 3 4 5 6 7 8 9 10 |
class Post include DataMapper::Resource include Searchable property :id, Serial property :title, String property :body, Text searchable :title, :body # Provides Post.search('keyword') end |
You can find the Searchable module code over on github. In it you can also find a fugly proof-of-concept for a DSL that generates the above SQL for indexing child models using DataMapper’s rich property model. It worked, but I’m not using it in any production code so I can hardly recommend it. Maybe you want to have a play though.
