tag:www.rhnh.net,2008:/codeCode - Xavier Shay's Blog2012-01-02T03:16:46ZEnkiXavier Shaynotreal@rhnh.nettag:www.rhnh.net,2008:Post/8592012-01-02T03:16:00Z2012-01-02T03:16:46ZConway's Game of Life in Haskell<p>Today I came across this <a href="http://clj-me.cgrand.net/2011/08/19/conways-game-of-life/">excellent game of life implementation in Clojure</a>, and also was learning about <a href="http://learnyouahaskell.com/a-fistful-of-monads#the-list-monad">monads in Haskell</a>. So I ported the former, using the latter!</p>
<p>The logic translates pretty much the same. Wondering if there is more monads to be had on the <code>newCell</code> assignment line (the one with <code>concatMap</code> and friends), even at the expense of readability. This is a learning exercise, after all. I went for bonus points by writing a function to render the grid, it didn’t go as well. Would love some feedback on it. Here is a <a href="https://github.com/xaviershay/sandbox/blob/master/misc/game_of_life.hs">forkable version</a>.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt>23<tt>
</tt>24<tt>
</tt>25<tt>
</tt>26<tt>
</tt>27<tt>
</tt>28<tt>
</tt>29<tt>
</tt><strong>30</strong><tt>
</tt>31<tt>
</tt>32<tt>
</tt>33<tt>
</tt>34<tt>
</tt>35<tt>
</tt>36<tt>
</tt>37<tt>
</tt>38<tt>
</tt>39<tt>
</tt><strong>40</strong><tt>
</tt>41<tt>
</tt>42<tt>
</tt>43<tt>
</tt>44<tt>
</tt>45<tt>
</tt>46<tt>
</tt>47<tt>
</tt>48<tt>
</tt>49<tt>
</tt><strong>50</strong><tt>
</tt>51<tt>
</tt>52<tt>
</tt>53<tt>
</tt>54<tt>
</tt>55<tt>
</tt>56<tt>
</tt>57<tt>
</tt>58<tt>
</tt>59<tt>
</tt><strong>60</strong><tt>
</tt>61<tt>
</tt>62<tt>
</tt>63<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">import Data.List<tt>
</tt>import Control.Monad<tt>
</tt><tt>
</tt>type Cell = (Int, Int)<tt>
</tt>type Grid = [Cell]<tt>
</tt><tt>
</tt>-- Game Logic<tt>
</tt><tt>
</tt>neighbours :: Cell -> Grid<tt>
</tt>neighbours (x, y) = do<tt>
</tt> dx <- [-1..1]<tt>
</tt> dy <- [-1..1]<tt>
</tt> guard (dx /= 0 || dy /= 0)<tt>
</tt> return (x + dx, y + dy)<tt>
</tt><tt>
</tt>step :: Grid -> Grid<tt>
</tt>step cells = do<tt>
</tt> (newCell, n) <- frequencies $ concatMap neighbours cells<tt>
</tt> guard $ (n == 3) || (n == 2 && newCell `elem` cells)<tt>
</tt> return newCell<tt>
</tt><tt>
</tt>-- This is the only deviation from the Clojure version, since it is not a<tt>
</tt>-- built-in in Haskell.<tt>
</tt>frequencies :: Ord a => [a] -> [(a, Int)]<tt>
</tt>frequencies xs = do<tt>
</tt> x <- group $ sort xs<tt>
</tt> return (head x, length x)<tt>
</tt><tt>
</tt><tt>
</tt>-- UI<tt>
</tt><tt>
</tt>-- Feel like I'm missing a concept. Not so happy with this function:<tt>
</tt>-- * Can `eol` be done a better way? I tried nested maps but it was urgh.<tt>
</tt>-- * `marker` seems long for a simple tenary. Same issue as `eol` I guess.<tt>
</tt>formatGrid :: Grid -> String<tt>
</tt>formatGrid grid = do<tt>
</tt> y <- ys<tt>
</tt> x <- xs<tt>
</tt> [marker x y] ++ eol x<tt>
</tt> where<tt>
</tt> marker x y<tt>
</tt> | (x, y) `elem` grid = '*'<tt>
</tt> | otherwise = ' '<tt>
</tt> eol x<tt>
</tt> | x == maximum xs = ['\n']<tt>
</tt> | otherwise = []<tt>
</tt><tt>
</tt> xs = gridRange fst<tt>
</tt> ys = gridRange snd<tt>
</tt> gridRange f = [min grid .. max grid]<tt>
</tt> where<tt>
</tt> min = minimum . map f<tt>
</tt> max = maximum . map f<tt>
</tt><tt>
</tt>main = do<tt>
</tt> mapM_ printGrid . take 3 $ iterate step beacon<tt>
</tt> where<tt>
</tt> beacon = [(0, 0), (1, 0), (0, 1), (3, 3), (2, 3), (3, 2)]<tt>
</tt><tt>
</tt> printGrid :: Grid -> IO ()<tt>
</tt> printGrid grid = do<tt>
</tt> putStrLn $ formatGrid grid<tt>
</tt> putStrLn ""<tt>
</tt></pre></td>
</tr></table>
tag:www.rhnh.net,2008:Post/8572011-11-29T04:39:00Z2011-11-29T04:24:34ZDataMapper Retrospective<p>I introduced <a href="http://datamapper.org/">DataMapper</a> on my last two major projects. As those projects matured after I had left, they both migrated to a different <span class="caps">ORM</span>. That deserves a retrospective, I think. As I’ve left both projects, I don’t have the insider level of detail on the decision to abandon DataMapper, but developers from both projects kindly provided background for this blog post.</p>
<h2>Project A</h2>
<p>Web application and a batch processing component built on top of a legacy Oracle database.</p>
<h3>Good</h3>
<ul>
<li>Field mappings, nice ruby names and able to ignore fields we didn’t care about.</li>
</ul>
<h3>Bad</h3>
<ul>
<li>Had to roll our own locking and time zone integration.</li>
<li>Not great for batch processing (trying to write <span class="caps">SQL</span> through DM abstraction.)</li>
</ul>
<p>It turned out this project required a lot more batch processing than we anticipated, which DataMapper does not shine at. It was migrated to <a href="http://sequel.rubyforge.org/">Sequel</a> which provides a far better abstraction for working closer to <span class="caps">SQL</span>.</p>
<h2>Project B</h2>
<p>A fairly typical Rails 3 application. A couple of tens of thousands of lines of code.</p>
<h3>Good</h3>
<ul>
<li>No migrations (pre-release).</li>
<li>Foreign keys, composite primary keys.</li>
<li>Auto-validations.</li>
</ul>
<h3>Bad</h3>
<ul>
<li>Auto-validations with nested attributes was uncharted territory (needed bug fixes).</li>
<li>Performance on large object graphs was unusable for page rendering (close to two seconds for our home page, which admittedly had a stupid amount of stuff on it).</li>
<li>Performance was suboptimal (though passable) on smaller pages.</li>
<li>Tracing through what his happening across multiple gems (particularly around transactions) was tricky.</li>
<li>The maintenance/interactions of all the various gems was problematic (e.g. gems X,Y work with 1.9.3 but Z doesn’t yet).</li>
<li>Inability to easily “break the abstraction” when <span class="caps">SQL</span> was required.</li>
</ul>
<p>The performance issues were clear in our code base, but eluded much effort to reduce them down to smaller reproducible problems. The best quick win I found was ~15% by disabling assertions, but I suspect that given the large scope of the problem DataMapper is trying to solve there may not be any approachable way of tackling the issue (would love to be proven wrong!)</p>
<p>We ran into obvious integration bugs (apologies for not having kept a concrete list), a symptom of a library not widely used. As a commiter on the project this wasn’t an issue, since they were easily fixed and moved past (the DataMapper code base is really nice to work on), but having a commiter on your team isn’t a tenable strategy.</p>
<p>DataMapper takes an all-ruby-all-the-time approach, which means things get tricky when the abstraction leaks. Much of the <span class="caps">SQL</span> generation is hidden in private methods. Compare some code to create a composable full text search query:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt>23<tt>
</tt>24<tt>
</tt>25<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">def</span> <span class="pc">self</span>.<span class="fu">search</span>(keywords, options = {})<tt>
</tt> options = {<tt>
</tt> <span class="ke">conditions</span>: [<span class="s"><span class="dl">"</span><span class="k">true</span><span class="dl">"</span></span>]<tt>
</tt> }.merge(options)<tt>
</tt><tt>
</tt> current_query = query.merge(options)<tt>
</tt><tt>
</tt> a = repository.adapter<tt>
</tt> columns_sql = a.send(<span class="sy">:columns_statement</span>, current_query.fields, <span class="pc">false</span>)<tt>
</tt> conditions = a.send(<span class="sy">:conditions_statement</span>, current_query.conditions, <span class="pc">false</span>)<tt>
</tt> order_sql = a.send(<span class="sy">:order_statement</span>, current_query.order, <span class="pc">false</span>)<tt>
</tt> limit_sql = current_query.limit || <span class="i">50</span><tt>
</tt> conditions_sql, conditions_values = *conditions<tt>
</tt><tt>
</tt> bind_values = [keywords] + conditions_values<tt>
</tt><tt>
</tt> find_by_sql([<span class="s"><span class="dl"><<-SQL</span></span>, *bind_values])<span class="s"><span class="k"><tt>
</tt> SELECT </span><span class="il"><span class="idl">#{</span>columns_sql<span class="idl">}</span></span><span class="k">, ts_rank_cd(search_vector, query) AS rank<tt>
</tt> FROM things<tt>
</tt> CROSS JOIN plainto_tsquery(?) query<tt>
</tt> WHERE </span><span class="il"><span class="idl">#{</span>conditions_sql<span class="idl">}</span></span><span class="k"> AND (query @@ search_vector)<tt>
</tt> ORDER BY rank DESC, </span><span class="il"><span class="idl">#{</span>order_sql<span class="idl">}</span></span><span class="k"><tt>
</tt> LIMIT </span><span class="il"><span class="idl">#{</span>limit_sql<span class="idl">}</span></span><span class="dl"><tt>
</tt> SQL</span></span><tt>
</tt><span class="r">end</span><tt>
</tt></pre></td>
</tr></table>
<p>To the ActiveRecord equivalent (Sequel is similar):</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">def</span> <span class="pc">self</span>.<span class="fu">search</span>(keywords)<tt>
</tt> select(<span class="s"><span class="dl">"</span><span class="k">things.*, ts_rank_cd(search_vector, query) AS rank</span><span class="dl">"</span></span>)<tt>
</tt> .joins(sanitize_sql_array([<span class="s"><span class="dl">"</span><span class="k">CROSS JOIN plainto_tsquery(?) query</span><span class="dl">"</span></span>, keywords]))<tt>
</tt> .where(<span class="s"><span class="dl">"</span><span class="k">query @@ search_vector</span><span class="dl">"</span></span>)<tt>
</tt> .order(<span class="s"><span class="dl">"</span><span class="k">rank DESC</span><span class="dl">"</span></span>)<tt>
</tt><span class="r">end</span><tt>
</tt></pre></td>
</tr></table>
<p>Switching to ActiveRecord took a week of all hands (~4) on deck, plus another week alongside other feature work to get it stable. From beginning to in production was two weeks. The end result was a drop in response time (the deploy is pretty blatant in the graph below), start up time, plus 3K less lines of code (a lot of custom code for dropping down to <span class="caps">SQL</span> was able to be removed).</p>
<p><img src="http://a.yfrog.com/img739/7449/4h5.png" alt="" /></p>
<h2>Do differently</h2>
<p>Ultimately, DataMapper provides an abstraction that I just don’t need, and even if I did it hasn’t had its tires kicked sufficiently that a team can use it without having to delve down to the internals. The applications I find myself writing are about data, and the store in which that data lives is vitally important to the application. Abstracting away those details seems to be heading in the wrong direction for writing simple applications. As an intellectual achievement in its own right I really dig DataMapper, but it is too complicated a component to justify using inside other applications.</p>
<p>Rich Hickey’s talk <a href="http://www.infoq.com/presentations/Simple-Made-Easy">Simple Made Easy</a> has been rattling around my head a lot.</p>
<p>Nowadays I’m back to ActiveRecord for team conformance. It’s more work to keep on top of foreign keys and the like, but overall it does the job. It’s still too complicated, but has the non-trivial benefit of being used by <strong>lots</strong> of people. This is my responsible choice at the moment.</p>
<p>On my own projects I first reach for Sequel. It supports all the nice database features I want to use, while providing a thin layer over <span class="caps">SQL</span>. In other words, I don’t have to worry about the abstraction leaking because the abstraction is still <span class="caps">SQL</span>, just expressed in ruby (which is a huge win for composeability that you don’t get with raw <span class="caps">SQL</span>). While it does have “<span class="caps">ORM</span>” features, it feels more like the most convenient way of accessing my database rather than an abstraction layer. It’s actively maintained and the only bug I have found was something that Rails broke, and a patch was already available. There are no open issues in the bug tracker. My experiences have been overwhelmingly positive. I haven’t built anything big enough with it yet to have confidence using it on a team project though.</p>
<p>I still have a soft spot in my heart for DataMapper, I just don’t see anywhere for me to use it anymore.</p>tag:www.rhnh.net,2008:Post/8562011-09-05T01:48:00Z2011-09-05T01:48:38ZExercises in style<p>Let us make a stack machine! It can add numbers! This may be a winding journey. Have some time and an <code>irb</code> up your sleeve. Maybe it is more of a meditation than a blog post? Onwards!</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">def</span> <span class="fu">push_op</span>(value)<tt>
</tt> lambda {|x| [value, x + [value]] }<tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt><span class="r">def</span> <span class="fu">add_op</span><tt>
</tt> lambda {|x| [x[<span class="i">-1</span>] + x[<span class="i">-2</span>], x[<span class="i">0</span>..<span class="i">-3</span>]] }<tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt>[<tt>
</tt> push_op(<span class="i">1</span>),<tt>
</tt> push_op(<span class="i">2</span>),<tt>
</tt> add_op<tt>
</tt>].inject([<span class="pc">nil</span>, []]) {|(result, state), op|<tt>
</tt> op[state]<tt>
</tt>}<tt>
</tt></pre></td>
</tr></table>
<p>Get it? Pushes 1, pushes 2, then the <code>add_op</code> pops them off the stack and makes 3. Not a lot of metadata in those lambdas though, and we can’t combine them in interesting way.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt>23<tt>
</tt>24<tt>
</tt>25<tt>
</tt>26<tt>
</tt>27<tt>
</tt>28<tt>
</tt>29<tt>
</tt><strong>30</strong><tt>
</tt>31<tt>
</tt>32<tt>
</tt>33<tt>
</tt>34<tt>
</tt>35<tt>
</tt>36<tt>
</tt>37<tt>
</tt>38<tt>
</tt>39<tt>
</tt><strong>40</strong><tt>
</tt>41<tt>
</tt>42<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">class</span> <span class="cl">Operation</span> < <span class="co">Struct</span>.new(<span class="sy">:block</span>)<tt>
</tt> <span class="r">def</span> <span class="fu">+</span>(other)<tt>
</tt> <span class="co">CompositeOperation</span>.new(<span class="pc">self</span>, other)<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">run</span>(state)<tt>
</tt> <span class="iv">@block</span>.call(state)<tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt><span class="r">class</span> <span class="cl">CompositeOperation</span> < <span class="co">Operation</span><tt>
</tt> <span class="r">def</span> <span class="fu">initialize</span>(a, b)<tt>
</tt> <span class="iv">@a</span> = a<tt>
</tt> <span class="iv">@b</span> = b<tt>
</tt> <span class="r">super</span>(lambda {|x| <span class="iv">@b</span>.block[<span class="iv">@a</span>.block[x][<span class="i">1</span>]] })<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">desc</span><tt>
</tt> <span class="iv">@a</span>.desc + <span class="s"><span class="dl">"</span><span class="ch">\n</span><span class="dl">"</span></span> + <span class="iv">@b</span>.desc<tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt><span class="r">class</span> <span class="cl">PushOperation</span> < <span class="co">Operation</span><tt>
</tt> <span class="r">def</span> <span class="fu">initialize</span>(value)<tt>
</tt> <span class="iv">@value</span> = value<tt>
</tt> <span class="r">super</span>(lambda {|x| [value, x + [value]] })<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">desc</span><tt>
</tt> <span class="s"><span class="dl">"</span><span class="k">push </span><span class="il"><span class="idl">#{</span><span class="iv">@value</span><span class="idl">}</span></span><span class="dl">"</span></span><tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt><span class="r">class</span> <span class="cl">AddOperation</span> < <span class="co">Operation</span><tt>
</tt> <span class="r">def</span> <span class="fu">initialize</span><tt>
</tt> <span class="r">super</span>(lambda {|x| [x[<span class="i">-1</span>] + x[<span class="i">-2</span>], x[<span class="i">0</span>..<span class="i">-3</span>]] })<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">desc</span><tt>
</tt> <span class="s"><span class="dl">"</span><span class="k">add top two digits on stack</span><span class="dl">"</span></span><tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt></pre></td>
</tr></table>
<p>A lot more setup, but now we also get a description of operations!</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">def</span> <span class="fu">tagged_push_op</span>(value)<tt>
</tt> <span class="co">PushOperation</span>.new(value)<tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt><span class="r">def</span> <span class="fu">tagged_add_op</span><tt>
</tt> <span class="co">AddOperation</span>.new<tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt>ops =<tt>
</tt> tagged_push_op(<span class="i">1</span>) +<tt>
</tt> tagged_push_op(<span class="i">2</span>) +<tt>
</tt> tagged_add_op<tt>
</tt><tt>
</tt>puts ops.desc<tt>
</tt>puts ops.run(start_state).inspect<tt>
</tt></pre></td>
</tr></table>
<p>Ok you get that. What else can we do?</p>
<p><em>“every monad [.] embodies a particular computational strategy. A ‘motto of computation,’ if you will.”</em> — <a href="http://moonbase.rydia.net/mental/writings/programming/monads-in-ruby/02array">Mental Guy</a></p>
<p>hmmm. What does it mean?</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">class</span> <span class="cl">VerboseStackEvaluator</span> < <span class="co">Struct</span>.new(<span class="sy">:stack</span>)<tt>
</tt> attr_accessor <span class="sy">:result</span>, <span class="sy">:stack</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">pass</span>(op)<tt>
</tt> puts op.desc<tt>
</tt> results = op.call(stack)<tt>
</tt> <span class="pc">self</span>.class.new(results[<span class="i">1</span>]).tap <span class="r">do</span> |x|<tt>
</tt> x.result = results[<span class="i">0</span>]<tt>
</tt> <span class="r">end</span><tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="pc">self</span>.<span class="fu">identity</span><tt>
</tt> new([])<tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt>e = evaluator.identity.<tt>
</tt> pass(tagged_push_op(<span class="i">1</span>)).<tt>
</tt> pass(tagged_push_op(<span class="i">2</span>)).<tt>
</tt> pass(tagged_add_op)<tt>
</tt><tt>
</tt>p [e.result, e.stack]<tt>
</tt></pre></td>
</tr></table>
<p>Oh so now we have one structure (the <code>pass</code> stuff) that we can run through different evaluators. Let us make a recursive one!</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">class</span> <span class="cl">RecursiveLazyStackEvaluator</span> < <span class="co">Struct</span>.new(<span class="sy">:stack</span>)<tt>
</tt> <span class="r">def</span> <span class="fu">pass</span>(op)<tt>
</tt> <span class="pc">self</span>.class.new(lambda {<tt>
</tt> op.call(stack)<tt>
</tt> })<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="pc">self</span>.<span class="fu">identity</span><tt>
</tt> new(lambda { [<span class="pc">nil</span>, []] })<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">result</span>; evaled[<span class="i">0</span>]; <span class="r">end</span><tt>
</tt> <span class="r">def</span> <span class="fu">stack</span>; evaled[<span class="i">1</span>]; <span class="r">end</span><tt>
</tt><tt>
</tt> private<tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">evaled</span><tt>
</tt> <span class="iv">@evaled</span> ||= <span class="iv">@stack</span>.call<tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt></pre></td>
</tr></table>
<p>Do you see it is now lazy. Rather than evaluate each operation when <code>pass</code> is called, it saves them up until a result is requested. Look out! Haskell in your Ruby! Recursion might blow out our stack though. Let us isomorphically (I just learned this word) translate it to use iteration!</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt>23<tt>
</tt>24<tt>
</tt>25<tt>
</tt>26<tt>
</tt>27<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">class</span> <span class="cl">LazyStackEvaluator</span><tt>
</tt> attr_accessor <span class="sy">:steps</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">initialize</span>(stack, steps = [])<tt>
</tt> <span class="iv">@stack</span> = stack<tt>
</tt> <span class="iv">@steps</span> = steps<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">pass</span>(op)<tt>
</tt> <span class="pc">self</span>.class.new(<span class="iv">@stack</span>, steps + [op])<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="pc">self</span>.<span class="fu">identity</span><tt>
</tt> new([])<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">result</span>; evaled[<span class="i">0</span>]; <span class="r">end</span><tt>
</tt> <span class="r">def</span> <span class="fu">stack</span>; evaled[<span class="i">1</span>]; <span class="r">end</span><tt>
</tt><tt>
</tt> protected<tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">evaled</span><tt>
</tt> <span class="iv">@evaled</span> ||= steps.inject([<span class="pc">nil</span>, <span class="iv">@stack</span>]) {|(r, s), op|<tt>
</tt> op.call(s)<tt>
</tt> }<tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt></pre></td>
</tr></table>
<p>Not too shabby. Let’s try something more useful. Given we only have one operation that pops the stack (add), and it only pops two numbers, if we have more than two numbers in a row they start becoming redundant. Let us optimize!</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt>23<tt>
</tt>24<tt>
</tt>25<tt>
</tt>26<tt>
</tt>27<tt>
</tt>28<tt>
</tt>29<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">class</span> <span class="cl">OptimizingEvaluator</span> < <span class="co">LazyStackEvaluator</span><tt>
</tt> <span class="r">def</span> <span class="fu">evaled</span><tt>
</tt> <span class="iv">@evaled</span> ||= <span class="r">begin</span><tt>
</tt> accumulator = []<tt>
</tt> new_steps = []<tt>
</tt> steps.each <span class="r">do</span> |step|<tt>
</tt> accumulator << step<tt>
</tt> <span class="r">if</span> !step.is_a?(<span class="co">PushOperation</span>)<tt>
</tt> new_steps += accumulator<tt>
</tt> accumulator = []<tt>
</tt> <span class="r">elsif</span> accumulator.length > <span class="i">2</span><tt>
</tt> accumulator = accumulator[<span class="i">1</span>..<span class="i">-1</span>]<tt>
</tt> <span class="r">end</span><tt>
</tt> <span class="r">end</span><tt>
</tt> new_steps += accumulator<tt>
</tt> new_steps.inject([<span class="pc">nil</span>, <span class="iv">@stack</span>]) {|(r, s), op|<tt>
</tt> op.call(s)<tt>
</tt> }<tt>
</tt> <span class="r">end</span><tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt>e = evaluator.identity.<tt>
</tt> pass(tagged_push_op(<span class="i">1</span>)). <span class="c"># This won't get run!</span><tt>
</tt> pass(tagged_push_op(<span class="i">1</span>)).<tt>
</tt> pass(tagged_push_op(<span class="i">2</span>)).<tt>
</tt> pass(tagged_add_op)<tt>
</tt><tt>
</tt>p [e.result, e.stack]<tt>
</tt></pre></td>
</tr></table>
<p>Ok one more. This one is pretty useless for this problem, but perhaps it will inspire thought. Let us multithread!</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt>23<tt>
</tt>24<tt>
</tt>25<tt>
</tt>26<tt>
</tt>27<tt>
</tt>28<tt>
</tt>29<tt>
</tt><strong>30</strong><tt>
</tt>31<tt>
</tt>32<tt>
</tt>33<tt>
</tt>34<tt>
</tt>35<tt>
</tt>36<tt>
</tt>37<tt>
</tt>38<tt>
</tt>39<tt>
</tt><strong>40</strong><tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">class</span> <span class="cl">ThreadingEvaluator</span> < <span class="co">LazyStackEvaluator</span><tt>
</tt> <span class="r">def</span> <span class="fu">evaled</span><tt>
</tt> <span class="iv">@evaled</span> ||= <span class="r">begin</span><tt>
</tt> accumulator = []<tt>
</tt> workers = []<tt>
</tt> steps.each <span class="r">do</span> |step|<tt>
</tt> accumulator << step<tt>
</tt> <span class="r">if</span> step.is_a?(<span class="co">AddOperation</span>)<tt>
</tt> workers << spawn_thread(accumulator)<tt>
</tt> accumulator = []<tt>
</tt> <span class="r">end</span><tt>
</tt> <span class="r">end</span><tt>
</tt> workers << spawn_thread(accumulator) <span class="r">unless</span> accumulator.empty?<tt>
</tt> workers.each(&<span class="sy">:join</span>)<tt>
</tt><tt>
</tt> workers.last[<span class="sy">:result</span>]<tt>
</tt> <span class="r">end</span><tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">spawn_thread</span>(accumulator)<tt>
</tt> <span class="co">Thread</span>.new <span class="r">do</span><tt>
</tt> sleep rand / <span class="i">3</span><tt>
</tt> <span class="co">Thread</span>.current[<span class="sy">:result</span>] = <span class="r">begin</span><tt>
</tt> e = accumulator.inject(<span class="co">VerboseStackEvaluator</span>.identity) {|e, s| e.pass(s) }<tt>
</tt> [e.result, e.stack]<tt>
</tt> <span class="r">end</span><tt>
</tt> <span class="r">end</span><tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt>e = evaluator.identity.<tt>
</tt> pass(tagged_push_op(<span class="i">1</span>)).<tt>
</tt> pass(tagged_push_op(<span class="i">1</span>)).<tt>
</tt> pass(tagged_push_op(<span class="i">2</span>)).<tt>
</tt> pass(tagged_add_op).<tt>
</tt> pass(tagged_push_op(<span class="i">3</span>)).<tt>
</tt> pass(tagged_push_op(<span class="i">4</span>)).<tt>
</tt> pass(tagged_add_op)<tt>
</tt><tt>
</tt>p [e.result, e.stack]<tt>
</tt></pre></td>
</tr></table>
<p>Ok that is all. Here is an exercise for you: how would you allow the threading and optimizing evaluators to be combined?</p>tag:www.rhnh.net,2008:Post/8552011-08-21T21:00:00Z2011-08-21T17:08:33ZSICP Lisp interpreter in Clojure<p>On a lazy Sunday morning I can oft be found meandering through the classics of computer science literature. This weekend was no exception, as I put together a <span class="caps">LISP</span> interpreter in Clojure based off chapter 4 of <a href="http://mitpress.mit.edu/sicp/">The Structure and Interpretation of Computer Progams.</a></p>
<p>The <a href="https://github.com/xaviershay/sandbox/blob/master/clj/lisptwo.clj">code is on github</a>, rather than including it inline here, since at 90 lines plus tests it’s getting a tad long for a snippet.</p>
<p>It differs from the <span class="caps">SICP</span> version in that the environment variable is immutable, so new versions have to be passed through to each function. This resulted in the “context” concept that encapsulates both the current expression and the environment that does with. It causes a small amount of clunky code (see <code>map-reducer</code>), but also allows easier managing of scoping for lambdas (see <code>do-apply</code> and <code>env-extend</code>). It matches the functional paradigm much better anyway. I also used some higher level primitives such as <code>map</code> and <code>reduce</code> that <span class="caps">SICP</span> doesn’t <del>-</del> <span class="caps">SICP</span> is demonstrating that they aren’t necessary, but that’s a point I’ve already conceeded and don’t feel I need to replicate.</p>
<p>Critique of my style warmly encouraged, I’m still new to Clojure.</p>tag:www.rhnh.net,2008:Post/8542011-08-20T10:35:00Z2011-08-20T22:35:06ZVim and tmux on OSX<p>I recently switched from MacVim to vim inside tmux, using iTerm in full screen mode (<code>Command+Enter</code>). It’s pretty rad. I tried screen first, but even after a lot of screwing around there was still a lot of brokeness, and I don’t like how it does split panes anyways. Follows are some notes about what is required for tmux.</p>
<h3>Get the latest vim and tmux</h3>
<p>Latest vim required for proper clipboard sharing, if you don’t want to install it you can use the <code>pbcopy</code> plugin mentioned below. A formula won’t ever be in the main homebrew repository because they have a policy of not including packages already provided by the system, but homebrew-alt is pretty legit.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">brew install --HEAD https://raw.github.com/adamv/homebrew-alt/master/duplicates/vim.rb<tt>
</tt>brew install tmux<tt>
</tt></pre></td>
</tr></table>
<h3>Set up pretty colors</h3>
<p><a href="https://img.skitch.com/20110821-gfp7b3g8xrk25bfdxgrxdnghnr.jpg"><img src="https://img.skitch.com/20110821-gfp7b3g8xrk25bfdxgrxdnghnr.jpg" width='640' alt='my vim/tmux setup' /></a></p>
<p>I use the <a href="http://ethanschoonover.com/solarized">solarized</a> color scheme. To make this work, ensure you are not overriding the <code>TERM</code> variable in your <code>.{bash|zsh}rc</code>, then create an alias for tmux:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"># .zshrc<tt>
</tt>alias tmux="TERM=screen-256color-bce tmux"<tt>
</tt></pre></td>
</tr></table>
<p>I also have a tmux config:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"># .tmux.conf<tt>
</tt>set -g default-terminal "screen-256color"<tt>
</tt></pre></td>
</tr></table>
<h3>Clipboard sharing</h3>
<p>Up until I wrote this blog post, I had been using the <a href="https://github.com/mortice/pbcopy.vim">pbcopy plugin</a> to share clipboard using a cute hack involving ssh’ing back into your machine to run <code>pbcopy</code>/<code>pbpaste</code>. In researching some more details on this though I found an <strong>excellent</strong> write up of the problem and a <a href="https://github.com/ChrisJohnsen/tmux-MacOSX-pasteboard">far better solution by Chris Johnsen</a> that enables proper sharing without ssh’ing, and therefore also the <code>*</code> register (use <code>"*y</code> to copy, <code>"*p</code> to paste – note this does <strong>not</strong> work with the vim that ships with <span class="caps">OSX</span>).</p>
<h3>Mouse integration</h3>
<p>The mouse is good for two things: scrolling, and selecting text from your scrollback.</p>
<p>For the first, put the following config:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"># ~/.tmux.conf<tt>
</tt>set -g mode-mouse on<tt>
</tt></pre></td>
</tr></table>
<p>For the second, hold the option key while you select.</p>
<h3>Workflow</h3>
<p>Find another reference for basic keys, this here are notes on top of that. <code>Ctrl-B</code> sucks as an escape sequence, rebind it to <code>Ctrl-A</code> to match screen. Most online references don’t mention it, but the default binding for horizontal split is <code>prefix "</code> (it’s in the man page). I tend to have a main pane for editing and a smaller pane for a <code>REPL</code> or log. If I need to investigate the smaller pane, I press <code>Ctrl-A Ctrl-O</code>, which switches the two panes to give me the log in the larger one.</p>
<p>I use the <a href="https://github.com/xaviershay/tslime.vim" title="I had to patch it">tslime.vim plugin</a> to send text directly from vim to the supplementary pane. This is a killer feature. As well as the built in <code>Ctrl-C</code> shortcut, I also use a trick I learned from <a href="http://blog.extracheese.org/">Gary Bernhardt</a> and remap <code><leader>t</code> on the fly to send whatever command I am currently testing to the other pane. Some examples:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">; Load a file into a clojure repl<tt>
</tt>:map ;t :w\|:call Send_to_Tmux("\n\n\n(load-file \"./myfile.clj\")\n")<CR><tt>
</tt>; Run rspec in zsh<tt>
</tt>:map ;t :w\|:call Send_to_Tmux("rspec spec/my_spec.rb\n")<CR><tt>
</tt></pre></td>
</tr></table>
<p>If I need to interact with a shell I’ll usually <code>Ctrl-Z</code> vim, do what I need to do, then <code>fg</code> back again. If it’s a context switch, I’ll start a new tmux window then exit it after I’m done with the distraction.</p>
<p>I don’t use sessions. I prefer setting up from scratch each time since it takes no time at all, and eases my brain into the problem. Clean desk and all that.</p>
<p>That’s it. Nothing too fancy, but I’ve been meaning to make the switch from MacVim for a while and with this set up I can’t ever see myself going back.</p>tag:www.rhnh.net,2008:Post/8532011-08-07T06:46:20Z2011-08-07T06:46:20ZOCR with Clojure and ImageMagick<p>Let’s write some Clojure to recognize hand-written digits. It will be fun. But first, some notes.</p>
<p><strong><span class="caps">NOTE</span> <span class="caps">THE</span> <span class="caps">FIRST</span>:</strong> If you actually want proper <span class="caps">OCR</span> with Clojure that is actually useful, perhaps try <a href="http://antoniogarrote.wordpress.com/2011/01/30/ocr-with-clojure-tesseract-and-opencv/">this blog post on using OpenCV and Tesseract.</a> If you want to have some fun from first principles, come with me.</p>
<p><strong><span class="caps">NOTE</span> <span class="caps">THE</span> <span class="caps">SECOND</span>:</strong> This post was heavily inspired by Chapter 2 in <a href="http://www.manning.com/pharrington/">Machine Learning in Action</a>, which details the K nearest neighbour algorithm and pointed me to the dataset. If you dig this post, you should buy that book.</p>
<p>OK let’s go! Here’s what we’re going to do:</p>
<ul>
<li>Take a snapshot of your handwriting.</li>
<li>Use ImageMagick to post-process it.</li>
<li>Convert the snapshot to a text format matching our training data.</li>
<li>Download and parse a training set of data.</li>
<li>Identify the digit written in the snapshot using the training data.</li>
</ul>
<p>It’s going to be great.</p>
<h2>Take a snapshot</h2>
<p>Draw a single numeric digit on a piece of paper. Take a photo of it and get it on your computer. I used Photo Booth and the built-in camera on my Mac. Tight crop the picture around the number, so it looks something like:</p>
<p><img src="https://img.skitch.com/20110807-nndqigyck9u3bf8gxx14m9dhgj.jpg" alt="" /></p>
<p>Don’t worry if it’s a bit grainy or blurry, our classifier is going to be pretty smart.</p>
<h2>Use ImageMagick to post-process it</h2>
<p>The ImageMagick command line utility <code>convert</code> is one of those magic tools that once you learn you can never imagine how you did without it. It can do anything you need to an image. <em>Anything</em>. For instance, resize our image to 32×32 pixels and convert it into black and white.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(ns ocr.main<tt>
</tt> (:use [clojure.contrib.shell-out :only (sh)]))<tt>
</tt><tt>
</tt>(defn convert-image<tt>
</tt> [in out]<tt>
</tt> (sh "convert" in "-colorspace" "gray" "+dither" "-colors" "2"<tt>
</tt> "-normalize" "-resize" "32x32!" out))<tt>
</tt></pre></td>
</tr></table>
<p>It took me a while to figure out this incantation. The <a href="http://www.imagemagick.org/Usage/quantize/#monochrome">user manual for quantize</a> is probably the best reference you’ll find. Note that the exclamation mark in “32×32!” will stretch the dimensions of the image to be square. This is desirable since most people write too skinny, and maybe some write too fat, but we need the digits to be square otherwise everything will look like a “1”. Converting the above “5” will look like this:</p>
<p><img src="https://img.skitch.com/20110807-f5qrdt34ccsj2d137rq94fwajn.jpg" alt="" /></p>
<p>I am shelling out from Clojure to transform the file. There are two other options: JMagick, which uses the C <span class="caps">API</span> directly using <span class="caps">JNI</span>, and im4java which still shells out but gives you a nice interface over the top of it. I couldn’t get the first one working (it looks like a pretty dead project, no updates for a few years), and the latter wouldn’t give me anything helpful in this case.</p>
<h2>Convert the image into a text format</h2>
<p>The <code>convert</code> program automatically formats the output file based on the file extension, you can easily convert between any graphic format you choose. For instance, convert <span class="caps">JPG</span> to <span class="caps">PNG</span>:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">convert myfile.jpg myfile.png<tt>
</tt></pre></td>
</tr></table>
<p>As well as graphic formats though, it also supports the <code>txt</code> format, which looks like this:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"># ImageMagick pixel enumeration: 32,32,255,rgb<tt>
</tt>0,0: (255,255,255) #FFFFFF white<tt>
</tt>1,0: ( 0, 0, 0) #000000 black<tt>
</tt># etc...<tt>
</tt></pre></td>
</tr></table>
<p>That’s handy, because it can be easily translated into a bitmap with “1” representing black and “0” representing white. The “5” from above will look like this:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt>23<tt>
</tt>24<tt>
</tt>25<tt>
</tt>26<tt>
</tt>27<tt>
</tt>28<tt>
</tt>29<tt>
</tt><strong>30</strong><tt>
</tt>31<tt>
</tt>32<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">10000000000000000000000000000000<tt>
</tt>00000000000000000000000000000000<tt>
</tt>00000000000000000000000000000000<tt>
</tt>00000000000000000000001111111111<tt>
</tt>00000000000000111111111111111111<tt>
</tt>00000000000011111111111111111111<tt>
</tt>00000000000011111111111111111110<tt>
</tt>00000000000111111111100000000000<tt>
</tt>00000000000111100000000000000000<tt>
</tt>00000000001111100000000000000000<tt>
</tt>00000000001111000000000000000000<tt>
</tt>00000000011110000000000000000000<tt>
</tt>00000000111110000000000000000000<tt>
</tt>00000000111110000000000000000000<tt>
</tt>00000000111110000000000000000000<tt>
</tt>00000000111111111000000000000000<tt>
</tt>00000000111111111000000000000000<tt>
</tt>00000000001111111100000000000000<tt>
</tt>00000000000111111110000000000000<tt>
</tt>00000000000001111111000000000000<tt>
</tt>00000000000000111111000000000000<tt>
</tt>00000000000000011111000000000000<tt>
</tt>00000000000000001111000000000000<tt>
</tt>00000000000000000111100000000000<tt>
</tt>00000000000000000111100000000000<tt>
</tt>00000000000000011111000000000000<tt>
</tt>00011111111111111111000000000000<tt>
</tt>00011111111111111110000000000000<tt>
</tt>00011111111111111100000000000000<tt>
</tt>00000111111111111000000000000000<tt>
</tt>00000000001110000000000000000000<tt>
</tt>00000000000000000000000000000000<tt>
</tt></pre></td>
</tr></table>
<p>I used the <code>duck-streams</code> library found in <code>clojure.contrib</code> to read and write the file from disk, and applied some light processing to get the data into the required format. I also used a temporary file on disk to store the data <del>-</del> I’m pretty sure there would be a way to get <code>convert</code> to write to <span class="caps">STDOUT</span> then process that in memory, but I didn’t figure it out. It’s handy for debugging to have the file there anyways.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(ns ocr.main<tt>
</tt> (:use [clojure.contrib.shell-out :only (sh)]))<tt>
</tt> (:use [clojure.contrib.duck-streams :only (read-lines write-lines)]))<tt>
</tt><tt>
</tt>(defn read-text-image-line [line]<tt>
</tt> (if (= "white" (last (split line #"[,:\s]+"))) "0" "1"))<tt>
</tt><tt>
</tt>(defn load-text-image<tt>
</tt> [filename]<tt>
</tt> (let [lines (vec (drop 1 (read-lines filename)))<tt>
</tt> converted (map read-text-image-line lines) ]<tt>
</tt> (map #(apply str %) (partition 32 converted))))<tt>
</tt><tt>
</tt>(defn convert-image<tt>
</tt> [in out]<tt>
</tt> (sh "convert" in "-colorspace" "gray" "+dither" "-colors" "2"<tt>
</tt> "-normalize" "-resize" "32x32!" out)<tt>
</tt> (write-lines out (load-text-image out)))<tt>
</tt><tt>
</tt>(def temp-outfile "/tmp/clj-converted.txt")<tt>
</tt></pre></td>
</tr></table>
<p>One more function is needed to be able to load that file up again into memory. This one doesn’t need to use <code>read-lines</code>, since the desired format for the classification below is actually just a vector of ones and zeros, so <code>slurp</code> is a quick alternative which is in the core libraries.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(defn load-char-file [file]<tt>
</tt> (let [filename (.getName file)<tt>
</tt> tokens (split filename #"[_\.]")<tt>
</tt> label (first tokens)<tt>
</tt> contents (parse-char-row (slurp file))]<tt>
</tt> [label contents]))<tt>
</tt></pre></td>
</tr></table>
<h2>Fetch some training data</h2>
<p>The <a href="http://archive.ics.uci.edu/ml/">University of California Irving</a> provides some sweet datasets if you’re getting into machine learning. In particular, the <a href="http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits">Optical Recognition of Handwritten Digits Data Set</a> contains nearly 2000 labeled digits provided in the 32×32 text format the snapshot is now in. All digits are in one file, with a few header rows that can be dropped and ignored.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">wget http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits-orig.tra.Z<tt>
</tt>gunzip optdigits-orig.tra.Z<tt>
</tt></pre></td>
</tr></table>
<table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(defn parse-char-row [row]<tt>
</tt> (map #(Integer/parseInt %) (filter #(or (= % "1") (= % "0")) (split row #""))))<tt>
</tt><tt>
</tt>(defn parse-char-data [element]<tt>
</tt> (let [label (trim (last element))<tt>
</tt> rows (take 32 element)]<tt>
</tt> [label (vec (flatten (map parse-char-row rows)))]))<tt>
</tt><tt>
</tt>(defn load-training-data<tt>
</tt> [filename]<tt>
</tt> (let [lines (drop 21 (read-lines filename))<tt>
</tt> elements (partition 33 lines)]<tt>
</tt> (map parse-char-data elements)<tt>
</tt> ))<tt>
</tt><tt>
</tt>(def training-set (load-training-data "optdigits-orig.tra"))<tt>
</tt></pre></td>
</tr></table>
<p>This code returns an array of all the training data, each element being an array itself with the first element a label (“0”, “1”, “2”, etc…) and the second element a vector of all the data (new lines ignored, they’re not important).</p>
<p>Note that I’m using <code>vec</code> throughout. This is to force lazy sequences to be evaluated, which is a required performance optimization for this program otherwise it won’t finish calculating.</p>
<h2>Classify our digit</h2>
<p>This is the exciting part! I won’t go into the algorithm here (buy the Machine Learning book!), but it’s called K Nearest Neighbour and it’s not particularly fancy but works surprisingly well. If you read <a href="http://rhnh.net/2011/08/02/exploring-data-with-clojure-incanter-and-leiningen">my last blog post</a>, you’ll note I’ve dropped the <code>Incanter</code> library. It was too much mucking about and didn’t provide any value for this project. Reading datasets is pretty easy with Clojure anyways.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(defn minus-vector [& args]<tt>
</tt> (map #(apply - %) (apply map vector args)))<tt>
</tt><tt>
</tt>(defn sum-of-squares [coll]<tt>
</tt> (reduce (fn [a v] (+ a (* v v))) coll))<tt>
</tt><tt>
</tt>(defn calculate-distances [in]<tt>
</tt> (fn [row]<tt>
</tt> (let [vector-diff (minus-vector (last in) (last row))<tt>
</tt> label (first row)<tt>
</tt> distance (sqrt (sum-of-squares vector-diff))]<tt>
</tt> [label distance])))<tt>
</tt><tt>
</tt>(defn classify [in]<tt>
</tt> (let [k 10<tt>
</tt> diffs (map (calculate-distances in) training-set)<tt>
</tt> nearest-neighbours (frequencies (map first (take k (sort-by last diffs))))<tt>
</tt> classification (first (last (sort-by second nearest-neighbours)))]<tt>
</tt> classification))<tt>
</tt></pre></td>
</tr></table>
<p>Now to tie it all together with a main function that converts all the snapshots you pass in as arguments.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(defn classify-image [filename]<tt>
</tt> (convert-image filename temp-outfile)<tt>
</tt> (classify (load-char-file (java.io.File. temp-outfile))))<tt>
</tt><tt>
</tt>(defn -main [& args]<tt>
</tt> (doseq [filename args]<tt>
</tt> (println "I think that is the number" (classify-image filename))))<tt>
</tt></pre></td>
</tr></table>
<p>That’s the lot. Use it like so:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">> lein run myDigits/5_0.jpg<tt>
</tt>I think that is the number 5<tt>
</tt></pre></td>
</tr></table>
<p>Hooray! Here is the <a href="https://gist.github.com/1130134">full script as a gist</a>. Let me know if you do anything fun with it.</p>tag:www.rhnh.net,2008:Post/8522011-08-03T10:43:42Z2011-08-03T10:43:42ZProfiling Clojure<p>Tonight I was so impressed by how easy it was to profile some Clojure code using built-in <span class="caps">JVM</span> tools that I had to share:</p>
<p><iframe src="http://player.vimeo.com/video/27237005?title=0&byline=0&portrait=0&color=FFFACD" width="600" height="358" frameborder="0"></iframe><p><a href="http://vimeo.com/27237005">Profiling Clojure</a>.</p></p>
<p>Today I also learned more about the Incanter <span class="caps">API</span>, and wrote some good code to <a href="http://stackoverflow.com/questions/5481777/how-can-i-modify-a-column-in-an-incanter-dataset/6921703#6921703">transform columns</a>, among other things.</p>tag:www.rhnh.net,2008:Post/8512011-08-02T08:45:00Z2011-08-02T08:45:35ZExploring data with Clojure, Incanter, and Leiningen<p>I’m working through <a href="http://www.manning.com/pharrington/">Machine Learning in Action</a> at the moment, and it’s done in Python. I don’t really know Python, but I’d prefer to learn Clojure, so I’m redoing the code samples.</p>
<p>This blog posts show how to read a <span class="caps">CSV</span> file, manipulate it, then graph it. Turns out Clojure is pretty good for this, in combination with the Incanter library (think R for the <span class="caps">JVM</span>). It took me a while to get an environment set up since I’m unfamiliar with basically everything.</p>
<h2>Install Clojure</h2>
<p>I already had it installed so can’t remember if there were any crazy steps to get it working. Hopefully this is all you need:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">sudo brew install clojure<tt>
</tt></pre></td>
</tr></table>
<h2>Install Leiningen</h2>
<p>Leiningen is a build tool which does many things, but most importantly for me is it manages the classpath. I was jumping through all sorts of hoops trying to get Incanter running without it.</p>
<p><a href="https://github.com/technomancy/leiningen">There are easy to follow instructions in the <span class="caps">README</span></a></p>
<p>*UPDATE: * As suggested in the comments, you can probably just `brew install lein` here and that will get you Leiningen and Clojure in one command.</p>
<h2>Create a new project</h2><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">lein new hooray-data && cd hooray-data<tt>
</tt></pre></td>
</tr></table>
<p>Add Incanter as a dependency to the <code>project.clj</code> file, and also a main target:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(defproject clj "1.0.0-SNAPSHOT"<tt>
</tt> :description "FIXME: write"<tt>
</tt> :dependencies [[org.clojure/clojure "1.2.0"]<tt>
</tt> [org.clojure/clojure-contrib "1.2.0"]<tt>
</tt> [incanter "1.2.3-SNAPSHOT"]]<tt>
</tt> :main hooray_data.core)<tt>
</tt></pre></td>
</tr></table>
<p>Add some Incanter code to <code>src/hooray_data/core.clj</code></p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(ns hooray_data.core<tt>
</tt> (:gen-class)<tt>
</tt> (:use (incanter core stats charts io datasets)))<tt>
</tt><tt>
</tt>(defn -main [& args]<tt>
</tt> (view (histogram (sample-normal 1000)))<tt>
</tt></pre></td>
</tr></table>
<p>Then fire it up:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">lein deps<tt>
</tt>lein run<tt>
</tt></pre></td>
</tr></table>
<p>If everything runs to plan you’ll see a pretty graph.</p>
<h2>Code</h2>
<p>First, a simple categorized scatter plot. <code>read-dataset</code> works with both URLs and files, which is pretty handy.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(ns hooray_data.core<tt>
</tt> (:use (incanter core stats charts io)))<tt>
</tt><tt>
</tt>; Sample data set provided by Incanter<tt>
</tt>(def plotData (read-dataset <tt>
</tt> "https://raw.github.com/liebke/incanter/master/data/iris.dat" <tt>
</tt> :delim \space <tt>
</tt> :header true))<tt>
</tt><tt>
</tt>(def plot (scatter-plot<tt>
</tt> (sel plotData :cols 0)<tt>
</tt> (sel plotData :cols 1)<tt>
</tt> :x-label "Sepal Length"<tt>
</tt> :y-label "Sepal Width"<tt>
</tt> :group-by (sel plotData :cols 4)))<tt>
</tt><tt>
</tt>(defn -main [& args]<tt>
</tt> (view plot))<tt>
</tt></pre></td>
</tr></table>
<p>Second, the same data but normalized. The graph will look the same, but the underlying data is now ready for some more math.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt>23<tt>
</tt>24<tt>
</tt>25<tt>
</tt>26<tt>
</tt>27<tt>
</tt>28<tt>
</tt>29<tt>
</tt><strong>30</strong><tt>
</tt>31<tt>
</tt>32<tt>
</tt>33<tt>
</tt>34<tt>
</tt>35<tt>
</tt>36<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">(ns hooray_data.core<tt>
</tt> (:use (incanter core stats charts io)))<tt>
</tt><tt>
</tt>; Sample data set provided by Incanter<tt>
</tt>(def data (read-dataset <tt>
</tt> "https://raw.github.com/liebke/incanter/master/data/iris.dat" <tt>
</tt> :delim \space <tt>
</tt> :header true))<tt>
</tt><tt>
</tt>(defn extract [f]<tt>
</tt> (fn [data]<tt>
</tt> (map #(apply f (sel data :cols %)) (range 0 (ncol data)))))<tt>
</tt><tt>
</tt>(defn fill [n row] (map (fn [x] row) (range 0 n)))<tt>
</tt><tt>
</tt>(defn matrix-row-operation [operand row matrix] <tt>
</tt> (operand matrix <tt>
</tt> (fill (nrow matrix) row)))<tt>
</tt><tt>
</tt>; Probably could be much nicer using `reduce`<tt>
</tt>(defn normalize [matrix]<tt>
</tt> (let [shifted (matrix-row-operation minus ((extract min) matrix) matrix)]<tt>
</tt> (matrix-row-operation div ((extract max) shifted) shifted)))<tt>
</tt><tt>
</tt>(def normalized-data<tt>
</tt> (normalize (to-matrix (sel data :cols [0 1]))))<tt>
</tt><tt>
</tt>(def normalized-plot (scatter-plot<tt>
</tt> (sel normalized-data :cols 0)<tt>
</tt> (sel normalized-data :cols 1)<tt>
</tt> :x-label "Sepal Length"<tt>
</tt> :y-label "Sepal Width"<tt>
</tt> :group-by (sel data :cols 4)))<tt>
</tt><tt>
</tt>(defn -main [& args]<tt>
</tt> (view normalized-plot))<tt>
</tt></pre></td>
</tr></table>
<p>I was kind of hoping the <code>normalize</code> function would have already been written for me in a standard library, but I couldn’t find it.</p>
<p>I’ll report back if anything else of interest comes up as I’m working through the book.</p>tag:www.rhnh.net,2008:Post/8502011-07-30T05:45:00Z2011-09-03T22:46:34ZInterface Mocking<p><strong><span class="caps">UPDATE</span>:</strong> This is a gem now: <a href="https://github.com/xaviershay/rspec-fire">rspec-fire</a> The code in the gem is better than that presented here.</p>
<p>Here is a screencast I put together in response to a recent Destroy All Software screencast on <a href="https://www.destroyallsoftware.com/screencasts/catalog/test-isolation-and-refactoring">test isolation and refactoring</a>, showing off an idea I’ve been tinkering around with for automatic validation of your implicit interfaces that you stub in tests.</p>
<p><iframe src="http://player.vimeo.com/video/27079042?title=0&byline=0&portrait=0&color=FFFACD" width="600" height="338" frameborder="0"></iframe><p><a href="http://vimeo.com/27079042">Interface Mocking screencast</a>.</p></p>
<p>Here is the code for <code>InterfaceMocking</code>:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt>16<tt>
</tt>17<tt>
</tt>18<tt>
</tt>19<tt>
</tt><strong>20</strong><tt>
</tt>21<tt>
</tt>22<tt>
</tt>23<tt>
</tt>24<tt>
</tt>25<tt>
</tt>26<tt>
</tt>27<tt>
</tt>28<tt>
</tt>29<tt>
</tt><strong>30</strong><tt>
</tt>31<tt>
</tt>32<tt>
</tt>33<tt>
</tt>34<tt>
</tt>35<tt>
</tt>36<tt>
</tt>37<tt>
</tt>38<tt>
</tt>39<tt>
</tt><strong>40</strong><tt>
</tt>41<tt>
</tt>42<tt>
</tt>43<tt>
</tt>44<tt>
</tt>45<tt>
</tt>46<tt>
</tt>47<tt>
</tt>48<tt>
</tt>49<tt>
</tt><strong>50</strong><tt>
</tt>51<tt>
</tt>52<tt>
</tt>53<tt>
</tt>54<tt>
</tt>55<tt>
</tt>56<tt>
</tt>57<tt>
</tt>58<tt>
</tt>59<tt>
</tt><strong>60</strong><tt>
</tt>61<tt>
</tt>62<tt>
</tt>63<tt>
</tt>64<tt>
</tt>65<tt>
</tt>66<tt>
</tt>67<tt>
</tt>68<tt>
</tt>69<tt>
</tt><strong>70</strong><tt>
</tt>71<tt>
</tt>72<tt>
</tt>73<tt>
</tt>74<tt>
</tt>75<tt>
</tt>76<tt>
</tt>77<tt>
</tt>78<tt>
</tt>79<tt>
</tt><strong>80</strong><tt>
</tt>81<tt>
</tt>82<tt>
</tt>83<tt>
</tt>84<tt>
</tt>85<tt>
</tt>86<tt>
</tt>87<tt>
</tt>88<tt>
</tt>89<tt>
</tt><strong>90</strong><tt>
</tt>91<tt>
</tt>92<tt>
</tt>93<tt>
</tt>94<tt>
</tt>95<tt>
</tt>96<tt>
</tt>97<tt>
</tt>98<tt>
</tt>99<tt>
</tt><strong>100</strong><tt>
</tt>101<tt>
</tt>102<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">module</span> <span class="cl">InterfaceMocking</span><tt>
</tt><tt>
</tt> <span class="c"># Returns a new interface double. This is equivalent to an RSpec double,</span><tt>
</tt> <span class="c"># stub or, mock, except that if the class passed as the first parameter</span><tt>
</tt> <span class="c"># is loaded it will raise if you try to set an expectation or stub on</span><tt>
</tt> <span class="c"># a method that the class has not implemented.</span><tt>
</tt> <span class="r">def</span> <span class="fu">interface_double</span>(stubbed_class, methods = {})<tt>
</tt> <span class="co">InterfaceDouble</span>.new(stubbed_class, methods)<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">module</span> <span class="cl">InterfaceDoubleMethods</span><tt>
</tt><tt>
</tt> include <span class="co">RSpec</span>::<span class="co">Matchers</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">should_receive</span>(method_name)<tt>
</tt> ensure_implemented(method_name)<tt>
</tt> <span class="r">super</span><tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">should_not_receive</span>(method_name)<tt>
</tt> ensure_implemented(method_name)<tt>
</tt> <span class="r">super</span><tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">stub!</span>(method_name)<tt>
</tt> ensure_implemented(method_name)<tt>
</tt> <span class="r">super</span><tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">ensure_implemented</span>(*method_names)<tt>
</tt> <span class="r">if</span> recursive_const_defined?(<span class="co">Object</span>, <span class="iv">@__stubbed_class__</span>)<tt>
</tt> recursive_const_get(<span class="co">Object</span>, <span class="iv">@__stubbed_class__</span>).<tt>
</tt> should implement(method_names, <span class="iv">@__checked_methods__</span>)<tt>
</tt> <span class="r">end</span><tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">recursive_const_get</span> object, name<tt>
</tt> name.split(<span class="s"><span class="dl">'</span><span class="k">::</span><span class="dl">'</span></span>).inject(<span class="co">Object</span>) {|klass,name| klass.const_get name }<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">recursive_const_defined?</span> object, name<tt>
</tt> !!name.split(<span class="s"><span class="dl">'</span><span class="k">::</span><span class="dl">'</span></span>).inject(<span class="co">Object</span>) {|klass,name|<tt>
</tt> <span class="r">if</span> klass && klass.const_defined?(name)<tt>
</tt> klass.const_get name<tt>
</tt> <span class="r">end</span><tt>
</tt> }<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">class</span> <span class="cl">InterfaceDouble</span> < <span class="co">RSpec</span>::<span class="co">Mocks</span>::<span class="co">Mock</span><tt>
</tt><tt>
</tt> include <span class="co">InterfaceDoubleMethods</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">initialize</span>(stubbed_class, *args)<tt>
</tt> args << {} <span class="r">unless</span> <span class="co">Hash</span> === args.last<tt>
</tt><tt>
</tt> <span class="iv">@__stubbed_class__</span> = stubbed_class<tt>
</tt> <span class="iv">@__checked_methods__</span> = <span class="sy">:public_instance_methods</span><tt>
</tt> ensure_implemented *args.last.keys<tt>
</tt><tt>
</tt> <span class="c"># __declared_as copied from rspec/mocks definition of `double`</span><tt>
</tt> args.last[<span class="sy">:__declared_as</span>] = <span class="s"><span class="dl">'</span><span class="k">InterfaceDouble</span><span class="dl">'</span></span><tt>
</tt> <span class="r">super</span>(stubbed_class, *args)<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt><span class="co">RSpec</span>::<span class="co">Matchers</span>.define <span class="sy">:implement</span> <span class="r">do</span> |expected_methods, checked_methods|<tt>
</tt> match <span class="r">do</span> |stubbed_class|<tt>
</tt> unimplemented_methods(<tt>
</tt> stubbed_class,<tt>
</tt> expected_methods,<tt>
</tt> checked_methods<tt>
</tt> ).empty?<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> <span class="r">def</span> <span class="fu">unimplemented_methods</span>(stubbed_class, expected_methods, checked_methods)<tt>
</tt> implemented_methods = stubbed_class.send(checked_methods)<tt>
</tt> unimplemented_methods = expected_methods - implemented_methods<tt>
</tt> <span class="r">end</span><tt>
</tt><tt>
</tt> failure_message_for_should <span class="r">do</span> |stubbed_class|<tt>
</tt> <span class="s"><span class="dl">"</span><span class="k">%s does not publicly implement:</span><span class="ch">\n</span><span class="k">%s</span><span class="dl">"</span></span> % [<tt>
</tt> stubbed_class,<tt>
</tt> unimplemented_methods(<tt>
</tt> stubbed_class,<tt>
</tt> expected_methods,<tt>
</tt> checked_methods<tt>
</tt> ).sort.map {|x|<tt>
</tt> <span class="s"><span class="dl">"</span><span class="k"> </span><span class="il"><span class="idl">#{</span>x<span class="idl">}</span></span><span class="dl">"</span></span><tt>
</tt> }.join(<span class="s"><span class="dl">"</span><span class="ch">\n</span><span class="dl">"</span></span>)<tt>
</tt> ]<tt>
</tt> <span class="r">end</span><tt>
</tt><span class="r">end</span><tt>
</tt><tt>
</tt><span class="co">RSpec</span>.configure <span class="r">do</span> |config|<tt>
</tt><tt>
</tt> config.include <span class="co">InterfaceMocking</span><tt>
</tt><tt>
</tt><span class="r">end</span><tt>
</tt></pre></td>
</tr></table>
tag:www.rhnh.net,2008:Post/8492011-07-29T05:55:00Z2011-07-29T05:55:02ZStatic Asset Caching on Heroku Cedar Stack<p>I recently moved this blog over to <a href="http://heroku.com">Heroku</a>, and in the process added in some proper <span class="caps">HTTP</span> caching headers. The dynamic pages use the build in <code>fresh_when</code> and <code>stale?</code> Rails helpers, combined with <code>Rack::Cache</code> and the free memcached plugin available on Heroku. That was all pretty straight forward, what was more difficult was configuring Heroku to serve all static assets (such as images and stylesheets) with a far-future <code>max-age</code> header so that they will be cached for eternity. What I’ve documented here is somewhat of a hack, and hopefully Heroku will provide a better way of doing this in the future.</p>
<p>By default Heroku serves everything in <code>public</code> directly via nginx. This is a problem for us since we don’t get a chance to configure the caching headers. Instead, use the <code>Rack::StaticCache</code> middleware (provided in the <code>rack-contrib</code> gem) to serve static files, which by default adds far future max age cache control headers. This needs to be out of different directory to <code>public</code> since there is no way to disable the nginx serving. I renamed by <code>public</code> folder to <code>public_cached</code>.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="c"># config/application.rb</span><tt>
</tt>config.middleware.use <span class="co">Rack</span>::<span class="co">StaticCache</span>, <tt>
</tt> <span class="ke">urls</span>: <span class="s"><span class="dl">%w(</span><span class="k"><tt>
</tt> /stylesheets<tt>
</tt> /images<tt>
</tt> /javascripts<tt>
</tt> /robots.txt<tt>
</tt> /favicon.ico<tt>
</tt> </span><span class="dl">)</span></span>,<tt>
</tt> <span class="ke">root</span>: <span class="s"><span class="dl">"</span><span class="k">public_cached</span><span class="dl">"</span></span><tt>
</tt></pre></td>
</tr></table>
<p>I also disabled the built in Rails serving of static assets in development mode, so that it didn’t interfere:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="c"># config/environments/development.rb</span><tt>
</tt>config.serve_static_assets = <span class="pc">false</span><tt>
</tt></pre></td>
</tr></table>
<p>In the production config, I configured the <code>x_sendfile_header</code> option to be “X-Accel-Redirect”. It was “X-Sendfile” which is an apache directive, and was causing nginx to hang (Heroku would never actually serve the assets to the browser).</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="c"># config/environments/production.rb</span><tt>
</tt>config.action_dispatch.x_sendfile_header = <span class="s"><span class="dl">'</span><span class="k">X-Accel-Redirect</span><span class="dl">'</span></span><tt>
</tt></pre></td>
</tr></table>
<p>A downside of this approach is that if you have a lot of static assets, they all have to hit the Rails stack in order to be served. If you only have one dyno (the free plan) then the initial load can be slower than it otherwise would be if nginx was serving them directly. As I mentioned in the introduction, hopefully Heroku will provide a nicer way to do this in the future.</p>tag:www.rhnh.net,2008:Post/8472011-05-28T01:08:00Z2011-05-28T01:08:12ZSpeeding up Rails startup time<p>In which I provide easy instructions to try a new patch that drastically improves the start up time of Ruby applications, in the hope that with wide support it will be merged into the upcoming 1.9.3 release. Skip to the bottom for instructions, or keep reading for the narrative.</p>
<p><strong><span class="caps">UPDATE</span>:</strong> If you have trouble installing, grab a recent copy of rvm: rvm get head.</p>
<h2>Background</h2>
<p>Recent releases of <span class="caps">MRI</span> Ruby have introduced some fairly major performance regressions when requiring files:</p>
<p><img src="https://img.skitch.com/20110528-xigici83u5texbpnwnwntfrkuq.jpg" alt="" /></p>
<p>For reference, our medium-sized Rails application requires around 2200 files &emdash; off the right-hand side of this graph. This is problematic. On 1.9.2 it takes 20s to start up, on 1.9.3 it takes 46s. Both are far too long.</p>
<p>There are a few reasons for this, but the core of the problem is the basic algorithm which looks something like this:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">def</span> <span class="fu">require</span>(file)<tt>
</tt> <span class="gv">$loaded</span>.each <span class="r">do</span> |x|<tt>
</tt> <span class="r">return</span> <span class="pc">false</span> <span class="r">if</span> x == file<tt>
</tt> <span class="r">end</span><tt>
</tt> load(file)<tt>
</tt> <span class="gv">$loaded</span> << file<tt>
</tt><span class="r">end</span><tt>
</tt></pre></td>
</tr></table>
<p>That loop is no good, and gets worse the more files you have required. I have written a patch for 1.9.3 which changes this algorithm to:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">def</span> <span class="fu">require</span>(file)<tt>
</tt> <span class="r">return</span> <span class="pc">false</span> <span class="r">if</span> <span class="gv">$loaded</span>[file] <tt>
</tt> load(file)<tt>
</tt> <span class="gv">$loaded</span>[file] = <span class="pc">true</span><tt>
</tt><span class="r">end</span><tt>
</tt></pre></td>
</tr></table>
<p>That gives you a performance curve that looks like this:</p>
<p><img src="https://img.skitch.com/20110528-gtsgba1twaiwkd3frewen54ts.jpg" alt="" /></p>
<p>Much nicer.</p>
<p>That’s just a synthetic benchmark, but it works in the real world too. My <a href="http://theconversation.edu.au">main Rails application</a> now loads in a mite over 10s, down from 20s it was taking on 1.9.2. A blank Rails app loads in 1.1s, which is even faster than 1.8.7.</p>
<p><img src="https://img.skitch.com/20110528-cu9nux6619fxruh5rq6ppywp7p.jpg" alt="" /></p>
<h2>Getting the fix</h2>
<p>Here is how you can try out my patch right now in just ten minutes using <span class="caps">RVM</span>.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt>15<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"># First get a baseline measurement<tt>
</tt>cd /your/rails/app<tt>
</tt>time script/rails runner "puts 1"<tt>
</tt><tt>
</tt># Install a patched ruby<tt>
</tt>curl https://gist.github.com/raw/996418/e2b346fbadeed458506fc69ca213ad96d1d08c3e/require-performance-fix-r31758.patch > /tmp/require-performance-fix.patch<tt>
</tt>rvm install ruby-head --patch /tmp/require-performance-fix.patch -n patched<tt>
</tt># ... get a cup of tea, this took about 8 minutes on my MBP<tt>
</tt><tt>
</tt># Get a new measurement<tt>
</tt>cd /your/rails/app<tt>
</tt>rvm use ruby-head-patched<tt>
</tt>gem install bundler --no-rdoc --no-ri<tt>
</tt>bundle<tt>
</tt>time script/rails runner "puts 1"<tt>
</tt></pre></td>
</tr></table>
<h2>How you can help</h2>
<p>I need a lot more eyeballs on this patch before it can be considered for merging into trunk. I would really appreciate any of the following:</p>
<ul>
<li>Try it out on your app and report timings in the comments.</li>
<li><a href="https://github.com/ruby/ruby/pull/25">Code review the patch on this GitHub pull request</a> (it’s C code, but don’t let that scare you off).</li>
<li>Try it on Windows.</li>
<li>Report any bugs you find.</li>
</ul>
<h2>Next steps</h2>
<p>I imagine there will be a bit more work to get this into Ruby 1.9.3, but after that this is just the first step of many to try and speed up the time Rails takes to start up. Bundler and RubyGems still spend a lot of time doing … something, which I want to investigate. I also want to port these changes over to JRuby which has similar issues (Rubinius isn’t quite as fast out of the gate, but does not degrade exponentially so would not benefit from this patch).</p>
<p>Thank you for your time.</p>tag:www.rhnh.net,2008:Post/8462011-04-30T03:05:45Z2011-04-30T03:05:45ZDeleting duplicate data with PostgreSQL<p>Here is an update to a query I posted a while back for <a href="http://rhnh.net/2010/08/22/duplicate-data">detecting duplicate data</a>. It allows you to select all but one of the resulting duplicates, for easy deletion. It only works on PostgreSQL, but is pretty neat. It uses a window function!</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt>14<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="r">DELETE</span> <span class="r">FROM</span> users <tt>
</tt><span class="r">USING</span> (<tt>
</tt> <span class="r">SELECT</span> id, first_value(id) OVER (<tt>
</tt> PARTITION <span class="r">BY</span> name <span class="r">ORDER</span> <span class="r">BY</span> created_at DESC<tt>
</tt> ) first_id<tt>
</tt> <span class="r">FROM</span> users<tt>
</tt> <span class="r">WHERE</span> name IN (<tt>
</tt> <span class="r">SELECT</span> name <tt>
</tt> <span class="r">FROM</span> users <tt>
</tt> <span class="r">GROUP</span> <span class="r">BY</span> name <tt>
</tt> HAVING <span class="pd">count</span>(name) > <span class="i">1</span><tt>
</tt> )<tt>
</tt>) dups<tt>
</tt><span class="r">WHERE</span> dups.id != dups.first_id <span class="r">AND</span> users.id = dups.id;<tt>
</tt></pre></td>
</tr></table>
<p>The order by is optional, but handy if you need to select a particular row rather than just an arbitrary one. You need an extra sub-query because you can’t have window functions in a where clause.</p>
<p>For more tasty PostgreSQL tricks, check out my <a href="http://peepcode.com/products/postgresql">Meet PostgreSQL screencast</a>, a steal at only $12 <strong>plug plug plug</strong>.</p>tag:www.rhnh.net,2008:Post/8452011-04-18T00:48:42Z2011-04-18T00:48:14ZNew Column: Code Safari<p>I am writing a regular weekly column at the newly launched Sitepoint project <a href="http://rubysource.com">RubySource</a>. The column is named “Code Safari”, where I explore the jungle of ruby libraries and gems and figure out how they work. It’s an introductory series designed to not just explain how things operate, but show you the tools and techniques so that you can figure it out yourself.</p>
<p>Three posts have already been published:</p>
<ul>
<li><a href="http://rubysource.com/understanding-concurrent-programming-with-ruby-goliath/">Understanding Concurrent Programming With Ruby’s Goliath</a>, in which I dig into the new Goliath web server to figure out how it uses the new 1.9 Fibers to work some magic.</li>
<li><a href="http://rubysource.com/code-safari-configuring-capybara/">Configuring Capybara</a>, in which I investigate how Capybara implemented its configuration <span class="caps">DSL</span>, and then make one for myself.</li>
<li><a href="http://rubysource.com/code-safari-twss-and-bayesian-classification-of-twitter-searches/"><span class="caps">TWSS</span> and Bayesian Classification of Twitter Searches</a>, in which I inspect the pipes of a beautiful piece of plumbing.</li>
</ul>
<p>The format is a bit different but I’m really happy with how it is working so far. Let me know what you think.</p>tag:www.rhnh.net,2008:Post/8432011-01-31T10:37:00Z2011-01-31T10:37:27ZYAML Tutorial<p>Many years ago I wrote a <a href="http://rhnh.net/2006/06/25/yaml-tutorial">tutorial on using <span class="caps">YAML</span> in ruby</a>. It still sees the most google traffic of any post, by far. So people want to know about <span class="caps">YAML</span>? I’ll help them out.</p>
<h3>What is <span class="caps">YAML</span>?</h3>
<p><span class="caps">YAML</span> is a flexible, human readable file format that is ideal for storing object trees. <span class="caps">YAML</span> stands for “<span class="caps">YAML</span> Ain’t Markup Language”. It is easier to read (by humans) than <span class="caps">JSON</span>, and can contain richer meta data. It is far nicer than <span class="caps">XML</span>. There are libraries available for all mainstream languages including Ruby, Python, C++, Java, Perl, C#/.<span class="caps">NET</span>, Javascript, <span class="caps">PHP</span> and Haskell. It looks like this:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="head"><span class="head">---</span></span> <tt>
</tt>- <span class="s">name: Xavier</span><tt>
</tt> <span class="ke">country</span>: <span class="s">Australia</span><tt>
</tt> <span class="ke">age</span>: <span class="s">24</span><tt>
</tt>- <span class="s">name: Don</span><tt>
</tt> <span class="ke">country</span>: <span class="s">US</span><tt>
</tt></pre></td>
</tr></table>
<p>That is a simple array of hashes. You can nest any combination of these simple data structures however you like. Most parsers will also detect the 24 as an integer too. Quoting strings is optional, and was omitted in this example.</p>
<p><span class="caps">YAML</span> allows you to add tags to your objects, which is extra meta-data that your application can use to deserialize portions into complex data structures. For instance, in ruby if you serialize a set object it looks like this:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="c"># Set.new([1,2]).to_yaml</span><tt>
</tt><span class="head"><span class="head">---</span></span> <span class="ty">!ruby/object</span>:<span class="cl">Set</span> <tt>
</tt><span class="ke">hash</span>: <tt>
</tt> <span class="ke">1</span>: <span class="s">true</span><tt>
</tt> <span class="ke">2</span>: <span class="s">true</span><tt>
</tt></pre></td>
</tr></table>
<p>Notice that ruby has added the <code>ruby/object:Set</code> tag so that the correct object can be instantiated on deserialization, while maintaining a human readable rendition of a set. These tags can be anything you like, ruby just happens to use that particular format.</p>
<p>You can remove duplication from <span class="caps">YAML</span> files by using anchors (&) and aliases (*). You typically see this in configuration files, such as:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="ke">defaults</span>: <span class="v">&defaults</span><tt>
</tt> <span class="ke">adapter</span>: <span class="s">postgres</span><tt>
</tt> <span class="ke">host</span>: <span class="s">localhost</span><tt>
</tt><tt>
</tt><span class="ke">development</span>:<tt>
</tt> <span class="ke">database</span>: <span class="s">myapp_development</span><tt>
</tt> <span class="cv"><<</span>: <span class="gv">*defaults</span><tt>
</tt><tt>
</tt><span class="ke">test</span>:<tt>
</tt> <span class="ke">database</span>: <span class="s">myapp_test</span><tt>
</tt> <span class="cv"><<</span>: <span class="gv">*defaults</span><tt>
</tt></pre></td>
</tr></table>
<p><code>&</code> sets up the name of the anchor (“defaults”), <code><<</code> means “merge the given hash into the current one”, and <code>*</code> includes the named anchor (“defaults” again). The expanded version looks like this:</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="ke">defaults</span>:<tt>
</tt> <span class="ke">adapter</span>: <span class="s">postgres</span><tt>
</tt> <span class="ke">host</span>: <span class="s">localhost</span><tt>
</tt><tt>
</tt><span class="ke">development</span>:<tt>
</tt> <span class="ke">database</span>: <span class="s">myapp_development</span><tt>
</tt> <span class="ke">adapter</span>: <span class="s">postgres</span><tt>
</tt> <span class="ke">host</span>: <span class="s">localhost</span><tt>
</tt><tt>
</tt><span class="ke">test</span>:<tt>
</tt> <span class="ke">database</span>: <span class="s">myapp_test</span><tt>
</tt> <span class="ke">adapter</span>: <span class="s">postgres</span><tt>
</tt> <span class="ke">host</span>: <span class="s">localhost</span><tt>
</tt></pre></td>
</tr></table>
<p>Note that the defaults hash hangs around, even though it isn’t really required anymore.</p>
<p><span class="caps">YAML</span> generators use this technique to correctly serialize repeated references to the same object, and even cyclic references. That’s pretty clever.</p>
<h3>Flow style</h3>
<p><span class="caps">YAML</span> has an alternate synax called “flow style”, that allows arrays and hashes to be written inline without having to rely on indentation, using square brackets and curly brackets respectively.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt><strong>10</strong><tt>
</tt>11<tt>
</tt>12<tt>
</tt>13<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="head"><span class="head">---</span></span> <tt>
</tt><span class="c"># Arrays</span><tt>
</tt><span class="ke">colors</span>:<tt>
</tt> - <span class="s">red</span><tt>
</tt> - <span class="s">blue</span><tt>
</tt><span class="c"># in flow style...</span><tt>
</tt><span class="ke">colors</span>: <span class="s">[red, blue]</span><tt>
</tt><tt>
</tt><span class="c"># Hashes</span><tt>
</tt>- <span class="s">name: Xavier</span><tt>
</tt> <span class="ke">age</span>: <span class="s">24</span><tt>
</tt><span class="c"># in flow style...</span><tt>
</tt>- <span class="s">{name: Xavier, age: 24}</span><tt>
</tt></pre></td>
</tr></table>
<p>This has the curious effect of making <span class="caps">YAML</span> a superset of <span class="caps">JSON</span>. A valid <span class="caps">JSON</span> document is also a valid <span class="caps">YAML</span> document.</p>
<h3>Performance</h3>
<p>Given YAML’s richness and human readability, you would expect it to be slower than native serialization or <span class="caps">JSON</span>. This would be correct. My <a href="https://github.com/xaviershay/sandbox/blob/master/misc/yaml-test.rb">brief testing</a> shows it is about an order of magnitude slower. For the typical configuration use-case, this is irrelevant, but worth keeping in mind if you are doing something crazy. Remember to run your own benchmarks that represent your specific need.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt>8<tt>
</tt>9<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"> user system total real<tt>
</tt>Marshal serialize 0.090000 0.000000 0.090000 ( 0.091822)<tt>
</tt>Marshal deserialize 0.090000 0.000000 0.090000 ( 0.092186)<tt>
</tt>JSON serialize 0.480000 0.010000 0.490000 ( 0.480291)<tt>
</tt>JSON deserialize 0.130000 0.010000 0.140000 ( 0.134860)<tt>
</tt>YAML serialize 2.040000 0.020000 2.060000 ( 2.065693)<tt>
</tt>YAML deserialize 0.520000 0.010000 0.530000 ( 0.526048)<tt>
</tt>Psych serialize 2.530000 0.030000 2.560000 ( 2.565116)<tt>
</tt>Psych deserialize 1.510000 0.120000 1.630000 ( 1.622601)<tt>
</tt></pre></td>
</tr></table>
<p>Curiously, the new <span class="caps">YAML</span> parser Psych included in ruby 1.9.2 appears significantly slower than the old one. Not sure what is going on there.</p>
<h3>Reading <span class="caps">YAML</span> from a file with ruby</h3><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt>5<tt>
</tt>6<tt>
</tt>7<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">require <span class="s"><span class="dl">'</span><span class="k">yaml</span><span class="dl">'</span></span><tt>
</tt><tt>
</tt>parsed = <span class="r">begin</span><tt>
</tt> <span class="co">YAML</span>.load(<span class="co">File</span>.open(<span class="s"><span class="dl">"</span><span class="k">/tmp/test.yml</span><span class="dl">"</span></span>))<tt>
</tt><span class="r">rescue</span> <span class="co">ArgumentError</span> => e<tt>
</tt> puts <span class="s"><span class="dl">"</span><span class="k">Could not parse YAML: </span><span class="il"><span class="idl">#{</span>e.message<span class="idl">}</span></span><span class="dl">"</span></span><tt>
</tt><span class="r">end</span><tt>
</tt></pre></td>
</tr></table>
<h3>Writing <span class="caps">YAML</span> to a file with ruby</h3><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt>4<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">require <span class="s"><span class="dl">'</span><span class="k">yaml</span><span class="dl">'</span></span><tt>
</tt><tt>
</tt>data = {<span class="s"><span class="dl">"</span><span class="k">name</span><span class="dl">"</span></span> => <span class="s"><span class="dl">"</span><span class="k">Xavier</span><span class="dl">"</span></span>}<tt>
</tt><span class="co">File</span>.open(<span class="s"><span class="dl">"</span><span class="k">path/to/output.yml</span><span class="dl">"</span></span>, <span class="s"><span class="dl">"</span><span class="k">w</span><span class="dl">"</span></span>) {|f| f.write(data.to_yaml) }<tt>
</tt></pre></td>
</tr></table>
<p>Anything else you’d like to know? Leave a comment.</p>tag:www.rhnh.net,2008:Post/8422011-01-31T10:27:00Z2011-01-31T10:29:15ZPsych YAML in ruby 1.9.2 with RVM and Snow Leopard OSX<p>Note that you must have libyaml installed <em>before</em> you compile ruby, so this probably means you’ll need to recompile your current version.</p><table class="CodeRay"><tr>
<td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt>
</tt>2<tt>
</tt>3<tt>
</tt></pre></td>
<td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">sudo brew install libyaml<tt>
</tt>rvm install ruby-1.9.2 --with-libyaml-dir=/usr/local<tt>
</tt>ruby -rpsych -e 'puts Psych.load("win: true")'<tt>
</tt></pre></td>
</tr></table>