Robot Has No Heart

Xavier Shay blogs here

A robot that does not have a heart

Exploring data with Clojure, Incanter, and Leiningen

I’m working through Machine Learning in Action at the moment, and it’s done in Python. I don’t really know Python, but I’d prefer to learn Clojure, so I’m redoing the code samples.

This blog posts show how to read a CSV file, manipulate it, then graph it. Turns out Clojure is pretty good for this, in combination with the Incanter library (think R for the JVM). It took me a while to get an environment set up since I’m unfamiliar with basically everything.

Install Clojure

I already had it installed so can’t remember if there were any crazy steps to get it working. Hopefully this is all you need:

1
sudo brew install clojure

Install Leiningen

Leiningen is a build tool which does many things, but most importantly for me is it manages the classpath. I was jumping through all sorts of hoops trying to get Incanter running without it.

There are easy to follow instructions in the README

*UPDATE: * As suggested in the comments, you can probably just `brew install lein` here and that will get you Leiningen and Clojure in one command.

Create a new project

1
lein new hooray-data && cd hooray-data

Add Incanter as a dependency to the project.clj file, and also a main target:

1
2
3
4
5
6
(defproject clj "1.0.0-SNAPSHOT"
  :description "FIXME: write"
  :dependencies [[org.clojure/clojure "1.2.0"]
                 [org.clojure/clojure-contrib "1.2.0"]
                 [incanter "1.2.3-SNAPSHOT"]]
  :main hooray_data.core)

Add some Incanter code to src/hooray_data/core.clj

1
2
3
4
5
6
(ns hooray_data.core
  (:gen-class)
  (:use (incanter core stats charts io datasets)))

(defn -main [& args]
  (view (histogram (sample-normal 1000)))

Then fire it up:

1
2
lein deps
lein run

If everything runs to plan you’ll see a pretty graph.

Code

First, a simple categorized scatter plot. read-dataset works with both URLs and files, which is pretty handy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
(ns hooray_data.core
  (:use (incanter core stats charts io)))

; Sample data set provided by Incanter
(def plotData (read-dataset 
            "https://raw.github.com/liebke/incanter/master/data/iris.dat" 
            :delim \space 
            :header true))

(def plot (scatter-plot
            (sel plotData :cols 0)
            (sel plotData :cols 1)
            :x-label "Sepal Length"
            :y-label "Sepal Width"
            :group-by (sel plotData :cols 4)))

(defn -main [& args]
  (view plot))

Second, the same data but normalized. The graph will look the same, but the underlying data is now ready for some more math.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
(ns hooray_data.core
  (:use (incanter core stats charts io)))

; Sample data set provided by Incanter
(def data (read-dataset 
            "https://raw.github.com/liebke/incanter/master/data/iris.dat" 
            :delim \space 
            :header true))

(defn extract [f]
  (fn [data]
     (map #(apply f (sel data :cols %)) (range 0 (ncol data)))))

(defn fill [n row] (map (fn [x] row) (range 0 n)))

(defn matrix-row-operation [operand row matrix] 
  (operand matrix 
    (fill (nrow matrix) row)))

; Probably could be much nicer using `reduce`
(defn normalize [matrix]
  (let [shifted (matrix-row-operation minus ((extract min) matrix) matrix)]
   (matrix-row-operation div ((extract max) shifted) shifted)))

(def normalized-data
  (normalize (to-matrix (sel data :cols [0 1]))))

(def normalized-plot (scatter-plot
            (sel normalized-data :cols 0)
            (sel normalized-data :cols 1)
            :x-label "Sepal Length"
            :y-label "Sepal Width"
            :group-by (sel data :cols 4)))

(defn -main [& args]
  (view normalized-plot))

I was kind of hoping the normalize function would have already been written for me in a standard library, but I couldn’t find it.

I’ll report back if anything else of interest comes up as I’m working through the book.

  1. Ben Atkin says:

    Nice writeup. I would run (sudo) brew install lein instead of brew install clojure (I use Cinderella and have homebrew in my home folder so I don't run it with sudo). Installing lein also installs clojure.

  2. Christopher says:

    I enjoyed this post, but I wanted to point out a couple of items in your final example that I think are useful to know.

    First, in the code above, you created a function called `fill` that takes a value and a number (n) and returns a sequence with that value repeated n times. Clojure already has a function that does exactly that called `repeat`.

    So, rather than doing:

    1
    
    (fill (nrow matrix) row)
    

    you could just do:
    1
    
    (repeat (nrow matrix) row) 
    
    and drop the `fill` function completely.

    Second, you mentioned in the post that you were surprised that there wasn't already a built in way to do the normalization and I just wanted to point out that Incanter also has the processing library which is essentially a Clojure interface to Processing (http://processing.org/) which does have a function called `norm` which does exactly what you're looking for. The following code is basically just a refactoring of your code above, but using the `norm` function from the processing library in place of your custom normalization code.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    
    (ns hooray-data.core
      (:gen-class)
      (:use (incanter core stats charts io))
      (:use [incanter.processing :only (norm)]))
    
    (def data (read-dataset
                    "https://raw.github.com/liebke/incanter/master/data/iris.dat"
                    :delim \space
                    :header true))
    
    (defn normalize
      "Normalizes a set of data"
      [data]
      (let [start     (apply min data)
            stop      (apply max data)]
        (map #(norm % start stop) data)))
    
    (def plot (scatter-plot
               (normalize (sel data :cols 0))
               (normalize (sel data :cols 1))
               :x-label "Sepal Length"
               :y-label "Sepal Width"
               :group-by (sel data :cols 4)))
    
    (defn -main [& args]
      (view plot))
    

    Hope that helps out. I enjoyed the post, keep up the good work.

  3. Xavier Shay says:

    Thanks Chris, great comment.

Post a comment


(lesstile enabled - surround code blocks with ---)

A pretty flower Another pretty flower