Exploring data with Clojure, Incanter, and Leiningen
I’m working through Machine Learning in Action at the moment, and it’s done in Python. I don’t really know Python, but I’d prefer to learn Clojure, so I’m redoing the code samples.
This blog posts show how to read a CSV file, manipulate it, then graph it. Turns out Clojure is pretty good for this, in combination with the Incanter library (think R for the JVM). It took me a while to get an environment set up since I’m unfamiliar with basically everything.
Install Clojure
I already had it installed so can’t remember if there were any crazy steps to get it working. Hopefully this is all you need:
1 |
sudo brew install clojure |
Install Leiningen
Leiningen is a build tool which does many things, but most importantly for me is it manages the classpath. I was jumping through all sorts of hoops trying to get Incanter running without it.
There are easy to follow instructions in the README
*UPDATE: * As suggested in the comments, you can probably just `brew install lein` here and that will get you Leiningen and Clojure in one command.
Create a new project
1 |
lein new hooray-data && cd hooray-data |
Add Incanter as a dependency to the project.clj file, and also a main target:
1 2 3 4 5 6 |
(defproject clj "1.0.0-SNAPSHOT"
:description "FIXME: write"
:dependencies [[org.clojure/clojure "1.2.0"]
[org.clojure/clojure-contrib "1.2.0"]
[incanter "1.2.3-SNAPSHOT"]]
:main hooray_data.core)
|
Add some Incanter code to src/hooray_data/core.clj
1 2 3 4 5 6 |
(ns hooray_data.core (:gen-class) (:use (incanter core stats charts io datasets))) (defn -main [& args] (view (histogram (sample-normal 1000))) |
Then fire it up:
1 2 |
lein deps lein run |
If everything runs to plan you’ll see a pretty graph.
Code
First, a simple categorized scatter plot. read-dataset works with both URLs and files, which is pretty handy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
(ns hooray_data.core
(:use (incanter core stats charts io)))
; Sample data set provided by Incanter
(def plotData (read-dataset
"https://raw.github.com/liebke/incanter/master/data/iris.dat"
:delim \space
:header true))
(def plot (scatter-plot
(sel plotData :cols 0)
(sel plotData :cols 1)
:x-label "Sepal Length"
:y-label "Sepal Width"
:group-by (sel plotData :cols 4)))
(defn -main [& args]
(view plot))
|
Second, the same data but normalized. The graph will look the same, but the underlying data is now ready for some more math.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
(ns hooray_data.core
(:use (incanter core stats charts io)))
; Sample data set provided by Incanter
(def data (read-dataset
"https://raw.github.com/liebke/incanter/master/data/iris.dat"
:delim \space
:header true))
(defn extract [f]
(fn [data]
(map #(apply f (sel data :cols %)) (range 0 (ncol data)))))
(defn fill [n row] (map (fn [x] row) (range 0 n)))
(defn matrix-row-operation [operand row matrix]
(operand matrix
(fill (nrow matrix) row)))
; Probably could be much nicer using `reduce`
(defn normalize [matrix]
(let [shifted (matrix-row-operation minus ((extract min) matrix) matrix)]
(matrix-row-operation div ((extract max) shifted) shifted)))
(def normalized-data
(normalize (to-matrix (sel data :cols [0 1]))))
(def normalized-plot (scatter-plot
(sel normalized-data :cols 0)
(sel normalized-data :cols 1)
:x-label "Sepal Length"
:y-label "Sepal Width"
:group-by (sel data :cols 4)))
(defn -main [& args]
(view normalized-plot))
|
I was kind of hoping the normalize function would have already been written for me in a standard library, but I couldn’t find it.
I’ll report back if anything else of interest comes up as I’m working through the book.
August 02, 2011 at 8:57 AM
Nice writeup. I would run (sudo) brew install lein instead of brew install clojure (I use Cinderella and have homebrew in my home folder so I don't run it with sudo). Installing lein also installs clojure.
August 14, 2011 at 10:28 PM
I enjoyed this post, but I wanted to point out a couple of items in your final example that I think are useful to know.
First, in the code above, you created a function called `fill` that takes a value and a number (n) and returns a sequence with that value repeated n times. Clojure already has a function that does exactly that called `repeat`.
So, rather than doing:
you could just do:
Second, you mentioned in the post that you were surprised that there wasn't already a built in way to do the normalization and I just wanted to point out that Incanter also has the processing library which is essentially a Clojure interface to Processing (http://processing.org/) which does have a function called `norm` which does exactly what you're looking for. The following code is basically just a refactoring of your code above, but using the `norm` function from the processing library in place of your custom normalization code.
(ns hooray-data.core (:gen-class) (:use (incanter core stats charts io)) (:use [incanter.processing :only (norm)])) (def data (read-dataset "https://raw.github.com/liebke/incanter/master/data/iris.dat" :delim \space :header true)) (defn normalize "Normalizes a set of data" [data] (let [start (apply min data) stop (apply max data)] (map #(norm % start stop) data))) (def plot (scatter-plot (normalize (sel data :cols 0)) (normalize (sel data :cols 1)) :x-label "Sepal Length" :y-label "Sepal Width" :group-by (sel data :cols 4))) (defn -main [& args] (view plot))Hope that helps out. I enjoyed the post, keep up the good work.
August 21, 2011 at 12:28 AM
Thanks Chris, great comment.