YAML Tutorial
Many years ago I wrote a tutorial on using YAML in ruby. It still sees the most google traffic of any post, by far. So people want to know about YAML? I’ll help them out.
What is YAML?
YAML is a flexible, human readable file format that is ideal for storing object trees. YAML stands for “YAML Ain’t Markup Language”. It is easier to read (by humans) than JSON, and can contain richer meta data. It is far nicer than XML. There are libraries available for all mainstream languages including Ruby, Python, C++, Java, Perl, C#/.NET, Javascript, PHP and Haskell. It looks like this:
1 2 3 4 5 6 |
--- - name: Xavier country: Australia age: 24 - name: Don country: US |
That is a simple array of hashes. You can nest any combination of these simple data structures however you like. Most parsers will also detect the 24 as an integer too. Quoting strings is optional, and was omitted in this example.
YAML allows you to add tags to your objects, which is extra meta-data that your application can use to deserialize portions into complex data structures. For instance, in ruby if you serialize a set object it looks like this:
1 2 3 4 5 |
# Set.new([1,2]).to_yaml --- !ruby/object:Set hash: 1: true 2: true |
Notice that ruby has added the ruby/object:Set tag so that the correct object can be instantiated on deserialization, while maintaining a human readable rendition of a set. These tags can be anything you like, ruby just happens to use that particular format.
You can remove duplication from YAML files by using anchors (&) and aliases (*). You typically see this in configuration files, such as:
1 2 3 4 5 6 7 8 9 10 11 |
defaults: &defaults adapter: postgres host: localhost development: database: myapp_development <<: *defaults test: database: myapp_test <<: *defaults |
& sets up the name of the anchor (“defaults”), << means “merge the given hash into the current one”, and * includes the named anchor (“defaults” again). The expanded version looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
defaults: adapter: postgres host: localhost development: database: myapp_development adapter: postgres host: localhost test: database: myapp_test adapter: postgres host: localhost |
Note that the defaults hash hangs around, even though it isn’t really required anymore.
YAML generators use this technique to correctly serialize repeated references to the same object, and even cyclic references. That’s pretty clever.
Flow style
YAML has an alternate synax called “flow style”, that allows arrays and hashes to be written inline without having to rely on indentation, using square brackets and curly brackets respectively.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
--- # Arrays colors: - red - blue # in flow style... colors: [red, blue] # Hashes - name: Xavier age: 24 # in flow style... - {name: Xavier, age: 24} |
This has the curious effect of making YAML a superset of JSON. A valid JSON document is also a valid YAML document.
Performance
Given YAML’s richness and human readability, you would expect it to be slower than native serialization or JSON. This would be correct. My brief testing shows it is about an order of magnitude slower. For the typical configuration use-case, this is irrelevant, but worth keeping in mind if you are doing something crazy. Remember to run your own benchmarks that represent your specific need.
1 2 3 4 5 6 7 8 9 |
user system total real Marshal serialize 0.090000 0.000000 0.090000 ( 0.091822) Marshal deserialize 0.090000 0.000000 0.090000 ( 0.092186) JSON serialize 0.480000 0.010000 0.490000 ( 0.480291) JSON deserialize 0.130000 0.010000 0.140000 ( 0.134860) YAML serialize 2.040000 0.020000 2.060000 ( 2.065693) YAML deserialize 0.520000 0.010000 0.530000 ( 0.526048) Psych serialize 2.530000 0.030000 2.560000 ( 2.565116) Psych deserialize 1.510000 0.120000 1.630000 ( 1.622601) |
Curiously, the new YAML parser Psych included in ruby 1.9.2 appears significantly slower than the old one. Not sure what is going on there.
Reading YAML from a file with ruby
1 2 3 4 5 6 7 |
require 'yaml' parsed = begin YAML.load(File.open("/tmp/test.yml")) rescue ArgumentError => e puts "Could not parse YAML: #{e.message}" end |
Writing YAML to a file with ruby
1 2 3 4 |
require 'yaml' data = {"name" => "Xavier"} File.open("path/to/output.yml", "w") {|f| f.write(data.to_yaml) } |
Anything else you’d like to know? Leave a comment.
Psych YAML in ruby 1.9.2 with RVM and Snow Leopard OSX
Note that you must have libyaml installed before you compile ruby, so this probably means you’ll need to recompile your current version.
1 2 3 |
sudo brew install libyaml
rvm install ruby-1.9.2 --with-libyaml-dir=/usr/local
ruby -rpsych -e 'puts Psych.load("win: true")'
|
YAML in Ruby Tutorial
UPDATE 2011-01-31: I have posted a newer tutorial which is probably going to be more useful to you than this one: YAML Tutorial
So you’ve got all these tasty ruby objects lying around in memory, and they’re going to be lost when your program ends. Such a tragic end. What’s a robot to do? Why, store them to disk in a language agnostic format, of course! Enter YAML, a language perfectly suited to the task, more so than it’s heavier bretheren, XML. YAML support comes built in to the ruby language, and it couldn’t be easier to use. Every object automagically gets a to_yaml method that returns a string containing appropriate YAML markup when you include the right file.
1 2 3 |
require 'yaml' # Assumed in future examples puts "hello".to_yaml |
Of course this works for any object, using some of that oh-so-sweet reflection. to_yaml recursively calls itself on all of your instance variables, and even knows how to handle complex structure like arrays and hashes. It even copes with cyclic references! How’s that for value?
1 2 3 4 5 6 7 8 9 10 |
class Square def initialize width, height @width = width @height = height @bonus = ['yo', {:msg => 'YAML 4TW'}] @me = self end end puts Square.new(2, 2).to_yaml |
Now that you’ve got a handy YAML string you can do whatever you like with it: write it to disk, store it in a database, email it to your cousin Benny. But Benny is going to spin out – how does he reproduce your shiny ruby objects? Thoughtfully, ruby makes it just about as easy to create an object from YAML markup – in other words to go the other way. The YAML::load method takes either a string or an IO object and gives you back an object, ready to use. It’s worth noting that the initialize method is not called on the new object – a fact that will become pertinent later.
1 2 3 |
serialized = Square.new(2, 2).to_yaml new_obj = YAML::load(serialized) puts new_obj.width |
Transience
The YAML serializer works in essentially the same manner as a sledgehammer. There’s no finesse – it will serialize all of your instance variables. Always. This is generally not a problem, but every now and then for reasons of space, security, beauty or public health you will have a transient variable that you really just don’t want to be serialized. There is no neat way in the supplied library to do this. You could override to_yaml and blank out the transient fields before you call super, but then you need to restore them afterwards. And what if those fields were calculated on initialization – how do you restore them when the object is deserialized?
Not to worry, our gallant hero (yours truly) has created a helper script that allows you to specify which fields are to be persisted in a declarative manner using a class attribute.
1 2 3 4 5 6 7 8 9 10 11 |
require 'rhnh/yaml_helper' # Assumed in future examples class Square persistent :width, :height def initialize width, height @width = width @height = height @me = self # @me will not be serialized end end |
The script also provides a post_deserialize hook that is called, not surprisingly, after deserialization. It essentially acts as initialize for deserialized objects. No setup is necessary to use this hook, it’s mere presence will attract enough attention.
1 2 3 4 5 6 7 |
class OnTheBall def post_deserialize puts "I'm awake!" end end YAML::load(OnTheBall.new.to_yaml) |
In closing
YAML is an excellent choice for serializing your Ruby objects. Its brevity and readability give it the edge over both XML and Marshal, and with the addition of YAML Helper it becomes more flexible as well.
Resources
YAML persistence
Fixed up my persistence code to not have to specify variables as an array, and committed my changes to CVS. Funny that on the day I got developer access to clxmlserial, I switched it out of my project in favour of YAML. Of course, I need to add a persistent attribute to that also, but it works a little different from XML:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
class Object def self._persist klass begin @@persist rescue @@persist = {} end @@persist[klass] = [] if !@@persist[klass] @@persist[klass] end def self._persist_with_parent klass begin @@persist rescue @@persist = {} end p = nil while (!p) && klass p = @@persist[klass.to_s] klass = klass.superclass end p end def self.persistent *var p = self._persist(self.to_s) for i in (0..var.length-1) var[i] = var[i].to_s end p.concat(var) end def to_yaml ( opts = {} ) p = self.class._persist_with_parent(self.class) if p.size > 0 YAML::quick_emit( object_id, opts ) do |out| out.map( taguri, to_yaml_style ) do |map| p.each do |m| map.add( m, instance_variable_get( '@' + m ) ) end end end else YAML::quick_emit( object_id, opts ) do |out| out.map( taguri, to_yaml_style ) do |map| to_yaml_properties.each do |m| map.add( m[1..-1], instance_variable_get( m ) ) end end end end end def save(filename) File.open( filename + '.yaml', 'w' ) do |out| YAML.dump( self, out ) end end end |