Tuesday, December 29, 2009

Bit (not Byte) Manipulation in Ruby

I was recently tasked with creating a rough version of the Lempel-Ziv 77 encoder/decoder engine for use in most operating systems (i.e. Windows, Linux, Mac). The application would need to read a binary file and compress it or decompress it to another binary file. The compression algorithm involved a format bit specifying compressed or literal bytes to follow and then distance and length bits of instructions for compressed data. Such an application would clearly involve a good deal of bit manipulation and consequently require a solid bit manipulation library.

The logical language of choice to me was C++ because of its proximity to the memory, inherent ease of bit manipulation, and presence on every computer since I was born. Unfortunately, I can probably barely compile a "hello, world!" application in C++ =( Next I considered Java since it's open source and present on most people's computers. However, my Java skillz have sadly dwindled since college to the point that I frustratingly discarded that project about an hour after I started. Finally, I decided upon Ruby as my language of choice -- mainly because I like coding in Ruby.

My project got off to a good start until I realized that the original research I'd done on manipulating bits in Ruby had been incomplete. Ruby inherently manages characters and bytes synonymously, but bits are another story. Based on the loose typing model of Ruby, any use of bits throughout my code was being converted to their numeric string representation behind the scenes. For example, 0xff was ending up as the string "255" when I was writing it to a file.

Finally, after much worrying, reading of documentation, online research, and irb investigation, I had an answer.
  • Bytes can be specified in Ruby per bit as such, 255 = 0b1111_1111 (each four bits are separated by an underscore). This was important for me since I was doing a lot of shifting and didn't want to worry about the actual numerical values in my unit testing.
  • Bytes can be written explicitly to files in Ruby using the << operator along with Array.pack.
File.open("foo.txt", "wb+") { |f| f << [0xff].pack("c") }
  • Bytes can be easily read using File.each_byte
  • The byte code for a given character can be accessed using: "a"[0]
  • Binary file manipulation involving windows must be done using the "b" flag when opening the file. Otherwise, the windows file system will treat certain bytes as termination characters and ignore the remainder of the file. I learned this the hard way because each_byte would just inexplicably stop reading in bytes from my file before the file was finished.
After I had all of this figured out, Ruby proved to a very nice environment for writing the app.

1 comment:

  1. Thank you for the useful information.

    ReplyDelete