Recently, a coworker asked if I could help with manipulating some large CSV files. Each file contained approximately 1.2 million records and was over 200 megabytes in size. I’ve never worked with such large CSV files before, so I had no idea how long they would take to process.
I decided to start by writing a Ruby script that would parse and manipulate only a subset of the data. I used Benchmark to figure out how long it would take the program to run. I then did some quick calculations and the result was that it would take roughly 25 minutes to process the entire data set.
My first attempt looked like this:
Coincidentally, I had been reading about working with threads and processes in Ruby in the evenings. I thought that this would be a great opportunity to use some of my new knowledge.
According to Ruby’s documentation,
Kernel.fork “creates a subprocess. If a block is specified, the block is run in the subprocess…” The documentation also points out that it should be used with
Process.detach to avoid accumulating zombie processes.
Here is an example:
Reworking the Code
Kernel.fork into my existing solution was very simple. All I had to do was insert a strategically placed
fork block into my code. The result being that I saved myself and my coworker several extra minutes of waiting.
Here’s what I ended up with:
You can achieve the same result in Bash using ampersands (&). In Bash, the ampersand is a control operator used to fork processes. This means that if I were to modify the previous example to accept an input file and output file as arguments, I could then run the following command in Bash and achieve similar results: