A Brief Introduction to Ruby's Kernel.fork
Recently, a coworker asked if I could help with manipulating some large CSV files. Each file contained approximately 1.2 million records and was over 200 megabytes in size. I’ve never worked with such large CSV files before, so I had no idea how long they would take to process.
I decided to start by writing a Ruby script that would parse and manipulate only a subset of the data. I used Benchmark to figure out how long it would take the program to run. I then did some quick calculations and the result was that it would take roughly 25 minutes to process the entire data set.
My first attempt looked like this:
require 'benchmark'
require 'csv'
# Read each record in sample.csv, manipulate it,
# and write it to sample.out.csv
bm = Benchmark.realtime do
File.open('sample.out.csv', 'w+') do |output|
CSV.foreach('sample.csv') do |input|
# Do some manipulations...
output.write(input.to_csv)
end
end
end
puts bm
Coincidentally, I had been reading about working with threads and processes in Ruby in the evenings. I thought that this would be a great opportunity to use some of my new knowledge.
About Kernel.fork
According to Ruby’s documentation, Kernel.fork
“creates a subprocess. If a block is specified, the block is run in the subprocess…” The documentation also points out that it should be used with Process.wait
or Process.detach
to avoid accumulating zombie processes.
Here is an example:
fork do
# Do stuff in a subprocess
end
# Wait for the subprocess(es) to finish
Process.wait
Reworking the Code
Integrating Kernel.fork
into my existing solution was very simple. All I had to do was insert a strategically placed fork
block into my code. The result being that I saved myself and my coworker several extra minutes of waiting.
Here’s what I ended up with:
require 'benchmark'
require 'csv'
bm = Benchmark.realtime do
['file1.csv', 'file2.csv', 'file3.csv'].each do |input_file|
fork do
output_file = input_file + '.out'
File.open(output_file, 'w+') do |output|
CSV.foreach(input_file) do |input|
# Do some manipulations...
output.write(input.to_csv)
end
end
end
end
end
puts bm
Bonus Tip
You can achieve the same result in Bash using ampersands (&). In Bash, the ampersand is a control operator used to fork processes. This means that if I were to modify the previous example to accept an input file and output file as arguments, I could then run the following command in Bash and achieve similar results:
ruby parse.rb in1.csv out1.csv & \
ruby parse.rb in2.csv out2.csv & \
ruby parse.rb in3.csv out3.csv &