Episode #190: Gsub

Upgrade to download episode video.

Episode Script

Greetings, citizen. Here at the Ministry of Public Truthiness, it is our sacred duty to provide government records to the American people, without compromising national security. This sometimes means we have to slightly alter documents before they are published, so as not to leak sensitive information.

Take this document for instance. It summarizes the history of US contact with certain… foreign entities. There are elements here which the general public cannot yet be privy to. It must be massaged before release.

TEXT = <<EOF
Since the early 1950s, United States officials have been in contact
with extraterrestrial life forms. Our first contact came after a group
of Grey aliens crash-landed near Roswell, New Mexico. Through them we
learned of the imminent threat of invasion by lizard people from
Sirius.  Since then we have worked with the Greys to develop an
Alien/Human hybrid capable of mind-control. As more and more UFO
sightings occur, it becomes increasingly difficult for
the men in black to control public awareness of the
extraterrestrial presence.
EOF

Fortunately, we have a lexicon of approved alternate wordings. Here it is. It is a hash of keys and values, with the keys being sensitive terms and the values being safe alternatives.

LEXICON = {
  "extraterrestrial"     => "CANADIAN",
  "life forms"           => "REPRESENTATIVES",
  "Grey aliens"          => "POLITE BACON RANCHERS",
  "the Greys"            => "OUR FRIENDS",
  "Roswell, New Mexico"  => "NIAGARA FALLS",
  "crash-landed"         => "DROPPED BY",
  "lizard people"        => "FOLK SINGERS",
  "Sirius"               => "SASKATCHEWAN",
  "Alien"                => "WILLIAM SHATNER",
  "Human"                => "MILEY CIRUS",
  "mind-control"         => "MAPLE SYRUP MANUFACTURE",
  "men in black"         => "MOUNTIES",
  "UFO"                  => "MOOSE"
}

We must use this lexicon to process documents before they are published.

The most obvious approach to this problem is the String#gsub method. This method takes a pattern and a replacement string. Everywhere the pattern is matched, it will be replaced.

"lizard people".gsub(/lizard/, "party")
# => "party people"

To perform all of these substitutions, we could just loop over the lexicon. Each iteration, we replace one source term with its alternate. At the end, we have a sanitized text, ready for publication.

require "./text"
require "./lexicon"

sanitized = TEXT
LEXICON.each do |term, alternate|
  sanitized = sanitized.gsub(term, alternate)
end
puts sanitized
# >> Since the early 1950s, United States officials have been in contact
# >> with CANADIAN REPRESENTATIVES. Our first contact came after a group
# >> of POLITE BACON RANCHERS DROPPED BY near NIAGARA FALLS. Through them we
# >> learned of the imminent threat of invasion by FOLK SINGERS from
# >> SASKATCHEWAN.  Since then we have worked with OUR FRIENDS to develop an
# >> WILLIAM SHATNER/MILEY CIRUS hybrid capable of MAPLE SYRUP MANUFACTURE. As more and more MOOSE
# >> sightings occur, it becomes increasingly difficult for
# >> the MOUNTIES to control public awareness of the
# >> CANADIAN presence.

Now, as a government agency, we try to be as efficient as possible. Stop laughing. That was not a joke.

Iteration is inefficient. We can, and will, do better.

Our first task is to combine the keys of the lexicon into a composite pattern which will match any of them. We can do this by joining the terms together with a pipe character between them. The pipe is the alternation operator in a regular expression. Then we convert the joined string to a regular expression object.

We pass this pattern to #gsub. Now we need to handle replacing matched values.

One way to do this is to pass a block to #gsub, instead of a replacement string. The block will receive the matched string as an argument, and should return the replacement. In our case that just means using the match to look up the alternate text in our lexicon.

require "./text"
require "./lexicon"

terms = LEXICON.keys.join("|")
# => "extraterrestrial|life forms|Grey aliens|the Greys|Roswell, New Mexico|crash-landed|lizard people|Sirius|Alien|Human|mind-control|men in black|UFO"
pattern = Regexp.new(terms)
# => /extraterrestrial|life forms|Grey aliens|the Greys|Roswell, New Mexico|crash-landed|lizard people|Sirius|Alien|Human|mind-control|men in black|UFO/

sanitized = TEXT.gsub(pattern) { |match| LEXICON[match] }
puts sanitized

# >> Since the early 1950s, United States officials have been in contact
# >> with CANADIAN REPRESENTATIVES. Our first contact came after a group
# >> of POLITE BACON RANCHERS DROPPED BY near NIAGARA FALLS. Through them we
# >> learned of the imminent threat of invasion by FOLK SINGERS from
# >> SASKATCHEWAN.  Since then we have worked with OUR FRIENDS to develop an
# >> WILLIAM SHATNER/MILEY CIRUS hybrid capable of MAPLE SYRUP MANUFACTURE. As more and more MOOSE
# >> sightings occur, it becomes increasingly difficult for
# >> the MOUNTIES to control public awareness of the
# >> CANADIAN presence.

Note that our terms contain no characters which are treated specially in regular expressions. If they did, we would need to escape them before combining them into a regex.

terms = LEXICON.keys.map{|k| Regexp.escape(k)}.join("|")

We are now using a single #gsub instead of calling it once for every proscribed term. But we can do better still.

Instead of a string or a block, #gsub can accept a hash to use as the replacement.

require "./text"
require "./lexicon"

terms = LEXICON.keys.join("|")
# => "extraterrestrial|life forms|Grey aliens|the Greys|Roswell, New Mexico|crash-landed|lizard people|Sirius|Alien|Human|mind-control|men in black|UFO"
pattern = Regexp.new(terms)
# => /extraterrestrial|life forms|Grey aliens|the Greys|Roswell, New Mexico|crash-landed|lizard people|Sirius|Alien|Human|mind-control|men in black|UFO/

sanitized = TEXT.gsub(pattern, LEXICON)
puts sanitized

# >> Since the early 1950s, United States officials have been in contact
# >> with CANADIAN REPRESENTATIVES. Our first contact came after a group
# >> of POLITE BACON RANCHERS DROPPED BY near NIAGARA FALLS. Through them we
# >> learned of the imminent threat of invasion by FOLK SINGERS from
# >> SASKATCHEWAN.  Since then we have worked with OUR FRIENDS to develop an
# >> WILLIAM SHATNER/MILEY CIRUS hybrid capable of MAPLE SYRUP MANUFACTURE. As more and more MOOSE
# >> sightings occur, it becomes increasingly difficult for
# >> the MOUNTIES to control public awareness of the
# >> CANADIAN presence.

In this case, #gsub looks up the matched text in the given hash, and uses the resulting value as the replacement.

This changed version is satisfactory. With this code, we will be able to publish more truthiness than ever before. Our people will be informed, our secrets will stay out of the hands of our enemies, and our skies will be kept safe from… folk-singers.

Happy hacking, citizen!