Episode #249: Unindent

Upgrade to download episode video.

Episode Script

Today I want to talk about unindenting text. It's a little problem, and one that has been solved already in various ways. But I think that examining the different possible solutions may prove to be instructive.

First, let's lay out the problem. In episode #247, we learned about using heredocs for multiline strings. We used an example similar to this one:

module Wonderland
  JABBERWOCKY = <<-EOF
      'Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe;
      All mimsy were the borogoves,
      And the mome raths outgrabe.

    -- From "jabberwocky", by Lewis Carroll
  EOF
end

In order to make our code neat and easy to read, we've indented the text of this poem inside the heredoc. Unfortunately, heredoc syntax does not ignore these leading spaces. It includes them in the resulting string, as we can see if we print the constant.

require "./wonderland"

puts Wonderland::JABBERWOCKY

# >>       'Twas brillig, and the slithy toves
# >>       Did gyre and gimble in the wabe;
# >>       All mimsy were the borogoves,
# >>       And the mome raths outgrabe.
# >>
# >>     -- From "jabberwocky", by Lewis Carroll

For some types of text, we might not care about this leading whitespace. For instance, if the heredoc contained HTML, extra whitespace probably wouldn't matter. But for plain text like this, we'd like a way to use an indented heredoc and then strip off the indentation before use.

Sadly, Ruby doesn't have a method for this built-in. And it turns out to be a slightly trickier problem than it might first appear.

One approach we might try is to simply remove the first 4 spaces from each line using gsub. Technically, this works. But it requires knowing exactly how many spaces the text is indented with. If we ever move the code around and re-indent to fit better in its new home, the unindent will be broken.

require "./wonderland"

puts Wonderland::JABBERWOCKY.gsub("    ", "")

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

So we need a strategy that will adapt automatically to the indent of the text as it is found. This, again, is not quite as easy as it looks. A naive algorithm might use the indent of the first line to determine the number of characters to strip from every line. But this would mangle our sample text, because in the sample the last line is indented less than the first line.

One novel solution I've run across involves some special formatting in the indented text. By putting a pipe character at the beginning of each line to indicate the left margin, we make it possible to strip out any amount of indent with a little postprocessing. All we need to do is remove everything up to and including the first pipe character in each line in order to get our unindented text.

module Wonderland
  JABBERWOCKY = <<-EOF
    |  'Twas brillig, and the slithy toves
    |  Did gyre and gimble in the wabe;
    |  All mimsy were the borogoves,
    |  And the mome raths outgrabe.
    |
    |-- From "jabberwocky", by Lewis Carroll
  EOF
end

puts Wonderland::JABBERWOCKY.gsub(/^.*\|/, "")

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

But this makes it more difficult to reformat the text inside the heredoc, since we'd have to adjust the location of the pipe characters every time we did so. Let's put this solution aside, and look at how we might unindent text with no special preparation. A robust solution will have to find the smallest line indent in the whole string, and then unindent by that much. To determine the minimum indent, we might do something like this:

  • First, split the string into individual lines.
  • Next, map through the list of lines and find the index of the first non-whitespace character in the string (if any). We last talked about the #index method in episode #224.
  • Some lines might only contain whitespace and thus produce a nil index. We don't care about blank lines, so we filter those results out with #compact.
  • Then we pick out the shortest one using #min.
  • We guard against the chance that no indents were found by explicitly converting the result to an integer. If the result is nil, this will turn it into a zero instead. Otherwise it has not effect.
  • Now that we know the minimum indent, we can use #gsub to replace that many characters at the start of each line with the empty string.
require "./wonderland"

s = Wonderland::JABBERWOCKY
nchars = s.split("\n") # => ["      'Twas brillig, and the slithy toves", "      Did gyre and gimble in the wabe;", "      All mimsy were the borogoves,", "      And the mome raths outgrabe.", "", "    -- From \"jabberwocky\", by Lewis Carroll"]
  .map{ |l| l.index(/\S/) } # => [6, 6, 6, 6, nil, 4]
  .compact                  # => [6, 6, 6, 6, 4]
  .min.to_i                 # => 4
puts s.gsub(/^.{#{nchars}}/, "")

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

This works, but it's a lot of code. Let's see if we can golf it down a bit, and maybe learn a few tricks along the way.

First off, instead of splitting the string on newlines, we can use String#scan to search through the original string for indents. (We met String#scan in episode #41). To identify an indent, we use a regex which looks for spaces or tabs followed by some non-whitespace character. We use a zero-length lookahead assertion around the non-whitespace character so that it must be found but won't be part of the resulting match string.

Once we have a list of indent strings, we can just map them to their sizes and pick the smallest. We don't anticipate any nils in the results of the scan, so we no longer need to compact the array first.

require "./wonderland"

s = Wonderland::JABBERWOCKY
nchars = s.scan(/^[ \t]+(?=\S)/) # => ["      ", "      ", "      ", "      ", "    "]
  .map(&:size)               # => [6, 6, 6, 6, 4]
  .min.to_i                  # => 4
puts s.gsub(/^.{#{nchars}}/, "")

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

Now let's make another pass. We assume that all lines will be indented with the same prefix: the same sequence of spaces, tabs or (horror of horrors) a mix of the two. Instead of working with indent sizes, we could just pick out a single minimal indent string to match on.

After episode #248, we know a quick way to find the shortest prefix: we can use #min_by with the :size method. And instead of removing a certain number of characters from the beginning of each line, we switch to replacing the indent prefix string that we picked out.

require "./wonderland"

s = Wonderland::JABBERWOCKY
prefix = s.scan(/^[ \t]+(?=\S)/) # => ["      ", "      ", "      ", "      ", "    "]
  .min_by(&:size)                # => "    "
puts s.gsub(/^#{prefix}/, "")

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

Are we done yet? Heck no! We are picking out the minimal string by size. But Ruby strings can be compared to each other directly, as well. And it just so happens that the way Ruby strings compare to each other, a shorter sequence of a character will always be considered “smaller” than a longer sequence of the same character.

["aaaa", "aa", "aaa"].sort      # => ["aa", "aaa", "aaaa"]

So if we assume that the a given string will only use spaces or tabs for indents, we could just pick the “smallest” of the prefix strings; in other words, the #min string.

require "./wonderland"

s = Wonderland::JABBERWOCKY
prefix = s.scan(/^[ \t]+(?=\S)/) # => ["      ", "      ", "      ", "      ", "    "]
  .min                           # => "    "
puts s.gsub(/^#{prefix}/, "")

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

At this point we can just inline the search for the minimum indent prefix into the gsub.

require "./wonderland"

s = Wonderland::JABBERWOCKY
puts s.gsub(/^#{s.scan(/^[ \t]+(?=\S)/).min}/, "")

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

And there we have it: a one-liner for unindenting multiline strings.

Now, like I said at the beginning, there are off-the-shelf solutions to this problem as well. If we are using a project that already includes ActiveSupport, we can use the #strip_heredoc method to accomplish the same end.

require "./wonderland"
require "active_support/core_ext"

puts Wonderland::JABBERWOCKY.strip_heredoc

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

There is also a standalone RubyGem called unindent which adds an unindent method to String.

require "./wonderland"
require "unindent"

puts Wonderland::JABBERWOCKY.unindent

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

However, the technique that we've just worked through is behavior-compatible with both of these gem methods; it is equivalent or faster in execution time, and it has the shortest implementation of all of them.

To use this in our program, we could encapsulate it in a method and use it at the start of the heredoc. Our string constant is then correctly unindented right from the start.

def unindent(s)
  s.gsub(/^#{s.scan(/^[ \t]+(?=\S)/).min}/, "")
end

module Wonderland
  JABBERWOCKY = unindent(<<-EOF)
      'Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe;
      All mimsy were the borogoves,
      And the mome raths outgrabe.

    -- From "jabberwocky", by Lewis Carroll
  EOF
end

puts Wonderland::JABBERWOCKY

# >>   'Twas brillig, and the slithy toves
# >>   Did gyre and gimble in the wabe;
# >>   All mimsy were the borogoves,
# >>   And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll

I'd like to extend a special thanks to Daniel Fone, who came up with the elegantly minimalist solution you see here. And that's it for today. Happy hacking!