Episode Script
Today I want to talk about unindenting text. It’s a little problem, and one that has been solved already in various ways. But I think that examining the different possible solutions may prove to be instructive.
First, let’s lay out the problem. In episode #247, we learned about using heredocs for multiline strings. We used an example similar to this one:
module Wonderland
JABBERWOCKY = <<-EOF
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
-- From "jabberwocky", by Lewis Carroll
EOF
end
In order to make our code neat and easy to read, we’ve indented the text of this poem inside the heredoc. Unfortunately, heredoc syntax does not ignore these leading spaces. It includes them in the resulting string, as we can see if we print the constant.
require "./wonderland"
puts Wonderland::JABBERWOCKY
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
For some types of text, we might not care about this leading whitespace. For instance, if the heredoc contained HTML, extra whitespace probably wouldn’t matter. But for plain text like this, we’d like a way to use an indented heredoc and then strip off the indentation before use.
Sadly, Ruby doesn’t have a method for this built-in. And it turns out to be a slightly trickier problem than it might first appear.
One approach we might try is to simply remove the first 4 spaces from each line using gsub. Technically, this works. But it requires knowing exactly how many spaces the text is indented with. If we ever move the code around and re-indent to fit better in its new home, the unindent will be broken.
require "./wonderland"
puts Wonderland::JABBERWOCKY.gsub(" ", "")
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
So we need a strategy that will adapt automatically to the indent of the text as it is found. This, again, is not quite as easy as it looks. A naive algorithm might use the indent of the first line to determine the number of characters to strip from every line. But this would mangle our sample text, because in the sample the last line is indented less than the first line.
One novel solution I’ve run across involves some special formatting in the indented text. By putting a pipe character at the beginning of each line to indicate the left margin, we make it possible to strip out any amount of indent with a little postprocessing. All we need to do is remove everything up to and including the first pipe character in each line in order to get our unindented text.
module Wonderland
JABBERWOCKY = <<-EOF
| 'Twas brillig, and the slithy toves
| Did gyre and gimble in the wabe;
| All mimsy were the borogoves,
| And the mome raths outgrabe.
|
|-- From "jabberwocky", by Lewis Carroll
EOF
end
puts Wonderland::JABBERWOCKY.gsub(/^.*\|/, "")
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
But this makes it more difficult to reformat the text inside the heredoc, since we’d have to adjust the location of the pipe characters every time we did so. Let’s put this solution aside, and look at how we might unindent text with no special preparation. A robust solution will have to find the smallest line indent in the whole string, and then unindent by that much. To determine the minimum indent, we might do something like this:
- First, split the string into individual lines.
- Next, map through the list of lines and find the index of the first
non-whitespace character in the string (if any). We last talked
about the#index
method in episode #224. - Some lines might only contain whitespace and thus produce a nil
index. We don’t care about blank lines, so we filter those results
out with#compact
. - Then we pick out the shortest one using
#min
. - We guard against the chance that no indents were found by explicitly
converting the result to an integer. If the result is nil, this will
turn it into a zero instead. Otherwise it has not effect. - Now that we know the minimum indent, we can use
#gsub
to replace
that many characters at the start of each line with the empty
string.
require "./wonderland"
s = Wonderland::JABBERWOCKY
nchars = s.split("\n") # => [" 'Twas brillig, and the slithy toves", " Did gyre and gimble in the wabe;", " All mimsy were the borogoves,", " And the mome raths outgrabe.", "", " -- From \"jabberwocky\", by Lewis Carroll"]
.map{ |l| l.index(/\S/) } # => [6, 6, 6, 6, nil, 4]
.compact # => [6, 6, 6, 6, 4]
.min.to_i # => 4
puts s.gsub(/^.{#{nchars}}/, "")
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
This works, but it’s a lot of code. Let’s see if we can golf it down a bit, and maybe learn a few tricks along the way.
First off, instead of splitting the string on newlines, we can use String#scan
to search through the original string for indents. (We met String#scan
in episode #41). To identify an indent, we use a regex which looks for spaces or tabs followed by some non-whitespace character. We use a zero-length lookahead assertion around the non-whitespace character so that it must be found but won’t be part of the resulting match string.
Once we have a list of indent strings, we can just map them to their sizes and pick the smallest. We don’t anticipate any nils in the results of the scan, so we no longer need to compact the array first.
require "./wonderland"
s = Wonderland::JABBERWOCKY
nchars = s.scan(/^[ \t]+(?=\S)/) # => [" ", " ", " ", " ", " "]
.map(&:size) # => [6, 6, 6, 6, 4]
.min.to_i # => 4
puts s.gsub(/^.{#{nchars}}/, "")
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
Now let’s make another pass. We assume that all lines will be indented with the same prefix: the same sequence of spaces, tabs or (horror of horrors) a mix of the two. Instead of working with indent sizes, we could just pick out a single minimal indent string to match on.
After episode #248, we know a quick way to find the shortest prefix: we can use #min_by
with the :size
method. And instead of removing a certain number of characters from the beginning of each line, we switch to replacing the indent prefix string that we picked out.
require "./wonderland"
s = Wonderland::JABBERWOCKY
prefix = s.scan(/^[ \t]+(?=\S)/) # => [" ", " ", " ", " ", " "]
.min_by(&:size) # => " "
puts s.gsub(/^#{prefix}/, "")
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
Are we done yet? Heck no! We are picking out the minimal string by size. But Ruby strings can be compared to each other directly, as well. And it just so happens that the way Ruby strings compare to each other, a shorter sequence of a character will always be considered “smaller” than a longer sequence of the same character.
["aaaa", "aa", "aaa"].sort # => ["aa", "aaa", "aaaa"]
So if we assume that the a given string will only use spaces or tabs for indents, we could just pick the “smallest” of the prefix strings; in other words, the #min
string.
require "./wonderland"
s = Wonderland::JABBERWOCKY
prefix = s.scan(/^[ \t]+(?=\S)/) # => [" ", " ", " ", " ", " "]
.min # => " "
puts s.gsub(/^#{prefix}/, "")
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
At this point we can just inline the search for the minimum indent prefix into the gsub
.
require "./wonderland"
s = Wonderland::JABBERWOCKY
puts s.gsub(/^#{s.scan(/^[ \t]+(?=\S)/).min}/, "")
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
And there we have it: a one-liner for unindenting multiline strings.
Now, like I said at the beginning, there are off-the-shelf solutions to this problem as well. If we are using a project that already includes ActiveSupport, we can use the #strip_heredoc
method to accomplish the same end.
require "./wonderland"
require "active_support/core_ext"
puts Wonderland::JABBERWOCKY.strip_heredoc
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
There is also a standalone RubyGem called unindent
which adds an unindent method to String
.
require "./wonderland"
require "unindent"
puts Wonderland::JABBERWOCKY.unindent
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
However, the technique that we’ve just worked through is behavior-compatible with both of these gem methods; it is equivalent or faster in execution time, and it has the shortest implementation of all of them.
To use this in our program, we could encapsulate it in a method and use it at the start of the heredoc. Our string constant is then correctly unindented right from the start.
def unindent(s)
s.gsub(/^#{s.scan(/^[ \t]+(?=\S)/).min}/, "")
end
module Wonderland
JABBERWOCKY = unindent(<<-EOF)
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
-- From "jabberwocky", by Lewis Carroll
EOF
end
puts Wonderland::JABBERWOCKY
# >> 'Twas brillig, and the slithy toves
# >> Did gyre and gimble in the wabe;
# >> All mimsy were the borogoves,
# >> And the mome raths outgrabe.
# >>
# >> -- From "jabberwocky", by Lewis Carroll
I’d like to extend a special thanks to Daniel Fone, who came up with the elegantly minimalist solution you see here. And that’s it for today. Happy hacking!
This can be updated to mention the “squiggly heredoc” feature available since Ruby 2.3. Tt’s well described at: http://www.virtuouscode.com/2016/01/06/about-the-ruby-squiggly-heredoc-syntax/
Congrats for your contribution to the Ruby language syntax 🙂
This doesn’t actually support mixed tabs & spaces.
[“\t\t\t”, ” “].min # =>”\t\t\t”
Best not to suggest that it could, unless you want to golf tab expansion as well 😉