I ran into an interesting problem yesterday. At some point, while scripting updates to a collection of repos, I managed to duplicate a few lines in several files. I ended up with something like this:
README.md | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | # `dotfiles-role-javascript` |
I figured there had to be an easy solution using awk
, so I grabbed the first SO thread I saw and ran with it.
$ awk '!seen[$0]++' README.md |
This one’s particularly opaque. Before using it, lets see how it works.
seen[$0]
creates an entry in theseen
associative array whose key is the current line,$0
.seen
isn’t a magic array; it’s just easy convention.qqq[$0]
achieves the same results.x++
post-increments the value. That means the value will stay the same for this operation, but increases immediately afterward.!x
negates the following statement, which, in this case, will stopawk
from doing anything.
Normally awk
prints every line. In this script, the first time awk
sees a line, seen[$0]
will be empty, so the post-increment will coerce it to a number after the operation completes. However, at the moment, it’s empty, and the post-increment waits for any preceding operations, so the empty value is negated and then coerced to a number.
1 2 3 4 | (!(seen[$0]))++ |
As clever as it is, it’s got some major flaws, especially for my use case:
$ awk '!seen[$0]++' README.md > tmp.md |
tmp.md | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # `dotfiles-role-javascript` |
It’s removed the duplicate lines, empty lines, and necessary repeated elements (see how the second fenced block lost its fence). The whitespace is pretty easy to get back; empty lines won’t have any fields, so N
umber of F
ields will be empty. We can run awk
, that is print a line, when NF
is empty or the line hasn’t been seen. In other words,
$ awk '!NF || !seen[$0]++' README.md > tmp2.md |
tmp2.md | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | # `dotfiles-role-javascript` |
As it turns out, I need a few lines to appear more than once, so common awk
solutions don’t work very well. My problem is really centered around evaluating each line against its immediate neighbors.
$ awk 'BEGIN{ old = "" } { new = $0 } old == new && old != "" { next } { old = $0; print }' README.md > tmp3.md |
BEGIN{ old = "" }
seedsold
at file load, rather than at each line{ new = $0 }
is run each line, updating the value ofnew
old == new && old != ""
will be true only if the lines are equal and nonempty{ next }
is fired if the conditional is true, skipping immediately to the next record (i.e. not printing the second, duplicated line){ old = $0; print }
will update the value ofold
and pass the line on tostdout
tmp3.md | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # `dotfiles-role-javascript` |
I’m still pretty new to awk
scripting, so there might be a better way to do this. If there is, I’d love to know about it!