Remove Consecutive Duplicate Lines With awk

I ran into an interesting problem yesterday. At some point, while scripting updates to a collection of repos, I managed to duplicate a few lines in several files. I ended up with something like this:

README.md

# `dotfiles-role-javascript`
# `dotfiles-role-javascript`

[![Build Status](https://travis-ci.org/thecjharries/dotfiles-role-javascript.svg?branch=master)](https://travis-ci.org/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)

...

Finally, these variables must be set:

```yml
author_name
author_email
author_url
```

## Dependencies

```yml
---
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git
```

I figured there had to be an easy solution using awk, so I grabbed the first SO thread I saw and ran with it.

$ awk '!seen[$0]++' README.md

This one’s particularly opaque. Before using it, lets see how it works.

seen[$0] creates an entry in the seen associative array whose key is the current line, $0. seen isn’t a magic array; it’s just easy convention. qqq[$0] achieves the same results.
x++ post-increments the value. That means the value will stay the same for this operation, but increases immediately afterward.
!x negates the following statement, which, in this case, will stop awk from doing anything.

Normally awk prints every line. In this script, the first time awk sees a line, seen[$0] will be empty, so the post-increment will coerce it to a number after the operation completes. However, at the moment, it’s empty, and the post-increment waits for any preceding operations, so the empty value is negated and then coerced to a number.

(!(seen[$0]))++
(!( ))++
(1)++
2

As clever as it is, it’s got some major flaws, especially for my use case:

$ awk '!seen[$0]++' README.md > tmp.md

tmp.md

# `dotfiles-role-javascript`

[![Build Status](https://travis-ci.org/thecjharries/dotfiles-role-javascript.svg?branch=master)](https://travis-ci.org/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)
...
Finally, these variables must be set:
```yml
author_name
author_email
author_url
```
## Dependencies
---
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git

It’s removed the duplicate lines, empty lines, and necessary repeated elements (see how the second fenced block lost its fence). The whitespace is pretty easy to get back; empty lines won’t have any fields, so Number of Fields will be empty. We can run awk, that is print a line, when NF is empty or the line hasn’t been seen. In other words,

$ awk '!NF || !seen[$0]++' README.md > tmp2.md

tmp2.md

# `dotfiles-role-javascript`

[![Build Status](https://travis-ci.org/thecjharries/dotfiles-role-javascript.svg?branch=master)](https://travis-ci.org/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)

...

Finally, these variables must be set:

```yml
author_name
author_email
author_url
```

## Dependencies

---
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git

As it turns out, I need a few lines to appear more than once, so common awk solutions don’t work very well. My problem is really centered around evaluating each line against its immediate neighbors.

$ awk 'BEGIN{ old = "" } { new = $0 } old == new && old != "" { next } { old = $0; print }' README.md > tmp3.md

BEGIN{ old = "" } seeds old at file load, rather than at each line
{ new = $0 } is run each line, updating the value of new
old == new && old != "" will be true only if the lines are equal and nonempty
{ next } is fired if the conditional is true, skipping immediately to the next record (i.e. not printing the second, duplicated line)
{ old = $0; print } will update the value of old and pass the line on to stdout

tmp3.md

# `dotfiles-role-javascript`

[![Build Status](https://travis-ci.org/thecjharries/dotfiles-role-javascript.svg?branch=master)](https://travis-ci.org/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)

...

Finally, these variables must be set:

```yml
author_name
author_email
author_url
```

## Dependencies

```yml
---
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git
```

I’m still pretty new to awk scripting, so there might be a better way to do this. If there is, I’d love to know about it!

Remove Consecutive Duplicate Lines With awk

CJ Harries

CJ Harries

Patched Powerline fonts via urxvt

Quickly Detect Cursor Position and Color