n8henrie/treecount.sh

## treecount.sh
Well, this ended up being easier than I'd expected to implement with coreutils.

Wrapped it up into a little script that sorts by count and removes anything with only 1 result (like files).

Should be pretty easy to also add in a `du -sh` to get sizes if one wanted. Currently it runs in <2s on that 500,000 line file on my M1 Mac. Sharing in case useful for anyone else.

```bash
#!/usr/bin/env bash
# treecount.sh https://gist.github.com/a7c3b48eb971f662c03e9da17ecb9ea4
#
# Given an input file of paths as $1, counts the number of subfiles for each
# directory Useful for determining what directories are most frequently changes
# and might be good candidates for exclusion for restic backups (like caches
# that don't have a `CACHEDIR.TAG`)
#
# USAGE: `$ ./treecount.sh changes.txt`
#
# changes.txt should be a list of file paths without duplicates (`sort -u` is
# your friend) no other content. For my use case, I use `restic snapshots` to
# get a list of snapshots, and with a little processing run `restic diff` on
# each of those snapshots to get a list of modified files. I then filter out
# lines that do not start with `-`, `+`, or `M` (which indicate removals,
# additions, and modifications, respectively) and then deduplicate the
# resulting output.
#
# By default anything with less than 2 results is not included in the output of
# this script.
#
# nb4 I know the grep and sort could be done in awk, but grep and sort sure
# make it easy, don't they?

set -Eeuf -o pipefail
shopt -s inherit_errexit

main() {
  local infile=$1
  awk < "${infile}" -F/ '{
      path=""
      for (idx=2; idx<=NF; idx++) {
        path = path "/" $idx
        paths[path]++
      }
    }
    END {
      for (path in paths) {
        print paths[path], path
      }
    }' |
    grep -v '^1 ' |
    sort -n
}
main "$@"
```
	Well, this ended up being easier than I'd expected to implement with coreutils.

	Wrapped it up into a little script that sorts by count and removes anything with only 1 result (like files).

	Should be pretty easy to also add in a `du -sh` to get sizes if one wanted. Currently it runs in <2s on that 500,000 line file on my M1 Mac. Sharing in case useful for anyone else.

	```bash
	#!/usr/bin/env bash
	# treecount.sh https://gist.github.com/a7c3b48eb971f662c03e9da17ecb9ea4
	#
	# Given an input file of paths as $1, counts the number of subfiles for each
	# directory Useful for determining what directories are most frequently changes
	# and might be good candidates for exclusion for restic backups (like caches
	# that don't have a `CACHEDIR.TAG`)
	#
	# USAGE: `$ ./treecount.sh changes.txt`
	#
	# changes.txt should be a list of file paths without duplicates (`sort -u` is
	# your friend) no other content. For my use case, I use `restic snapshots` to
	# get a list of snapshots, and with a little processing run `restic diff` on
	# each of those snapshots to get a list of modified files. I then filter out
	# lines that do not start with `-`, `+`, or `M` (which indicate removals,
	# additions, and modifications, respectively) and then deduplicate the
	# resulting output.
	#
	# By default anything with less than 2 results is not included in the output of
	# this script.
	#
	# nb4 I know the grep and sort could be done in awk, but grep and sort sure
	# make it easy, don't they?

	set -Eeuf -o pipefail
	shopt -s inherit_errexit

	main() {
	local infile=$1
	awk < "${infile}" -F/ '{
	path=""
	for (idx=2; idx<=NF; idx++) {
	path = path "/" $idx
	paths[path]++
	}
	}
	END {
	for (path in paths) {
	print paths[path], path
	}
	}' \|
	grep -v '^1 ' \|
	sort -n
	}
	main "$@"
	```