Data Wrangling

PrettyMeng, Wed Nov 03 2021 • code

`sed` + regular expressions

sed 's/.*Disconnected from //'
- s/REGEX/SUBSTITUITION/
  - s stands for substitution
  - REGEX some pattern you want to match
  - SUBSTITUTION the text you want to substitute matching text with
- A tricky case: Jan 17 03:13:00 thesquareplanet.com sshd[2631]: Disconnected from invalid user Disconnected from 46.97.239.16 port 55920 [preauth]
  - Some user named "Disconnected from"
  - *, + does greedy matching
Pass -E to avoid putting \ before some special characters
Capture groups
- a regex surrounded by parentheses is stored in a numbered capture group \1, \2, ...

sort sort its input
uniq -c collapse consecutive lines that are the same into a single line, prefixed with a count of the number of occurrences
paste -sd, combine lines by a single character specified by -d{char}

awk {print $2} print the second field of the delimeter, delimeter can be specified by -F
awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l specify a pattern: the first field in the line should be 1, the second field should match the regular expression. wc -l to count the number of lines that match such pattern

awk as a programming language

BEGIN { rows = 0 }
$1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 }
END { print rows }