Mastering Data Wrangling on the Command Line: From System Logs to Visual Insights

Name: Lecture 4: Data Wrangling (2020)
Uploaded: 2026-02-22T17:48:40.562760+00:00
Channel: Missing Semester
Description: Summary and key takeaways on Mastering Data Wrangling on the Command Line: From System Logs to Visual Insights, covering What Is Data Wrangling? Data wrangling

Missing Semester

Feb 22, 2026

•

5 min read

YouTube video ID: sz_dsktIjt4

Source: YouTube video by Missing Semester — Watch original video

PDF

What Is Data Wrangling?

Data wrangling is the process of converting raw data from one format into another that is easier to analyse. In a Unix‑like environment this often means taking text streams (log files, CSVs, command output) and shaping them with small, composable tools.

A Real‑World Example: Mining SSH Login Attempts

Source data – The lecture uses the system journal (journalctl) from a Linux server in the Netherlands.
Initial filtering – grep ssh extracts only lines that mention SSH connections.
Remote vs. local processing – Instead of pulling the whole log over the network, the same pipeline (journalctl | grep ssh | less) is executed on the remote host via SSH, sending back only the relevant lines.
Saving for reuse – The filtered output is stored locally (ssh user@host "journalctl | grep ssh" > ssh.log) so subsequent analysis works on a static file.

Cleaning the Log with `sed`

sed -e 's/.*disconnected from //' -e 's/ disconnected from.*//' removes timestamps, hostnames, and the constant phrase disconnected from, leaving just the usernames.
Regular expressions (.*, +, ?, []) let you match any character, repeat patterns, or create optional parts. The -r (or -E) flag enables extended syntax, reducing the need for backslashes.
Capture groups () store parts of the match for later reuse (\1, \2). They are essential when you need to keep the username while discarding surrounding text.

Debugging Regular Expressions

Online regex debuggers visualise matches, highlight capture groups, and show why a pattern fails on edge cases (e.g., usernames that contain the word disconnected).
Greedy vs. non‑greedy quantifiers (* vs. *?) control how much of the line is consumed.

Aggregating Results with Classic Unix Tools

Tool	Purpose	Example in the lecture
`wc -l`	Count lines	`wc -l ssh.log` → 198 000 attempts
`sort`	Order lines (numeric `-n`, column `-k`)	`sort -nrk1,1`
`uniq -c`	Collapse duplicates and count them	`uniq -c` after sorting
`awk`	Column‑oriented processing, arithmetic, conditionals	`awk '{print $2}'` to print usernames; `awk 'NR%2==0'` for every second line
`paste -sd,`	Join lines with a delimiter	Create a comma‑separated list of top usernames
`bc`	Command‑line calculator	`echo "1+2" \| bc`
`plot`	Quick histogram from a stream	Visualise frequency of top usernames

Advanced Filtering with `awk`

awk '$1==1 && $2 ~ /^C.*e$/' prints usernames that appear exactly once and match the pattern C…e.
awk 'BEGIN{cnt=0} {cnt++} END{print cnt}' replicates wc -l inside a single awk script, useful when you already have an awk pipeline.

Turning Lists into Command‑Line Arguments with `xargs`

xargs reads whitespace‑separated items and appends them to a command. Example: removing old Rust toolchains.

rustup toolchain list | grep nightly | sed 's/ (default)//' | xargs -n1 rustup toolchain uninstall

This eliminates tedious copy‑paste and demonstrates how data wrangling can automate system administration tasks.

Working with Binary Streams

ffmpeg can read from a device (/dev/video0) and output a single frame to stdout (-f image2 -vframes 1 -).
convert (ImageMagick) reads the raw image from stdin, converts it to grayscale, and writes to stdout (-).
By chaining ffmpeg | convert | gzip | ssh remote "cat > frame.png.gz", you can capture, transform, compress, and transfer binary data without ever writing intermediate files.

Putting It All Together – A Typical Workflow

Collect – Pull raw data (logs, sensor output, command output).
Filter – Use grep, sed, or awk to keep only the relevant rows.
Transform – Strip unwanted fields, extract identifiers, or reformat dates.
Aggregate – Sort, uniq -c, or awk to compute counts, sums, averages.
Visualise – Pipe numeric results to plot, gnuplot, or a quick awk‑paste‑bc calculation.
Act – Feed the final list into xargs or another script to perform automated actions (e.g., block abusive usernames).

Why Learn These Tools?

They are always available on any Unix‑like system – no extra libraries needed.
Each tool does one thing well, and together they form powerful pipelines that can handle text, numbers, and even binary streams.
Mastery of regular expressions and stream editors (sed, awk) dramatically reduces the time spent writing ad‑hoc scripts in higher‑level languages.

Tips for Getting Started

Start with simple patterns (grep "error" file.log).
Incrementally add sed or awk transformations, testing each step with less or head.
Use man <tool> and online regex testers to explore options.
Keep a notebook of useful one‑liners – they become a personal toolbox.

Exercises Suggested in the Lecture

Extract usernames from a system log and list the top 20 attackers.
Compute how many distinct usernames attempted a login.
Automate removal of old Rust toolchains using xargs.
Capture a webcam frame, convert it to grayscale, and store it on a remote server.

What Comes Next?

The next lecture will shift focus to command‑line environments (shell configuration, scripting, and environment management). Mastering the data‑wrangling techniques above will make those topics much easier to absorb.

Data wrangling on the command line turns raw, noisy streams into actionable insights by chaining simple, purpose‑built tools—making complex analysis possible without writing full programs.

Frequently Asked Questions

Who is Missing Semester on YouTube?

Missing Semester is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

-c` | Collapse duplicates and count them | `uniq -c` after sorting | | `awk` | Column‑oriented processing, arithmetic, conditionals | `awk '{print $2}'` to print usernames; `awk 'NR%2==0'` for every second line | | `paste -sd,` | Join lines with

-c`, or `awk` to compute counts, sums, averages. 5. Visualise – Pipe numeric results to `plot`, `gnuplot`, or

quick `awk`‑`paste`‑`bc` calculation. 6. Act – Feed the final list into `xargs` or another script to perform automated actions (e.g., block abusive usernames).

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

The Linux Command Line Book Recommended

Provides a comprehensive guide to Bash, grep, sed, awk and other core utilities, helping readers master the exact tools demonstrated in the lecture.

Amazon →

Unix Power Tools

Collects hundreds of practical one‑liners and pipelines for text, binary, and system data manipulation, directly supporting the data‑wrangling techniques shown.

Amazon →

Sed & Awk Pocket Reference

Offers quick syntax lookup and examples for regular expressions, stream editing, and column processing, essential for the regex‑heavy parts of the article.

Amazon →

Regular Expressions Cookbook

Shows real‑world regex patterns and debugging tips, enabling readers to craft robust expressions for log parsing and beyond.

Amazon →

Ffmpeg: The Complete Guide

Explains how to capture, convert, and stream audio/video from the command line, extending data‑wrangling concepts to binary media.

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

all right so welcome to today's lecture
which is going to be on data wrangling
and data wrangling might be a phrase it
sounds a little bit odd to you but the
basic idea of data wrangling is that you
have data in one format and you want it
in some different format and this
happens all of the time I'm not just
talking about like converting images but
it could be like you have a text file or
a log file and what you really want this
data in some other format like you want
a graph or you want statistics over the
data anything that goes from one piece
of data to another representation of
that data is what I would call data
wrangling we've seen some examples of
this kind of data wrangling already
previously in the semester like
basically whenever you use the pipe
operator that lets you sort of take
output from one program and feed it
through another program you are doing
data wrangling in one way or another but
we're going to do in this lecture is
take a look at some of the fancier ways
you can do data wrangling and some of
the really useful ways you can do data
wrangling in order to do any kind of
data wrangling though you need a data
source you need some data to operate on
in the first place and there are a lot
of good candidates for that kind of data
we give some examples in the exercise
section for today's lecture notes in
this particular one though I'm going to
be using a system log so I have a server
that's running somewhere the Netherlands
because that seemed like a reasonable
thing at the time and on that server
it's running sort of a regular logging
daemon that comes with system Deeb's
it's a sort of relatively standard Linux
logging mechanism and there's a command
called journal CTL on Linux systems that
will let you view the system log and so
what I'm gonna do is I'm gonna do some
transformations over that log and see if
we can extract something interesting
from it you'll see though that if I run
this command I end up with a lot of data
because this is a log that has just like
there's a lot of stuff in it right a lot
of things have happened on my server and
this goes back to like January first and
their logs that go even further back on
this there's a lot of stuff so the first
thing we're gonna do is try to limit it
down to you only
one piece of content and here the grep
command is your friend so we're gonna
pipe this through grep and we're gonna
pipe for SSH right so SSH we haven't
really talked to you about yet but it is
a way to access computers remotely
through the command line and in
particular what happens when you put a
server on the public Internet is that
lots and lots of people around the world
to try to connect to it and log in and
take over your server and so I want to
see how those people are trying to do
that and so I'm going to grep for SSH
and you'll see pretty quickly that this
also generates a bunch of content at
least in theory this is gonna be real
slow there we go so this generates tons
and tons and tons of content and it's
really hard to even just visualize
what's going on here so let's look at
only what user names people have used to
try to log into my server so you'll see
some of these lines say disconnected
disconnected from invalid user and then
some user name I want only those lines
that's all I really care about I'm gonna
make one more change here though which
is if you think about how this pipeline
does if I here do this connected from so
this pipeline at the bottom here what
that will do is it will send the entire
log file over the network to my machine
and then locally run grep to find only
the lines to contained ssh and then
locally filter them further this seems a
little bit wasteful because i don't care
about most of these lines and the remote
site is also running a shell so what I
can actually do is I can have that
entire command run on the server right
so I'm telling you SSH the command I
want you to run on the server is this
pipeline of three things and then what I
get back I want to pipe through less so
what does this do well it's gonna do
that same filtering that we did but it's
gonna do it on the server side and the
server is only going to send me those
lines that I care about and then when I
pipe it locally through the program
called less less is a pager you'll see
some examples of this you've actually
seen some of them already like when you
type man and some command that opens in
a pager and a pagers is a convenient way
to take a long piece of content and fit
it into your term
window and have you scrolled down and
scroll up and navigate it so that it
doesn't just like scroll past your
screen and so if I run this it still
takes a little while because it has to
parse through a lot of log files and in
particular grep is buffering and
therefore it decides to be relatively
unhelpful
I may do this without let's see if
that's more helpful why doesn't it want
to be helpful to me fine I'm gonna cheat
a little just ignore me
or the internet is really slow those are
two possible options luckily there's a
fix for that because previously I have
run the following command so this
command just takes the output of that
command and sticks it into a file
locally on my computer alright so I ran
this when I was up in my office and so
what this did is it downloaded all of
the SSH log entries that matched
disconnect from so I have those locally
and this is really handy right there's
no reason for me to stream the full log
every single time because I know that
that starting pattern is what I'm going
to want anyway so we can take a look at
SSH dot log and you will see there are
lots and lots and lots of lines that all
say disconnected from invalid user
authenticating users etc right so these
are the lines that we have to work on
and this also means that going forward
we don't have to go through this whole
SSH process we can just cat that file
and then operate it on it directly so
here I can also demonstrate this pager
so if I do cat s is a cat SSH dot log
and I pipe it through less it gives me a
pager where I can scroll up and down
make that a little bit smaller maybe so
I can scroll this file screw through
this file and I can do so with what are
roughly vim bindings so control you to
scroll up control D to scroll down and
cue to exit this is still a lot of
content though and these lines contain a
bunch of garbage that I'm not really
interested in what I really want to see
is what are what are these user names
and here the tool that we're going to
start using is one called sent said is a
stream editor that's modify or it's it's
a modification of a much earlier program
called edie which was a really weird
editor that none of you will probably
want to use yeah Oh tsp is the name of
my the remote computer I'm connecting to
so said is a stream editor and it
basically lets you make changes to the
contents of a stream you can think of it
a little bit like doing replacements but
it's actually a full programming
language
over the stream that is given one of the
most common things you do with said
though is to just run replacement
expressions on an input stream what do
these looks like well let me show you
here I'm gonna pipe this sue said and
I'm going to say that I want to remove
everything that comes before
disconnected from so this might look a
little weird the observation is that the
date and the host name and the sort of
process ID of the SSH daemon I don't
care about I can just remove that
straightaway and I can also remove that
like disconnected from bit because that
seems to be present in every single log
entry so I just want to get rid of it
and so what I write is a set expression
in this particular case it's an S
expression which is a substitute
expression it takes two arguments that
are basically enclosed in these slashes
so the first one is the search string
and the second one which is currently
empty is a replacement string so here
I'm saying search for the following
pattern and replace it with blank and
then I'm gonna pipe it into less at the
end do you see that now what it's done
is trim off the beginning of all these
lines and that seems really handy but
you might wonder what is this pattern
that I've built up here right this is
this dot star what does that mean this
is an example of a regular expression
and regular expressions are something
that you may have come across in
programming in the past
but it's something that once you go into
the command line you will find yourself
using a lot especially for this kind of
data wrangling regular expressions are
essentially a powerful way to match text
you can use it for other things than
text too but Texas the most common
example and in regular expressions you
have a number of special characters that
say don't just match this character but
match for example a particular type of
character or a particular set of options
it essentially generates a program for
you that searches the given text dot for
example means any single
character and star if you follow a
character with a star it means zero or
more of that character and so in this
case is pattern of saying zero or more
of any character followed by the literal
string disconnected from I'm saying
match that and then replace it with
blank regular expressions have a number
of these kind of special characters that
have various meanings you can take
advantage of I talked about star which
is zero or more
there's also Plus which is one or more
right so this is saying I want the
previous expression to match at least
once
you also have square brackets so square
brackets let you match one of many
different characters so here let us
build up a string list something like a
BA and I want to substitute a and B with
nothing okay so here what I'm telling
the pattern to do is to replace any
character that is either A or B with
nothing so if I make the first character
B it will still produce BA you might
wonder though why did it only replace
once well it's because what regular
expressions will do especially in this
default mode is they will just match the
pattern once and then apply the
replacement once per line that is what's
said normally does you can provide the G
modifier which says do this as many
times as it keeps matching which in this
case would erase the entire line because
every single character is either an A or
a B if I added a C here and remove
everything but the C if I added other
characters in the middle of this string
somewhere they would all be preserved
but anything that is an A or and B is
removed you can also do things like add
modifiers to this for example
what would this do this is saying I want
zero or more of the string a B and I'm
gonna replace them with nothing
this means that if I have a standalone a
it will not be replaced if I have a
standalone B it will not be replaced but
if I have the string a B it will be
removed which yeah what are they said is
stupid
the - a here is because said is a really
old tool and so it supports only a very
old version of very cool expressions
generally you will want to run it with -
capital e which makes it use a more
modern syntax that supports more things
if you are in a place where you can't
you have to prefix these with back
slashes to say I want the special
meaning of parenthesis otherwise they
were just match a literal parenthesis
which is probably not what you want so
notice how this replaced the a B here
and it replaced the a be here but it
left this C and it also left the a at
the end because that a does not match
this pattern anymore and you can group
these patterns in whatever ways you want
you also have things like alternations
you can say anything that matches a b or
b c i want to remove and here you'll
notice that this a b got removed this bc
did not get removed even though it
matches the pattern because the a b had
already been removed this a b is removed
right but the c stays in place this a b
is removed and this c states because it
still does not match that if I made this
if I remove this a then now this a B
pattern will not match this B so it'll
be preserved and then BC will match BC
and it'll go away
Regulus presence can be all sorts of
complicated when you first encounter
them and even once you get more
experience with them they can be
daunting to look at and this is why very
often you want to use something like a
regular expression debugger which we'll
look at in a little bit but first let's
try to make up a
pattern that will match the logs and and
match the logs that we've been working
with so far so here I'm gonna just sort
of extract a couple of lines from this
file let's say the first five so these
lines all now look like this right and
what we want to do is we want to only
have the user name okay so what might
this look like well here's one thing we
could try to do actually let me show you
one except one thing first let me take a
line that says something like
disconnected from invalid user
disconnected from maybe four to one one
whatever okay so this is an example of a
login line where someone tried to login
with the username disconnected from
missing an S disconnected thank you
you'll notice that this actually removed
the username as well and this is because
when you use dot star and any of these
sort of range expressions indirect
expressions they are greedy they will
match as much as they can so in this
case this was the username that we
wanted to retain but this pattern
actually matched all the way up until
the second occurrence of it or the last
occurrence of it and so everything
before it including the username itself
got removed and so we need to come up
with a slightly clever or matching
strategy than just saying sort of dot
star because it means that if we have
particularly adversarial input we might
end up with something that we didn't
expect okay so let's see how we might
try to match these lines let's just do a
head first well let's try to construct
this up from the beginning we first of
all know that we want - capital e right
because we want to not have to put all
these back slashes everywhere
these lines look like they say from and
then some of them say invalid but some
of them do not right this line has
invalid that one does not question mark
here is saying zero or one so I want
zero or zero or one of invalid space
user what else well that's going to be a
double space so we can't have that and
then there's gonna be some username and
then there's gonna be what exactly is
gonna be what looks like an IP address
so here we can use our range syntax and
say zero to nine and a dot right that's
what IP addresses are and we want many
of those then it says porch so we're
just going to match a literal port and
then another number zero to nine and
we're going to wand plus of that the
other thing we're going to do here is
we're going to do what's known as
anchoring the regular expression so
there are two special characters and
regular expressions there's carrot or
hat which matches the beginning of a
line and there's dollar which matches
the end of a line so here we're gonna
say that this regression has to match
the complete line the reason we do this
is because imagine that someone made
their username the entire log string
then now if you try to match this
pattern it would match the username
itself which is not what we want
generally you will want to try to anchor
your patterns wherever you can to avoid
those kind of oddities okay let's see
what that gave us that removed many of
the lines but not all of them so this
one for example includes this pre off at
the end so we'll want to cut that off if
there's a space pre off square brackets
our specials we need to escape them
right now let's see what happens if we
try more lines of this no it still gets
something weird some of these lines are
not empty right which means that the
pattern did not match this one for
example it says authenticating user
instead of invalid
user okay so as to match invalid or
authenticated zero or one time before
user how about now okay that looks
pretty promising but this output is not
particularly helpful right here we've
just erased every line of our log files
successfully which is not very helpful
instead what we really wanted to do is
when we match the username right over
here we really wanted to remember what
that username was because that is what
we want to print out and the way we can
do that in regular expressions is using
something like capture groups so capture
groups are a way to say that I want to
remember this value and reuse it later
and in regular expressions any bracketed
expression any parenthesis expression is
going to be such a capture group so we
already actually have one here which is
this first group and now we're creating
a second one here notice that these
parentheses don't do anything to the
matching right because they're just
saying this expression as a unit but we
don't have any modifiers after it so
it's just match one-time and then the
reason matching groups are are useful or
capture groups are useful is because you
can refer back to them in the
replacement so in the replacement here I
can say backslash two this is the way
that you refer to the name of a capture
group in this say I'm in this case I'm
saying match the entire line and then in
the replacement put in the value you
captured in the second capture group
right remember this is the first capture
group and this is the second one and
this gives me all the usernames now if
you look back at what we wrote this is
pretty complicated right it might make
sense now that we walk through it and
why it had to be the way it was but this
is like not obvious that this is how
these lines work and this is where a
regular expression debugger can come in
really really handy so we have one here
there are many online but here I've sort
of pre filled in this expression that we
just used and notice that it it tells me
all the matching does in fact now this
window is a little small with this font
size but if I do hear this explanation
says dot star matches any character
between zero and unlimited times
followed by disconnected from literally
followed by a capture group and then
walks you through all the stuff and
that's one thing but it also lets you've
given a test string and then matches the
pattern against every single test string
that you give and highlights what the
different capture groups for example are
so here we made user a capture group
right so it'll say okay the full string
matched right the whole thing is blue so
it matched Green is the first capture
group red is the second capture group
and this is the third because preauth
was also put into parenthesis and this
can be a handy way to try to debug your
regular expressions for example if I put
disconnected from and let's add a new
line here and I make the username
disconnected from now that line already
had the username be disconnect from
great here me of thinking ahead you'll
notice that with this pattern this was
no longer a problem because it got
matched the username what happens if we
take this entire line or this entire
line and make that the username now what
happens it gets really confused right so
this is where regular expressions can be
a pain to get right because it now tries
to match it matches the first place
where username appears or the first
invalid in this case the second invalid
because this is greedy we can make this
non greedy by putting a question mark
here so if you suffix a plus or a star
with a question mark it becomes a non
greedy match so it will not try to match
as much as possible and then you see
that this actually gets parsed correctly
because this dots
we'll stop at the first disconnected
from which is the one that's actually
emitted by SSH the one that actually
appears in our logs as you can probably
tell from the explanation of this so far
regular expressions can get really
complicated and there are all sorts of
weird modifiers that you might have to
apply in your pattern the only way to
really learn them is to start with
simple ones and then build them up until
they match what you need often you're
just doing some like one-off job like
when we're hacking out the user names
here and you don't need to care about
all the special conditions right you
don't have to care about someone having
the SSH username perfectly match your
login format that's probably not
something that matters because you're
just trying to find the usernames but
regular expressions are really powerful
and you want to be careful if you're
doing something where it actually
matters you had a question
regular expressions by default only
match per line anyway they will not
match across new lines so so the way
that said works is that it operates per
line and so said we'll do this
expression for every line okay questions
about regular sessions or this pattern
so far it is a complicated pattern so if
it if it feels confusing like don't be
worried about it look at it in the
debugger later yep so so keep in mind
that the we're assuming here that the
user only has control over their
username right so the worst that they
could do is take like this entire entry
and make that the username let's see
what happens right so that's the works
and the reason for this is this question
mark means that the moment we hit the
disconnect keyword we start parsing the
rest of the pattern right and the
first occurrence of disconnected is
printed by SSH before anything the user
controls so in this particular instance
even this will not confuse the pattern
yep if well so if you're writing a this
sort of odd matching will in general
when you're doing data wrangling is like
not security it's not security related
but it might mean that you get really
weird data back and so if you're doing
something like plotting data you might
drop data points that matter you might
parse out the wrong number and then like
your plot suddenly have data points that
weren't in the original data and so it's
more that if you find yourself writing a
complicated regular expression like
double check that it's actually matching
what you think it's matching and even if
it's not security related and as you can
imagine these patterns can get really
complicated like for example there's a
big debate about how do you match an
email address with a regular expression
and you might think of something like
this so this is a very straightforward
one that just says letters and numbers
and rotor scores some percent followed
by a plus because in Gmail you can have
pluses in email addresses with a suffix
in this case the plus is just for any
number of these but at least one because
you can't have an email address that
doesn't have anything before the ad and
then similarly after the domain right
and the top-level domain has to be at
least two characters and can't include
digits right you can have it calm but
you can't have adopt seven it turns out
this is not really correct right there
are a bunch of valid email addresses
that will not be matched by this and
they're a bunch of invalid email
addresses that will be matched by this
so there are many many suggestions and
there are people who've built like full
test suites to try to see which regular
expression is best and this is this
particular one is for URLs there are
similar ones for email where they found
that the best one is this one I don't
recommend you trying to understand this
pattern but this one apparently will all
most perfectly match the what the like
internet standard for email addresses
says as a valid email address and that
includes all sorts of weird Unicode code
points this is just to say regular
expressions can be really hairy and if
you end up somewhere like this there's
probably a better way to do it for
example if you find yourself trying to
parse HTML or something or parse like
parse JSON where they're expressions you
should probably use a different tool and
there is an exercise that has you do
this not with the regular sessions point
you yeah that it's there's all sorts of
suggestions and they give you deep deep
dives into how they works if you want to
look that up it's it's in the lecture
notes okay so now we have the sister of
user names so let's go back to data
wrangling right like this list of user
names is still not that interesting to
me right let's let's see how many lines
there are so if I do WC - oh there are
one hundred and ninety eight thousand
lines so WC is the word count program -
L makes it count the number of lines
this is a lot of lines then if I start
scrolling through them that still
doesn't really help me right like I need
statistics over this I need aggregates
of some kind and the send tool is like
useful for many things it gives you a
full programming language it can do
weird things like insert text or only
print matching lines but it's not
necessarily the perfect tool for
everything right like sometimes there
are better tools like for example you
could write a line counter instead you
just should never said it's a terrible
programming language except for
searching and replacing but there are
other useful tools so for example
there's a tool called sort so sort this
is also not going to be very helpful but
sort takes a bunch of lines of input
sorts them and then prints them to your
output so in this case I now get the
sorted output of that list it is still
two hundred thousand lines long so it's
still not very helpful to me but now I
can combine it
the tool called unique so unique we'll
look at a sorted list of lines and it
will only print those that are unique so
if you have multiple instances of any
given line it will only print it once
and then I can say unique - C so this is
gonna say count the number of duplicates
for any lines that are duplicated and
eliminate them what does this look like
well if I run it it's gonna take a while
there were thirteen zze user names there
were ten ZX VF user names etc there and
I can scroll through this this is still
a very long list right but at least now
it's a little bit more collated than it
was let's see how many lines I'm dumped
in now okay
twenty-four thousand lines it's still
too much it's not useful information to
me but I can keep burning down this with
more tools for example what I might care
about is which user names have been used
the most well I can do sort again and I
can say I want a numeric sort on the
first column of the input so - n says
numeric sort - K lets you select a white
space separated column from the input to
sort my and the reason I'm giving one
comma one here is because I want to
start at the first column and stop at
the first column alternatively I could
say I want you to sort by this list of
columns but in this case I just want to
sort by that column and then I want only
the ten last lines so sort by default
will output in ascending order so the
the ones with the highest counts are
gonna be at the bottom and then I want
only lost ten lines and now when I run
this I actually get a useful bit of data
right it tells me there were eleven
thousand login attempts with the
username root there were four thousand
with one two three four five six isn't
username etc and this is pretty handy
right and now suddenly this giant log
file actually produces useful
information for me this is what I really
from that log file now maybe I want to
just like do a quick disabling of root
for example for SSH login on my machine
which I recommend you will do by the way
in this particular case we don't
actually need the k4 sort because sort
by default will sort by the entire line
and the number happens to come first but
it's useful to know about these
additional flags and you might wonder
well how would I know that these flags
exist how would I know that these
programs even exist
well the programs usually pick up just
from being told about them in classes
like here the flags are usually like I
want to sort by something that is not
the full line your first instinct should
be to type man sort and then read
through the page and then very quickly
will tell you here's how to select a
pretty good column here's how to sort by
a number etc okay what if now that I
have this like top let's say top 20 list
let's say I don't actually care about
the counts I just want like a comma
separated list of the user names because
I'm gonna like send it to myself by
email every day or something like that
like these are the top 20 usernames well
I can do this
ok that's a lot more weird commands but
their commands that are useful to know
about so awk is a column based stream
processor so we talked about said which
is a stream editor so it tries to edit
text primarily in the inputs awk on the
other hand also lets you edit text it is
still a full programming language but
it's more focused on columnar data so in
this case awk by default will parse its
input in white space separated columns
and then that you operate on those
columns separately in this case I'm
saying just print the second column
which is the user name right paste is a
command that takes a bunch of lines and
paste them together into a single line
that's the - s with the delimiter comma
so in this case for on this I want to
get a comma separated list of the top
user names which I can then do whatever
useful thing I might want maybe I want
to stick this in a config file of
disallowed usernames or something along
those lines
um awk is worth talking a little bit
more about because it turns out to be a
really powerful language for this kind
of data wrangling we mentioned briefly
what this print dollar 2 does but it
turns out the for awk you can do some
really really fancy things so for
example let's go back to here where we
just have the usernames I say let's
still do sort and unique because we
don't otherwise the list gets far too
long
and let's say that I only want to print
the usernames that match a particular
pattern let's say for example that I
want to see I want all of the usernames
that only appear once and that start
with a C and end with an e there's a
really weird thing to look for but in
all this is really simple to express I
can say I want the first column to be 1
and I want the second column to match
the following regular expression
hey this could probably just be dot and
then I want to print the whole line so
unless I mess something up this will
give me all the usernames that start
with a C end with an e and only appear
once in my log now that might not be a
very useful thing to do with the data
what I'm trying to do in this lecture is
show you the kind of tools that are
available and in this particular case
this pattern is like not that
complicated even though what we're doing
is sort of weird and this is because
very often on Linux with Linux tools in
particular and command-line tools in
general the tools are built to be based
on lines of input and lines of output
and very often those lines are going to
be have multiple columns and awk is
great for operating over columns now awk
is is not just able to do things like
match per line but it lets you do things
like let's say I want the number of
these right I want to know how many user
names match this pattern well I can do
WCHL that works just fine all right
there are 31 such user names but awk is
a programming language this is something
that you will probably never end up
doing yourself but it's important to
know that you can every now and again it
is actually useful to know about these
this might be hard to read on my screen
I just realized let me try to fix that
in a second
let's do yeah apparently fish does not
want me to do that um so here begin is a
special pattern that only matches the
zeroth line end is a special pattern
that only matches after the last line
and then this is gonna be a normal
pattern that's matched against every
line so what I'm saying here is on the
zeroth line set the variable rose to
zero on every line that matches this
pattern increment rose and after you
have matched the last line print the
value of rose and this will have the
same effect as running WCHL but all
within awk his particular instance like
WCHL is just fine but sometimes you want
to do things like you want to might want
to keep a dictionary or a map of some
kind you might want to compute
statistics you might want to do things
like I want the second match of this
pattern so you need a stateful matcher
like ignore the first match but then
print everything following the second
match and for that this kind of simple
programming in all can be useful to know
about in fact we could in this pattern
get rid of said and sort and unique and
grep that we originally used to produce
this file and do it all in awk
but you probably don't want to do that
it would be probably too painful to be
worth it it's worth talking a little bit
about the other kinds of tools that you
might want to use on the command line
the first of these is a really handy
program called BC so BC is the Berkeley
calculator I believe man BC I think BC
is originally from Berkeley calculator
anyway it is a very simple command-line
calculator but instead of giving you a
prompt it reads from standard in so I
can do something like echo 1 plus 2 and
pipe it to BC - shell because many of
these programs normally operate in like
a stupid mode where they're unhelpful so
here it prints 3 Wow very impressive but
it turns out this can be really handy
imagine you have a file with a bunch of
lines
let's say something like oh I don't know
this file and let's say I want to sum up
the number of logins the number of user
names that have not been used only once
all right so the ones where the count is
not equal to one I want to print just
the count right this is me give me the
counts for all the non single-use user
names and then I want to know how many
are there of these notice that I can't
just count the lines that wouldn't work
right because there are numbers on each
ran I want to sum well I can use paste
to paste by plus so this paste every
line together into a plus expression
right and this is now an arithmetic
expression so I can pipe it through BCL
and now there have been hundred and
ninety one thousand logins that share to
username with at least one other login
again probably not something you really
care about but this is just to show you
that you can extract this data pretty
easily and there's all sort of other
stuff you can do with this for example
there are tools so that you compute
statistics over inputs so for example
for this list of numbers that's that I
just took the numbers and just print it
out just the distribution of numbers I
could do things like use our our is the
separate programming language that's
specifically built for a statistical
analysis and I can say let's see if I
got this right
this is again a different programming
language that you would have to learn
but if you already know R or you can
pipe them through all their languages
too like so so this gives me summary
statistics over that input stream of
numbers so the median number of login
attempts per user name is 3 the max is
10,000 that was route
we saw before I'll tell me the average
was 8 for this might not matter in this
particular instance like this might not
be interesting numbers but if you're
looking at things like output from your
benchmarking script or something else
where you have some numerical
distribution and you want to look at
them these tools are really handy we can
even do some simple plotting if we
wanted to right so this has a bunch of
numbers let's do let's go back to our
sort and k-11 and look at only the two
top 5 new plot is a plotter that lets
you take things from standard in I'm not
expecting you to know all of these
programming languages because they
really are programming languages in
their own right but is it just show you
what is possible right so this is now a
histogram of how many times each of the
top 5 user names have been used for my
server since January 1st and it's just
one command line it's somewhat
complicated command line but it's just
one command line thing that you can do
there are two sort of special types of
data wrangling that I want to talk to
you about in the in the last little bit
of time that we have and the first one
is command line argument wrangling
sometimes you might have something that
actually we looked at in the last
lecture like you have things like find
that produces a list of files or maybe
something that produces a list of
arguments for your benchmarking script
like you want to run it with a
particular distribution of arguments
like let's say you had a script that
printed the number of iterations to run
a particular project and you wanted like
an exponential distribution or something
and this prints the number of iterations
on each line and you were to run your
benchmark for each one well here is a
tool called X args
that's your friend so X args takes lines
of input and turns them into arguments
and this is my
look a little weird see if I can come
with a good example for this so I
program in rust and rust lets you
install multiple versions of the
compiler so in this case you can see
that I have stable beta I have a couple
of earlier stable releases and I've
launched a different dated Knightley's
and this is all very well but over time
like I don't really need the nightly
version from like March of last year
anymore
I can probably delete that every now and
again and maybe I want to clean these up
a little well this is a list of lines so
I can get for nightly I can get rid of
so - V is don't match I don't want to
match to the current nightly okay so
this is al a list of dated Knightley's
maybe I want only the ones from 2019
and now I want to remove each of these
tool chains for my machine I could copy
paste each one into so there's a rust up
tool chain remove or uninstall maybe
tool chain uninstall right so I could
manually type out the name of each one
or copy/paste them but that's getting
gets annoying really quickly because I
have the list right here so instead how
about I said away this sort of this
suffix that it adds right so now it's
just that and then I use ex args so ex
args takes a list of inputs and turns
them into arguments so I want this to
become arguments to rust up tool chain
uninstall and just for my own sanity
sake I'm gonna make this echo just so
it's going to show which command it's
gonna run well it's relatively unhelpful
but are hard to read at least you see
the command it's going to execute if I
remove this echo is rust up tool chain
uninstall and then the list of
Knightley's as arguments to that program
and so if I run this it on installs
every tool chain instead of me having to
copy paste them so this is one example
where this kind of data wrangling
actually can be useful for other tasks
than just looking at data it's just
going from one
format to another you can also wrangle
binary data so a good example of this is
stuff like videos and images where you
might actually want to operate over them
in some interesting way so for example
there's a tool called ffmpeg ffmpeg is
for encoding and decoding video and to
some extent images I'm gonna set its log
level to panic because otherwise it
prints a bunch of stuff I want it to
read from dev video 0 which is my video
of my webcam video device and I wanted
to take the first frame so I just wanted
to take a picture and I wanted to take
an image rather than a single frame
video file and I wanted to print its
output so the image it captures to
standard output - is usually the way you
tell the program to use standard input
or output rather than a given file so
here it expects a file name and the file
name - means standard output in this
context and then I want to pipe that
through a parameter called convert
convert is a image manipulation program
I want to tell convert to read from
standard input and turn the image into
the color space gray and then write the
resulting image into the file - which is
standard output and I don't want to pipe
that into gzip we're just gonna compress
this image file and that's also going to
just operate on standard input standard
output and then I'm going to pipe that
to my remote server and on that I'm
going to decode that image and then I'm
gonna store a copy of that image so
remember T reads input prints it to
standard out and to a file this is gonna
make a copy of the decoded image file
ass copy about PNG and then it's gonna
continue to stream that out so now I'm
gonna bring that back into a local
stream and here I'm going to display
that in an image display err let's see
if that works
Hey right so this now did a round-trip
to my server
and then came back over pipes and
there's now a computer there's a
decompressed version of this file at
least in theory on my server let's see
if that's there a CPT's p copy PNG 2
here and CP 8 yeah hey same file ended
up on the server so our pipeline worked
again this is a sort of silly example
but let's you see the power of building
these pipelines where it doesn't have to
be textual data it's just go taking data
from any format to any other like for
example if I wanted to I can do cat dev
video 0 and then pipe that to a server
that like Anish controls and then he
could watch that video stream by piping
it into a video player on his machine if
we wanted to write it just need to know
that these thing exist there are a bunch
of exercises for this lab and some of
them rely on you having a data source
that looks a little bit like a log on
Mac OS and Linux we give you some
commands you can try to experiment with
but keep in mind that it's not it's not
that important exactly what data source
you use this is more find some data
source that where you think there might
be an interesting signal and then try to
extract something interesting from it
that is what all of the exercises are
about we will not have class on Monday
because it's MLK Day so next lecture
will be Tuesday on command line
environments any questions about what
we've guarded so far or the pipelines or
regular expressions I really recommend
that you look into regular expressions
and try to learn them they are extremely
handy both for this and in programming
in general and if you have any questions
come to office hours and we'll help you
up