I have a side project that it’s kinda abandoned. I am moving it from Google Cloud to Microsoft Azure and adding some new features to it. It’s an excellent way to learn more about Azure and practice some skills. This project is about helping you find a movie to watch. In this effort it requires ingesting some IMDB datasets. The datasets can be found here. The datasets are TSV files (tab separated).
In the past I used Python (and Java in another occassion) to merge two of those sets and transform them into a CSV file. In this occassion I dropped the merging, I just wanted to transform them in CSV and fast! Somehow I wasn’t in the mood repurposing my Python script for this transformation. Plus I wanted something to quickly run from my command line. So I stumbled upon an answer on StackOverflow that explained how to do this with AWK. And it did. It transformed a file with 1.230.953 lines in a second or so. I am still in shock.
Here’s the script:
awk 'BEGIN { FS="\t"; OFS="," } {
rebuilt=0
for(i=1; i<=NF; ++i) {
if ($i ~ /,/ && $i !~ /^".*"$/) {
gsub("\"", "\"\"", $i)
$i = "\"" $i "\""
rebuilt=1
}
}
if (!rebuilt) { $1=$1 }
print
}' file.tsv > file.csv
I’m not gonna lie, I have little idea what’s going on here. I know that a lot of this script is redundant for my usecase. I know there is a simpler way to do it like so:
awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv
but that’s not the point. The point is that I feel like a B class Indiana Jones of computer science. I feel like I discovered an ancient artifact and I need to know more about it.
So yeah, I’ll be soon doing some AWK tutorials. Because this feels like a very sharp tool that I need to have in my toolchain. Not sure what to do with it yet but I’m sure I’ll find out.