[ EN | PT ] AWK: a precursor to RegEX | AWK: um precursor do RegEX

over 2 years ago

banner

The AWK language is a text processing language created for early versions of Unix. You can think of it as the grandfather of RegEx, as you could create simple scripts to search for text on the lines of files you wanted and then filter them.

The first version of this language appeared in 1977 as a scripting language for text processing, helping to increase the power of shell scripts and offer new functionalities, characteristics that made AWK influence several other languages, such as Perl, the new versions of Shell and Lua.

The language even had updates from the 1980s, making it possible to incorporate RegEx into the original AWK scripts.

With the end of Unix and the construction of GNU/Linux, BSD and other variations, the original AWK was in the past. However, mainly to maintain the backward compatibility of the scripts of the Unix users, several interpreters of the AWK appeared. The most popular are:

BWK: is the oldest interpreter after the end of Unix, having direct involvement from the original creators of AWK, mainly Brian Kernighan. This version is used on FreeBSD, OpenBSD, NetBSD and Mac OS X;
GAWK: GNU AWK is the most popular interpreter today, coming by default in most Linux distros and being available in the repositories of almost all of them as well. It has popular uses even today in the community, being able to find certain text patterns from complex functions and received recent updates to guarantee features to operate in TCP/IP networks;
TAWK: Thompson AWK is an AWK compiler for DOS, Solaris, OS/2 and Windows. Formerly sold by Thompson Automation Software, today you can get it for free from the official website. Despite claiming to offer a Windows version, it only has official compatibility up to Windows XP, so bugs may occur in later versions of Microsoft's software;
AWKA: is an AWK compiler that converts code into C language and then compiles it, causing long scripts to be interpreted faster than they would have been in the original version. The performance of compiled scripts is much higher compared to other languages, and it is still compatible with GAWK 3.1.0, having native functions for TCP/IP network interfaces and the like. You can download it from official website;

Hello World

There are two ways to run AWK. One of them is putting the command inline, all on the same line, and executing it directly from the terminal. The second is by entering the entire command inside a file and executing that file

Directly in the terminal

The first time you can run it directly in the terminal. Run the following command:

awk 'BEGIN{print "Hello World!"}'

This will print Hello World in the terminal. Simple, do you agree?

In a file

BEGIN {
     print "Hello World!";
}

You can run it in 2 ways. First, run the file with the following command:

awk -f hello.awk

And that's it, you'll have Hello World written on the screen.

In the second (and most common) way, you add a hashbang before the code, with the following code:

#!/bin/gawk -f

This will tell the shell which interpreter will be used to run the script. Save the file and run the following command:

chmod +x hello.awk

This adds execute permission to the hello.awk file. Now just run:

./hello.awk

And that's it, you'll have Hello World written on the screen.

Practical use - Listed only specific data

Let's work with the gender_submission.csv file from a Kaggle Titanic dataset (available here). Let's start by printing all the lines of the file:

cat gender_submission.csv | awk '{print $0;}'

For this file you will need to change the default divisor of items. The AWK default is a space, but our file uses the CSV default, which is a comma. How to change it? Simple, let's use the BEGIN block. The BEGIN block is executed once in the code, before everything else, while the following block is executed once per line. So let's change the FS variable, which sets the line parameter separator, right before running the rest of the code:

cat gender_submission.csv | awk 'BEGIN{FS=",";} {print $0;}'

Want to know if it worked? How about placing two arrows between one field and another?

cat gender_submission.csv | gawk 'BEGIN{FS=",";}{print $1 " → → " $2}'

Our aim here is to list only the IDs, but only of the people who survived. How can we do that? The answer is: adding an if conditional. When parameter 2 is equal to 1, it means it survived, and if it survived, we can show it on the screen. Our code looks like this:

cat gender_submission.csv | gawk 'BEGIN{FS=",";}{if ($2 == 1) print($1);}'

And that's it, you'll have a list of desired IDs. Simple, no?

You can still throw the output to a file:

(cat gender_submission.csv | gawk 'BEGIN{FS=",";}{if ($2 == 1) print($1);}') \
>> titanic_survivors_id.txt

Note that I put the command for printing outputs on another line in the shell to improve visibility.

Improving usage

You can also do the same thing by running commands directly from a file. How to do this? Come with me and I'll show you.

First write all your code inside a gs.awk file:

BEGIN {
    FS=",";
}
{
    if ($2 == 1)
        print($0);
}

And save the file in the same folder as the file. Now, you can run the command like this:

cat gender_submission.csv | gawk -f gs.awk >> titanic_survivors_id.txt

We can simplify it even further using the hashbang. Just insert the following line at the top of the file:

#!/bin/gawk -f

Your code will look like this:

#!/bin/gawk -f
BEGIN {
    FS=",";
}
{
    if ($2 == 1)
        print($0);
}

Now save the file and add execute permission to the file:

chmod +x gs.awk

And then your command will look like this:

cat gender_submission.csv | ./gs.awk >> titanic_survivors_id.txt

Of course, there are infinite other improvements we could make, like printing the lines directly to the correct file, but for an introduction, it was already quite interesting, don't you agree?

Interesting projects

In case you want to study, there are several complex projects that challenge the limitations of the language, as well as comprehensive tutorials that explore specific details of the language. Here are some cool repositories:

AWK Raycaster: a DOOM-style game created to run in terminal
JSON.awk: a JSON reader written in AWK
Opera Bookmarks: converts Chromium and derivative bookmarks data to SQLite and CSV
AWKLISP: LISP parser written in AWK
learn_gnuawk: AWK tutorial
AHO: a complete implementation of GIT written in AWK

References

banner

Hello World

Há duas formas de se executar AWK. Uma delas é colocando o comando inline, todo em uma mesma linha, e executá-lo diretamente pelo terminal. A segunda é inserindo o comando inteiro dentro de um arquivo e executando esse arquivo

Direto no terminal

Na primeira vez você pode rodar direto no terminal. Rode o seguinte comando:

awk 'BEGIN{print "Hello World!"}'

Isso vai imprimir Hello World no terminal. Simples, não?

Em um arquivo

BEGIN {
    print "Hello World!";
}

Você pode rodar de 2 formas. Na primeira você deve rodar o arquivo com o seguinte comando:

awk -f hello.awk

E pronto, você vai ter Hello World escrito na tela.

Na segunda forma (e mais comum), você adiciona, antes do código, uma hashbang, com o seguinte código:

#!/bin/gawk -f

Isso indicará para o shell qual o interpretador que será usado para rodar o script. Salve o arquivo e rode o seguinte comando:

chmod +x hello.awk

Isso adiciona permissão de execução ao arquivo hello.awk. Agora é só executar:

./hello.awk

E pronto, você terá Hello World escrito na tela.

Uso prático - Listados apenas dados específicos

Vamos trabalhar com o arquivo gender_submission.csv de um dataset do Titanic da Kaggle (disponível aqui). Comecemos imprimindo todas as linhas do arquivo:

cat gender_submission.csv | awk '{print $0;}'

Para esse arquivo você vai precisar alterar o divisor padrão dos items. O padrão do AWK é um espaço, mas o nosso arquivo usa o padrão do CSV, que é uma vírgula. Como mudar isso? Simples, vamos usar o bloco BEGIN. O bloco BEGIN é executado uma única vez no código, antes de todo o resto, enquanto o seguinte é executado uma vez por linha. Então vamos alterar a variável FS, que define o separador dos parâmetros da linha, logo antes de executar o resto do código:

cat gender_submission.csv | awk 'BEGIN{FS=",";} {print $0;}'

Quer saber se deu certo? Que tal colocar duas setas entre um campo e outro?

cat gender_submission.csv | gawk 'BEGIN{FS=",";}{print $1 " → → " $2}'

O nosso objetivo aqui é listar somente os IDs, mas apenas das pessoas que sobreviveram. Como podemos fazer isso? A resposta é: adicionando uma condicional if. Quando o parâmetro 2 for igual a 1, significa que sobreviveu, e, se sobreviveu, podemos mostrar na tela. Nosso código fica assim:

cat gender_submission.csv | gawk 'BEGIN{FS=",";}{if ($2 == 1) print($1);}'

E pronto, você terá uma lista dos IDs desejados. Simples, não?

Você ainda pode jogar a saída para um arquivo:

(cat gender_submission.csv | gawk 'BEGIN{FS=",";}{if ($2 == 1) print($1);}') \
>> titanic_survivors_id.txt

Perceba que eu coloquei o comando para impressão de saídas em outra linha no shell para melhorar a visibilidade.

Melhorando o uso

Você pode também fazer a mesma coisa rodando os comandos diretamente de um arquivo. Como fazer isso? Vem comigo que eu vou te mostrar.

Primeiro escreva seu código todo dentro de um arquivo gs.awk:

BEGIN{
    FS=",";
}
{
    if ($2 == 1)
        print($0);
}

E salve o arquivo na mesma pasta do arquivo. Agora, você pode rodar o comando assim:

cat gender_submission.csv | gawk -f gs.awk >> titanic_survivors_id.txt

Podemos simplificar ainda mais usando a hashbang. É só inserir a seguinte linha no topo do arquivo:

#!/bin/gawk -f

Seu código vai ficar assim:

#!/bin/gawk -f
BEGIN{
    FS=",";
}
{
    if ($2 == 1)
        print($0);
}

Agora salve o arquivo e adicione permissão de execução ao arquivo:

chmod +x gs.awk

E pronto, seu comando ficará assim:

cat gender_submission.csv | ./gs.awk >> titanic_survivors_id.txt

Claro que existem infinitas outras melhorias que poderíamos fazer, como imprimir as linhas diretamente no arquivo correto, mas, para uma introdução, já foi bem interessante, concorda?

Projetos interessantes

Caso você queira estudar, existem vários projetos complexos que desafiam as limitações da linguagem, assim como tutoriais completos que exploram detalhes específicos da linguagem. Aqui vão alguns repositórios interessantes:

AWK Raycaster: um jogo em estilo DOOM criado para rodar em terminal
JSON.awk: um leitor de JSON escrito em AWK
Opera Bookmarks: converte os dados das bookmarks do Chromium e derivados para SQLite e CSV
AWKLISP: interpretador de LISP escrito em AWK
learn_gnuawk: tutorial de AWK
AHO: uma implementação completa do GIT escrita em AWK

Referências

hive-139531 programming linux awk hivebr pt proofofbrain cent pimp neoxian

0.000

6 comments

@perfilbrasil 60

over 2 years ago

Obrigado por promover a Língua Portuguesa em suas postagens.

Vamos seguir fortalecendo a comunidade lusófona dentro da Hive.

^{Metade das recompensas dessa resposta serão destinadas ao autor do post.}

0.000

@perfilbrasil 60

over 2 years ago

Obrigado por ter compartlhado essa postagem na Comunidade Brasil.

^{Metade das recompensas dessa resposta serão destinadas ao autor do post.}

0.000

@ecency 76

over 2 years ago

Yay! 🤗
Your content has been boosted with Ecency Points, by @arthursiq5.
Use Ecency daily to boost your growth on platform!

Support Ecency
Vote for new Proposal
Delegate HP and earn more