https://www.digitalocean.com/community/tutorials/how-to-use-the-awk-language-to-manipulate-text-in-linux

Introduction

Linux utilities often follow the Unix philosophy of design. Tools are encouraged to be small, use plain text files for input and output, and operate in a modular manner. Because of this legacy, we have great text processing functionality with tools like sed and awk.

In this guide, we will discuss awk. Awk is both a programming language and text processor that can be used to manipulate text data in very useful ways. We will be discussing this on an Ubuntu 12.04 VPS, but it should operate the same on any modern Linux system.

 

Basic Syntax

The awk command is included by default in all modern Linux systems, so we do not need to install it to begin using it.

Awk is most useful when handling text files that are formatted in a predictable way. For instance, it is excellent at parsing and manipulating tabular data. It operates on a line-by-line basis and iterates through the entire file.

By default, it uses whitespace (spaces, tabs, etc.) to separate fields. Luckily, many configuration files on your Linux system use this format.

The basic format of an awk command is:

awk '/search_pattern/ { action_to_take_on_matches; another_action; }' file_to_parse

You can omit either the search portion or the action portion from any awk command. By default, the action taken if the "action" portion is not given is "print". This simply prints all lines that match.

If the search portion is not given, awk performs the action listed on each line.

If both are given, awk uses the search portion to decide if the current line reflects the pattern, and then performs the actions on matches.

 

Simple Uses

In its simplest form, we can use awk like cat to simply print all lines of a text file out to the screen.

Let's print out our server's fstab file, which lists the filesystems that it knows about:

awk '{print}' /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
#
proc /proc proc nodev,noexec,nosuid 0 0
# / was on /dev/vda1 during installation
UUID=b96601ba-7d51-4c5f-bfe2-63815708aabd / ext4 noatime,errors=remount-ro 0 1

This isn't very useful. Let's try out awk's search filtering capabilities:

awk '/UUID/' /etc/fstab
# device; this may be used with UUID= as a more robust way to name devices
UUID=b96601ba-7d51-4c5f-bfe2-63815708aabd / ext4 noatime,errors=remount-ro 0 1

As you can see, awk now only prints the lines that have "UUID" in them. We can get rid of the extraneous comment line by specifying that UUID must be located at the very beginning of the line:

awk '/^UUID/' /etc/fstab
UUID=b96601ba-7d51-4c5f-bfe2-63815708aabd /               ext4    noatime,errors=remount-ro 0       1

Similarly, we can use the action section to specify which pieces of information we want to print. For instance, to print only the first column, we can type:

awk '/^UUID/ {print $1;}' /etc/fstab
UUID=b96601ba-7d51-4c5f-bfe2-63815708aabd

We can reference every column (as delimited by whitespace) by variables associated with their column number. The first column can be referenced by $1 for instance. The entire line can by referenced by $0.

 

Awk Internal Variables and Expanded Format

Awk uses some internal variables to assign certain pieces of information as it processes a file.

The internal variables that awk uses are:

  • FILENAME: References the current input file.
  • FNR: References the number of the current record relative to the current input file. For instance, if you have two input files, this would tell you the record number of each file instead of as a total.
  • FS: The current field separator used to denote each field in a record. By default, this is set to whitespace.
  • NF: The number of fields in the current record.
  • NR: The number of the current record.
  • OFS: The field separator for the outputted data. By default, this is set to whitespace.
  • ORS: The record separator for the outputted data. By default, this is a newline character.
  • RS: The record separator used to distinguish separate records in the input file. By default, this is a newline character.

We can change the values of these variables at will to match the needs of our files. Usually we do this during the initialization phase of our awk processing.

This brings us to another important concept. Awk syntax is actually slightly more complex than what we showed initially. There are also optional BEGIN and END blocks that can contain commands to execute before and after the file processing, respectively.

This makes our expanded syntax look something like this:

awk 'BEGIN { action; }
/search/ { action; }
END { action; }' input_file

The BEGIN and END keywords are actually just specific sets of conditions just like the search parameters. They match before and after the document has been processed.

This means that we can change some of the internal variables in the BEGIN section. For instance, the /etc/passwd file is delimited with colons (:) instead of whitespace. If we wanted to print out the first column of this file, we could type:

sudo awk 'BEGIN { FS=":"; }
{ print $1; }' /etc/passwd
root
daemon
bin
sys
sync
games
man
. . .

We can use the BEGIN and END blocks to print simple information about the fields we are printing:

sudo awk 'BEGIN { FS=":"; print "User\t\tUID\t\tGID\t\tHome\t\tShell\n--------------"; }
{print $1,"\t\t",$3,"\t\t",$4,"\t\t",$6,"\t\t",$7;}
END { print "---------\nFile Complete" }' /etc/passwd
User        UID     GID     Home        Shell
--------------
root 0 0 /root /bin/bash
daemon 1 1 /usr/sbin /bin/sh
bin 2 2 /bin /bin/sh
sys 3 3 /dev /bin/sh
sync 4 65534 /bin /bin/sync
. . .
---------
File Complete

As you can see, we can format things quite nicely by taking advantage of some of awk's features.

Each of the expanded sections are optional. In fact, the main action section itself is optional if another section is defined. We can do things like this:

awk 'BEGIN { print "We can use awk like the echo command"; }'
We can use awk like the echo command
 

Awk Field Searching and Compound Expressions

In one of the examples above, we printed the line in the /etc/fstab file that began with "UUID". This was easy because we were looking for the beginning of the entire line.

What if we wanted to find out if a search pattern matched at the beginning of a field instead?

We can create a favorite_food.txt file which lists an item number and the favorite foods of a group of friends:

echo "1 carrot sandy
2 wasabi luke
3 sandwich brian
4 salad ryan
5 spaghetti jessica" > favorite_food.txt

If we want to find all foods from this file that begin with "sa", we might begin by trying something like this:

awk '/sa/' favorite_food.txt
1 carrot sandy
2 wasabi luke
3 sandwich brian
4 salad ryan

Here, we are matching any instance of "sa" in the word. This does exclude things like "wasabi" which has the pattern in the middle, or "sandy" which is not in the column we want. We are only interested in words beginning with "sa" in the second column.

We can tell awk to only match at the beginning of the second column by using this command:

awk '$2 ~ /^sa/' favorite_food.txt
3 sandwich brian
4 salad ryan

As you can see, this allows us to only search at the beginning of the second column for a match.

The "^" character tells awk to limit its searches to the beginning of the field. The "field_num ~" part specifies that it should only pay attention to the second column.

We can just as easily search for things that do not match by including the "!" character before the tilde (~). This command will return all lines that do not have a food that starts with "sa":

awk '$2 !~ /^sa/' favorite_food.txt
1 carrot sandy
2 wasabi luke
5 spaghetti jessica

If we decide later on that we are only interested in lines where the above is true and the item number is less than 5, we could use a compound expression like this:

awk '$2 !~ /^sa/ && $1 < 5' favorite_food.txt

This introduces a few new things. The first is the ability to add additional requirements for the line to match by using the && operator. Using this, you can combine an arbitrary number of conditions for the line to match.

We use this operator to add a check that the value of the first column is less than 5.

 

Conclusion

By now, you should have a basic understanding of how awk can manipulate, format, and selectively print text files. Awk is a much larger topic though, and is actually an entire programming language complete with variable assignment, control structures, built-in functions, and more. It can be used in scripts to easily format text in a reliable way.

To learn more about how to work with awk, check out the great online resources for awk, and more relevantly, gawk, the GNU version of awk present on modern Linux distributions.

By Justin Ellingwood

How To Use the AWK language to Manipulate Text in Linux的更多相关文章

  1. awk、grep、sed是linux操作文本的三大利器,也是必须掌握的linux命令之一

    awk.grep.sed是linux操作文本的三大利器,也是必须掌握的linux命令之一.三者的功能都是处理文本,但侧重点各不相同,其中属awk功能最强大,但也最复杂.grep更适合单纯的查找或匹配文 ...

  2. 转!! 关于jsp编码设置 <%@ page language="java" contentType="text/html; charset=utf-8" pageEncoding="utf-8"%>

    我们在写jsp页面的时候经常会在页面头部使用如下代码: <%@ page language="java" contentType="text/html; chars ...

  3. <%@ page language="java" contentType="text/html; charset=utf-8" pageEncoding="utf-8"%>

    那么 pageEncoding , contentType 分别用来做什么那?在解释之前让我们先了解下jsp从被请求到响应经历的三个阶段: 第一阶段:将jsp编译成Servlet(.java)文件.用 ...

  4. awk(1)-简述

    1.概述 AWK is a programming language designed for text processing and typically used as a data extract ...

  5. Awk by Example--转载

    原文地址: http://www.funtoo.org/Awk_by_Example,_Part_1?ref=dzone http://www.funtoo.org/Awk_by_Example,_P ...

  6. awk - Unix, Linux Command---reference

    http://www.tutorialspoint.com/unix_commands/awk.htm NAME gawk - pattern scanning and processing lang ...

  7. 【三剑客】awk命令

    前言 awk是一种很棒的语言,它适合文本处理和报表生成. 模式扫描和处理.处理文本流. awk不仅仅是Linux系统中的一个命令,而是一种编程语言,可以用来处理数据和生成报告. 处理的数据: 可以是一 ...

  8. awk的数组使用经历

    背景:之前是一个数学妞,所以操作系统类的就由windows系列霸占了,甚至“cmd"是什么东西,环境变量是什么概念......其实说那么多就是想表明一点:你现在很有可能比我知道得多得多呢! ...

  9. EL表达式Expression Language

    表达式语言Expression Language目的:简化jsp代码 EL内置对象 1.pageContext2.pageScope3.requestScope4.sessionScope5.appl ...

随机推荐

  1. 什么是RUP

    Rational统一过程(Rational  Unified  Process,RUP)是由Rational软件公司推出的一种完整且完美的软件过程. RUP总结了经过多年商业化验证的6条最有效的软件开 ...

  2. c++的读入txt文件(转)

    因为学姐的项目需要,要用到excel的读入读出,百度过后发现txt的读入读出比较简单,于是,我采用了先把excel转成txt,然后再读入. 方法是csdn上的天使的原地址:   https://blo ...

  3. 对弈的Python学习笔记

    #主要序列类型 str list tuple #列表 list ls=[1,2,3,4]#末尾追加ls.append(5) #添加多个,扩展ls.extend([5,6,7]) #在某个位置插入一个值 ...

  4. oracle 12c 警告日志位置

    Oracle 12c环境下查询,alert日志并不在bdump目录下,看到网上和书上都写着可以通过初始化参数background_dump_dest来查看alter日志路径,还说警告日志文件的缺省位置 ...

  5. date简述

    Date 定义时间和日期的类   java.util.Date 1s=1000ms; 时间的原点:公元1970年1月1日 00点00分00秒: public class DateDemo { publ ...

  6. CentOS安装JDK9

    1.使用XShell将下载好的jdk-9.0.1_linux-x64_bin.tar.gz包上传到/opt/下 2.解压文件 $ tar -zxvf jdk-9.0.1_linux-x64_bin.t ...

  7. c++ 编译报错汇总(随时更新)

    1.invalid new-expression of abstract class type ‘×××ב 这个报错代表一个尝试在实例化一个抽象类,也就是说父类的接口中有纯虚函数在子类中没有实现: ...

  8. 【转载】 从ACM会议看中国大陆计算机科学与国外的差距

    ps:   这是一篇06年的文章,与今日的国内计算机行业学术圈环境简直是天翻地覆,很不错的history,值得mark下,今日的cs学术发展十号是坏不发表意见,但是history是值得对比,借鉴,思考 ...

  9. 深度强化学习介绍 【PPT】 Human-level control through deep reinforcement learning (DQN)

    这个是平时在实验室讲reinforcement learning 的时候用到PPT, 交期末作业.汇报都是一直用的这个,觉得比较不错,保存一下,也为分享,最早该PPT源于师弟汇报所做.

  10. 一道DP

    也是校赛学长出的一道题~想穿了很简单..但我还是听了学长讲才明白. 观察力有待提高. Problem D: YaoBIG’s extra homeworkTime LimitMemory Limit1 ...