Home page of Eric Pement

Home > awk.htm

 

The awk programming language

awk is a programming language that gets its name from the 3 people who invented it (Aho, Weinberger, and Kernighan). Developed on a Unix operating system, its name is usually printed in lower-case ("awk") instead of capitalized ("Awk"). awk is distributed as free software, meaning that you don't have to pay anything for it and you can get the source code to build awk yourself.

awk is much easier to learn than C, C++, Java, or many other languages. awk excels at handling text and data files, the kind that are created in Notepad or (for example) HTML files. You wouldn't use awk to modify a Microsoft Word document or an Excel spreadsheet. However, if you take the Word document and Save As "Text Only" or if you take the Excel spreadsheet and Save As tab-delimited (*.txt) or comma-separated (*.csv) output files, then awk could do a good job at handling them.

I like awk because it's concise. The shortest awk program that does anything useful is just 1 character:

awk 1 yourfile

On a DOS/Windows machine, this converts Unix line endings (LF) to standard DOS line endings (CR+LF). awk programs are often called "scripts" because they don't require an intermediate stage of compiling the progam into an executable form like an *.EXE file. In fact, awk programs are almost never compiled into *.EXE files (though it's possible to do this). Thus, many people refer to awk as a "scripting language" instead of a "programming language."

Normally, awk is run from a command prompt. However, if you need to run a custom awk program from the Windows desktop (usually, because you want to run the same script over and over), instead of creating a desktop shortcut to "awk.exe", create a shortcut to a script or batch file. An awk batch file for Windows could look like this:

   @echo off
   c:
   cd \path\to\some\directory
   awk.exe -f myscript.awk inputfile.txt > outputfile.txt

   :: All done. Show a message to the Windows user
   if not exist c:\temp\NUL mkdir c:\temp
   echo result = MsgBox("Output file successfully created",0,"File created") > c:\temp\msg.vbs
   %windir%\system32\cscript.exe //Nologo c:\temp\msg.vbs
   del /q c:\temp\msg.vbs

If you work in an enterprise or commercial environment, your version of Windows may have "cscript.exe" (a/k/a Windows Script Host) turned off or removed from the PATH for security reasons, as it can be a vehicle for malicious exploitation. It might be available, but just not on the expected directory (or maybe %windir% is not defined). So if the last part of the script does not work correctly, that's the most likely reason.

Get precompiled binaries for awk

Get awk, precompiled for Windows, from one of these locations:

  • EZWinPorts — This is a set of Unix files ported to Windows by Eli Zaretskii, who maintains the Windows port of GNU awk. There are a number of files here, but you especially want to download the README file and the GNU awk binary. This is the compilation I am currently using under Windows 7 and Windows 8.1.
  • Klabaster freeware — This site contains ports of GNU awk ("gawk"), Michael Brennan's awk ("mawk"), Andrew Summer's tool ("awka") to translate awk source to Ansi C, and a few other tools. The version of GNU awk here is v4.1.4. Though the Klabaster compilation and the Zaretskii compilation of GNU awk are different, the version numbers are identical.
  • Mawk — If execution speed is ever an issue, try running Mawk instead of awk or gawk. It is typically much faster than either awk or gawk, though it lacks many of the options of GNU awk, and the error messages of mawk are much less informative.

NOTE: There are other versions of GNU awk for Win32, including compilations called GnuWin32 and UnxUtils (both on Sourceforge, if you want to search for them), but they are significantly older and less reliable than the ones above. There is also a compilation called DJGPP from Delorie.com (and especially here), designed to work for Intel 80386 (and higher) PCs running MS-DOS or DOS compatibles, such as PC-DOS, DR-DOS, PTS-DOS, or FreeDOS. If you are running one of these versions of DOS, you may benefit from the Delorie versions.

Aside from that, the DJGPP utilities (all of them, not just awk) have one other unique feature (or benefit). Because they are written with Unix users in mind, they emulate the 'single quote' and "double quote" system of parsing command-line arguments. In other words, with Windows utilities such as EZWinPorts, Klabaster awk, Mawk, GnuWin32, and UnixUtils, parameters to the utilities must be entered in "double quotes" only. If you enter this at the DOS command line:

     echo Hello | sed 's/.*/&, world/'

it will not be recognized as a valid command, due to the presence of the 'single quotes'. The CMD shell wants "double quotes". The Delorie utilities allow you to use 'single quotes' or "double quotes". That is the one benefit that these versions have, although there are limits. In a true Unix shell (ksh, bash, etc.), single quotes protect special characters such as the redirection arrows (<, >) and the pipe (|). The Delorie utilities do not protect these characters with single quotes.

If you are running in a Microsoft Windows environment such as Windows 7, using CMD.EXE (or better, a command shell like Take Command), the GnuWin32, EZWinPorts, or Klabaster utilities are probably a better choice.

Things I wrote for awk

  • awk1line.txt - one-line scripts for awk. Modeled after my "sed one-liners" file, particular to awk.
  • awktail.txt - the proper way to assign the "rest of the line" to a variable. I did it wrong for a year or two, and now I don't want to forget.
  • awk_sed.txt - a table comparing similar commands between sed and awk. How to do substitutions, deletions, etc., in both sed and awk.
  • Using system commands - the real way to embed system commands (say, calls to sed or perl or fmt) within an awk script, so they can be used just on a particular hunk of text.
  • Endnote - I created ENDNOTE because I wanted to create documents in plain ASCII (like Emacs, vim, Notepad++), complete with formal, numbered footnotes, and at the same time be able to re-arrange and move the notes to different locations without renumbering everything.

    Eric Meyer first thought of it, and created a system for WordStar. I used Eric Meyer's system and rewrote it in both awk and perl.

    It works like this: Put the reference in square brackets[##] in your text, and directly below the paragraph insert the actual citation, such as (Dante, Book 3, sect. 2). Rearrange the document to your heart's delight. When the document is complete, use this awk script to sequentially number all the references, gather the notes together, and print them at the end of the file with numbers corresponding to the in-text references. Totally cool.

    The same script is also available in the perl section of this web site.

  • italbold.awk - given a textfile marked up in _pseudo-italic_ or else in *pseudo-bold* (or _*both*_), convert those tags to bona-fide HTML or some other desired output.
  • longest.awk - print the longest line in a file, with its length. Or just print the line. Or just print the length.
  • outline_classic.awk - given a document created in Emacs "outline-mode", convert the outline markers to traditional Outline format (e.g., A, B, C, 1, 2, (a), (b), etc.)
  • outline_numbered.awk - given a document created in Emacs "outline-mode", convert the outline markers to numbered outline format (e.g., 1, 2, 3, 3.1, 3.2, 3.2.1, 3.2.2, etc.)
  • paragrep.awk - when grepping (searching) a textfile, print the entire paragraph that contains the search expression, not just the line that it's on.
  • pmailadd.awk - how to take a list of names and e-mail addresses, and use awk to convert them to a format for immediate import into the Pegasus Mail program.
  • printf.txt - memory jog of how printf() works in awk
  • titlecase.awk - This is a function for taking a string in "ALL CAPS", "lowercase", or "mIXeD cAsE" and converting it to "Title Case", such as would be used for book or chapter titles. It keeps Roman numerals and special abbreviations (like USA, LXX, NT, NY) in caps, but keeps articles, conjunctions, and prepositions between words in lowercase. Names like D'Arcy, O'Reilly, and McDonald are properly capitalized, as are abbreviations like Ph.D. or D.Min. Obeys most style manual rules.

    This is really the best "titlecase" function I've seen, and I say that honestly, even though I'm the author. It even has an option switch to handle a situation where two-letter abbreviations for U.S. States and Territories should be capitalized, because most of the time, you do not want to auto-capitalize AL, CA, ID, IN, LA, MA, ME, OH, OR, and VI (Virgin Islands).
  • uniq-1.awk - sample script to show how to remove duplicate data.

Tutorials

Discussion forums, newsgroups

These pages created with GNU Emacs, xhtmlpp, Take Command, and Altap Salamander. Icons courtesy of Qbullets
Last modified: 2017-08-07