Some notes on computer stuff

csplit: splitting text files by separator line pattern

July 23, 2014
[gnu] [howto] [linux] [shell]

To split files by number of bytes or lines there is standard split command, but how to break file into pieces of variable size that are delimited by a separator? Search revealed csplit tool, hence this quick overview with a small example.

Here only string pattern will be shown, see documentation for more.

Let's assume the following input file input (note variable size of sections):

first section

=============

second section
second section

=============

third section
third section
third section

Desired output is (or something close to this, the example is artificial, in real life file might contain thousands of lines and tenth of sections):

first section
second section
second section
third section
third section
third section

Break once

First try in which we're passing regular expression to match separator (===...):

$ csplit input /===/

and getting two files as the output: xx00 and xx01. What happened is that input file was splitted only once at first separator: all above it went to xx00 and everything else to xx01.

Break as many times as possible

Need to request pattern repetition by passing {*}:

$ csplit input /===/ {*}

This time five files are created.

Better file names

Names like xx{digits} are not very informative, so it's better to provide custom output file prefix with --prefix option (-f is its short version):

$ csplit --prefix=input. input /===/ {*}

And get:

  • input.00
first section
 
  • input.01
=============

second section
second section
 
  • input.02
=============

third section
third section
third section

Output files might still require some post-processing to remove leading/trailing lines, but it's still much better than splitting long text files manually, one doesn't even need to write a program or a script to do it, it's already there.

There some more useful options and types of pattern, so take a look at this page, which repeats content of the manual page in HTML format or just read man csplit.