8 bzip2, bunzip2 - a block-sorting file compressor, v1.0
9 bzcat - decompresses files to stdout
10 bzip2recover - recovers data from damaged bzip2 files
13 S
\bSY
\bYN
\bNO
\bOP
\bPS
\bSI
\bIS
\bS
14 b
\bbz
\bzi
\bip
\bp2
\b2 [ -
\b-c
\bcd
\bdf
\bfk
\bkq
\bqs
\bst
\btv
\bvz
\bzV
\bVL
\bL1
\b12
\b23
\b34
\b45
\b56
\b67
\b78
\b89
\b9 ] [ _
\bf_
\bi_
\bl_
\be_
\bn_
\ba_
\bm_
\be_
\bs _
\b._
\b._
\b. ]
15 b
\bbu
\bun
\bnz
\bzi
\bip
\bp2
\b2 [ -
\b-f
\bfk
\bkv
\bvs
\bsV
\bVL
\bL ] [ _
\bf_
\bi_
\bl_
\be_
\bn_
\ba_
\bm_
\be_
\bs _
\b._
\b._
\b. ]
16 b
\bbz
\bzc
\bca
\bat
\bt [ -
\b-s
\bs ] [ _
\bf_
\bi_
\bl_
\be_
\bn_
\ba_
\bm_
\be_
\bs _
\b._
\b._
\b. ]
17 b
\bbz
\bzi
\bip
\bp2
\b2r
\bre
\bec
\bco
\bov
\bve
\ber
\br _
\bf_
\bi_
\bl_
\be_
\bn_
\ba_
\bm_
\be
20 D
\bDE
\bES
\bSC
\bCR
\bRI
\bIP
\bPT
\bTI
\bIO
\bON
\bN
21 _
\bb_
\bz_
\bi_
\bp_
\b2 compresses files using the Burrows-Wheeler block
22 sorting text compression algorithm, and Huffman coding.
23 Compression is generally considerably better than that
24 achieved by more conventional LZ77/LZ78-based compressors,
25 and approaches the performance of the PPM family of sta-
28 The command-line options are deliberately very similar to
29 those of _
\bG_
\bN_
\bU _
\bg_
\bz_
\bi_
\bp_
\b, but they are not identical.
31 _
\bb_
\bz_
\bi_
\bp_
\b2 expects a list of file names to accompany the com-
32 mand-line flags. Each file is replaced by a compressed
33 version of itself, with the name "original_name.bz2".
34 Each compressed file has the same modification date, per-
35 missions, and, when possible, ownership as the correspond-
36 ing original, so that these properties can be correctly
37 restored at decompression time. File name handling is
38 naive in the sense that there is no mechanism for preserv-
39 ing original file names, permissions, ownerships or dates
40 in filesystems which lack these concepts, or have serious
41 file name length restrictions, such as MS-DOS.
43 _
\bb_
\bz_
\bi_
\bp_
\b2 and _
\bb_
\bu_
\bn_
\bz_
\bi_
\bp_
\b2 will by default not overwrite existing
44 files. If you want this to happen, specify the -f flag.
46 If no file names are specified, _
\bb_
\bz_
\bi_
\bp_
\b2 compresses from
47 standard input to standard output. In this case, _
\bb_
\bz_
\bi_
\bp_
\b2
48 will decline to write compressed output to a terminal, as
49 this would be entirely incomprehensible and therefore
52 _
\bb_
\bu_
\bn_
\bz_
\bi_
\bp_
\b2 (or _
\bb_
\bz_
\bi_
\bp_
\b2 _
\b-_
\bd_
\b) decompresses all specified files.
53 Files which were not created by _
\bb_
\bz_
\bi_
\bp_
\b2 will be detected and
54 ignored, and a warning issued. _
\bb_
\bz_
\bi_
\bp_
\b2 attempts to guess
55 the filename for the decompressed file from that of the
56 compressed file as follows:
58 filename.bz2 becomes filename
59 filename.bz becomes filename
60 filename.tbz2 becomes filename.tar
73 filename.tbz becomes filename.tar
74 anyothername becomes anyothername.out
76 If the file does not end in one of the recognised endings,
77 _
\b._
\bb_
\bz_
\b2_
\b, _
\b._
\bb_
\bz_
\b, _
\b._
\bt_
\bb_
\bz_
\b2 or _
\b._
\bt_
\bb_
\bz_
\b, _
\bb_
\bz_
\bi_
\bp_
\b2 complains that it cannot
78 guess the name of the original file, and uses the original
79 name with _
\b._
\bo_
\bu_
\bt appended.
81 As with compression, supplying no filenames causes decom-
82 pression from standard input to standard output.
84 _
\bb_
\bu_
\bn_
\bz_
\bi_
\bp_
\b2 will correctly decompress a file which is the con-
85 catenation of two or more compressed files. The result is
86 the concatenation of the corresponding uncompressed files.
87 Integrity testing (-t) of concatenated compressed files is
90 You can also compress or decompress files to the standard
91 output by giving the -c flag. Multiple files may be com-
92 pressed and decompressed like this. The resulting outputs
93 are fed sequentially to stdout. Compression of multiple
94 files in this manner generates a stream containing multi-
95 ple compressed file representations. Such a stream can be
96 decompressed correctly only by _
\bb_
\bz_
\bi_
\bp_
\b2 version 0.9.0 or
97 later. Earlier versions of _
\bb_
\bz_
\bi_
\bp_
\b2 will stop after decom-
98 pressing the first file in the stream.
100 _
\bb_
\bz_
\bc_
\ba_
\bt (or _
\bb_
\bz_
\bi_
\bp_
\b2 _
\b-_
\bd_
\bc_
\b) decompresses all specified files to
103 _
\bb_
\bz_
\bi_
\bp_
\b2 will read arguments from the environment variables
104 _
\bB_
\bZ_
\bI_
\bP_
\b2 and _
\bB_
\bZ_
\bI_
\bP_
\b, in that order, and will process them
105 before any arguments read from the command line. This
106 gives a convenient way to supply default arguments.
108 Compression is always performed, even if the compressed
109 file is slightly larger than the original. Files of less
110 than about one hundred bytes tend to get larger, since the
111 compression mechanism has a constant overhead in the
112 region of 50 bytes. Random data (including the output of
113 most file compressors) is coded at about 8.05 bits per
114 byte, giving an expansion of around 0.5%.
116 As a self-check for your protection, _
\bb_
\bz_
\bi_
\bp_
\b2 uses 32-bit
117 CRCs to make sure that the decompressed version of a file
118 is identical to the original. This guards against corrup-
119 tion of the compressed data, and against undetected bugs
120 in _
\bb_
\bz_
\bi_
\bp_
\b2 (hopefully very unlikely). The chances of data
121 corruption going undetected is microscopic, about one
122 chance in four billion for each file processed. Be aware,
123 though, that the check occurs upon decompression, so it
124 can only tell you that something is wrong. It can't help
125 you recover the original uncompressed data. You can use
126 _
\bb_
\bz_
\bi_
\bp_
\b2_
\br_
\be_
\bc_
\bo_
\bv_
\be_
\br to try to recover data from damaged files.
139 Return values: 0 for a normal exit, 1 for environmental
140 problems (file not found, invalid flags, I/O errors, &c),
141 2 to indicate a corrupt compressed file, 3 for an internal
142 consistency error (eg, bug) which caused _
\bb_
\bz_
\bi_
\bp_
\b2 to panic.
145 O
\bOP
\bPT
\bTI
\bIO
\bON
\bNS
\bS
146 -
\b-c
\bc -
\b--
\b-s
\bst
\btd
\bdo
\bou
\but
\bt
147 Compress or decompress to standard output.
149 -
\b-d
\bd -
\b--
\b-d
\bde
\bec
\bco
\bom
\bmp
\bpr
\bre
\bes
\bss
\bs
150 Force decompression. _
\bb_
\bz_
\bi_
\bp_
\b2_
\b, _
\bb_
\bu_
\bn_
\bz_
\bi_
\bp_
\b2 and _
\bb_
\bz_
\bc_
\ba_
\bt are
151 really the same program, and the decision about
152 what actions to take is done on the basis of which
153 name is used. This flag overrides that mechanism,
154 and forces _
\bb_
\bz_
\bi_
\bp_
\b2 to decompress.
156 -
\b-z
\bz -
\b--
\b-c
\bco
\bom
\bmp
\bpr
\bre
\bes
\bss
\bs
157 The complement to -d: forces compression, regard-
158 less of the invokation name.
160 -
\b-t
\bt -
\b--
\b-t
\bte
\bes
\bst
\bt
161 Check integrity of the specified file(s), but don't
162 decompress them. This really performs a trial
163 decompression and throws away the result.
165 -
\b-f
\bf -
\b--
\b-f
\bfo
\bor
\brc
\bce
\be
166 Force overwrite of output files. Normally, _
\bb_
\bz_
\bi_
\bp_
\b2
167 will not overwrite existing output files. Also
168 forces _
\bb_
\bz_
\bi_
\bp_
\b2 to break hard links to files, which it
169 otherwise wouldn't do.
171 -
\b-k
\bk -
\b--
\b-k
\bke
\bee
\bep
\bp
172 Keep (don't delete) input files during compression
175 -
\b-s
\bs -
\b--
\b-s
\bsm
\bma
\bal
\bll
\bl
176 Reduce memory usage, for compression, decompression
177 and testing. Files are decompressed and tested
178 using a modified algorithm which only requires 2.5
179 bytes per block byte. This means any file can be
180 decompressed in 2300k of memory, albeit at about
181 half the normal speed.
183 During compression, -s selects a block size of
184 200k, which limits memory use to around the same
185 figure, at the expense of your compression ratio.
186 In short, if your machine is low on memory (8
187 megabytes or less), use -s for everything. See
188 MEMORY MANAGEMENT below.
190 -
\b-q
\bq -
\b--
\b-q
\bqu
\bui
\bie
\bet
\bt
191 Suppress non-essential warning messages. Messages
192 pertaining to I/O errors and other critical events
205 will not be suppressed.
207 -
\b-v
\bv -
\b--
\b-v
\bve
\ber
\brb
\bbo
\bos
\bse
\be
208 Verbose mode -- show the compression ratio for each
209 file processed. Further -v's increase the ver-
210 bosity level, spewing out lots of information which
211 is primarily of interest for diagnostic purposes.
213 -
\b-L
\bL -
\b--
\b-l
\bli
\bic
\bce
\ben
\bns
\bse
\be -
\b-V
\bV -
\b--
\b-v
\bve
\ber
\brs
\bsi
\bio
\bon
\bn
214 Display the software version, license terms and
217 -
\b-1
\b1 t
\bto
\bo -
\b-9
\b9
218 Set the block size to 100 k, 200 k .. 900 k when
219 compressing. Has no effect when decompressing.
220 See MEMORY MANAGEMENT below.
222 -
\b--
\b- Treats all subsequent arguments as file names, even
223 if they start with a dash. This is so you can han-
224 dle files with names beginning with a dash, for
225 example: bzip2 -- -myfilename.
227 -
\b--
\b-r
\bre
\bep
\bpe
\bet
\bti
\bit
\bti
\biv
\bve
\be-
\b-f
\bfa
\bas
\bst
\bt -
\b--
\b-r
\bre
\bep
\bpe
\bet
\bti
\bit
\bti
\biv
\bve
\be-
\b-b
\bbe
\bes
\bst
\bt
228 These flags are redundant in versions 0.9.5 and
229 above. They provided some coarse control over the
230 behaviour of the sorting algorithm in earlier ver-
231 sions, which was sometimes useful. 0.9.5 and above
232 have an improved algorithm which renders these
236 M
\bME
\bEM
\bMO
\bOR
\bRY
\bY M
\bMA
\bAN
\bNA
\bAG
\bGE
\bEM
\bME
\bEN
\bNT
\bT
237 _
\bb_
\bz_
\bi_
\bp_
\b2 compresses large files in blocks. The block size
238 affects both the compression ratio achieved, and the
239 amount of memory needed for compression and decompression.
240 The flags -1 through -9 specify the block size to be
241 100,000 bytes through 900,000 bytes (the default) respec-
242 tively. At decompression time, the block size used for
243 compression is read from the header of the compressed
244 file, and _
\bb_
\bu_
\bn_
\bz_
\bi_
\bp_
\b2 then allocates itself just enough memory
245 to decompress the file. Since block sizes are stored in
246 compressed files, it follows that the flags -1 to -9 are
247 irrelevant to and so ignored during decompression.
249 Compression and decompression requirements, in bytes, can
252 Compression: 400k + ( 8 x block size )
254 Decompression: 100k + ( 4 x block size ), or
255 100k + ( 2.5 x block size )
257 Larger block sizes give rapidly diminishing marginal
258 returns. Most of the compression comes from the first two
271 or three hundred k of block size, a fact worth bearing in
272 mind when using _
\bb_
\bz_
\bi_
\bp_
\b2 on small machines. It is also
273 important to appreciate that the decompression memory
274 requirement is set at compression time by the choice of
277 For files compressed with the default 900k block size,
278 _
\bb_
\bu_
\bn_
\bz_
\bi_
\bp_
\b2 will require about 3700 kbytes to decompress. To
279 support decompression of any file on a 4 megabyte machine,
280 _
\bb_
\bu_
\bn_
\bz_
\bi_
\bp_
\b2 has an option to decompress using approximately
281 half this amount of memory, about 2300 kbytes. Decompres-
282 sion speed is also halved, so you should use this option
283 only where necessary. The relevant flag is -s.
285 In general, try and use the largest block size memory con-
286 straints allow, since that maximises the compression
287 achieved. Compression and decompression speed are virtu-
288 ally unaffected by block size.
290 Another significant point applies to files which fit in a
291 single block -- that means most files you'd encounter
292 using a large block size. The amount of real memory
293 touched is proportional to the size of the file, since the
294 file is smaller than a block. For example, compressing a
295 file 20,000 bytes long with the flag -9 will cause the
296 compressor to allocate around 7600k of memory, but only
297 touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the
298 decompressor will allocate 3700k but only touch 100k +
299 20000 * 4 = 180 kbytes.
301 Here is a table which summarises the maximum memory usage
302 for different block sizes. Also recorded is the total
303 compressed size for 14 files of the Calgary Text Compres-
304 sion Corpus totalling 3,141,622 bytes. This column gives
305 some feel for how compression varies with block size.
306 These figures tend to understate the advantage of larger
307 block sizes for larger files, since the Corpus is domi-
308 nated by smaller files.
310 Compress Decompress Decompress Corpus
311 Flag usage usage -s usage Size
313 -1 1200k 500k 350k 914704
314 -2 2000k 900k 600k 877703
315 -3 2800k 1300k 850k 860338
316 -4 3600k 1700k 1100k 846899
317 -5 4400k 2100k 1350k 845160
318 -6 5200k 2500k 1600k 838626
319 -7 6100k 2900k 1850k 834096
320 -8 6800k 3300k 2100k 828642
321 -9 7600k 3700k 2350k 828642
337 R
\bRE
\bEC
\bCO
\bOV
\bVE
\bER
\bRI
\bIN
\bNG
\bG D
\bDA
\bAT
\bTA
\bA F
\bFR
\bRO
\bOM
\bM D
\bDA
\bAM
\bMA
\bAG
\bGE
\bED
\bD F
\bFI
\bIL
\bLE
\bES
\bS
338 _
\bb_
\bz_
\bi_
\bp_
\b2 compresses files in blocks, usually 900kbytes long.
339 Each block is handled independently. If a media or trans-
340 mission error causes a multi-block .bz2 file to become
341 damaged, it may be possible to recover data from the
342 undamaged blocks in the file.
344 The compressed representation of each block is delimited
345 by a 48-bit pattern, which makes it possible to find the
346 block boundaries with reasonable certainty. Each block
347 also carries its own 32-bit CRC, so damaged blocks can be
348 distinguished from undamaged ones.
350 _
\bb_
\bz_
\bi_
\bp_
\b2_
\br_
\be_
\bc_
\bo_
\bv_
\be_
\br is a simple program whose purpose is to
351 search for blocks in .bz2 files, and write each block out
352 into its own .bz2 file. You can then use _
\bb_
\bz_
\bi_
\bp_
\b2 -t to test
353 the integrity of the resulting files, and decompress those
356 _
\bb_
\bz_
\bi_
\bp_
\b2_
\br_
\be_
\bc_
\bo_
\bv_
\be_
\br takes a single argument, the name of the dam-
357 aged file, and writes a number of files "rec0001file.bz2",
358 "rec0002file.bz2", etc, containing the extracted blocks.
359 The output filenames are designed so that the use of
360 wildcards in subsequent processing -- for example, "bzip2
361 -dc rec*file.bz2 > recovered_data" -- lists the files in
364 _
\bb_
\bz_
\bi_
\bp_
\b2_
\br_
\be_
\bc_
\bo_
\bv_
\be_
\br should be of most use dealing with large .bz2
365 files, as these will contain many blocks. It is clearly
366 futile to use it on damaged single-block files, since a
367 damaged block cannot be recovered. If you wish to min-
368 imise any potential data loss through media or transmis-
369 sion errors, you might consider compressing with a smaller
373 P
\bPE
\bER
\bRF
\bFO
\bOR
\bRM
\bMA
\bAN
\bNC
\bCE
\bE N
\bNO
\bOT
\bTE
\bES
\bS
374 The sorting phase of compression gathers together similar
375 strings in the file. Because of this, files containing
376 very long runs of repeated symbols, like "aabaabaabaab
377 ..." (repeated several hundred times) may compress more
378 slowly than normal. Versions 0.9.5 and above fare much
379 better than previous versions in this respect. The ratio
380 between worst-case and average-case compression time is in
381 the region of 10:1. For previous versions, this figure
382 was more like 100:1. You can use the -vvvv option to mon-
383 itor progress in great detail, if you want.
385 Decompression speed is unaffected by these phenomena.
387 _
\bb_
\bz_
\bi_
\bp_
\b2 usually allocates several megabytes of memory to
388 operate in, and then charges all over it in a fairly ran-
389 dom fashion. This means that performance, both for com-
390 pressing and decompressing, is largely determined by the
403 speed at which your machine can service cache misses.
404 Because of this, small changes to the code to reduce the
405 miss rate have been observed to give disproportionately
406 large performance improvements. I imagine _
\bb_
\bz_
\bi_
\bp_
\b2 will per-
407 form best on machines with very large caches.
410 C
\bCA
\bAV
\bVE
\bEA
\bAT
\bTS
\bS
411 I/O error messages are not as helpful as they could be.
412 _
\bb_
\bz_
\bi_
\bp_
\b2 tries hard to detect I/O errors and exit cleanly,
413 but the details of what the problem is sometimes seem
416 This manual page pertains to version 1.0 of _
\bb_
\bz_
\bi_
\bp_
\b2_
\b. Com-
417 pressed data created by this version is entirely forwards
418 and backwards compatible with the previous public
419 releases, versions 0.1pl2, 0.9.0 and 0.9.5, but with the
420 following exception: 0.9.0 and above can correctly decom-
421 press multiple concatenated compressed files. 0.1pl2 can-
422 not do this; it will stop after decompressing just the
423 first file in the stream.
425 _
\bb_
\bz_
\bi_
\bp_
\b2_
\br_
\be_
\bc_
\bo_
\bv_
\be_
\br uses 32-bit integers to represent bit posi-
426 tions in compressed files, so it cannot handle compressed
427 files more than 512 megabytes long. This could easily be
431 A
\bAU
\bUT
\bTH
\bHO
\bOR
\bR
432 Julian Seward, jseward@acm.org.
434 http://sourceware.cygnus.com/bzip2
435 http://www.muraroa.demon.co.uk
437 The ideas embodied in _
\bb_
\bz_
\bi_
\bp_
\b2 are due to (at least) the fol-
438 lowing people: Michael Burrows and David Wheeler (for the
439 block sorting transformation), David Wheeler (again, for
440 the Huffman coder), Peter Fenwick (for the structured cod-
441 ing model in the original _
\bb_
\bz_
\bi_
\bp_
\b, and many refinements), and
442 Alistair Moffat, Radford Neal and Ian Witten (for the
443 arithmetic coder in the original _
\bb_
\bz_
\bi_
\bp_
\b)_
\b. I am much
444 indebted for their help, support and advice. See the man-
445 ual in the source distribution for pointers to sources of
446 documentation. Christian von Roques encouraged me to look
447 for faster sorting algorithms, so as to speed up compres-
448 sion. Bela Lubkin encouraged me to improve the worst-case
449 compression performance. Many people sent patches, helped
450 with portability problems, lent machines, gave advice and
451 were generally helpful.