- The correct title of this article is bzip2. The initial letter is shown capitalized due to technical restrictions.
bzip2 is a free software/open source data compression algorithm and program developed by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996. The compressor's stability and popularity grew over the next several years, and Seward released version 1.0 in late 2000. A filename extension is a suffix to the name of a computer file applied to show its format. ...
Multipurpose Internet Mail Extensions (MIME) is an Internet Standard that extends the format of e-mail to support text in character sets other than US-ASCII, non-text attachments, multi-part message bodies, and header information in non-ASCII character sets. ...
A type code is a mechanism used in pre-Mac OS X versions of the Macintosh operating system to denote a files format, in a manner similar to file extensions in other operating systems. ...
Julian Seward is a compiler hacker and open source contributor. ...
In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. ...
Software development is the translation of a user need or marketing goal into a software product. ...
Julian Seward is a compiler hacker and open source contributor. ...
A software release refers to the creation and availability of a new version of a computer software product. ...
December 20 is the 354th day of the year (355th in leap years) in the Gregorian calendar. ...
For the Manfred Mann album, see 2006 (album). ...
An operating system (OS) is a set of computer programs that manage the hardware and software resources of a computer. ...
A cross-platform (or platform independent) programming language, software application or hardware device works on more than one system platform (e. ...
In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. ...
A software license is a legal agreement which may take the form of a proprietary or gratuitous license as well as a memorandum of contract between a producer and a user of computer software. ...
A website (or Web site) is a collection of web pages, images, videos and other digital assets and hosted on a particular domain or subdomain on the World Wide Web. ...
This article is about free software as defined by the sociopolitical free software movement; for information on software distributed without charge, see freeware. ...
Open source software refers to computer software available with its source code and under an open source license. ...
In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. ...
In mathematics, computing, linguistics, and related disciplines, an algorithm is a procedure (a finite set of well-defined instructions) for accomplishing some task which, given an initial state, will terminate in a defined end-state. ...
Julian Seward is a compiler hacker and open source contributor. ...
Compression efficiency bzip2 compresses most files more effectively than more traditional gzip or ZIP but is slower. In this manner it is fairly similar to other recent-generation compression algorithms. Unlike other formats such as RAR or ZIP (and similar to gzip), bzip2 is only a data compressor, not an archiver. The program itself has no facilities for multiple files, encryption or archive-splitting, in the UNIX tradition instead relying on separate external utilities such as tar and GnuPG for these tasks. The correct title of this article is . ...
The ZIP file format is a popular data compression and archival format. ...
REDIRECT RAR (file format) ...
Filiation of Unix and Unix-like systems Unix (officially trademarked as UNIX®) is a computer operating system originally developed in the 1960s and 1970s by a group of AT&T employees at Bell Labs including Ken Thompson, Dennis Ritchie and Douglas McIlroy. ...
In computing, the tar (file) format (derived from tape archive) is a type of archive bitstream or file format. ...
The GNU Privacy Guard (GnuPG or GPG) is a free software replacement for the PGP suite of cryptographic software, released under the GNU General Public License. ...
In some cases, bzip2 is surpassed by 7z and RAR formats in terms of absolute compression efficiency. According to the author, bzip2 gets within ten to fifteen percent of the "best" class of compression algorithms currently known (PPM), while being roughly twice as fast at compression and six times faster at decompression. 7z is a compressed archive file format that supports several different data compression, encryption and pre-processing filters. ...
PPM is an adaptive statistical data compression technique based on context modeling and prediction. ...
bzip2 uses the Burrows-Wheeler transform to convert frequently recurring character sequences into strings of identical letters, and then applies a move-to-front transform and finally Huffman coding. In bzip2 the blocks are generally all the same size in plaintext, which can be selected by a command-line argument between 100kB–900kB. Compression blocks are delimited by a 48-bit sequence (magic number) derived from the binary-coded decimal representation of π, 0x314159265359, with the end-of-stream similarly delimited by a value representing sqrt(π), 0x177245385090. The Burrows-Wheeler transform (BWT, also called block-sorting compression), is an algorithm used in data compression techniques such as bzip2. ...
The move-to-front (or MTF) transform is an encoding of data (typically a stream of bytes) designed to improve the performance of entropy encoding techniques of compression. ...
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. ...
A kilobyte (derived from the SI prefix kilo-, meaning 1000) is a unit of information or computer storage equal to the decimal 1024 bytes (2 to the 10th power, or 1,024 bytes based in the binary system). ...
In computer programming, a magic number is a constant used to identify the file or data type employed. ...
In computing and electronic systems, Binary-coded decimal (BCD) is an encoding for decimal numbers in which each digit is represented by its own binary sequence. ...
When a circles diameter is 1, its circumference is Ï. The mathematical constant Ï is an irrational real number, approximately equal to 3. ...
In mathematics, a square root of a number x is a number whose square (the result of multiplying the number by itself) is x. ...
When a circles diameter is 1, its circumference is Ï. The mathematical constant Ï is an irrational real number, approximately equal to 3. ...
Originally, bzip2's ancestor bzip used arithmetic coding after the blocksort; this was discontinued because of the patent restriction to be replaced by the Huffman coding currently used in bzip2. Arithmetic coding is a method for lossless data compression. ...
Software patent does not have a universally accepted definition. ...
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. ...
Compression stack Bzip2 uses several layers of compression techniques stacked on top of each other, which occur in the following order during compression and the reverse order during decompression: - Run-length encoding (RLE), any sequence of four (4) duplicate symbols are adjusted so that the following symbol is replaced by a repeat length of between 0 and 255. Thus the sequence
"AAAAAAABBBBCCCD" is replaced with "AAAA3BBBB0CCCD". Runs of symbols are always transformed after three consecutive symbols, even if the run-length is set to zero. This transformation is particularly good at taking care of large zeroed areas of files. In the worst case, it can cause a pre-BWT expansion of 1.25 and in the best case a reduction to <0.02 of original size. - Burrows-Wheeler transform (BWT), this is the reversible block-sort that is at the core of bzip2. The block is entirely self contained, with input and output buffers remaining the same size—in bzip2, the operating limit for this stage is 900kB. For the block-sort, a (notional) matrix is created in which row i contains the whole of the buffer, rotated to start from the ith symbol. Following rotation, the rows of the matrix are sorted into alphabetic (numerical) order. A 24-bit pointer is stored marking the starting position for when the block is untransformed. In practice, it is not necessary to construct the full matrix, rather the sort is performed using pointers for each position in the buffer. The output buffer is the last column of the matrix; this contains the whole buffer, but reordered so that it is likely to contain large runs of identical symbols.
- Move to front (MTF), again, this transform does not alter the size of the processed block. Each of the symbols in use in the document is placed in an array. When a symbol is processed, it is replaced by its subscript (offset) in the array and that symbol is shuffled to the front of the array. The effect is that immediately recurring symbols are replaced by zero symbols (long runs of any arbitrary symbol thus become runs of zero symbols), while other symbols are remapped according to their local frequency.
A lot of "natural" data contains identical symbols that recur within a limited range (text is a good example). As the MTF transform assigns low values to symbols that reappear frequently, this results in a data stream which contains a lot of symbols in the low integer range, many of them being identical (different recurring input symbols can actually map to the same output symbol). Such data can be very efficiently encoded by any legacy compression method. - Run-length encoding (RLE), long strings of repeated symbols in the output (normally zeros by this time) are replaced by a combination of the symbol and a sequence of two special codes,
RUNA and RUNB, which represent the run-length as a binary number greater than one (1). The sequence 0,0,0,0,1 would be represented as 0,RUNB,RUNA,1; RUNB and RUNA representing the value 11 in binary, or three (3) in decimal. The run-length code is terminated by reaching another normal symbol. - Huffman coding, this process replaces fixed length symbols (8-bit bytes) with variable length codes based on the frequency of use. More frequently used codes end up shorter (2-3 bits) whilst rare codes can be allocated up to 20 bits. The codes are selected carefully so that no sequence of bits can be confused for a different code. An additional code is used to mark the end of the stream: if fewer than 256 symbols are in use, then the end-of-stream will be the smallest used value, which could be as low as three (3). The code mappings are allocated as follows:
0: RUNA 1: RUNB 2-257: byte values 0-255 258: end of stream, finish processing. (could be as low as 3). - Multiple Huffman tables, several identically-sized Huffman tables can be used with a block if the gain from using them is greater than the cost of including the extra table. At least two (2) and up to six (6) tables can be present, with the most appropriate table being reselected before every 50 symbols processed. This has the advantage of having very responsive Huffman dynamics without having to continuously supply new tables, as would be required in DEFLATE. Run-length encoding in the previous step is designed to take care of codes that have an inverse probability of use higher than the shortest code Huffman code in use.
- Unary Base 1 encoding, if multiple Huffman tables are in use, the selection of each table (numbered 0..5) is done from a list by a zero-terminated bit run between one (1) and six (6) bits in length. The selection is into a MTF list of the tables. Using this feature results in a maximum expansion of around 1.015, but generally less. This expansion is likely to be greatly over-shadowed by the advantage of selecting more appropriate Huffman tables and the common-case of continuing to use the same Huffman table is represented as a single bit. Rather than unary encoding, effectively this is an extreme form of a Huffman tree where each code has half the probability of the previous code.
- Delta encoding (Δ); Huffman code bit-lengths are required to reconstruct each of the used Canonical Huffman tables. Each bit-length is stored as an encoded difference against the previous code bit-length. A zero-bit (0) means that the previous bit-length should be duplicated for the current code, whilst a one-bit (1) means that a further bit should be read and the bit-length incremented or decremented based on that value. In the common case a single bit is used per symbol per table and the worst case—going from length one (1) to length twenty (20)—would require approximately 37 bits. As a result of the earlier MTF encoding, code lengths would start at 2-3bits long (very frequently used codes) and gradually increase, meaning that the delta format is fairly efficient—requiring around 300-bits (38 bytes) per full Huffman table.
- Sparse Bit array, a bitmap is used to show which symbols are used inside the block and should be included in the Huffman trees. Binary data is likely to use all 256 symbols representable by a byte, whereas textual data may only use a small subset of available values, perhaps covering the ASCII range between 32 and 126. Storing 256 zero bits would be inefficient if they were mostly unused. A sparse method is used, the 256 symbols are divided up into 16 ranges and only if symbols are used within that block is a 16-bit array included. The presence of each of these 16 ranges is indicated by an additional 16-bit bit array at the front. The total bitmap uses between 32 and 272 bits of storage (4–34 bytes). For contrast, the DEFLATE algorithm would show the absence of a symbol by encoded the symbol as having zero bit-length; this very inefficient with the Delta-encoding used in bzip2 which is the reason for bzip2 having a separate bit array.
Run-length encoding (RLE) is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. ...
The Burrows-Wheeler transform (BWT, also called block-sorting compression), is an algorithm used in data compression techniques such as bzip2. ...
The move-to-front (or MTF) transform is an encoding of data (typically a stream of bytes) designed to improve the performance of entropy encoding techniques of compression. ...
Run-length encoding (RLE) is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. ...
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. ...
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. ...
DEFLATE is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. ...
The unary numeral system (base-1) is the simplest numeral system to represent natural numbers: in order to represent a number N, an arbitrarily chosen symbol is repeated N times. ...
MTF is a three-letter acronym with multiple meanings, as described below: Male-to-female (gender identity); see transwoman or transsexual Modulation Transfer Function, the magnitude of an Optical Transfer Function, a description of an optical systems response to fine detail. ...
Delta encoding is a way of storing or transmitting data in form of differences between sequential data rather than complete files. ...
In the mathematical subfield of numerical analysis a sparse matrix is a matrix populated primarily with zeros. ...
A bit array is an array data structure which compactly stores individual bits (boolean values). ...
There are 95 printable ASCII characters, numbered 32 to 126. ...
DEFLATE is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. ...
File format A .bz2 stream consists of a 4-byte header, followed by zero or more compressed blocks, immediately followed by an end-of-stream marker containing a 32-bit CRC for the plaintext whole stream processed. The compressed blocks are bit-aligned and no padding occurs. /* Paul Sladen, 2007-01-11 */ .magic:16 = 'BZ' signature/magic number .version:8 = 'h' for Bzip2 ('H'uffman coding), '0' for Bzip1 (deprecated) .hundred_k_blocksize:8 = '1'..'9' block-size 100 kB-900 kB .compressed_magic:48 = 0x314159265359 (BCD (pi)) .crc:32 = checksum for this block .randomised:1 = 0=>normal, 1=>randomised (deprecated) .origPtr:24 = starting pointer into BWT for after untransform .huffman_used_map:16 = bitmap, of ranges of 16 bytes, present/not present .huffman_used_bitmaps:0..256 = bitmap, of symbols used, present/not present (multiples of 16) .huffman_groups:3 = 2..6 number of different Huffman tables in use .selectors_used:15 = number of times that the Huffman tables are swapped (each 50 bytes) *.selector_list:1..6 = zero-terminated bit runs (0..62) of MTF'ed Huffman table (*selectors_used) .start_huffman_length:5 = 0..20 starting bit length for Huffman deltas *.delta_bit_length:1..40 = 0=>next symbol; 1=>alter length { 1=>length--; 0=>length++ } (*(symbols+2)*groups) .contents:2..∞ = Huffman encoded data stream until end of block .eos_magic:48 = 0x177245385090 (BCD sqrt(pi) .crc:32 = checksum for whole stream .padding:0..7 = align to whole byte Note for implementors: Because of the first-stage RLE compression (see above), the maximum length of plaintext that a single 900 kB bzip2 block can contain is around 46 MB (45,899,235 bytes). This staggering feat of compression can occur if the whole plaintext consists entirely of repeated values (the resulting .bz2 file in this case is 46 bytes long).[1]
Use In Unix, bzip2 can be used combined with or independently of tar: bzip2 file to compress and bzip2 -d file.bz2 to uncompress (the alias bunzip2 for decompression may also be used). Filiation of Unix and Unix-like systems Unix (officially trademarked as UNIX®) is a computer operating system originally developed in the 1960s and 1970s by a group of AT&T employees at Bell Labs including Ken Thompson, Dennis Ritchie and Douglas McIlroy. ...
bzip2's command line flags are mostly the same as in gzip. So, to extract from a bzip2-compressed tar-file: The correct title of this article is . ...
bzip2 -d < archivefile.tar.bz2 | tar -xvf - or bunzip2 < archivefile.tar.bz2 | tar -xvf - To create a bzip2-compressed tar-file: tar -cvf - filenames | bzip2 > archivefile.tar.bz2 GNU tar supports a -j flag, which allows creation of tar.bz2 files without a pipeline: GNU (pronounced ) is a computer operating system - consisting of a kernel, libraries, system utilities, compilers, and end-user application software - composed entirely of free software. ...
tar -cvjf archivefile.tar.bz2 file-list Decompressing in GNU tar: tar -xvjf archivefile.tar.bz2 Implementations - bzip2: Julian Seward's original reference implementation available under a BSD license.
- 7-Zip: written by Igor Pavlov in C++, the 7-Zip suite contains a bzip2 encoder/decoder which is freely licensed.
- micro-bzip2: a version by Rob Landley designed for reduced compiled code size and and available under the GNU LGPL.
- org.apache.tools.bzip2: Java implementation from the Apache Ant build system available under the Apache License.
- pbzip2: Parallel MPI-based implementation by Christian Siebert.
- PBZIP2: Parallel pthreads-based implementation in C++ by Jeff Gilchrist.
- bzip2smp: a modification to
libbzip2 that has SMP parallelisation "hacked in" by Konstantin Isakov. - smpbzip2: Another go at parallel bzip2, by Niels Werensteijn.
- pyflate: a pure-Python stand-alone bzip2 and DEFLATE (gzip) decoder by Paul Sladen. Probably useful for research and prototyping, made available under the BSD/GPL/LGPL/DFSG licenses.
Julian Seward is a compiler hacker and open source contributor. ...
The BSD license is a permissive license and is one of the most widely used free software licenses. ...
7-Zip is a file archiver designed originally for the Microsoft Windows operating system, and later made available to other systems. ...
C++ (pronounced see plus plus, IPA: ) is a general-purpose, high-level programming language with low-level facilities. ...
GNU logo The GNU Lesser General Public License (formerly the GNU Library General Public License) is an FSF approved Free Software license designed as a compromise between the GNU General Public License and simple permissive licenses such as the BSD license and the MIT License. ...
Java is an object-oriented programming language developed by Sun Microsystems in the early 1990s. ...
Apache Ant is a software tool for automating software build processes. ...
The Apache License (Apache Software License previous to version 2. ...
The Message Passing Interface (MPI) is a computer communications protocol. ...
POSIX Threads is a POSIX standard for threads. ...
C++ (pronounced see plus plus, IPA: ) is a general-purpose, high-level programming language with low-level facilities. ...
Symmetric Multiprocessing, or SMP, is a multiprocessor computer architecture where two or more identical processors are connected to a single shared main memory. ...
Python is a programming language created by Guido van Rossum in 1990. ...
DEFLATE is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. ...
The correct title of this article is . ...
The BSD license is a permissive license and is one of the most widely used free software licenses. ...
The GNU logo The GNU General Public License (GNU GPL or simply GPL) is a widely-used free software license, originally written by Richard Stallman for the GNU project. ...
GNU logo The GNU Lesser General Public License (formerly the GNU Library General Public License) is a free software license published by the Free Software Foundation. ...
The Debian Free Software Guidelines (DFSG) are a set of guidelines that the Debian Project uses to determine whether a software license is free software license, which in turn is used to determine whether a piece of software can be included in the main, free software distribution of Debian. ...
See also Image File history File links Portal. ...
This is a list of file formats used by archivers and compressors. ...
It has been suggested that this article or section be merged into Comparison of file archivers. ...
The following tables compare general and technical information for a number of file archivers. ...
This is a list of Unix programs. ...
Rzip is a data compression program based on bzip2. ...
External links This article or section does not cite its references or sources. ...
Mac OS X (official IPA pronunciation: ) is a line of proprietary, graphical operating systems developed, marketed, and sold by Apple Inc. ...
Parallel computing is the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain results faster. ...
References - ^
$ dd if=/dev/zero bs=45899235 count=1 | bzip2 -vvvv | wc -c A even smaller file of 40 bytes can be achieved by using a input containing entirely values of 251, an apparent compression ratio of 1147480:1. | v • d • e Codecs implementations (See Compression methods for methods and Compression formats and standards - for formats) | Video codecs (Comparison) | MPEG-4 ASP 3ivx · DivX · FFmpeg MPEG-4 · HDX4 · Xvid | H.264/MPEG-4 AVC CoreAVC · QuickTime H.264 · x264 | Lossless CorePNG · FFV1 · Huffyuv · Lagarith · MSU Lossless | Others Cinepak · Dirac · Indeo · VP3 · VP7 · Pixlet · Snow · Tarkin · Theora · WMV A Codec is a device or program capable of performing encoding and decoding on a digital data stream or signal. ...
Look up Implementation in Wiktionary, the free dictionary. ...
A video codec is a device or software module that enables video compression or decompression for digital video. ...
This is a comparison of video codecs. ...
MPEG-4 Part 2 is a video compression technology developed by MPEG. It belongs to the MPEG-4 ISO/IEC standard (ISO/IEC 14496-2). ...
3ivx is a video codec created by 3ivx Technologies. ...
DivX is a brand name of products created by DivX, Inc. ...
FFmpeg is a collection of free software that can record, convert and stream digital audio and video. ...
HDX4 is a MPEG4 codec developed by a German company named Jomigo Visual Technology. ...
Xvid (formerly XviD) is a video codec library following the MPEG-4 standard. ...
H.264 is a high compression digital video codec standard written by the ITU-T Video Coding Experts Group (VCEG) together with the ISO/IEC Moving Picture Experts Group (MPEG) as the product of a collective partnership effort known as the Joint Video Team (JVT). ...
CoreAVC is a video decoder developed by CoreCodec, implementing the MPEG-4 AVC standard (also known as H.264) used, for example, in next-generation video disc formats HD DVD and Blu-Ray. ...
QuickTime is a multimedia framework developed by Apple Inc. ...
x264 is a free software library for encoding H.264/MPEG-4 AVC video streams. ...
Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. ...
CorePNG is a lossless codec based on PNG. Essentially, each frame is compressed as a PNG, so if PNG does it, this codec does too. ...
FFV1 which stands for FF video codec 1 is an experimental video encoder and decoder featuring lossless, intra-frame only and relatively high compression. ...
Huffyuv (or HuffYUV) is a very fast, lossless Win32 video codec written by Ben Rudiak-Gould, meant to replace uncompressed YUV as a video capture format. ...
Lagarith is an open source lossless video codec written by Ben Goldman. ...
MSU Lossless video codec is an lossless video codec written by MSU Graphics&Media Lab Video Group. ...
Compressed with Cinepak, quality 40% Cinepak is a video codec, developed by Radius Inc to accommodate 1x (150 kbyte/s) CD-ROM transfer rates. ...
Dirac is a prototype algorithm for the encoding and decoding (see codec) of raw video. ...
Indeo Video (commonly known now simply as Indeo) is a video codec developed by Intel in 1992. ...
VP3 was originally a proprietary video codec developed by On2 Technologies. ...
TrueMotion VP7 is a video codec developed by On2 Technologies as a successor to earlier efforts such as VP3, VP5 and TrueMotion VP6. ...
Pixlet is a video codec created by Apple Computer and based on wavelets, designed to enable viewing of full resolution, high resolution movies in real time at low DV data rates. ...
Snow is an experimental video codec developed by Michael Niedermayer for the FFmpeg package. ...
This page is about the video compression codec. ...
Theora is a video codec being developed by the Xiph. ...
Windows Media Video (WMV) is a generic name for the set of video codec technologies developed by Microsoft. ...
| | Audio codecs (Comparison) | | Archivers (Comparison) | | |