Parity Volume Set Specification v1.0

Stefan Wehlus

Tobias Reiper

Kilroy Balore

Willem Monsuwe

Karl Vogel

Ryan Gallagher

The introduced principles, the file format secification and it's details can be used free by everyone for any type of software. There will be no license or other limitations.

It doesn't matter, if you are writing open, closed or shared source, public domain, freeware, shareware or commercial software - you can use this spec.

But if you use the PAR/PXX file format and want to make changes to the format, you have to discuss them with the project (http://parchive.sourceforge.net) to maintain compatibility.

October 14th, 2001

Revision History
Revision 1.0	April 2nd, 2003
Formatting Changes (Docbook Simplified by Ryan)
Revision 1.0	October 14th, 2001
Stable
Revision 0.9	October 7th, 2001
Pass 3 draft revision
Revision 0.3	July 16th, 2001
Pass 2 draft revision
Revision 0.1	July 12th, 2001
Initial Submission, by Beaker

Table of Contents

Overview

General Description

Format Specifications for a .PAR/.PXX File

Comments

Implementation Issues

Overview

The following document describes a file format called Parity Volume Set.

A Parity Volume Set provides RAID like parity records for a given number of files. With this parity records it is possible to restore missing files of the set.

Example 73. An example file set

You have 10 data files:

Data Files
Foobar.d01
Foobar.d02
Foobar.d03
Foobar.d04
Foobar.d05
Foobar.d06
Foobar.d07
Foobar.d08
Foobar.d09
Foobar.d10

With a program using this specification you can calculate parity volumes for these files.

Parity Volumes
Foobar.p01
Foobar.p02
Foobar.p03

The parity volumes contain a reed solomon checksum of the data of the data files. So if a file is missing or corrupt, it can be reconstructed out of the remaining data files and the parity volumes. Any file of this set can be reconstructed with any parity volume. You just need as much parity volumes, as files you are missing.

For instance, in the example above, Foobar.d03 is lost. So it is possible to restore Foobar.d03 using the remaining data files and one parity volume. It doesn't matter, which parity volume (p01/p02/p03) - everyone will do...

In case you miss 2 files, you need 2 parity volumes. What parity volumes and their combination (p01+p02/p01+p03/p02+p03) doesn't matter too - you just need two of them. (They have to be different: renaming a copy of foobar.p01 to foobar.p02 and using it with foobar.p01 to restore will not work...)

General Description

A parity volume set consists of two parts. First there is a .PAR file (the index file).

It contains information about the files, which are "saved" in the volume sets:

what files are stored in the set
their checksums (MD5)
their size
their filename
a general comment for the set

Second there are the parity volume files. They are named .P00, .P01, P02..... PXX. (After .P99, there will be .Q00...)

They contain:

the same information as the PAR file (besides the comment)
the calculated parity data of the stored files

Format Specifications for a .PAR/.PXX File

The data is stored in ‘Intel Notation’.

Table 145. File Format Specification

Offset	Type (Size)	Data
0x0000	8 Bytes	Identification String: ‘PAR’\0 Comment: The string ‘PAR’ followed by 5 null bytes.
0x0008	1 QWord	Version Number The version number consists of two parts. The low Dword (bit 0..31) is the version - $00010000 for v01.00 The high Dword (bit 32..63) is a identifier for the generating program. The client can use it to see, which program generated the parity volume set. This may come in handy, if your program needs to know, if it was the generator itself. (Maybe to see, if it can use some proprietary bits in the status register or to notify, if new versions are out...) The coding is ‘program’-‘version’-‘subversion’-‘subsubversion’. ‘Program’ codes so far: 00 - Undefined 01 - Mirror 02 - PAR Example: Mirror 0.2.1 has $01000201 as program identifier.
0x0010	16 Bytes	Control Hash a MD5 hash of the rest of the file starting from 0x0020 to the end used to detect a corrupt file.
0x0020	16 Bytes	set hash used as an identifier for the parity volume set Creation: Make a array of bytes (one dimendion, size=used files*16). Then put the MD5 hashes (not the MD516k hashes) there, starting with the first file. Use only the files, which are included in the parity data (status bit 0 = 1). Then calculate the MD5 hash of this array.
0x0030	1 QWord	number of files the number of files stored in the parity volume set (only the input data files - the parity volumes are not counted) The files, which are not stored in the parity data, but in the file list for CRC checking, are counted too. So this is the number of files in the file list.
0x0040	1 QWord	start offset of the file list the start offset of the list of stored files
0x0048	1 QWord	file list size the size of the file list in bytes
0x0050	1 QWord	start offset of the data the start offset of the data area
0x0058	1 QWord	data size the size of the data area in bytes
0x0060	...	file list this is the start of the list of the saved files it consists of file entrys

Table 252. A File Entry has the following format:

Offset	Type (Size)	Data
0x0000	1 QWord	entry size the size of the file list entry in bytes counting starts at 1, this record is counted too
0x0008	1 QWord	status field 64 bits for status flags or numbers currently only bit 0 and 1 are used: bit0 = 0 - file is not saved in the parity volume set bit0 = 1 - file is saved in the parity volume set bit1 = 0 - file is not checked successfully yet bit1 = 1 - file was successfully checked
0x0010	1 QWord	size the size of the stored file in bytes
0x0018	16 Bytes	MD5 Hash a MD5 hash of the whole file
0x0028	16 Bytes	16k MD5 hash a MD5 hash of the first 16384 Bytes of the file if the file is smaller than 16k, only the exisdting bytes are used
0x0038	X Words	filename the full filename with extension in Unicode ("foobar.d01") paths are not allowed no end indicators, you have to compute the size out of the entrysize (chars = (entrysize - 0x38) DIV 2)

After the file list, the data area starts.

For the PAR file, the data area contains a comment for the volume set. The comment is stored in uniode, control characters are allowed.

For the Pxx files, the data area contains the checksum data.

Comupting of the checksum data:

You have n files and want to create m parity data. So you create a data field D1..Dm, where D1 is the first byte of file1, D2 is the first byte of file2, D3 is the first byte of file 3... Dn is the first byte of file n. For this field you create a field of m parity checksums C1..Cm. C1 is stored as the first parity byte of .p01, C2 is stored as the first parity byte of .p02, C3 is stored as the first parity byte of .p03... Cm is stored as the first parity byte of .pm.

Then process the second byte of all files and so on, until you reached the end of the biggest input file. If a file has no bytes left, use 0 for the calculation. So the size of the parity data will be the size of the biggest file stored in the parity volume set.

For calculating the checksums, the Reed-Solomon algorithm is used. A good description of this algorithm for programmers is http://www.cs.utk.edu/~plank/plank/papers/CS-96-332.html.

The "wordsize" is 8 bit, so you calculate with Bytes only. (This restricts the number of files + parity volumes to be less than 256.)

Comments

Why the 16k MD5 hashes?
One demanded feature is, that the program will find its files even if the filenames are changed. So maybe, if the program won't find the files due to renaming, you click the "search" button and then it checks the MD5 hashes of all files in the directory to find his files. But with large numbers of large files this will take a long time. With the 16k hashes, the program only needs to check the first 16k of each file. This will be much faster. If it finds a file, it can make a full hash too, just to be sure...
Why the file is saved - status flag?
Maybe the PAR file is only used to provide file comments and checksums. The files are not saved in a PXX volume then.
Why the file is checked - status flag?
So you can build in a function, that your program skips already successfully checked files. Repeated checking of the same set will be faster this way. (Like with QuickSFV)

Implementation Issues

A file present in the file list doesn't have to be stored in the parity data at all times. So, some (or all) files can be just for checksum control or file comments in the PAR file. (In case, only a .SFV or .NFO functionality is needed...)

The saved files can be of any type and size.

There are no informations about the parity volumes stored in the header. So the program has to scan the directory for present parity volumes. They can be identified with the set hash.

There is no explicit rule for the order of the entrys in the filelist, but I recommend to sort the filenames in ascending order.