Discussion:
[gentoo-portage-dev] [RFC] Improving Gentoo package format
Michał Górny
2018-11-10 13:09:03 UTC
Permalink
Hi, everyone.

The Gentoo's tbz2/xpak package format is quite old. We've made a few
incompatible changes in the past (most notably, allowing non-bzip2
compression and multi-instance naming) but the core design stayed
the same. I think we should consider changing it, for the reasons
outlined below.

The rough format description can be found in xpak(5). Basically, it's
a regular compressed tarball with binary metadata blob appended
to the end. As such, it looks like a regular compressed tarball
to the compression tools (with some ignored junk at the end).
The metadata is entirely custom format and needs dedicated tools
to manipulate.


The current format has a few advantages whose preserving would probably
be worthwhile:

+ The binary package is a single flat file.

+ It is reasonably compatible with regular compressed tarball,
so the users can unpack it using standard tools (except for metadata).

+ The metadata is uncompressed and can be quickly found without touching
the compressed data.

+ The metadata can be updated (e.g. as result of pkgmove) without
touching the compressed data.


However, it has a few disadvantages as well:

- The metadata is entirely custom binary format, requiring dedicated
tools to read or edit.

- The metadata format is relying on customary behavior of compression
tools that ignore junk following the compressed data.

- By placing the metadata at the end of file, we make it rather hard to
read the metadata from remote location (via FTP, HTTP) without fetching
the whole file. [NB: it's technically possible but probably not worth
the effort]

- By requiring the custom format to be at the end of file, we make it
impossible to trivially cover it with a OpenPGP signature without
introducing another custom format.

- While the format might allow for some extensibility, it's rather
evolutionary dead end.


I think the key points of the new format should be:

1. It should reuse common file formats as much as possible, with
inventing as little custom code as possible.

2. It should allow for easy introspection and editing by users without
dedicated tools.

3. The metadata should allow for lookup without fetching the whole
binary package.

4. The format should allow for some extensions without having to
reinvent the wheel every time.

5. It would be nice to preserve the existing advantages.


My proposal
===========

Basic format
------------
The base of the format is a regular compressed tarball. There's no junk
appended to it but the metadata is stored inside it as
/var/db/pkg/${PF}. The contents are as compatible with the actual vdb
format as possible.

This has the following advantages:

+ Binary package is still stored as a single file.

+ It uses a standard compressed .tar format, with minimal customization.

+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).

+ If we can maintain reasonable level of vdb compatibility, the user can
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.


Optimizing for easy recognition
-------------------------------
In order to make it possible for magic-based tools such as file(1) to
easily distinguish Gentoo binary packages from regular tarballs, we
could (ab)use the volume label field, e.g. use:

$ tar -V 'gpkg: app-foo/bar-1' -c ...

This will add a volume label as the first file entry inside the tarball,
which does not affect extracting but can be trivially matched via magic
rules.

Note: this is meant to be used as a method for fast binary package
recognition; I don't think we should reject (hand-modified) binary
packages that lack this label.


Optimizing for metadata reading/manipulation performance
--------------------------------------------------------
The main problem with using a single tarball for both metadata and data
is that normally you'd have to decompress everything to reliably unpack
metadata, and recompress everything to update it. This problem can be
addressed by a few optimization tricks.

Firstly, all metadata files are packed to the archive before data files.
With a slightly customized unpacker, we can stop decompressing as soon
as we're past metadata and avoid decompressing the whole archive. This
will also make it possible to read metadata from remote files without
fetching far past the compressed metadata block.

Secondly, if we're up for some more tricks, we could technically split
the tarball into metadata and data blocks compressed separately. This
will need a bit of archiver customization but it will make it possible
to decompress the metadata part without even touching compressed data,
and to replace it without recompressing data.

What's important is that both tricks proposed maintain backwards
compatibility with regular compressed tarballs. That is, the user will
still be able to extract it with regular archiving tools.


Adding OpenPGP signatures
-------------------------
This is the main XXX here.

Technically, the most obvious solution is to cover the entire tarball
with OpenPGP signature. However, this has the disadvantage that
the verification requires fetching the whole file.

I will look into possibility of having partial signatures.
--
Best regards,
Michał Górny
Alec Warner
2018-11-10 14:37:39 UTC
Permalink
Post by Michał Górny
Hi, everyone.
The Gentoo's tbz2/xpak package format is quite old. We've made a few
incompatible changes in the past (most notably, allowing non-bzip2
compression and multi-instance naming) but the core design stayed
the same. I think we should consider changing it, for the reasons
outlined below.
The rough format description can be found in xpak(5). Basically, it's
a regular compressed tarball with binary metadata blob appended
to the end. As such, it looks like a regular compressed tarball
to the compression tools (with some ignored junk at the end).
The metadata is entirely custom format and needs dedicated tools
to manipulate.
The current format has a few advantages whose preserving would probably
+ The binary package is a single flat file.
+ It is reasonably compatible with regular compressed tarball,
so the users can unpack it using standard tools (except for metadata).
+ The metadata is uncompressed and can be quickly found without touching
the compressed data.
+ The metadata can be updated (e.g. as result of pkgmove) without
touching the compressed data.
- The metadata is entirely custom binary format, requiring dedicated
tools to read or edit.
- The metadata format is relying on customary behavior of compression
tools that ignore junk following the compressed data.
I agree this is a problem in theory, but I haven't seen it as a problem in
practice. Have you observed any problems around this setup?
Post by Michał Górny
- By placing the metadata at the end of file, we make it rather hard to
read the metadata from remote location (via FTP, HTTP) without fetching
the whole file. [NB: it's technically possible but probably not worth
the effort]
- By requiring the custom format to be at the end of file, we make it
impossible to trivially cover it with a OpenPGP signature without
introducing another custom format.
Its trivial to cover with a detached sig, no?
Post by Michał Górny
- While the format might allow for some extensibility, it's rather
evolutionary dead end.
I'm not even sure how to quantify this, it just sounds like your subjective
opinion (which is fine, but its not factual.)
Post by Michał Górny
1. It should reuse common file formats as much as possible, with
inventing as little custom code as possible.
2. It should allow for easy introspection and editing by users without
dedicated tools.
So I'm less confident in the editing use cases; do users edit their binpkgs
on a regular basis?
Post by Michał Górny
3. The metadata should allow for lookup without fetching the whole
binary package.
4. The format should allow for some extensions without having to
reinvent the wheel every time.
5. It would be nice to preserve the existing advantages.
My proposal
===========
Basic format
------------
The base of the format is a regular compressed tarball. There's no junk
appended to it but the metadata is stored inside it as
/var/db/pkg/${PF}. The contents are as compatible with the actual vdb
format as possible.
Just to clarify, you are suggesting we store the metadata inside the
contents of the binary package itself (e.g. where the other files that get
merged to the liveFS are?) What about collisions?

E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
overwrite files in the VDB at qmerge time?
Post by Michał Górny
+ Binary package is still stored as a single file.
+ It uses a standard compressed .tar format, with minimal customization.
+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).
+ If we can maintain reasonable level of vdb compatibility, the user can
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.
I'm not certain this is really desired.
Post by Michał Górny
Optimizing for easy recognition
-------------------------------
In order to make it possible for magic-based tools such as file(1) to
easily distinguish Gentoo binary packages from regular tarballs, we
$ tar -V 'gpkg: app-foo/bar-1' -c ...
This will add a volume label as the first file entry inside the tarball,
which does not affect extracting but can be trivially matched via magic
rules.
Note: this is meant to be used as a method for fast binary package
recognition; I don't think we should reject (hand-modified) binary
packages that lack this label.
Optimizing for metadata reading/manipulation performance
--------------------------------------------------------
The main problem with using a single tarball for both metadata and data
is that normally you'd have to decompress everything to reliably unpack
metadata, and recompress everything to update it. This problem can be
addressed by a few optimization tricks.
These performance goals seem a little bit ill defined.

1) Where are users reporting slowness in binpkg operations?
2) What is the cause of the slowness?

Like I could easily see a potential user with many large binpkgs, and the
current implementation causing them issues because
they have to decompress and seek a bunch to read the metadata out of their
1.2GB binpkg. But i'm pretty sure this isn't most users.
Post by Michał Górny
Firstly, all metadata files are packed to the archive before data files.
With a slightly customized unpacker, we can stop decompressing as soon
as we're past metadata and avoid decompressing the whole archive. This
will also make it possible to read metadata from remote files without
fetching far past the compressed metadata block.
So this seems to basically go against your goals of simple common tooling?
Post by Michał Górny
Secondly, if we're up for some more tricks, we could technically split
the tarball into metadata and data blocks compressed separately. This
will need a bit of archiver customization but it will make it possible
to decompress the metadata part without even touching compressed data,
and to replace it without recompressing data.
What's important is that both tricks proposed maintain backwards
compatibility with regular compressed tarballs. That is, the user will
still be able to extract it with regular archiving tools.
So my recollection is that debian uses common format AR files for the main
deb.
Then they have 2 compressed tarballs, one for metadata, and one for data.

This format seems to jive with many of your requirements:

- 'ar' can retrieve individual files from the archive.
- The deb file itself is not compressed, but the tarballs inside *are*
compressed.
- The metadata and data are compressed separately.
- Anyone can edit this with normal tooling (ar, tar)

In short; why should we event a new format?
Post by Michał Górny
Adding OpenPGP signatures
-------------------------
This is the main XXX here.
Technically, the most obvious solution is to cover the entire tarball
with OpenPGP signature. However, this has the disadvantage that
the verification requires fetching the whole file.
I will look into possibility of having partial signatures.
--
Best regards,
Michał Górny
Zac Medico
2018-11-11 05:46:32 UTC
Permalink
Post by Michał Górny
Hi, everyone.
The Gentoo's tbz2/xpak package format is quite old.  We've made a few
incompatible changes in the past (most notably, allowing non-bzip2
compression and multi-instance naming) but the core design stayed
the same.  I think we should consider changing it, for the reasons
outlined below.
The rough format description can be found in xpak(5).  Basically, it's
a regular compressed tarball with binary metadata blob appended
to the end.  As such, it looks like a regular compressed tarball
to the compression tools (with some ignored junk at the end).
The metadata is entirely custom format and needs dedicated tools
to manipulate.
The current format has a few advantages whose preserving would probably
+ The binary package is a single flat file.
+ It is reasonably compatible with regular compressed tarball,
so the users can unpack it using standard tools (except for metadata).
+ The metadata is uncompressed and can be quickly found without touching
the compressed data.
+ The metadata can be updated (e.g. as result of pkgmove) without
touching the compressed data.
- The metadata is entirely custom binary format, requiring dedicated
tools to read or edit.
- The metadata format is relying on customary behavior of compression
tools that ignore junk following the compressed data.
I agree this is a problem in theory, but I haven't seen it as a problem
in practice. Have you observed any problems around this setup?
In portage we use head -c to selected the compressed data, since zstd
doesn't handle the xpak trailer well.
Post by Michał Górny
- By placing the metadata at the end of file, we make it rather hard to
read the metadata from remote location (via FTP, HTTP) without fetching
the whole file.  [NB: it's technically possible but probably not worth
the effort] 
- By requiring the custom format to be at the end of file, we make it
impossible to trivially cover it with a OpenPGP signature without
introducing another custom format.
Its trivial to cover with a detached sig, no?
 
- While the format might allow for some extensibility, it's rather
evolutionary dead end.
I'm not even sure how to quantify this, it just sounds like your
subjective opinion (which is fine, but its not factual.)
Yeah the xpak trailer is flexible enough, but I'm not opposed to
supporting a different format.
Post by Michał Górny
1. It should reuse common file formats as much as possible, with
inventing as little custom code as possible.
2. It should allow for easy introspection and editing by users without
dedicated tools.
So I'm less confident in the editing use cases; do users edit their
binpkgs on a regular basis?
Yes, gentoo/profiles/updates package renames an slot moves are a form of
this.
Post by Michał Górny
3. The metadata should allow for lookup without fetching the whole
binary package.
4. The format should allow for some extensions without having to
reinvent the wheel every time.
5. It would be nice to preserve the existing advantages.
My proposal
===========
Basic format
------------
The base of the format is a regular compressed tarball.  There's no junk
appended to it but the metadata is stored inside it as
/var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
format as possible.
Just to clarify, you are suggesting we store the metadata inside the
contents of the binary package itself (e.g. where the other files that
get merged to the liveFS are?) What about collisions?
E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine
that already has 'machine-images/gentoo-disk-image-1.2.3' installed,
won't it overwrite files in the VDB at qmerge time?
I haven't looked into it but maybe we can use "nil control directory
names" to embed things, like http://savannah.gnu.org/projects/swbis
claims to use.
Post by Michał Górny
+ Binary package is still stored as a single file.
+ It uses a standard compressed .tar format, with minimal customization.
+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).
+ If we can maintain reasonable level of vdb compatibility, the user can
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.
I'm not certain this is really desired.
Yeah I don't like it either, I'd prefer to keep the metadata someplace
where it can't overwrite files in the installed package database.
Post by Michał Górny
Optimizing for easy recognition
-------------------------------
In order to make it possible for magic-based tools such as file(1) to
easily distinguish Gentoo binary packages from regular tarballs, we
  $ tar -V 'gpkg: app-foo/bar-1' -c ...
This will add a volume label as the first file entry inside the tarball,
which does not affect extracting but can be trivially matched via magic
rules.
Note: this is meant to be used as a method for fast binary package
recognition; I don't think we should reject (hand-modified) binary
packages that lack this label.
Optimizing for metadata reading/manipulation performance
--------------------------------------------------------
The main problem with using a single tarball for both metadata and data
is that normally you'd have to decompress everything to reliably unpack
metadata, and recompress everything to update it.  This problem can be
addressed by a few optimization tricks.
These performance goals seem a little bit ill defined.
1) Where are users reporting slowness in binpkg operations?
2) What is the cause of the slowness?
Yeah I'd like more information here too.
Post by Michał Górny
Like I could easily see a potential user with many large binpkgs, and
the current implementation causing them issues because
they have to decompress and seek a bunch to read the metadata out of
their 1.2GB binpkg. But i'm pretty sure this isn't most users.
 
Firstly, all metadata files are packed to the archive before data files.
 With a slightly customized unpacker, we can stop decompressing as soon
as we're past metadata and avoid decompressing the whole archive.  This
will also make it possible to read metadata from remote files without
fetching far past the compressed metadata block.
So this seems to basically go against your goals of simple common tooling?
 
Secondly, if we're up for some more tricks, we could technically split
the tarball into metadata and data blocks compressed separately.  This
will need a bit of archiver customization but it will make it possible
to decompress the metadata part without even touching compressed data,
and to replace it without recompressing data.
What's important is that both tricks proposed maintain backwards
compatibility with regular compressed tarballs.  That is, the user will
still be able to extract it with regular archiving tools.
So my recollection is that debian uses common format AR files for the
main deb.
Then they have 2 compressed tarballs, one for metadata, and one for data.
 - 'ar' can retrieve individual files from the archive.
 - The deb file itself is not compressed, but the tarballs inside *are*
compressed.
 - The metadata and data are compressed separately.
 - Anyone can edit this with normal tooling (ar, tar)
In short; why should we event a new format?
Maybe we can borrow some ideas from
http://savannah.gnu.org/projects/swbis which claims to be capable of
creating and verifying a tarball with GPG signatures embedded in the
tarball.
Post by Michał Górny
Adding OpenPGP signatures
-------------------------
This is the main XXX here.
Technically, the most obvious solution is to cover the entire tarball
with OpenPGP signature.  However, this has the disadvantage that
the verification requires fetching the whole file.
I will look into possibility of having partial signatures.
--
Best regards,
Michał Górny
--
Thanks,
Zac
Michał Górny
2018-11-11 08:29:14 UTC
Permalink
Post by Alec Warner
Post by Michał Górny
Hi, everyone.
The Gentoo's tbz2/xpak package format is quite old. We've made a few
incompatible changes in the past (most notably, allowing non-bzip2
compression and multi-instance naming) but the core design stayed
the same. I think we should consider changing it, for the reasons
outlined below.
The rough format description can be found in xpak(5). Basically, it's
a regular compressed tarball with binary metadata blob appended
to the end. As such, it looks like a regular compressed tarball
to the compression tools (with some ignored junk at the end).
The metadata is entirely custom format and needs dedicated tools
to manipulate.
The current format has a few advantages whose preserving would probably
+ The binary package is a single flat file.
+ It is reasonably compatible with regular compressed tarball,
so the users can unpack it using standard tools (except for metadata).
+ The metadata is uncompressed and can be quickly found without touching
the compressed data.
+ The metadata can be updated (e.g. as result of pkgmove) without
touching the compressed data.
- The metadata is entirely custom binary format, requiring dedicated
tools to read or edit.
- The metadata format is relying on customary behavior of compression
tools that ignore junk following the compressed data.
I agree this is a problem in theory, but I haven't seen it as a problem in
practice. Have you observed any problems around this setup?
Historically one of the parallel compressor variants did not support
this.
Post by Alec Warner
Post by Michał Górny
- By placing the metadata at the end of file, we make it rather hard to
read the metadata from remote location (via FTP, HTTP) without fetching
the whole file. [NB: it's technically possible but probably not worth
the effort]
- By requiring the custom format to be at the end of file, we make it
impossible to trivially cover it with a OpenPGP signature without
introducing another custom format.
Its trivial to cover with a detached sig, no?
Post by Michał Górny
- While the format might allow for some extensibility, it's rather
evolutionary dead end.
I'm not even sure how to quantify this, it just sounds like your subjective
opinion (which is fine, but its not factual.)
Post by Michał Górny
1. It should reuse common file formats as much as possible, with
inventing as little custom code as possible.
2. It should allow for easy introspection and editing by users without
dedicated tools.
So I'm less confident in the editing use cases; do users edit their binpkgs
on a regular basis?
It's useful for debugging stuff. I had to use hexedit on xpak
in the past. Believe me, it's nowhere close to pleasant.
Post by Alec Warner
Post by Michał Górny
3. The metadata should allow for lookup without fetching the whole
binary package.
4. The format should allow for some extensions without having to
reinvent the wheel every time.
5. It would be nice to preserve the existing advantages.
My proposal
===========
Basic format
------------
The base of the format is a regular compressed tarball. There's no junk
appended to it but the metadata is stored inside it as
/var/db/pkg/${PF}. The contents are as compatible with the actual vdb
format as possible.
Just to clarify, you are suggesting we store the metadata inside the
contents of the binary package itself (e.g. where the other files that get
merged to the liveFS are?) What about collisions?
E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
overwrite files in the VDB at qmerge time?
Portage will obviously move the files out, and process them as metadata.
The idea is to precisely use a directory that can't be normally part
of binary packages, so can't cause collisions with real files (even if
they're very unlikely to ever happen).
Post by Alec Warner
Post by Michał Górny
+ Binary package is still stored as a single file.
+ It uses a standard compressed .tar format, with minimal customization.
+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).
+ If we can maintain reasonable level of vdb compatibility, the user can
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.
I'm not certain this is really desired.
Are you saying it's better that user emergency-installs a package
without recording it in vdb, and ends up with mess of collisions
and untracked files?

Just because you don't like some use case doesn't mean it's not gonna
happen. Either you prepare for it and make the best of it, or you
pretend it's not gonna happen and cause extra pain to users.
Post by Alec Warner
Post by Michał Górny
Optimizing for easy recognition
-------------------------------
In order to make it possible for magic-based tools such as file(1) to
easily distinguish Gentoo binary packages from regular tarballs, we
$ tar -V 'gpkg: app-foo/bar-1' -c ...
This will add a volume label as the first file entry inside the tarball,
which does not affect extracting but can be trivially matched via magic
rules.
Note: this is meant to be used as a method for fast binary package
recognition; I don't think we should reject (hand-modified) binary
packages that lack this label.
Optimizing for metadata reading/manipulation performance
--------------------------------------------------------
The main problem with using a single tarball for both metadata and data
is that normally you'd have to decompress everything to reliably unpack
metadata, and recompress everything to update it. This problem can be
addressed by a few optimization tricks.
These performance goals seem a little bit ill defined.
1) Where are users reporting slowness in binpkg operations?
2) What is the cause of the slowness?
Those are optimizations to not cause slowness compared to the current
format. Main use case is recreating package index which would require
rereading the metadata of all binary packages.
Post by Alec Warner
Like I could easily see a potential user with many large binpkgs, and the
current implementation causing them issues because
they have to decompress and seek a bunch to read the metadata out of their
1.2GB binpkg. But i'm pretty sure this isn't most users.
Post by Michał Górny
Firstly, all metadata files are packed to the archive before data files.
With a slightly customized unpacker, we can stop decompressing as soon
as we're past metadata and avoid decompressing the whole archive. This
will also make it possible to read metadata from remote files without
fetching far past the compressed metadata block.
So this seems to basically go against your goals of simple common tooling?
No. My goal is to make it compatible with simple common tooling. You
can still use the simple tooling to read/write them. The optimized
tools are only needed to efficiently handle special use cases.
Post by Alec Warner
Post by Michał Górny
Secondly, if we're up for some more tricks, we could technically split
the tarball into metadata and data blocks compressed separately. This
will need a bit of archiver customization but it will make it possible
to decompress the metadata part without even touching compressed data,
and to replace it without recompressing data.
What's important is that both tricks proposed maintain backwards
compatibility with regular compressed tarballs. That is, the user will
still be able to extract it with regular archiving tools.
So my recollection is that debian uses common format AR files for the main
deb.
Then they have 2 compressed tarballs, one for metadata, and one for data.
- 'ar' can retrieve individual files from the archive.
- The deb file itself is not compressed, but the tarballs inside *are*
compressed.
- The metadata and data are compressed separately.
- Anyone can edit this with normal tooling (ar, tar)
In short; why should we event a new format?
Because nobody knows how to use 'ar', compared to how almost every
Gentoo user can use 'tar' immediately? Of course we could alternatively
just use a nested tarball but I wanted to keep the possibility
of actually being able to 'tar -xf' it without having to extract nested
archives.
--
Best regards,
Michał Górny
Rich Freeman
2018-11-11 10:56:16 UTC
Permalink
Post by Michał Górny
Post by Alec Warner
Post by Michał Górny
+ If we can maintain reasonable level of vdb compatibility, the user can
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.
IMO just overwriting vdb doesn't seem like a great idea. If somebody
does do a plain untar then the package manager has no ability to
sanitize anything going in, which means dealing with a potential mess
after-the-fact. Users wouldn't think to go messing with /var/db/pkg
on their own, but they certainly will be tempted to just untar a file.

Perhaps a package with the same name/version was already installed,
but the new files aren't the same as the old files. Now we have
orphans because the package manager never had a chance to clean up and
lost all its state regarding what was already there.

Plus, this is basically in-band signaling and that is the sort of
thing that usually ends up being regretted sooner or later.

I'm not sure if vdb is entirely optimal, but if we wanted to stick
with that metadata format, why not just stick it in a separate
tarball?
Post by Michał Górny
Are you saying it's better that user emergency-installs a package
without recording it in vdb, and ends up with mess of collisions
and untracked files?
IMO this is no different than a user unpacking any other random
tarball that goes and creates orphans. I think the better solution is
some kind of tool to cleanly install a tarball, assuming it doesn't
already exist. Short of turning /usr into a squashfs or whatever I'm
not sure any distro has a great solution for keeping users from
bypassing the package manager entirely.
Post by Michał Górny
Post by Alec Warner
In short; why should we event a new format?
Because nobody knows how to use 'ar', compared to how almost every
Gentoo user can use 'tar' immediately? Of course we could alternatively
just use a nested tarball but I wanted to keep the possibility
of actually being able to 'tar -xf' it without having to extract nested
archives.
IMO a nested tarball would be a better solution. I agree that ar is
obscure, and I don't see how it adds any value. If we were talking
about something going into a bootloader then optimizing for unpacking
efficiency might be a concern, but there is no reason not to use more
standard tools, unless there was something about the .deb format
itself we wanted to completely preserve (seems unlikely).

Overall though, I definitely think that a better binary format makes a
lot of sense, and I think you're on the right track.

One thing you didn't touch on is file naming. Right now two binary
packages with different USE/etc configurations are going to collide.
Would it make sense to toss in some kind of content hash of certain
metadata in the filename or something so that it would be much simpler
to host and auto-fetch binary packages? I realize this is going
beyond your initial scope, but if we wanted to do this it would be a
good time to do so.
--
Rich
Alec Warner
2018-11-11 13:43:21 UTC
Permalink
Post by Michał Górny
Post by Alec Warner
Post by Michał Górny
Hi, everyone.
The Gentoo's tbz2/xpak package format is quite old. We've made a few
incompatible changes in the past (most notably, allowing non-bzip2
compression and multi-instance naming) but the core design stayed
the same. I think we should consider changing it, for the reasons
outlined below.
The rough format description can be found in xpak(5). Basically, it's
a regular compressed tarball with binary metadata blob appended
to the end. As such, it looks like a regular compressed tarball
to the compression tools (with some ignored junk at the end).
The metadata is entirely custom format and needs dedicated tools
to manipulate.
The current format has a few advantages whose preserving would probably
+ The binary package is a single flat file.
+ It is reasonably compatible with regular compressed tarball,
so the users can unpack it using standard tools (except for metadata).
+ The metadata is uncompressed and can be quickly found without
touching
Post by Alec Warner
Post by Michał Górny
the compressed data.
+ The metadata can be updated (e.g. as result of pkgmove) without
touching the compressed data.
- The metadata is entirely custom binary format, requiring dedicated
tools to read or edit.
- The metadata format is relying on customary behavior of compression
tools that ignore junk following the compressed data.
I agree this is a problem in theory, but I haven't seen it as a problem
in
Post by Alec Warner
practice. Have you observed any problems around this setup?
Historically one of the parallel compressor variants did not support
this.
Post by Alec Warner
Post by Michał Górny
- By placing the metadata at the end of file, we make it rather hard to
read the metadata from remote location (via FTP, HTTP) without fetching
the whole file. [NB: it's technically possible but probably not worth
the effort]
- By requiring the custom format to be at the end of file, we make it
impossible to trivially cover it with a OpenPGP signature without
introducing another custom format.
Its trivial to cover with a detached sig, no?
Post by Michał Górny
- While the format might allow for some extensibility, it's rather
evolutionary dead end.
I'm not even sure how to quantify this, it just sounds like your
subjective
Post by Alec Warner
opinion (which is fine, but its not factual.)
Post by Michał Górny
1. It should reuse common file formats as much as possible, with
inventing as little custom code as possible.
2. It should allow for easy introspection and editing by users without
dedicated tools.
So I'm less confident in the editing use cases; do users edit their
binpkgs
Post by Alec Warner
on a regular basis?
It's useful for debugging stuff. I had to use hexedit on xpak
in the past. Believe me, it's nowhere close to pleasant.
Post by Alec Warner
Post by Michał Górny
3. The metadata should allow for lookup without fetching the whole
binary package.
4. The format should allow for some extensions without having to
reinvent the wheel every time.
5. It would be nice to preserve the existing advantages.
My proposal
===========
Basic format
------------
The base of the format is a regular compressed tarball. There's no
junk
Post by Alec Warner
Post by Michał Górny
appended to it but the metadata is stored inside it as
/var/db/pkg/${PF}. The contents are as compatible with the actual vdb
format as possible.
Just to clarify, you are suggesting we store the metadata inside the
contents of the binary package itself (e.g. where the other files that
get
Post by Alec Warner
merged to the liveFS are?) What about collisions?
E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
overwrite files in the VDB at qmerge time?
Portage will obviously move the files out, and process them as metadata.
The idea is to precisely use a directory that can't be normally part
of binary packages, so can't cause collisions with real files (even if
they're very unlikely to ever happen).
Post by Alec Warner
Post by Michał Górny
+ Binary package is still stored as a single file.
+ It uses a standard compressed .tar format, with minimal
customization.
Post by Alec Warner
Post by Michał Górny
+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).
+ If we can maintain reasonable level of vdb compatibility, the user
can
Post by Alec Warner
Post by Michał Górny
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.
I'm not certain this is really desired.
Are you saying it's better that user emergency-installs a package
without recording it in vdb, and ends up with mess of collisions
and untracked files?
Just because you don't like some use case doesn't mean it's not gonna
happen. Either you prepare for it and make the best of it, or you
pretend it's not gonna happen and cause extra pain to users.
I would argue that I would split the requirements into 3 bands.

1) Must do
2) Should do
3) Nice to have

To me, manually unpacking a tarball and having it recorded in the VDB is a
'nice to have' feature. If we can make it work, great.
I tend to lean with Rich here that recording the data in-band is risky. I
think there is also this premise; that binpkgs can 'maintain VDB
compatibility'. Like I could make a binpkg, wait 2 years, then install it;
and we have to make sure that everything still works.

IMHO its a pretty high cost to pay (tight coupling) for what, to me, is a
nice to have feature.
Post by Michał Górny
Post by Alec Warner
Post by Michał Górny
Optimizing for easy recognition
-------------------------------
In order to make it possible for magic-based tools such as file(1) to
easily distinguish Gentoo binary packages from regular tarballs, we
$ tar -V 'gpkg: app-foo/bar-1' -c ...
This will add a volume label as the first file entry inside the
tarball,
Post by Alec Warner
Post by Michał Górny
which does not affect extracting but can be trivially matched via magic
rules.
Note: this is meant to be used as a method for fast binary package
recognition; I don't think we should reject (hand-modified) binary
packages that lack this label.
Optimizing for metadata reading/manipulation performance
--------------------------------------------------------
The main problem with using a single tarball for both metadata and data
is that normally you'd have to decompress everything to reliably unpack
metadata, and recompress everything to update it. This problem can be
addressed by a few optimization tricks.
These performance goals seem a little bit ill defined.
1) Where are users reporting slowness in binpkg operations?
2) What is the cause of the slowness?
Those are optimizations to not cause slowness compared to the current
format. Main use case is recreating package index which would require
rereading the metadata of all binary packages.
Post by Alec Warner
Like I could easily see a potential user with many large binpkgs, and the
current implementation causing them issues because
they have to decompress and seek a bunch to read the metadata out of
their
Post by Alec Warner
1.2GB binpkg. But i'm pretty sure this isn't most users.
Post by Michał Górny
Firstly, all metadata files are packed to the archive before data
files.
Post by Alec Warner
Post by Michał Górny
With a slightly customized unpacker, we can stop decompressing as soon
as we're past metadata and avoid decompressing the whole archive. This
will also make it possible to read metadata from remote files without
fetching far past the compressed metadata block.
So this seems to basically go against your goals of simple common
tooling?
No. My goal is to make it compatible with simple common tooling. You
can still use the simple tooling to read/write them. The optimized
tools are only needed to efficiently handle special use cases.
Post by Alec Warner
Post by Michał Górny
Secondly, if we're up for some more tricks, we could technically split
the tarball into metadata and data blocks compressed separately. This
will need a bit of archiver customization but it will make it possible
to decompress the metadata part without even touching compressed data,
and to replace it without recompressing data.
What's important is that both tricks proposed maintain backwards
compatibility with regular compressed tarballs. That is, the user will
still be able to extract it with regular archiving tools.
So my recollection is that debian uses common format AR files for the
main
Post by Alec Warner
deb.
Then they have 2 compressed tarballs, one for metadata, and one for data.
- 'ar' can retrieve individual files from the archive.
- The deb file itself is not compressed, but the tarballs inside *are*
compressed.
- The metadata and data are compressed separately.
- Anyone can edit this with normal tooling (ar, tar)
In short; why should we event a new format?
Because nobody knows how to use 'ar', compared to how almost every
Gentoo user can use 'tar' immediately? Of course we could alternatively
just use a nested tarball but I wanted to keep the possibility
of actually being able to 'tar -xf' it without having to extract nested
archives.
I think man 'ar' could help them pretty easily. That being said I'm not wed
to 'ar', but trying to show how this problem was solved in a similar domain.
Post by Michał Górny
--
Best regards,
Michał Górny
Francesco Riosa
2018-11-11 16:05:37 UTC
Permalink
Post by Michał Górny
[...]
Post by Alec Warner
Post by Michał Górny
My proposal
===========
Basic format
------------
The base of the format is a regular compressed tarball. There's no
junk
Post by Alec Warner
Post by Michał Górny
appended to it but the metadata is stored inside it as
/var/db/pkg/${PF}. The contents are as compatible with the actual vdb
format as possible.
Just to clarify, you are suggesting we store the metadata inside the
contents of the binary package itself (e.g. where the other files that
get
Post by Alec Warner
merged to the liveFS are?) What about collisions?
E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
overwrite files in the VDB at qmerge time?
Portage will obviously move the files out, and process them as metadata.
The idea is to precisely use a directory that can't be normally part
of binary packages, so can't cause collisions with real files (even if
they're very unlikely to ever happen).
Post by Alec Warner
Post by Michał Górny
+ Binary package is still stored as a single file.
+ It uses a standard compressed .tar format, with minimal
customization.
Post by Alec Warner
Post by Michał Górny
+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).
+ If we can maintain reasonable level of vdb compatibility, the user
can
Post by Alec Warner
Post by Michał Górny
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.
I'm not certain this is really desired.
Are you saying it's better that user emergency-installs a package
without recording it in vdb, and ends up with mess of collisions
and untracked files?
Just because you don't like some use case doesn't mean it's not gonna
happen. Either you prepare for it and make the best of it, or you
pretend it's not gonna happen and cause extra pain to users.
Another option would be to install in a near but not overlapping directory,
example:
/var/db/pkg/${PF}-binpkg

this way the user that know what to do with that data can play with it,
also portage could be instructed to stat() that directory and take action
(halt maybe?) if present.
Duncan
2018-11-11 18:15:17 UTC
Permalink
Il giorno dom 11 nov 2018 alle ore 09:29 Michał Górny
Post by Michał Górny
[...]
Post by Alec Warner
My proposal ===========
Basic format ------------
The base of the format is a regular compressed tarball.
There's no junk appended to it but the metadata is stored
inside it as /var/db/pkg/${PF}. The contents are as compatible
with the actual vdb format as possible.
Just to clarify, you are suggesting we store the metadata inside
the contents of the binary package itself (e.g. where the other
files that get merged to the liveFS are?) What about collisions?
E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine
that already has 'machine-images/gentoo-disk-image-1.2.3' installed,
won't it overwrite files in the VDB at qmerge time?
Portage will obviously move the files out, and process them as metadata.
The idea is to precisely use a directory that can't be normally part
of binary packages, so can't cause collisions with real files (even if
they're very unlikely to ever happen).
Post by Alec Warner
+ Binary package is still stored as a single file.
Breaking these down into RFC style MUST/SHOULD/MAY levels (as already
suggested elsewhere), for me, this is...

SHOULD/MAY

(Would be a MAY, nice to have, but the existing solution has it, thus
arguably raising the priority to SHOULD.)
Post by Michał Górny
Post by Alec Warner
+ It uses a standard compressed .tar format, with minimal
customization.
MUST

(Losing the existing functionality here would be horrible. FWIW I
routinely use binpkgs as a reference, for "clean" config files, comparing
install trees of old and new versions, etc. Having tools that allow
browsing standard compressed tar archives as virtual extensions to the
filesystem makes that a breeze. =:^)
Post by Michał Górny
Post by Alec Warner
+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).
MUST

(As pointed out, portage itself already does this when doing binpkg
moves, etc. Losing that would be horrible!)
Post by Michał Górny
Post by Alec Warner
+ If we can maintain reasonable level of vdb compatibility, the
user can even emergency-install a package without causing too much
hassle (as it will be recorded in vdb); ideally Portage would
detect this vdb entry and support fixing the install afterwards.
I'm not certain this is really desired.
SHOULD/MAY

(I'd say SHOULD, but while possible to emergency-install via untarring
now, portage doesn't do anything with it at all, so the detect and fix
functionality is a bonus, thus arguably lowering it to a MAY.)
Post by Michał Górny
Are you saying it's better that user emergency-installs a package
without recording it in vdb, and ends up with mess of collisions and
untracked files?
Just because you don't like some use case doesn't mean it's not gonna
happen. Either you prepare for it and make the best of it, or you
pretend it's not gonna happen and cause extra pain to users.
I think I've had to do this twice in ~1.5 decades, plus once reaching
into the tarball to extract a single file that was broken in a newly
installed glibc, breaking portage (and much of the system, but bunzip
still worked!) so I couldn't undo it using portage.

The first time I didn't know enough to clean up manually, but the second
time (and the reach-in time) I did. It'd *definitely* be nice to have
portage be able to clean up automatically.
Another option would be to install in a near but not overlapping directory,
/var/db/pkg/${PF}-binpkg
this way the user that know what to do with that data can play with it,
also portage could be instructed to stat() that directory and take
action (halt maybe?) if present.
Idea ++

Detect and fix has already been proposed, but detect and halt with an
error and a pointer to manual fix instructions is arguably already better
than current.

Which suggests an easy implementation split, delaying the "fix" step
until later, if it would complicate the initial implementation too much.

[Bikeshed] I was thinking binpkg-${PF} to emphasize the binpkg part and
group any emergency-installed packages together in an alphabetical
listing. But whichever's easiest for portage to work with, which
probably makes the -binpkg suffix version a better choice, requiring less
modification to existing code.


Is there any interest at all in binpkgs, perhaps when improved, from the
other PMs? Or are they effectively dead now or not interested in binpkgs
even if the format were to be improved, or simply too hard to work with?
Because "it'd be nice" (aka MAY level) to have this formally standardized
to PMS... if there's any interest from the other PMs.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
M. J. Everitt
2018-11-11 18:31:44 UTC
Permalink
Post by Duncan
Is there any interest at all in binpkgs, perhaps when improved, from the
other PMs? Or are they effectively dead now or not interested in binpkgs
even if the format were to be improved, or simply too hard to work with?
Because "it'd be nice" (aka MAY level) to have this formally standardized
to PMS... if there's any interest from the other PMs.
Binpkgs are an important part of catalyst/releng stage-building runs, as it
allows portage to 'cache' a lot of the packages needed/used.

Binpkgs are also a popular component of a few downstream distro's based on
Gentoo (thinking pentoo right now as an easy example).

So we don't want to break existing users of this format without considering
the ramifications for these scenarios, as you'll have some very grumpy devs...
Rich Freeman
2018-11-11 18:41:51 UTC
Permalink
Post by M. J. Everitt
Binpkgs are also a popular component of a few downstream distro's based on
Gentoo (thinking pentoo right now as an easy example).
So we don't want to break existing users of this format without considering
the ramifications for these scenarios, as you'll have some very grumpy devs...
I'd argue that they'd be more important for Gentoo if they were more
useful. IMO the main limitation with them is the inability to
auto-download them from a repository, detecting the binpkg USE flags
BEFORE downloading. This is why I suggested hashing the USE flags or
similar and sticking that in the filename.

Obviously you can't host a repository with all the USE combinations.
However, you could have a reference repo and the package manager could
check it before doing a build. If you get a hit then you can install
the binpkg. If you don't then you can do a source build.

Portage already checks the USE flags inside the binpkg before merging
it and by default doesn't use a non-matching binpkg. The problem with
the current approach is:
1. You have to download the package to check this (could be a big file).
2. You can't host multiple versions of a binpkg with different USE
flags since the filenames collide.

I suggested a content hash because you can use it for an arbitrary
amount of metadata, vs having to cram arch/USE/multilib and I'm sure
something I'm missing into a filename. Make the hash as short as is
economical - it isn't like we have THAT many permutations, the PM can
still check the internal metadata, and this isn't a security feature.
--
Rich
M. J. Everitt
2018-11-11 19:02:34 UTC
Permalink
Post by Rich Freeman
Post by M. J. Everitt
Binpkgs are also a popular component of a few downstream distro's based on
Gentoo (thinking pentoo right now as an easy example).
So we don't want to break existing users of this format without considering
the ramifications for these scenarios, as you'll have some very grumpy devs...
I'd argue that they'd be more important for Gentoo if they were more
useful. IMO the main limitation with them is the inability to
auto-download them from a repository, detecting the binpkg USE flags
BEFORE downloading. This is why I suggested hashing the USE flags or
similar and sticking that in the filename.
Obviously you can't host a repository with all the USE combinations.
However, you could have a reference repo and the package manager could
check it before doing a build. If you get a hit then you can install
the binpkg. If you don't then you can do a source build.
Portage already checks the USE flags inside the binpkg before merging
it and by default doesn't use a non-matching binpkg. The problem with
1. You have to download the package to check this (could be a big file).
2. You can't host multiple versions of a binpkg with different USE
flags since the filenames collide.
I suggested a content hash because you can use it for an arbitrary
amount of metadata, vs having to cram arch/USE/multilib and I'm sure
something I'm missing into a filename. Make the hash as short as is
economical - it isn't like we have THAT many permutations, the PM can
still check the internal metadata, and this isn't a security feature.
If you can really present a decent argument for replicating the
functionality of other distros like Debian, Arch, Ubuntu etc then let's
here it. For now, the strength of Gentoo is being able to fully customise a
system to your own requirements, not being trapped by some distro
maintainer's arbitrary choices. Play to your USP's and strengths rather
than chasing rainbows ..
Rich Freeman
2018-11-11 19:20:56 UTC
Permalink
Post by M. J. Everitt
If you can really present a decent argument for replicating the
functionality of other distros like Debian, Arch, Ubuntu etc then let's
here it. For now, the strength of Gentoo is being able to fully customise a
system to your own requirements, not being trapped by some distro
maintainer's arbitrary choices. Play to your USP's and strengths rather
than chasing rainbows ..
Why do we support binary packages at all? Simple: compiling packages
is expensive, and if you happen to already have them compiled, fully
customized to your own requirements, then there is no point in
recompiling them. You're just spending a ton of resources to build
the exact same files you already have.

The only change I'm suggesting is that portage could take all the
configuration you're already supplying, and then optionally go see if
somebody you trust has already built the package that meets your
requirements. If so, then it would be downloaded and installed,
otherwise it would just compile from source.

You get the exact same files installed on your system either way.
--
Rich
M. J. Everitt
2018-11-11 19:25:54 UTC
Permalink
Post by Rich Freeman
Post by M. J. Everitt
If you can really present a decent argument for replicating the
functionality of other distros like Debian, Arch, Ubuntu etc then let's
here it. For now, the strength of Gentoo is being able to fully customise a
system to your own requirements, not being trapped by some distro
maintainer's arbitrary choices. Play to your USP's and strengths rather
than chasing rainbows ..
Why do we support binary packages at all? Simple: compiling packages
is expensive, and if you happen to already have them compiled, fully
customized to your own requirements, then there is no point in
recompiling them. You're just spending a ton of resources to build
the exact same files you already have.
The only change I'm suggesting is that portage could take all the
configuration you're already supplying, and then optionally go see if
somebody you trust has already built the package that meets your
requirements. If so, then it would be downloaded and installed,
otherwise it would just compile from source.
You get the exact same files installed on your system either way.
Ok so I get the principle, but who's gonna provide the tools to make this
feasible, and perhaps more interestingly, who's going to curate, provide,
host and maintain the binpkg repos you propose? We barely have enough
developers to maintain a working source package repository, let alone
adding new distro "features" .. unless perhaps you have a few hours every
week to spare?

I see no sense in reinventing the wheel here, besides #thegentooway....
Alec Warner
2018-11-11 19:31:06 UTC
Permalink
Post by M. J. Everitt
Post by M. J. Everitt
If you can really present a decent argument for replicating the
functionality of other distros like Debian, Arch, Ubuntu etc then let's
here it. For now, the strength of Gentoo is being able to fully
customise a
Post by M. J. Everitt
system to your own requirements, not being trapped by some distro
maintainer's arbitrary choices. Play to your USP's and strengths rather
than chasing rainbows ..
Why do we support binary packages at all? Simple: compiling packages
is expensive, and if you happen to already have them compiled, fully
customized to your own requirements, then there is no point in
recompiling them. You're just spending a ton of resources to build
the exact same files you already have.
The only change I'm suggesting is that portage could take all the
configuration you're already supplying, and then optionally go see if
somebody you trust has already built the package that meets your
requirements. If so, then it would be downloaded and installed,
otherwise it would just compile from source.
You get the exact same files installed on your system either way.
I think this conversation is a bit off track. I'm not saying this isn't a
great idea, but I think its very orthogonal to the binpkg format itself.

For example, the binhost pkg index file can contain this metadata and
portage can be designed to fetch the binpkg index metadata and do matching
(afaik it already does this; it just needs extending with more metadata.)
The binpkg format itself seems not too relevant to this.

-A
Post by M. J. Everitt
--
Rich
Michał Górny
2018-11-11 20:53:33 UTC
Permalink
Hi,

Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.

The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
mismatched order):

1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).

2. "metadata.tar${comp}" tarball containing binary package metadata
as files.

3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.

4. "contents.tar${comp}" tarball containing files to be installed.

5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.

Notes:

a. ${comp} can be any compression format supported by binary packages.
Technically, metadata and content archives may use different
compression. Either or both may be uncompressed as well.

b. While signatures are optional, the PM should have a switch
controlling whether to expect them, and fail hard if they're not present
when expected.


Advantages
----------
Guaranteed:

+ The binary package is still one file, so can be fetched easily.

+ File format is trivial and can be extracted using tar(1) + compressor.

+ The metadata and contents are compressed independently, and so can be
easily extracted or modified independently.

+ The package format provides for separate metadata and content
signatures, so they can be verified independently.

+ Metadata can be compressed now.

Achieved by regular archives (but might be broken if modified by user):

+ Easy recognition by magic(1).

+ The metadata archive (and its signature) is packed first, so it may be
read without fetching the whole binpkg.


Why not .ar format?
-------------------
The use of .ar format has been proposed, akin to Debian. While
the option is mostly feasible, and the simplicity of .ar format would
reduce the outer size of binary packages, I think the format is simply
too obscure. It lives mostly as static library format, and the tooling
for it is part of binutils. LSB considers it deprecated. While I don't
see it going away anytime soon, I'd rather not rely on it in order to
save a few KiB.


Is there anything left to address?
--
Best regards,
Michał Górny
Michał Górny
2018-11-11 21:17:31 UTC
Permalink
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
a. ${comp} can be any compression format supported by binary packages.
Technically, metadata and content archives may use different
compression. Either or both may be uncompressed as well.
b. While signatures are optional, the PM should have a switch
controlling whether to expect them, and fail hard if they're not present
when expected.
Advantages
----------
+ The binary package is still one file, so can be fetched easily.
+ File format is trivial and can be extracted using tar(1) + compressor.
+ The metadata and contents are compressed independently, and so can be
easily extracted or modified independently.
+ The package format provides for separate metadata and content
signatures, so they can be verified independently.
+ Metadata can be compressed now.
+ Easy recognition by magic(1).
+ The metadata archive (and its signature) is packed first, so it may be
read without fetching the whole binpkg.
Why not .ar format?
-------------------
The use of .ar format has been proposed, akin to Debian. While
the option is mostly feasible, and the simplicity of .ar format would
reduce the outer size of binary packages, I think the format is simply
too obscure. It lives mostly as static library format, and the tooling
for it is part of binutils. LSB considers it deprecated. While I don't
see it going away anytime soon, I'd rather not rely on it in order to
save a few KiB.
Is there anything left to address?
Hmm, I've missed one disadvantage compared to xpak and v1: at least with
the standard tools, we can't build the binary package on the fly without
creating temporary archives (and therefore duplicating disk space use).

In other words, xpak and v1 formats made it possible to tar
the installation image straight to the new package.

The v2 format requires creating "contents.tar${comp}" first, and then
creating the actual binary package with it. I don't think we can avoid
this without creating a custom .tar writing tool that supports adding
data on-the-fly (e.g. by writing the file data, then seeking back to
update the size record).

Of course, one option would be to use ZIP ;-).
--
Best regards,
Michał Górny
Francesco Riosa
2018-11-12 00:21:36 UTC
Permalink
[...-]
Of course, one option would be to use ZIP ;-).
Zip archives have another big advantage; there is an index of files, so
listing the archive contents and extracting a single file is very fast and
does not depend from it's position in the archive.
The big disadvantage is that only "desktop" profile has unzip by default

Best regards,
-Francesco
Michał Górny
2018-11-12 15:40:51 UTC
Permalink
Post by Francesco Riosa
[...-]
Of course, one option would be to use ZIP ;-).
I wasn't serious there.
Post by Francesco Riosa
Zip archives have another big advantage; there is an index of files, so
listing the archive contents and extracting a single file is very fast and
does not depend from it's position in the archive.
The big disadvantage is that only "desktop" profile has unzip by default
The two main problems with ZIP is that:

1. As you noted, it's not present in core system packages.

2. It uses trailer format which means that you need to fetch the whole
file before being able to process it.

There was also some patent hassle back in the day but I think it's no
longer applicable today.
--
Best regards,
Michał Górny
Francesco Riosa
2018-11-13 01:03:16 UTC
Permalink
Post by Michał Górny
Post by Francesco Riosa
[...-]
Of course, one option would be to use ZIP ;-).
I wasn't serious there.
Post by Francesco Riosa
Zip archives have another big advantage; there is an index of files, so
listing the archive contents and extracting a single file is very fast and
does not depend from it's position in the archive.
The big disadvantage is that only "desktop" profile has unzip by default
1. As you noted, it's not present in core system packages.
2. It uses trailer format which means that you need to fetch the whole
file before being able to process it.
Well, with some protocols (HTTP-1.1) and with a well behaving server
(content-length) this is doable, but limited indeed.
However the same is true for the tarball if the order of the contained
files is uncertain.

Since a tar file is sequential even to know contained files you need to
seek into the tarball and the number of seeks is dependant from the
number of files included in the tarball before the wanted one.
Post by Michał Górny
There was also some patent hassle back in the day but I think it's no
longer applicable today.
Fabian Groffen
2018-11-12 16:51:03 UTC
Permalink
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
a. ${comp} can be any compression format supported by binary packages.
Technically, metadata and content archives may use different
compression. Either or both may be uncompressed as well.
I'm wondering here, how much sense does it make to compress 2., 3.
and/or 4. if you compress the whole gpkg? I have the impression
compression on compression isn't beneficial here. Shouldn't just
compressing of the gpkg tar be sufficient?

As to allowing different compressors for a single gpkg, I think it would
be better to require all compressors to be the same, such that a PM or
tool can quickly see if it can "read" the file from the gpkg filename,
instead of having to fetch and open it first. Obviously, if you drop
compression of the inner tars, this point goes away.

Thanks,
Fabian
Post by Michał Górny
b. While signatures are optional, the PM should have a switch
controlling whether to expect them, and fail hard if they're not present
when expected.
Advantages
----------
+ The binary package is still one file, so can be fetched easily.
+ File format is trivial and can be extracted using tar(1) + compressor.
+ The metadata and contents are compressed independently, and so can be
easily extracted or modified independently.
+ The package format provides for separate metadata and content
signatures, so they can be verified independently.
+ Metadata can be compressed now.
+ Easy recognition by magic(1).
+ The metadata archive (and its signature) is packed first, so it may be
read without fetching the whole binpkg.
Why not .ar format?
-------------------
The use of .ar format has been proposed, akin to Debian. While
the option is mostly feasible, and the simplicity of .ar format would
reduce the outer size of binary packages, I think the format is simply
too obscure. It lives mostly as static library format, and the tooling
for it is part of binutils. LSB considers it deprecated. While I don't
see it going away anytime soon, I'd rather not rely on it in order to
save a few KiB.
Is there anything left to address?
--
Best regards,
Michał Górny
--
Fabian Groffen
Gentoo on a different level
Michał Górny
2018-11-12 16:59:00 UTC
Permalink
Post by Fabian Groffen
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
a. ${comp} can be any compression format supported by binary packages.
Technically, metadata and content archives may use different
compression. Either or both may be uncompressed as well.
I'm wondering here, how much sense does it make to compress 2., 3.
and/or 4. if you compress the whole gpkg? I have the impression
compression on compression isn't beneficial here. Shouldn't just
compressing of the gpkg tar be sufficient?
Please read the spec again. It explicitly says it's not compressed.
--
Best regards,
Michał Górny
Ulrich Mueller
2018-11-12 17:33:44 UTC
Permalink
Post by Michał Górny
Post by Fabian Groffen
I'm wondering here, how much sense does it make to compress 2., 3.
and/or 4. if you compress the whole gpkg? I have the impression
compression on compression isn't beneficial here. Shouldn't just
compressing of the gpkg tar be sufficient?
Please read the spec again. It explicitly says it's not compressed.
Isn't that the wrong way around? The tar format contains a lot of
padding, so using uncompressed tar for the outer archive would be
somewhat wasteful. Why not leave the inner tar files uncompressed, but
compress the whole binpkg instead?

Also, what would be wrong with ar? It's a standard POSIX tool, and
should be available everywhere.

Ulrich
Michał Górny
2018-11-12 18:00:46 UTC
Permalink
Post by Ulrich Mueller
Post by Michał Górny
Post by Fabian Groffen
I'm wondering here, how much sense does it make to compress 2., 3.
and/or 4. if you compress the whole gpkg? I have the impression
compression on compression isn't beneficial here. Shouldn't just
compressing of the gpkg tar be sufficient?
Please read the spec again. It explicitly says it's not compressed.
Isn't that the wrong way around? The tar format contains a lot of
padding, so using uncompressed tar for the outer archive would be
somewhat wasteful. Why not leave the inner tar files uncompressed, but
compress the whole binpkg instead?
Uncompressed tar is mostly suitable for random access. Compressed tar
isn't suitable for random access at all.

With uncompressed tar, it's trivial to access one of the members. With
compressed tar, you always end up decompressing everything.

With uncompressed tar, it's easy to rewrite the metadata (read: apply
package updates) without updating the rest. With compressed tar, you'd
have to recompress all the huge packages in order to apply updates.
Post by Ulrich Mueller
Also, what would be wrong with ar? It's a standard POSIX tool, and
should be available everywhere.
The original post says what's wrong with ar. Please be more specific if
you disagree with it.
--
Best regards,
Michał Górny
Ulrich Mueller
2018-11-12 20:23:49 UTC
Permalink
Post by Michał Górny
Post by Ulrich Mueller
Also, what would be wrong with ar? It's a standard POSIX tool, and
should be available everywhere.
The original post says what's wrong with ar. Please be more specific
if you disagree with it.
AFAICS, the arguments are that ar would be obscure, and that the LSB
considers it deprecated. I don't find either of them convincing.
Since when do we care about the LSB?

Ulrich
Alec Warner
2018-11-12 20:59:54 UTC
Permalink
Post by Ulrich Mueller
Post by Michał Górny
Post by Ulrich Mueller
Also, what would be wrong with ar? It's a standard POSIX tool, and
should be available everywhere.
The original post says what's wrong with ar. Please be more specific
if you disagree with it.
AFAICS, the arguments are that ar would be obscure, and that the LSB
considers it deprecated. I don't find either of them convincing.
Since when do we care about the LSB?
I assert that it doesn't matter which tool we pick, so we have arbitrarily
chosen tar because we like it.

If you have a basis for preferring ar over tar; I'd love to hear it. I only
brought it up because I know debian uses it.

-A
Post by Ulrich Mueller
Ulrich
Michał Górny
2018-11-12 21:29:30 UTC
Permalink
Post by Ulrich Mueller
Post by Michał Górny
Post by Ulrich Mueller
Also, what would be wrong with ar? It's a standard POSIX tool, and
should be available everywhere.
The original post says what's wrong with ar. Please be more specific
if you disagree with it.
AFAICS, the arguments are that ar would be obscure, and that the LSB
considers it deprecated. I don't find either of them convincing.
Since when do we care about the LSB?
Do you have a convincing arguments for using ar?

I think it's quite obvious that tar is the only sane choice for
the inner archive format since we need to preserve permissions,
ownership etc. ar can't do it.

Once tar is used for inner archive format, it is also a natural choice
for the outer format. If you believe we should use another format, that
is introduce a second distinct archive format and depend on a second
tool, you need to have a good justification for it.

So yes, ar is an option, as well as cpio. In both cases the format is
simpler (yet obscure), and the files are smaller. But does that justify
using a second tool that serves the same purpose as tar, given that tar
works and we need to use it anyway? Even if we skip the fact that ar is
bundled as part of binutils rather than as stand-alone archiver, we're
introducing unnecessarily complexity of learning a second tool.
And both ar(1) and cpio(1) have weird CLI, compared to tar(1).

Plus, ar apparently doesn't support directories, so we end up adding
extra complexity to get it unpacked sanely.

For the record, I've did a little experiment and here are the results:

-rw-r--r-- 1 mgorny mgorny 112928836 11-12 22:13 wine-any-3.20-1.gpkg.ar
-rw-r--r-- 1 mgorny mgorny 112929280 11-12 22:21 wine-any-3.20-1.gpkg.cpio
-rw-r--r-- 1 mgorny mgorny 112936960 11-12 22:11 wine-any-3.20-1.gpkg.tar

So yes, we are saving around 8 KiB... out of 108 MiB. Of course,
the savings may become relevant in case of tiny archives but do we
really need to be concerned about that?

The whole point of the proposal is to make the format simpler, easier to
introspect and to modify. I believe limiting the number of formats
in use certainly serves that purpose while starting to depend on obscure
tools in order to save 8 KiB is a case of premature optimization.
--
Best regards,
Michał Górny
Ulrich Mueller
2018-11-12 23:45:13 UTC
Permalink
Post by Michał Górny
Once tar is used for inner archive format, it is also a natural choice
for the outer format. If you believe we should use another format, that
is introduce a second distinct archive format and depend on a second
tool, you need to have a good justification for it.
Right, that's a better reason. :)
Post by Michał Górny
So yes, ar is an option, as well as cpio. In both cases the format is
simpler (yet obscure), and the files are smaller. But does that justify
using a second tool that serves the same purpose as tar, given that tar
works and we need to use it anyway? Even if we skip the fact that ar is
bundled as part of binutils rather than as stand-alone archiver, we're
introducing unnecessarily complexity of learning a second tool.
And both ar(1) and cpio(1) have weird CLI, compared to tar(1).
cpio is not feasible because of file size limitations (4 GiB IIRC).

Ulrich
Michał Górny
2018-11-13 08:59:02 UTC
Permalink
Post by Ulrich Mueller
Post by Michał Górny
Once tar is used for inner archive format, it is also a natural choice
for the outer format. If you believe we should use another format, that
is introduce a second distinct archive format and depend on a second
tool, you need to have a good justification for it.
Right, that's a better reason. :)
Post by Michał Górny
So yes, ar is an option, as well as cpio. In both cases the format is
simpler (yet obscure), and the files are smaller. But does that justify
using a second tool that serves the same purpose as tar, given that tar
works and we need to use it anyway? Even if we skip the fact that ar is
bundled as part of binutils rather than as stand-alone archiver, we're
introducing unnecessarily complexity of learning a second tool.
And both ar(1) and cpio(1) have weird CLI, compared to tar(1).
cpio is not feasible because of file size limitations (4 GiB IIRC).
FWICS, ar has a limit of 10 decimal digits, so around 9.3 GiB.
--
Best regards,
Michał Górny
Zac Medico
2018-11-13 17:49:32 UTC
Permalink
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We'll want to access "contents.tar${comp}.sig" very early, but in the
absence of an index containing offsets, normally we'd have to read all
of "contents.tar${comp}" first. However, I suppose we could search
backwards for the "contents.tar${comp}.sig" entry.
--
Thanks,
Zac
Zac Medico
2018-11-13 17:53:48 UTC
Permalink
Post by Zac Medico
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We'll want to access "contents.tar${comp}.sig" very early, but in the
absence of an index containing offsets, normally we'd have to read all
of "contents.tar${comp}" first. However, I suppose we could search
backwards for the "contents.tar${comp}.sig" entry.
We could solve this problem by adding an index file containing offsets
as the last file in the outer tar file.
--
Thanks,
Zac
Zac Medico
2018-11-13 18:03:18 UTC
Permalink
Post by Zac Medico
Post by Zac Medico
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We'll want to access "contents.tar${comp}.sig" very early, but in the
absence of an index containing offsets, normally we'd have to read all
of "contents.tar${comp}" first. However, I suppose we could search
backwards for the "contents.tar${comp}.sig" entry.
We could solve this problem by adding an index file containing offsets
as the last file in the outer tar file.
Actually, the tar entry for "contents.tar${comp}" should contain the
length, so we should be able to efficiently seek to the
"contents.tar${comp}.sig" entry. So, an index is not really needed.
--
Thanks,
Zac
Michał Górny
2018-11-13 18:18:39 UTC
Permalink
Post by Zac Medico
Post by Zac Medico
Post by Zac Medico
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We'll want to access "contents.tar${comp}.sig" very early, but in the
absence of an index containing offsets, normally we'd have to read all
of "contents.tar${comp}" first. However, I suppose we could search
backwards for the "contents.tar${comp}.sig" entry.
We could solve this problem by adding an index file containing offsets
as the last file in the outer tar file.
Actually, the tar entry for "contents.tar${comp}" should contain the
length, so we should be able to efficiently seek to the
"contents.tar${comp}.sig" entry. So, an index is not really needed.
Precisely. Tar is pretty much a linked-list, so seeking is quite
efficient, as long as it's not compressed.
--
Best regards,
Michał Górny
Michał Górny
2018-11-13 18:22:59 UTC
Permalink
Post by Zac Medico
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We'll want to access "contents.tar${comp}.sig" very early, but in the
absence of an index containing offsets, normally we'd have to read all
of "contents.tar${comp}" first. However, I suppose we could search
backwards for the "contents.tar${comp}.sig" entry.
Hmm, I guess in order to verify the file without actually having to
unpack it first? Yes, I suppose packing signatures before actual files
may make sense, and that we should able to verify-then-unpack those data
files without actually extracting them from the archive.
--
Best regards,
Michał Górny
Zac Medico
2018-11-13 18:50:54 UTC
Permalink
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We need to establish the procedure for signature verification of the
files in "contents.tar${comp}" at any point in the future *after* they
have been installed. In order to identify corruption of a particular
installed file, we'll need separate digests for each of the installed
files, and a signature covering the separate digests.
--
Thanks,
Zac
Zac Medico
2018-11-13 18:55:02 UTC
Permalink
Post by Zac Medico
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We need to establish the procedure for signature verification of the
files in "contents.tar${comp}" at any point in the future *after* they
have been installed. In order to identify corruption of a particular
installed file, we'll need separate digests for each of the installed
files, and a signature covering the separate digests.
We need separate digests for the files in "metadata.tar${comp}" too, for
the same reason. Note the environment.bz2 is mutable because it is
deserialized/reserialized for each pkg_* phase. If the installation
process has access to a trusted signing key, it can sign environment.bz2
after each mutation.
--
Thanks,
Zac
Michał Górny
2018-11-13 19:10:50 UTC
Permalink
Post by Zac Medico
Post by Zac Medico
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We need to establish the procedure for signature verification of the
files in "contents.tar${comp}" at any point in the future *after* they
have been installed. In order to identify corruption of a particular
installed file, we'll need separate digests for each of the installed
files, and a signature covering the separate digests.
We need separate digests for the files in "metadata.tar${comp}" too, for
the same reason. Note the environment.bz2 is mutable because it is
deserialized/reserialized for each pkg_* phase. If the installation
process has access to a trusted signing key, it can sign environment.bz2
after each mutation.
Most of the other metadata is also mutable because of package moves.
--
Best regards,
Michał Górny
Michał Górny
2018-11-13 19:11:42 UTC
Permalink
Post by Zac Medico
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We need to establish the procedure for signature verification of the
files in "contents.tar${comp}" at any point in the future *after* they
have been installed. In order to identify corruption of a particular
installed file, we'll need separate digests for each of the installed
files, and a signature covering the separate digests.
I should note that package contents are strongly mutable in Gentoo --
preinst/postinst, instprep, custom hooks...
--
Best regards,
Michał Górny
Zac Medico
2018-11-13 20:19:35 UTC
Permalink
Post by Michał Górny
Post by Zac Medico
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
We need to establish the procedure for signature verification of the
files in "contents.tar${comp}" at any point in the future *after* they
have been installed. In order to identify corruption of a particular
installed file, we'll need separate digests for each of the installed
files, and a signature covering the separate digests.
I should note that package contents are strongly mutable in Gentoo --
preinst/postinst, instprep, custom hooks...
It should be limited to a small subset of files. Maybe at some point we
can introduce a helper that installation processes can use to sign
modified files.
--
Thanks,
Zac
Michał Górny
2018-11-14 20:57:21 UTC
Permalink
Post by Michał Górny
Hi,
Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.
The outer tarball is uncompressed and uses '.gpkg.tar' suffix. It
contains (preferably in order but PM should also handle packages with
1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).
2. "metadata.tar${comp}" tarball containing binary package metadata
as files.
3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.
4. "contents.tar${comp}" tarball containing files to be installed.
5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.
a. ${comp} can be any compression format supported by binary packages.
Technically, metadata and content archives may use different
compression. Either or both may be uncompressed as well.
b. While signatures are optional, the PM should have a switch
controlling whether to expect them, and fail hard if they're not present
when expected.
Advantages
----------
+ The binary package is still one file, so can be fetched easily.
+ File format is trivial and can be extracted using tar(1) + compressor.
+ The metadata and contents are compressed independently, and so can be
easily extracted or modified independently.
+ The package format provides for separate metadata and content
signatures, so they can be verified independently.
+ Metadata can be compressed now.
+ Easy recognition by magic(1).
+ The metadata archive (and its signature) is packed first, so it may be
read without fetching the whole binpkg.
Why not .ar format?
-------------------
The use of .ar format has been proposed, akin to Debian. While
the option is mostly feasible, and the simplicity of .ar format would
reduce the outer size of binary packages, I think the format is simply
too obscure. It lives mostly as static library format, and the tooling
for it is part of binutils. LSB considers it deprecated. While I don't
see it going away anytime soon, I'd rather not rely on it in order to
save a few KiB.
Is there anything left to address?
Here's a quick & dirty xpak2gpkg converter:

https://gist.github.com/mgorny/cca78fb93f14aad11f43abe352caad06

It can be used to try out the format practically and flesh out
the details before we go for a formal spec.
--
Best regards,
Michał Górny
Loading...