This is a discussion of data packages, as they are currently implemented in nibabel / nipy.
This API proved to be very uncomfortable, and we intend to replace it fairly soon. See data_packages.rst in the nibabel wiki for our current thinking, not yet implemented.
When developing or using nipy, many data files can be useful.
We need some standard way to provide the larger data sets. To do this, we are here defining the idea of a data package. This document is a draft specification of what a data package looks like and how to use it.
This section needs some healthy beating to make the ideas clearer. However, in the interests of the 0SAGA software model, here are some ideas that may be separable.
This idea is rather difficult to define, but is a bit like a data project, that is a set of information that the packager believed had something in common. The package then is an abstract idea, and what is in the package could change completely over course of the life of the package. The package then is a little bit like a namespace, having itself no content other than a string (the package name) and the data it contains.
This is a string that gives a name to the package.
By instantiation we mean some particular actual set of data for a particular package. By actual, we mean stuff that can be read as bytes. As we add and remove data from the package, the instantiation changes. In version control, the instantiation would be the particular state of the working tree at any moment, whether this has been committed or not.
It might not be enjoyable, but we’ll call a package instantiation a pinstance.
A revision is an instantiation of the working tree that has a unique label - the revision id.
The revision id is a string that identifies a particular pinstance. This is the equivalent of the revision number in subversion, or the commit hash in systems like git or mercurial. There is only one pinstance for any given revision id, but there can be more than one revision id for a pinstance. For example, you might have a revision of id ‘200’, delete a file, restore the file, call this revision id ‘201’, but they might both refer to the same instantiation of the package. Or they might not, that’s up to you, the author of the package.
A tag is a memorable string that refers to a particular pinstance. It differs from a revision id only in that there is not likely to be a tag for every revision. It’s possible to imagine pinstances without a revision id but with a tag, but perhaps it’s reasonable to restrict tags to refer to revisions. A tag is equivalent to a tag name in git or mercurial - a memorable string that refers to a static state of the data. An example might be a numbered version. So, a package may have a revision uniquely identified by a revision id af5bd6. We might decide to label this revision release-0.3 (the equivalent of applying a git tag). release-0.3 is the tag and af5bd6 is the revision id. Different sources of the same package might possibly produce different tags [1]
A pinstance might also have a version. A version is just a tag that can be compared using some algorithm.
Maybe we could call this a “prundle”.
The provider bundle is something that can deliver the bytes of a particular pinstance. For example, if you have a package named “interesting-images”, you might have a revision of that package identified by revision id “f745dc2” and tagged with “version-0.2”. There might be a provider bundle of that instantiation that is a zipfile interesting-images-version-0.2.zip. There might also be a directory on an http server with the same contents http://my.server.org/packages/interesting-images/version-9.2. The zipfile and the http directory would both be provider bundles of the particular instantiation. When I unpack the zipfile onto my hard disk, I might have a directory /my/home/packages/interesting-images/version-0.2. Now this path is a provider bundle.
In the example above, the zipfile, the http directory and the local path are three different provider bundle formats delivering the same package instantiation. Let’s call those formats:
A release might be a package instantiation that one person has:
We discover a package bundle when we ask a system (local or remote) whether they have a package bundle at a given revision, tag, or bundle format. That implies two discoveries - local discovery (is the package bundle on my local system, if so where is it?); and remote discovery (is the package bundle on your expensive server and if so, how do I get it?). For the Debian distributions, the sources.list file identifies sources from which we can query for software packages. Those would be sources for remote discovery in our language.
A prundle discovery source is somewhere that can answer prundle discovery queries.
One such thing might be a prundle registry, where an element in the registry contains information about a particular prundle. At a first pass this might contain:
Maybe it should also contain information about where the information came from.
We query a pinstance when we know that a particular system (local or remote) has a package bundle of the pinstance we want. Then we get some information about that pinstance.
By definition, different prundles relating to the same pinstance have the same metadata.
A pinstance metadata query source is somewhere that can answer pinstance metadata queries.
Obviously a source may well be both a prundle discovery source and a pinstance metadata query source.
We install a pinstance when we get some prundle containing the pinstance and place it on local storage, such that we can discover the prundle on our own (local) system. That is we take some prundle and convert it to a local-path format bundle and we register this local-path format bundle to a discovery source.
In which we compare the package terminology above to the terminology of Debian packaging.
We want to build a package system that is very simple (‘S’ in 0SAGA). For the moment, the main problems we want to solve are: creation of a package instantiation, installation of package instantiations, local discovery of package instantiations. For now we are not going to try and solve queries.
At least local discovery should be so simple that it can be implemented in any language, and should not require a particular tool to be installed. We hope we can write a spec that makes all of (creation, installation, local discovery) clearly defined, so that it would be simple to write an implementation. Obviously we’re going to end up writing our own implementation, or adapting someone else’s. datapkg looks like the best candidate at the moment.
From a brief scan of the debian package management documentation.
(no plan at the moment)
For dependency and validation, see the Debian secure apt page. One related proposal would be:
The obvious differences are:
The size of data packages probably mean that using git itself will not work well. git stores (effectively) all previous versions of the files in the repository, as zlib compressed blobs. The working tree is an uncompressed instantiation of the current state. Thus, if we have, over time, had 4 different versions of a large file with little standard diff relationship to one another, the repository will have four zlib compressed versions of the file in the .git/objects database, and one uncompressed version in the working tree. The files in data packages may or may not compress well.
In contrast to the full git model, we may want to avoid duplicates of the data. We probably won’t by default want to keep all previous versions of the data together at least locally.
We probably do want to be able to keep track of which files are the same across different instantiations of the package, in the case where we already have one instantiation on local disk, and we are asking for another, with some shared files. We might well want to avoid downloading duplicate data in that case.
Maybe the way to think of it is of the different costs that become important as files get larger. So the cost for holding a full history becomes very large, whereas the benefit decreases a little bit (compared to code).
from ourpkg import default_registry
my_pkg_path = default_registry.pathfor('mypkg', '0.3')
if mypkg_path is None:
raise RuntimeError('It looks like mypkg version 0.3 is not installed')
Footnotes
[1] | Revsion ids could for example be hashes of the package instantiation (package contents), so they could be globally unique to the contents, whereever the contents was when the identifier was made. However, tags are just names that someone has attached to a particular revsion id. If there is more than one person providing versions of a particular package, there may not be agreement on the revsion that a particular tag is attached to. For example, I might think that release-0.3 of some-package refers to package state identified by revsion id af5bd6, but you might think that release-0.3 of some-package refers to some other package state. In this case you and are are both a tag sources for the package. The state that particular tag refers to can depend then on the source from which the tag came. |