[sword-devel] InstallMgr details.

Fri May 15 01:03:43 MST 2009

Troy A. Griffitts wrote:

> There is a basic and practical difference between a local and a remote
> installation, however abstract you want to get.
> 
> Remote repositories have concepts like 'refresh from remote source'
> (apt-get update)

Short term for SWORD, that is likely to remain an important difference.
Long term, either a local repo just returns "OK, done, nothing changed"
when asked to refresh, or you can rethink whether this is really needed...

In the apt-get example you mentioned, I think the whole idea of needing
apt-get update for a local metadata database arises only when the
repository needs significant metadata to be usable, or when they get so
large (like tens of thousands of items in the repo, as with apt) that it
is currently impractical for performance reasons to obtain the index
info / metadata dynamically upon opening the repo.  If the largest
Debian package repository on the planet held 200 packages, would we
really all be using the "apt-get update" type of approach, and keeping a
local metadata database?

Thinking of other counter-examples: We do not "web update" before we can
browse to a new web page.  Nor do we "pdf update" before we can browse
to a new PDF file, or even "video update" for a video file.  If YouTube
is considered a large repository of video files, one does not "youtube
update" before one can watch a new video :)

Some (probably too idealistic and blue sky) ideas and thoughts for the
distant future of SWORD that arise when I think about this:

(1) If Peter von Kaehne's idea that "a SWORD module is like a PDF" is
accurate and appropriate, then the whole idea of "installing" a SWORD
module is an unhelpful anachronism that can go away at some point in
future development.  The end user does not really want to "install" a
SWORD module, they want to use (read/search/annotate/etc.) it!

(2) Or, if modules are always going to be only "installable" entities,
for whatever reason, then it seems to me to make little sense to provide
them online as a tree of files per module.  It is surely simpler, more
efficient, and maybe more logical(?) to provide them as a single
compressed archive file per module.  Then you let the "install" process
also decompress them (either after transport to the machine running the
application, or decompress the byte stream as it arrives, if that is
better for overall performance).  Remote network transport time and disk
write time is likely to dwarf any decompression time, even on embedded
low power CPUs.

Given this "SWORD modules are installable entities, not documents like
PDFs" vision of the future, apt-get is a very workable analogy.  Debian
package repositories do not require that the repository unpacks every
.deb file they offer, so that repository users can access and download
the files inside one at a time (those files are not generally useful
individually anyway!).  Instead, it stores the .deb files, which are
compressed archives, along with meta info about them to aid in
searching.  The client does the decompression and unpacking of the
archives.  I can imagine a SWORD repo operating this way, too.

Unless you allow direct remote access (RPC-like, or maybe even
NFS-like?) to the items in the remote SWORD repo (potentially a nice
blue sky idea, but not currently implemented!), what is the benefit of
the "unpacked tree of files" format for repo owners, for front end
developers, or for end users?

Right now, without knowing all the history, my understanding is that
SWORD sort of does both, and so (to me) is confusing... online
repositories are unpacked, but there is also a  "raw zip" standardized
way to store (and so transport) SWORD modules.  When does the user pick
one rather than the other?  Why is the user being asked to make that choice?

Is there really enough added value in having both to justify the
additional system complexity that ensues from this "do both" approach to
SWORD module storage in repositories?

(3) Ignoring backward compatibility (!), one could in future make SWORD
modules available as .zip files (or some other defined compressed
archive file format), *only*.  An installer would then use URLs to find
collections of these archive files (and the related repo metadata if
such is needed/useful), and more specific URLs to download the
individual archive files, and then install them locally.  This (as Greg
pointed out) would allow for a very nicely abstracted set of methods
that could expand to encompass any desired number of different URI
schemes, from http: to ftp: to file: to sshfs: to something not yet
invented.

[Aside: If this *is* done, it can often make a lot of sense to use an
"embedded magic number" approach to being able to identify the files as
being SWORD modules rather than a generic zip/gzip/bzip2/etc compressed
file, as this permits special treatment of them without having to use
ugly workarounds like renaming them to end in something hopefully
unique. Debian/Ubuntu .deb packages have such numbers, for example --
you can rename a .deb to a .foobar if you really want, and the file *
command will still identify it as being a Debian binary package, and so
you can still set you your file manager to do the right thing when you
double click it! ]

It seems to me that movement in that general direction (SWORD modules
are available as a single file in some defined format, decompressed and
installed locally) might be better (long term) than adding features to
the  current (SWORD modules online are a hierarchy of many files) approach.

This is potentially also a useful first step on the long road to "SWORD
modules are a single data file which the application opens, just like a
PDF, no installation of them is needed" -- first make modules be single
files, then make doing the equivalent of whatever an installer does fast
enough than you can do it at file (module) open time :)

> Local repositories usually aren't 'entered' in a list by a user, though
> I suppose they could be if it was useful.  Practically there is usually
> 1 local source (a CD or USB drive) and the user can Browse... to the
> location.  This is not easily replicated for remote sources.  They are
> typically 'configured' and their configuration stored for future reference.

Consider bookmarks in a web browser -- one can bookmark both local files
and remote ones in there... there is no difference to the user interface
at all in that case.  Why do SWORD modules require such a distinction?
How does it help the end user to have that distinction be visible to
them?  How does it help the front end developer to have that distinction
be visible to them in the API?  if the distinction is unhelpful to the
users, then can it be abstracted away?

> Some I can think of:  We now have just added support in 1.6.0 for
> non-anonymous FTP, so the user can input username/password if
> necessary-- useful for access to a private beta repository.  We have
> supported Passive FTP as an option.  With HTTP access, we might also add
> HTTP proxy features.  These all require frontend user preferences.

I'd think that all of this can either be in the URL (username and
password) or else a systemwide config option (proxies, passive vs active
FTP -- though the "good default" these days for FTP seems to be try
passive, and if it fails in a certain way, fall back to active).  This
probably needs a way for the "open a URL" method to prompt the user for
authentication information (username and pw, usually), but that's all.
The underlying subsystems (and control panels for users to set proxies,
etc.) for doing it that way already exist, on most and perhaps all
platforms, as far as I know, so making them preferences that SWORD front
ends need to handle specially seems like extra work for both SWORD
developers and SWORD end users, for no real benefit?

> There will almost certainly be additional work for the frontends when we
> add HTTP support, though not as different as you fear might be.

I think that a careful design might be able to avoid that.  Even if that
is impractical for this first implementation, I think a design that
moves the library towards a more unified approach to acquiring and
opening/using modules would be good.

For instance, longer term still, given a fast enough network pipe, why
download and install any modules at all -- one should conceivably be
able to have more of an RPC style approach to accessing a remote
module... a little like accessing a remote SQL database today... or even
just a file on a network share... you don't have to copy the entire
database (or file) to your PC first, before you can use it :)

> Libraries of modules are exposed as SWMgr objects.
> An SWMgr object can be easily created from a local path:
> 
> SWMgr localLibrary("/path");
> 
> So for local sources, you don't need InstallMgr to obtain an SWMgr object.

Why not see that parameter as potentially being a URL, in a later
version of the library, so unifying remote and local access?  If a URL
parsing library says it isn't a URL, treat it as file:// as a fallback...

SWMgr library("http://www.example.com/sword/");

or similar?  Other than performance over a slow network connection, is
there any technical requirement for this to be restricted to "local"?
Since "local" can in fact be very remote if one mounts a network
filesystem, I'm not sure the distinction is all that useful to the end
user anyway, is it?

Going even further, is it necessary or helpful for the API to have the
concept of "libraries" at all, other than as bookmarks to open modules
to install?  We don't normally expect PDF files to exist grouped into
"libraries"; why would we expact SWORD modules to be so grouped, if they
are in effect just like PDFs? (Even if they are in some ways perhaps
more like databases, with all the searching and indexing stuff that they
need... we don't generally group databases into sets of databases based
on their physical or network location, either).

> int InstallMgr::installModule(SWMgr *destMgr, const char *fromLocation,
> const char *modName);
> // which I don't hate

Can SWORD not go back to this, and allow *fromLocation to be a URL?  I
think this is more or less what Greg is suggesting :)  Adding capability
without changing the API is really nice if you can do it, and if the
resulting API is actually *simpler* than before (plus as a bonus you
don't hate it)... that sounds good all around.

Jonathan