::scr Ramblings of a Classic Refugee or How I Learned To Stop Worrying and Love OS X

Alaric B Snell scr@thegestalt.org
Tue, 5 Feb 2002 16:34:41 +0000


> > 1. performance -
> > FS access would be *slow*. But memory is cheap and I've got clock cycles
> > to burn so we'll let Mr Moore take care of that for us.
>
> I don't know much about BeOS, but I think I've read something that
> claimed they'd fixed this problem? Maybe. I can't remember.

There's a myth about things 'having to be slow'. If you took a current 
filesystem and bolted extra stuff on willy-nilly the result might well be 
'slow'. If you do it properly, it won't. The hierarchial filesystem wasn't 
designed to 'be fast', it was designed since it seemed like a logical way to 
organise files at the time.

> > Effectively DB file systems and also individual files would have the
> > equivalent of DTDs (the file that describes an XML file for the buzzword
> > protected) but for lovely shiny binary files. Al Snell would be proud.

Hello.

If the abstraction layer to the data store is as strings of bits, then yes, 
you'll need something basically isomorphic to grammars to define allowable 
bit strings if you want to perform data validation, and attributing that 
grammar with semantic declarations ('This is a pornographic image in JPEG 
format', 'this is an integer', etc) will also be useful since those semantic 
types can be looked up in a registry to automatically find pre-written 
grammars for validity checking, and also link to code that can 
view/edit/validate in deep semantic ways (making sure that the porn image 
contains something that an image recognition algorithm classifies as a 
cum-soaked farmyard babe).

Performing data validation can be a very very good thing, but it's important 
that it doesn't become a pain in the ass. Whether this is best solved by 
allowing unstructured types - as the X.500 / LDAP folks have recently - that 
are not validated against a schema or by munging the minds of programmers to 
make them feel that time spent writing schemas for everything they do far, 
far, outweighs the costs of debugging things... waits to be seen.

> A binary file with a separate definition document gets rid of the
> primary advantage of XML, namely that it can be edited in a standard
> text editor.

This is not the primary advantage of XML. People used to say things like that 
on XML-DEV, but then the Collapse came, and now they're cursing textual 
encodings.

The primary advantage of XML is that it's a well supported standard, full 
stop; the biggest pains in the ass of it is the actual textual encoding rules 
and the lack of semantic declarations.

1) There are problems with character encodings. XML is based upon Unicode, 
which is still a developing standard. There have been problems with 
definitions of whitespace, and XML 1.1 is set to not be backwards compatible 
with XML 1.0 because of this; XML 1.0 documents may not necessarily be valid 
XML 1.1 documents...

2) There are problems with the fact that only a subset of interesting data 
map well to text. Images don't. Encrypted data don't. The two solutions to 
this are BASE64 and storing the 'binary' data in a seperate data store with 
links inbetween that need to be kept up to date.

3) XML isn't that readable. Really simple stuff is, but the verbosity and so 
on quickly make it next to impossible to make sense of in a text editor; this 
is why there are specialist XML editors. XML configuration files are a right 
pain :-( This particularly arises when the data model in use isn't all that 
tree structured, and there are IDREF links between elements - the text editor 
user will have big problems editing *that*.

And as for the lack of semantic declarations; the XML designers took a path 
of 'We define a syntax alone, and let applications worry about semantics'. 
This has caused problems because things like namespaces and namespace URIs 
and ID attributes and processing instructions and schemas and DTDs have 
multiple de facto standards for their semantics, and when they clash it can 
get ugly. Both because there are now undefined cases and because developers 
must be familiar with multiple competing 'mental models' of what XML 
constructs *mean*, and must switch between them in their heads.

The lack of semantics of base XML 1.0 is a problem. There are multiple 
available semantics.

1) The DOM model, in which text nodes may be arbitrarily broken at various 
places

2) The Infoset, which is like the DOM but different in some ways

3) The post schema validation Infoset, which is like the Infoset but 
different in some ways.

4) The XPath tree model, which is like the DOM but different in some ways

I can't remember all the differences, but they're mainly to do with things 
like whether a parsed entity reference is presented to the application as 
such or magically parsed at parse time (which may only be possible if there's 
an Internet connection available at parse time).

Also, XML is *complex*. XML 1.0 itself is simple. Add namespaces and it gets 
a bit more complex. Add XML schemas and the Infoset and the PSVI and it gets 
more complex so.

> In that case, why is a binary file like that better than
> files being objects? If they had specified accessor methods, they could
> pretend to be any file type they needed too.

Indeed! Sod all this low level crap, people just wrap it in a library and 
call it 'serialisation' anyway... we don't want to have to be worrying about 
bit formats. We don't worry about bit layouts when we declare a struct full 
of integers in C or an object in Java or a table in SQL, do we?

ABS

-- 
Alaric B. Snell, Developer
abs@frontwire.com