Anita
de Waard (Advanced Technology Group, Elsevier)
Molenwerf 1, 1014 AG Amsterdam
Joost
Kircz (KRA-Publishing
Research) *
Prins Hendrikkade 141, 1011 AS Amsterdam
kircz@kra.nl
With the impressive growth of hyper-linked information objects on the World Wide Web, the best possible way of finding gems in the desert is to create a system of filters - sieves, that enable a large throughput of information in the hope that the residue is of relevance to the working scientist. Two methodological directions can be taken to find relevant information. One approach starts from the assumption that information growth cannot be tamed. Purely statistical information retrieval techniques are a prime example of such an approach, which can be void from any semantic knowledge about the content at stake. In these IR techniques, context is inferred from patterns that contain the query words. In the extreme case, not even words are used as in the powerful n-grams technique [1,2].
The other approach is based on denotating information. Every relevant piece of information is augmented with data describing the information object, so-called: metadata. Metadata can be seen as filters as they distribute information according to classes, such as a name, an address, a keyword, etc. Looking for the name of the person Watt, we only have to look in the class of authors, whilst looking for the notion Watt (as a measure for electric power) we only have to look in the class of keywords belonging to the field of electric engineering. Due to the ambiguity of words, normally metadata are added by hand or based on the structure of the information object, e.g., a document. In a standardised environment we can infer with 100% certainty what the name of the author is, which is impossible if we deal with a document with an arbitrary structure in a language we don't master.
It goes without saying that both approaches, purely statistical and pre-coordination are needed in a real life environment. Statistical approaches have a number of obvious problems (lack of semantic knowledge, inability to interpret irony or casual references), while full pre-coding by the author might on the one hand be impossible to achieve, and on the other hand prevent the browsing reader to stumble on unexpected relationships or cross-disciplinary similarities. The challenge is how we can prepare information in order to enable quick and relevant retrieval, while not overburdening the author or indexer.
In adding metadata to documents, more and more computer assisted techniques are used. Some types of metadata are more or less obvious, e.g., bibliographic information, while others demand a deep knowledge of the content at issue. At the content level we deal with authors who are the only ones who can tell us what they want to convey and professional indexers who try, with the help of systematic keyword systems, to contextualise the document into a specific domain. In particular the last craft is creating essential added value by securing idiosyncratic individual documents into a domain context, by using well designed metadata systems in the form of thesauri and other controlled keyword systems.
We are currently working on the design of a system which enables the author to add as much relevant information as possible to her/his work in order to enhance retrievability. As writing cultures do change as a result of the technology used, we propose to fully exploit the electronic capabilities to change the culture of authoring information. In such an approach, it is the author who contextualises the information in such a way that most ambiguities are pre-empted before release of the work. Such an environment is much more demanding for the author and editor, but ensures that the context of the work is well-grounded.
To build a useful development environment, in this contribution we define different categories of metadata, that are created, validated and used in different stages of the publishing process. Given the importance of metadata, we believe it should be treated with the reverence usually reserved for regular data, in other words, we need to worry about its creation, standardisation, validation and property rights. In this contribution, we want to explore how metadata is used, and consider the issues of versioning, standardisation and property rights. We then come up with a proposed, and very preliminary, classification of metadata items, and discuss some issues concerning the items mentioned. As we believe that metadata should be treated on equal footing as the objects they describe, in other words metadata are information objects in themselves, we show that all issues that pertain to information objects also pertain to metadata.
This contribution is meant to support our own work in building an authoring environment, and therefore does not present any conclusions yet- but we invite responses to this proposed classification and the issues at hand (versioning, validation, standardisation and property rights of metadata). Preferably, based on comparison of documents of different scientific domains, as it turns out that different domains can have substantial differences in structure and style. As is clear from the above and in particular from the table, many issues are still uncertain and in full development. For the design of an easy to use and versatile author environment, where the author can quickly denote her/his own writing and create and name the links to connotate the work, an analytically sound scaffolding is needed before such a system can be built.
Below we discuss a classification of metadata leading to an overview presented in a table. Items in the table refer to further elaboration via hyperlinks. As this presentation also has to be printed, in this version the elaborations and digressions are located linearly as sections after the table.
In first approximation we make a distinction into three broad categories of metadata, which are accompanied by three uses of information:
Metadata can be created by different parties - authors, editors, indexers and publishers, to name a few. It is important to realise that at some times, the creating party is not the validator; also, if the creating party is not part of the versioning cycle, the party creating the latest version can be not aware of necessary updates in the metadata. Therefore, only the creator can add and validate such items as her/his own name or references to other works. Additional metadata can be generated by machine intervention, such as automatic file-type and size identification, whilst professional indexers, be it by hand or computer assisted, will add domain dependent context to a work.
Very often, metadata is not validated per se. For convenience's sake, it is often assumed that links, figure captions, titles, references and keywords are correct. An extra challenge in electronic publishing is the validation of non-text items - for one thing, most reviewers and editors still work form paper, thereby missing are hypertextual and/or interactive aspects of a paper (hyperlinks that no longer work are an obvious example of this problem).
The
role of Intellectual Property Rights (IPR) and Copyright in particular,
is a hot issue in the discussions on so-called self-publishing. A great
deal of difficulty is in the differences between the various IPR
systems, in particular between (continental) Europe and the US.
However, besides this issue, electronic publishing generates a series
of even more complicated questions that have to be addressed. As
metadata allow the retrieval of information, they become "objects of
trade" by themselves. Below we only indicate some issues pertaining to
our discussion. A more detailed overview on the complicted legal
aspects in ICT based research is given in Kampermann et. al ([3] and references therein). This short list below,
shows that the heated debate on the so-called copyright transfer (or
really: reproduction rights) from the author to a publishers is only a
small part of the issue. Metadata as information objects face at
least the same right problems as the document per se.
Using the categories defined above, we can come to a first list of metadata items, that include comments on their usage, creation/validation and rights, and define a number of issues, that are described in the paragraphs below.
What is it | Category | Who creates | Who validates | Who has rights | Issues |
Author name | content | Author | Author | Author | Unique author ID (see below 3.1) |
Author affiliation | content | Author's Institute |
Editor? Publisher? | Author? | Corresponding author address only? / Present address vs. at the time of writing. In other words is the article coupled to the author and her institution during creation, or does an article follows an author in time. |
Author index | content | Publisher | Publisher | Publisher/Library | Author name issues (Y. Li issue, see below 3.1) |
Keywords | content | Author, editor, publisher, A&I service, library, on-the-fly | Editor, publisher, A&I, library | See section 2.4 |
Multi-thesaurus
indexing
(see below 3.2) |
Abstract | content | Author, A&I service | Editor, A&I editor | Author/A&I service | Types of abstracts? Usage of abstracts? (see below 3.3) |
References | location | Author | Editor, Publisher | None for individual reference; document collection - yes | DOI,
http as reference; link reference
to referring part of document; versioning! See also Links (below 3.4) |
Title, Section division, headers | content | Author/publisher | Publisher | Publisher? | Presently based on essayistic narratives produced for paper |
Bibliographic info (publisher's data) | location | Publisher | Publisher |
Publisher (TM)™ |
DOI refers to a
document, but is intrinsically able to refer to a sub-document unit. No pagination in an electronic file, referencing is now point-to-point instead of from page-to-page. |
Bibliographic info (Other data) | locate | Library | Library | Library | Multiple
copies in a library system, signature, etc. Does this all evaporate with the new license agreements, where the document is hosted at the Publisher's database? |
Clinical data | content | Author | Editorial | Doctor/patient? | Privacy; standardisation; usage? |
Link (object to dataset, object to object) | location/ content |
Author Publisher |
Publisher | Author? Publisher? |
Are information objects (see below 3.4) |
Multimedia
objects Visuals, Audio, Video, Simul- (Anim)ations |
content/ format |
Author, Publisher | Editor? Publisher? | Rights to format (cf. ISO and JPEG) vs. rights to content |
Who owns
SwissProt nr? Genbank
® nr? Chemical structure format? JPEG org? |
Document status, version | content | Editor, publisher, (author for preprint/OAI) | Publisher | Publisher |
Version issue (see
below
3.6) |
Peer review data | content | Reviewer | Editor | Reviewer? | How to ensure connection to article? Privacy? vs Versions of articles? Open or closed refereeing procedures |
Document |
content/ |
Author,
Publisher Reviewer |
Editor, Publisher | Author ("creator") |
Integrity of components that make up document;
Versioning. |
DTD | content/ format |
Publisher | Publisher | Open source, copyleft? | Versioning? Standard-DTD (see below 3.7) (Dublin Core)? Ownership |
Exchange protocols e.g. OAI | locate/ format |
Library, Publisher, archive | "Creator" | ?! |
Rights! Open standards |
Document collection - Journal (e.g. NTvG) | content/ location/ format |
Editor/Publisher | Editor /Publisher | Publisher | Integrity
of collection - multiple collections E-version versus P-version |
Document collection - Database (e.g. SwissProt) | content/ location/ format |
Publisher - Editor? | Publisher | Organization? | Validation? Rights? |
Data sets collaboratories - Earth System Grid |
content/ location/ format |
Federated partners | Nobody! | Creator? | Validation? Usage? |
The demand for an unique author id is as simple as reasonable. However, in the real world we encounter the following caveats:
So,
do we want to use a social security (or in The Netherlands SOFI) number
or picture of an iris scan? Or even introduce a Personal Publishing
Identification Number (PPIN)?
A lot of practical and legal issues still stand in the way of true
unique identification, but first steps are being set on this path by
publishers, agents and online parties to come to a common unique ID -
the INTERPARTY
initiative being one of them.
Indexing
systems are as old as science. The ultimate goal is to assign an
unambiguous term to a complex phenomena or reasoning. As soon a
something has a name, we can manipulate, use and re-use the term
without long descriptions. In principle, a numerical approach would be
easiest, because we can assign an infinite number of ids to an infinite
number of objects. In reality, as nobody things in numerical strings,
simples names are used. However, as soon as we use names we introduce
ambiguities as a name normally has multiple meanings
A known problem is that author added keywords normally are inferior to
keywords added by trained publishing staff, as professional indexers
add wider context where individual authors target mainly on terms that
are fashionable in the discussion at the time of writing, as the
experience in the journal making industry learns. Adding uncontrolled
index terms to information objects therefore rarely adds true
descriptive value to an article, a prime reason to use well-grounded
thesauri and ontologies.
A so-called Ontology is meant to be a structured keyword system with inference rules and mutual relationships beyond "broader/narrower" terms. At present we are still dealing with an mixed approach of numerical systems such as: Classification codes, e.g. in chemistry or pharmacology, and domain specific thesauri or structured keyword system such as EMTREE and MeSH terms in the biomedical field. Therefore, most ontologies still rely on existing indices, and ontology mapping is still a matter of much debate and research. Currently, multifarious index systems are still needed, based on the notion that readers can come from different angles and not necessarily via the front door of the well established Journal Title. Index systems must overlap fan-wise and links have to indicate what kind of relationship they encode. The important issue of rules and particular the argumentational structure of these roles is part of our research programme and discussed elsewhere [5, 9].
The history of abstracts follows the history of the scientific paper. No abstracts were needed when the number of articles in a field was fairly small. Only after the explosion of scientific information after WWII we see the emergence of abstracts as a regular component of a publication. Abstracting services came into existence and in most cases specialists wrote abstracts for specialised abstracting journals (like the Excerpta Medica series). Only after the emergence of bibliographic databases the abstract became compulsory as it was not yet possible to deliver the full text. After a keyword search, the next step towards assessing the value of retrieved document identifiers was by reading the on-line abstract. In an electronic environment (where the full article is as quickly on the screen as the abstract) the role of the abstract as an information object is under scrutiny, since for many readers, it often replaces the full text of the article. As already said in section 2.1, abstracts are identifiers for a larger information object: the document. In that sense an abstract is a metadata element.
In
a study at the University of Amsterdam [6] to
assess the roles of the abstract in an electronic environment, the
following distinctions are made :
Functions:
Type of abstracts:
This analysis shows that the database field "abstract" now has to be endowed with extra specifying denotation. As our research is on design models for e-publishing environments, we have to realise that at the authoring stage of an abstract a clear statement about function and role is needed, as more abstracts -of a different type- might be needed to cater for different reader communities.
As already discussed above, analysing components of a creative work into coherent information objects, means that we also have to define how we synthesize the elements again into a well behaving (new) piece of work. The glue for this puzzle are the Hyperlinks. An important aspect in our reserach programme is to combine denotative systems with named link-structures that add connotation to the object descriptors. By integrating a proper linking system with a clear domain-dependent keyword system, a proper context can be generated.
If
we analyse hyperlinks we have to accept that they are much richer
objects than just a connection sign, as:
All in all, hyperlinks are information objects with creation date, authorship, etc. and hence, can be treated like any other information object. This means that we have to extent our discussion of metadata as data describing information to hyperlinks.
Apart
from the obvious attributes such as author, date, etc. we can think
about an ontology for links. This ontology will be on a more abstract
level than an ontology of objects in a particular scientific field as
we here we deal with relationships that are to a large extent domain
independent.
A first approach towards such system might go as follows:
A)
Organisational
B)
Representational
C)
Discourse
The great challenge in designing a link ontology, and metadata system
is in developing a concise but coherent set of coordinates. As
discussed in more detail elsewhere [7, 8].
We suggest the following main categories:
In conclusion: as links are information objects we have to be aware of validation and versioning the SAME way as textual or visual objects and data-sets!
In an electronic environment where documents (or parts thereof) are interlinked, no stand-alone (piece of) work is created/edited/published anymore. All creative actions are part of a network. So, all parties need to discuss and use standards: (partly) across fields, (certainly) across value chains. However "The great thing about standards is that there are so many to choose from..." and that they evolve all the time.
In
library systems, we rely on a more or less certified system of index
terms such as Machine-Readable Cataloging (MARC) records,
where a distinction is made between: Bibliographic, Authority,
Holdings, Classification and Community information. In a more general
perspective we see all kind of standardisation attempts to ensure
interchange of information in such a way that the meaning of the
information object remains intelligible in the numerous exchanges over
Internet. {See e.g.. The National Information Standards Organization (NISO) in the
USA for the Information Interchange Format and The Dublin
Core Metadata Element Set}.
An immediate concern is the level of penetration of a standard in the
field and its, public or commercial, ownership. Who has the right to
change a standard, who has the duty to maintain a standard, how is the
standardisation work financed and who is able to make financial gains
out of a standard? For that reason the discussion of standardisation
and Open Standards in particular are crucial in this period of time.
An interesting new phenomenon appears here. As is well known, in may fields so-called salami publishing is popular. Firstly a paper is presented as short contribution on a conference, than a larger version is presented on another conference and after some iterations, a publication is published in a journal. It is also common practice that people publish partial results in different presentations and then review them again in a more comprehensive publication. This practice can be overcome if we realise that an electronic environment is essentially defined as an environment of multiple and re-use of information. The answer to the great variety of versions and sub-optimal publications might lie in a break up of the linear document into a series of inter-connected well defined modules. In a modular environment the dynamic patchwork of modules allows for a creative re-use of information in such away that the integrity of the composing modules remain secured and a better understanding of what is old and what is new can be reached. Such an development is only possible if the description of the various information objects (or modules) is unique and standardised [7, 8, 9,10].
As
said in section 2.1, the description of the
format of the information is an essential feature for rendering,
manipulating and datamining the information. This means that we need a
full set of technical describers identifying the technical formats as
well as identifiers that describe to structure of the shape of the
document. Opposite to the simple technical metadata, e.g., are we
dealing with ASCII or Unicode, the metadata that describe the various
linguistic components and the structure of a document are
interconnected one way or the other. This means that we need a
description of this interconnection, hence metadata on a higher level.
A Document Type Definition (or its cousin, a Schema) defines the
interrelationship between the various components of a document. It
provides rules that enable checking (parsing) of files. For that reason
a DTD, like an abstract belongs to the metadata of a document.
Based on such a skeleton DTDs and Style Sheets can be designed that keep the integrity of the information (up to a -to be defined- level) tailored to various output/ presentation devices (CRT, handheld, paper, etc.).
Within
this problem area it is important to mention the difference between
content driven publications, i.e. publications that allow different
presentations of the same information content and can be well catered
for by a DTD and lay-out driven publications, which are publications
where e.g., the time correlation between the various elements is
essential for the presentation. See e.g. the work done at the CWI [11].
*) Also at : Van der Waals-Zeeman Institute, University of Amsterdam and the Research in Semantic Scholarly Publishing project of the University Library, Erasmus University, Rotterdam
DOI:
http://www.doi.org
Elsevier: http://www.elsevier.com
Dublin Core: http://dublincore.org/
Earth Systems Grid: http://www.earthsystemgrid.org/
Genbank: http://www.ncbi.nlm.nih.gov/Genbank/index.html
Interparty: http://www.interparty.org/
JPEG-Org:
http://www.jpeg.org/
JPEG: http://www.theregus.com/content/4/25711.html
KRA: http://www.kra.nl
Marc: http://www.loc.gov/marc/
NISO: http://www.niso.org/standards/
NTvG: http://www.ntvg.nl/
OAi: http://www.openarchives.org/
Ontologies: http://protege.stanford.edu/ontologies/ontologies.html
Research in Semantic Scholarly Publishing project: http://rssp.org/
Swissprot: http://www.ebi.ac.uk/swissprot/index.html
Trec: http://trec.nist.gov/
ZING: http://www.loc.gov/z3950/agency/zing/zing-home.html