A modular structure for electronic scientific articles

F.A.P. Harmsze1, M.C. van der Tol2 and J.G. Kircz1
Web site: http://www.wins.uva.nl/projects/commphys/home.htm
1Van der Waals-Zeeman Institute, University of Amsterdam
Valckenierstraat 65, 1018 XE Amsterdam, The Netherlands
2Speech Communication, Argumentation Theory and Rhetoric, University of Amsterdam
Spuistraat 134, 1012 VB Amsterdam, The Netherlands

Abstract

We have developed a modular structure for electronic articles on experimental science. Modular articles consist of different types of explicitly characterised modules and explicitly characterised links expressing different types of relations. The modules can be located, retrieved and consulted both separately and in conjunction with other modules.


The project

At present, we face a revolution in the dissemination and handling of scientific papers. Most major publishers make an important share of their publications available via an Internet site and many independent new initiatives are launched. With a few exceptions, these electronic publications are in fact reproductions of paper-based products. As always, with the introduction of a new technology, the first steps in a new era are characterised by the translation of the old methods and models into the new situation. Only when the intrinsic characteristics of the new technologies are fully appreciated do real novel developments get a chance.

In the project 'Communication in Physics', we try to go a step further and propose a new model for the creation and evaluation of electronic scientific articles, taking into account the intrinsic features of the new medium, the requirements of adequate scientific communication as well as the societal traditions on which regular scientific communication is based. This model is intended to work in a fully electronic environment, where all papers are linked to each other and where new scientific contributions are added to the existing pool of papers in an organic way.

Rather than concentrating on the capability of present-day software, we choose an analytical approach. In other words, we design a new way of presenting scientific results, based on the assumption that appropriate software will become available in the foreseeable future. We analysed the role of articles in scientific communications following the standard literature (Garvey 1979, Meadows, 1999). Subsequently, we draft a profile of the interactants in the communication process. Here, we rely on discourse and argumentation studies concerning rational communication (Van Eemeren et al., 1993). This way we are able to connect the characteristics of scientific articles with the various stages in the communication process. This leads to a series of specific requirements that electronic scientific articles have to satisfy to allow for effective and efficient communication. These requirements include: a) dissemination requirements like indexing and logistics tools as well as proper identification and registration of intellectual ownership and integrity of the work, and b) creation requirements, resulting in authoring tools and electronic templates.

Based on our analyses of the role of the article and the communication criteria in academia, we conclude that in an electronic environment the traditional linear essay form becomes obsolete and has to be replaced by a modular framework (Kircz, 1998).

In order to ensure that our model is grounded in scientific practice, we developed the model in conjunction with an analysis of a coherent corpus of printed articles in the field of experimental physics. In this analysis, we identify different types of information and relations in the corpus and re-organise that information in a novel, modular structure. We found that the modular structure indeed allows for the creation of scientific articles that meet the necessary requirements.

In an earlier presentation in this series of conferences, we gave an outline of our programme (Harmsze et al., 1996 ). In this contribution, we would like to present the final model, which will soon be fully reported elsewhere (Harmsze, 2000). In that thesis, the model will be specified in terms of instructions to authors and will provide recommendations for software implementation.

The framework

Because electronic media are suitable for multiple (re)usage and reshuffling of information units, as well as for additions of new components to published work; our guiding principle is 'modularity' (Kircz, 1998). We develop a structure for modular articles, based on the idea that an electronic article can be made up of well-defined modules and links that, following the SGML-philosophy, can be identified with tags. In our modular framework, we define the modules that can represent the different types of information in an article. In order to guarantee and express the coherence of the information in and between different modules, we introduce a systematic way of linking the modules, both within the same article and between different publications. Thus, a modular article represents a sub-network of information within the network of all published information. In our model, both modules and links are explicitly characterised 'information objects' that can be handled using state of the art database management and information retrieval techniques.

Modules

We define a module as a uniquely characterised, self-contained representation of a conceptual information unit that is aimed at communicating that information. Not its length, but the coherence and completeness of the information it contains makes it a module. Modules can be located, retrieved and consulted separately as well as in conjunction with related modules.

The relations between modules can be expressed not only in links, but also in the composition of elementary modules into higher-level, complex modules. We define a complex module as a module that consists of a coherent collection of (elementary or complex) modules and the links between them. Using a metaphor, elementary modules are 'atomic' entities that can be composed into a 'molecular' entity: a complex module.

We distinguish two types of complex modules: compound modules and cluster modules. In a compound module, related (albeit possibly dissimilar) modules are aggregated to form a new module on a higher level. An example of an aggregated module is the module 'Experimental methods' that is composed of lower-level modules representing the various components of a measuring device. In our corpus we encounter molecular beam apparatuses that have, as relatively independent components, things like: one or more sources of a particle beam, a beam transport system, an interaction chamber and a detector. The central concept of a cluster module is the generalisation of specific concepts, focused on in its constituent modules. An example of a cluster module is the module 'Raw data' composed of various elementary modules reporting the results of the same general type of measurements involving different molecules.

In order to be able to determine what is 'similar information' to be grouped together and represented in a self-contained module and, subsequently, in order to be able to determine how to tag the resulting module, we need an unambiguous typology of scientific information. Therefore, we introduce a typology by which we characterise the information from four complementary points of view. In this typology, we incorporate the characterisations from two classical points of view: the domain-oriented characterisation that can be expressed in keywords and the characterisation by specified bibliographic data. In addition, we introduce a characterisation by the range of the information and a characterisation by its conceptual function, i.e. by the role the information plays in the scientific problem-solving process.

By characterising information by its range, so-called microscopic, mesoscopic and macroscopic modules can be introduced. A microscopic module represents information that belongs only to one particular article, e.g., information concerning the specific problem addressed in that article. A mesoscopic module functions at the level of an entire research project; it is created for multiple use in several articles issued from the same project. For example, information about the experimental set-up that has been used in a series of experiments can be represented in a mesoscopic module and connected to several articles reporting experimental results. A macroscopic module represents information that transcends the level of the research project; this type of firmly established information is given in, e.g., books, lecture notes.

Our main division in modules is based on the characterisation of the information by its conceptual function. Our starting point is the prototypical section structure of scientific papers: Introduction, Methods, Results, Discussion and Conclusions. This sequence represents the normal flow of a scientific narrative, but the way it is used in practice presupposes that the article will, indeed, be read sequentially from the beginning to the end. One of the main arguments in favour of modularity is that knowledgeable readers hardly read articles sequentially but browse through them, looking for useful bits and pieces. In our approach, we take that behaviour as our starting point and define our modules as entities that can be read independently. Thus, every module represents only one well-defined aspect of the article. Of course, this independence does not mean that one module is in general sufficient to understand the whole work. Modularity enables the reader to zoom in immediately on those aspects he/she is interested in. If so desired, the whole work, i.e., all the related modules and if needed the necessary related information presented in meso- and macroscopic modules, can be retrieved and read as if it were a traditional article.

We derive a list of distinctive conceptual functions for our corpus. From this analysis, we distinguish the following modules based on these conceptual functions.

Figure 1: an overview of the modules (click for a separate window with this figure)

Links

In the present practice of hypertext linking, the relations between the linked objects are often left unclear to the reader. A standard hyperlink only indicates that the author has some relation in mind between, for example, a blue underlined word and something else. In a standard HTML-document full of links, we are directed from nowhere to everywhere and back.

In our modular structure, a link is defined as an explicitly characterised directed connection, between modules or parts thereof (e.g., words or sentences), that represents one or more different kinds of relevant relation. Characterising links by the relations they express and by the modules they connect enables the reader, firstly, to make a well-considered choice, whether or not to follow the link and, secondly, to take the links into account in the process of locating and retrieving relevant information. This way, a link becomes a proper information object with clear characteristics. In a retrieval situation, the reader can now seek for modules and links, therewith enhancing the whole disclosure process. For this reason we also endow each link with the bibliographic data of the author who identified these relations and created the link. This way it becomes possible that a commentator on a modular article adds links to an already-published work. These links can strengthen the original work, but they can also challenge the results by, e.g., pointing to incompatible results of others. Thus, by endowing the object "link" with the traditional bibliographic data, we ensure the authenticity and priority of each information object when new links or modules are added to published work. Links and modules now have an equal standing.

In our analysis, we identify different types of relation that are relevant in modular scientific articles, and formulate a typology for the links in the modular structure. We distinguish two main classes of relations: organisational relations and scientific discourse relations.

Organisational relations

In the class of organisational relations, which express the organisational coherence of the modular network, we distinguish the following six types of relation:

Figure 2: an overview of the organisational relations (click for a separate window with this figure)

  1. hierarchical: an asymmetric relation between complex modules and their constituent modules,
  2. proximity-based: a symmetric relation between linked modules expressing whether they are part of the same collection (in particular, the same article or set of articles),
  3. range-based: an asymmetric relation expressing the difference in range between linked modules,
  4. administrative: an asymmetric relation between conceptual modules and the module representing their meta-information,
  5. sequential: an asymmetric relation between modules linked to form a complete or a more easy-going reading path,
  6. representational: an asymmetric relation between different representations of the same information (e.g., between texts, tables and figures).

An important aspect of links based on organisational relations is that they can often be assigned semi-automatically, provided the authors have appropriate authoring tools at their disposal.

Scientific discourse relations

The second main class of relations: scientific discourse relations, allows authors to indicate why they refer to another module or another part of the same module. Following speech communication research, we arrive at two subclasses of scientific discourse relations. One class is based on the communicative function; the other type consists of Content relations between two relata.

Figure 3: an overview of the content relations (click for a separate window with this figure)

1. Communicative function relations

The two basic aims of the author are to increase the reader's understanding of the message or to increase his/her acceptance of it. In order to understand or accept a module, readers may need additional information, for instance about the causes of a certain phenomenon. The author can make that information available to the readers by means of a link. The target of the link then consists of , e.g., a figure, a statement or a whole module, which has a particular communicative function with respect to the source of that link; for instance that of an explanation. Hence, this asymmetric relation can be made explicit by the characterisation of the link.

In practice we can often easily make a distinction between Elucidation links and Argumentation links. In the case of elucidation, the aim is at increasing the reader's understanding. Within the Elucidation relations, we make a further distinction between Explanation and Clarification. An explanation is given when the author anticipates that part of the intended readership will not understand how a particular state of affairs has come into being. When the author anticipates that part of the intended readership will not understand what he/she means by a particular text or figure, he/she will make a clarification available in the module or through a link to another module. A further refinement is then possible between a Definition relation and a Specification relation. Thus, the author can, for instance, connect a difficult term to an "encyclopaedic" macroscopic module by a link expressing a Definition relation.

In the argumentative case, the aim is to increase the reader's acceptance of a standpoint. These are cases where the author can presume that not every reader of the indented readership will immediately accept a particular statement.

2 Content relations

The second subclass of scientific discourse relations comprises Content relations, such as Dependency, Elaboration, Similarity, Synthesis and Causality.

The Dependency in the problem-solving process of the reported research is an asymmetric relation between steps in that process. A link can express the fact that the source depends on the target in the way in which, for instance, results depend on to generated them. A special case is a Transfer relation, if items are taken from one module and included in another. This is often the case with mathematical formulae or values that are used as input in calculations.

With an Elaboration relation, we indicate an asymmetric relation where the target contains an elaboration of the statement in the source. A mesoscopic sketch of the Situation can provide more information than a short statement in a Situation module at the microscopic level. Within this class, we can make a further distinction between Resolution relations that point to more fine-grained information, i.e. more details, and Context relations, pointing to more broad sweeping accounts of the subject, i.e. more context. We link information that is similar in relevant details, e.g., results of the same kind of investigation by different authors, by links expressing Similarity relations.

In the case of Synthesis relations we deal with: a) Aggregation expressed in links in which the source of the link is a component of the target, and (b) Generalisation, where more-or-less the same concepts are grouped together (for instance in the case where, on the microscopic level, specific parameters of an apparatus are fully described in an Experimental Methods meso-module).

As a final example we identify the Causal relations in which clear cause and effect relations are covered.

Applicability of our model

We developed the model in conjunction with an analysis of a corpus of articles published by a single research group in the field of experimental molecular dynamics. However, a short inspection of examples of publications in other domains showed that modular structures for other types of publications could be derived from our model.

To test the model, we rewrote two strongly related articles from our corpus as modular electronic article (demo in progress). Although the modular framework is explicitly intended for the creation and evaluation of new work, we found, recasting old work in the new mould, that modular electronic articles can meet our pre-defined requirements better than linear articles. In particular:

Acknowledgements

This work is part of the 'Communication in Physics' project of the Foundation Physica; it is financially supported by the Foundation Physica, the Shell Research and Technology Centre Amsterdam, the Royal Dutch Academy of Sciences, the Royal Library, and Elsevier Science NL.

Bibliographic references

 
(Garvey, 1979) W.D. Garvey, Communication: the essence of science - Facilitating information exchange among librarians, scientists, engineers and students. (Pergamon Press, Oxford, 1979)
 
(Meadows, 1998) A.J. Meadows, Communicating research, (Academic Press, San Diego, 1998)

(Van Eemeren et al., 1993) Eemeren, F.H. van, R. Grootendorst, Sally Jackson and Scott Jacobs, Reconstructing argumentative discourse. Studies in rhetoric and communication. (The University of Alabama Press, Tuscaloosa, 1993)

(Kircz, 1998) J.G. Kircz, Modularity: the next form of scientific information presentation? Journal of Documentation, Vol.54,no.2,March 1998, p.210-235. Electronic version: http://www.wins.uva.nl/projects/commphys/papers/jkmodul.htm

(Harmsze et al., 1996) F.A.P. Harmsze, M. van der Tol and J. Kircz, 'Naar een modulair model voor natuurwetenschappelijke informatie in elektronische artikelen. In: Informatiewetenschap 1996, Wetenschappelijke bijdragen aan de Vierde Interdisciplinaire Conferentie Informatiewetenschap (Delft, 13 december 1996). Van der Meer (Werkgemeenschap Informatiewetenschap, 1996).pp. 53-71. Electronic version: http://www.wins.uva.nl/projects/commphys/papers/delft/delft.htm

(Harmsze, 2000) F.A.P. Harmsze, A modular structure for scientific articles in an electronic environment, PhD thesis, to be published, Amsterdam 2000

(Van der Tol, 1999) M.C. van der Tol, The abstract as an orientation tool in modular electronic articles. To be published in the proceedings of the First International Conference on Document Design, Tilburg, December 17 and 18, 1998 Electronic version: http://www.wins.uva.nl/projects/commphys/papers/docdes/docdes.html