A modular structure for electronic scientific articles

F.A.P. Harmsze¹, M.C. van der Tol² and J.G. Kircz¹
Web site: http://www.wins.uva.nl/projects/commphys/home.htm
¹Van der Waals-Zeeman Institute, University of Amsterdam
Valckenierstraat 65, 1018 XE Amsterdam, The Netherlands
²Speech Communication, Argumentation Theory and Rhetoric, University of Amsterdam
Spuistraat 134, 1012 VB Amsterdam, The Netherlands

Abstract

We have developed a modular structure for electronic articles on experimental science. Modular articles consist of different types of explicitly characterised modules and explicitly characterised links expressing different types of relations. The modules can be located, retrieved and consulted both separately and in conjunction with other modules.

The project

At present, we face a revolution in the dissemination and handling of scientific papers. Most major publishers make an important share of their publications available via an Internet site and many independent new initiatives are launched. With a few exceptions, these electronic publications are in fact reproductions of paper-based products. As always, with the introduction of a new technology, the first steps in a new era are characterised by the translation of the old methods and models into the new situation. Only when the intrinsic characteristics of the new technologies are fully appreciated do real novel developments get a chance.

In the project 'Communication in Physics', we try to go a step further and propose a new model for the creation and evaluation of electronic scientific articles, taking into account the intrinsic features of the new medium, the requirements of adequate scientific communication as well as the societal traditions on which regular scientific communication is based. This model is intended to work in a fully electronic environment, where all papers are linked to each other and where new scientific contributions are added to the existing pool of papers in an organic way.

Rather than concentrating on the capability of present-day software, we choose an analytical approach. In other words, we design a new way of presenting scientific results, based on the assumption that appropriate software will become available in the foreseeable future. We analysed the role of articles in scientific communications following the standard literature (Garvey 1979, Meadows, 1999). Subsequently, we draft a profile of the interactants in the communication process. Here, we rely on discourse and argumentation studies concerning rational communication (Van Eemeren et al., 1993). This way we are able to connect the characteristics of scientific articles with the various stages in the communication process. This leads to a series of specific requirements that electronic scientific articles have to satisfy to allow for effective and efficient communication. These requirements include: a) dissemination requirements like indexing and logistics tools as well as proper identification and registration of intellectual ownership and integrity of the work, and b) creation requirements, resulting in authoring tools and electronic templates.

Based on our analyses of the role of the article and the communication criteria in academia, we conclude that in an electronic environment the traditional linear essay form becomes obsolete and has to be replaced by a modular framework (Kircz, 1998).

In order to ensure that our model is grounded in scientific practice, we developed the model in conjunction with an analysis of a coherent corpus of printed articles in the field of experimental physics. In this analysis, we identify different types of information and relations in the corpus and re-organise that information in a novel, modular structure. We found that the modular structure indeed allows for the creation of scientific articles that meet the necessary requirements.

In an earlier presentation in this series of conferences, we gave an outline of our programme (Harmsze et al., 1996 ). In this contribution, we would like to present the final model, which will soon be fully reported elsewhere (Harmsze, 2000). In that thesis, the model will be specified in terms of instructions to authors and will provide recommendations for software implementation.

The framework

Because electronic media are suitable for multiple (re)usage and reshuffling of information units, as well as for additions of new components to published work; our guiding principle is 'modularity' (Kircz, 1998). We develop a structure for modular articles, based on the idea that an electronic article can be made up of well-defined modules and links that, following the SGML-philosophy, can be identified with tags. In our modular framework, we define the modules that can represent the different types of information in an article. In order to guarantee and express the coherence of the information in and between different modules, we introduce a systematic way of linking the modules, both within the same article and between different publications. Thus, a modular article represents a sub-network of information within the network of all published information. In our model, both modules and links are explicitly characterised 'information objects' that can be handled using state of the art database management and information retrieval techniques.

Modules

We define a module as a uniquely characterised, self-contained representation of a conceptual information unit that is aimed at communicating that information. Not its length, but the coherence and completeness of the information it contains makes it a module. Modules can be located, retrieved and consulted separately as well as in conjunction with related modules.

The relations between modules can be expressed not only in links, but also in the composition of elementary modules into higher-level, complex modules. We define a complex module as a module that consists of a coherent collection of (elementary or complex) modules and the links between them. Using a metaphor, elementary modules are 'atomic' entities that can be composed into a 'molecular' entity: a complex module.

We distinguish two types of complex modules: compound modules and cluster modules. In a compound module, related (albeit possibly dissimilar) modules are aggregated to form a new module on a higher level. An example of an aggregated module is the module 'Experimental methods' that is composed of lower-level modules representing the various components of a measuring device. In our corpus we encounter molecular beam apparatuses that have, as relatively independent components, things like: one or more sources of a particle beam, a beam transport system, an interaction chamber and a detector. The central concept of a cluster module is the generalisation of specific concepts, focused on in its constituent modules. An example of a cluster module is the module 'Raw data' composed of various elementary modules reporting the results of the same general type of measurements involving different molecules.

In order to be able to determine what is 'similar information' to be grouped together and represented in a self-contained module and, subsequently, in order to be able to determine how to tag the resulting module, we need an unambiguous typology of scientific information. Therefore, we introduce a typology by which we characterise the information from four complementary points of view. In this typology, we incorporate the characterisations from two classical points of view: the domain-oriented characterisation that can be expressed in keywords and the characterisation by specified bibliographic data. In addition, we introduce a characterisation by the range of the information and a characterisation by its conceptual function, i.e. by the role the information plays in the scientific problem-solving process.

By characterising information by its range, so-called microscopic, mesoscopic and macroscopic modules can be introduced. A microscopic module represents information that belongs only to one particular article, e.g., information concerning the specific problem addressed in that article. A mesoscopic module functions at the level of an entire research project; it is created for multiple use in several articles issued from the same project. For example, information about the experimental set-up that has been used in a series of experiments can be represented in a mesoscopic module and connected to several articles reporting experimental results. A macroscopic module represents information that transcends the level of the research project; this type of firmly established information is given in, e.g., books, lecture notes.

Our main division in modules is based on the characterisation of the information by its conceptual function. Our starting point is the prototypical section structure of scientific papers: Introduction, Methods, Results, Discussion and Conclusions. This sequence represents the normal flow of a scientific narrative, but the way it is used in practice presupposes that the article will, indeed, be read sequentially from the beginning to the end. One of the main arguments in favour of modularity is that knowledgeable readers hardly read articles sequentially but browse through them, looking for useful bits and pieces. In our approach, we take that behaviour as our starting point and define our modules as entities that can be read independently. Thus, every module represents only one well-defined aspect of the article. Of course, this independence does not mean that one module is in general sufficient to understand the whole work. Modularity enables the reader to zoom in immediately on those aspects he/she is interested in. If so desired, the whole work, i.e., all the related modules and if needed the necessary related information presented in meso- and macroscopic modules, can be retrieved and read as if it were a traditional article.

We derive a list of distinctive conceptual functions for our corpus. From this analysis, we distinguish the following modules based on these conceptual functions.

Figure 1: an overview of the modules (click for a separate window with this figure)

Positioning is a complex module consisting of the module Situation, describing the embedding of the work, and the module Central Problem, stating the why of the work in question. In this complex module all the information the reader needs to know about the background of the problem in question and the particular aspects dealt with in the article, is grouped together. Separating the two constituent modules allows the reader to make a choice: to read only the Central Problem in case he/she is conversant with the subject, or to be introduced in the background as well by reading both constituent modules. It is immediately clear that the module Situation, that reviews the embedding of the work, can be replaced by a pointer, linking the work in question to a description elsewhere. Such an introduction is a typical kind of mesoscopic information. This way, the enormous redundancy of information presented in introductions of articles can be avoided. It goes without saying that the model Central Problem is an essential module, as this module provides the intentions of the author of a particular article, given the context. For an informed reader, this module can play a decisive role in the decision to drop the article or to consult the rest of it as well.

Methods is a complex module that can be built up from separate modules representing the theoretical, experimental, and/or numerical methods employed. If an article is one of a series, a substantial part of the information about the methods can be represented in mesoscopic modules for multiple use; e.g., in a pure experimental article using a standard instrument and employing a standard theory, both the Experimental Method and the Theoretical Method can be described elsewhere. In fact, this is already often the case. However, paper forces the author to repeat, in his/her own words, the description of the methods, whilst now a simple link suffices.
The complex module Results allows readers to inspect the results without reading the whole article, for example if only a number is looked for. One of its two constituents is the module Raw Data. In printed articles, these data are hardly ever published, as that would require too much space. In an electronic environment, on the other hand, these data can become directly available to the reader. By doing so, the reader is able to use the data without the preferred interpretation of the originator. This enables the reader to merge his/her own data directly with the presented data for comparison and analysis. It also allows different people to apply different methods for data reduction to the same data. The second constituent of the module Results is the module Treated Results. Here the raw data are handled according to the author's choice for data reduction and further treatment. The module Treated Results presents the smoothed data in the usual form in figures and tables, as we are familiar with in traditional journals.
The module Interpretation contains the core of the scientific reasoning in the article. Here, the author interprets the experimental results in the light of a theoretical model, for example, by comparing them with theoretical results and experimental results obtained by others. An important observation in our analysis is that it is this module that maintains most of the characteristics of a classical paper. One can argue that our procedure in fact strips the traditional article from those components that can be presented as independent entities. The remaining core, the real scientific reasoning, argumentation and conjectures, remains an essay-like text. It is this part, in fact representing knowledge rather than pure data or quantitative information, that is the most difficult to deal with.
Within the complex module Outcome, we distinguish a compulsory module Findings, in which the author tries to answer the central questions stated in the module Central Problem, and an optional module Leads to Further Research, in which ideas and suggestions for new work are expressed. A reader who wants to learn about what happened without the how and why can simply consult the modules Findings and Treated Results.
Besides the conceptual modules, we define a module Meta-Information that comprises all traditional metadata. We mention two important ingredients that are very important, given the complexity of a system of modules and links: 1) the Abstract, which in a modular environment has to be rethought, and 2) a clear graphical Map of contents. With regard to the Abstract, the main obstacle is that no clear theory is available about its role and content. In the standard literature many do's and dont's for writing an abstract are given, but no systematic work has been done in order to define the proper roles of an abstract as a representation of the underlying information. In a separate research programme Maarten van der Tol is tackling this problem for a modular environment (Van der Tol, 1999).

Links

In the present practice of hypertext linking, the relations between the linked objects are often left unclear to the reader. A standard hyperlink only indicates that the author has some relation in mind between, for example, a blue underlined word and something else. In a standard HTML-document full of links, we are directed from nowhere to everywhere and back.

In our modular structure, a link is defined as an explicitly characterised directed connection, between modules or parts thereof (e.g., words or sentences), that represents one or more different kinds of relevant relation. Characterising links by the relations they express and by the modules they connect enables the reader, firstly, to make a well-considered choice, whether or not to follow the link and, secondly, to take the links into account in the process of locating and retrieving relevant information. This way, a link becomes a proper information object with clear characteristics. In a retrieval situation, the reader can now seek for modules and links, therewith enhancing the whole disclosure process. For this reason we also endow each link with the bibliographic data of the author who identified these relations and created the link. This way it becomes possible that a commentator on a modular article adds links to an already-published work. These links can strengthen the original work, but they can also challenge the results by, e.g., pointing to incompatible results of others. Thus, by endowing the object "link" with the traditional bibliographic data, we ensure the authenticity and priority of each information object when new links or modules are added to published work. Links and modules now have an equal standing.

In our analysis, we identify different types of relation that are relevant in modular scientific articles, and formulate a typology for the links in the modular structure. We distinguish two main classes of relations: organisational relations and scientific discourse relations.

Organisational relations

In the class of organisational relations, which express the organisational coherence of the modular network, we distinguish the following six types of relation:

Figure 2: an overview of the organisational relations (click for a separate window with this figure)

hierarchical: an asymmetric relation between complex modules and their constituent modules,

proximity-based: a symmetric relation between linked modules expressing whether they are part of the same collection (in particular, the same article or set of articles),

range-based: an asymmetric relation expressing the difference in range between linked modules,

administrative: an asymmetric relation between conceptual modules and the module representing their meta-information,

sequential: an asymmetric relation between modules linked to form a complete or a more easy-going reading path,

representational: an asymmetric relation between different representations of the same information (e.g., between texts, tables and figures).

An important aspect of links based on organisational relations is that they can often be assigned semi-automatically, provided the authors have appropriate authoring tools at their disposal.

Scientific discourse relations

The second main class of relations: scientific discourse relations, allows authors to indicate why they refer to another module or another part of the same module. Following speech communication research, we arrive at two subclasses of scientific discourse relations. One class is based on the communicative function; the other type consists of Content relations between two relata.

Figure 3: an overview of the content relations (click for a separate window with this figure)

1. Communicative function relations

The two basic aims of the author are to increase the reader's understanding of the message or to increase his/her acceptance of it. In order to understand or accept a module, readers may need additional information, for instance about the causes of a certain phenomenon. The author can make that information available to the readers by means of a link. The target of the link then consists of , e.g., a figure, a statement or a whole module, which has a particular communicative function with respect to the source of that link; for instance that of an explanation. Hence, this asymmetric relation can be made explicit by the characterisation of the link.

In practice we can often easily make a distinction between Elucidation links and Argumentation links. In the case of elucidation, the aim is at increasing the reader's understanding. Within the Elucidation relations, we make a further distinction between Explanation and Clarification. An explanation is given when the author anticipates that part of the intended readership will not understand how a particular state of affairs has come into being. When the author anticipates that part of the intended readership will not understand what he/she means by a particular text or figure, he/she will make a clarification available in the module or through a link to another module. A further refinement is then possible between a Definition relation and a Specification relation. Thus, the author can, for instance, connect a difficult term to an "encyclopaedic" macroscopic module by a link expressing a Definition relation.

In the argumentative case, the aim is to increase the reader's acceptance of a standpoint. These are cases where the author can presume that not every reader of the indented readership will immediately accept a particular statement.

2 Content relations

The second subclass of scientific discourse relations comprises Content relations, such as Dependency, Elaboration, Similarity, Synthesis and Causality.

The Dependency in the problem-solving process of the reported research is an asymmetric relation between steps in that process. A link can express the fact that the source depends on the target in the way in which, for instance, results depend on to generated them. A special case is a Transfer relation, if items are taken from one module and included in another. This is often the case with mathematical formulae or values that are used as input in calculations.

With an Elaboration relation, we indicate an asymmetric relation where the target contains an elaboration of the statement in the source. A mesoscopic sketch of the Situation can provide more information than a short statement in a Situation module at the microscopic level. Within this class, we can make a further distinction between Resolution relations that point to more fine-grained information, i.e. more details, and Context relations, pointing to more broad sweeping accounts of the subject, i.e. more context. We link information that is similar in relevant details, e.g., results of the same kind of investigation by different authors, by links expressing Similarity relations.

In the case of Synthesis relations we deal with: a) Aggregation expressed in links in which the source of the link is a component of the target, and (b) Generalisation, where more-or-less the same concepts are grouped together (for instance in the case where, on the microscopic level, specific parameters of an apparatus are fully described in an Experimental Methods meso-module).

As a final example we identify the Causal relations in which clear cause and effect relations are covered.

Applicability of our model

We developed the model in conjunction with an analysis of a corpus of articles published by a single research group in the field of experimental molecular dynamics. However, a short inspection of examples of publications in other domains showed that modular structures for other types of publications could be derived from our model.

To test the model, we rewrote two strongly related articles from our corpus as modular electronic article (demo in progress). Although the modular framework is explicitly intended for the creation and evaluation of new work, we found, recasting old work in the new mould, that modular electronic articles can meet our pre-defined requirements better than linear articles. In particular:

The possibility of multiple usage enhances the author's efficiency
The explicit labelling of modules and links allows for better information retrieval.
The reader can selectively locate, retrieve and consult precisely those parts of the published works that are relevant, so that the reader's efficiency is increased.
As the modular structure is more systematic and explicit, modular publications can be clearer than linear ones.

Acknowledgements

This work is part of the 'Communication in Physics' project of the Foundation Physica; it is financially supported by the Foundation Physica, the Shell Research and Technology Centre Amsterdam, the Royal Dutch Academy of Sciences, the Royal Library, and Elsevier Science NL.

Bibliographic references

(Garvey, 1979) W.D. Garvey, Communication: the essence of science - Facilitating information exchange among librarians, scientists, engineers and students. (Pergamon Press, Oxford, 1979)

(Meadows, 1998) A.J. Meadows, Communicating research, (Academic Press, San Diego, 1998)

(Van Eemeren et al., 1993) Eemeren, F.H. van, R. Grootendorst, Sally Jackson and Scott Jacobs, Reconstructing argumentative discourse. Studies in rhetoric and communication. (The University of Alabama Press, Tuscaloosa, 1993)

(Kircz, 1998) J.G. Kircz, Modularity: the next form of scientific information presentation? Journal of Documentation, Vol.54,no.2,March 1998, p.210-235. Electronic version: http://www.wins.uva.nl/projects/commphys/papers/jkmodul.htm

(Harmsze et al., 1996) F.A.P. Harmsze, M. van der Tol and J. Kircz, 'Naar een modulair model voor natuurwetenschappelijke informatie in elektronische artikelen. In: Informatiewetenschap 1996, Wetenschappelijke bijdragen aan de Vierde Interdisciplinaire Conferentie Informatiewetenschap (Delft, 13 december 1996). Van der Meer (Werkgemeenschap Informatiewetenschap, 1996).pp. 53-71. Electronic version: http://www.wins.uva.nl/projects/commphys/papers/delft/delft.htm

(Harmsze, 2000) F.A.P. Harmsze, A modular structure for scientific articles in an electronic environment, PhD thesis, to be published, Amsterdam 2000

(Van der Tol, 1999) M.C. van der Tol, The abstract as an orientation tool in modular electronic articles. To be published in the proceedings of the First International Conference on Document Design, Tilburg, December 17 and 18, 1998 Electronic version: http://www.wins.uva.nl/projects/commphys/papers/docdes/docdes.html