LWW Assessment

Cliff Joslyn
February, 1999

Introduction

This report is intended to provide a unified overview of the resources, environments, and facilities of the Library Without Walls (LWW).  It is intended that this information will server the purpose of designing, in the future, appropriate methods to integrate software developments in the Active Recommendation Systems for a Library Without Walls project (which we will call the Active Recommendation Project (ARP)) into the LWW.

Below we consider the following issues:

There are a few items which are incomplete in this report, in particular a detailed consideration of the content crossover among the various LWW corpi. Also, we rely on information provided externally on the LWW network architecture and server configurations.

Database Formats

It will be useful first to identify and describe the various database environments used within the LWW.

Advance

Advance is a database system maintained and enhanced by the LWW based on legacy systems used to support library cataloging with MARC format records. It is based on PIC, with some C code. PIC is a legacy hybrid OS-database environment which is currently supported in an emulation mode within the Universe product running on UNIX. In this context, it appears as a flat-file system with relational capabilities. An SQL interface is available within Universe. Implementation is in terms of a great many separate UNIX exectuables wtih a complex interaction. Advance is optimized for variable-length text storage, and provides direct (machine-level) access to data records. Data typing is partial and dynamic, based on text fields. Numeric and boolean types are supported contextually based on available operations.

Within Advance documents are identified with DocIDs (below).

The Advance architecture is as follows. HTML form information is passed through UNIX pipes to the WebZ gateway, which then parses it and constructs a Z39.50 query. This is then sent in Z39.50 format to Advance, and in particular to the Z39.50 server which is a portion of Advance. Other portions of Advance then parse and execute the search, return DocIDs, format MARC records, retrieve URLs, etc. This information is then passed back to the Z39.50 server, sent in Z39.50 format back to WebZ, which then formats an HTML response.

Each different Advance corpus has its own set of indexed fields. Some (e.g. title) are common to all corpi, others (e.g. taxonomic classification in Biosis) are specialized to particular corpi. Some (e.g. subject) differ semantically among corpi, but can be identified with each other. To represent this, 22 common index "slots" available, which each special index terms mapped into a particular slot. The following table shows this mapping.
 

Index Slot Main OPAC BIOSIS DOE Energy Eng Index INSPEC
A Author Author Author Author Author
B Organization Assignee Organization Sponsor Sponsor
D Conference Conference Conference Conference Conference
E TOC author Institution Affiliation Affiliation Institution
F TOC title Journal Journal Journal
G Date Date Date Date Date
H Publisher Publisher Publisher Publisher Publisher
Format Doc Type Doc Type Doc Type Doc Type
J Genre Material Affiliation
K TOC general
L Language Language Language Language Language
N Note Note Note
O Abstract Abstract Abstract Abstract Abstract
P Series Sequence data Assignee
Q Journal Issue Journal Issue Issue
R Url Classification
S Subject Subject Patent Subject
T Title Title Title Title Title
U Taxonomy Major Subject Subject FreeTerm
V Minor Subject Descriptor Numeric data
W Subject Concept Classification Chemical
Y Subject Other FreeTerm Astronomy
Advance Index Slots
For each of the top 5,000 words queried, multiple keylists are prepared, one for each word, for each of a number of collections of slots. Each keylist contains a list of the documents containing the word, in DocID format. Keylists are updated generally every week.

Single-word queries first check for pre-made keylists, otherwise the keylist is constructed at query time. Multiple-word queries are broken into multiple single-word queries, whose results are then combined with standard UNIX text manipulation commands (e.g. sort, merge, uniq).

Topic/Verity

Topic is the name of a database product of the Verity coroporation. They are actually phasing this name out, so we will refer to it as "Verity" only. Verity provides a hierarchically structured database which is highly optimized for free text retrieval, and supports some relatively sophisticated Boolean capabilities for classical information retrieval.

Topic is structured as follows.

Interface with Verity is through an API. Currently, a LANL product, Explorer, is the interface. The SciSearch interface passes HTML form information to Explorer, which constructs Verity API calls, queries the Verity engine, retrieves the results, and formats HTML for output.

This HTML/Explorer/Verity architecture is not unique to the LWW, but rather is used lab-wide in a number of different Explorer flavors for a number of different support environments. In the LWW, Verity is currently specialized to the SciSearch database from ISI (below).

Verity is highly optimized for text storage and retrieval, and retrieval of non-indexed text fields is feasible from any collection. Support for numeric data is weak, for example there is no indexing. Similarly, there is no relational information. Further, no linking capability is available among Verity collections.

ScienceServer/MPS

This is a proprietery database originally produced by the Orion corporation, and supplmented by front-end routines called the MPS Information Server, and indexing routines provided by FSConsulting. It serves the EJournals (below).  We will call this the MPS system.

Similar to Verity, MPS is a dedicated information retrieval product on a proprietary database format. It supports a variety of boolean and other search capabilities (see http://www.fsconsult.com/products/mps-server.html).  The LWW currently has a binary license for MPS, but will be receiving a source license shortly.

Oracle

The above environments are currently available in the LWW. The ARP may require access to an ODBC-compliant relational environment such as Oracle.

There are currently plans to bring up an instance of Oracle on a PC from Sandia National Laboratory in order to provide a dedicated environment on which SciSearch interaction with the VxInsight visualization product can be demonstrated. This machine will generally not be available to the ARP.

Alternatively, it has been suggested that an Oracle instance might be made available on the Sun Ultra 3000 soon-to-be development machine (see Server Configurations below), or on a new Linux machine.

Components of the LWW "Meta-Base"

The LWW currently consists of a number of distinct components, which we will call corpi. A major task of Phase III of the LWW, and of the ARP, is to try to integrate these corpi in various ways.

General Issues

For each of the components, we focus on a number of key issues:

Online Catalog

The main "card catalog" for the library, called the OPAC, is held in an Advance database which is Z39.50 compliant for inter-library use. It has a special Web interface emulating Explorer. It is currently intended for the OPAC to remain as a legacy Advance corpus.

OPAC contains records for listings for all materials accessible through the Research Library which have MARC records. This includes:

These listings are at the "title" (i.e. journal title) rather than "article" (i.e. article title) level. Thus each of the other LWW corpi below has a single entry, as does the OPAC corpus itself.

As a database of MARC records, the OPAC benefits from keywords, and the controlled vocabulary maintained by the Library of Congress (LC) system. Thus most ISBN and ISSN entries have correspoding LC numbers.

Scientific Catalogs

The LWW maintains a variety of catalogs of bilbiographic, abstract, and indexing information provided by third parties. They are all currently in the Advance format, although Biosis is in the process of being converted to Verity. It is intended that all Advance databases will eventually be converted to Verity.

Inspec

From the web page:
"INSPEC (Information Service in Physics, Electrotechnology and Control) is the leading English-language citation database for the world's literature on all aspects of physics, electronics, and computing. INSPEC scans papers from approximately 4,200 journals, 1000 conferences, and other publications. INSPEC corresponds to the print publications Physics Abstracts, Electrical and Electronics Abstracts, and Computer and Control Abstracts.

Subjects: Astronomy; Chemistry; Computer Science; Engineering; Earth Sciences; Mathematics; Nuclear Information; Physics; Science (General/Popular)."

Inspec is provided by IEEE, and includes books, journals, conference, and some grey literature. Keywords are provided initially from authors and editors, and matched into the IEEE hierararchical knowledge taxonomy to provide a controlled vocabulary. Matches are identified, and non-matches are included as "free terms". Access is restricted to LANL and UNM.

Biosis

From the web page:
"Citations, with abstracts, from Biological abstracts and Biological abstracts/RRM (reports, reviews, meetings). BIOSIS indexes approximately 6500 journals and 2000 meetings/year, as well as books and other materials. Subject coverage includes biological and biomedical sciences, botany, biochemistry, biophysics, biotechnology, medicine, public health, radiation biology, ecology and the environment.

Subjects: Biology/Genetics; Environment"

Contains biological material only, as opposed to medical. Includes books, journals, proceedings, and some grey literature. Keyword control is similar to Inspec. Indexing terms are quite elaborate, including taxonomic classification, sequencing data, etc. Access is limitted to LANL and Stanford.

Engineering Index (EngInd)

From the web page:
"Engineering Index provides worldwide coverage of approximately 4,500 journals, and numerous conference proceedings, reports, books and dissertations. Over half a million records are for papers in published proceedings. Subjects covered are civil, energy, environmental, geological, and biological engineering; electrical, electronics, and control engineering; chemical, mining, metals, and fuel engineering; mechanical, automotive, nuclear, and aerospace engineering; and computers, robotics, and industrial robots. Engineering Index at LANL corresponds to the print publication Engineering Index, with additional coverage of conference papers and proceedings.

Subjects: Engineering; Earth Sciences; Nuclear Information; Reports"

Provided by Elsevier. Includes books, journals, proceedings, and some grey literature. Access limitted to LANL.

DOE Engineering (DOEng)

From the web page:
"A multi-disciplinary database containing worldwide references to basic and applied scientific and technical research literature. The primary focus is on energy and related topics. The database includes references to publications provided by the U.S. Department of Energy, its contractors, and other government agencies; also information from the International Energy Agency's Energy Technology Data Exchange (ETDE) and the International Atomic Energy Agency's International Nuclear Information System (INIS). Approximately half of the references are from sources outside the United States. Abstracts are included for records from 1976 to the present. Approximately half of the references are to journal literature and 25% to technical report literature.

Subjects: Biology/Genetics; Chemistry; Computer Science; Defense; Environment; Engineering; Mathematics; Nuclear Information; Patents; Physics; Reports"

Provided by DOE OSTI. Derived from the old Nuclear Science Abstracts (through 1972). Focuses primarily on DOE research reports and some limitted journal coverage. Keyword vocabulary controlled from a thesaraus and hierarchical classification. Free terms maintained. Limitted to LANL.

SciSearch

This is the premier citation database available for scientific literature, and is provided by the Institute for Scientific Information (ISI). SciSearch holds abstract and indexing information about a broad and deep collection of scientific journals, and most importantly, cross-referencing infomation about citations among published papers.

The Social SciSearch database is coming online soon, and holds a similar collection of social scientific literature. Here we focus on regular SciSearch only.

Keywords come in two varieties:

SciSearch is available to LANL staff and to a variety of partners (see http://scisearch2.lanl.gov).

SciSearch Architecture and Loads

Weekly SciSearch loads proceed as follows: Note that:
  • Information on which articles cite a given article are stored in the Verity Collections only, and not in the text files.
  • Conversely, detailed reference information on the articles cited in a particular article are stored only in the text files, whereas in the Verity collection this information is represented only indirectly through a list of appropriate CitationIDs, which can then be queried iteratively.
  • Communication from the Verity Collections back to the text files is facilitated by a Verity structured field holding, for each  article, its text file name and byte offset.

    Historical Queries

    Queries sent to SciSearch are stored for historical purposes in human-readable, ASCII log files containing the following information: Note that the range of years requested in the search is not available.

    Standing alerts are not logges distinctly, but information about them is available. In particular, every week standing alerts generate appropriate standard queries, which are then recorded in the log files. Queries generated from alerts can be distinguished from queries generated manually by the fact that they include ranges of internal Verity record identifiers used to limit queries to the weeks in question.

    Electronic Journals

    The Electronic Journals (EJournals) portion of the LWW provides desktop access to the content of scientific journals. Journals from a number of publishers are stored locally, while the others are linked to externally. Here we focus on locally stored journals, since we will experience less problemattic access to them for the ARP.

    The LWW has entered into a close relationship with Elsevier (not actually a publisher, but rather a distributor of multiple publishers such as Pergamon), other publishing companies, and FSConsult, in order to provide desktop access to a wide variety of electronic journals.  These parties all have a mutual interest in working together to develop these capabilities. We will focus specifically on Elsevier here, which is the major effort, but Academic Press is coming behind.

    On a weekly basis, CDs are received from the publisher. A TOC file (in a tagged flat-file format) contains abstract information per issue on the CD, including authors, title, keywords, etc. Papers are available in PDF, SGML, and raw text. LWW code runs an indexing routine (both the TOC and the document text is indexed), creates SICIs, and updates the SICI database (see below).

    No citation information is available in indexed form, although in principle it can be abstracted at least from the raw text copies of the papers, if not the tagged SGML versions.

    A subject specialist LWW librarian makes LANL subject (category) classifications of all EJournals. This is how they are grouped on the web page.

    EJournals are restricted to LANL and the Air Force Research Library, although discussions are ongoing with others (e.g. Stanford).

    xxx.lanl.gov

    This database represents a very innovative project in electronic publishing and communication among working scientists, with the potential to point the way towards a revolution in the sociology of science. We anticipate being able to have more information about this corpus, and how it might interact with the LWW, in the near future.

    Other Corpi

    Unclassified Publications

    From the web page:
    "The Los Alamos Unclassified Publications Database is a multi-disciplinary database  containing references to basic and applied scientific and technical literature authored by the  staff of Los Alamos National Laboratory, its contractors, and collaborators."
    This corpus currently exists in a hybrid Explorer-Advance database where a particular instance of Explorer is used to access an Advance database.

    LANL Technical Reports

    A corpus of classified reports exists in a secure vault in the library. It is another instance of an Advance interface to a collection of PDF documents.

    Patents

    This corpus currently exists in the Barn Owl network tool. It is not currently accessible to users, and has no GUI.

    Citation Analysis

     Yearly reports are generated from SciSearch concerning the following issues: These are maipulated in an Excel spreadsheet and converted to HTML for user access.

    Summary

    Summary information is provided below for relavent corpi.
    Database Format Keywords Size Update/interval
    (avg)
    Started Searches/wk
    (avg)
    OPAC Advance Library of Congress 250 KRecs 
    3 GB
    Continuous 1800's 7000
    Inspec Advance IEEE: Controlled and Free 6.2  MRecs 
    64 GB
    6 KRecs/wk 
    35 MB/wk
    1969 700
    Biosis Advance Biosis: Controlled and Free 11.6 MRecs 
    95 GB
    8 KRecs/1.3wk 
    50 MB/1.3wk
    1969 1000
    EngInd Advance ** 4.3 MRecs 
    38 GB
    4 KRecs/wk 
    15 MB/wk
    1969 400
    DOEng Advance OSTI: Controlled and Free 3.8 MRecs 
    38 GB
    6 KRecs/2wk 
    20 MB/2wk
    1940's 250
    SciSearch Verity Authors and Keywords+ 16.5  MRecs 
    50 GB
    20 Krecs/wk 
    50 MB/wk
    1991
    5200 (LANL)
    9800 (other)
    EJournals MPS Authors  ?
    300 GB
    ?
    1 GB/wk
    1994
    100 Journals
    1000 Articles
    Unclassified Publications Advance Partial: DOE or LC 54 KRecs
    500 MB
    Irregular 1943
    **Irma
    LANL Patents Barn Owl
    The LWW Meta-Base. 

    Current and Anticipated Linkages and Crossover

    Content Crossover

    Here we consider the amount of content which is shared among the Advance scientific corpi, the EJournals, and SciSearch. This is currently rather difficult information to obtain. LWW staff have been able to provide a few rough estimates based on their particular experience, and it will be possible in the future to gather analytical information about, for example, the number of journal titles which are held in common among the different corpi, and possibly the number of article titles. For the present, however, this section is a skeletal and incomplete representation of information which will await future analysis.

    Generally, however, we are informed by the following considerations.

    Specifically, we observe the following: When completed, the following table will show rough estimates (qualitative or quantitative) by LWW staff of the amount of data which is held redundantly in multiple corpi. An entry in a cell indicates that that is the amount of the corpus in that row which is held within the corpus of that column. This is a generally assymetric relation.
     
    Inspec Biosis EngInd DOEng SciSearch EJournals
    Inspec 20%
    Biosis 20% Moderate
    EngInd
    DOEng
    SciSearch 90%
    EJournals 60%
    Estimated Overlaps among Corpi

    Database Crossover

    Here we consider physical movements which either exist or are anticipated among the LWW corpi.

    Importing xxx.lanl.gov

    There are plans currently to integrate xxx.lanl.gov with the LWW. This project will begin in 1Q 1999, and be carried out be staff who will work in conjunction with the ARP.

    Movement to Topic

    Plans are currently to migrate the Advance databases in stages to Verity. The first candidate is Biosis, which wll likey be available in Verity in either 1Q or 2Q 1999.

    Document Reference Crossover

    In order to facilitate crossover among corpi, it is necessary to have unique keys which can identify documents existing in multiple corpi. We first describe the references available, and then describe their relations.

    Document References

    Document Object Identifier (DOI)
    DOIs are being advanced by the publishing industry to support universal references to electronically published documents. They act effectively as URL down to the directory level, with final access to the article level unspecified. Although DOIs are available in Inspec, they are not currently supported by the LWW.
    Serial Item Contribution Identifier (SICI)
    SICIs are currently the most universally available inter-corpus reference. Multiple SICI formats for different publication types are defined in an ANSI standard established as a cooperative venture involving the academic and the digital library communities.

    SICIs using the "article wthin serial publication" format are contstructed at load-time in all of the the Advance, EJournal, and SciSearch databases. The syntax used in the Advance and EJournal databases is:

    ISSN(yyyy[mm[dd]])vol[:issue[:part]]<page:title_code>control check_digit
    where: It is regretably not possible to match full SICIs from queries into the database, due to invonsistencies in the publisher-supplied databases. In order to provide unambiguous matches, it is actually necessary to construct one full and five partial SICIs for each document, as follows: SICIs are not currently indexed in Advance.

    Due to the limitations of indexing information provided by ISI, SciSearch maintains its own format of the SICI, with syntax:

    ISSN()vol[:issue[:part]]<page:>
    where the fields are above. This is due to be updated, but currently only 75% of SciSearch SICIs match the corresponding non-SciSearch SICIs.
    DocIDs
    Documents in the Advance corpi are identified uniquely by an eight-digit DocId, beginning at 10,000,000, arbitrarily assigned. DocIds can be repeated among corpi, meaning that the combination <Corpus,DocId> actually uniquely identies a particular document within a corpus. Furthermore, the same document contained in different corpuses is likely to have different DocIds in each corpus.
    Citation ID
    SciSearch has its own internal identifiers, with the syntax:
    lname initials # yyyy # journal_char vol # page
    where: CitationIDs are used by SciSearch to match citations against other entries available in SciSearch.
    URLs
    URLs are unique identifiers within the EJournals.

    When the EJournals are loaded, a "SICI database" is updated, holding <SICI,URL> pairs, with six entries for each URL.

    URL information for journals which are available remotely to the LWW is also available. A table holds publisher-specific URL templates which can be instantiated in predictable ways using partial SICI information (e.g. ISSN and enumeration).

    Both of these tables are made available to Advance, and then sent to SciSearch.

    Document Reference Crossover

    There is no single, consistent, and accurate reference available across all of the LWW corpi. The following table shows available paths to the different corpi, given a starting point in the left-most column.
     
    Reference Advance SciSearch EJournals
    Non-SciSearch SICI Lookup OK. Partial match through SciSearch SICI. Lookup OK through SICI database.
    SciSearch SICI Partial match through non-SciSearch SICI. Lookup OK. Partial match through SICI database.
    DocID Lookup OK. Partial match through non-SciSearch SICI. Lookup OK through non-SciSearch SICI, then SICI database
    CitationID Partial match through SciSearch SICI. Lookup OK. Partial match through SciSearch SICI, then SICI database.
    URL Lookup OK through SICI database. Partial match through SICI database. Lookup OK.
    Reference paths through LWW corpi
    Note that complete paths would be available if SciSearch and non-SciSearch SICIs could be made consistent.

    Engineering Issues

    ARP Read/Write Access

    ARP programs will require at least read access to multiple LWW corpi. Whether the ARP will require write access to LWW corpi will depend on whether LWW front-end applications (e.g. the SciSearch pages) are modified to support ARP information, or whether ARP applications (e.g. the TalkMine testbed) will provide a self-contained user environment. The first option saves ARP development over LWW, and vice versa.

    Read Access

    All LWW database formats will allow read access with varying degrees of difficulty.

    Write Access

    One consideration for ARP development is the ability to work in the native LWW database format. It is difficult to make a final determination at this point if any of the database environments currently in use in the LWW will be appropriate for use by the ARP, but indications are negative. In particular: It may be advisable for the ARP to maintain its own data server for write access, but which has read capacity from LWW corpi.

    Server Configuration and Loading

    All major LWW systems are supported by UNIX servers. Most are Solaris 2.5 or 2.6, except for sciserver.lanl.gov (the ScienceServer) which is running Linux. Most servers are in the Green network. The current host of library.lanl.gov (the Advance corpi) will shortly be moving into the Blue and becoming a development environment. It is a Sun Ultra 3000 Enterprise machine with 6 250 mHz processors running Solaris 2.6.

    Further details are available at http://lib-www.lanl.gov/~mlbm/libnet/index.html, currently in draft form.

    Acknowledgements

    I would like to extend my thanks to the LWW staff for their great cooperation in preparing this report, in particular Miriam Blake, Doug Chafe, Frances Knudson, Abe Lederman, and Mark Martinez.